Guy Harrison - Yet Another Database Blog

Monday

Oct122015

Exploring CouchBase N1QL

Monday, October 12, 2015 at 9:06PM

Couchbase recently announced Non-first Normal Form Query Language (N1QL) – pronounced “Nickel” – a virtually complete SQL language implementation for use with document databases, and implemented within the Couchbase server 4.0.

I recently took a quick look.

Most of the examples use the sample films documents shown below (this is the same sample data we created for MongoDB in this post):

N1QL allows us to perform basic queries to retrieve selected documents or attributes of selected documents:

N1QL allows us to access nested documents within the JSON structure using array notation. So for instance in the example below Actors[0] refers to the first nested document within the actors array:

We can query for subdocuments which match a search criteria using WHERE ANY syntax:

The UNNEST command allows embedded documents to be “joined” back to the parent document. So here we get one result for each Actor who starred in film 200, with the film title included in the results:

The UNNEST command allows us to perform the equivalent of joins between parent and children documents where the child documents are nested within the parent. N1QL also allows us to join across disparate documents, providing that one of the document collections contains a reference to the primary key in the other.

So for instance if we had a bucket of documents that contains the primary keys of “overdue” films in our imaginary (and by now definitely struggling) DVD store, then we can join that to the films collection to return just those films:

N1QL also contains DML statements allowing us to manipulate the contents of documents, and DDL statements allowing creation and modification of indexes.

N1QL is ambitious attempt to bring SQL into the world of document databases. It’s interesting to consider that as the same time that companies like CouchBase are introducing SQL support into their database, that companies like Oracle are introducing strong JSON support into their SQL-based database. It would seem that the two worlds are coming together.

Guy Harrison |

On my way to Collaborate 2015

Saturday, April 11, 2015 at 5:05PM

As I write this I’m at 35,000 ft (or so) on my way to yet another Collaborate. This year I’ll be presenting two sessions and participating in one panel:

963: Writing to Lead

Monday, April 13 | Banyan E, South Convention Center, Level 2

10:30 a.m. – 11:30 a.m.

Jonathan Gennick will be moderating a panel including myself, Bobby Curtis, Charles Kim, Darl Kuln and Michael Rosenblum. We’ll be talking about how writing – book, blogs, articles, twitter, etc – to best effect. The panel are all authors and hopefully between us we’ll have a lot of useful advise to offer the aspiring or existing book authors.

995: Top 5 Trends in Database Technology

Tue, Apr 14, 2015 (04:30 PM - 05:30 PM) : Banyan E

It’s been an amazing 5 years in database technology – we went from RDBMS for everything to Big Data, NoSQL, NewSQL and a lot of innovation within RDBMS as well. This talk covers 5 big technologies that are influencing Oracle and other database technologies. It includes some of my best recycled jokes as well hopefully some usefull technical content. There’s an article based on this presentation if you want a preview here.

442: Top 10 Oracle Database Tuning Tips and Techniques

Thu, Apr 16, 2015 (08:30 AM - 09:30 AM) : Palm B

Seasoned presenters know that for some reason “top 10” presentations somehow get accepted more readily than more abstract titles. Something to do with the impression of a concrete amount of information I suppose. But once you start giving a top 10 talk you quickly realize you can only budget 5 minutes at the most for each topic. But actually I feel pretty happy about this talk as a high level summary of the core concepts that I’ve picked up in 25 years of Oracle performance tuning. The main thing I’ve tried to accomplish in this talk is to organize the core concepts of performance tuning around a sensible tuning philosophy. I’m not assuming a great deal of Oracle tuning experience.

Guy Harrison |

Sakila sample schema in MongoDB

Monday, March 23, 2015 at 2:29PM

2018 Update: You can download this and other sample schemas we use in dbKoda from https://medium.com/dbkoda/mongodb-sample-collections-52d6a7745908.

I wanted to do some experimenting with MongoDB, but I wasn’t really happy with any of the sample data I could find in the web. So I decided that I would translate the MySQL “Sakila” schema into MongoDB collections as part of the learning process.

For those that don’t know, Sakila is a MySQL sample schema that was published about 8 years ago. It’s based on a DVD rental system. OK, not the most modern data ever, but DVDs are still a thing aren’t they??

You can get the MongoDB version of Sakilia here. To load, use unpack using tar zxvf sakilia.tgz then use mongoimport to load the resulting JSON documents. On windows you should be able to double click on the file to get to the JSON.

The Sakila database schema is shown below. There are 16 tables representing a fairly easy to understand inventory of films, staff, customers and stores.

When modelling MongoDB schemas, we partially ignore our relational modelling experience – “normalization” is not the desired end state. Instead of driving our decision on the nature of the data, we drive it on the nature of operations. The biggest decision is which “entities” get embedded within documents, and which get linked. I’m not the best person to articulate these principles – the O’Reilly book “MongoDB Applied Design Patterns” does a pretty good job and this presentation is also useful.

My first shot at mapping the data – which may prove to be flawed as I play with MongoDB queries – collapsed the 16 tables into just 3 documents: FILMS, STORES and CUSTOMERS. ACTORS became a nested document in FILMS, STAFF and INVENTORY were nested into STORES, while RENTALS and PAYMENTS nested into CUSTOMERS. Whether these nestings turn out to be good design decisions will depend somewhat on the application. Some operations are going to be awkward while others will be expedited.

Here’s a look at the FILMS collection:

Here is STORES:

And here is CUSTOMERS:

Looks like I have to fix some float rounding issues on customers.rentals.payments.amount Smile .

The code that generates the schema is here. It’s pretty slow, mainly because of the very high number of lookups on rentals and payments. It would be better to bulk collect everything and scan through it but it would make the code pretty ugly. If this were Oracle I’m pretty sure I could make it run faster but with MySQL SQL tuning is much harder.

Code is pretty straight forward. To insert a MongoDB document we get the DBCollection, then create BasicDBObjects which we insert into the DBCollection. To nest a documnet we create a BasicDBList and insert BasicDBObjects into it. Then we add the BasicDBList to the parent BasicDBObject. The following snippit illustrates that sequence. It's mostly boilerplate code, with the only human decision being the nesting structure.

DBCollection filmCollection = mongoDb.getCollection(mongoCollection);

  

while (fileRs.next()) { // For each film

  

        // Create the actors document

        BasicDBObject filmDoc = new BasicDBObject();

        Integer filmId = fileRs.getInt("FILM_ID");

        filmDoc.put("_id", filmId);

        filmDoc.put("Title", fileRs.getString("TITLE"));

        // Other attributes

        BasicDBList actorList = getActors(mysqlConn, filmId);

        // put the actor list into the film document

        filmDoc.put("Actors", actorList);

        filmCollection.insert(filmDoc); // insert the film

  

    }

Anyway, hopefully this might be of some use to those moving from MySQL to MongoDB. Comments welcome!

Guy Harrison |

6 Comments |

tagged

MongoDB,

mysql,

nosql in

mysql

Wednesday

Apr302014

Best practices for accessing Oracle from scala using JDBC

Wednesday, April 30, 2014 at 11:20AM

I’ve been looking for an excuse to muck about with scala for a while now. So I thought i’d do a post similar to those I’ve done the past for .NET, python, perl and R. Best practices for Java were included in my book Oracle Performance Survival Guide (but I’d be more than happy to post them if anyone asks).

One of the great things about scala is that it runs in the JVM, so we can use the Oracle JDBC drivers to access Oracle. These drivers are very mature and support all the best programming practices.

Best practices for programming Oracle in any language require at least the following:

Use bind variables appropriately.
Ensure you are performing array fetch when retrieving more than one row
Ensure that you use array insert when doing bulk inserts

You can get the scala program which contains the code snippets below here.

Connecting

If you’ve ever used Oracle with JDBC, you’ll find things very familiar. Here’s a snippet that connects to an oracle database with username,password, host and service specified on the command line (assumes the default 1521 port, but of course this could be parameterized as well):

   1:  import java.sql.Connection

   2:  import java.sql.ResultSet

3:

   4:  import oracle.jdbc.pool.OracleDataSource

5:

   6:  object guyscala2 {

   7:    def main(args: Array[String]) {

   8:      if (args.length != 4) {

   9:        println("Arguments username password hostname serviceName")

  10:        System.exit(1)

  11:      }

12:

  13:      val ods = new OracleDataSource()

  14:      ods.setUser(args(0))

  15:      ods.setPassword(args(1))

  16:      ods.setURL("jdbc:oracle:thin:@" + args(2)+":1521/"+args(3))

  17:      val con = ods.getConnection()

  18:      println("Connected")

Using Bind variables

As in most languages, it's all to easy to omit bind variables in scala. Here's an example where the variable value is simply concatenated into a SQL string

   1:     for (cust_id <- 1 to rows) {

   2:        val s1 = con.createStatement()

   3:        s1.execute("UPDATE customers SET cust_valid = 'Y'"

   4:          + " WHERE cust_id = " + cust_id)

5:

   6:        s1.close()

   7:      }

On line 3 we build up a SQL statement concatenating the value we want into the string and immediately exeucte it. Each execution is a unique SQL statement which requires parsing, optimization and caching in the shared pool.

Here’s an example using bind variables and a prepared Statement:

   1:    val s2 = con.prepareStatement(

   2:        "UPDATE customers SET cust_valid = 'Y'"

   3:          + " WHERE cust_id = :custId")

4:

   5:      for (cust_id <- 1 to rows) {

   6:        s2.setInt(1, cust_id)

   7:        s2.execute()

   8:      }

9:

  10:      s2.close()

Slightly more complex: we prepare a statement on line 1, associate the bind variable on line 6, then execute on line 7. It might get tedious if there are a lot of bind variables, but still definitely worthwhile. Below we see the difference in execution time when using bind variables compared with concanating the variables into a string. Bind variables definitely increase execution time.

As well as the reduction in execution time for the individual application, using bind variables reduces the chance of latch and/or mutex contention for SQL statements in the shared pool – where Oracle caches SQL statements to avoid re-parsing. If many sessions are concurrently trying to add new SQL statements to the shared pool, then some may have to wait on the library cache mutex. Historically, this sort of contention has been one of the most common causes of poor application scalability – applications which did not use bind variables risked strangling on library cache latch or mutex as the SQL exectuion rate increased.

Exploiting the array interface

Oracle can retrieve rows either from the database one at a time, or can retrieve rows in “batches” sometimes called “arrays”. Array fetch refers to the mechanism by which Oracle can retrieve multiple rows in a single fetch operation. Fetching rows in batches reduces the number of calls issued to the database server, and can also reduce network traffic and logical IO overhead. Fetching rows one at a time is like moving thousands of people from one side of a river to another in a boat with all but one of the seats empty - it’s incredibly inefficient.

Fetching rows using the array interface is simple as can be and in fact enabled by default - though with a small default batch size of 10. The setFetchSize method of the connection and statement objects sets the number of rows to be batched. Unfortunately, the default setting of 10 is often far too small – especially since there is typically no degradation even when the fetch size is set very large – you get diminishing, but never negative, returns as you increase the fetch size beyond the point at which every SQL*NET packet is full

Here’s a bit of code that sets the fetch size to 1000 before executing the SQL:

   1:  val s1 = con.createStatement()

2:

   3:  s1.setFetchSize(1000)

   4:  val rs = s1.executeQuery("Select /*fetchsize=" + s1.getFetchSize() + " */ * " +

   5:          "from customers where rownum<= " + rows)

   6:  while (rs.next()) {

   7:          val c1 = rs.getString(1)

   8:          val c2 = rs.getString(2)

   9:   }

  10:   rs.close()

Here’s the elapsed times for various fetchsizes for the above query:

While the default setting of 10 is clearly better than any lower value, it’s still more than 6 times worse than a setting of 100 or 200.

Inserting data is another situation in which we normally want to consider the array interface. In this case we need to change our code structure a bit more noticeably. Here’s the code we probably would write if we didn’t know about array processing:

   1:    val insSQL = "INSERT into arrayinsertTest" +

   2:        " (cust_id,cust_first_name,cust_last_name,cust_street_address) " +

   3:        " VALUES(:1,:2,:3,:4)"

   4:      val insStmt = con.prepareStatement(insSQL)

   5:      val startMs = System.currentTimeMillis

   6:      var rowCount = 0

   7:      while (rs.next()) {

   8:        insStmt.setInt(1, rs.getInt(1))

   9:        insStmt.setString(2, rs.getString(2))

  10:        insStmt.setString(3, rs.getString(3))

  11:        insStmt.setString(4, rs.getString(4))

  12:        rowCount += insStmt.executeUpdate()

  13:      }

  14:      val elapsedMs = System.currentTimeMillis - startMs

  15:      println(rowCount + " rows inserted - " + elapsedMs + " ms")

  16:      con.commit()

We prepare the statement on line 4, bind the values to be inserted on lines 8-11, then execute the insert on line 12.

WIth a few minor changes, this code can perform array inserts:

   1:      val insSQL = "INSERT into arrayinsertTest" +

   2:        " (cust_id,cust_first_name,cust_last_name,cust_street_address) " +

   3:        " VALUES(:1,:2,:3,:4)"

   4:      val insStmt = con.prepareStatement(insSQL)

   5:      val startMs = System.currentTimeMillis

   6:      var rowCount = 0

   7:      while (rs.next()) {

   8:        insStmt.setInt(1, rs.getInt(1))

   9:        insStmt.setString(2, rs.getString(2))

  10:        insStmt.setString(3, rs.getString(3))

  11:        insStmt.setString(4, rs.getString(4))

  12:        insStmt.addBatch()

  13:        rowCount += 1

  14:        if (rowCount % batchSize == 0) {

  15:          insStmt.executeBatch()

  16:        }

  17:      }

18:

  19:      val elapsedMs = System.currentTimeMillis - startMs

  20:      println(rowCount + " rows inserted - " + elapsedMs + " ms")

  21:      con.commit()

On line 12, we now call the addBatch() method instead of executeUpdate(). Once we’ve added enough rows to our batch (defined by the batchsize constant in the above code) we can call executeBatch() to insert the batch.

Array insert gives about the same performance improvements as array fetch. For the above example I got the performance improvement below:

To be fair, the examples above are a best case scenario for array processing – the Oracle database was running in an Amazon EC2 instance in the US, while I was running the scala code from my home in Australia! So the round trip time was as bad as you are ever likely to see. Nevertheless, you see pretty impressive performance enhancements from simply increasing array size in the real world all the time.

Conclusion

If you use JDBC to get data from Oracle RDBMS within a scala project, then the principles for optimization are the same as for Java JDBC – use preparedStatements, bind variables and array processing. Of course there’s a lot more involved in optimizing database queries (SQL Tuning, indexing, etc), but these are the three techniques that vary significantly from language to language. The performance delta from these simple techniques are very significant and should represent the default pattern for a professional database programmer.

Guy Harrison |

2 Comments |

tagged

Oracle,

scala in

Oracle

Thursday

Jan022014

Using SSD for a temp tablespace on Exadata

Thursday, January 2, 2014 at 7:10PM

I seem to be getting a lot of surprising performance results lately on our X-2 quarter rack Exadata system, which is good – the result you don’t expect is the one that teaches you something new.

This time, I was looking at using a temporary tablespace based on flash disks rather than spinning disks. In the past – using Fusion IO PCI cards, I found that using flash for temp tablespace was very effective in reducing the overhead of multi-pass sorts:

See (http://guyharrison.squarespace.com/ssdguide/04-evaluating-the-options-for-exploiting-ssd.html)

However, when I repeated these tests for Exadata, I got very disappointing results. SSD based temp tablespace actually lead to marginally worse performance:

Looking in depth at a particular point (the 500K SORT_AREA_SIZE point), we can see that although the SSD based temp tablespace has marginally better read times, it involves a significantly higher write overhead:

I can understand the higher read overhead (at least partially). It’s Yet Another time when sequential write operations to an SSD device have provided disappointing performance. However, it’s strange to see such poor read performance. How can a spinning disk serve blocks up at effectively the same latency an SSD?

So I dumped all the direct path read waits from a 10046 trace and plotted them logarithmically:

We can see in this chart, that the SDD based tablespace suffers from a small “spike” of high latencies between 600-1000 us (eg .6-1 ms). These are extremely high latencies for an SSD ! What could be causing them? Garbage collection being caused by the almost writes to the temp tablespaces? There was negliglbe concurrent activity on the system and the table concerned had flash cache disabled so for now that is my #1 theory.

For that matter, why are the HDD reads times so low? An average disk read latency of 500 us for a spinning disk is unreasonably low, is the storage cell somehow buffering temporary tablespace IO?

As always I’m wondering if there’s someone with more expertise in Exadata internals who could shed some light on all of this!

Guy Harrison |

2 Comments |

tagged

Exadata,

Oracle,

ssd in

Oracle