Sunday
24Jan2010

Flash tablespace vs. DB Flash Cache

 

In this post I'm going to report on some performance comparisons I've recently conducted on using SSD flash storage for datafiles vs. using the new Oracle 11GR2 database flash cache.  

It sometimes seems like I’ve been waiting for the the end of the spinning magnetic disk all my professional  life.   The technology is so ancient, so clearly limited and (argh) mechanical.  So the fact that Solid State Disk (SSD) is becoming more an more practical as a database storage medium - and directly supported by Oracle in 11GR2 - is very exciting.  

Using SSD as part of database storage can certainly yield big results, but it's important to understand the performance characteristics of flash SSD and make sure we don't use it inappropriately. 

SSD comes in two flavours:

  • Flash  based SSD uses the familiar flash technology that underlies the ubiquitous USB drives that have replaced floppy disks for small portable data storage.   Flash RAM is cheap, provides permanent storage without battery backup and so has low power consumption. 
  • DDR RAM based SSD uses memory modules that are not particularly different in nature from those that provide core memory for the server.   This RAM is backed by non-volatile storage (disk or flash RAM) and internal batteries.  In the event of a power failure, the batteries provide enough power to write the RAM memory to the non-volatile storage.  

DDR SSD is too expensive (in terms of $$/GB) to be very useful to we database professionals today.  But Flash based disk is becoming an increasingly viable alternative to magnetic disk.

Read, write and erase operations

 

Flash disk storage is organized into pages – typically 4K – and blocks – typically 256K.  For read operations, the flash disk can return data very quickly from individual pages.  Writing to a page is slower – maybe 10 times slower.  However,  it is only possible to write a page of data if there is an empty page in the block.   If we need to write to a full block, it must first be erased.   The Wikipedia entry on SSD gives the following figures for seek, write and erase times:

 

As a flash SSD fills with data, the percentage of simple block level writes that require an erase increases, and the write performance of the flash SSD degrades.

The TRIM API function

 

High end flash SSD support an API call TRIM, which allows the OS to proactively erase full blocks so that writes can complete with only a page-level IO.  Most high end drives also support wear-leveling algorithms that move hot-pages around the device to avoid the risk of block level failures.  Flash drives can only support a limited number of erase operations before a block becomes unreliable,  but this is minimized providing the disk automatically moves hot pages around in physical storage. 

MLC vs SLC

 

Cheap flash drives generally use Multi-Level-Cell (MLC) technology in which each cell can store two bits of data rather than SLC which can only store one bit.  The effect of MLC is to increase density at the expense of performance, particularly write performance.   MLC is also less reliable in terms of data loss. If you’re concerned about write performance – and you probably should be – then you should avoid MLC.  

In general,  if you want a high performance flash SSD - and why get one if it's not high performance - you should go for a SLC based flash SSD that supports the TRIM API and has good wear leveling capabilities.  For my testing,  I used an Intel X-25 E 32GB drive.   It cost about $600 Australian (about $534 US dollars).

The problem with read/write speed disparities

 

Given that most databases do many more reads than writes, do we need to worry about the disparity between seek times and write times for flash SSD?  The answer is definitely YES.   For an Oracle database, a big mismatch between the ability to read from a device and the ability to write to the device can be particularly harmful when performing transactional activity through the buffer cache.

The issue relates to the caching of data within the buffer cache.  If it’s very much easier to put blocks into the buffer cache than to write them out, then there’s a good chance that the buffer cache will fill up with dirty (unmodified) blocks and that free buffer waits will result.  The figure below illustrates how free buffer waits can occur:

 

(This illustration from Oracle Performance Survival Guide).

If you're using cheap flash drives, then the write speed is going to be much slower than the read speed and consequently you could end up with free buffer waits becoming your limited factor during high transactional activity. 

Enter the Oracle Database Flash Cache 

 

Oracle's database flash cache provides an alternative way of exploiting flash SSD.  Instead of putting the entire datafile on flash, we use flash as a secondary cache.  The flash cache can be massive and can speed up reads for frequently accessed blocks.  But if the flash drive is busy, Oracle will simply decline to write to the cache.  That way we can get some of the advantage of flash for read optimization without having to pay the write penalty.  

I outlined the flash cache algorithms in this previous post, and here's the diagram I used there to outline the lifecycle of a block when the database flash cache is enabled:

The important part of this architecture is that the DBWR will only write the block to the flash cache if it is not overloaded.  If the DBWR gets busy, then the flash cache write can be omitted and while this will reduce the efficiency of the cache, it prevents the buffer cache from filling up with modified blocks - so free buffer waits should not result. 

Flash disk read performance

 

Let's see how this pans out in practice.  Below we see the relative performance for 500,000 random indexed reads against:

 

  1. A table on magnetic disk with no flash cache
  2. A table on magnetic disk with flash caching
  3. A table stored directly on flash disk 

The magnetic disk was a Seagate 7200.7 80GB Barracuda drive, while the Solid State Disk drive was an Intel X-25 E 32GB drive which is a pretty high end SLC drive.  In both cases, the tablespaces were created on raw partitions to avoid filesystem caching and the redo logs and other tablespaces were on a seperate drive.   The database was restarted prior to each test. 

Here's the read performance:

 

As you'd expect,  storing the table directly on flash is fastest, since there are only read I/Os, and every read I/O is from the much faster flash SSD.   Note that the flash cache gives us a lot of advantages as well - the number of database file IOs is reduced, and the reads from the flash cache are very fast indeed. 

Update performance

 

Next I measured the performance of primary key based UPDATEs.  The performance for such an UPDATE is a mix of read time - get the block into cache - and DBWR performance.  If the DBWR can't write the modified blocks out fast enough then free buffer waits and/or write complete waits will occur. 

What I expected was that the flash tablespace test would show a lot of free buffer waits, since in theory flash writes are so much slower than flash reads.  I thought that using the flash cache would avoid that issue, and provide the best performance.   However, the results surprised me:

 

I repeated the above tests about a dozen times with a variety of workloads, but the results were pretty similar in every case.  While using the flash cache is better than no flash at all, it's still better to put the table directly on flash storage.   The X-25 E was simply able to sustain a write rate much higher than I had expected - about 2,000 writes/sec.   Intel claim - and now I believe - that they have sophisticated algorithms that avoid the write penalty usually associated with the flash SSD storage medium.

Note that it's actually the Flash Cache configuration that generates the most free buffer waits.  The flash cache allows Oracle to achieve much higher logical read rates but - since the writes still go to the relatively slower spinning magnetic disk - the buffer cache often fills with dirty blocks. 

It may be that over time the performance of the X-25 will reduce when all the blocks have been written to at least once  Erase operations become more prevalent.   However,   if the TRIM API is working properly then this degradation should in theory be avoided.  Note that TRIM support is not universal - not all SSDs and older versions of Windows may not support it. 

Write complete waits: flash cache

 

In previous postings I noted that the DBWR will skip writing to the flash cache when busy, which should prevent flash cache activity being a direct cause of buffer busy waits.   However,  I noticed under some workloads that I would get a large number of "write complete waits: flash cache".  For instance, the output below shows about 74% of elapsed time in this wait:

Write complete waits occur when a session wants to modify a block, but the DBWR is in the process of writing it out to the datafile.  The flash cache equivalent occurs for the same reason, but instead of writing it to the datafile, the DBWR is writing it to the flash cache.  This phenomenon can occur when individual blocks are being modified at very high rates;  in this scenario, the chance that the block will be modified before the DBWR writes it out can be substantial. 

Conclusion

 

I'm very impressed with the performance of the X-25 E flash SSD drive.  If it can maintain it's high write throughput over time, then it offers a significant increase in throughput over traditional spinning disk or over the new 11GR2 database flash cache.  However, a few words of warning:

 

  • I did not stress test the disk over a long period of time.  Some flash drives exhibit more write latency as they wear in.  In theory, the TRIM functionality of the Intel X-25 E should avoid that, but I did not verify it in these tests.
  • Flash drives can be an unreliable form of storage over long periods of time.  Any disk can fail of course, but if you run a flash disk hot it will likely fail before a magnetic disk (though it will do more work during that period). 
  • In my tests, my table was small enough to fit on the flash storage.  Had it not, then using the database flash cache would have allowed me to get some benefits from the flash drive without having to buy enough storage to fit the whole database on the SSD.

 

I'm now even more enthusiastic about the use of flash drives.  Providing you buy a high-performance drive (think SLC & TRIM) then you could consider placing entire tablespaces or database on these drives.  If you can't afford top end flash drives for all your data, then the 11GR2 database flash cache can give you a pretty substantial performance boost. 

Sunday
17Jan2010

Best practices with Python and Oracle

This is the third in a series of postings outlining the basics of best practices when coding against Oracle in the various languages.  In Oracle Performance Survival Guide (and in it's predecessor Oracle SQL High Performance Tuning) I emphasise the need to use bind variables and array processing when writing Oracle programs.   The book provides examples of both in PL/SQL and Java, while these postings outline the techniques in other languages. 

Previous postings covered .NET languages and perl

The Python code for the examples used in the posting is here

I became interested in python a few years back when I read the article Strong Typing vs Strong Testing in Best Software Writing and from various other sources that sjggested that Python was a more modern and robust scripting language than Perl (and superior to Java for that matter).  I dabbled with Python and thought about dropping perl as my scripting language of choice but in the end stuck with perl and Java as my primary languages because: (a) decades of perl experience were going to be too hard to replicate in Python and (b) the database drivers for python were not universally of the same quality as perl.

Nevertheless,  Python is an excellent language for Oracle development and utilities and the cx_oracle driver fully supports best coding practices for Oracle. 

Bind Variables

As with most languages, it's all too easy to concatenate literals - creating a unique SQL which mush be re-parsed - rather than using the more efficient bind variable approach.  If anything, it's even easier in Python, since Python's string substitution idiom looks very similar to bind variables.  

For instance, in this fragment, the "%d" literal in the SQL text represents the substitution value for the query:

 

Of course, if you are at all familiar with Python you'll know that the "%" operator substitutes the values in the subsequent list into the preceding string - so this example creates many unique SQL statements leading to high parse overhead; and possibly to library cache latch or mutex contention.  

The cursor execute method allows you to specify bind variables as a list following the SQL, as in this example:

In line 129 we define a named bind variable ":x_value".  Within the execute procedure (line 130) we assign the bind variable to the value of the python variable "i". 

There's a couple of different styles of using bind variables in python.  You can use positional parameters (eg ":1",":2", etc) and pass in a python list that contains the values in sequence and - as we shall see soon - you can bind arrays.  

 

 

As with any programming language, using bind variables improves performance by reducing parse overhead and library cache latch or mutex contention.  The above SQL is about as simple as you can get, so parse overhead reductions should be modest.   As the only active session on my (local laptop) database,  library cache contention would not be expected either.  Nevertheless,  elapsed time even in this simple example reduced by more than 1/2 when bind variables were used. 

Array Fetch

 

Fetching rows in batches generally reduces elapsed time for bulk selects by up to a factor of 10.  Array fetch works to improve performance by reducing network round trips and by reducing logical reads.

Like most modern interfaces,  Python with cx_oracle will do some level of array fetch for you, even if your program handles rows and at a time.  So in the example below, it looks like we are fetching rows one at a time:


Using SQL trace, we can see that Python in fact fetched rows in batches of 50 (this is shown using Spotlight on Oracles trace file analyzer, but you can deduce the same thing from tkprof output): 

You can adjust or view the array size through the arraysize cursor property.  Here we foolishly set the arraysize to 1:

 

Python provides fetchmany() and fetchall() calls in addition to the "fetchone" call used above.  While these do affect the handling of the result set on the python side, the arraysize variable continues to control the size of the array fetch actually sent to Oracle.   The fetchall() call is more efficient in some ways, but requires that you have enough memory on the client side to hold the entire result set in memory:

 

 

 

 Fetchmany() let's you handle the results in batches:

The arraysize cursor property determines how many rows are in each set, so with fetchmany() the python code corresponds most closely to the underlying Oracle processing logic. 

For 500,000 rows, here's the performance of each of the approaches we've tried so far:

 

Although fetchall() performed best in this example, it may backfire for very massive result sets, because of the need to allocate memory for all of the data in the result set.   fetchmany() may be the best approach if you're not sure that the result set is small enough to fit comfortably in memory. 

 

Array insert

Array insert has similar performance implications as array fetch, but it harder to program and does not happen automatically.  The "natural" way to perform inserts is probably to loop through the rows and assign them one at a time:

The values to be inserted are supplied as a Python list which is bound to the positional bind variables (":1, :2") defined in the SQL.   In this example we used the prepare method to create the SQL once before executing it many times.  This is not a bad practice, but doesn't seem to have any impact on parse overhead:  Python only reparses the SQL when the SQL text changes.

To perform an array insert, we create a "list of lists" which defines our array to be inserted.  We then call the executemany() method to insert that array:

Each entry in the arrayValues list is itself a list that contains the values to be bound.

As usual, the performance improvements are fairly remarkable.  The chart above compares the time to insert 500,000 rows using array insert vs. single row inserts. 

Conclusion

 

The basics of good practice in Python when working with an Oracle database are much the same as those for other languages:

 

  • Use bind variables whenever possible
  • Array fetch is performed automatically, but you can tweak it to get better performance
  • Array insert requires manual coding but is definitely worth it

 

 

Monday
11Jan2010

Histograms of histograms

 

One of our TOAD customers asked us if we could show a chart of CBO histogram values so that they could observe the skew in column values.   We don't currently, but will in an upcoming version of SQL Optimizer (part of TOAD DBA and Developer suites).  In the meantime, the idea of have an SQL that generated a histogram (as bar charts are sometimes called) of a histogram seemed appealing, so here's some SQL to do that. 

In case you don't know,  Oracle uses height balanced histograms when the number of distinct values is high.  In a height balanced histogram the number of rows in each bucket is about the same, but the range of values in each bucket varies.  However, when the number of buckets in the histogram is greater than or equal to the number of distinct column values then Oracle will create a frequency histogram. In a frequency histogram each bucket represents a value, and the number of rows matching the value is recorded.  The Oracle documentation set describes the two types of histograms here

This script will create a bar chart (AKA a histogram) as for a frequency histogram.  For instance, here is a display of the values for the CUST_DATE_OF_BIRTH column in the CUSTOMERS table:

The script uses the ENDPOINT_ACTUAL_VALUE column if present, otherwise the ENDPOINT_VALUE column.  For numeric columns,  ENDPOINT_VALUE is fine,  but for character columns it's meaningless.   

Unfortunately ENDPOINT_ACTUAL_VALUE is not generally populated.  According to Jonathan Lewis in Cost Based Oracle Fundamentals,  from 9i onwards:  "Oracle only populates the endpoint_actual_value column of the histogram views if there is at least one pair of values that, roughly speaking, are identical for the first six characters".

So for most character columns, the script will report something unhelpful like this (this is the CUSTOMERS.CUST_INCOME_LEVEL column):

Tom Kyte provided a PL/SQL function "hexstr" that can decode the ENPOINT_VALUE to a character string - at least for the first 6 bytes.  So using Tom's function, we can get a better sense of the distribution of CUST_INCOME_LEVEL:

Remember, these "histogram histograms" are only really useful for FREQUENCY histograms;  check USER_TAB_COL_STATISTICS to see what types of histograms exist on your columns:
Also the accuracy of the histograms is going to vary depending on your DBMS_STATS options and on how recently you collected the statistics.  All that having been said, the query may be a good way to get a quick idea of how your data is distributed which can be critical when tuning SQL. 

 

 

 

 

Friday
01Jan2010

The 11GR2 IGNORE_ROW_ON_DUPKEY_INDEX hint

One of the strangest new features in 11GR2 is the new IGNORE_ROW_ON_DUPKEY_INDEX hint.  When this hint is applied to an INSERT statement, any duplicate key values that are inserted will be silently ignored, rather than causing an ORA-0001 error. 

Why is this so strange?  Mainly because unlike almost all other hints, this hint has a semantic effect: it changes the actual behavior - not just the optimization - of the SQL.  In my opinion, clauses that affect the actual data effect of the SQL should be contained in official SQL syntax, not embedded in a comment string as a "hint".  The Oracle documentation acknowledges the uniqueness of the hint:

Note:

The CHANGE_DUPKEY_ERROR_INDEXIGNORE_ROW_ON_DUPKEY_INDEX, and RETRY_ON_ROW_CHANGE hints are unlike other hints in that they have a semantic effect. The general philosophy explained in "Hints" does not apply for these three hints.

 Given all that, I'd be reluctant to use such a hint unless there was a compelling performance advantage.  So, let's see if there's any performance justification for this strange hint. 

Usage

We can use the hint in circumstances in which we expect that an insert may encounter a duplicate key on index error.  Traditionally, we have two ways of avoiding this scenario.  The most familiar is to code an exception handler that handles the duplicate key error:

Another common approach is to check to see if there is a matching value before the insert:

This doesn't totally eliminate the possibility of an ORA-0001, since a row could get inserted between the SELECT and the INSERT - so you still might want to have an exception handler (or merge the SELECT into the INSERT).  In any case, the exception handler will likely not be invoked very often with the above approach.  

The IGNORE_ROW_ON_DUPKEY_INDEX approach requires merely that we add a hint referring to the table alias and optionally the index for which we anticipate ORA-0001:

 

As with all hints, a syntax error in the hint causes it to be silently ignored. The result will be that ORA-00001 will be raised, just as if no hint were used.

However, unlike most hints, you might actually generate an error if you don't specify a valid index.  As in this example:

Performance implications 

So, which approach is faster?  I inserted about 600,000 rows into a copy of the SALES table.  About 1/2 the rows already existed.  I traced each of the programming approaches illustrated above:

 

  1. Issue a select to see if there is already a matching value and only insert if there is not.
  2. Issue an INSERT and use an EXCEPTION handler to catch any duplicate values
  3. use the IGNORE_ROW_ON_DUPKEY_INDEX hint

 

Here's the elapsed times for each approach.  Not that these times only include SQL statement execution time - the PL/SQL time (loop and collection handling, etc) is not shown:

 

It's clearly much more efficient to avoid  insert duplicates than to insert them and handle the exception or use the IGNORE_ROW_ON_DUPKEY_INDEX hint.   Rollbacks - explicit or implicit statement level rollbacks - are expensive and should generally be avoided.  The IGNORE_ROW_ON_DUPKEY_INDEX hint does seem to be a little faster than handling the exception manually, but it's still far more efficient to avoid inserting the duplicate in the first place. 

Use with caution

As well as the other limitations of this hint, there's also a significant bug which causes an ORA-600 [qerltcInsertSelectRop_bad_state]  if you try to perform a bulk insert with the hint enabled:

I've raised an SR on this issue. 

Conclusion

It's an odd idea to have a hint that affects the functionality rather than the performance of a SQL statement and this alone would make me avoid the hint.  However, in addition to that reservation, the hint does not seem to offer a very significant performance advantage and in the initial release of 11GR2 is associated with some significant bugs.  

 

Monday
30Nov2009

More on the database flash cache

I'm eagerly awaiting my high-performance flash SSD (an Intel X-25 E), but in the meantime I've played a bit more with the database flash cache on the cheap hardware setup described in my last post.   Sometimes it can be useful to test new features on slow hardware, since you observe phenomenon that don't occur when everything is running full speed.

I originally naively imagined that blocks would be copied into the flash cache by the Oracle server process .  Eg, that if I read from disk, I deposit the block in both the buffer cache or the flash cache.  However,  upon investigation it appears that blocks are moved from the buffer cache to the flash cache by the DBWR as they are about to be flushed from the buffer cache.  

This is of course, a much better approach.  The DBWR can write to the flash cache asynchronously, so that user sessions get the benefit - less time reading from magnetic disk - without having to wait while blocks are inserted into the flash cache. 

 

So the lifecycle of a block looks something like this:

  1. The Oracle server process reads a file from disk and adds it to the buffer cache
  2. If a session wants that block later and it's still in the buffer cache, they can read it from the buffer cache
  3. Before the block leaves the buffer cache the DBWR will write it to the flash cache (if the DBWR is not too busy)
  4. If a session want a block later and it's still in the flash cache, then they will read it from the flash cache (and presumably place it back in the buffer cache)
  5. If the block is modified, the DBWR will eventually write it back to the disk.  (Q:  What happens to any unmodified copies of that block in the flash cache?)

DBWR and the flash cache

 

Reading from flash is fast - but writing is much slower.  Therefore,  in order to avoid performance issues, the DBWR should:

 

  1. Not write to flash unless it has to and
  2. Not write to flash at all if it will get in the way of other more important activities. 

 

To the first point,  DBWR does not appear to populate the flash cache until blocks are about to be flushed.  In other words, the DBWR doesn't seem to write a buffer to the flash cache just in case it is going to be flushed, but only if it is actually (or maybe probably) going to be flushed.   We do not observe writes to the flash cache when blocks are brought into the buffer cache unless other blocks are also being flushed out.

Secondly,  the DBWR won't write to flash cache - and won't delay blocks from being flushed - if it is busy writing dirty blocks to disk.   The DBWR must clear dirty blocks from the cache as quickly as possible, otherwise "free buffer waits" will prevent new blocks from being introduced to the buffer cache.  If blocks are about to be flushed from the buffer cache but the DBWR is busy then the blocks will not get written to the flash cache, and the statistic 'flash cache insert skip: DBWR overloaded' will be incremented. 

 

Measuring db flash cache effectiveness

 

The following query reports on how often a busy DBWR is forced to skip flash cache inserts because it is too busy or for other reasons:

 

The "DBWR overloaded" statistic I think we understand - the DBWR is busy writing dirty blocks to disk and hasn't the bandwidth to write clean blocks to the flash cache.

"flash cache insert skip: exists" is also easy to understand.  If we read a block back from the flash cache into the buffer cache, then it remains in the flash cache.  There's no need to write it to the flash cache again should it again age out of the buffer cache.

"not current" probably means that the DBWR is aware of a more recent copy of the block on disk or in the buffer cache and for that reason is going to decline to write a version to flash.

"not useful" I do not understand....

The biggest takeaway here is - I think:

On a system with a very busy DBWR,  the db flash cache may be less effective. 

 

To measure the benefit from the flash cache we can compute the number of flash cache reads that avoided a physical read and multiply by the average time for each type of read.  This query attepts to do that though only for single block reads:

 

Conclusion

 

The architecture of the DB flash cache allows us to relax a bit regarding the relatively poor write performance that is associated with flash SSD.  If we place a datafile on flash SSD, we risk creating a free buffer wait bottleneck:  the DBWR will stall trying to write blocks to the slow (in terms of write latency) flash drive and so we might actually get worst performance compared to magnetic disk - at least if we have a high write rate.  But with the flash cache, the DBWR only writes when it has free time.  Consequentially, there should be little real down side - the worst that can happen is that the DBWR is too busy to populate the flash cache and so it becomes less useful.

The efficiency of the db flash cache is going to depend on two main factors:

  1. Does the DBWR have enough free time to write to the flash cache?
  2. How long does it take to write to the flash device?

The answer to the first question depends on the current rate of DBWR activity which in turn probably depends on how much IO write bandwidth exists on your disk array.  But if you have the sort of database that suffers from or is on the verge of suffering from free buffer waits, then probably the db flash cache won't work too well  because the DBWR will never be able to write to it.

The answer to the second question depends on the quality of the flash SSD.   You generally want to use Flash disks which have the lowest possible write latency.  That usually means support for the TRIM API call, and flash that uses Single Level Cells (SLC).