dbKoda for MongoDB

dbKoda for MongoDB - a modern open source database IDE, now available for MongoDB. Download it here!

Next Generation
Databases: NoSQL,
NewSQL and Big Data

Buy at Amazon

Read sample at Amazon

Buy at Apress

Latest Postings:


Oracle Performance Survival Guide

Buy It
Read it on Safari
Scripts and Examples
Sample Chapter


Powered by Squarespace

 MySQL Stored procedure programming

Buy It
Read it on Safari
Scripts and Examples 


Entries in ssd (17)


More on the database flash cache

I'm eagerly awaiting my high-performance flash SSD (an Intel X-25 E), but in the meantime I've played a bit more with the database flash cache on the cheap hardware setup described in my last post.   Sometimes it can be useful to test new features on slow hardware, since you observe phenomenon that don't occur when everything is running full speed.

I originally naively imagined that blocks would be copied into the flash cache by the Oracle server process .  Eg, that if I read from disk, I deposit the block in both the buffer cache or the flash cache.  However,  upon investigation it appears that blocks are moved from the buffer cache to the flash cache by the DBWR as they are about to be flushed from the buffer cache.  

This is of course, a much better approach.  The DBWR can write to the flash cache asynchronously, so that user sessions get the benefit - less time reading from magnetic disk - without having to wait while blocks are inserted into the flash cache. 


So the lifecycle of a block looks something like this:

  1. The Oracle server process reads a file from disk and adds it to the buffer cache
  2. If a session wants that block later and it's still in the buffer cache, they can read it from the buffer cache
  3. Before the block leaves the buffer cache the DBWR will write it to the flash cache (if the DBWR is not too busy)
  4. If a session want a block later and it's still in the flash cache, then they will read it from the flash cache (and presumably place it back in the buffer cache)
  5. If the block is modified, the DBWR will eventually write it back to the disk.  (Q:  What happens to any unmodified copies of that block in the flash cache?)

DBWR and the flash cache


Reading from flash is fast - but writing is much slower.  Therefore,  in order to avoid performance issues, the DBWR should:


  1. Not write to flash unless it has to and
  2. Not write to flash at all if it will get in the way of other more important activities. 


To the first point,  DBWR does not appear to populate the flash cache until blocks are about to be flushed.  In other words, the DBWR doesn't seem to write a buffer to the flash cache just in case it is going to be flushed, but only if it is actually (or maybe probably) going to be flushed.   We do not observe writes to the flash cache when blocks are brought into the buffer cache unless other blocks are also being flushed out.

Secondly,  the DBWR won't write to flash cache - and won't delay blocks from being flushed - if it is busy writing dirty blocks to disk.   The DBWR must clear dirty blocks from the cache as quickly as possible, otherwise "free buffer waits" will prevent new blocks from being introduced to the buffer cache.  If blocks are about to be flushed from the buffer cache but the DBWR is busy then the blocks will not get written to the flash cache, and the statistic 'flash cache insert skip: DBWR overloaded' will be incremented. 


Measuring db flash cache effectiveness


The following query reports on how often a busy DBWR is forced to skip flash cache inserts because it is too busy or for other reasons:


The "DBWR overloaded" statistic I think we understand - the DBWR is busy writing dirty blocks to disk and hasn't the bandwidth to write clean blocks to the flash cache.

"flash cache insert skip: exists" is also easy to understand.  If we read a block back from the flash cache into the buffer cache, then it remains in the flash cache.  There's no need to write it to the flash cache again should it again age out of the buffer cache.

"not current" probably means that the DBWR is aware of a more recent copy of the block on disk or in the buffer cache and for that reason is going to decline to write a version to flash.

"not useful" I do not understand....

The biggest takeaway here is - I think:

On a system with a very busy DBWR,  the db flash cache may be less effective. 


To measure the benefit from the flash cache we can compute the number of flash cache reads that avoided a physical read and multiply by the average time for each type of read.  This query attepts to do that though only for single block reads:




The architecture of the DB flash cache allows us to relax a bit regarding the relatively poor write performance that is associated with flash SSD.  If we place a datafile on flash SSD, we risk creating a free buffer wait bottleneck:  the DBWR will stall trying to write blocks to the slow (in terms of write latency) flash drive and so we might actually get worst performance compared to magnetic disk - at least if we have a high write rate.  But with the flash cache, the DBWR only writes when it has free time.  Consequentially, there should be little real down side - the worst that can happen is that the DBWR is too busy to populate the flash cache and so it becomes less useful.

The efficiency of the db flash cache is going to depend on two main factors:

  1. Does the DBWR have enough free time to write to the flash cache?
  2. How long does it take to write to the flash device?

The answer to the first question depends on the current rate of DBWR activity which in turn probably depends on how much IO write bandwidth exists on your disk array.  But if you have the sort of database that suffers from or is on the verge of suffering from free buffer waits, then probably the db flash cache won't work too well  because the DBWR will never be able to write to it.

The answer to the second question depends on the quality of the flash SSD.   You generally want to use Flash disks which have the lowest possible write latency.  That usually means support for the TRIM API call, and flash that uses Single Level Cells (SLC).





Using the Oracle 11GR2 database flash cache

Oracle just released a patch which allows you to use the database flash cache on Oracle Enterprise Linux even if you don't have exadata storage.  The patch is the obscurely named:


Once you install the patch you can use any old flash device as a database flash cache.  Below I've documented some initial dabbling on a very old server and a cheap usb flash device.   The results are not representative of the performance you'd get on quality hardware, but are still interesting, I think.

Setup and configuration


If, like me, you just want to experiment using an USB flash device, then you first need to get that device mounted. On my test machine I created a directory "/mnt/usbflash" then created an /etc/fstab entry like this:

   /dev/sda1               /mnt/usbflash           vfat    noauto,users,rw,umask=0 0 0

On your system you might need to change "/dev/sda1" to another device depending on how your fixed disks are configured.  You should then be able to mount the flashdrive by typiing "mount /dev/sda1".  Make sure that the mount point is writable by oracle (chmod 777 /mnt/usbflash). 

Once mounted, you configure the flash cache by setting the parameters DB_FLASH_CACHE_FILE and DB_FLASH_CACHE_SIZE.  My settings are shown below:


Note that the value of DB_FLASH_CACHE_FILE needs to be a file on the flash drive, not the flash drive mount point itself.

Once these parameters are set, the flash cache will be enabled and will act as a secondary cache to the buffer cache.  When a block is removed from the primary cache, it will still exist in the flash cache, and can be read back without a physical read to the spinning disk.


There's a few ways to examine how the flash cache is being used.  Firstly,  V$SYSSTAT contains some new statistics showing the number of blocks added to the cache and the number of "hits" that were satisfied from cache (script here):


Some new wait events are now visible showing the waits incurred when adding or reading from the flash  cache.   Below we see that 'db flash' waits are higher than 'db file sequential read', though reads from the flash cache are much quicker than reads from disk (but there are a lot more of them):




Now, remember that I could not have picked worst hardware for this test - an old two CPU intel box and a cheap thumb drive.  Even so, it's remarkable how high the write overhead is in comparison to the total time.    Although the flash reads save time when compared to db file sequential reads, the overhead of maintaining the cache can be high because flash based SSD has a relatively severe write penalty.

All flash-based Solid State Disk have issues with write performance.  However, cheap Multi Level Cell (MLC) flash take about 3 times as long to write as the more expensive Single Level Cell (SLC).  When flash drives are new, the empty space can be written to in single page increments (usually 4KB).  However, when the flash drive is older, writes typically require erasing a complete 128 page block which is very much slower.  My cheap USB drive was old and MLC, so it had very poor write performance.   But even the best flash based SSD is going to be much slower for writes than for reads, and in some cases using a  flash cache might slow a database down as a result.  So monitoring is important.

There's a couple of other V$SYSSTAT statistics of interest:


To examine the contents of the cache, we can examine the V$BH view.  Buffers in the flash cache have STATUS values such as 'flashcur', allowing us to count the buffers from each object in the main buffer cache and in the flash cache (script is here):



In this case, the TXN_DATA table has 85,833 blocks in the flash cache and 28,753 blocks in the main buffer cache. 


I was happy to get the flash cache working, even with this crappy hardware.  I'm really happy that Oracle opened this up to non-exadata hardware. 

I'll be getting a better setup soon so that I can see how this works with a decent commerical SSD flash drive.

I'm a big believer in the need for a hierarchy of storage ideally including at least two layers - one for mass storage, the other for fast retrieval.   We should be cautious however, because the flash write penalty may result in performance problems similar to those we've seen with RAID5.


Page 1 ... 1 2 3 4