Entries in Exadata (11)


Exadata smart flash logging

Exadata storage software introduced the Smart flash logging feature.  The intent of this is to reduce overall redo log sync times - especially outliers - by allowing the exadata flash storage to serve as a secondary destination for redo log writes.  During a redo log sync, Oracle will write to the disk and flash simultaneously and allow the redo log sync operation to complete when the first device completes. 

Jason Arneil reports some initial observations here, and Luis Moreno Campos summarized it here.

I’ve reported in the past on using SSD for redo including on Exadata and generally I’ve found that SSD is a poor fit for redo log style sequential write IO.  But this architecture should at least do now harm and on the assumption that the SSD will at least occasionally complete faster than a spinning disk I tried it out. 

My approach involved the same workload I’ve used in similar tests.  I ran 20 concurrent processes each of which performed 200,000 updates and commits – a total of 4,000,000 redo log sync operations.  I captured every redo log sync wait from 10046 traces and loaded them in R for analysis.

I turned flash logging on or off by using an ALTER IORMPLAN command like this (my DB is called SPOT):

ALTER IORMPLAN dbplan=((name='SPOT', flashLog=$1),(name=other,flashlog=on))'

And I ran “list metriccurrent where objectType='FLASHLOG'” before and after each run so I could be sure that flash logging was on or off.

When flash logging was on, I saw data like this:


     FL_DISK_FIRST                     FLASHLOG     32,669,310 IO requests
     FL_FLASH_FIRST                    FLASHLOG     7,318,741 IO requests
     FL_PREVENTED_OUTLIERS             FLASHLOG     774,146 IO requests


      FL_DISK_FIRST                     FLASHLOG     33,201,462 IO requests
     FL_FLASH_FIRST                    FLASHLOG     7,337,931 IO requests
     FL_PREVENTED_OUTLIERS             FLASHLOG     774,146 IO requests


So for this particular cell the flash disk “won” only 3.8% of times (7,337,931-7,318,741)*100/(7,337,931-7,318,741+33,201,462-32,669,310) and prevented no “outliers”.  Outliers are defined as being redo log syncs that would have taken longer than 500 ms to complete. 

Looking at my 4 million redo log sync times,  I saw that the average and median times where statistically significantly higher when the smart flash logging was involved:

> summary(flashon.data$synctime_us) #Smart flash logging ON
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    1.0   452.0   500.0   542.4   567.0  3999.0
> summary(flashoff.data$synctime_us) #Smart flash logging OFF
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   29.0   435.0   481.0   508.7   535.0  3998.0
> t.test(flashon.data$synctime_us,flashoff.data$synctime_us,paired=FALSE)

    Welch Two Sample t-test

data:  flashon.data$synctime_us and flashoff.data$synctime_us
t = 263.2139, df = 7977922, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
33.43124 33.93285
sample estimates:
mean of x mean of y
542.3583  508.6763

Plotting the distribution of redo log sync times we can pretty easily see that there’s actually a small “hump” in times when flash logging is on (note logarithmic scale):


This is of course the exact opposite of what we expect, and I checked my data very carefully to make sure that I had not somehow switched samples.  And I repeated the test many times and always saw the same pattern.  

It may be that there is a slight overhead to running the race between disk and flash, and that that overhead makes redo log sync times slightly higher.  That overhead may become more negligible on a busy system.  But for now I personally can’t confirm that smart flash logging provides the intended optimization and in fact I observed a small but statistically significant and noticeable degradation in redo log sync times when it is enabled.


Using SSD for redo on Exadata - pt 2

In my previous post on this topic, I presented data showing that redo logs placed on an ASM diskgroup created from exadata griddisks created from flash performed far worse than redo logs placed on ASM created from spinning SAS disks.

Of course, theory predicts that flash will not outperform spinning magnetic disk for the sequential write IOs experienced by redo logs, but on Exadata, flash disk performed much worse than seemed reasonable and worse than experience on regular Oracle with FusionIO SSD would predict (see this post).

Greg Rahn and Kevin Closson were both kind enough to help explain this phenomenon.  In particular, they pointed out that the flash cards might be performing poorly because of the default 512 byte redo block size and that I should try a 4K blocksize.   Unfortunately, at least on my patch level (, there appears to be a problem with setting a 4K blocksize

ALTER DATABASE add logfile thread 1 group 9 ('+DATA_SSD') size 4096M blocksize 4096
ERROR at line 1:
ORA-01378: The logical block size (4096) of file +DATA_SSD is not compatible with the disk sector size (media sector size is 512 and host sector size is 512)

According to Greg, the F20 SSD cards are incorrectly reporting their physical characteristics and this is fixed in the current patch level.   Luckily, you can override the check by setting

ALTER SYSTEM SET "_disk_sector_size_override"=TRUE SCOPE=BOTH;

Greg and Kevin really know their stuff:  setting a 4k redo log block size resulted in dramatic improvements to redo log throughput – elapsed time reduced by 70%:


As expected,  redo log performance for SSD still slightly lags that of SAS spinning disks.     It’s clear that you can’t expect a performance improvement by placing redo on SSD, but at least the 4K blocksize fix makes the response time comparable.  Of course, with the price of SSD being what it is, and the far higher benefits provided for other workloads – especially random reads – it’s hard to see an economic rationale for SSD-based redo.    But at least with a 4K blocksize it’s tolerable.

When our Exadata system is updated to the latest storage cell software, I’ll try comparing workloads with the Exadata smart flash logging feature.


Using flash disk for Redo on Exadata

In this Quest white paper and on my SSD blog here,  I report on how using a FusionIO flash SSD compares with SAS disk for various configurations – datafile, flash cache, temp tablespace and redo log.  Of all the options I found that using flash for redo was the least suitable, with virtually no performance benefit:


That being the case,  I was surprised to see that Oracle had decided to place Redo logs on flash disk within the database appliance, and also that the latest release of the exadata storage cell software used flash disk to cache redo log writes (Greg Rahn explains it here).   I asked around at OOW hoping someone could explain the thinking behind this, but generally I got very little insight.

I thought I better repeat my comparisons between spinning and solid state disk on our Exadata system here at Quest.  Maybe the “super capacitor” backed 64M DRAM on each flash chip would provide enough buffering to improve performance.  Or maybe I was just completely wrong in my previous tests (though I REALLY don’t think so :-/).

Our Exadata 1/4 rack has a 237GB disk group constructed on top of storage cell flash disk.  I described how that is created in this post.   I chose 96GB per storage cell in order to allow the software to isolate the grid disks created on flash to 4 24GB FMODs (each cell has 16 FMODs).    Our Exadata system has fast SAS spinning disks – 12 per storage cell for a total of 36 disks.  Both the SAS and SSD disk groups had normal redundancy.

I ran an identical redo-intensive workload on the system using SAS or SSD diskgroups for the redo logs.  Redo logs were 3 groups of 4GB per instance.   I ran the workload on it’s own, and as10 separate concurrent sessions.  

The results shocked me:


When running at a single level of concurrency,  the SSD based ASM redo seemed to be around 4 times slower than the default SAS-based ASM redo.  Things got substantially worse as I upped the level of concurrency with SSD being almost 20 times slower.  Wow.

I had expected the SAS based redo to win – the SAS ASM disk group has 36 spindles to write to, while the SSD group is (probably) only writing to 12 FMODs.  And we know that we don’t expect flash to be as good as SAS disks for sequential writes.  But still, the performance delta is remarkable. 



I’m yet to see any evidence that putting redo logs on SSD is a good idea, and I keep observing data from my own tests indicating that it is neutral at best and A Very Bad Idea at worse.  Is anybody seeing any similar?  Does anybody think there’s a valid scenario for flash-based redo?


Configuring Exadata flash as grid disk

The default – or at least a very common - configuration for Exadata is to configure all the flash as Exadata Smart Flash Cache (ESFC).   This is a simple and generally performant configuration, but won’t be the best choice for all cases.  In particular, if you have table which is performance critical, and it could fit in the flash storage you have available, you might be better off configuring some of your flash as grid disk, creating an ASM disk group from that, and putting the table there.

Here’s the procedure:

1. Drop the flash cache, create a new flashcache of a smaller size, then create the griddisks from the unallocated space.  These CELLCLI commands do that:

CellCLI> drop flashcache
Flash cache exa1cel01_FLASHCACHE successfully dropped
CellCLI> create flashcache all size=288g
Flash cache exa1cel01_FLASHCACHE successfully created
CellCLI> create griddisk all flashdisk prefix=ssddisk

There’s 384G of flash on each storage cell, so the above commands create about 96G of SSD grid disk.   Run those commands on each cell node, perhaps by using the CCLI command (see this post for an example).

2. The above procedure will create disks in the format o/cellIpAddress/ssddisk_FD_*_cellnode.  Log into an ASM instance, and issue the following command to create a diskgroup from those disks:

  1  create diskgroup DATA_SSD normal redundancy disk 'o/*/ssddisk*'
  2  attribute 'compatible.rdbms'='',
  3  'compatible.asm'='',
  4  'cell.smart_scan_capable'='TRUE',
  5* 'au_size'='4M'

Alternatively you can use the database control for the ASM instance to create the new diskgroup.  Your new flash disks should show up as candidate disks.

The relative performance of flash disks, vs flash cache is similar in Exadata to what I’ve seen using the Database flash cache.  Placing an object directly on flash is faster than using the cache, although the cache is very effective.  Here’s the results for 200,000 primary key lookups across  1,000,000 possible primary keys:



Clearing the Exadata smart flash cache using dcli

I’ve been doing some performance benchmarks on our exadata box specifically focusing on the performance of the smart flash cache.  I found that even if I switched the CELL_FLASH_CACHE storage setting to NONE,  the flash cache will still keep cached blocks in flash and would therefore give me artificially high values for “cell flash cache read hits” statistic when I set CELL_FLASH_CACHE back to DEFAULT or KEEP.  What I needed was a way to flush the Exadata flash cache.

Unfortunately there doesn’t seem to be a good way to flush the flash cache – no obvious CELLCLI command.   Maybe I’ve missed something, but for now I’m dropping and recreating the flash cache before each run.

Luckily the dcli command lets me drop and recreate on each cell directly from the database node and even sets up passwordless connections.  Here’s how to do it.

Firstly, create a script that will drop and recreate the flash cache for a single cell:

$ cat flushcache.sh
cellcli <<!
drop flashcache;
create flashcache all;

Now, use ccli to execute that on each cell node (I have three named exa1cel01,exa1cel02,exa1cel03:

$ dcli -c exa1cel01,exa1cel02,exa1cel03 --serial -k -l userid -x flushcache.sh

The “-k” option copies the ssh key to the cell nodes which means that after the first execution you’ll be able to do this without typing in the password for each cell node.   The “—serial” option makes each command happen one after another rather than all at once – you probably don’t need this…

Anyone know a better way to flush the Exadata flash cache?