Entries in Oracle (38)


Exadata Smart Flash Logging–Outliers

In my last post, I looked at the effect of the Exadata smart flash logging.  Overall,  there seemed to be a slight negative effect on median redo log sync times.  This chart (slightly different from the last post because of different load and configuration of the system), shows how there’s a “hump” of redo log syncs that take slightly longer when the flash logging is enabled:


But of course, the flash logging feature was designed to improve performance not of the “average” redo log sync, but of the “outliers”. 

In my tests, I had 40 concurrent processes writing redo as fast as they could.  Occasionally this would result in some really long wait times.  For instance, in this trace you see an outlier of 291,780 microseconds (the biggest outlier in my tests BTW) within an otherwise unremarkable set of waits:

WAIT #47124064145648: nam='log file sync' ela= 1043 buffer#=101808 sync scn=1266588527 p3=0 obj#=-1 tim=1347583167588250
WAIT #47124064145648: nam='log file sync' ela= 2394 buffer#=130714 sync scn=1266588560 p3=0 obj#=-1 tim=1347583167590888
WAIT #47124064145648: nam='log file sync' ela= 932 buffer#=101989 sync scn=1266588598 p3=0 obj#=-1 tim=1347583167592057
WAIT #47124064145648: nam='log file sync' ela= 291780 buffer#=102074 sync scn=1266588637 p3=0 obj#=-1 tim=1347583167884090
WAIT #47124064145648: nam='log file sync' ela= 671 buffer#=102196 sync scn=1266588697 p3=0 obj#=-1 tim=1347583167885294
WAIT #47124064145648: nam='log file sync' ela= 957 buffer#=102294 sync scn=1266588730 p3=0 obj#=-1 tim=1347583167886575

To see if the flash logging feature was successful in removing these outliers, I extracted the top 10,000 waits from each of the roughly 8,000,000 waits I recorded in each category.  Here’s a plot (non-logarithmic) of those waits:


So – the flash log feature was effective in eliminating or at least reducing very extreme outlying redo log sync times.    Most redo log sync operations will experience no improvement or maybe even a slight degradation. But for the small number of log syncs that would have experienced a really excessive delay, the feature works as advertised – it reduces the chance of really excessive log file syncs. 

In my opinion, this effect doesn't imply that the flash can process a redo log write faster than the magnetic disks - in fact probably the opposite is true.  But given two desitinations to choose from, we avoid really long delays that occur when one of the destinations only is overloaded. 


Exadata smart flash logging

Exadata storage software introduced the Smart flash logging feature.  The intent of this is to reduce overall redo log sync times - especially outliers - by allowing the exadata flash storage to serve as a secondary destination for redo log writes.  During a redo log sync, Oracle will write to the disk and flash simultaneously and allow the redo log sync operation to complete when the first device completes. 

Jason Arneil reports some initial observations here, and Luis Moreno Campos summarized it here.

I’ve reported in the past on using SSD for redo including on Exadata and generally I’ve found that SSD is a poor fit for redo log style sequential write IO.  But this architecture should at least do now harm and on the assumption that the SSD will at least occasionally complete faster than a spinning disk I tried it out. 

My approach involved the same workload I’ve used in similar tests.  I ran 20 concurrent processes each of which performed 200,000 updates and commits – a total of 4,000,000 redo log sync operations.  I captured every redo log sync wait from 10046 traces and loaded them in R for analysis.

I turned flash logging on or off by using an ALTER IORMPLAN command like this (my DB is called SPOT):

ALTER IORMPLAN dbplan=((name='SPOT', flashLog=$1),(name=other,flashlog=on))'

And I ran “list metriccurrent where objectType='FLASHLOG'” before and after each run so I could be sure that flash logging was on or off.

When flash logging was on, I saw data like this:


     FL_DISK_FIRST                     FLASHLOG     32,669,310 IO requests
     FL_FLASH_FIRST                    FLASHLOG     7,318,741 IO requests
     FL_PREVENTED_OUTLIERS             FLASHLOG     774,146 IO requests


      FL_DISK_FIRST                     FLASHLOG     33,201,462 IO requests
     FL_FLASH_FIRST                    FLASHLOG     7,337,931 IO requests
     FL_PREVENTED_OUTLIERS             FLASHLOG     774,146 IO requests


So for this particular cell the flash disk “won” only 3.8% of times (7,337,931-7,318,741)*100/(7,337,931-7,318,741+33,201,462-32,669,310) and prevented no “outliers”.  Outliers are defined as being redo log syncs that would have taken longer than 500 ms to complete. 

Looking at my 4 million redo log sync times,  I saw that the average and median times where statistically significantly higher when the smart flash logging was involved:

> summary(flashon.data$synctime_us) #Smart flash logging ON
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    1.0   452.0   500.0   542.4   567.0  3999.0
> summary(flashoff.data$synctime_us) #Smart flash logging OFF
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   29.0   435.0   481.0   508.7   535.0  3998.0
> t.test(flashon.data$synctime_us,flashoff.data$synctime_us,paired=FALSE)

    Welch Two Sample t-test

data:  flashon.data$synctime_us and flashoff.data$synctime_us
t = 263.2139, df = 7977922, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
33.43124 33.93285
sample estimates:
mean of x mean of y
542.3583  508.6763

Plotting the distribution of redo log sync times we can pretty easily see that there’s actually a small “hump” in times when flash logging is on (note logarithmic scale):


This is of course the exact opposite of what we expect, and I checked my data very carefully to make sure that I had not somehow switched samples.  And I repeated the test many times and always saw the same pattern.  

It may be that there is a slight overhead to running the race between disk and flash, and that that overhead makes redo log sync times slightly higher.  That overhead may become more negligible on a busy system.  But for now I personally can’t confirm that smart flash logging provides the intended optimization and in fact I observed a small but statistically significant and noticeable degradation in redo log sync times when it is enabled.


Using SSD for redo on Exadata - pt 2

In my previous post on this topic, I presented data showing that redo logs placed on an ASM diskgroup created from exadata griddisks created from flash performed far worse than redo logs placed on ASM created from spinning SAS disks.

Of course, theory predicts that flash will not outperform spinning magnetic disk for the sequential write IOs experienced by redo logs, but on Exadata, flash disk performed much worse than seemed reasonable and worse than experience on regular Oracle with FusionIO SSD would predict (see this post).

Greg Rahn and Kevin Closson were both kind enough to help explain this phenomenon.  In particular, they pointed out that the flash cards might be performing poorly because of the default 512 byte redo block size and that I should try a 4K blocksize.   Unfortunately, at least on my patch level (, there appears to be a problem with setting a 4K blocksize

ALTER DATABASE add logfile thread 1 group 9 ('+DATA_SSD') size 4096M blocksize 4096
ERROR at line 1:
ORA-01378: The logical block size (4096) of file +DATA_SSD is not compatible with the disk sector size (media sector size is 512 and host sector size is 512)

According to Greg, the F20 SSD cards are incorrectly reporting their physical characteristics and this is fixed in the current patch level.   Luckily, you can override the check by setting

ALTER SYSTEM SET "_disk_sector_size_override"=TRUE SCOPE=BOTH;

Greg and Kevin really know their stuff:  setting a 4k redo log block size resulted in dramatic improvements to redo log throughput – elapsed time reduced by 70%:


As expected,  redo log performance for SSD still slightly lags that of SAS spinning disks.     It’s clear that you can’t expect a performance improvement by placing redo on SSD, but at least the 4K blocksize fix makes the response time comparable.  Of course, with the price of SSD being what it is, and the far higher benefits provided for other workloads – especially random reads – it’s hard to see an economic rationale for SSD-based redo.    But at least with a 4K blocksize it’s tolerable.

When our Exadata system is updated to the latest storage cell software, I’ll try comparing workloads with the Exadata smart flash logging feature.


Using flash disk for Redo on Exadata

In this Quest white paper and on my SSD blog here,  I report on how using a FusionIO flash SSD compares with SAS disk for various configurations – datafile, flash cache, temp tablespace and redo log.  Of all the options I found that using flash for redo was the least suitable, with virtually no performance benefit:


That being the case,  I was surprised to see that Oracle had decided to place Redo logs on flash disk within the database appliance, and also that the latest release of the exadata storage cell software used flash disk to cache redo log writes (Greg Rahn explains it here).   I asked around at OOW hoping someone could explain the thinking behind this, but generally I got very little insight.

I thought I better repeat my comparisons between spinning and solid state disk on our Exadata system here at Quest.  Maybe the “super capacitor” backed 64M DRAM on each flash chip would provide enough buffering to improve performance.  Or maybe I was just completely wrong in my previous tests (though I REALLY don’t think so :-/).

Our Exadata 1/4 rack has a 237GB disk group constructed on top of storage cell flash disk.  I described how that is created in this post.   I chose 96GB per storage cell in order to allow the software to isolate the grid disks created on flash to 4 24GB FMODs (each cell has 16 FMODs).    Our Exadata system has fast SAS spinning disks – 12 per storage cell for a total of 36 disks.  Both the SAS and SSD disk groups had normal redundancy.

I ran an identical redo-intensive workload on the system using SAS or SSD diskgroups for the redo logs.  Redo logs were 3 groups of 4GB per instance.   I ran the workload on it’s own, and as10 separate concurrent sessions.  

The results shocked me:


When running at a single level of concurrency,  the SSD based ASM redo seemed to be around 4 times slower than the default SAS-based ASM redo.  Things got substantially worse as I upped the level of concurrency with SSD being almost 20 times slower.  Wow.

I had expected the SAS based redo to win – the SAS ASM disk group has 36 spindles to write to, while the SSD group is (probably) only writing to 12 FMODs.  And we know that we don’t expect flash to be as good as SAS disks for sequential writes.  But still, the performance delta is remarkable. 



I’m yet to see any evidence that putting redo logs on SSD is a good idea, and I keep observing data from my own tests indicating that it is neutral at best and A Very Bad Idea at worse.  Is anybody seeing any similar?  Does anybody think there’s a valid scenario for flash-based redo?


Comparing Hadoop Oracle loaders

Oracle put a lot of effort into highlighting the upcoming Oracle Hadoop Loader (OHL) at OOW 2011 – it was even highlighted in Andy Mendelsohn's keynote.  It’s great to see Oracle recognizing Hadoop as a top tier technology!

However, there were a few comments made about the “other loaders” that I wanted to clarify.  At Quest, I lead the team that writes the Quest Data Connector for Oracle and Hadoop (let’s call it the “Quest Connector”) which is a plug-in to the Apache Hadoop SQOOP framework and which provides optimized bidirectional data loads between Oracle and Hadoop.  Below I’ve outlined some of the high level features of the Quest Connector in the context of the  Oracle-Hadoop loaders.  Of course, I got my information on the Oracle loader from technical sessions at OOW so I may have misunderstood and/or the facts may change between now and the eventual release of that loader.  But I wanted to go on the record with the following:

  • All parties (Quest, Cloudera, Oracle) agree that native SQOOP (eg, without the Quest plug-in) will be sub-optimal: it will not exploit Oracle direct path reads or writes, will not use partitioning, nologging, etc.   Both Cloudera and Quest recommend that if are doing transfers between Oracle and Hadoop that you use SQOOP with the Quest connector.
  • The Quest connector is a free, open source plug in to SQOOP, which is itself a free, open source software product.  Both are licensed under the Apache 2.0 open source license.  Licensing for the Oracle Loader has not been announced, but Oracle has said it will be a commercial product and therefore presumably not free under all circumstances.   It’s definitely not open source.
  • The Quest loader is available now (version 1.4), the Oracle loader is in beta and will be released commercially in 2012.
  • The Oracle loader moves data from Hadoop to Oracle only.  The Quest loader can also move data from Oracle to Hadoop.   We import data into Hadoop from an Oracle database usually 5+ times faster than SQOOP alone.
  • Both Quest and the Oracle loader use direct path writes when loading from Hadoop to Oracle.  Oracle do say they use OCI calls which may be faster than the direct path SQL calls used by Quest in some circumstances.   But I’d suggest that the main optimization in each case is direct path.
  • Both Quest and the Oracle loader can do parallel direct path writes to a partitioned Oracle table.  In the case of the Quest loader, we create partitions based on the job and mapper ids.  Oracle can use logical keys and write into existing partitioned tables.  My understanding is that they will shuffle and sort the data in the mappers to direct the output to the appropriate partition in bulk.  They also do statistical sampling which may improve the load balancing when you are inserting into an existing table. 
  • The Quest loader can update existing tables, and can do Merge operations that insert or updates rows depending on the existence of a matching key value.  My understanding is that the Oracle loader will do inserts only - at least initially.
  • Both the Quest connector and the Oracle loader have some form of GUI.  The Oracle GUI I believe is in the commercial ODI product.  The Quest GUI is in the free Toad for Cloud Databases Eclipse plug-in.  I’ve put a screenshot of that at the end of the post.
  • The Quest connector uses the SQOOP framework which is a Apache Hadoop sub-project maintained by multiple companies most notably Cloudera.  This means that the Hadoop side of the product was written by people with a lot of experience in Hadoop.   Cloudera and Quest jointly support SQOOP when used with the Quest connector so you get the benefit of having very experienced Hadoop people involved as well as Quest people who know Oracle very well.   Obviously Oracle knows Oracle better than anyone, but people like me have been working with Oracle for decades and have credibility I think when it comes to Oracle performance optimization.

Again,  I’m happy to see Oracle embracing Hadoop;  I just wanted to set the record straight with regard to our technology which exists today as a free tool for optmized bi-directional data transfer between Oracle and Hadoop. 

You can download the Quest Connector at http://bit.ly/questHadoopConnector.  The documentation is at  http://bit.ly/QuestHadoopDoc.

15-09-2011 3-01-01 PM import

21-09-2011 9-21-41 AM Hadoop solutions