dbKoda for MongoDB

dbKoda for MongoDB - a modern open source database IDE, now available for MongoDB. Download it here!

Next Generation
Databases: NoSQL,
NewSQL and Big Data

Buy at Amazon

Read sample at Amazon

Buy at Apress

Latest Postings:


Oracle Performance Survival Guide

Buy It
Read it on Safari
Scripts and Examples
Sample Chapter


Powered by Squarespace

 MySQL Stored procedure programming

Buy It
Read it on Safari
Scripts and Examples 



Redo log sync time vs redo size

It’s been tough to find time to do actual performance research of late, but I have managed to get a test system prepared that will allow me to determine if Solid State disks offer some performance advantage over spinning disks when the redo entries are very large.   This is to test the theory that the results I’ve published in the past (here and here for instance) actually apply only when the redo entries are relatively small.  For small sequential writes to SSD, each successive write will invoke an erase of a complete NAND page, whereas in a larger sequential write  this will not occur since each write will hit different pages.

I’m still setting up the test environment to look at this, but first I thought it would be worth showing this pretty picture:

16-09-2013 9-08-26 PM redo sync2

This chart shows how redo log sync time (eg, time taken to COMMIT) varies with the amount of redo information written since the last COMMIT.  There is a slight overall upwards trend, but the really noticeable trend is the “sawtooth” effect, which I’ve highlighted below:


Can you guess what causes this?  

I think it’s pretty clear that we are seeing the effect of redo buffer flushing.  Remember, when you write redo entries, they are written to the redo buffer (or sometimes a strand).  Oracle flushes the buffer when you commit, but also flushes it when it is 1/3rd full, after 3 seconds (I think from memory) or after 1MB of redo entries.   Given that, we can see what happens when we commit:

  • If there has been no redo log flush since we started writing, we have to wait while LGWR writes all the entries to disk
  • If a redo log flush occurs after we have written our entries but before we COMMIT, then we have to write virtually nothing (but a COMMIT marker I suppose)
  • Between the two scenarios, we may have to write some of our redo log entries.  However, we should never have to write more than about 1MB

In the chart above we can clearly see the redo log flushes occurring at 1MB intervals.  If we write less than 1MB we generally have to write it all, above 1MB we only have to write a portion of the redo entry.     Note that on this system, I was pretty much the only session doing significant activity, so the pattern is very clear.  On a busy system the effect would be randomized by others causing flushes to occur.

Hopefully I’ll soon be able to compare HDD and SSD performance to see if there are any significant differneces to these trends – the above data was generated by redo on SSD.



When Steven and I wrote MySQL Stored Procedure programming our biggest reservation about the new stored procedure language was the lack of support for proper error handling.  The lack of the SIGNAL and RESIGNAL clauses prevented a programmer from raising an error that could be propagated throughout a call stack properly, and the lack of a general purpose exception handler which could examine error codes at run time led to awkward exception handling code at best, and poorly implemented error handling at worst.

In 5.4 MySQL implemented the SIGNAL and RESIGNAL clauses (see http://guyharrison.squarespace.com/blog/2009/7/13/signal-and-resignal-in-mysql-54-and-60.html), which corrected half of the problem.  Now finally, MySQL 5.6 implements the ANSI GET DIAGNOSTICS clause and we can write a general catch-all exception handler.

Here’s an example:


The exception handler is on lines 10-27.  It catches any SQL exception, then uses the GET DIAGNOSTICS clause to fetch the SQLstate, MySQL error code and messages to local variables.  We then decide what to do for anticipated errors – duplicate or badly formed product codes and SIGNAL a more more meaningful application error.  Unexpected errors are RESIGNALed on line 24.

This is a great step forward for MySQL stored procedures – the lack of a means to programmatically examine error codes made proper error handling difficult or impossible.  This is a major step forward in maturity. 

Thanks to Ernst Bonat of www.evisualwww.com for helping me work through the usage of GET DIAGNOSTICS.


Exadata Smart Flash Logging–Outliers

In my last post, I looked at the effect of the Exadata smart flash logging.  Overall,  there seemed to be a slight negative effect on median redo log sync times.  This chart (slightly different from the last post because of different load and configuration of the system), shows how there’s a “hump” of redo log syncs that take slightly longer when the flash logging is enabled:


But of course, the flash logging feature was designed to improve performance not of the “average” redo log sync, but of the “outliers”. 

In my tests, I had 40 concurrent processes writing redo as fast as they could.  Occasionally this would result in some really long wait times.  For instance, in this trace you see an outlier of 291,780 microseconds (the biggest outlier in my tests BTW) within an otherwise unremarkable set of waits:

WAIT #47124064145648: nam='log file sync' ela= 1043 buffer#=101808 sync scn=1266588527 p3=0 obj#=-1 tim=1347583167588250
WAIT #47124064145648: nam='log file sync' ela= 2394 buffer#=130714 sync scn=1266588560 p3=0 obj#=-1 tim=1347583167590888
WAIT #47124064145648: nam='log file sync' ela= 932 buffer#=101989 sync scn=1266588598 p3=0 obj#=-1 tim=1347583167592057
WAIT #47124064145648: nam='log file sync' ela= 291780 buffer#=102074 sync scn=1266588637 p3=0 obj#=-1 tim=1347583167884090
WAIT #47124064145648: nam='log file sync' ela= 671 buffer#=102196 sync scn=1266588697 p3=0 obj#=-1 tim=1347583167885294
WAIT #47124064145648: nam='log file sync' ela= 957 buffer#=102294 sync scn=1266588730 p3=0 obj#=-1 tim=1347583167886575

To see if the flash logging feature was successful in removing these outliers, I extracted the top 10,000 waits from each of the roughly 8,000,000 waits I recorded in each category.  Here’s a plot (non-logarithmic) of those waits:


So – the flash log feature was effective in eliminating or at least reducing very extreme outlying redo log sync times.    Most redo log sync operations will experience no improvement or maybe even a slight degradation. But for the small number of log syncs that would have experienced a really excessive delay, the feature works as advertised – it reduces the chance of really excessive log file syncs. 

In my opinion, this effect doesn't imply that the flash can process a redo log write faster than the magnetic disks - in fact probably the opposite is true.  But given two desitinations to choose from, we avoid really long delays that occur when one of the destinations only is overloaded. 


Exadata smart flash logging

Exadata storage software introduced the Smart flash logging feature.  The intent of this is to reduce overall redo log sync times - especially outliers - by allowing the exadata flash storage to serve as a secondary destination for redo log writes.  During a redo log sync, Oracle will write to the disk and flash simultaneously and allow the redo log sync operation to complete when the first device completes. 

Jason Arneil reports some initial observations here, and Luis Moreno Campos summarized it here.

I’ve reported in the past on using SSD for redo including on Exadata and generally I’ve found that SSD is a poor fit for redo log style sequential write IO.  But this architecture should at least do now harm and on the assumption that the SSD will at least occasionally complete faster than a spinning disk I tried it out. 

My approach involved the same workload I’ve used in similar tests.  I ran 20 concurrent processes each of which performed 200,000 updates and commits – a total of 4,000,000 redo log sync operations.  I captured every redo log sync wait from 10046 traces and loaded them in R for analysis.

I turned flash logging on or off by using an ALTER IORMPLAN command like this (my DB is called SPOT):

ALTER IORMPLAN dbplan=((name='SPOT', flashLog=$1),(name=other,flashlog=on))'

And I ran “list metriccurrent where objectType='FLASHLOG'” before and after each run so I could be sure that flash logging was on or off.

When flash logging was on, I saw data like this:


     FL_DISK_FIRST                     FLASHLOG     32,669,310 IO requests
     FL_FLASH_FIRST                    FLASHLOG     7,318,741 IO requests
     FL_PREVENTED_OUTLIERS             FLASHLOG     774,146 IO requests


      FL_DISK_FIRST                     FLASHLOG     33,201,462 IO requests
     FL_FLASH_FIRST                    FLASHLOG     7,337,931 IO requests
     FL_PREVENTED_OUTLIERS             FLASHLOG     774,146 IO requests


So for this particular cell the flash disk “won” only 3.8% of times (7,337,931-7,318,741)*100/(7,337,931-7,318,741+33,201,462-32,669,310) and prevented no “outliers”.  Outliers are defined as being redo log syncs that would have taken longer than 500 ms to complete. 

Looking at my 4 million redo log sync times,  I saw that the average and median times where statistically significantly higher when the smart flash logging was involved:

> summary(flashon.data$synctime_us) #Smart flash logging ON
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    1.0   452.0   500.0   542.4   567.0  3999.0
> summary(flashoff.data$synctime_us) #Smart flash logging OFF
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   29.0   435.0   481.0   508.7   535.0  3998.0
> t.test(flashon.data$synctime_us,flashoff.data$synctime_us,paired=FALSE)

    Welch Two Sample t-test

data:  flashon.data$synctime_us and flashoff.data$synctime_us
t = 263.2139, df = 7977922, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
33.43124 33.93285
sample estimates:
mean of x mean of y
542.3583  508.6763

Plotting the distribution of redo log sync times we can pretty easily see that there’s actually a small “hump” in times when flash logging is on (note logarithmic scale):


This is of course the exact opposite of what we expect, and I checked my data very carefully to make sure that I had not somehow switched samples.  And I repeated the test many times and always saw the same pattern.  

It may be that there is a slight overhead to running the race between disk and flash, and that that overhead makes redo log sync times slightly higher.  That overhead may become more negligible on a busy system.  But for now I personally can’t confirm that smart flash logging provides the intended optimization and in fact I observed a small but statistically significant and noticeable degradation in redo log sync times when it is enabled.


Getting started with Apache Pig

If, like me, you want to play around with data in a Hadoop cluster without having to write hundreds or thousands of lines of Java MapReduce code, you most likely will use either Hive (using the  Hive Query Language HQL) or Pig.

Hive is a SQL-like language which compiles to Java map-reduce code, while Pig is a data flow language which allows you to specify your map-reduce data pipelines using high level abstractions. 

The way I like to think of it is that writing Java MapReduce is like programming in assembler:  you need to manually construct every low level operation you want to perform.  Hive allows people familiar with SQL to extract data from Hadoop with ease and – like SQL – you specify the data you want without having to worry too much about the way in which it is retrieved.  Writing a Pig script is like writing a SQL execution plan:  you specify the exact sequence of operations you want to undertake when retrieving the data.  Pig also allows you to specify more complex data flows than is possible using HQL alone.

As a crusty old RDBMS guy, I at first thought that Hive and HQL was the most attractive solution and I still think Hive is critical to enterprise adoption of Hadoop since it opens up Hadoop to the world of enterprise Business Intelligence.  But Pig really appeals to me as someone who has spent so much time tuning SQL.  The Hive optimizer is currently at the level of early rule-based RDBMS optimizers from the early 90s.  It will get better and get better quickly, but given the massive size of most Hadoop clusters, the cost of a poorly optimized HQL statement is really high.  Explicitly specifying the execution plan in Pig arguably gives the programmer more control and lessens the likelihood of the “HQL statement from Hell” brining a cluster to it’s knees.

So I’ve started learning Pig, using the familiar (to me) Oracle sample schema which I downloaded using SQOOP.   (Hint:  Pig likes tab separated  files, so use the --fields-terminated-by '\t' flag in your SQOOP job). 

Here’s a diagram I created showing how some of the more familiar HQL idioms are implemented in Pig:

Note how using Pig we explicitly control the execution plan:  In HQL it’s up to the optimizer whether tables are joined before or after the “country_region=’Asia’” filter is applied.  In Pig I explicitly execute the filter before the join.    It turns out that the Hive optimizer does the same thing, but for complex data flows being able to explicitly control the sequence of events can be an advantage. 

Pig is only a little more wordy than HQL and while I definitely like the familiar syntax of HQL I really like the additional control of Pig.

Page 1 ... 3 4 5 6 7 ... 19 Next 5 Entries »