Guy Harrison - Yet Another Database Blog

Monday

Dec302013

Can the Exadata Smart Flash Cache slow smart scans?

Monday, December 30, 2013 at 3:37PM

I’ve been doing some work on the Exadata Smart Flash Cache recently and came across a situation in which setting CELL_FLASH_CACHE to KEEP will significantly slow down smart scans on a table.

If we create a table with default settings, then the Exadata Smart Flash Cache (ESFC) will not be involved in smart scans, since by default only small IOs get cached. If we want the ESFC to be involved, we need to set the CELL_FLASH_CACHE to KEEP. Of course, we don’t expect immediate improvements, since we expect that the next smart scan will need to populate the cache before subsequent scans can benefit.

HOWEVER, what I’m seeing in practice is that the next smart scan following an ALTER TABLE … STORAGE(CELL_FLASH_CACHE KEEP) is significantly degraded, while subsequent scans get a performance boost. Here’s an example of what I observe:

The big increase in CELL IO time is in an increase in both the number and latency of cell smart table scans. The wait stats for the first scan with a default setting looked like this:

Elapsed times include waiting on following events:
Event waited on                             Times   Max. Wait Total Waited
----------------------------------------   Waited ---------- ------------
gc cr disk read                                 1        0.00          0.00
cell single block physical read                 2        0.01          0.01
row cache lock                                  2        0.00          0.00
gc cr grant 2-way                               1        0.00          0.00
SQL*Net message to client                    1021        0.00          0.00
reliable message                                1        0.00          0.00
enq: KO - fast object checkpoint                2        0.00          0.00
cell smart table scan                        9322        0.14          7.60
SQL*Net message from client                  1021        0.00          0.02

For the first scan with KEEP cache it looked like this:

Elapsed times include waiting on following events:
Event waited on                             Times   Max. Wait Total Waited
----------------------------------------   Waited ---------- ------------
SQL*Net message to client                    1021        0.00          0.00
reliable message                                1        0.00          0.00
enq: KO - fast object checkpoint                2        0.00          0.00
cell smart table scan                       14904        1.21         33.37
SQL*Net message from client                  1021        0.00          0.02

Looking at the raw trace file didn’t help – it just shows a bunch of lines like this, with only a small number (3 in this case) of unique cellhash values… I couldn’t see a pattern:

WAIT #… : nam='cell smart table scan' ela= 678 cellhash#=398250101 p2=0 p3=0 obj#=139207 tim= …

I’m at a loss to understand why there would be such a high penalty for the initial smart scan with CELL_FLASH_CACHE KEEP setting. You expect some overhead from constructing and storing the result set blocks in the cache, but an IO penalty of 200=300% seems way too high. Anybody seen anything like this or have a clear explanation?

Test script is here, and formatted tkprof here

Update on Thursday, January 2, 2014 at 9:15PM by

Guy Harrison

@kevinclosson was kind enough to genlty correct my embarassing misconception about how the Exadata Smart Flash Cache (ESFC) deals with smart scans. Somehow I had got the impression that the ESFC was storing parts of the smart scan results - kind of like the result set cache. I really can't remember where I got that impression but as Kevin pointed out, its completely incorrect. What is being cached are the underlying table blocks - and you can clearly see this using LIST FLASHCACHECONTENT and looking at the flash hit rates for other queries.

So the overhead then is easier to understand since we are looking at placing all the blocks for a reasonably large table into the flash cache. This involves a large number of write operations over a very short period of time which in turn probably requires some on the fly garbage collection and signifcant write amplification.

The conclusion to draw is probably NOT to set CELL_FLASH_CACHE KEEP for a table of significant size unless you know for sure that it will be re-scanned within a short period of time. If the blocks age out of the cache and have to be reloaded you'll probably see a degradation rather than an improvement in scan performance.

Guy Harrison |

2 Comments |

tagged

Exadata,

Oracle,

ssd in

Oracle

Tuesday

Nov262013

Redo on SSD: effect of redo size (Exadata)

Tuesday, November 26, 2013 at 3:31PM

Of all the claims I make about SSD for Oracle databases, the one that generates the most debate is that placing redo logs on SSD is not likely to be effective. I’ve published data to that effect in particular see Using SSD for redo on Exadata - pt 2 and 04 Evaluating the options for Exploiting SSD.

I get a lot of push back on these findings – often on theoretical grounds from Flash vendors (“our SSD use advanced caching and garbage collection that support high rates of sequential IO”) or from people who say that they’ve used flash for redo and it “worked fine”.

Unfortunately, every single test I do comparing performance of redo on flash and HDD shows redo with little or no advantage and in some cases with a clear disadvantage.

One argument for flash SSD that I’ve heard is that while for the small transactions I use for testing flash might not have the advantage but for “big” redo writes – such as those associated with LOB updates – flash SSD would work better. The idea is that the overhead of garbage collection and free page pool processing is less with big writes since you don’t hit the same flash SSD pages in rapid succession as you would with smaller writes. On the other hand a reader who knows more about flash than I do (flashdba.com) recently commented: “in foreground garbage collection a larger write will require more pages to be erased, so actually will suffer from even more performance issues.”

It’s taken me a while to get around to testing this, but I tried on our Exadata X-2 recently with a test that generates a variable amount of redo and then commits. The relationship between the size of the redo and redo log sync time is shown below

I’m now putting on my flame retardant underwear in anticipation of some dispute over this data…. but, this suggests that while SSD and HDD (at least on Exadata) are about at parity for small writes, flash degrades much more steeply than HDD as the size of the redo entry increases. Regardless of whether the redo is on flash or HDD, there’s a break at the 1MB point which corresponds to log buffer flush threshold. When a redo entry is only slightly bigger than 1MB then the chances are high that some of it will have been flushed already – see Redo log sync time vs redo size for a discussion of this phenomenon.

The SSD redo files were on an ASM disk group carved out of the Exadata flash disks - see Configuring Exadata flash as grid disk to see how I created these. Also the redo logs were created with 4K blocksize as outlined in Using SSD for redo on Exadata - pt 2. The database was in NoarchiveLog mode. Smart flash logging was disabled. As far as I can determine, there was no other significant activity on the flash disks (the grid disks were supporting all the database tablespaces, so if anything the SSD had the advantage).

Why are we seeing such a sharp dropoff in performance for the SSD as the redo write increases in size? Well one explanation was given by flashdba in this comment thread. It has to do with understanding what happens when a write IO which modifies an existing block hits a flash SSD. I tried to communicate my limited understanding of this process in Fundamentals of Flash SSD Technology. Instead of erasing the existing page, the flash controller will pull a page off a “free list” of pages and mark the old page as invalid. Later on, the garbage collection routines will reorganize the data and free up invalid pages. In this case, it’s possible that no free blocks were available because garbage collection fell behind during the write intensive workload. The more blocks written by LGWR, the more SSD pages had to be erased during these un-optimized writes and therefore the larger the redo log write the worse the performance of the SSD.

Any other theories and/or observations?

I hope soon to have a Dell system with Dell express flash so as I can repeat these tests on a non-exadata system. The F20 cards used in my X-2 are not state of the art, so it’s possible that different results could be obtained with a more recent flash card, or with a less contrived workload.

However, yet again I’m gathering data that suggests that using flash for redo logs is not worthwhile. I’d love to argue the point but even better than argument would be some hard data in either direction….

Guy Harrison |

1 Comment |

tagged

Exadata,

Oracle,

ssd

Friday

Nov222013

Exadata Write-back cache and free buffer waits

Friday, November 22, 2013 at 8:49AM

Prior to storage server software version 11.2.3.2.0 (associated with Exadata X3), Exadata Smart Flash Cache was a “write-through” cache, meaning that write operations are applied both to the cache and to the underlying disk devices, but are not signalled as complete until the IO to the disk has completed.

Starting with 11.2.3.2.0 of the Exadata storage software[1], Exadata Smart Flash Cache may act as a write-back cache. This means that a write operation is made to the cache initially and de-staged to grid disks at a later time. This can be effective in improving the performance of an Exadata system that is subject to IO write bottlenecks on the Oracle datafiles.

Writes to datafile generally happen as a background task in Oracle, and most of the time we don’t actually “wait” on these IOs. That being the case, what advantage can we expect if these writes are optimized? To understand the possible advantages of the write-back cache let’s review the nature of datafile write IO in Oracle and the symptoms that occur when write IO becomes the bottleneck.

When a block in the buffer cache is modified, it is the responsibility of the database writer (DBWR) to write these “dirty” blocks to disk. The DBWR does this continuously and uses asynchronous IO processing, so generally sessions do not have to wait for the IO to occur – the only time sessions wait directly on write IO is when a redo log sync occurs following a COMMIT.

However, should all the buffers in the buffer cache become dirty then a session may wait when it wants to bring a block into the cache – resulting in a “free buffer wait”.

Free buffer waits can occur in update-intensive workloads when the IO bandwidth of the Oracle sessions reading into the cache exceeds the IO bandwidth of the database writer. Because the database writer uses asynchronous parallelized write IO, and because all processes concerned are accessing the same files, free buffer waits usually happen when the IO subsystem can service reads faster than it can service writes.

There exists just such an imbalance between reads and write latency in Exadata X2 – the Exadata Smart Flash Cache accelerates reads by a factor of perhaps 4-10 times, while offering no comparable advantage for writes. As a result, a very busy Exadata X2 system could become bottlenecked on free buffer waits. The Exadata Smart Flash Cache write-back cache provides acceleration to datafile writes as well as reads and therefore reduces the chance of free buffer wait bottlenecks.

The figure below illustrates the effectiveness of the write-back cache for workloads that encounter free buffer waits. The workload used to generate this data was heavily write-intensive with very little read IO overhead (all the necessary read data was in cache). As a result, it experienced a very high degree of free buffer waits and some associated buffer busy waits. Enabling the write-back cache completely eliminated the free buffer waits by effectively accelerating the write IO bandwidth of the database writer. As a result, throughput increased four fold.

However, don’t be misled into thinking that the write-back cache will be a silver bullet for any workload. Workloads that are experiencing free buffer waits are likely to see this sort of performance gain. Workloads where the dominant waits are for CPU, read IO, global cache co-ordination, log writes and so on will be unlikely to see any substantial benefit from the write-back cache.

[1] 11.2.3.2.1 is recommended as the minimum version for this feature as it contains fixes to significant issues discovered in the initial release.

Guy Harrison |

5 Comments |

tagged

Exadata,

Oracle,

ssd

Tuesday

Sep172013

Redo log sync time vs redo size

Tuesday, September 17, 2013 at 1:49PM

It’s been tough to find time to do actual performance research of late, but I have managed to get a test system prepared that will allow me to determine if Solid State disks offer some performance advantage over spinning disks when the redo entries are very large. This is to test the theory that the results I’ve published in the past (here and here for instance) actually apply only when the redo entries are relatively small. For small sequential writes to SSD, each successive write will invoke an erase of a complete NAND page, whereas in a larger sequential write this will not occur since each write will hit different pages.

I’m still setting up the test environment to look at this, but first I thought it would be worth showing this pretty picture:

This chart shows how redo log sync time (eg, time taken to COMMIT) varies with the amount of redo information written since the last COMMIT. There is a slight overall upwards trend, but the really noticeable trend is the “sawtooth” effect, which I’ve highlighted below:

Can you guess what causes this?

I think it’s pretty clear that we are seeing the effect of redo buffer flushing. Remember, when you write redo entries, they are written to the redo buffer (or sometimes a strand). Oracle flushes the buffer when you commit, but also flushes it when it is 1/3rd full, after 3 seconds (I think from memory) or after 1MB of redo entries. Given that, we can see what happens when we commit:

If there has been no redo log flush since we started writing, we have to wait while LGWR writes all the entries to disk
If a redo log flush occurs after we have written our entries but before we COMMIT, then we have to write virtually nothing (but a COMMIT marker I suppose)
Between the two scenarios, we may have to write some of our redo log entries. However, we should never have to write more than about 1MB

In the chart above we can clearly see the redo log flushes occurring at 1MB intervals. If we write less than 1MB we generally have to write it all, above 1MB we only have to write a portion of the redo entry. Note that on this system, I was pretty much the only session doing significant activity, so the pattern is very clear. On a busy system the effect would be randomized by others causing flushes to occur.

Hopefully I’ll soon be able to compare HDD and SSD performance to see if there are any significant differneces to these trends – the above data was generated by redo on SSD.

Guy Harrison |

2 Comments |

1 Reference |

Friday

Jul052013

Using GET DIAGNOSTICS in MySQL 5.6

Friday, July 5, 2013 at 4:15PM

When Steven and I wrote MySQL Stored Procedure programming our biggest reservation about the new stored procedure language was the lack of support for proper error handling. The lack of the SIGNAL and RESIGNAL clauses prevented a programmer from raising an error that could be propagated throughout a call stack properly, and the lack of a general purpose exception handler which could examine error codes at run time led to awkward exception handling code at best, and poorly implemented error handling at worst.

In 5.4 MySQL implemented the SIGNAL and RESIGNAL clauses (see http://guyharrison.squarespace.com/blog/2009/7/13/signal-and-resignal-in-mysql-54-and-60.html), which corrected half of the problem. Now finally, MySQL 5.6 implements the ANSI GET DIAGNOSTICS clause and we can write a general catch-all exception handler.

Here’s an example:

The exception handler is on lines 10-27. It catches any SQL exception, then uses the GET DIAGNOSTICS clause to fetch the SQLstate, MySQL error code and messages to local variables. We then decide what to do for anticipated errors – duplicate or badly formed product codes and SIGNAL a more more meaningful application error. Unexpected errors are RESIGNALed on line 24.

This is a great step forward for MySQL stored procedures – the lack of a means to programmatically examine error codes made proper error handling difficult or impossible. This is a major step forward in maturity.

Thanks to Ernst Bonat of www.evisualwww.com for helping me work through the usage of GET DIAGNOSTICS.

Guy Harrison |

1 Comment |

tagged

mysql