Redo on SSD: effect of redo size (Exadata)
Of all the claims I make about SSD for Oracle databases, the one that generates the most debate is that placing redo logs on SSD is not likely to be effective. I’ve published data to that effect in particular see Using SSD for redo on Exadata - pt 2 and 04 Evaluating the options for Exploiting SSD.
I get a lot of push back on these findings – often on theoretical grounds from Flash vendors (“our SSD use advanced caching and garbage collection that support high rates of sequential IO”) or from people who say that they’ve used flash for redo and it “worked fine”.
Unfortunately, every single test I do comparing performance of redo on flash and HDD shows redo with little or no advantage and in some cases with a clear disadvantage.
One argument for flash SSD that I’ve heard is that while for the small transactions I use for testing flash might not have the advantage but for “big” redo writes – such as those associated with LOB updates – flash SSD would work better. The idea is that the overhead of garbage collection and free page pool processing is less with big writes since you don’t hit the same flash SSD pages in rapid succession as you would with smaller writes. On the other hand a reader who knows more about flash than I do (flashdba.com) recently commented: “in foreground garbage collection a larger write will require more pages to be erased, so actually will suffer from even more performance issues.”
It’s taken me a while to get around to testing this, but I tried on our Exadata X-2 recently with a test that generates a variable amount of redo and then commits. The relationship between the size of the redo and redo log sync time is shown below
I’m now putting on my flame retardant underwear in anticipation of some dispute over this data…. but, this suggests that while SSD and HDD (at least on Exadata) are about at parity for small writes, flash degrades much more steeply than HDD as the size of the redo entry increases. Regardless of whether the redo is on flash or HDD, there’s a break at the 1MB point which corresponds to log buffer flush threshold. When a redo entry is only slightly bigger than 1MB then the chances are high that some of it will have been flushed already – see Redo log sync time vs redo size for a discussion of this phenomenon.
The SSD redo files were on an ASM disk group carved out of the Exadata flash disks - see Configuring Exadata flash as grid disk to see how I created these. Also the redo logs were created with 4K blocksize as outlined in Using SSD for redo on Exadata - pt 2. The database was in NoarchiveLog mode. Smart flash logging was disabled. As far as I can determine, there was no other significant activity on the flash disks (the grid disks were supporting all the database tablespaces, so if anything the SSD had the advantage).
Why are we seeing such a sharp dropoff in performance for the SSD as the redo write increases in size? Well one explanation was given by flashdba in this comment thread. It has to do with understanding what happens when a write IO which modifies an existing block hits a flash SSD. I tried to communicate my limited understanding of this process in Fundamentals of Flash SSD Technology. Instead of erasing the existing page, the flash controller will pull a page off a “free list” of pages and mark the old page as invalid. Later on, the garbage collection routines will reorganize the data and free up invalid pages. In this case, it’s possible that no free blocks were available because garbage collection fell behind during the write intensive workload. The more blocks written by LGWR, the more SSD pages had to be erased during these un-optimized writes and therefore the larger the redo log write the worse the performance of the SSD.
Any other theories and/or observations?
I hope soon to have a Dell system with Dell express flash so as I can repeat these tests on a non-exadata system. The F20 cards used in my X-2 are not state of the art, so it’s possible that different results could be obtained with a more recent flash card, or with a less contrived workload.
However, yet again I’m gathering data that suggests that using flash for redo logs is not worthwhile. I’d love to argue the point but even better than argument would be some hard data in either direction….
Reader Comments (1)
When use SSD for the Redo Log Write,
1. you need Dedicated Device for it.
2. you need to do more Over Provisioning, Aka, more reserved space for Garbage Collection(Erase).
3. you need more stable Flash Device from the Vendor, For Example : Use Fusion-IO.