Redo log sync time vs redo size
It’s been tough to find time to do actual performance research of late, but I have managed to get a test system prepared that will allow me to determine if Solid State disks offer some performance advantage over spinning disks when the redo entries are very large. This is to test the theory that the results I’ve published in the past (here and here for instance) actually apply only when the redo entries are relatively small. For small sequential writes to SSD, each successive write will invoke an erase of a complete NAND page, whereas in a larger sequential write this will not occur since each write will hit different pages.
I’m still setting up the test environment to look at this, but first I thought it would be worth showing this pretty picture:
This chart shows how redo log sync time (eg, time taken to COMMIT) varies with the amount of redo information written since the last COMMIT. There is a slight overall upwards trend, but the really noticeable trend is the “sawtooth” effect, which I’ve highlighted below:
Can you guess what causes this?
I think it’s pretty clear that we are seeing the effect of redo buffer flushing. Remember, when you write redo entries, they are written to the redo buffer (or sometimes a strand). Oracle flushes the buffer when you commit, but also flushes it when it is 1/3rd full, after 3 seconds (I think from memory) or after 1MB of redo entries. Given that, we can see what happens when we commit:
- If there has been no redo log flush since we started writing, we have to wait while LGWR writes all the entries to disk
- If a redo log flush occurs after we have written our entries but before we COMMIT, then we have to write virtually nothing (but a COMMIT marker I suppose)
- Between the two scenarios, we may have to write some of our redo log entries. However, we should never have to write more than about 1MB
In the chart above we can clearly see the redo log flushes occurring at 1MB intervals. If we write less than 1MB we generally have to write it all, above 1MB we only have to write a portion of the redo entry. Note that on this system, I was pretty much the only session doing significant activity, so the pattern is very clear. On a busy system the effect would be randomized by others causing flushes to occur.
Hopefully I’ll soon be able to compare HDD and SSD performance to see if there are any significant differneces to these trends – the above data was generated by redo on SSD.
Reader Comments (2)
Guy, I'm struggling to understand what you mean by this statement:
"For small sequential writes to SSD, each successive write will invoke an erase of a complete NAND page, whereas in a larger sequential write this will not occur since each write will hit different pages."
Any write to an SSD will require an available empty page in which the data can be stored. Unless the SSD is brand new it's likely that the page will need to be erased first. However, every SSD I know of performs page erasing as a background garbage collection process. There are therefore two situations to consider.
Firstly there is the situation where garbage collection is able to keep pace with the rate of change, i.e. there are always erased, empty pages available for data to be written in. In this case it doesn't matter what size the redo writes are as we have no performance degradation.
The second case, which I believe you are referring to when you say "each successive write will invoke an erase of a complete NAND page" is that of foreground garbage collection, where the writing process has to wait until the target page is erased. This happens when background garbage collection is unable to keep up with the rate of change and results in a steep decline in performance (the so-called "write cliff"). SSDs and PCIe flash cards are more prone to this situation than all flash arrays because they are effectively silos of flash.
The bit I don't understand from your article is, "in a larger sequential write this will not occur since each write will hit different pages". In background garbage collection we have no issues so the size of the write is not relevant, while in foreground garbage collection a larger write will require more pages to be erased, so actually will suffer from even more performance issues.
What am I missing?
OK, so this is just a theory, so reader beware.
The theory is that when you do lots of small writes sequentially - really tiny redo log file sync operations for instance, then you'll hit the same block multiple times within a very short time interval. Using the normal erase-prevention routines would require that you pull a block off the free list for each operation - eventually running out of the freelist and hitting a cliff.
If the entries are large, then the number of pages being hit per second will be lower, and you won't hit the same page again until you've cycled through all your redo log groups. So that would suggest a lower rate of GC and less chance of hitting a cliff.
HOWEVER, my most recent performance tests don't bear this out so my theory is probably bunk :-)
But thanks for the comment, it really helps me think about this....