Guy Harrison - Yet Another Database Blog

Wednesday

Jul072010

The Flash cache KEEP option

Wednesday, July 7, 2010 at 5:02PM

Oracle’s database flash cache (AKA “database smart flash cache”) supports selective caching at the table level via the FLASH_CACHE setting in the storage clause.

Settings can be DEFAULT, KEEP or NONE. Eg:

The word KEEP is a bit unfortunate, since it perhaps implies behaviour identical to the KEEP pool of the buffer cache. The KEEP cache is a separately managed area of specific size, so that you can control what proportion of your memory is allocated to “kept” blocks. The flash cache KEEP setting is more to prioritise caching – there seems to be no limit on how many blocks will be kept.

For the above three tables, if we read all the rows in each more or less simultaneously, we get the following outcomes (shown here in our lovely new Spotlight on Oracle flash cache drilldown):

Not surprisingly, where FLASH_CACHE is NONE, there is no flash cache storage. When flash cache is KEEP, all the blocks are kept in the flash cache, and consequently there is not enough room for all the blocks from tables where FLASH_CACHE=DEFAULT. However, unlike the KEEP buffer pool, there’s no way to limit the amount of blocks in the KEEP pool, and so you can quite easily fill up the cache with blocks form FLASH_CACHE=KEEP tables. Once that happens, those blocks will stay there forever.

If you have Spotlight (or at least the 7.5 version about to be released with Flash cache diagnostics), you can see this happening. Once the flash cache fills up, every new block introduced forces an old block out. This shows up as evictions in the following chart:

Oracle clearly evicts blocks from FLASH_CACHE=DEFAULT tables in preference to FLASH_CACHE=KEEP. Presumably (I haven't tested this) within each setting blocks are moved out using some form of Least Recently Used (LRU) or Not Recently Used algorithm.

KEEP should therefore be used only for relatively small tables. If you set FLASH_CACHE=KEEP for a large table (larger than the flash cache for instance) then you risk rendering the flash cache unusable for other tables. I hope Oracle chooses to implement a limit on KEEP size in a subsequent release. I think it would be far more useful if you could allocate multiple flash cache pools of a certain size as for KEEP and RECYCLE in the buffer cache.

Spotlight 7.5 – which includes flash cache diagnostics as well as ESX diagnostics, support for the 11gR2 parallel query queue and lots of other cool stuff - will be available next month, both on it’s own and as part of the TOAD DBA suite.

Guy Harrison |

Post a Comment |

1 Reference |

tagged

Oracle,

ssd in

Oracle

Sunday

Jun132010

Consistency models in Non relational Databases

Sunday, June 13, 2010 at 5:00PM

One of the most significant differences between the new generation of non-relational (AKA NoSQL) databases and the traditional RDBMS is the way in which consistency of data is handled. In a traditional RDBMS, all users see a consistent view of the data. Once a user commits a transaction, all subsequent queries will report that transaction and certainly no-one will see partial results of a transaction.

RDBMS transactions are typically described as “ACID” transactions. That is, they are:

Atomic: The transaction is indivisible – either all the statements in the transaction are applied to the database, or none are.
Consistent: The database remains in a consistent state before and after transaction execution.
Isolated: While multiple transactions can be executed by one or more users simultaneously, one transaction should not see the effects of other concurrent transactions.
Durable: Once a transaction is saved to the database (an action referred to in database programming circles as a commit), its changes are expected to persist.

As databases become distributed across multiple hosts, maintaining ACID consistency becomes increasingly difficult. In a transaction that spans multiple independent databases, complex two-phase commit protocols must be employed. In the case of a truly clustered distributed database even more complex protocols are required, since the state of data in memory and the state of data in various transaction logs and data files must be maintained in a consistent state (cache fusion in Oracle RAC for instance).

CAP Theorem: You can’t have it all

In 2000, Eric Brewer outlined the CAP (AKA Brewer’s) Theorem. Simplistically, CAP theorem says that in a distributed database system, you can only have at most two of the following three characteristics:

Consistency: All nodes in the cluster see exactly the same data at any point in time
Availability: Failure of a node does not render the database inoperative
Partition tolerance: Nodes can still function when communication with other groups of nodes is lost

Interpretation and implementations of CAP theorem vary, but most of the NoSQL database system architectures favour partition tolerance and availability over strong consistency.

Eventual Consistency

A compromise between eventual consistency and weak (no guarantees) consistency is Eventual Consistency.

The core of the eventual consistency concept is that although the database may have some inconsistencies at a point in time, it will eventually become consistent should all updates cease. That is, inconsistencies are transitory: eventually all nodes will receive the latest consistent updates.

BASE – Basically Available Soft-state Eventually consistent is an acronym used to contrast this approach with the RDBMS ACID transactions described above.

Not all implementations of eventually consistent are equal. In particular, an eventually consistent database may also elect to provide the following:

Causal consistency: This involves a signal being sent from between application sessions indicating that a change has occurred. From that point on the receiving session will always see the updated value.
Read your own writes: In this mode of consistency, a session that performs a change to the database will immediately see that change, even if other sessions experience a delay.
Monotonic consistency: In this mode, A session will never see data revert to an earlier point in time. Once we read a value, we will never see an earlier value.

The NRW notation

NRW notation describes at a high level how a distributed database will trade off consistency, read performance and write performance. NRW stands for:

N: the number of copies of each data item that the database will maintain.
R: the number of copies that the application will access when reading the data item
W: the number of copies of the data item that must be written before the write can complete.

When N=W then the database will always write every copy before returning control to the client – this is more or less what traditional databases do when implementing synchronous replication. If you are more concerned about write performance than read performance, then you can set W=1, R=N. Then each read must access all copies to determine which is correct, but each write only has to touch a single copy of the data.

Most NoSQL databases use N>W>1: more than one write must complete, but not all nodes need to be updated immediately. You can increase the level of consistency in roughly three stages:

If R=1, then the database will accept whatever value it reads first. This might be out of date if not all updates have propagated through the system.
If R>1 then the database will more than one value and pick either the most recent (or “correct”) value.
If W+R>N, then a read will always retrieve the latest value, although it may be mixed in with “older” values. In other words, the number of copies you write and the number of copies you read is high enough to guarantee that you’ll always have at least one copy of the latest version in your read set. This is sometimes referred to as quorum assembly.

NRW configuration	Outcome
W=N R=1	Read optimized strong consistency
W=1 R=N	Write optimized strong consistency
W+R<=N	Weak eventual consistency. A read might not see latest update
W+R>N	Strong consistency through quorum assembly. A read will see at least one copy of the most recent update

NoSQL databases generally try hard to be as consistent as possible, even when configured for weaker consistency. For instance, the read repair algorithm is often implemented to improve consistency when R=1. Although the application does not wait for all the copies of a data item to be read, the database will read all known copies in the background after responding to the initial request. If the application asks for the data item again, it will therefore see the latest version.

Vector clocks

NoSQL databases can seem simplistic in some respects, but there’s a lot of really clever algorithms going on behind the scenes. For example, the vector clock algorithm can be used to ensure that updates are processed in order (monotonic consistency).

With vector clocks, each node participating in the cluster maintains an change number (or event count) similar to the System Change Number used in some RDBMSs. The “vector” is a list including the current node's change number as well as the change numbers that have been received from other nodes. When an update is transmitted, the vector is included with the update and the receiving node compares that vector with other vectors that have been received to determine if updates are being received out of sequence. Out of sequence updates can be held until the preceding updates appear.

I found vector clocks hard to understand until I read the description in Operating Systems: Concurrent and Distributed Software Design by Jean Bacon and Tim Harris (Addison-Wesley).

Amazon’s Dynamo

A lot of the eeventually consistent concepts were best articulated by Amazon in Verner Vogels’ Eventually Consistent paper and in Amazon’s paper on the Dynamo eeventually consistent key-value store. Dynamo implements most of the ideas above, and is the inspiration for several well known NoSQL datastores including Voldemort and – together with Google’s BigTable specification – Cassandra.

Guy Harrison |

Post a Comment |

tagged

nosql in

TCD blog post

Tuesday

Jun082010

Cloud (AKA NoSQL) Databases and me

Tuesday, June 8, 2010 at 4:31PM

I’ve been an RDBMS guy now since about 1988. Around the time my first son Chris was born, the government department were I worked shifted from an ADABAS/MVS environment to an Oracle/VMS system, and I was dropped headfirst into joy of Oracle 5.1. Since then I’ve been continuously involved with Oracle development, administration and performance as well in Sybase, MySQL and SQL Server. So you could say I’m a pretty hard core old-school SQL database guy.

For the longest time, it looked like the RDBMS was here to stay. But a few years ago, we noticed signs of new models for databases that were better aligned with modern application architectures, the massive “big data” demands of Web 2.0 and the increasing disparity between IO and CPU capabilities. I’ve written and talked about some of these trends over the past two years and I’ve placed some links to those articles the end of this post.

What I haven’t been able to talk much about is what we are doing to support these Cloud/NoSQL database at Quest Software. Quest is diverse company, with strong offerings in Applications, Virtualization and Windows management, but we are definitely very dominant in database tools, so you might have been wondering how we planned to move into the next generation of database management tools.

For the past year or so, I’ve been directing a team of developers who are building “Toad for Cloud Databases” which will provide data management capabilities for NoSQL/Cloud databases in a familiar TOAD platform. This Toad will allow developers, DBA and Data analysts to work with data in these databases just as easily as you would with data in an RDBMS. Even better, you’ll be able to move data between RDBMS and a NoSQL/Cloud database, and issue queries that join data in both databases.

I can’t say much more about Toad for Cloud Databases in advance of it’s official release later this month. But I will be starting to blog more frequently on NoSQL topics, both in this blog and in the Toad for Cloud Databases community site that will be active my the end of June.

In this personal blog, I’m going to publish some summaries of the things we’ve learned about the various NoSQL/Cloud Databases, especially from the perspective of an RDBMS professional. I’ll also be posting early versions of articles and posting that will eventually go up on the official Toad for Cloud website.

I’m pretty excited about what we’re doing with Toad for Cloud Databases, and I’m looking forward to sharing some of this stuff with you. Of course, we’re still busily working on RDBMS products at Quest and particularly in Melbourne, where we develop SQL Navigator, Spotlight on Oracle, Spotlight on Oracle RAC,Spotlight on MySQL and Spotlight on SQL Server Enterprise.

References

As promised, here are links to my previous articles on next generation databases. Some are a bit dated now, but give you an idea of how the new world has emerged from my (SQL guy) perspective:

Is the next DBMS Revolution Looming? Database Trends and Applications, June 2008

End of the one-size fits all RDBMS? DBTA Database Trends and Applications, July 2008

Map Reduce for Business Intelligence and Analytics DBTA September 2009

What's next for RDBMS? Article published in InfoManagement Direct, May 7, 2009

Hadoop sets its sights on the Enterprise Jan 2010

http://www.slideshare.net/gharriso/next-generation-databases RMOUG presentation Jan 2010

Guy Harrison |

2 Comments |

TCD blog post

Tuesday

May112010

Disabling the Intel X-25 E write cache

Tuesday, May 11, 2010 at 9:06PM

In my last posting on using SSD with Oracle, I said how impressed I was with the X-25 E SSD write performance. However, at the OReilly MySQL conference last month, I attended a talk by Vadim Tkachenko and Morgan Tocker from Percona on An Overview of Flash Storage for Databases. It was a great talk overall, but one important thing I learned is that the X-25 E has a volatile 64MB write cache. What this means is that the X-25 can report that a block is written to disk when it is still within a RAM buffer within the device. If the disk failed between the write to RAM and the write to flash then the data could be lost.

We’d normally regard this data loss as an unacceptable risk, so you would think that the best thing to do would be to turn the write cache off. This can be done with the following command:

hdparm -W 0 /dev/sdb

(assuming that /dev/sdb is the flash SSD).

Turning off the write cache – as you’d expect – reduces the write IO capacity of the device. Below we see the effect on two identical workloads:

These workloads involve SELECT and UPDATE operations on a table which is stored on a datafile directly on the SSD. There’s no db flash cache involved in this simulation.

The datafile write rate drops substantially and the work takes longer to complete, as we expect. But why does the read IO rate drop as well? The reason is because of free buffer waits.

As described in this post, when a IO subsystem has a higher read bandwidth than write bandwidth, then sessions may be able to add and update blocks in the buffer cache faster than the DBWR can clear them out. When this happens free buffer waits occur as sessions wait for buffers to be cleared.

We can see the free buffer waits in Spotlights event wait chart:

Disabling the write cache slows down disk performance somewhat, but it’s still a lot faster than a spinning disk. Furthermore, most workloads are not as update intensive as in my simulation so a lot of the time you won’t hit this problem. Nevertheless, it's important to realize that the X-26 has this write cache and that it may be artificially increasing write throughput at the cost of write safety.

One word of caution: I met a guy from Percona who told me that Intel doesn’t actually support the X-25 with the write cache disabled. This is a bit disturbing, since it implies that you can choose data safety or vendor support but not both!

Also, note that the write cache can be left enabled if the SSD is only being used for the 11GR2 database flash cache. In that configuration failed writes to the cache in the event of a disk failure will cause no harm: Oracle will detect that the cache has failed and will bypass the cache completely.

Guy Harrison |

Post a Comment |

tagged

Oracle,

ssd in

Oracle

Saturday

Apr102010

ESX CPU optimization for Oracle Databases

Saturday, April 10, 2010 at 3:50AM

In the last post, I talked about managing memory for Oracle databases when running under ESX. In this post I’ll cover the basics of CPU management.

ESX CPU Scheduling

Everyone probably understands that in a ESX server there are likely to be more virtual CPUs than physical CPUs. For instance, you might have an 8 core ESX server with 16 virtual machines, each of which has a single virtual CPU. Since there are twice as many virtual CPUs as physical CPUs, not all the virtual CPUs can be active at the same time. If they all try to gain CPU simultaneously, then some of them will have to wait.

In essence, a virtual CPU (vCPU) can be in one of three states:

Associated with an ESX CPU but idle
Associated with an ESX CPU and executing instructions
Waiting for an ESX CPU to become available

As with memory, ESX uses reservations, shares and limits to determine which virtual CPUs get to use the physical CPUs if the total virtual demand exceeds physical capacity.

Shares represent the relative amount of CPU allocated to a VM if there is competition. The more shares the relatively larger number of CPU cycles will be allocated to the VM. All other things being equal, a VM with twice the number of shares will get access to twice as much CPU capacity.
The Reservation determines the minimum amount of CPU cycles allocated to the VM
The Limit determines the maximum amount of CPU that can be made available to the VM

VMs compete for CPU cycles between the limit and their reservation. The outcome of the competition is determined by the relative number of shares allocated to each VM.

Measuring CPU consumption in a virtual machine

Because ESX can vary the amount of CPU actually allocated to a VM, operating system reports of CPU consumption can be misleading. On a physical machine with a single 2 GHz CPU, 50% utilization clearly means 1GHz of CPU consumed. But on a VM, 50% might mean 50% of the reservation, the limit, or anything in between. So interpreting CPU consumption requires the ESX perspective as to how much CPU was actually provided to the VM.

The Performance monitor in the vSphere client gives us the traditional measures of CPU consumption: CPU used, CPU idle, etc. However it adds the critical “CPU Ready” statistic. This statistic reflects the amount of time the VM wanted to consume CPU, but was waiting for a physical CPU to become available. It is the most significant measure of contention between VMs for CPU power.

For instance in the chart below, we can see at times that the amount of ready time is sometimes almost as great as the amount of CPU actually consumed. In fact you can probably see that as the ready time goes up, the VMs actual CPU used goes down – the VM wants to do more computation, but is unable to do so due to competition with other VMs.

The display of milliseconds in each mode makes it hard to work out exactly what is going on. In the next release of Spotlight on Oracle (part of the Toad DBA suite) we’ll be showing the amount of ready time as a proportion of the maximum possible CPU, and provide drilldowns that show CPU limits, reservations, utilization and ready time.

Co-scheduling

For a VM with multiple virtual CPUs, ESX needs to synchronize vCPUs cycles with physical CPU consumption. Significant disparities between the amounts of CPU given to each vCPU in a multi-CPU VM will cause significant performance issues and maybe even instability. A “strict co-scheduling policy” is one in which all the vCPUs are allocated to the physical CPUs simultaneously, or at least when any one CPU falls significantly behind in processing. Modern ESX uses “relaxed co-scheduling” in which only CPUs that have fallen behind need to be scheduled.

In practice however, on a multi-CPU system all the CPUs generally consume roughly equivalent amounts of CPU and most of the time all of them will need to be scheduled together. This can make it harder for ESX to allocate CPUs. For instance, in the diagram below we see how the more vCPUs are configured, the fewer scheduling choices are available to the ESX scheduler:

29-03-2010 4-38-05 PM CPU Sockets

(Thanks to Carl Bradshaw for letting me reprint that diagram from his Oracle on VMWare whitepaper)

As a result, you can actually find performance decreasing as the number of cores increases. This will be most apparent if you try and configure as many vCPUs as physical CPUs.

Even if there is no competition from other virtual machines, the ESX hypervisor itself will require CPU resources and find it difficult to schedule all the cores of the VM. This is very noticeable on VMware workstation: if you create a two-CPU virtual machine on a dual core laptop, it will almost certainly perform worse than a single CPU VM, because VMware will have trouble scheduling both the vCPUs simultaneously.

In general, don’t allocate a lot of vCPUs unless you are sure that the ESX server is usually under light load from other VMs and that your database actually needs the extra cores.

Summary

Effective ESX memory configuration requires co-ordination between Oracle’s memory management and the ESX memory management to avoid PGA or SGA ending up on disk. CPU is a lot simpler. In general I recommend the following:

Avoid over-allocating CPU cores: don’t automatically assume that more CPUs will lead to better performance
Use reservations, limits and shares to determine the relative amount of CPU that will be allocated to your VM
Monitor the ESX “CPU ready” statistic to determine how competition with other VMs is affecting your virtualized databases’ access to CPU.

Guy Harrison |

Post a Comment |

tagged

Oracle,

virtualization in

Oracle