Thursday

Mar032011

Statistical analysis of Oracle performance data using R

Thursday, March 3, 2011 at 9:27PM

R is without doubt the Open Source tool of choice for statistical analysis, it contains a huge variety of statistical analysis techniques – rivalled only by hugely expensive commercial products such as SAS and SPSS. I’ve been playing with R a bit lately, and – of course – working with data held in Oracle. In particular, I’ve been playing with data held in the Oracle dynamic performance views.

This post is a brief overview of installing R, connecting R to Oracle, and using R to analyse Oracle performance data.

Installing R

R can be install in linux as a standard package:

yum install R

On Windows, you may wish to use the Revolution R binaries: http://info.revolutionanalytics.com/download-revolution-r-community.html. I had a bit of trouble installing the 32-bit binaries on my system as they conflicted with my 64-bit JDBC. But if you are 32-bit you might be OK.

The easiest way to setup a connection to Oracle in to install the RJDBC package.

[oracle@GuysOEL ~]$ R

R version 2.12.1 (2010-12-16)
Copyright (C) 2010 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-redhat-linux-gnu (64-bit)

<snip>

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> install.packages("RJDBC")

Using the StatET Eclipse plug-in

I use a free eclipse plug-in called StatET. It provides an editing environment and GUI console for the R system. The configuration steps are a little laborious, but it has a online getting started module that guides you through the steps. Once you have it installed, I doubt you’ll go back to the command line.

You can get StatET at http://www.walware.de/goto/statet. Using the eclipse environment is really handy if you’re going to use R with Oracle, since you can also use the free Toad for Eclipse extension to work on your SQLs. Eclipse becomes a complete environment for both R and Oracle.

Getting data from Oracle into R

Once you’ve installed R, it’s pretty simple to get data out of Oracle and into R. Here’s a very short snippet that grabs data from the V$SQL table:

library(RJDBC)

  

drv <- JDBC("oracle.jdbc.driver.OracleDriver",

                "/ora11/home/jdbc/lib/ojdbc6.jar")

  

conn <- dbConnect(drv,"jdbc:oracle:thin:@hostname:1521: service","username","password")

sqldata<-dbGetQuery(conn, "SELECT cpu_time cpu,elapsed_time ela,disk_reads phys,

                                  buffer_gets bg,sorts sorts

                             FROM V$SQL ")

summary(sqldata)

Let’s look at that line by line:

Line	Comments
1	The library command loads the RJDBC module, which will provide connectivity to Oracle.
3	We create a driver object for the Oracle JDBC driver. The second argument is the location of the Oracle JDBC jar file, almost always $ORACLE_HOME/jdbc/lib/ojdbc6.jar.
6	Connect to the Oracle database using standard JDBC connections strings
7	Create an R dataset from the result set of a query. In this case, we are loading the contents of the V$SQL table.
10	The R “summary” package provides simple descriptive statistics for each variable in the provided dataset.

Basic R statistical functions

R has hundreds of statistical functions, in the above example we used “summary”, which prints descriptive statistics. The output is shown below; mean, medians, percentiles, etc:

Correlation

Statistical correlation reveals the association between two numeric variables. If two variables always increase or decrease together the correlation is 1; if two variables are absolutely random with respect of each other then the correlation tends towards 0.

cor prints the correlation between every variable in the data set:

cor.test calculates the correlation coefficient and prints out the statistical significance of the correlation, which allows you to determine if there is a significant relationship between the two variables. So does the number of sorts affect response time? Let’s find out:

The p-value is 0.19 which indicates no significant relationship – p values of no more than 0.05 (one chance in 20) are usually requires before we assume statistical significance.

On the other hand, there is a strong relationship between CPU time and Elapsed time:

Plotting

plot prints a scattergram chart. Here’s the output from plot(sqldata$ELA,sqldata$CPU):

Here’s a slightly more sophisticated chart using “smoothScatter”, logarithmic axes and labels for the axes:

Regression

Regression is used to draw “lines of best fit” between variables.

In the simplest case, we use the “lm” package to create a linear regression model between two variables (which we call “regdata” in the example). The summary function prints a summary of the analysis:

This might seem a little mysterious if your statistics is a bit rusty, but the data above tells us that there is a significant relationship between elapsed time (ELA) and physical reads (PHYS) and gives us the gradient and Y axis intercept if we wanted to draw the relationship. We can get R to draw a graph, and plot the original data by using the plot at abline functions:

Testing a hypothesis

One of the benefits of statistical analysis is you can test hypotheses about your data. For instance, what about we test the until recently widely held notion that the buffer cache hit rate is a good measure of performance. We might suppose if that were true that SQL statements with high buffer cache hit rates would show smaller elapsed times than those with low buffer cache hit rates. To be sure, there are certain hidden assumptions underlying that hypothesis, but for the sake of illustration let’s use R to see if our data supports the hypothesis.

Simple correlation is a fair test for this, all we need to do is see if there is a statistically signifcant correlation between hit rate and elapsed time. Here’s the analysis:

The correlation is close to 0, and the statistical significance way higher than the widely accepted .05 threshold for statistical significance. Statements with high hit ratios do not show statistically signficantly lower elasped times that SQLs with low hit ratios.

Conclusion

There’s tons of data in our Oracle databases that could benefit from statistical analysis – not the least the performance data in the dynamic performance views, ASH and AWR. We use statistical tests in Spotlight on Oracle to extrapolate performance into the future and to set some of the alarm thresholds. Using R, you have easy access to the most sophisticated statistical analysis techniques and as I hope I’ve shown, you can easily integrate R with Oracle data.

Guy Harrison |

6 Comments |

1 Reference |

tagged

Oracle,

R in

Oracle

Wednesday

Feb022011

Using Toad with Hive in Amazon Elastic Map Reduce

Wednesday, February 2, 2011 at 3:24PM

The Toad for Cloud Databases eclipse client has support for Hive queries which makes it really easy for me to run queries against our test hadoop clusters. It also supports Hive running on top of Amazon Elastic Map Reduce (EMR), but you do need to be aware that in EMR the default ports are different from what we have come to expect.

Firstly, if you have started an EMR cluster with Hive 0.5 support, then the Hive server will be running on port 10001, not port 10000. The second difference is that the JobTracker is running on port 9100, rather than 50030. So when attaching to EMR, you would set up your hive connection something like this:

Once you’ve done that, the Hive connection will show all the Hive tables and you can enter HQL queries in the SQL editor. You can drag table and column names into the editor as well:

One of the simple, but really useful things about the hive client is that you can jump to the jobtracker web page while the HQL is running to see how it is going:

Here’s the resulting JobTracker console. We can see the job running and – if we scroll to the right or maximize the window – we can see how the Map and reduce phases of the Hive job are progressing:

Guy Harrison |

3 Comments |

2 References |

tagged

Hive,

toad in

TCD blog post

Wednesday

Feb022011

Correlation probability in Oracle

Wednesday, February 2, 2011 at 12:40PM

Being at the OReilly Strata conference has re-energized my inner statistic geek, so I thought I’d share a PL/SQL package I some years ago to calculate the probability associated with correlation coefficients calculated by Oracle.

Correlation is a statistical measure of the association between two variables. Oracle introduced direct support for correlation in the SQL language from at least 10g (may have even been 9i). So we can write SQL statements that calculate correlation. So for instance, this statement generates the correlation between the number of sorts and the elapsed times of SQL statements still cached in the SQL area:

The correlation coefficient “r” will be 1 if the two variables always increase or decrease together and –1 if one variable always increases as the other decreases. A correlation coefficient of .33 probably represents some relationship, but anyone who has done statistics knows that you should calculate the statistical significance – chance that the result did not arise from chance – before making assertions.

Oracle doesn’t report correlation probabilities, so I implemented the formula to calculate the probability in a PLSQL package. It’s here.

If you pass the correlation coefficient and the number of samples (rows input to the CORR function) into the CORR_PROG_PKG.PROB_R function then it will show you the probability associated with that correlation. So below we see the probability that this correlation is due to random fluctuation is effectively 0 (down to 4 significant figures):

Generally if the probability is less than .05 or .01 (one in 20 or 1 in 100 respectively) , then we would refer to the correlation as being statistically significant and that the two variables are related. So we would be justified in claiming there was a relationship between sorts and elapsed time.

Just for kicks, let’s use this technique to test the hypothesis that the Buffer Cache Hit ratio is a reliable indicator of performance. Here’s the result:

The correlation is very low and the probability that there is no association is very high – way above the .05 probability. In other words, in this database there is no statistically significant relationship between the elapsed time of an SQL and it’s buffer cache hit rate!

Guy Harrison |

Working with Cassandra 0.7

Friday, January 14, 2011 at 4:21PM

In this post, I experimented with inserting data from Oracle into Cassandra column families using Hector. Unfortunately, that code isn’t compatible with the latest Cassandra 0.7 release, so I had to rework it. The new version uses the addInsertion method of the Mutator object and while not totally intuitive didn’t take long to get working. Here are the key changes:

private static void insertSales(Connection oracleConn, Keyspace keyspace,

        String cfName) throws SQLException {

    int rows = 0;

    ColumnPath cf = new ColumnPath(cfName);

    Statement query = oracleConn.createStatement();

  

    String sqlText = "SELECT cust_id, cust_first_name,  cust_last_name, prod_name, "

            + "           SUM (amount_sold) sum_amount_sold,sum(quantity_sold) sum_quantity_sold "

            + "          FROM sh.sales    "

            + "          JOIN sh.customers USING (cust_id) "

            + "          JOIN sh.products  USING (prod_id)  "

            + "         GROUP BY cust_id, cust_first_name,  cust_last_name,  prod_name "

            + "         ORDER BY cust_id, prod_name ";

    ResultSet results = query.executeQuery(sqlText);

    int rowCount = 0;

    int lastCustId = -1;

    while (results.next()) { // For each customer

        Integer custId = results.getInt("CUST_ID");

        String keyValue = custId.toString();

  

        if (rowCount++ == 0 || custId != lastCustId) { // New Customer

            String custFirstName = results.getString("CUST_FIRST_NAME");

            String custLastName = results.getString("CUST_LAST_NAME");

            System.out.printf("%s %s\n", custFirstName, custLastName);

            // Create a supercolumn for customer details (first, lastname)

            Mutator<String> mutator = HFactory.createMutator(keyspace,

                    stringSerializer);

            mutator.addInsertion(keyValue, cfName, HFactory

                    .createSuperColumn("CustomerDetails", Arrays

                            .asList(HFactory.createStringColumn(

                                    "customerFirstName", custFirstName)),

                            StringSerializer.get(), StringSerializer.get(),

                            StringSerializer.get()));

            mutator.addInsertion(keyValue, cfName, HFactory

                    .createSuperColumn("CustomerDetails", Arrays

                            .asList(HFactory.createStringColumn(

                                    "customerLastName", custLastName)),

                            StringSerializer.get(), StringSerializer.get(),

                            StringSerializer.get()));

  

            mutator.execute();

        }

        // Insert product sales total for that customer

        String prodName = results.getString("PROD_NAME");

        Float SumAmountSold = results.getFloat("SUM_AMOUNT_SOLD");

        Float SumQuantitySold = results.getFloat("SUM_QUANTITY_SOLD");

        // Supercolumn name is the product name

        Mutator<String> mutator = HFactory.createMutator(keyspace,

                stringSerializer);

        mutator.addInsertion(keyValue, cfName, HFactory.createSuperColumn(

                prodName, Arrays.asList(HFactory.createStringColumn(

                        "AmountSold", SumAmountSold.toString())),

                StringSerializer.get(), StringSerializer.get(),

                StringSerializer.get()));

        mutator.addInsertion(keyValue, cfName, HFactory.createSuperColumn(

                prodName, Arrays.asList(HFactory.createStringColumn(

                        "QuantitySold", SumQuantitySold.toString())),

                StringSerializer.get(), StringSerializer.get(),

                StringSerializer.get()));

        mutator.execute(); 

        lastCustId = custId;

        rows++;

    }

    System.out.println(rows + " rows loaded into " + cf.getColumn_family());

}

The reason why I wanted to do this was to play with Cassandra using our (relatively) new Toad for Cloud Databases Eclipse client. Toad for Cloud Databases lets you work with non-relational datasources such as Cassandra, HBase, SimpleDB, etc, using SQL.

Here’s how it works. We select the column family we want to map from the Cassandra server:

That column family contains data loaded from both the Oracle CUSTOMER and SALES tables. Toad recognizes that the data in that single column family is best represented by two normalized tables, and gives us the opportunity to specify the names for the primary and foreign keys. We can also rename the “tables” (more like views really) that Toad will create:

The resulting tables look similar to the tables that we originally loaded from Oracle, and we can issue SQL queries against them just as we could have with Oracle. The queries get translated from SQL to thrift calls against the underlying Cassandra Server:

I definitely find it easier to issue SQL than write a 200 line Java program to do the same thing! Of course, I'm not much of a Java programmer, but at a minimum having Toad to query the Cassandra data is invaluable when checking to see that your program did was it was intended to do

Guy Harrison |

2 Comments |

tagged

Oracle,

cassandra in

TCD blog post

Thursday

Oct212010

Accelerating Oracle database performance with SSD

Thursday, October 21, 2010 at 3:05PM

I was recently asked to provide advice on how best to accelerate Oracle database performance with flash SSD. In this post I’ll attempt to outline some thoughts on that.

There’s some thinking that SSD will soon replace spinning disk altogether - we just have to wait for the price to come down. However, I’m not so sure that this complete replacement is likely to occur in the immediate future. There are two competing forces at work:

The economies of “big data”, which drive us to reduce the cost per unit of storage ($$/TB): magnetic disk costs only about 5% of the cost of a high end SSD per terabyte.
The economies of high IO throughput, in which we are motivated to reduce the cost of providing a given IO rate. SSD (PCi flash) can generate IO rates at 1/20th of the cost of magnetic disks.

These two competing economies are not likely to change in the immediate future: magnetic disk can store large amounts of data much more cheaply than SSD; SSD can deliver very high IO rates more cheaply than magnetic disk. For most databases, it will not be cost effective to place the entire database on SSD – the best outcome will be obtained when we place parts of the database on SSD and some on magnetic disk.

SSD Performance basics

I’ve posted in the past on how SSD works with the Oracle flash cache, and I’ve presented on SSD performance at OOW. You might like to review those items. However, here’s a quick review of the basics:

All flash drives offer pretty good read performance – say 25 microseconds for a single page read
When inserting data into an empty page, performance is slower, but not awful – around 250 microseconds
In order to update an existing page, a block erase is required – much, much slower – maybe 2000 microseconds

Enterprise SSD vendors - Fusion IO, Virident, etc - all have sophisticated algorithms to avoid the write penalty associated with SSD block erase operations. Amongst other techniques, they will maintain a pool of unused blocks. When a page needs to be updated it will be marked as invalid in the original location and moved to one of these blocks. Later on, garbage collection routines will clear up the invalid entries. The result is that updates don't always have to incur the high overhead of block erase operations.

Nevertheless, you want to avoid placing write intensive files on a flash disk, because as the disk fills up with modified blocks - and especially if the write rate exceeds the garbage collection capabilities - you might see the disk slow down dramatically.

To summarize:

Flash-based disks perform reads much faster than writes, and can suffer from performance degradation if subjected to intensive sustained write operations.

Options for SSD deployment on Oracle

Given the performance characteristics of SSD, how best to use SSD to boost Oracle performance? There are a few options:

Put the entire database on SSD
Place selected segments on SSD
Use the 11GR2 flash cache
Put temporary tablespace on SSD
Put the redo logs on SSD

Let’s look at each of these in turn:

Put the entire Database on SSD

This works great – if you can afford it. For most databases the cost of putting it all on SSD is too high. Placing data that is rarely accessed on SSD is not very cost effective, because you are paying a high cost per GB, but not getting any benefit from the relatively cheap IO rates.

Place selected segments on SSD

This is probably the best option if you don’t have enough SSD for the entire database, but it does require a bit of configuration. We identify the segments with the most IO, perhaps using a query on V$SEGMENT_STATISTICS:

We are looking for object that have a high read/write ratio, contribute to a significant number of reads, and are small enough to fit on our SSD. We create a tablespace on SSD, then move these objects to that tablespace.

If our most read intensive tables are massive, then they might not fit on the SSD. In this case we could consider partitioning them and placing the hottest partitions on SSD.

Use the 11GR2 flash cache

If you have 11GR2 and are on a compatible platform (Solaris or Oracle Enterprise Linux), then you can setup the Oracle DB flash cache. This is absolutely the easiest option – all you need to do is set a few parameters and bounce the database. The flash cache automatically caches the hottest blocks and usually provides an excellent performance improvement.

The improvement is not as significant as moving objects directly onto flash storage, since there still has to be magnetic disk IO getting things on and off. Also, the architecture tends to create a lot of writes to the SSD, which can challenge garbage collection. The chart below sums up a typical outcome – a huge boost from the flash cache, but not as big a boost as we could get by putting objects directly on SSD:

Put temporary tablespace on SSD

Temporary tablespace IO (from sorts and hash operations) can sometimes be the most significant form of database IO, and often the entire temporary tablespace will fit on the SSD. So should we consider putting temporary tablespace on the SSD?

I’m a bit reluctant to do this: by definition temporary tablespace IO is equal parts reads and writes, and since we are motivated to avoid write IO to the SSD, this seems like a dubious option. However, if your performance is absolutely dominated by temp tablespace IO and your SSD is tolerant of high write rates (eg has excellent garbage collection,etc) then it might be worth trying. After all, the SSD will definitely speed up the read side of temporary tablespace IO, even if the write IO does not get a significant boost.

Put redo logs on SSD

Although redo logs are very IO intensive, it’s almost all write IO and therefore this is probably not the best option. During sustained writes the Garbage Collection will probably break down and the disk might not end up performing that much better than a spinning disk. I'd not recommend this.

Conclusions

You can use SSD to boost database performance even if you don’t have enough for the entire database. Best options are:

Place segments with high read but low write rates directly on SSD
Use the SSD with the 11GR2 database flash cache (if you are on OEL or Solaris)
If temporary segment IO is absolutely your bottleneck – but only then – consider placing the temporary tablespace on SSD.

I don’t think using SSD for redo logs is a good idea. The high rates of sustained sequential write IO is not the ideal type of IO for SSD.

Guy Harrison |

3 Comments |

2 References |

tagged

Oracle,

ssd in

Oracle