dbKoda for MongoDB

dbKoda for MongoDB - a modern open source database IDE, now available for MongoDB. Download it here!

Next Generation
Databases: NoSQL,
NewSQL and Big Data

Buy at Amazon

Read sample at Amazon

Buy at Apress

Latest Postings:

Search

Oracle Performance Survival Guide

Buy It
Read it on Safari
Scripts and Examples
Sample Chapter

                                    

Powered by Squarespace

 MySQL Stored procedure programming

Buy It
Read it on Safari
Scripts and Examples 

                                                

Entries in R (4)

Thursday
Aug092012

Exadata smart flash logging

Exadata storage software 11.2.2.4 introduced the Smart flash logging feature.  The intent of this is to reduce overall redo log sync times - especially outliers - by allowing the exadata flash storage to serve as a secondary destination for redo log writes.  During a redo log sync, Oracle will write to the disk and flash simultaneously and allow the redo log sync operation to complete when the first device completes. 

Jason Arneil reports some initial observations here, and Luis Moreno Campos summarized it here.

I’ve reported in the past on using SSD for redo including on Exadata and generally I’ve found that SSD is a poor fit for redo log style sequential write IO.  But this architecture should at least do now harm and on the assumption that the SSD will at least occasionally complete faster than a spinning disk I tried it out. 

My approach involved the same workload I’ve used in similar tests.  I ran 20 concurrent processes each of which performed 200,000 updates and commits – a total of 4,000,000 redo log sync operations.  I captured every redo log sync wait from 10046 traces and loaded them in R for analysis.

I turned flash logging on or off by using an ALTER IORMPLAN command like this (my DB is called SPOT):

ALTER IORMPLAN dbplan=((name='SPOT', flashLog=$1),(name=other,flashlog=on))'

And I ran “list metriccurrent where objectType='FLASHLOG'” before and after each run so I could be sure that flash logging was on or off.

When flash logging was on, I saw data like this:

Before:

     FL_DISK_FIRST                     FLASHLOG     32,669,310 IO requests
     FL_FLASH_FIRST                    FLASHLOG     7,318,741 IO requests
     FL_PREVENTED_OUTLIERS             FLASHLOG     774,146 IO requests

After:

      FL_DISK_FIRST                     FLASHLOG     33,201,462 IO requests
     FL_FLASH_FIRST                    FLASHLOG     7,337,931 IO requests
     FL_PREVENTED_OUTLIERS             FLASHLOG     774,146 IO requests

 

So for this particular cell the flash disk “won” only 3.8% of times (7,337,931-7,318,741)*100/(7,337,931-7,318,741+33,201,462-32,669,310) and prevented no “outliers”.  Outliers are defined as being redo log syncs that would have taken longer than 500 ms to complete. 

Looking at my 4 million redo log sync times,  I saw that the average and median times where statistically significantly higher when the smart flash logging was involved:

> summary(flashon.data$synctime_us) #Smart flash logging ON
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    1.0   452.0   500.0   542.4   567.0  3999.0
> summary(flashoff.data$synctime_us) #Smart flash logging OFF
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   29.0   435.0   481.0   508.7   535.0  3998.0
> t.test(flashon.data$synctime_us,flashoff.data$synctime_us,paired=FALSE)

    Welch Two Sample t-test

data:  flashon.data$synctime_us and flashoff.data$synctime_us
t = 263.2139, df = 7977922, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
33.43124 33.93285
sample estimates:
mean of x mean of y
542.3583  508.6763

Plotting the distribution of redo log sync times we can pretty easily see that there’s actually a small “hump” in times when flash logging is on (note logarithmic scale):

image

This is of course the exact opposite of what we expect, and I checked my data very carefully to make sure that I had not somehow switched samples.  And I repeated the test many times and always saw the same pattern.  

It may be that there is a slight overhead to running the race between disk and flash, and that that overhead makes redo log sync times slightly higher.  That overhead may become more negligible on a busy system.  But for now I personally can’t confirm that smart flash logging provides the intended optimization and in fact I observed a small but statistically significant and noticeable degradation in redo log sync times when it is enabled.

Friday
May202011

RJDBC 2.0 performance improvements

But hopefully this will be the last RJDDC posting the I make for a while :-).

Simon Urbanek has made some fixes to RJDBC which resolve the performance issues five referred to in my last post.  As you can see below, these fixes have led to a pretty dramatic performance improvements:

17-05-2011 10-29-16 AM RJDBC 2.0

Furthermore, RJDBC is substantially easier to install than Roracle, and is of course portable across different databases.

If you’re really concerned with RJDBC performance and have memory to burn you can improve performance someone by selecting all rows in a single internal fetch, like this:

jdata2<-fetch(dbSendQuery(jcon,sqltext),n=rowsToFetch);

This improves performance, though Roracle still has the advantage:

18-05-2011 10-56-47 AM RJDBC2

You can get the new RJDBC here.

Tuesday
May102011

RORACLE vs RJDBC

In this post I looked at hooking up the R statistical system to Oracle for either analysing database performance data or analysing other data that happened to be stored in Oracle.   A reader asked me if I’d compared the performance of RJDBC – which I used in that post - with the RORACLE package.  I hadn’t, but now I have and found some pretty significant performance differences.

RJDBC hooks up pretty much any JDBC datasource into R, while ROracle uses native libraries to mediate the connection.  RJDBC is probably easier to start with, but ROracle can be installed pretty easily, provided you create some client libary entries.  So for me,  after downloading the ROracle package, my install looked something like this (run it as root):

cd $ORACLE_HOME/bin
genclntsh
genclntst

R CMD INSTALL --configure-args='--enable-static' /root/Desktop/ROracle_0.5-9.tar.gz

It was pretty obvious right away that RJDBC was much slower than ROracle.  Here’s a plot of elapsed times for variously sized datasets:

8-05-2011 3-09-24 PM R1

The performance of RJDBC degrades fairly rapidly as the size of the data being retrieved from Oracle increases,  and would probably be untenable for very large data sets. 

The RJDBC committers do note that RJDBC performance will not be optimal:

The current implementation of RJDBC is done entirely in R, no Java code is used. This means that it may not be extremely efficient and could be potentially sped up by using Java native code. However, it was sufficient for most tasks we tested. If you have performance issues with RJDBC, please let us know and tell us more details about your test case.

The default array size used by RJDBC is only 10, while the default for Roracle is 500… could this be the explaination for the differences in performance?

You can’t change the default RJDBC fetch size (at least, I couldn’t work out how to), but you can change ROracle's.  Here’s a breakdown of elapsed time for RJDBC and Roracle using defaults, and for ROracle using a fetch size of 10:

10-05-2011 11-56-52 AM R3

As you can see, issue does not seem to be the array size alone.   I suspect the overhead of building up the data frame from the result set in RJDBC is where the major inefficiency occurs.  Increasing the array fetch size might reduce the impact of this, but the array fetch size alone is not the cause of the slow performance.    

Conclusion

The current implementation of RJDBC is easy to install and fairly portable, but doesn’t provide good performance when loading large data sets.   For now, ROracle is the better choice.   

Scripts:

Create test table (SQL)

scalability testing (R script)

Comparison of elapsed times and array size (R script).

Thursday
Mar032011

Statistical analysis of Oracle performance data using R

R is without doubt the Open Source tool of choice for statistical analysis, it contains a huge variety of statistical analysis techniques – rivalled only by hugely expensive commercial products such as SAS and SPSS.   I’ve been playing with R a bit lately, and – of course – working with data held in Oracle.  In particular, I’ve been playing with data held in the Oracle dynamic performance views.

 

This post is a brief overview of installing R, connecting R to Oracle, and using R to analyse Oracle performance data.

Installing R

 

R can be install in linux as a standard package:

yum install R

On Windows, you may wish to use the Revolution R binaries:  http://info.revolutionanalytics.com/download-revolution-r-community.html.  I had a bit of trouble installing the 32-bit binaries on my system as they conflicted with my 64-bit JDBC.  But if you are 32-bit you might be OK.

The easiest way to setup a connection to Oracle in to install the RJDBC package.

[oracle@GuysOEL ~]$ R

R version 2.12.1 (2010-12-16)
Copyright (C) 2010 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-redhat-linux-gnu (64-bit)

<snip>

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> install.packages("RJDBC")

Using the StatET Eclipse plug-in

 

I use a free eclipse plug-in called StatET.  It provides an editing environment and GUI console for the R system.  The configuration steps are a little laborious, but it has a online getting started module that guides you through the steps.  Once you have it installed, I doubt you’ll go back to the command line. 

 

 

You can get StatET at http://www.walware.de/goto/statet.  Using the eclipse environment is really handy if you’re going to use R with Oracle, since you can also use the free Toad for Eclipse extension to work on your SQLs.  Eclipse becomes a complete environment for both R and Oracle.

Getting data from Oracle into R

 

Once you’ve installed R, it’s pretty simple to get data out of Oracle and into R.   Here’s a very short snippet that grabs data from the V$SQL table:

   1: library(RJDBC)
   2:  
   3: drv <- JDBC("oracle.jdbc.driver.OracleDriver",
   4:                 "/ora11/home/jdbc/lib/ojdbc6.jar")
   5:  
   6: conn <- dbConnect(drv,"jdbc:oracle:thin:@hostname:1521: service","username","password")
   7: sqldata<-dbGetQuery(conn, "SELECT cpu_time cpu,elapsed_time ela,disk_reads phys,
   8:                                   buffer_gets bg,sorts sorts
   9:                              FROM V$SQL ")
  10: summary(sqldata)

 

Let’s look at that line by line:

 

Line Comments
1 The library command loads the RJDBC module, which will provide connectivity to Oracle.
3 We create a driver object for the Oracle JDBC driver.  The second argument is the location of the Oracle JDBC jar file,  almost always $ORACLE_HOME/jdbc/lib/ojdbc6.jar.
6 Connect to the Oracle database using standard JDBC connections strings
7 Create an R dataset from the result set of a query.  In this case, we are loading the contents of the V$SQL table. 
10 The R “summary” package provides simple descriptive statistics for each variable in the provided dataset.

 

Basic R statistical functions

R has hundreds of statistical functions,  in the above example we used “summary”, which prints descriptive statistics.  The output is shown below;  mean, medians, percentiles, etc:

image

Correlation

Statistical correlation reveals the association between two numeric variables.  If two variables always increase or decrease together the correlation is 1;  if two variables are absolutely random with respect of each other then the correlation tends towards 0.

cor prints the correlation between every variable in the data set:

image

cor.test calculates the correlation coefficient and prints out the statistical significance of the correlation, which allows you to determine if there is a significant relationship between the two variables.  So does the number of sorts affect response time?  Let’s find out:

image

The p-value is 0.19 which indicates no significant relationship – p values of no more than 0.05 (one chance in 20) are usually requires before we assume statistical significance.

On the other hand,  there is a strong relationship between CPU time and Elapsed time:

image

Plotting

plot prints a scattergram chart.  Here’s the output from plot(sqldata$ELA,sqldata$CPU):

image

Here’s a slightly more sophisticated chart using “smoothScatter”, logarithmic axes and labels for the axes:

image

Regression

Regression is used to draw “lines of best fit” between variables.  

In the simplest case, we use the “lm” package to create a linear regression model between two variables (which we call “regdata” in the example).  The summary function prints a summary of the analysis:

image

This might seem a little mysterious if your statistics is a bit rusty, but the data above tells us that there is a significant relationship between elapsed time (ELA) and physical reads (PHYS) and gives us the gradient and Y axis intercept if we wanted to draw the relationship.   We can get R to draw a graph, and plot the original data by using the plot at abline functions:

image

Testing a hypothesis

 

One of the benefits of statistical analysis is you can test hypotheses about your data.  For instance, what about we test the until recently widely held notion that the buffer cache hit rate is a good measure of performance.  We might suppose if that were true that SQL statements with high buffer cache hit rates would show smaller elapsed times than those with low buffer cache hit rates.  To be sure, there are certain hidden assumptions underlying that hypothesis, but for the sake of illustration let’s use R to see if our data supports the hypothesis.

Simple correlation is a fair test for this, all we need to do is see if there is a statistically signifcant correlation between hit rate and elapsed time.  Here’s the analysis:

image

The correlation is close to 0, and the statistical significance way higher than the widely accepted .05 threshold for statistical significance.  Statements with high hit ratios do not show statistically signficantly lower elasped times that SQLs with low hit ratios.

Conclusion

 

There’s tons of data in our Oracle databases that could benefit from statistical analysis – not the least the performance data in the dynamic performance views, ASH and AWR.  We use statistical tests in Spotlight on Oracle to extrapolate performance into the future and to set some of the alarm thresholds.  Using R,  you have easy access to the most sophisticated statistical analysis techniques and as I hope I’ve shown, you can easily integrate R with Oracle data.