dbKoda - a modern open source databases IDE, now available for MongoDB Download it here!

Next Generation
Databases: NoSQL,
NewSQL and Big Data

Buy at Amazon

Read sample at Amazon

Buy at Apress

Latest Postings:

Search

Oracle Performance Survival Guide

Buy It
Read it on Safari
Scripts and Examples
Sample Chapter

                                    

Powered by Squarespace

 MySQL Stored procedure programming

Buy It
Read it on Safari
Scripts and Examples 

                                                

« Gearing up for OOW 2010 | Main | Playing with Cassandra and Oracle »
Saturday
Sep112010

Oracle tables vs Cassandra SuperColumns

 

In my last post,  I wrote some Java code to insert Oracle tables into Cassandra column families.  As much fun as this was for me, it was  fairly trivial and not a particularly useful exercise in terms of learning Cassandra. 

In Cassandra,  data modelling is very different from the relational models we are used to and one would rarely convert a complete Oracle schema from tables directly to ColumnFamilies .  Instead, Cassandra data modelling involves the creation of ColumnFamilies with SuperColumns to represent master-detail structures that are commonly referenced together

SuperColumns vs Relational schema

 

At the Cassandra Summit in August,  Eben Hewitt gave a presentation on Cassandra Data Modelling.   There’s a lot of nuance in that talk and in the topic, but a key point in Cassandra – as in many other NoSQL databases – is that you model data to match the queries you need to satisfy,  rather than to a more theoretically "pure" normalized form.   For relational guys, the process is most similar to radical denormalization in which you introduce redundancy to allow for efficient query processing.

For example, let’s consider the Oracle SH sample schema.  Amongst other things, it includes SALES, PRODUCTS and CUSTOMERS:

 

9-09-2010 3-35-32 PM Oracle sample schema

We could map each Oracle table to a Cassandra ColumnFamily, but because there are no foreign key indexes or joins,  such a Cassandra data model would not necessarily support the types of queries we want.  For instance, if we want to query sales totals by customer ID, we should create a column family keyed by customer id, which contains SuperColumns named for each product which in turn includes columns for sales totals.  It might look something like this:

ID CustomerDetails Product Name #1 Product Name #2 ………….. Product Name #N
1
First Name Last Name
Guy Harrison
Quantity Value
3 $100,020
  …………..
Quantity Value
3 $130,000
2
First Name Last Name
Greg Cottman
 
Quantity Value
34 $10,080
…………..
Quantity Value
4 $99,000

 

Each customer “row” has super column for each product that contains the sales for that product.  Not all customers have all the supercolumns - each customer has supercolumns only for each product they have purchased.  The name of the SuperColumn is the name of the product.  

Giving the column the name of the product is a major departure from how we would do things in Oracle.  The name of a column or SuperColumn can be determined by the data, not by the schema - a concept completely alien to relational modelling.

Inserting into SuperColumns with Hector

 

To try and understand this,  I created a Cassandra columnfamily of the type “Super”.  Here’s my definition in the storage-conf.xml file:

<ColumnFamily Name="SalesByCustomer" 
ColumnType="Super"
CompareWith="UTF8Type"
CompareSubcolumnsWith="UTF8Type"
Comment="Sales summary for each customer "/>

And here is some of my Hector Java program, which reads sales totals for each customer from the Oracle sample schema, and inserts them into the ColumnFamily:

   1: private static void insertSales(Connection oracleConn, Keyspace keyspace,
   2:         String cfName) throws SQLException {
   3:     int rows = 0;
   4:     ColumnPath cf = new ColumnPath(cfName);
   5:     Statement query = oracleConn.createStatement();
   6:  
   7:     String sqlText = "SELECT cust_id, cust_first_name,  cust_last_name, prod_name, "
   8:             + "           SUM (amount_sold) sum_amount_sold,sum(quantity_sold) sum_quantity_sold "
   9:             + "          FROM sh.sales    "
  10:             + "          JOIN sh.customers USING (cust_id) "
  11:             + "          JOIN sh.products  USING (prod_id)  "
  12:             + "         GROUP BY cust_id, cust_first_name,  cust_last_name,  prod_name "
  13:             + "         ORDER BY cust_id, prod_name ";
  14:     ResultSet results = query.executeQuery(sqlText);
  15:     int rowCount = 0;
  16:     int lastCustId = -1;
  17:     while (results.next()) { // For each customer
  18:         Integer custId = results.getInt("CUST_ID");
  19:         String keyValue = custId.toString();
  20:  
  21:         if (rowCount++ == 0 || custId != lastCustId) { // New Customer
  22:             String custFirstName = results.getString("CUST_FIRST_NAME");
  23:             String custLastName = results.getString("CUST_LAST_NAME");
  24:             System.out.printf("%s %s\n", custFirstName, custLastName);
  25:             //Create a supercolumn for customer details (first, lastname)     
  26:             cf.setSuper_column(StringUtils.bytes("CustomerDetails"));
  27:             cf.setColumn(StringUtils.bytes("customerFirstName"));
  28:             keyspace.insert(keyValue, cf, StringUtils.bytes(custFirstName));
  29:             cf.setColumn(StringUtils.bytes("customerLastName"));
  30:             keyspace.insert(keyValue, cf, StringUtils.bytes(custLastName));
  31:         }
  32:         //Insert product sales total for that customer 
  33:         String prodName = results.getString("PROD_NAME");
  34:         Float SumAmountSold = results.getFloat("SUM_AMOUNT_SOLD");
  35:         Float SumQuantitySold = results.getFloat("SUM_QUANTITY_SOLD");
  36:         //Supercolumn name is the product name 
  37:         cf.setSuper_column(StringUtils.bytes(prodName));
  38:         cf.setColumn(StringUtils.bytes("AmountSold"));
  39:         keyspace.insert(keyValue, cf, StringUtils.bytes(SumAmountSold.toString()));
  40:         cf.setColumn(StringUtils.bytes("QuantitySold"));
  41:         keyspace.insert(keyValue, cf, StringUtils.bytes(SumQuantitySold.toString()));
  42:         
  43:         lastCustId = custId;
  44:         rows++;
  45:     }
  46:     System.out.println(rows + " rows loaded into " + cf.getColumn_family());
  47: }

This code is fairly straightforward,  but let’s step through it anyway:

Lines Description
7-14 Execute the Oracle SQL to get product summaries for each customer
17 Loop through each row returned (one row per product per customer)
21 Check to see if this is a completely new customer
26-30 If it is a new customer,  create the CustomerDetails SuperColumn for that customer.  The SuperColumn name is “CustomerDetails” and it contains columns for Firstname and Lastname.
37-41

Now we create a SuperColumn for a specfic product, still keyed to the customer.  The SuperColumn name is set to the name of the product (line 37).  Inside the supercolumn are placed columns “AmountSold” (lines 38-39) and “QuantitySold” (lines 40-41)



Querying SuperColumns

 

Inserting master detail relationships into a supercolumn column family was easy enough.  I had a lot more difficulty writing code to query the data.  The tricky part seems to be when you don’t know the name of the SuperColumn you want to read from.  There's no direct equivalent to the JDBC ResultMetaData object to query the SuperColumn names - instead you create a "SuperSlice" predictate that defines a range of SuperColumns that you want to retrieve.  It's a bit awkward to express the simple case in which you want to return all the SuperColumns. 

Below is a bit of code which retrieves sales totals for a specific customer id.  I suspect I've made a few newbie mistakes :-):

   1: public static void querySuperColumn(Keyspace keyspace, String cfName,
   2:         String keyValue) {
   3:  
   4:     ColumnPath colFamily = new ColumnPath(cfName);
   5:     System.out.println("Details for customer id " + keyValue);
   6:  
   7:     /* Get Customer Details */
   8:     colFamily.setSuper_column(StringUtils.bytes("CustomerDetails"));
   9:     SuperColumn custDetailsSc = keyspace
  10:             .getSuperColumn(keyValue, colFamily);
  11:     for (Column col : custDetailsSc.getColumns()) {
  12:         String colName = StringUtils.string(col.getName()); 
  13:         String colValue = StringUtils.string(col.getValue()); 
  14:         System.out.printf("\t%-20s:%-20s\n", colName, colValue);
  15:     }
  16:     /* Get dynamic columns -  */
  17:     ColumnParent colParent = new ColumnParent(cfName);
  18:     SliceRange sliceRange = new SliceRange(StringUtils.bytes(""), StringUtils
  19:             .bytes(""), false, 2 ^ 32); // TODO: what if there are > 2^32 ??                                             
  20:     SlicePredicate slicePredicate = new SlicePredicate();
  21:     slicePredicate.setSlice_range(sliceRange);
  22:     //TODO:  Surely there's an easier way to select all SC than the above??
  23:     List superSlice = keyspace.getSuperSlice(keyValue,
  24:             colParent, slicePredicate);
  25:     for (SuperColumn prodSuperCol : superSlice) {  //For each super column
  26:         String superColName = StringUtils.string(prodSuperCol.getName());
  27:         if (!superColName.equals("CustomerDetails")) { // Already displayed
  28:                                                          
  29:             System.out.printf("\n%50s:", superColName); // product Name 
  30:             List columns1 = prodSuperCol.getColumns();
  31:             for (Column col : columns1) {               // product data 
  32:                 String colName = StringUtils.string(col.getName()); 
  33:                 String colValue = StringUtils.string(col.getValue()); 
  34:                 System.out.printf("\t%20s:%-20s", colName, colValue);
  35:  
  36:             }
  37:         }
  38:     }
  39:  
  40: }
Lines Description
8-9 Set the superColumn to the “CustomerDetails” supercolumn
11-14 Retrieve the column values (firstname, surname) for the CustomerDetails supercolumn
17-21 Set up a “SlicePredicate” that defines the supercolumns to be queried.  I want to get all of the supercolumns (eg every product), so I set up an unbounded range (line 18) and supply that to the slice predicate (line 21)
23 Create a list of supercolumns.  This will include all the SuperColumns in the column family (including, unfortunately,  CustomerDetails)
27 Eliminate CustomerDetails from the result.  Here we only want product names
30-35 Iterate through the columns in each supercolumn.  THis will extract QuantitySold and AmountSold for each Product name

 

Here’s some output from the Java program.  It prints out customer Details and product sales totals for customer# 10100:

Details for customer id 101000
customerFirstName :Aidan
customerLastName :Wilbur

CD-RW, High Speed Pack of 5: AmountSold:11.99 QuantitySold:1.0
Keyboard Wrist Rest: AmountSold:11.99 QuantitySold:1.0
Multimedia speakers- 3" cones: AmountSold:44.99 QuantitySold:1.0


SuperColumns with Toad for Cloud Databases 

 

Toad for cloud databases now has Cassandra support, which makes querying SuperColumns s a lot easier.  SuperColumns that have dynamic names but uniform internal column structure (as in my example above) are represented by Toad for Cloud Databases as a detail table.  To put it another way,  Toad for Cloud Databases re-normalizes the data - displaying it in the format that we would typically use in an RDBMS. 

So when we point Toad for Cloud databases at our SalesByCustomer column family, it maps the column family to two tables:  one for CustomerDetails and the other - which by default it will call SalesByCustomersuper_column” – for product sales totals.  We can rename the subtable and subtable key during the mapping phase to make it clearer that it represents product details.

9-09-2010 1-56-19 PM map cassandra super col

Now if we want to extract product details for a particular customer, we can do a SQL join.  Below we build the join in the query builder, but of course we could simply code the SQL by hand as we would for any NoSQL or SQL database supported by Toad for Cloud Databases:

9-09-2010 3-49-36 PM cassandra supercol qry

And just to close the loop, here we can see that the Toad for Cloud databases query returns the same data as the Hector query:

9-09-2010 3-50-48 PM cassabdra supercol results

 

Conclusion

 

All NoSQL databases require that we change the way we think about data modelling, and Cassandra is no exception.  SuperColumns are an incredibly powerful construct, but I can’t say that I found them intuitive or easy.  Hopefully APIs and tooling will evolve to make life easier for those of us coming from the relational world.

Reader Comments (2)

Do you think that cassandra and non-relational model will replace traditional rdbms vendors ( ibm,oracle,microsoft ) ?

September 13, 2010 | Unregistered CommenterAlex

Sorry for the delay replying! No I don't think they will replace, but definitely I expect them to supplement. There's an interview I had with a guy from TechTarget that reflects my thinking:
http://searchcloudcomputing.techtarget.com/news/article/0,289142,sid201_gci1520760,00.html

September 28, 2010 | Registered CommenterGuy Harrison

PostPost a New Comment

Enter your information below to add a new comment.

My response is on my own website »
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>