Guy Harrison

Saturday

Sep112010

Oracle tables vs Cassandra SuperColumns

Saturday, September 11, 2010 at 1:49PM

In my last post, I wrote some Java code to insert Oracle tables into Cassandra column families. As much fun as this was for me, it was fairly trivial and not a particularly useful exercise in terms of learning Cassandra.

In Cassandra, data modelling is very different from the relational models we are used to and one would rarely convert a complete Oracle schema from tables directly to ColumnFamilies . Instead, Cassandra data modelling involves the creation of ColumnFamilies with SuperColumns to represent master-detail structures that are commonly referenced together

SuperColumns vs Relational schema

At the Cassandra Summit in August, Eben Hewitt gave a presentation on Cassandra Data Modelling. There’s a lot of nuance in that talk and in the topic, but a key point in Cassandra – as in many other NoSQL databases – is that you model data to match the queries you need to satisfy, rather than to a more theoretically "pure" normalized form. For relational guys, the process is most similar to radical denormalization in which you introduce redundancy to allow for efficient query processing.

For example, let’s consider the Oracle SH sample schema. Amongst other things, it includes SALES, PRODUCTS and CUSTOMERS:

We could map each Oracle table to a Cassandra ColumnFamily, but because there are no foreign key indexes or joins, such a Cassandra data model would not necessarily support the types of queries we want. For instance, if we want to query sales totals by customer ID, we should create a column family keyed by customer id, which contains SuperColumns named for each product which in turn includes columns for sales totals. It might look something like this:

CustomerDetails

Product Name #1

Product Name #2

…………..

Product Name #N

First Name	Last Name
Guy	Harrison

Quantity	Value
3	$100,020

…………..

Quantity	Value
3	$130,000

First Name	Last Name
Greg	Cottman

Quantity	Value
34	$10,080

…………..

Quantity	Value
4	$99,000

Each customer “row” has super column for each product that contains the sales for that product. Not all customers have all the supercolumns - each customer has supercolumns only for each product they have purchased. The name of the SuperColumn is the name of the product.

Giving the column the name of the product is a major departure from how we would do things in Oracle. The name of a column or SuperColumn can be determined by the data, not by the schema - a concept completely alien to relational modelling.

Inserting into SuperColumns with Hector

To try and understand this, I created a Cassandra columnfamily of the type “Super”. Here’s my definition in the storage-conf.xml file:

<ColumnFamily Name="SalesByCustomer" 
                  ColumnType="Super"
                  CompareWith="UTF8Type"
                  CompareSubcolumnsWith="UTF8Type"
                  Comment="Sales summary for each customer "/>

And here is some of my Hector Java program, which reads sales totals for each customer from the Oracle sample schema, and inserts them into the ColumnFamily:

private static void insertSales(Connection oracleConn, Keyspace keyspace,

        String cfName) throws SQLException {

    int rows = 0;

    ColumnPath cf = new ColumnPath(cfName);

    Statement query = oracleConn.createStatement();

  

    String sqlText = "SELECT cust_id, cust_first_name,  cust_last_name, prod_name, "

            + "           SUM (amount_sold) sum_amount_sold,sum(quantity_sold) sum_quantity_sold "

            + "          FROM sh.sales    "

            + "          JOIN sh.customers USING (cust_id) "

            + "          JOIN sh.products  USING (prod_id)  "

            + "         GROUP BY cust_id, cust_first_name,  cust_last_name,  prod_name "

            + "         ORDER BY cust_id, prod_name ";

    ResultSet results = query.executeQuery(sqlText);

    int rowCount = 0;

    int lastCustId = -1;

    while (results.next()) { // For each customer

        Integer custId = results.getInt("CUST_ID");

        String keyValue = custId.toString();

 

        if (rowCount++ == 0 || custId != lastCustId) { // New Customer

            String custFirstName = results.getString("CUST_FIRST_NAME");

            String custLastName = results.getString("CUST_LAST_NAME");

            System.out.printf("%s %s\n", custFirstName, custLastName);

            //Create a supercolumn for customer details (first, lastname)     

            cf.setSuper_column(StringUtils.bytes("CustomerDetails"));

            cf.setColumn(StringUtils.bytes("customerFirstName"));

            keyspace.insert(keyValue, cf, StringUtils.bytes(custFirstName));

            cf.setColumn(StringUtils.bytes("customerLastName"));

            keyspace.insert(keyValue, cf, StringUtils.bytes(custLastName));

        }

        //Insert product sales total for that customer 

        String prodName = results.getString("PROD_NAME");

        Float SumAmountSold = results.getFloat("SUM_AMOUNT_SOLD");

        Float SumQuantitySold = results.getFloat("SUM_QUANTITY_SOLD");

        //Supercolumn name is the product name 

        cf.setSuper_column(StringUtils.bytes(prodName));

        cf.setColumn(StringUtils.bytes("AmountSold"));

        keyspace.insert(keyValue, cf, StringUtils.bytes(SumAmountSold.toString()));

        cf.setColumn(StringUtils.bytes("QuantitySold"));

        keyspace.insert(keyValue, cf, StringUtils.bytes(SumQuantitySold.toString()));

        

        lastCustId = custId;

        rows++;

    }

    System.out.println(rows + " rows loaded into " + cf.getColumn_family());

}

This code is fairly straightforward, but let’s step through it anyway:

Lines	Description
7-14	Execute the Oracle SQL to get product summaries for each customer
17	Loop through each row returned (one row per product per customer)
21	Check to see if this is a completely new customer
26-30	If it is a new customer, create the CustomerDetails SuperColumn for that customer. The SuperColumn name is “CustomerDetails” and it contains columns for Firstname and Lastname.
37-41	Now we create a SuperColumn for a specfic product, still keyed to the customer. The SuperColumn name is set to the name of the product (line 37). Inside the supercolumn are placed columns “AmountSold” (lines 38-39) and “QuantitySold” (lines 40-41)

Querying SuperColumns

Inserting master detail relationships into a supercolumn column family was easy enough. I had a lot more difficulty writing code to query the data. The tricky part seems to be when you don’t know the name of the SuperColumn you want to read from. There's no direct equivalent to the JDBC ResultMetaData object to query the SuperColumn names - instead you create a "SuperSlice" predictate that defines a range of SuperColumns that you want to retrieve. It's a bit awkward to express the simple case in which you want to return all the SuperColumns.

Below is a bit of code which retrieves sales totals for a specific customer id. I suspect I've made a few newbie mistakes :-):

public static void querySuperColumn(Keyspace keyspace, String cfName,

        String keyValue) {

  

    ColumnPath colFamily = new ColumnPath(cfName);

    System.out.println("Details for customer id " + keyValue);

  

    /* Get Customer Details */

    colFamily.setSuper_column(StringUtils.bytes("CustomerDetails"));

    SuperColumn custDetailsSc = keyspace

            .getSuperColumn(keyValue, colFamily);

    for (Column col : custDetailsSc.getColumns()) {

        String colName = StringUtils.string(col.getName()); 

        String colValue = StringUtils.string(col.getValue()); 

        System.out.printf("\t%-20s:%-20s\n", colName, colValue);

    }

    /* Get dynamic columns -  */

    ColumnParent colParent = new ColumnParent(cfName);

    SliceRange sliceRange = new SliceRange(StringUtils.bytes(""), StringUtils

            .bytes(""), false, 2 ^ 32); // TODO: what if there are > 2^32 ??                                             

    SlicePredicate slicePredicate = new SlicePredicate();

    slicePredicate.setSlice_range(sliceRange);

    //TODO:  Surely there's an easier way to select all SC than the above??

    List superSlice = keyspace.getSuperSlice(keyValue,

            colParent, slicePredicate);

    for (SuperColumn prodSuperCol : superSlice) {  //For each super column

        String superColName = StringUtils.string(prodSuperCol.getName());

        if (!superColName.equals("CustomerDetails")) { // Already displayed

                                                         

            System.out.printf("\n%50s:", superColName); // product Name 

            List columns1 = prodSuperCol.getColumns();

            for (Column col : columns1) {               // product data 

                String colName = StringUtils.string(col.getName()); 

                String colValue = StringUtils.string(col.getValue()); 

                System.out.printf("\t%20s:%-20s", colName, colValue);

  

            }

        }

    }

  

}

Lines	Description
8-9	Set the superColumn to the “CustomerDetails” supercolumn
11-14	Retrieve the column values (firstname, surname) for the CustomerDetails supercolumn
17-21	Set up a “SlicePredicate” that defines the supercolumns to be queried. I want to get all of the supercolumns (eg every product), so I set up an unbounded range (line 18) and supply that to the slice predicate (line 21)
23	Create a list of supercolumns. This will include all the SuperColumns in the column family (including, unfortunately, CustomerDetails)
27	Eliminate CustomerDetails from the result. Here we only want product names
30-35	Iterate through the columns in each supercolumn. THis will extract QuantitySold and AmountSold for each Product name

Here’s some output from the Java program. It prints out customer Details and product sales totals for customer# 10100:

Details for customer id 101000
    customerFirstName   :Aidan               
    customerLastName    :Wilbur              

                       CD-RW, High Speed Pack of 5:              AmountSold:11.99                           QuantitySold:1.0                 
                               Keyboard Wrist Rest:              AmountSold:11.99                           QuantitySold:1.0                 
                     Multimedia speakers- 3" cones:              AmountSold:44.99                           QuantitySold:1.0

SuperColumns with Toad for Cloud Databases

Toad for cloud databases now has Cassandra support, which makes querying SuperColumns s a lot easier. SuperColumns that have dynamic names but uniform internal column structure (as in my example above) are represented by Toad for Cloud Databases as a detail table. To put it another way, Toad for Cloud Databases re-normalizes the data - displaying it in the format that we would typically use in an RDBMS.

So when we point Toad for Cloud databases at our SalesByCustomer column family, it maps the column family to two tables: one for CustomerDetails and the other - which by default it will call SalesByCustomersuper_column” – for product sales totals. We can rename the subtable and subtable key during the mapping phase to make it clearer that it represents product details.

Now if we want to extract product details for a particular customer, we can do a SQL join. Below we build the join in the query builder, but of course we could simply code the SQL by hand as we would for any NoSQL or SQL database supported by Toad for Cloud Databases:

And just to close the loop, here we can see that the Toad for Cloud databases query returns the same data as the Hector query:

Conclusion

All NoSQL databases require that we change the way we think about data modelling, and Cassandra is no exception. SuperColumns are an incredibly powerful construct, but I can’t say that I found them intuitive or easy. Hopefully APIs and tooling will evolve to make life easier for those of us coming from the relational world.

Guy Harrison |

2 Comments |

tagged

Oracle,

cassandra,

nosql in

TCD blog post

Reader Comments (2)

Do you think that cassandra and non-relational model will replace traditional rdbms vendors ( ibm,oracle,microsoft ) ?

September 13, 2010 |

Alex

Sorry for the delay replying! No I don't think they will replace, but definitely I expect them to supplement. There's an interview I had with a guy from TechTarget that reflects my thinking:
http://searchcloudcomputing.techtarget.com/news/article/0,289142,sid201_gci1520760,00.html

September 28, 2010 |

Guy Harrison

Post a New Comment

Enter your information below to add a new comment.

My response is on my own website »

Author:

Author Email (optional):

Author URL (optional):

Post:

↓ | ↑

Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>

Oracle tables vs Cassandra SuperColumns

SuperColumns vs Relational schema

Inserting into SuperColumns with Hector

Querying SuperColumns

SuperColumns with Toad for Cloud Databases

Conclusion

Reader Comments (2)

Post a New Comment

Link an External Response