In my last post, I wrote some Java code to insert Oracle tables into Cassandra column families. As much fun as this was for me, it was fairly trivial and not a particularly useful exercise in terms of learning Cassandra.
In Cassandra, data modelling is very different from the relational models we are used to and one would rarely convert a complete Oracle schema from tables directly to ColumnFamilies . Instead, Cassandra data modelling involves the creation of ColumnFamilies with SuperColumns to represent master-detail structures that are commonly referenced together
SuperColumns vs Relational schema
At the Cassandra Summit in August, Eben Hewitt gave a presentation on Cassandra Data Modelling. There’s a lot of nuance in that talk and in the topic, but a key point in Cassandra – as in many other NoSQL databases – is that you model data to match the queries you need to satisfy, rather than to a more theoretically "pure" normalized form. For relational guys, the process is most similar to radical denormalization in which you introduce redundancy to allow for efficient query processing.
For example, let’s consider the Oracle SH sample schema. Amongst other things, it includes SALES, PRODUCTS and CUSTOMERS:
We could map each Oracle table to a Cassandra ColumnFamily, but because there are no foreign key indexes or joins, such a Cassandra data model would not necessarily support the types of queries we want. For instance, if we want to query sales totals by customer ID, we should create a column family keyed by customer id, which contains SuperColumns named for each product which in turn includes columns for sales totals. It might look something like this:
|ID||CustomerDetails||Product Name #1||Product Name #2||…………..||Product Name #N|
Each customer “row” has super column for each product that contains the sales for that product. Not all customers have all the supercolumns - each customer has supercolumns only for each product they have purchased. The name of the SuperColumn is the name of the product.
Giving the column the name of the product is a major departure from how we would do things in Oracle. The name of a column or SuperColumn can be determined by the data, not by the schema - a concept completely alien to relational modelling.
Inserting into SuperColumns with Hector
To try and understand this, I created a Cassandra columnfamily of the type “Super”. Here’s my definition in the storage-conf.xml file:
And here is some of my Hector Java program, which reads sales totals for each customer from the Oracle sample schema, and inserts them into the ColumnFamily:
This code is fairly straightforward, but let’s step through it anyway:
|7-14||Execute the Oracle SQL to get product summaries for each customer|
|17||Loop through each row returned (one row per product per customer)|
|21||Check to see if this is a completely new customer|
|26-30||If it is a new customer, create the CustomerDetails SuperColumn for that customer. The SuperColumn name is “CustomerDetails” and it contains columns for Firstname and Lastname.|
Now we create a SuperColumn for a specfic product, still keyed to the customer. The SuperColumn name is set to the name of the product (line 37). Inside the supercolumn are placed columns “AmountSold” (lines 38-39) and “QuantitySold” (lines 40-41)
Inserting master detail relationships into a supercolumn column family was easy enough. I had a lot more difficulty writing code to query the data. The tricky part seems to be when you don’t know the name of the SuperColumn you want to read from. There's no direct equivalent to the JDBC ResultMetaData object to query the SuperColumn names - instead you create a "SuperSlice" predictate that defines a range of SuperColumns that you want to retrieve. It's a bit awkward to express the simple case in which you want to return all the SuperColumns.
Below is a bit of code which retrieves sales totals for a specific customer id. I suspect I've made a few newbie mistakes :-):
|8-9||Set the superColumn to the “CustomerDetails” supercolumn|
|11-14||Retrieve the column values (firstname, surname) for the CustomerDetails supercolumn|
|17-21||Set up a “SlicePredicate” that defines the supercolumns to be queried. I want to get all of the supercolumns (eg every product), so I set up an unbounded range (line 18) and supply that to the slice predicate (line 21)|
|23||Create a list of supercolumns. This will include all the SuperColumns in the column family (including, unfortunately, CustomerDetails)|
|27||Eliminate CustomerDetails from the result. Here we only want product names|
|30-35||Iterate through the columns in each supercolumn. THis will extract QuantitySold and AmountSold for each Product name|
Here’s some output from the Java program. It prints out customer Details and product sales totals for customer# 10100:
SuperColumns with Toad for Cloud Databases
Toad for cloud databases now has Cassandra support, which makes querying SuperColumns s a lot easier. SuperColumns that have dynamic names but uniform internal column structure (as in my example above) are represented by Toad for Cloud Databases as a detail table. To put it another way, Toad for Cloud Databases re-normalizes the data - displaying it in the format that we would typically use in an RDBMS.
So when we point Toad for Cloud databases at our SalesByCustomer column family, it maps the column family to two tables: one for CustomerDetails and the other - which by default it will call SalesByCustomersuper_column” – for product sales totals. We can rename the subtable and subtable key during the mapping phase to make it clearer that it represents product details.
Now if we want to extract product details for a particular customer, we can do a SQL join. Below we build the join in the query builder, but of course we could simply code the SQL by hand as we would for any NoSQL or SQL database supported by Toad for Cloud Databases:
And just to close the loop, here we can see that the Toad for Cloud databases query returns the same data as the Hector query:
All NoSQL databases require that we change the way we think about data modelling, and Cassandra is no exception. SuperColumns are an incredibly powerful construct, but I can’t say that I found them intuitive or easy. Hopefully APIs and tooling will evolve to make life easier for those of us coming from the relational world.