A Cassandra column family is a key aspect of the Apache Cassandra data model. Cassandra runs a distributed data store across several machines within a cluster, their outermost container. Keyspace is the outermost Cassandra data container. A keyspace is a container for one or more column families, and each column family can contain one or more rows (each of which contains one or more ordered columns). The replication factor determines the total number of copies of keyspace data stored in the cluster.
In the relational model, a schema is fixed. Once certain table columns are defined, all columns in every row must have at least a null value. However, in Cassandra, columns are not defined (although column families are), so it’s possible to add any column to a column family at any time. Cassandra does not force all columns on individual rows.
In Cassandra, a table can be defined as a super column family or can contain columns. In Cassandra, a column family:
Cassandra Column Family FAQs
In the Cassandra data model, data is stored in clusters. Cassandra clusters are the outermost container of the distributed Cassandra database which operates across multiple machines. Each machine has its own replica in case of failure, acting as a node. These nodes are arranged in a ring format as a cluster.
The Cassandra Keyspace is a data container in the Cassandra data model, and has these basic attributes:
A Cassandra column family consists of a collection of ordered columns in rows which represent a structured version of the stored data. The keyspace holds these Cassandra column families and each keyspace has at least one column family.
Columns are the basic data structure unit in Cassandra and they store three values: column or key name, timestamp, and value. The columns in rows can be removed or added at will, unlike the predefined Cassandra column families which cannot be changed.
Cassandra column family properties include:
A column is the basic Cassandra data structure or the smallest increment of data in Cassandra. Each column in a Cassandra column family database has three main values: key or column name, value, and an uneditable time stamp that reflects when the column was updated.
Expiring columns are those given an expiration date in advance.
Super columns create another level of nesting more granular than the standard column family structure, grouping columns together logically based on business needs. A super column family structure stores a map of related sub-columns to optimize performance for columns likely to be queried together.
A static column family defines names and Cassandra column family data types. It is used to store values that are shared by all rows in the same partition. It is called a static column because the number of columns available is known.
In contrast, a dynamic column family does not define column names in advance, leaving Cassandra’s capability for storing data with arbitrary application and column names available.
There is no difference between column families and tables in Cassandra. In the older Thrift API, the name “column family” is used, while the newer CQL API uses the name “table.” However, this is only true for a Cassandra table vs column family because there are differences between a column family and a relational database table.
A Cassandra column family is not exactly equivalent to relational database tables for several reasons.
A Cassandra column family is schema-free, and since it doesn’t follow any schema, depending on your needs, you can add any column to any column family freely at any time. A Cassandra column family also has a comparator, whose value indicates how to order and sort columns when they are returned in a query.
Unlike RDBMS tables, it’s important to define and keep related columns together in the same family because Cassandra column families are each stored in separate files on disk. And finally, a Cassandra column family can be defined as a super column family or hold columns itself, while the user supplies the values for a relational table which defines only columns.
Cassandra column family design is a key component of Cassandra data modeling.
Two primary goals for modeling the data structure are:
Three top resources for learning more about how to design Cassandra column families include:
Wide Column Store NoSQL vs SQL Data Modeling video: NoSQL schemas are designed with very different goals in mind than SQL schemas. Where SQL normalizes data, NoSQL denormalizes. Where SQL joins ad-hoc, NoSQL pre-joins. And where SQL tries to push performance to the runtime, NoSQL bakes performance into the schema. Join us for an exploration of the core concepts of NoSQL schema design.
Data Modeling and Application Development training course: This is an intermediate level course that explains basic and advanced data modeling techniques including information on workflow application, query analysis, denormalization and other NoSQL data modeling topics. After completing this course, you will be able to perform workflow application and query analysis, explain commonly used data types, understand collections and UDTs, and understand denormalization.
Data Modeling Best Practices: Migrating SQL Schemas for Wide Column NoSQL: To maximize the benefits of wide column databases like Cassandra or ScyllaDB, you must adapt the structure of your data. Data modeling for wide column databases should be query-driven based on your access patterns– a very different approach than normalization for SQL tables. In this video, you will learn how tools can help you migrate your existing SQL structures to accelerate your digital transformation and application modernization.
In Cassandra, create a column family by using the CREATE TABLE command. Create table schema using cqlsh. Define the primary key. Name the table and specify the keyspace that contains its name. Refer to Apache Cassandra Documentation for more information.
ScyllaDB is a modern high-performance NoSQL wide column store database that is API-compatible with Apache Cassandra.
Cassandra was revolutionary when it first debuted in 2008, leading to its broad adoption. However, more than a decade later, many companies have recognized its underlying limitations and have now moved on. Leading companies such as Discord, Comcast, Fanatics, Expedia, Samsung, and Rakuten have replaced Cassandra with ScyllaDB. ScyllaDB delivers on the original vision of NoSQL — without the architectural downsides associated with Apache Cassandra (or the costs at volume of databases like Amazon DynamoDB). ScyllaDB is built with deep knowledge of the underlying Linux operating system and architectural advancements that enable consistently high performance at extreme scale.
Access white papers, benchmarks, and engineer perspectives on ScyllaDB vs Apache Cassandra.