Friday, February 22, 2013

Cassandra Comparators and Validators


About Data Types (Comparators and Validators)

In a relational database, you must specify a data type for each column when you define a table. The data type constrains the values that can be inserted into that column. For example, if you have a column defined as an integer datatype, you would not be allowed to insert character data into that column. Column names in a relational database are typically fixed labels (strings) that are assigned when you define the table schema.

In Cassandra, the data type for a column (or row key) value is called a validator. The data type for a column name is called a comparator. You can define data types when you create your column family schemas (which is recommended), but Cassandra does not require it. Internally, Cassandra stores column names and values as hex byte arrays (BytesType). This is the default client encoding used if data types are not defined in the column family schema (or if not specified by the client request).

Cassandra comes with the following built-in data types, which can be used as both validators (row key and column value data types) or comparators (column name data types). One exception is CounterColumnType, which is only allowed as a column value (not allowed for row keys or column names).
Internal Type CQL Name Description
BytesType
blob
Arbitrary hexadecimal bytes (no validation)
AsciiType ascii US-ASCII character string
UTF8Type text, varchar UTF-8 encoded string
IntegerType varint Arbitrary-precision integer
LongType int, bigint 8-byte long
UUIDType uuid Type 1 or type 4 UUID
DateType timestamp Date plus time, encoded as 8 bytes since epoch
BooleanType boolean true or false
FloatType float 4-byte floating point
DoubleType double 8-byte floating point
DecimalType decimal Variable-precision decimal
CounterColumnType counter Distributed counter value (8-byte long)



About Validators
For all column families, it is best practice to define a default row key validator using the key_validation_class property.

For static column families, you should define each column and its associated type when you define the column family using the column_metadata property.

For dynamic column families (where column names are not known ahead of time), you should specify a default_validation_class instead of defining the per-column data types.

Key and column validators may be added or changed in a column family definition at any time. If you specify an invalid validator on your column family, client requests that respect that metadata will be confused, and data inserts or updates that do not conform to the specified validator will be rejected.

About Comparators
Within a row, columns are always stored in sorted order by their column name. The comparator specifies the data type for the column name, as well as the sort order in which columns are stored within a row. Unlike validators, the comparator may not be changed after the column family is defined, so this is an important consideration when defining a column family in Cassandra.

Typically, static column family names will be strings, and the sort order of columns is not important in that case. For dynamic column families, however, sort order is important. For example, in a column family that stores time series data (the column names are timestamps), having the data in sorted order is required for slicing result sets out of a row of columns.


No comments:

Post a Comment