Cassandra filter based on secondary index

Cassandra filter based on secondary index - python

We have been using Cassandra for awhile now and we are trying to get a really optimized table going that will be able to quickly query and filter on about 100k rows.
Our model looks something like this:
class FailedCDR(Model):
uuid = columns.UUID(partition_key=True, primary_key=True)
num_attempts = columns.Integer(index=True)
datetime = columns.Integer()
If I describe the table it clearly shows that num_attempts is index.
CREATE TABLE cdrs.failed_cdrs (
uuid uuid PRIMARY KEY,
datetime int,
num_attempts int
) WITH bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
CREATE INDEX index_failed_cdrs_num_attempts ON cdrs.failed_cdrs (num_attempts);
We want to be able to run a filter similar to this:
failed = FailedCDR.filter(num_attempts__lte=9)
But this happens:
QueryException: Where clauses require either a "=" or "IN" comparison with either a primary key or indexed field
How can we accomplish a similar task?

If you want to do a range query in CQL, you need the field to be a clustering column.
So you'll want the num_attempts field to be a clustering column.
Also if you want to do a single query, you need all the rows you want to query in the same partition (or a small number of partitions that you can access using an IN clause). Since you only have 100K rows, that is small enough to fit in one partition.
So you could define your table like this:
CREATE TABLE test.failed_cdrs (
partition int,
num_attempts int,
uuid uuid,
datetime int,
PRIMARY KEY (partition, num_attempts, uuid));
You would insert your data with a constant for the partition key, such as 1.
INSERT INTO failed_cdrs (uuid, datetime, num_attempts, partition)
VALUES ( now(), 123, 5, 1);
Then you can do range queries like this:
SELECT * from failed_cdrs where partition=1 and num_attempts >=8;
The drawback to this method is that to change the value of num_attempts, you need to delete the old row and insert a new row since you are not allowed to update key fields. You could do the delete and insert for that in a batch statement.
A better option that will become available in Cassandra 3.0 is to make a materialized view that has num_attempts as a clustering column, in which case Cassandra would take care of the delete and insert for you when you updated num_attempts in the base table. The 3.0 release is currently in beta testing.

Related

SQLAlchemy Efficient Subquery for Latest Value

The current value of an Entity's status attribute can queried as the latest entry in a EntityHistory table for that entity, i.e.
Entities (id) <- EntityHistory (timestamp, entity_id, value)
How do I write an efficient SQLALchemy expression that eagerly loads the current value from the history table for all entities without resulting in N+1 queries?
I tried writing a property for my model, but this generates a query for each (N+1) when I iterate over it. To my knowledge, there is no way to solve this without a subquery, which still seems inefficient to me on the database.
Example EntityHistory data:
timestamp |entity_id| value
==========|=========|======
15:00| 1| x
15:01| 1| y
15:02| 2| x
15:03| 2| y
15:04| 1| z
So the current value for entity 1 would be z and for entity 2 it would be y. The backing database is Postgres.

I think you could use a column_property to load the latest value as an attribute of an Entities instance along other column-mapped attributes:
from sqlalchemy import select
from sqlalchemy.orm import column_property
class Entities(Base):
...
value = column_property(
select([EntityHistory.value]).
where(EntityHistory.entity_id == id). # the id column from before
order_by(EntityHistory.timestamp.desc()).
limit(1).
correlate_except(EntityHistory)
)
A subquery could of course also be used in a query instead of a column_property.
query = session.query(
Entities,
session.query(EntityHistory.value).
filter(EntityHistory.entity_id == Entities.id).
order_by(EntityHistory.timestamp.desc()).
limit(1).
label('value')
)
Performance would naturally depend on proper index being in place:
Index('entityhistory_entity_id_timestamp_idx',
EntityHistory.entity_id,
EntityHistory.timestamp.desc())
In a way this is still your dreaded N+1, as the query uses a subquery per row, but it's hidden in a single round trip to the DB.
If on the other hand having value as a property of Entities is not necessary, in Postgresql you can join with a DISTINCT ON ... ORDER BY query to fetch latest values:
values = session.query(EntityHistory.entity_id,
EntityHistory.value).\
distinct(EntityHistory.entity_id).\
# The same index from before speeds this up.
# Remember nullslast(), if timestamp can be NULL.
order_by(EntityHistory.entity_id, EntityHistory.timestamp.desc()).\
subquery()
query = session.query(Entities, values.c.value).\
join(values, values.c.entity_id == Entities.id)
though in limited testing with dummy data the subquery-as-output-column always beat the join by a notable margin, if every entity had values. On the other hand if there were millions of entities and a lot of missing history values, then a LEFT JOIN was faster. I'd recommend testing on your own data which query suits your data better. For random access of single entity given that the index is in place a correlated subquery is faster. For bulk fetches: test.

What is the best way to store and query a very large number of variable-length lists in a MySQL database?

Maybe this question will be made more clear through an example. Let's say the dataset I'm working with is a whole bunch (several gigabytes) of variable-length lists of tuples, each associated with a unique ID and a bit of metadata, and I want to be able quickly retrieve any of these lists by its ID.
I currently have two tables set up more or less like this:
TABLE list(
id VARCHAR PRIMARY KEY,
flavor VARCHAR,
type VARCHAR,
list_element_start INT,
list_element_end INT)
TABLE list_element(
id INT PRIMARY KEY,
value1 FLOAT,
value2 FLOAT)
To pull a specific list out of the database I currently do something like this:
SELECT list_element_start, list_element_end FROM list WHERE id = 'my_list_id'
Then I use the retrieved list_element_start and list_element_end values to get the list elements:
SELECT *
FROM list_element
WHERE id BETWEEN(my_list_element_start, my_list_element_end)
Of course, this works very fast, but I feel as though there's a better way to do this. I'm aware that I could have another column in list_element_end called list_id, and then do something like SELECT * FROM list_element WHERE list_id = 'my_list_id' ORDER BY id. However, it seems to me that having that extra column, as well as a foreign key index on that column would take up a lot of unnecessary space.
Is there simpler way to do this?
Apologies if this question has been asked before, but I was unable to locate the answer. I'd also like to use SQLAlchemy in Python to do all of this, if possible.
Thanks in advance!

between is not a function so I don't know what you think is going on there. Anyway... Why not:
SELECT e.*
FROM list_element e
Join list l
On l.id between e.my_list_element_start and my_list_element_end
Or am I missing something

You can normalize each element of your array into a row. The following is the declarative style in SQLAlchemy that will give you a "MyList" object with flavor etc, and then elements will be an actual Python list of each "MyElement" object. You could get more complicated to weed out the extra id and idx within the returned element list, but this should be plenty fast enough.
Also, above, you had mixed varchar and int for your primary key, not sure if it was just oversight, but you ought not do that. Additionally, when handling large data sets remember options like chunking. You can use offset and limit to work with smaller sizes and process iteratively.
class MyList(Base):
__tablename__ = 'my_list'
id = Column(Integer, primary_key=True)
flavor = Column(String)
list_type = Column(String)
elements = Relationship('my_element', order_by='my_element.idx')
class MyElement(Base):
__tablename__ = 'my_element'
id = Column(Integer, ForeignKey('my_list.id'))
idx = Column(Integer)
val = Column(Integer)
__table_args__ = (PrimaryKeyConstraint('id','idx'), )

Mysql custom column

I am trying to create MySql column that I want to be a five digit integer. The first two digits I want to use from my software and the last three to generate from dabatabase.
Example: Store number 10 will be 10000 than 10001, 10002 for the other store ex: Store number 20 will be 20000, 20001, 20002 ...

Make the order_id an autoincrement field and then make a primary key on store_id and order_id (in that order).
This way the order_id will count separately for each store_id.
See this example:
http://sqlfiddle.com/#!2/33b3e/1
full code:
CREATE TABLE order_ticket_number ( id_store_ticket int(10) NOT NULL,
id_order_ticket int(10) AUTO_INCREMENT NOT NULL,
id_order int(10) unsigned NOT NULL default 0,
PRIMARY KEY (id_store_ticket,id_order_ticket)
)
ENGINE=myisam DEFAULT CHARSET=utf8;
INSERT INTO order_ticket_number (id_store_ticket) VALUES (10),(10),(20),(20);
Edit:
This can only be done with MyIsam and (apparently) not with InnoDB.
So I think there are two options. Either handle this in your application logic or create a MyIsam table just to handle the numbering. Once you have inserted in there, you'll know the order_id and you can insert it into the InnoDB table. Although this does not seem like the most elegant solution. I think it's far more error proof than trying to generate it yourself (racing conditions).
Last thing you should be asking yourself is why you would want to have these numbers. Why not use a simple autoincrement for each order regardless of the store_id....

As suggested in the comments, do consider that approach. Simply have 2 columns, and bind them through UNIQUE so there's no conflicts. If you look for the 1st id in Store ID 10, simply WHERE store_id = 10 AND other_id = 1. It's more logical, and you can make a simple function to output this as 100001:
function store_string($int_store_id, $int_other_id) {
$str = str_repeat('0', (2 - strlen($int_store_id))).$int_store_id;
$str .= str_repeat('0', (3 - strlen($int_other_id))).$int_other_id;
return $str;
}
(PHP example, but simply look up strlen and str_repeat to get the idea.
This gives you a lot of advantages such as easier searching for either value, and the possibility to go beyond store_id 99 without having to alter all existing rows and just the output function.
Regarding the actual INSERT, you can run your inserts like this:
INSERT INTO table_name (store_id, other_id, third_value)
SELECT {$store_id}, (other_id + 1), {$third_value}
FROM (( SELECT other_id
FROM table_name
WHERE store_id = {$store_id}
ORDER BY other_id DESC)
UNION ALL
( SELECT '0')
LIMIT 1) AS h
And simply extend with more values the same way $third_value is used.

How to insert arrays into a database?

In my previous question a lot of users wanted me to give some more data to toy with. So I got working on exporting all my data and processing it with Python, but then I realized: where do I leave all this data?
Well I decided the best thing would be to stick them in a database, so at least I don't have to parse the raw files every time. But since I know nothing about databases this is turning out to be quite confusing. I tried some tutorials to create a sqlite database, add a table and field and try to insert my numpy.arrays, but it can't get it to work.
Typically my results per dog look like this:
So I have 35 different dogs and each dog has 24 measurement. Each measurement itself has an unknown amount of contacts. Each measurement consists out of a 3D array (248 frames of the whole plate [255x63]) and a 2D array (the maximal values for each sensor of the plate [255x63]). Storing one value in a database wasn't a problem, but getting my 2D arrays in there didn't seem to work.
So my question is how should I order this in a database and insert my arrays into it?

You'll probably want to start out with a dogs table containing all the flat (non array) data for each dog, things which each dog has one of, like a name, a sex, and an age:
CREATE TABLE `dogs` (
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
`name` VARCHAR(64),
`age` INT UNSIGNED,
`sex` ENUM('Male','Female')
);
From there, each dog "has many" measurements, so you need a dog_mesaurements table to store the 24 measurements:
CREATE TABLE `dog_measurements` (
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
`dog_id` INT UNSIGNED NOT NULL,
`paw` ENUM ('Front Left','Front Right','Rear Left','Rear Right'),
`taken_at` DATETIME NOT NULL
);
Then whenever you take a measurement, you INSERT INTO dog_measurements (dog_id,taken_at) VALUES (*?*, NOW()); where * ? * is the dog's ID from the dogs table.
You'll then want tables to store the actual frames for each measurement, something like:
CREATE TABLE `dog_measurement_data` (
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
`dog_measurement_id` INT UNSIGNED NOT NULL,
`frame` INT UNSIGNED,
`sensor_row` INT UNSIGNED,
`sensor_col` INT UNSIGNED,
`value` NUMBER
);
That way, for each of the 250 frames, you loop through each of the 63 sensors, and store the value for that sensor with the frame number into the database:
INSERT INTO `dog_measurement_data` (`dog_measurement_id`,`frame`,`sensor_row`,`sensor_col`,`value`) VALUES
(*measurement_id?*, *frame_number?*, *sensor_row?*, *sensor_col?*, *value?*)
Obviously replace measurement_id?, frame_number?, sensor_number?, value? with real values :-)
So basically, each dog_measurement_data is a single sensor value for a given frame. That way, to get all the sensor values for all a given frame, you would:
SELECT `sensor_row`,sensor_col`,`value` FROM `dog_measurement_data`
WHERE `dog_measurement_id`=*some measurement id* AND `frame`=*some frame number*
ORDER BY `sensor_row`,`sensor_col`
And this will give you all the rows and cols for that frame.

Django has a library for encapsulating all the database work into Python classes, so you don't have to mess with raw SQL until you have to do something really clever. Even though Django is a framework for web applications, you can use the database ORM by itself.
Josh's models would look like this in Python using Django:
from django.db import models
class Dog(models.Model):
# Might want to look at storing birthday instead of age.
# If you track age, you probably need another field telling
# you when in the year age goes up by 1... and at that point,
# you're really storing a birthday.
name = models.CharField(max_length=64)
age = models.IntegerField()
genders = [
('M', 'Male'),
('F', 'Female'),
]
gender = models.CharField(max_length=1, choices=genders)
class Measurement(models.Model):
dog = models.ForeignKey(Dog, related_name="measurements")
paws = [
('FL', 'Front Left'),
('FR', 'Front Right'),
('RL', 'Rear Left'),
('RR', 'Rear Right'),
]
paw = models.CharField(max_length=2, choices=paws)
taken_at = models.DateTimeField(default=date, auto_now_add=True)
class Measurement_Point(models.Model):
measurement = models.ForeignKey(Measurement, related_name="data_points")
frame = models.IntegerField()
sensor_row = models.PositiveIntegerField()
sensor_col = models.PositiveIntegerField()
value = models.FloatField()
class Meta:
ordering = ['frame', 'sensor_row', 'sensor_col']
The id fields are created automatically.
Then you can do things like:
dog = Dog()
dog.name = "Pochi"
dog.age = 3
dog.gender = 'M'
# dog.gender will return 'M', and dog.get_gender_display() will return 'Male'
dog.save()
# Or, written another way:
dog = Dog.objects.create(name="Fido", age=3, sex='M')
To take a measurement:
measurement = dog.measurements.create(paw='FL')
for frame in range(248):
for row in range(255):
for col in range(63):
measurement.data_points.create(frame=frame, sensor_row=row,
sensor_col=col, value=myData[frame][row][col])
Finally, to get a frame:
# For the sake of argument, assuming the dogs have unique names.
# If not, you'll need some more fields in the Dog model to disambiguate.
dog = Dog.objects.get(name="Pochi", sex='M')
# For example, grab the latest measurement...
measurement = dog.measurements.all().order_by('-taken_at')[0]
# `theFrameNumber` has to be set somewhere...
theFrame = measurement.filter(frame=theFrameNumber).values_list('value')
Note: this will return a list of tuples (e.g. [(1.5,), (1.8,), ... ]), since values_list() can retrieve multiple fields at once. I'm not familiar with NumPy, but I'd imagine it's got a function similar to Matlab's reshape function for remapping vectors to matrices.

I think you are not able to figure out how to put 2D data in database.
If you think of relation between 2 columns, you can think of it as 2D data with 1st column as X axis data and 2nd column as Y axis data. Similarly for 3D data.
Finally your db should look like this:
Table: Dogs
Columns: DogId, DogName -- contains data for each dog
Table: Measurements
Columns: DogId, MeasurementId, 3D_DataId, 2D_DataId -- contains measurements of each dog
Table: 3D_data
Columns: 3D_DataId, 3D_X, 3D_Y, 3D_Z -- contains all 3D data of a measurement
Table: 2D_data
Columns: 2D_DataId, 2D_X, 2D_Y -- contains all 2D data of a measurement
Also you may want to store your 3D data and 2D data in an order. In that case, you will have to add a column to store that order in table of 3D data and 2D data

The only thing I would add to Josh's answer is that if you don't need to query individual frames or sensors, just store the arrays as BLOBs in the dog_measurement_data table. I have done this before with large binary set of sensor data and it worked out well. You basically query the 2d and 3d arrays with each measurement and manipulate them in code instead of the database.

I've benefited a lot from the sqlalchemy package; it is an Object Relational Mapper. What this means is that you can create a very clear and distinct separation between your objects and your data:
SQL databases behave less like object
collections the more size and
performance start to matter; object
collections behave less like tables
and rows the more abstraction starts
to matter. SQLAlchemy aims to
accommodate both of these principles.
You can create a objects representing your different nouns (Dog, Measurement, Plate, etc). Then you create a table via sqlalchemy constructs which will contain all the data that you want to associate with, say, a Dog object. Finally you create a mapper between the Dog object and the dog_table.
This is difficult to understand without an example, and I will not reproduce one here. Instead, please start by reading this case study and then study this tutorial.
Once you can think of your Dogs and Measurements as you do in the real world (that is, the objects themselves) you can start factoring out the data that makes them up.
Finally, try not to marry your data with an specific format (as you do currently by using numpy arrays). Instead, you can think of the simple numbers and then transform them on demand to the specific format your current application demands (along the lines of a Model-View-Controller paradigm).
Good luck!

From your description, I would highly recommend looking into PyTables. It's not a relational database in the traditional sense, it has most of the features you're likely to be using (e.g. querying) while allowing easy storage of large, multidimensional datasets and their attributes. As an added bonus, it's tightly integrated with numpy.

Querying a view in SQLAlchemy

I want to know if SQLAlchemy has problems querying a view. If I query the view with normal SQL on the server like:
SELECT * FROM ViewMyTable WHERE index1 = '608_56_56';
I get a whole bunch of records. But with SQLAlchemy I get only the first one. But in the count is the correct number. I have no idea why.
This is my SQLAlchemy code.
myQuery = Session.query(ViewMyTable)
erg = myQuery.filter(ViewMyTable.index1 == index1.strip())
# Contains the correct number of all entries I found with that query.
totalCount = erg.count()
# Contains only the first entry I found with my query.
ergListe = erg.all()

if you've mapped ViewMyTable, the query will only return rows that have a fully non-NULL primary key. This behavior is specific to versions 0.5 and lower - on 0.6, if any of the columns have a non-NULL in the primary key, the row is turned into an instance. Specify the flag allow_null_pks=True to your mappers to ensure that partial primary keys still count :
mapper(ViewMyTable, myview, allow_null_pks=True)
If OTOH the rows returned have all nulls for the primary key, then SQLAlchemy cannot create an entity since it can't place it into the identity map. You can instead get at the individual columns by querying for them specifically:
for id, index in session.query(ViewMyTable.id, ViewMyTable.index):
print id, index

I was facing similar problem - how to filter view with SQLAlchemy. For table:
t_v_full_proposals = Table(
'v_full_proposals', metadata,
Column('proposal_id', Integer),
Column('version', String),
Column('content', String),
Column('creator_id', String)
)
I'm filtering:
proposals = session.query(t_v_full_proposals).filter(t_v_full_proposals.c.creator_id != 'greatest_admin')
Hopefully it will help:)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cassandra filter based on secondary index - python

Related

SQLAlchemy Efficient Subquery for Latest Value

What is the best way to store and query a very large number of variable-length lists in a MySQL database?

Mysql custom column

How to insert arrays into a database?

Querying a view in SQLAlchemy

Categories

Resources