Is it possible to grab all values of a Cassandra composite key? - python

Say I have:
cur.execute("CREATE TABLE data_by_year ( device_id int, \
site_id text, year_id int, event_time timestamp, value float, \
PRIMARY KEY ((device_id, site_id, year_id),event_time))")
And I want to query all devices for years 2014 and 2013.
result=cur.execute("select distinct device_id, site_id, year_id,\
from data_by_year where device_id IN (324535, 32453l),\
and site_id in and year_id IN (2014)")
Obvously this statement has many issues but it's the best example I could come up with. My beef is with the "where device_id IN (324535, 32453l)". In reality I will not know all the various devices so I want to grab them "ALL". How do I do this?
I'm dealing with time series minute data so I felt that one year was a reasonable partition.

knifewine's answer is correct, but if you're going to be executing this query frequently (and want good performance), I suggest using a second table:
CREATE TABLE all_device_data_by_year (
site_id text,
year_id int,
device_id int,
event_time timestamp,
value float,
PRIMARY KEY ((site_id, year_id), device_id, event_time)
)
You might want to partition by day/month instead of year, depending on the number of devices.
Regarding automatic query paging support in the python driver, it's available right now in the 2.0 branch. I should have a 2.0-beta release ready soon.

You can grab everything using ALLOW FILTERING, but should be aware that this is costly in terms of performance because all nodes will need to answer back:
select distinct device_id, site_id, year_id from data_by_year ALLOW FILTERING;
The performance issue could be mitigated a bit by including a limit clause, but this won't allow you to page through all the data. If you want paging, you may want to use the datastax java driver with the paging feature (or wait for paging to land in the datastax python driver).
If none of the above will work for your use case, redesigning your table may be a better option (and possibly involving a secondary index but that can incur performance penalties as well).

Related

Batch inserting into multiple tables using DataStax model operations in Cassandra

Following DataStax's advice to 'use roughly one table per query pattern' as mentioned here, I have set up the same table twice, but keyed differently to optimize read times.
-- This table supports queries that filter on specific first_ids and a gt/lt filter on time
CREATE TABLE IF NOT EXISTS table_by_first_Id
(
first_id INT,
time TIMESTAMP,
second_id INT,
value FLOAT,
PRIMARY KEY (first_id, time, second_id)
);
-- Same table, but rearranged to filter on specific second_ids and the same gt/lt time filter
CREATE TABLE IF NOT EXISTS table_by_second_Id
(
second_id INT,
time TIMESTAMP,
first_id INT,
value FLOAT,
PRIMARY KEY (second_id, time, first_id)
);
Then, I have created 2 models using DataStax's Python driver, one for each table.
class ModelByFirstId (...)
class ModelBySecondId (...)
The Problem
I can't seem to figure out how to cleanly ensure atomicity when inserting into one of the tables to also insert into the other table. The only thing I can think of is
def insert_some_data(...):
ModelByFirstId.create(...)
ModelBySecondId.create(...)
I'm looking to see if there's an alternative way to ensure that insertion into one table is reflected into the other - perhaps in the model or table definition, in order to hopefully protect against errant inserts into just one of the models.
I'm also open to restructuring or remaking my tables altogether to accommodate this if needed.
NoSQL databases specially made for high availability and partition tolerance (AP of CAP) are not made to provide high referential integrity. Rather they are designed to provide high throughput and low latency reads and writes. Cassandra itself has no concept of referential integrity across tables. But do look for LWT (light weight transactions) and batches concept for your use case.
Please find some good material to read for the same:
https://www.oreilly.com/content/cassandra-data-modeling/
https://docs.datastax.com/en/cql-oss/3.3/cql/cql_using/useBatch.html
Specifically for your use case, try if you can go for below single table data model:
CREATE TABLE IF NOT EXISTS table_by_Id
(
primary_id INT,
secondary_id INT,
time TIMESTAMP,
value FLOAT,
PRIMARY KEY (primary_id ,secondary_id ,time)
);
for each input record you can create two entries in the table , one with first id as primary_id ( second_id and secondary_id) and second record with second_id as primary_id (and first_id as secondary_id). Now use batch inserts (as mentioned in above documentation. This might not be a best solution for your problem but think about it.

Fastest way of checking whether a record exists

What is the fastest way of checking whether a record exists, when I know the primary key? select, count, filter, where or something else?
When you use count, the database has to continue the search even if it found the record, because a second one might exist.
So you should search for the actual record, and tell the database to stop after the first one.
When you ask to return data from the record, then the database has to read that data from the table. But if the record can be found by looking up the ID in an index, that table access would be superfluous.
So you should return nothing but the ID you're using to search:
SELECT id FROM MyTable WHERE id = ? LIMIT 1;
Anyway, not reading the actual data and the limit are implied when you are using EXISTS, which is simpler in peewee:
SELECT EXISTS (SELECT * FROM MyTable WHERE id = ?);
MyTable.select().where(MyTable.id == x).exists()
You can check yourself via EXPLAIN QUERY PLAN which will tell you the cost & what it intends to do for a particular query.
Costs don't directly compare between runs, but you should get a decent idea of whether there are any major differences.
That being said, I would expect COUNT(id) FROM table WHERE table.id="KEY" is probably the ideal, as it will take advantage of any partial lookup ability (particularly fast in columnar databases like amazon's redshift) and the primary key indexing.

Python sqite3 user defined queries (selecting tables)

I have a uni assignment where I'm implementing a database that users interact with over a webpage. The goal is to search for books given some criteria. This is one module within a bigger project.
I'd like to let users be able to select the criteria and order they want, but the following doesn't seem to work:
cursor.execute("SELECT * FROM Books WHERE ? REGEXP ? ORDER BY ? ?", [category, criteria, order, asc_desc])
I can't work out why, because when I go
cursor.execute("SELECT * FROM Books WHERE title REGEXP ? ORDER BY price ASC", [criteria])
I get full results. Is there any way to fix this without resorting to injection?
The data is organised in a table where the book's ISBN is a primary key, and each row has many columns, such as the book's title, author, publisher, etc. The user should be allowed to select any of these columns and perform a search.
Generally, SQL engines only support parameters on values, not on the names of tables, columns, etc. And this is true of sqlite itself, and Python's sqlite module.
The rationale behind this is partly historical (traditional clumsy database APIs had explicit bind calls where you had to say which column number you were binding with which value of which type, etc.), but mainly because there isn't much good reason to parameterize values.
On the one hand, you don't need to worry about quoting or type conversion for table and column names. On the other hand, once you start letting end-user-sourced text specify a table or column, it's hard to see what other harm they could do.
Also, from a performance point of view (and if you read the sqlite docs—see section 3.0—you'll notice they focus on parameter binding as a performance issue, not a safety issue), the database engine can reuse a prepared optimized query plan when given different values, but not when given different columns.
So, what can you do about this?
Well, generating SQL strings dynamically is one option, but not the only one.
First, this kind of thing is often a sign of a broken data model that needs to be normalized one step further. Maybe you should have a BookMetadata table, where you have many rows—each with a field name and a value—for each Book?
Second, if you want something that's conceptually normalized as far as this code is concerned, but actually denormalized (either for efficiency, or because to some other code it shouldn't be normalized)… functions are great for that. create_function a wrapper, and you can pass parameters to that function when you execute it.

Speed up massive insertion with subqueries for foreign keys

I have to insert massive data (from a Python programme into a SQLite DB), where many fields are validated via foreign keys.
The query looks like this, and I perform the insertion with executemany()
INSERT INTO connections_to_jjos(
connection_id,
jjo_error_id,
receiver_task_id
sender_task_id
)
VALUES
(
:connection_id,
(select id from rtt_errors where name = :rtx_error),
(select id from tasks where name = :receiver_task),
(select id from tasks where name = :sender_task)
)
About 300 insertions take something like 15seconds, which I think it way too much. In production, there should be blocks of 1500 insertions in bulk or so. In similar cases without subqueries for the foreign keys, speed is unbelievable. It's quite clear that FK's will add overhead and slow down the process, but this is too much.
I could do a pre-query to catch all the foreign key id's, and then insert them directly, but I feel there must be a cleaner option.
On the other hand, I have read about the Isolation level, and if I don't understand it wrong, it could be that before each SELECT query, there is an automated COMMIT to enforce integrity... that could result in slowing down the process as well, but my attempts to work in this way were totally unsuccessful.
Maybe I'm doing something essentially wrong with the FK's. How can I improve the performance?
ADDITIONAL INFORMATION
The query:
EXPLAIN QUERY PLAN select id from rtt_errors where name = '--Unknown--'
Outputs:
SEARCH TABLE
rtt_errors
USING COVERING INDEX sqlite_autoindex_rtt_errors_1 (name=?) (~1 rows)
I have created an index in rtt_errors.name, but apparently it is not using it.
In theory, Python's default COMMITs should not happen between consecutive INSERTs, but your extremely poor performance look as if this is what is happening.
Set the isolation level to None, and then execute a pair of BEGIN/COMMIT commands once around all the INSERTs.

Should I use a surrogate key (id= 1) or natural primary key (tag='sqlalchemy') for my sqlalchemy model?

On the database side, I gather that a natural primary key is preferable as long as it's not prohibitively long, which can cause indexing performance problems. But as I'm reading through projects that use sqlalchemy via google code search, I almost always find something like:
class MyClass(Base):
__tablename__ = 'myclass'
id = Column(Integer, primary_key=True)
If I have a simple class, like a tag, where I only plan to store one value and require uniqueness anyway, what do I gain through a surrogate primary key, when I'm using sqlalchemy? One of the SQL books I'm reading suggests ORM's are a legitimate use of the 'antipattern,' but the ORMs he envisions sound more like ActiveRecord or Django. This comes up a few places in my model, but here's one:
class Tag(Base):
__tablename__ = 'tag'
id = Column(Integer, primary_key=True) #should I drop this and add primary_key to Tag.tag?
tag = Column(Unicode(25), unique=True)
....
In my broader, relational model, Tag has multiple many-to-many relationships with other objects. So there will be a number of intermediate tables that have to store a longer key. Should I pick tag or id for my primary key?
Although ORMs or programming languages make some usages easier than others, I think that choosing primary key is a database design problem unrelated to ORM. It is more important to get database schema right on its own grounds. Databases tend to live longer than code that accesses them, anyways.
Search SO (and google) for more general questions on how to chose primary key, e.g.: https://stackoverflow.com/search?q=primary+key+natural+surrogate+database-design ( Surrogate vs. natural/business keys, Relational database design question - Surrogate-key or Natural-key?, When not to use surrogate primary keys?, ...)
I assume that Tag table will not be very large or very dynamic.
In this case I would try to use tag as a primary key, unless there are important reasons to add some invisible to end user primary key, e.g.:
poor performance under real world data (measured, not imagined),
frequent changes of tag names (but then, I'd still use some unique string based on first used tag name as key),
invisible behind-the-scenes merging of tags (but, see previous point),
problems with different collations -- comparing international data -- in your RDBMS (but, ...)
...
In general I observed that people tend to err in both directions:
by using complex multi-field "natural" keys (where particular fields are themselves opaque numbers), when table rows have their own identity and would benefit from having their own surrogate IDs,
by introducing random numeric codes for everything, instead of using short meaningful strings.
Meaningful primary key values -- if possible -- will prove themselves useful when browsing database by hand. You won't need multiple joins to figure out your data.
Personally I prefer surrogate keys in most places; The two biggest reasons for this are 1) integer keys are generally smaller/faster and 2) Updating data doesn't require cascades. That second point is a fairly important one for what you are doing; If there are several many to many tables referencing the tag table, then remember that if someone wants to update a tag (eg, to fix a spelling/case mistake, or to use a more/less specific word, etc), the update will need to be done across all of the tables at the same time.
I'm not saying that you should never use a natural key -- If I am certain that the natural key will never be changed, I will consider a natural key. Just be certain, otherwise it becomes a pain to maintain.
Whenever I see people (over)using surrogate keys, I remember Roy Hann's blog articles regarding this topic, especially the second and the third article:
http://community.actian.com/forum/blogs/rhann/127-surrogate-keys-part-2-boring-bit.html
http://community.actian.com/forum/blogs/rhann/128-surrogate-keys-part-3-surrogates-composites.html
I strongly suggest people reading them as these articles come from a person who has spent few decades as database expert.
Nowadays surrogate key usage reminds me of early years of the 21 century when people used XML for literally everything, both where it did belong, and where it did not belong.

Categories

Resources