Batch inserting into multiple tables using DataStax model operations in Cassandra

Batch inserting into multiple tables using DataStax model operations in Cassandra - python

Following DataStax's advice to 'use roughly one table per query pattern' as mentioned here, I have set up the same table twice, but keyed differently to optimize read times.
-- This table supports queries that filter on specific first_ids and a gt/lt filter on time
CREATE TABLE IF NOT EXISTS table_by_first_Id
(
first_id INT,
time TIMESTAMP,
second_id INT,
value FLOAT,
PRIMARY KEY (first_id, time, second_id)
);
-- Same table, but rearranged to filter on specific second_ids and the same gt/lt time filter
CREATE TABLE IF NOT EXISTS table_by_second_Id
(
second_id INT,
time TIMESTAMP,
first_id INT,
value FLOAT,
PRIMARY KEY (second_id, time, first_id)
);
Then, I have created 2 models using DataStax's Python driver, one for each table.
class ModelByFirstId (...)
class ModelBySecondId (...)
The Problem
I can't seem to figure out how to cleanly ensure atomicity when inserting into one of the tables to also insert into the other table. The only thing I can think of is
def insert_some_data(...):
ModelByFirstId.create(...)
ModelBySecondId.create(...)
I'm looking to see if there's an alternative way to ensure that insertion into one table is reflected into the other - perhaps in the model or table definition, in order to hopefully protect against errant inserts into just one of the models.
I'm also open to restructuring or remaking my tables altogether to accommodate this if needed.

NoSQL databases specially made for high availability and partition tolerance (AP of CAP) are not made to provide high referential integrity. Rather they are designed to provide high throughput and low latency reads and writes. Cassandra itself has no concept of referential integrity across tables. But do look for LWT (light weight transactions) and batches concept for your use case.
Please find some good material to read for the same:
https://www.oreilly.com/content/cassandra-data-modeling/
https://docs.datastax.com/en/cql-oss/3.3/cql/cql_using/useBatch.html
Specifically for your use case, try if you can go for below single table data model:
CREATE TABLE IF NOT EXISTS table_by_Id
(
primary_id INT,
secondary_id INT,
time TIMESTAMP,
value FLOAT,
PRIMARY KEY (primary_id ,secondary_id ,time)
);
for each input record you can create two entries in the table , one with first id as primary_id ( second_id and secondary_id) and second record with second_id as primary_id (and first_id as secondary_id). Now use batch inserts (as mentioned in above documentation. This might not be a best solution for your problem but think about it.

Related

How to create a large number of tables

I've got a large dataset to work with to create a storage system to monitor movement in a store. There's over like 300 products in that store and the main structure of all tables is the same. The only difference is the data inside. There's a larger data base called StorageTF and I want to create a lot of tables called Product_1,Product_2,Product_3 etc..
The table structure should look like
The main large data set (table) looks like this:
CREATE TABLE StoringTF (
Store_code INTEGER,
Store TEXT,
Product_Date TEXT,
Permission INTEGER,
Product_Code INTEGER,
Product_Name TEXT,
Incoming INTEGER,
Unit_Buying_Price INTEGER,
Total_Buying_Price INTEGER,
Outgoing INTEGER,
Unit_Sell_Price INTEGER,
Total_Sell_Price INTEGER,
Description TEXT)
I want the user to input a code in an entry called PCode
it looks like this
PCode = Entry(root, width=40)
PCode.grid(row=0,column=0)
then a function compares the input with all codes in the main table and takes that one and gets the table that has the same product_code.
So the sequence is. All the product tables for all product_Codes in the main table will be created and will have all data from main table that has same product_code.
Then when the program is opened the user inputs a product_code
the program picks the table that has the same code and shows it to the user.
Thanks a lot and I know it's hard but I really need your help and I'm certain you can help me. Thanks.
The product table should look like
CREATE TABLE Product_x (Product_Code INTEGER,
Product_Name TEXT, --taken from main table from lines that has same product code
Entry_Date, TEXT,
Permission_Number INTEGER,
Incoming INTEGER,
Outgoing INTEGER,
Description TEXT,
Total_Quantity_In_Store INTEGER, --which is main table's incoming - outgoing
Total_Value_In_Store INTEGER --main table's total_buying_price - total_sell_price
)
Thank you for your help and hope you can figure it out because I'm really struggling with it.

From your comment:
I think I'd select some columns from main table but I don't know how I'd update the only some columns with select columns from main table where product code = PCode.get() "which is the entry box". is that possible.
Yes, it is definitely possible to present only certain rows and columns of data to the user.
However, there are many patterns (i.e. programming techniques) that you could follow for presenting data to the user, but every common, best-practice technique always separates the backend data (i.e. database) from the user interface. It is not necessary to limit presentation of data to one entire table at a time. In most cases the data should never be presented and/or exposed to the user exactly as it appears in a table. Of course sometimes the data is simple and direct enough to do that, but most applications re-format and group data in different views for proper presentation. (Here the term view is meant as a very general, abstract term for representing data in alternative ways from how it is stored. I mention specific sqlite views below.)
The entire philosophy behind modern databases is for efficient, well-designed storage that can be queried to return just what data is appropriate for each application. Much of this capability is based on the host-language data models, but sqlite directly supports features to help with this. For instance, a view can be defined to select only certain columns and rows at a time (i.e. choose certain Produce_Code values). An sqlite view is just an SQL query that is saved and can have certain properties and actions defined for it. By default, a sqlite view is read-only, but triggers can be defined to allow updates to the underlying tables via the view.
From my earlier comment: You should research data normalization. That is the key principle for designing relational databases. For instance, you should avoid duplicate data columns like Product_Name. That column should only be in the StoringTF. Calculated columns are also usually redundant and unnecessary--don't store the Total_Value_In_Store column, rather calculate it when needed by query and/or view. Having duplicate columns invites mismatched data or at least unnecessary care to make sure all columns are synced when one is updated. Instead you can just query joined tables to get related values.
Honestly, these concepts can require much study before implementing properly. By all means, go forward with developing a solution that fits your needs, but a Stack Overflow answer is no place for full tutorials which I perceive that you might need. Really your question seems more about overall design and I think my answer can get you started on the right track. Anything more specific and you'll need to ask other questions later on.

Fastest way of checking whether a record exists

What is the fastest way of checking whether a record exists, when I know the primary key? select, count, filter, where or something else?

When you use count, the database has to continue the search even if it found the record, because a second one might exist.
So you should search for the actual record, and tell the database to stop after the first one.
When you ask to return data from the record, then the database has to read that data from the table. But if the record can be found by looking up the ID in an index, that table access would be superfluous.
So you should return nothing but the ID you're using to search:
SELECT id FROM MyTable WHERE id = ? LIMIT 1;
Anyway, not reading the actual data and the limit are implied when you are using EXISTS, which is simpler in peewee:
SELECT EXISTS (SELECT * FROM MyTable WHERE id = ?);
MyTable.select().where(MyTable.id == x).exists()

You can check yourself via EXPLAIN QUERY PLAN which will tell you the cost & what it intends to do for a particular query.
Costs don't directly compare between runs, but you should get a decent idea of whether there are any major differences.
That being said, I would expect COUNT(id) FROM table WHERE table.id="KEY" is probably the ideal, as it will take advantage of any partial lookup ability (particularly fast in columnar databases like amazon's redshift) and the primary key indexing.

How can I prevent certain combinations of values in a table, derived from the values already in the table?

I'm developing a web app for storing and managing tasks, using Flask-SQLAlchemy to talk to the backend. I want to use a table to store priority comparisons between pairs of tasks in order to construct a partial ordering. The table will have two columns, one for the lesser element and one for the greater element, and both columns together will form the primary key.
My code so far looks like this:
class PriorityPair(db.Model):
lesser_id = db.Column(db.Integer, db.ForeignKey('task.id'),
primary_key=True)
lesser = db.relationship('Task', remote_side=[id])
greater_id = db.Column(db.Integer, db.ForeignKey('task.id'),
primary_key=True)
greater = db.relationship('Task', remote_side=[id])
def __init__(self, lesser, greater):
self.lesser = lesser
self.greater = greater
All told, this should be sufficient for what I want to do with the table, but there's a problem in that inconsistent rows might be inserted. Suppose I have two tasks, A and B. If task A is of greater priority than task B, I could do the following:
pair = PriorityPair(task_b, task_a)
db.session.add(pair)
db.session.commit
and the relation between the two would be stored as desired. But if at some future point, the opposite relation, PriorityPair(task_a, task_b), were to be inserted into the table, that would be inconsistent. The two tasks can't be greater in importance than each other at the same time.
Now I could probably prevent this in python code, but I'd like to be sure that the DB table itself is guaranteed to remain consistent. Is it possible to put (via Flask-SqlAlchemy) some kind of constraint on the table, so that if (A,B) is already present, then (B,A) will be automatically rejected? And would such a constraint work across DB backends?

No and No.
This is not possible. SqlAlchemy has support for CHECK constraints, but the expression of the check is given as a string. It will require a sub-query, something like (greater_id, lesser_id) not in (select sub.lesser_id, sub.greater_id from priority_pair as sub). And the underlying database backends will prevent it:
MySQL ignores all CHECK constraints.
SQLite does not allow sub-queries in CHECK constraints.
PostgreSQL does not allow sub-queries in CHECK constraints.
Oracle does not allow sub-queries in CHECK constraints.
Instead, you must find some other solution, whether it's triggers or just changing the whole model, which is what I decided to do.

Is it possible to grab all values of a Cassandra composite key?

Say I have:
cur.execute("CREATE TABLE data_by_year ( device_id int, \
site_id text, year_id int, event_time timestamp, value float, \
PRIMARY KEY ((device_id, site_id, year_id),event_time))")
And I want to query all devices for years 2014 and 2013.
result=cur.execute("select distinct device_id, site_id, year_id,\
from data_by_year where device_id IN (324535, 32453l),\
and site_id in and year_id IN (2014)")
Obvously this statement has many issues but it's the best example I could come up with. My beef is with the "where device_id IN (324535, 32453l)". In reality I will not know all the various devices so I want to grab them "ALL". How do I do this?
I'm dealing with time series minute data so I felt that one year was a reasonable partition.

knifewine's answer is correct, but if you're going to be executing this query frequently (and want good performance), I suggest using a second table:
CREATE TABLE all_device_data_by_year (
site_id text,
year_id int,
device_id int,
event_time timestamp,
value float,
PRIMARY KEY ((site_id, year_id), device_id, event_time)
)
You might want to partition by day/month instead of year, depending on the number of devices.
Regarding automatic query paging support in the python driver, it's available right now in the 2.0 branch. I should have a 2.0-beta release ready soon.

You can grab everything using ALLOW FILTERING, but should be aware that this is costly in terms of performance because all nodes will need to answer back:
select distinct device_id, site_id, year_id from data_by_year ALLOW FILTERING;
The performance issue could be mitigated a bit by including a limit clause, but this won't allow you to page through all the data. If you want paging, you may want to use the datastax java driver with the paging feature (or wait for paging to land in the datastax python driver).
If none of the above will work for your use case, redesigning your table may be a better option (and possibly involving a secondary index but that can incur performance penalties as well).

Speed up massive insertion with subqueries for foreign keys

I have to insert massive data (from a Python programme into a SQLite DB), where many fields are validated via foreign keys.
The query looks like this, and I perform the insertion with executemany()
INSERT INTO connections_to_jjos(
connection_id,
jjo_error_id,
receiver_task_id
sender_task_id
)
VALUES
(
:connection_id,
(select id from rtt_errors where name = :rtx_error),
(select id from tasks where name = :receiver_task),
(select id from tasks where name = :sender_task)
)
About 300 insertions take something like 15seconds, which I think it way too much. In production, there should be blocks of 1500 insertions in bulk or so. In similar cases without subqueries for the foreign keys, speed is unbelievable. It's quite clear that FK's will add overhead and slow down the process, but this is too much.
I could do a pre-query to catch all the foreign key id's, and then insert them directly, but I feel there must be a cleaner option.
On the other hand, I have read about the Isolation level, and if I don't understand it wrong, it could be that before each SELECT query, there is an automated COMMIT to enforce integrity... that could result in slowing down the process as well, but my attempts to work in this way were totally unsuccessful.
Maybe I'm doing something essentially wrong with the FK's. How can I improve the performance?
ADDITIONAL INFORMATION
The query:
EXPLAIN QUERY PLAN select id from rtt_errors where name = '--Unknown--'
Outputs:
SEARCH TABLE
rtt_errors
USING COVERING INDEX sqlite_autoindex_rtt_errors_1 (name=?) (~1 rows)
I have created an index in rtt_errors.name, but apparently it is not using it.

In theory, Python's default COMMITs should not happen between consecutive INSERTs, but your extremely poor performance look as if this is what is happening.
Set the isolation level to None, and then execute a pair of BEGIN/COMMIT commands once around all the INSERTs.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.