I'm attempting to connect to Cassandra in order to do a bulk insert. However, when I attempt to connect, I get an error.
The code I'm using:
from pycassa import columnfamily
from pycassa import pool
cassandra_ips = ['<an ip addr>']
conpool = pool.ConnectionPool('my_keyspace', cassandra_ips)
colfam = columnfamily.ColumnFamily(conpool, 'my_table')
However this fails on that last line with:
pycassa.cassandra.ttypes.NotFoundException: NotFoundException(_message=None, why='Column family my_table not found.')
The column family definitely exists:
cqlsh> use my_keyspace
... ;
cqlsh:my_keyspace> desc tables;
my_table
cqlsh:my_keyspace>
And I don't think this is a simple typo on the table name, as I've check it a dozen times, but also because of this:
In [3]: sys_mgr = pycassa.system_manager.SystemManager(cassandra_ips[0])
In [4]: sys_mgr.get_keyspace_column_families('my_keyspace')
Out[4]: {}
Why is that {}?
If it matters:
The table/column family was created using CQL.
The table is currently empty.
The table was roughly created using:
CREATE TABLE my_table (
user_id int,
year_month int,
t timestamp,
<tons of other attributes>
PRIMARY KEY ((user_id, year_month), t)
) WITH compaction =
{ 'class' : 'LeveledCompactionStrategy', 'sstable_size_in_mb' : 160 };
In order to access CQL3 databases via a thrift API, such as pycassa, the tables must be created using compact storage.
CREATE TABLE my_table (
...
) WITH COMPACT STORAGE;
With regards to the primary keys, from the docs:
Using the compact storage directive prevents you from defining more
than one column that is not part of a compound primary key.
Currently you are using a composite partition key but enabling compact storage limits us to using a compound partition key. So you will not have to limit it to a single column, it just has to be part of the compound key. One final reference.
This case could happen also after creation CF named with uppercase:
https://docs.datastax.com/en/cql/3.0/cql/cql_reference/ucase-lcase_r.html
I had such strange structure of namespace with quoted CFs:
cqlsh:testkeyspace> DESC TABLES;
"Tabletest" users "PlayerLastStats"
I got error on pycassa system_manager.create_column_family(...) but only if here was column_validation_classes parameter
pycassa.cassandra.ttypes.NotFoundException: NotFoundException(_message=None, why='Column family NewTable not found.')
After renaming to lowercase all tables looked good
cqlsh:testkeyspace> DESC TABLES;
tabletest newtable users playerlaststats
Related
This question is a bit related to another question: Get List of Primary Key Columns in Snowflake.
Since INFORMATION_SCHEMA.COLUMNS does not provide the required information regarding the primary keys. And the method proposed by Snowflake itself, where you would describe the table followed by a result_scan, is unreliable when queries are run in parallel.
I was thinking about using SHOW PRIMARY KEYs IN DATABASE. This works great when querying the database from within Snowflake. But as soon as I try to do this in python, I get results for the column name like 'Built-in function id'. Which is not useful when dynamically generating sql statements.
The code I am using is as follows:
SQL_PK = "SHOW PRIMARY KEYS IN DATABASE;"
snowflake_service = SnowflakeService(username=cred["username"], password=cred["password"])
snowflake_service.connect(database=DATABASE,role=ROLE, warehouse=WAREHOUSE)
curs = snowflake_service.cursor
primary_keys = curs.execute(SQL_PK).fetchall()
curs.close()
snowflake_service.connection.close()
Is there something I am doing wrong? Is it even possible to do it like this?
Or is the solution that Snowflake provides reliable enough, when sending these queries as one string? Although with many tables, there will be many round trips required to get all the data needed.
where you would describe the table followed by a result_scan, is unreliable when queries are run in parallel
You could search for specific query run using information_schema.query_history_by_session and then refer to resultset using retrieved QUERY_ID.
SHOW PRIMARY KEYS IN DATABASE;
-- find the newest occurence of `SHOW PRIMARY KEYS`:
SET queryId = (SELECT QUERY_ID
FROM TABLE(information_schema.query_history_by_session())
WHERE QUERY_TEXT LIKE '%SHOW PRIMARY KEYS IN DATABASE%'
ORDER BY ENDTIME DESC LIMIT 1);
SELECT * FROM TABLE(RESULT_SCAN($queryId));
tl;dr I'm trying to figure out the most efficient way to SELECT a record or INSERT it if it doesn't already exist that will work with multiple concurrent connections.
The situation:
I'm constructing a Postgres database (9.3.5, x64) containing a whole bunch of information associated with a customer. This database features a "customers" table that contains an "id" column (SERIAL PRIMARY KEY), and a "system_id" column (VARCHAR(64)). The id column is used as a foreign key in other tables to link to the customer. The "system_id" column must be unique, if it is not null.
CREATE TABLE customers (
id SERIAL PRIMARY KEY,
system_id VARCHAR(64),
name VARCHAR(256));
An example of a table that references the id in the customers table:
CREATE TABLE tsrs (
id SERIAL PRIMARY KEY,
customer_id INTEGER NOT NULL REFERENCES customers(id),
filename VARCHAR(256) NOT NULL,
name VARCHAR(256),
timestamp TIMESTAMP WITHOUT TIME ZONE);
I have written a python script that uses the multiprocessing module to push data into the database through multiple connections (from different processes).
The first thing that each process needs to do when pushing data into the database is to check if a customer with a particular system_id is in the customers table. If it is, the associated customer.id is cached. If its not already in the table, a new row is added, and the resulting customer.id is cached. I have written an SQL function to do this for me:
CREATE OR REPLACE FUNCTION get_or_insert_customer(p_system_id customers.system_id%TYPE, p_name customers.name%TYPE) RETURNS customers.id%TYPE AS $$
DECLARE
v_id customers.id%TYPE;
BEGIN
LOCK TABLE customers IN EXCLUSIVE MODE;
SELECT id INTO v_id FROM customers WHERE system_id=p_system_id;
IF v_id is NULL THEN
INSERT INTO customers(system_id, name)
VALUES(p_system_id,p_name)
RETURNING id INTO v_id;
END IF;
RETURN v_id;
END;
$$ LANGUAGE plpgsql;
The problem: The table locking was the only way I was able to prevent duplicate system_ids being added to the table by concurrent processes. This isn't really ideal as it effectively serialises all the processing at this point, and basically doubles the amount of time that it takes to push a given amount of data into the db.
I wanted to ask if there was a more efficient/elegant way to implement the "SELECT or INSERT" mechanism that wouldn't cause as much of a slow down? I suspect that there isn't, but figured it was worth asking, just in case.
Many thanks for reading this far. Any advice is much appreciated!
I managed to rewrite the function into plain SQL, changing the order (avoiding the IF and the potential race condition)
CREATE OR REPLACE FUNCTION get_or_insert_customer
( p_system_id customers.system_id%TYPE
, p_name customers.name%TYPE
) RETURNS customers.id%TYPE AS $func$
LOCK TABLE customers IN EXCLUSIVE MODE;
INSERT INTO customers(system_id, name)
SELECT p_system_id,p_name
WHERE NOT EXISTS (SELECT 1 FROM customers WHERE system_id = p_system_id)
;
SELECT id
FROM customers WHERE system_id = p_system_id
;
$func$ LANGUAGE sql;
I'm working with sqlite3 on python 2.7 and I am facing a problem with a many-to-many relationship. I have a table from which I am fetching its primary key like this
current.execute("SELECT ExtensionID FROM tblExtensionLookup where ExtensionName = ?",[ext])
and then i am fetching another primary key from another table
current.execute("SELECT HostID FROM tblHostLookup where HostName = ?",[host])
now what i am doing is i have a third table with these two keys as foreign keys and i inserted them like this
current.execute("INSERT INTO tblExtensionHistory VALUES(?,?)",[Hid,Eid])
The problem is i don't know why but the last insertion is not working it keeps giving errors. Now what i have tried is:
First I thought it was because I have an autoincrement primary id for the last mapping table which I didn't provide, but isn't it supposed to consider itself as it's auto incremented? However I went ahead and tried adding Null,None,0 but nothing works.
Secondly I thought maybe because i'm not getting the values from tables above so I tried printing it out and it shows so it works.
Any suggestions what I am doing wrong here?
EDIT :
When i don't provide primary key i get error as
The table has three columns but you provided only two values
and when i do provide them as None,Null or 0 it says
Parameter 0 is not supported probably because of unsupported type
I tried implementing the #abarnet way but still keeps saying parameter 0 not supported
connection = sqlite3.connect('WebInfrastructureScan.db')
with connection:
current = connection.cursor()
current.execute("SELECT ExtensionID FROM tblExtensionLookup where ExtensionName = ?",[ext])
Eid = current.fetchone()
print Eid
current.execute("SELECT HostID FROM tblHostLookup where HostName = ?",[host])
Hid = current.fetchone()
print Hid
current.execute("INSERT INTO tblExtensionHistory(HostID,ExtensionID) VALUES(?,?)",[Hid,Eid])
EDIT 2 :
The database schema is :
table 1:
CREATE TABLE tblHostLookup (
HostID INTEGER PRIMARY KEY AUTOINCREMENT,
HostName TEXT);
table2:
CREATE TABLE tblExtensionLookup (
ExtensionID INTEGER PRIMARY KEY AUTOINCREMENT,
ExtensionName TEXT);
table3:
CREATE TABLE tblExtensionHistory (
ExtensionHistoryID INTEGER PRIMARY KEY AUTOINCREMENT,
HostID INTEGER,
FOREIGN KEY(HostID) REFERENCES tblHostLookup(HostID),
ExtensionID INTEGER,
FOREIGN KEY(ExtensionID) REFERENCES tblExtensionLookup(ExtensionID));
It's hard to be sure without full details, but I think I can guess the problem.
If you use the INSERT statement without column names, the values must exactly match the columns as given in the schema. You can't skip over any of them.*
The right way to fix this is to just use the column names in your INSERT statement. Something like:
current.execute("INSERT INTO tblExtensionHistory (HostID, ExtensionID) VALUES (?,?)",
[Hid, Eid])
Now you can skip any columns you want (as long as they're autoincrement, nullable, or otherwise skippable, of course), or provide them in any order you want.
For your second problem, you're trying to pass in rows as if they were single values. You can't do that. From your code:
Eid = current.fetchone()
This will return something like:
[3]
And then you try to bind that to the ExtensionID column, which gives you an error.
In the future, you may want to try to write and debug the SQL statements in the sqlite3 command-line tool and/or your favorite GUI database manager (there's a simple extension that runs in for Firefox if you don't want anything fancy) and get them right, before you try getting the Python right.
* This is not true with all databases. For example, in MSJET/Access, you must skip over autoincrement columns. See the SQLite documentation for how SQLite interprets INSERT with no column names, or similar documentation for other databases.
I am working on a small database application in Python (currently targeting 2.5 and 2.6) using sqlite3.
It would be helpful to be able to provide a series of functions that could setup the database and validate that it matches the current schema. Before I reinvent the wheel, I thought I'd look around for libraries that would provide something similar. I'd love to have something akin to RoR's migrations. xml2ddl doesn't appear to be meant as a library (although it could be used that way), and more importantly doesn't support sqlite3. I'm also worried about the need to move to Python 3 one day given the lack of recent attention to xml2ddl.
Are there other tools around that people are using to handle this?
You can find the schema of a sqlite3 table this way:
import sqlite3
db = sqlite3.connect(':memory:')
c = db.cursor()
c.execute('create table foo (bar integer, baz timestamp)')
c.execute("select sql from sqlite_master where type = 'table' and name = 'foo'")
r=c.fetchone()
print(r)
# (u'CREATE TABLE foo (bar integer, baz timestamp)',)
Take a look at SQLAlchemy migrate. I see no problem using it as migration tool only, but comparing of configuration to current database state is experimental yet.
I use this to keep schemas in sync.
Keep in mind that it adds a metadata table to keep track of the versions.
South is the closest I know to RoR migrations. But just as you need Rails for those migrations, you need django to use south.
Not sure if it is standard but I just saved all my schema queries in a txt file like so (tables_creation.txt):
CREATE TABLE "Jobs" (
"Salary" TEXT,
"NumEmployees" TEXT,
"Location" TEXT,
"Description" TEXT,
"AppSubmitted" INTEGER,
"JobID" INTEGER NOT NULL UNIQUE,
PRIMARY KEY("JobID")
);
CREATE TABLE "Questions" (
"Question" TEXT NOT NULL,
"QuestionID" INTEGER NOT NULL UNIQUE,
PRIMARY KEY("QuestionID" AUTOINCREMENT)
);
CREATE TABLE "FreeResponseQuestions" (
"Answer" TEXT,
"FreeResponseQuestionID" INTEGER NOT NULL UNIQUE,
PRIMARY KEY("FreeResponseQuestionID"),
FOREIGN KEY("FreeResponseQuestionID") REFERENCES "Questions"("QuestionID")
);
...
Then I used this function taking advantage of the fact that I made each query delimited by two newline characters:
def create_db_schema(self):
db_schema = open("./tables_creation.txt", "r")
sql_qs = db_schema.read().split('\n\n')
c = self.conn.cursor()
for sql_q in sql_qs:
c.execute(sql_q)
I am using sqlite with python. When i insert into table A i need to feed it an ID from table B. So what i wanted to do is insert default data into B, grab the id (which is auto increment) and use it in table A. Whats the best way receive the key from the table i just inserted into?
As Christian said, sqlite3_last_insert_rowid() is what you want... but that's the C level API, and you're using the Python DB-API bindings for SQLite.
It looks like the cursor method lastrowid will do what you want (search for 'lastrowid' in the documentation for more information). Insert your row with cursor.execute( ... ), then do something like lastid = cursor.lastrowid to check the last ID inserted.
That you say you need "an" ID worries me, though... it doesn't matter which ID you have? Unless you are using the data just inserted into B for something, in which case you need that row ID, your database structure is seriously screwed up if you just need any old row ID for table B.
Check out sqlite3_last_insert_rowid() -- it's probably what you're looking for:
Each entry in an SQLite table has a
unique 64-bit signed integer key
called the "rowid". The rowid is
always available as an undeclared
column named ROWID, OID, or _ROWID_ as
long as those names are not also used
by explicitly declared columns. If the
table has a column of type INTEGER
PRIMARY KEY then that column is
another alias for the rowid.
This routine returns the rowid of the
most recent successful INSERT into the
database from the database connection
in the first argument. If no
successful INSERTs have ever occurred
on that database connection, zero is
returned.
Hope it helps! (More info on ROWID is available here and here.)
Simply use:
SELECT last_insert_rowid();
However, if you have multiple connections writing to the database, you might not get back the key that you expect.