Reading Cassandra 1.2 table with pycassa - python

Using Cassandra 1.2. I created a table using CQL 3 the following way:
CREATE TABLE foo (
user text PRIMARY KEY,
emails set<text>
);
Now I am trying to query the data through pycassa:
import pycassa
from pycassa.pool import ConnectionPool
pool = ConnectionPool('ks1', ['localhost:9160'])
foo = pycassa.ColumnFamily(pool, 'foo')
This gives me
Traceback (most recent call last):
File "test.py", line 5, in <module>
foo = pycassa.ColumnFamily(pool, 'foo')
File "/home/john/src/pycassa/lib/python2.7/site-packages/pycassa/columnfamily.py", line 284, in __init__
self.load_schema()
File "/home/john/src/pycassa/lib/python2.7/site-packages/pycassa/columnfamily.py", line 312, in load_schema
raise nfe
pycassa.cassandra.ttypes.NotFoundException: NotFoundException(_message=None, why='Column family foo not found.')
How can this be accomplished?

If you have created your tables using CQL3 and you want to access them through a thrift based client; you will have to specify the Compact Storage property. e.g :
CREATE TABLE dummy_file_test
(
dtPtn INT,
pxID INT,
startTm INT,
endTm INT,
patID BIGINT,
efile BLOB,
PRIMARY KEY((dtPtn, pxID, startTm))
)with compact storage;
This is what i had to do while accessing CQL3 based column Families with Pycassa

I am testing with Cassandra 1.2.8 and pycassa 1.9.0 and CQL3. I was able to verify that tables created in CQL3 using the "WITH COMPACT STORAGE" used during the CREATE table statement does make the table (column family) visible to pycassa. Unfortunately, I was not able to find how to alter the table to get the "WITH COMPACT STORAGE" to show up in the DESCRIBE TABLE command. THE ALTER TABLE WITH statement is supposed to allow you to change that setting but no luck.
To verify the results simply create two tables using CQL3 one using the WITH COMPACT STORAGE and one without it and the results are re-producable.
It looks like to accomplish the stated goal the table would need to be dropped and then re-created using the "WITH COMPACT STORAGE" option as part of the CREATE statement. If you don't want to loose any data then possibly rename the existing table, create the new empty table with the correct options, and then move the data back into the desired table. Unless, of course you can find how to alter the table correctly, which would be easier, if possible.

Column families created with CQL3 cannot use the Thrift API which pycassa uses.
You can read this if you have more questions.

It certainly appears that your column family (table) is not defined properly. Run cqlsh and then describe keyspace ks1;. My guess is you won't see your CF listed. Check to see that your keyspace name is correct.

Pycassa doesn't support newer versions of cassandra - see Ival's answer and here for more info. See https://pypi.python.org/pypi/cql/1.0.4 for an alternative solution to pycassa.

Related

How to delete top "n" rows using sqlite3 in Python

Is there an efficient way to delete the top "n" rows of a table in an SQLite database using sqlite3?
USE CASE:
--> I need to keep rolling timeseries data in table. To do this, I am fetching n new data points regularly and appending it to the table. However, to keep the table updated and a constant size (in terms of number of rows), I need to trim the table by removing the top n rows of data.
DELETE TOP(n) FROM <table_name>
looked promising, but it appears to be unsupported by sqlite3.
Below for example:
import sqlite3
conn = sqlite3.connect("testDB.db")
table_name = "test_table"
c = conn.cursor()
c.execute("DELETE TOP(2500) FROM test_table")
conn.commit()
conn.close()
The following error is raised:
Traceback (most recent call last):
File "test_db.py", line 9, in <module>
c.execute("DELETE TOP(10) FROM test_table")
sqlite3.OperationalError: near "TOP": syntax error
The only work around I've seen is to use c.executemany instead of c.execute but this would require specifying the exact dates to delete which is much more cumbersome than it needs to be.
Sqlite supports this, but not out of the box.
You first have to create a custom sqlite3.c amalgamation file from the master source tree and compile it with the C preprocessor macro SQLITE_ENABLE_UPDATE_DELETE_LIMIT defined (By running ./configure --enable-update-limit; make), then put the resulting shared library where Python will load it instead of whatever version it would otherwise use (This is the hard part compared to using it in C or C++, where you can just add a custom sqlite3.c to the project source files instead of using a library).
Once all that's done and you're successfully using your own custom sqlite3 library from Python, you can do
DELETE FROM test_table LIMIT 10
which will delete 10 unspecified rows. To control which rows to delete, you need an ORDER BY clause:
DELETE FROM test_table ORDER BY foo LIMIT 10
See the documentation for details.
I suspect most people would give up on this as too complicated and just first find the rowids (Or other primary/unique key) of the rows they want to delete, and then delete them.
There is no way to delete a specific row as far as I know.
You need to drop the table and add it again.
So basically If i had to do it I would fetch the whole table store it in a variable and would remove the part I don't need. Then I will drop the table and create the same table again and then I will input the rest of the data.

How to read Snowflake primary keys into python

This question is a bit related to another question: Get List of Primary Key Columns in Snowflake.
Since INFORMATION_SCHEMA.COLUMNS does not provide the required information regarding the primary keys. And the method proposed by Snowflake itself, where you would describe the table followed by a result_scan, is unreliable when queries are run in parallel.
I was thinking about using SHOW PRIMARY KEYs IN DATABASE. This works great when querying the database from within Snowflake. But as soon as I try to do this in python, I get results for the column name like 'Built-in function id'. Which is not useful when dynamically generating sql statements.
The code I am using is as follows:
SQL_PK = "SHOW PRIMARY KEYS IN DATABASE;"
snowflake_service = SnowflakeService(username=cred["username"], password=cred["password"])
snowflake_service.connect(database=DATABASE,role=ROLE, warehouse=WAREHOUSE)
curs = snowflake_service.cursor
primary_keys = curs.execute(SQL_PK).fetchall()
curs.close()
snowflake_service.connection.close()
Is there something I am doing wrong? Is it even possible to do it like this?
Or is the solution that Snowflake provides reliable enough, when sending these queries as one string? Although with many tables, there will be many round trips required to get all the data needed.
where you would describe the table followed by a result_scan, is unreliable when queries are run in parallel
You could search for specific query run using information_schema.query_history_by_session and then refer to resultset using retrieved QUERY_ID.
SHOW PRIMARY KEYS IN DATABASE;
-- find the newest occurence of `SHOW PRIMARY KEYS`:
SET queryId = (SELECT QUERY_ID
FROM TABLE(information_schema.query_history_by_session())
WHERE QUERY_TEXT LIKE '%SHOW PRIMARY KEYS IN DATABASE%'
ORDER BY ENDTIME DESC LIMIT 1);
SELECT * FROM TABLE(RESULT_SCAN($queryId));

Trying to write a streaming dataframe from spark in postgreSQL with Kafka and pyspark

I have been searching for this issue in every side of this site and I have not found any solution.
I have wrote a java class that creates a producer in Kafka and sends some file and it works fine.
Than, I want to write a python script that read this files and put them into a database in postgreSQL.
Each file (each file is a dataset with a lot of columns) becomes a topic in kafka consumer and each row of the file becomes a message in the relative topic.
This is the spark dataframe that I create in python from the streaming data:
list = df.select("fileName", "Satellite_PRN_number", "date", "time", "Crs", "Delta_n", "m0", "Cuc",
"e_Eccentricity",
"Cus",
"sqrt_A", "Toe_Time_of_Ephemeris", "Cic", "OMEGA_maiusc", "cis", "i0", "Crc", "omega",
"omega_dot",
"idot")
Here is my python function that should insert each row in my postgreSQL table. I used psycopg2 for creating a connection between python and postgre and I use "self.cursor.execute" in order to write queries.
def process_row(self, row):
self.cursor.execute(
'INSERT INTO satellite(fileName,Satellite_PRN_number, date, time,Crs,Delta_n, m0,
Cuc,e_Eccentricity,Cus,'
'sqrt_A, Toe_Time_of_Ephemeris, Cic, OMEGA_maiusc, cis, i0, Crc, omega, omega_dot, idot) VALUES
(%s,%s,%s,'
'%s,%s,%s, %s, %s, %s, %s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)',
(row.fileName, row.Satellite_PRN_number, row.date, row.time, row.Crs, row.Delta_n, row.m0, row.Cuc,
row.e_Eccentricity,
row.Cus, row.sqrt_A, row.Toe_Time_of_Ephemeris, row.Cic, row.OMEGA_maiusc, row.cis, row.i0,
row.Crc,
row.omega,
row.omega_dot, row.idot))
self.connection.commit()
Finally, I use this method above in order to populate my table in postgreSQL with the following command:
query = list.writeStream.outputMode("append").foreachBatch(process_row)\
.option("checkpointLocation", "C:\\Users\\Admin\\AppData\\Local\\Temp").start()
I got the following error: AttributeError: 'DataFrame' object has no attribute 'cursor'.
I think that the issue is in row.fileName, etc... or in the method "process_row". I don't exactly understand how to manage the method "process_row" in order to pass each row of the streaming dataframe to populate the posteSQL table.
Can anyone help me? Thanks.
Your signature of foreachBatch seems not correct. It should be like this:
def foreach_batch_function(df, epoch_id):
# Transform and write batchDF
pass
streamingDF.writeStream.foreachBatch(foreach_batch_function).start()
As you can see the first argument of the forEachBatch function is a DataFrame not what you expect the Instance of you psycopg2 class.
The ForEachBatch will have a DataFrame which itself will contain all the Rows from the current micro batch not just one row.
So you can either try to declare the Instance of your postgreSQL connection in that function to further use it or you could try that approach:
I would create a hive jdbc source based table of your postgreSQL database like this:
CREATE TABLE jdbcTable
USING org.apache.spark.sql.jdbc
OPTIONS (
url "jdbc:postgresql:dbserver",
dbtable "schema.tablename",
user 'username',
password 'password'
)
which will enable you to use your forEachBatch function like this:
def foreach_batch_function(df, epoch_id):
# Transform and write batchDF
df.write.insertInto("jdbcTable")
hope that was helpfull

PostgreSQL - Unable to create table - data type point has no default operator class for access method "btree"

so I'm working on a project to learn SQL, Python, and some server-side stuff and I'm pretty stumped at this point, after much searching. I want to create two identical tables, and I have the following code:
import psycopg2 as db
import json
with open('config.json', 'r') as f:
config = json.load(f)
conn = db.connect(user=config['dbuser'], database=config['dbname'], host=config['dbhost'], password=config['dbpass']);
cur = conn.cursor()
cur.execute("""CREATE TABLE IF NOT EXISTS A
(
city TEXT,
location POINT NOT NULL,
eta INTEGER,
time TIMESTAMP WITH TIME ZONE NOT NULL,
holiday BOOLEAN,
PRIMARY KEY (location, time)
);""")
and also table B, with the same format, where A and B are different services. When running this, I get:
Traceback (most recent call last):
File "dbsetup.py", line 18, in <module>
);""")
psycopg2.ProgrammingError: data type point has no default operator class for access method "btree"
HINT: You must specify an operator class for the index or define a default operator class for the data type.
I've been searching for awhile now but nothing I'm coming across is very similar to this case, and I'm confused since surely I'm missing or mistaking something very basic here. Any help/pointing in the right direction would be much appreciated! General advice is very welcome too.
PostgreSQL version doesn't support creating a normal or unique b-tree index on the point data type (true up to and including 9.5 at least). So you can't use point as part of a PRIMARY KEY.
test=> CREATE TABLE test_table( xy point primary key );
ERROR: data type point has no default operator class for access method "btree"
HINT: You must specify an operator class for the index or define a default operator class for the data type.
You'll need to change your data model so you don't try to use a point as part of the PK.
Or you could write an extension to add b-tree index support for the point type, but that requires a lot more work and understanding of the guts of how PostgreSQL's indexing works.
Anyway, if you're doing geographic stuff you might want to look into using PostGIS and the geometry data type. It has a much richer set of operations for searches to find the nearest point, find by distance, check if a point is within a region, etc, all efficiently indexed.

is there an easy/efficient way to make a mysql database dynamically 'build it's own structure' as it saves name/value pairs to disc?

is there any efficient and easy way to execute statements similar to:
CREATE TABLE IF NOT EXISTS fubar ( id int, name varchar(80))
for columns as you perform insert statements.
I'd imagine it will be a lot more complicated, but just for the sake of explanation, i guess I'm looking for something like....
IF NOT EXISTS
(
SELECT * FROM information_schema.COLUMNS
WHERE
COLUMN_NAME='new_column' AND
TABLE_NAME='the_table' AND
TABLE_SCHEMA='the_schema'
)
THEN
ALTER TABLE `the_schema`.`the_table`
ADD COLUMN `new_column` bigint(20) unsigned NOT NULL default 1;
alternatively, is there a way a python library that might handle the process?
basically, i want the 'id's' of a 'dictionary' to define the columns, and create them if they do not already exist.
i also would like the database to stay reasonably efficient, so i guess some dynamic handling of the type of data would also be necessary?
just wondering if anything like this exists at the moment, and if not looking for advice on how best to achieve it....
I believe that you're looking for an Object Relational Mapping system.
For Python there are a couple available:
SQLAlchemy: http://www.sqlalchemy.org/
SQLObject: http://sqlobject.org/
Or if you're building a website, the Django project includes an ORM system: http://www.djangoproject.com/
To simplify the process, you can add something like Elixir on top of SQLAlchemy: http://elixir.ematia.de/trac/wiki
You would get code like this:
class Movie(Entity):
title = Field(Unicode(30))
year = Field(Integer)
description = Field(UnicodeText)
This is how to insert:
>>> Movie(title=u"Blade Runner", year=1982)
<Movie "Blade Runner" (1982)>
>>> session.commit()
Or fetch the results:
>>> Movie.query.all()
[<Movie "Blade Runner" (1982)>]

Categories

Resources