How to delete top "n" rows using sqlite3 in Python - python

Is there an efficient way to delete the top "n" rows of a table in an SQLite database using sqlite3?
USE CASE:
--> I need to keep rolling timeseries data in table. To do this, I am fetching n new data points regularly and appending it to the table. However, to keep the table updated and a constant size (in terms of number of rows), I need to trim the table by removing the top n rows of data.
DELETE TOP(n) FROM <table_name>
looked promising, but it appears to be unsupported by sqlite3.
Below for example:
import sqlite3
conn = sqlite3.connect("testDB.db")
table_name = "test_table"
c = conn.cursor()
c.execute("DELETE TOP(2500) FROM test_table")
conn.commit()
conn.close()
The following error is raised:
Traceback (most recent call last):
File "test_db.py", line 9, in <module>
c.execute("DELETE TOP(10) FROM test_table")
sqlite3.OperationalError: near "TOP": syntax error
The only work around I've seen is to use c.executemany instead of c.execute but this would require specifying the exact dates to delete which is much more cumbersome than it needs to be.

Sqlite supports this, but not out of the box.
You first have to create a custom sqlite3.c amalgamation file from the master source tree and compile it with the C preprocessor macro SQLITE_ENABLE_UPDATE_DELETE_LIMIT defined (By running ./configure --enable-update-limit; make), then put the resulting shared library where Python will load it instead of whatever version it would otherwise use (This is the hard part compared to using it in C or C++, where you can just add a custom sqlite3.c to the project source files instead of using a library).
Once all that's done and you're successfully using your own custom sqlite3 library from Python, you can do
DELETE FROM test_table LIMIT 10
which will delete 10 unspecified rows. To control which rows to delete, you need an ORDER BY clause:
DELETE FROM test_table ORDER BY foo LIMIT 10
See the documentation for details.
I suspect most people would give up on this as too complicated and just first find the rowids (Or other primary/unique key) of the rows they want to delete, and then delete them.

There is no way to delete a specific row as far as I know.
You need to drop the table and add it again.
So basically If i had to do it I would fetch the whole table store it in a variable and would remove the part I don't need. Then I will drop the table and create the same table again and then I will input the rest of the data.

Related

Insert into Vertica if table does not exist or not a duplicate row

I have written a python script to create a table using a create table if not exists statement and then insert rows from dataframe into vertica database. For the first time when I run this python script, I want it to create a table and insert the data - it works fine.
But from next time onwards, I want it to create a table only if it does not exist (works fine) and insert data only if that row is not contained in the database.
I use both insert statement and COPY statement to insert data. How to do this in python ? I am accessing Vertica database from python using pyodbc.
Editing the post to include some code :
There is a dataframe called tableframe_df , from which I need to populate content into a table created as bellow:
I am creating a table in vertica with create table if not exists, which creates a table if there is not one.
cursor.execute("create table if not exists <tablename> (fields in the table)")
COPY statement to write to this table from a csv that was created
`cursor.execute("COPY tablename1 FROM LOCAL 'tablename.csv' DELIMITER ',' exceptions 'exceptions' rejected data 'rejected'")`
##for i,row in tablename_df.iterrows():
cursor.execute("insert into tablename2 values(?,?,?,?,?,?,?,?,?,?,?,?)",row.values[0],row.values[1],row.values[2],row.values[3],row.values[4],row.values[5],row.values[6],row.values[7],row.values[8],row.values[9],row.values[10],row.values[11])
Here in the above code, I am creating table and then inserting into tablename1 and tablename2 using COPY and insert. This works fine when executed the first time ( as there is no data in the table). Now by mistake if I run the same script twice, the data will be inserted twice in these tables. What check should I perform to ensure that data does not get inserted if it is already present?
First I'll mention that INSERT VALUES is pretty slow if you are doing a lot of rows. If you are using batch sql and the standard vertica drivers, it should convert it to a COPY but if it doesn't then your inserts might take forever. I don't think this will happen with pyodbc since they don't implement executemany() optimally. You might be able to with ceodbc though, but I haven't tried it. Alternatively, you can use vertica_python which has a .copy('COPY FROM STDIN...',data) command that is efficient.
Anyhow, for your question...
You can do it one of two ways. Also for the inserts, I would really try to change this to a copy or at least an executemany. Again, pydobc does not do this properly, at least for the releases that I have used.
Use a control table that somehow uniquely describe the set of data being loaded and insert into it and check before running the script that the data set has not been loaded.
--Step 1. Check control table for data set load.
SELECT *
FROM mycontroltable
WHERE dataset = ?
--Step 2. If row not found, insert rows
for row in data:
cursor.execute('INSERT INTO mytargettable....VALUES(...)')
-- Step 3. Insert row into control table
INSERT INTO mycontroltable( dataset ) VALUES ( ? )
-- Step 4. Commit data
COMMIT
Alternatively you can insert or merge data in based on a key. You can create a temp or other staging table to do it. If you don't want updates and data does not change once inserted, then INSERT will be better as it will not incur a delete vector. I'll do INSERT based on the way you phrased your question.
--Step 1. Create local temp for intermediate target
CREATE LOCAL TEMP TABLE mytemp (fields) ON COMMIT DELETE ROWS;
--Step 2. Insert data.
for row in data:
cursor.execute('INSERT INTO mytemp....VALUES(...)')
--Step 3. Insert/select only data that doesn't exist by key value
INSERT INTO mytargettable (fields)
SELECT fields
FROM mytemp
WHERE NOT EXISTS (
SELECT 1
FROM mytargettable t
WHERE t.key = mytemp.key
)
--Step 4. Commit
COMMIT;

How to read a csv using sql

I would like to know how to read a csv file using sql. I would like to use group by and join other csv files together. How would i go about this in python.
example:
select * from csvfile.csv where name LIKE 'name%'
SQL code is executed by a database engine. Python does not directly understand or execute SQL statements.
While some SQL database store their data in csv-like files, almost all of them use more complicated file structures. Therefore, you're required to import each csv file into a separate table in the SQL database engine. You can then use Python to connect to the SQL engine and send it SQL statements (such as SELECT). The engine will perform the SQL, extract the results from its data files, and return them to your Python program.
The most common lightweight engine is SQLite.
littletable is a Python module I wrote for working with lists of objects as if they were database tables, but using a relational-like API, not actual SQL select statements. Tables in littletable can easily read and write from CSV files. One of the features I especially like is that every query from a littletable Table returns a new Table, so you don't have to learn different interfaces for Table vs. RecordSet, for instance. Tables are iterable like lists, but they can also be selected, indexed, joined, and pivoted - see the opening page of the docs.
# print a particular customer name
# (unique indexes will return a single item; non-unique
# indexes will return a Table of all matching items)
print(customers.by.id["0030"].name)
print(len(customers.by.zipcode["12345"]))
# print all items sold by the pound
for item in catalog.where(unitofmeas="LB"):
print(item.sku, item.descr)
# print all items that cost more than 10
for item in catalog.where(lambda o : o.unitprice>10):
print(item.sku, item.descr, item.unitprice)
# join tables to create queryable wishlists collection
wishlists = customers.join_on("id") + wishitems.join_on("custid") + catalog.join_on("sku")
# print all wishlist items with price > 10
bigticketitems = wishlists().where(lambda ob : ob.unitprice > 10)
for item in bigticketitems:
print(item)
Columns of Tables are inferred from the attributes of the objects added to the table. namedtuples are good also, as well as a types.SimpleNamespaces. You can insert dicts into a Table, and they will be converted to SimpleNamespaces.
littletable takes a little getting used to, but it sounds like you are already thinking along a similar line.
You can easily query an SQL Database using PHP script. PHP runs serverside, so all your code will have to be on a webserver (the one with the database). You could make a function to connect to the database like this:
$con= mysql_connect($hostname, $username, $password)
or die("An error has occured");
Then use the $con to accomplish other tasks such as looping through data and creating a table, or even adding rows and columns to an existing table.
EDIT: I noticed you said .CSV file. You can upload a CSV file into a SQL database and create a table out of it. If you are using a control panel service such as phpMyAdmin, you can simply import a CSV file into your database like this:
If you are looking for a free web host to test your SQL and PHP files on, check out x10 hosting.

Reading Cassandra 1.2 table with pycassa

Using Cassandra 1.2. I created a table using CQL 3 the following way:
CREATE TABLE foo (
user text PRIMARY KEY,
emails set<text>
);
Now I am trying to query the data through pycassa:
import pycassa
from pycassa.pool import ConnectionPool
pool = ConnectionPool('ks1', ['localhost:9160'])
foo = pycassa.ColumnFamily(pool, 'foo')
This gives me
Traceback (most recent call last):
File "test.py", line 5, in <module>
foo = pycassa.ColumnFamily(pool, 'foo')
File "/home/john/src/pycassa/lib/python2.7/site-packages/pycassa/columnfamily.py", line 284, in __init__
self.load_schema()
File "/home/john/src/pycassa/lib/python2.7/site-packages/pycassa/columnfamily.py", line 312, in load_schema
raise nfe
pycassa.cassandra.ttypes.NotFoundException: NotFoundException(_message=None, why='Column family foo not found.')
How can this be accomplished?
If you have created your tables using CQL3 and you want to access them through a thrift based client; you will have to specify the Compact Storage property. e.g :
CREATE TABLE dummy_file_test
(
dtPtn INT,
pxID INT,
startTm INT,
endTm INT,
patID BIGINT,
efile BLOB,
PRIMARY KEY((dtPtn, pxID, startTm))
)with compact storage;
This is what i had to do while accessing CQL3 based column Families with Pycassa
I am testing with Cassandra 1.2.8 and pycassa 1.9.0 and CQL3. I was able to verify that tables created in CQL3 using the "WITH COMPACT STORAGE" used during the CREATE table statement does make the table (column family) visible to pycassa. Unfortunately, I was not able to find how to alter the table to get the "WITH COMPACT STORAGE" to show up in the DESCRIBE TABLE command. THE ALTER TABLE WITH statement is supposed to allow you to change that setting but no luck.
To verify the results simply create two tables using CQL3 one using the WITH COMPACT STORAGE and one without it and the results are re-producable.
It looks like to accomplish the stated goal the table would need to be dropped and then re-created using the "WITH COMPACT STORAGE" option as part of the CREATE statement. If you don't want to loose any data then possibly rename the existing table, create the new empty table with the correct options, and then move the data back into the desired table. Unless, of course you can find how to alter the table correctly, which would be easier, if possible.
Column families created with CQL3 cannot use the Thrift API which pycassa uses.
You can read this if you have more questions.
It certainly appears that your column family (table) is not defined properly. Run cqlsh and then describe keyspace ks1;. My guess is you won't see your CF listed. Check to see that your keyspace name is correct.
Pycassa doesn't support newer versions of cassandra - see Ival's answer and here for more info. See https://pypi.python.org/pypi/cql/1.0.4 for an alternative solution to pycassa.

Python script to diff same table in two different databases

I am about to write a python script to help me migrate data between different versions of the same application.
Before I get started, I would like to know if there is a script or module that does something similar, and I can either use, or use as a starting point for rolling my own at least. The idea is to diff the data between specific tables, and then to store the diff as SQL INSERT statements to be applied to the earlier version database.
Note: This script is not robust in the face of schema changes
Generally the logic would be something along the lines of
def diff_table(table1, table2):
# return all rows in table 2 that are not in table1
pass
def persist_rows_tofile(rows, tablename):
# save rows to file
pass
dbnames=('db.v1', 'db.v2')
tables_to_process = ('foo', 'foobar')
for table in tables_to_process:
table1 = dbnames[0]+'.'+table
table2 = dbnames[1]+'.'+table
rows = diff_table(table1, table2)
if len(rows):
persist_rows_tofile(rows, table)
Is this a good way to write such a script or could it be improved?. I suspect it could be improved by cacheing database connections etc (which I have left out - because I am not too familiar with SqlAlchemy etc).
Any tips on how to add SqlAlchemy and to generally improve such a script?
To move data between two databases I use pg_comparator. It's like diff and patch for sql! You can use it to swap the order of columns but if you need to split or merge columns you need to use something else.
I also use it to duplicate a database asynchronously. A cron-job runs every five minutes and pushes all changes on the "master"-database to the "slave"-databases. Especially handy if you only need distribute a single table, or a not all columns per table etc.

Join with Pythons SQLite module is slower than doing it manually

I am using pythons built-in sqlite3 module to access a database. My query executes a join between a table of 150000 entries and a table of 40000 entries, the result contains about 150000 entries again. If I execute the query in the SQLite Manager it takes a few seconds, but if I execute the same query from Python, it has not finished after a minute. Here is the code I use:
cursor = self._connection.cursor()
annotationList = cursor.execute("SELECT PrimaryId, GOId " +
"FROM Proteins, Annotations " +
"WHERE Proteins.Id = Annotations.ProteinId")
annotations = defaultdict(list)
for protein, goterm in annotationList:
annotations[protein].append(goterm)
I did the fetchall just to measure the execution time. Does anyone have an explanation for the huge difference in performance? I am using Python 2.6.1 on Mac OS X 10.6.4.
I implemented the join manually, and this works much faster. The code looks like this:
cursor = self._connection.cursor()
proteinList = cursor.execute("SELECT Id, PrimaryId FROM Proteins ").fetchall()
annotationList = cursor.execute("SELECT ProteinId, GOId FROM Annotations").fetchall()
proteins = dict(proteinList)
annotations = defaultdict(list)
for protein, goterm in annotationList:
annotations[proteins[protein]].append(goterm)
So when I fetch the tables myself and then do the join in Python, it takes about 2 seconds. The code above takes forever. Am I missing something here?
I tried the same with apsw, and it works just fine (the code does not need to be changed at all), the performance it great. I'm still wondering why this is so slow with the sqlite3-module.
There is a discussion about it here: http://www.mail-archive.com/python-list#python.org/msg253067.html
It seems that there is a performance bottleneck in the sqlite3 module. There is an advice how to make your queries faster:
make sure that you do have indices on the join columns
use pysqlite
You haven't posted the schema of the tables in question, but I think there might be a problem with indexes, specifically not having an index on Proteins.Id or Annotations.ProteinId (or both).
Create the SQLite indexes like this
CREATE INDEX IF NOT EXISTS index_Proteins_Id ON Proteins (Id)
CREATE INDEX IF NOT EXISTS index_Annotations_ProteinId ON Annotations (ProteinId)
I wanted to update this because I am noticing the same issue and we are now 2022...
In my own application I am using python3 and sqlite3 to do some data wrangling on large databases (>100000 rows * >200 columns). In particular, I have noticed that my 3 table inner join clocks in around ~12 minutes of run time in python, whereas running the same join query in sqlite3 from the CLI runs in ~100 seconds. All the join predicates are properly indexed and the EXPLAIN QUERY PLAN indicates that the added time is most likely because I am using SELECT *, which is a necessary evil in my particular context.
The performance discrepancy caused me to pull my hair out all night until I realized there is a quick fix from here: Running a Sqlite3 Script from Command Line. This is definitely a workaround at best, but I have research due so this is my fix.
Write out the query to an .sql file (I am using f-strings to pass variables in so I used an example with {foo} here)
fi = open("filename.sql", "w")
fi.write(f"CREATE TABLE {Foo} AS SELECT * FROM Table1 INNER JOIN Table2 ON Table2.KeyColumn = Table1.KeyColumn INNER JOIN Table3 ON Table3.KeyColumn = Table1.KeyColumn;")
fi.close()
Run os.system from inside python and send the .sql file to sqlite3
os.system(f"sqlite3 {database} < filename.sql")
Make sure you close any open connection before running this so you don't end up locked out and you'll have to re-instantiate any connection objects afterward if you're going back to working in sqlite within python.
Hope this helps and if anyone has figured the source of this out, please link to it!

Categories

Resources