Recommendation for writing data from SQLite file via Python sqlite3 - python

I have generated a giant SQLite database and need to get some data out of it. I wrote some script to do so, and profiling let to the unfortunate conclusion that the write process would take approx. 3 days with the current setup. I wrote the script as simplistic as possible to make it as fast as possible.
I am wondering if you have some trick to speed up the whole process. The database has an unique index, but the columns I am querying don't (because of duplicate rows for those).
Would it make sense to use any multi-processing Python library here?
The script would be like this:
import sqlite3
def write_from_query(db_name, table_name, condition, content_column, out_file):
'''
Writes contents from a SQLite database column to an output file
Keyword arguments:
db_name (str): Path of the .sqlite database file.
table_name (str): Name of the target table in the SQLite file.
condition (str): Condition for querying the SQLite database table.
content_colum (str): Name of the column that contains the content for the output file.
out_file (str): Path of the output file that will be written.
'''
# Connecting to the database file
conn = sqlite3.connect('zinc12_drugnow_nrb(copy).sqlite')
c = conn.cursor()
# Querying the database and writing the output file
c.execute('SELECT ({}) FROM {} WHERE {}'.format(content_column, table_name, condition))
with open(out_file, 'w') as outf:
for row in c:
outf.write(row[0])
# Closing the connection to the database
conn.close()
if __name__ == '__main__':
write_from_query(
db_name='my_db.sqlite',
table_name='my_table',
condition='variable1=1 AND variable2<=5 AND variable3="Zinc_Plus"',
content_column='variable4',
out_file='sqlite_out.txt'
)
Link to this script on GitHub
Thanks for your help, I am looking forward to your suggestions!
EDIT:
more information about the database:

I assume that you are running the write_from_query functions for a huge amount of queries.
If so the problem is the missing indices on your filter criteria
This results in the following: for each query you execute, sqlite will loop through the whole 50GB of data and checks whether your conditions hold true. That is VERY inefficient.
The easiest way would be to slap indices on your columns
An alternative would be to formulate less queries that include multiple of your cases and then loop over that data again to split it it in different files. How well this can done however depends on how your data is structured.
I'm not sure about multiprocessing/threading, sqlite is not really made for concurrency, but I guess it could work out since you only read data...

Either you dump the content and filter in your own program - or you add indices to all columns you use in your conditions.
Adding indices to all the columns will take a long long time.
But for many different queries there is no alternative.
No multiprocessing will probably not help. An SSD might, or 64GiB Ram. But they are not needed with indices, queries will be fast on normal disks too.
In conclusion you created a Database without creating indices for the columns you want to query. With 8Mio rows this wont work.

Whilst the process of actually writing this data to a file will take a while I would expect it to be more like minutes than days e.g. at a 50MB/s sequential write speed 15GB works out at around 5 mins.
I suspect that the issue is with the queries / lack of indexes. I would suggest trying to build composite indexes based on the combinations of columns that you need to filter on. As you will see from the documentation here, you can actually add as many columns as you want to an index.
Just to make you aware adding indexes will slow down inserts / updates to your database as every time it now needs to find the appropriate place in the relevant indexes to add data as well as appending data to the end of the tables, but this is probably your only option to speed up the queries.

I will look at the unique indices! But meanwhile another thing I just stumbled upon... Sorry for writing an own answer for my question here, but I thought it is better for the organization...
I was thinking that the .fetchall() command could also speed up the whole process, but I find the sqlite3 documentation on this a little bit brief ... Would something like
with open(out_file, 'w') as outf:
c.excecute ('SELECT * ...')
results = c.fetchmany(10000)
while results:
for row in results:
outf.write(row[0])
results = c.fetchmany(10000)
make sense?

Related

mongoengine query for duplicates in embedded documentlist

I'm making a python app with mongoengine where i have a mongodb database of n users and each user holds n daily records. I have a list of n new record per user that I want to add to my db
I want to check if a record for a certain date already exists for an user before adding a new record to the user
what i found in the docs is to iterate through every embedded document in the list to check for duplicate fields but thats an O(n^2) algorithm and took 5 solid seconds for 300 records, too long. below an abbreviated version of the code
There's gotta be a better way to query right? I tried accessing something like user.records.date but that throws a not found
import mongoengine
#snippet here is abbreviated and does not run
# xone of interest in conditional_insert(), line 16
class EmbeddedRecord(mongoengine.EmbeddedDocument):
date = mongoengine.DateField(required = True)
#contents = ...
class User(mongoengine.Document):
#meta{}
#account details
records = mongoengine.EmbeddedDocumentListField(EmbeddedRecord)
def conditional_insert(user, new_record):
# the docs tell me to iterate tthrough every record in the user
# there has to be a better way
for r in user.records:
if str(new_record.date) == str(r.date): #i had to do that in my program
#because python kep converting datetime obj to str
return
# if record of duplicate date not found, insert new record
save_record(user, new_record)
def save_record(): pass
if __name__ == "__main__":
lst_to_insert = [] # list of (user, record_to_insert)
for object in lst_to_insert: #O(n)
conditional_insert(object[0],object[1]) #O(n)
#and I have n lst_to_insert so in reality I'm currently at O(n^3)
Hi everyone (and future me who will probably search for the same question 10 years later)
I optimized the code using the idea of a search tree. Instead of putting all records in a single List in User I broke it down by year and month
class EmbeddedRecord(mongoengine.EmbeddedDocument):
date = mongoengine.DateField(required = True)
#contents = ...
class Year(mongoengine.EmbeddedDocument):
monthly_records = mongoengine.EmbeddedDocumentListField(Month)
class Month(mongoengine.EmbeddedDocument):
daily_records = mongoengine.EmbeddedDocumentListField(EmbeddedRecord)
class User(mongoengine.Document):
#meta{}
#account details
yearly_records = mongoengine.EmbeddedDocumentListField(Year)
because it's mongodb, I can later partition by decades, heck even centuries but by that point I dont think this code will be relevant
I then group my data to insert by months into separate pandas dataframe and feed each dataframe separately. The data flow thus looks like:
0) get monthly df
1) loop through years until we get the right one (lets say 10 steps, i dont think my program will live that long)
2) loop through months until we get the right one (12 steps)
3) for each record in df loop through each daily record in month to check for duplicates
The algorithm to insert with check is still O(n^2) but since there are maximum 31 records at the last step, the code is much faster. I tested 2000 duplicate records and it ran in under a second (didnt actually time it but as long as it feels instant it wont matter that much in my use case)
Mongo cannot conveniently offer you suitable indexes, very sad.
You frequently iterate over user.records.
If you can afford to allocate the memory for 300 users,
just iterate once and throw them into a set, which
offers O(1) constant time lookup, and
offers RAM speed rather than network latency
When you save a user, also make note of if with cache.add((user_id, str(new_record.date))).
EDIT
If you can't afford the memory for all those (user_id, date) tuples,
then sure, a relational database JOIN is fair,
it's just an out-of-core merge of sorted records.
I have had good results with using sqlalchemy to
hit local sqlite (memory or file-backed), or
heavier databases like Postgres or MariaDB.
Bear in mind that relational databases offer lovely ACID guarantees,
but you're paying for those guarantees. In this application
it doesn't sound like you need such properties.
Something as simple as /usr/bin/sort could do an out-of-core
ordering operation that puts all of a user's current
records right next to his historic records,
letting you filter them appropriately.
Sleepycat is not an RDBMS, but its B-tree does offer
external sorting, sufficient for the problem at hand.
(And yes, one can do transactions with sleepycat,
but again this problem just needs some pretty pedestrian reporting.)
Bench it and see.
Without profiling data for a specific workload,
it's pretty hard to tell if any extra complexity
would be worth it.
Identify the true memory or CPU bottleneck,
and focus just on that.
You don't necessarily need ordering, as hashing would suffice,
given enough core.
Send those tuples to a redis cache, and make it his problem
to store them somewhere.

Optimize processing of large CSV file Python

I have a CSV file of about 175 millions lines (2.86 GB), composed of three columns as shown below :
I need to get the value in column "val" given "ID1" and "ID2". I query this dataframe constantly with varying combination of ID1 and ID2, which are unique in the whole file.
I have tried to use pandas as shown below, but results are taking a lot of time.
def is_av(Qterm, Cterm, df):
try:
return df.loc[(Qterm, Cterm),'val']
except KeyError:
return 0
Is there a faster way to access CSV values, knowing that this value is located in one single row of the whole file.
If not could you check this function and tell me what might be the issue of slow processing
for nc in L:#ID1
score = 0.0
for ni in id_list:#ID2
e = is_av(ni,nc,df_g)
InDegree = df1.loc[ni].values[0]
SumInMap = df2.loc[nc].values[0]
score = score + term_score(InDegree, SumInMap, e) #compute a score
key = pd_df3.loc[nc].values[0]
tmt[key] = score
TL;DR: Use a DBMS (I suggest MySQL or PostgreSQL). Pandas is definitely not suited for this sort of work. Dask is better, but not as good as a traditional DBMS.
The absolute best way of doing this would be to use SQL, consider MySQL or PostgreSQL for starters (both free and very efficient alternatives for your current use case). While Pandas is an incredibly strong library, when it comes to indexing and quick reading, this is not something it excels at, given that it needs to either load data into memory, or stream over the data with little control compared to a DBMS.
Consider your use case where you have multiple values and you want to skip specific rows, let's say you're looking for (ID1, ID2) with values of (3108, 4813). You want to skip over every row that starts with anything other than 3, then anything other than 31, and so on, and then skip any row starting with anything other than 3108,4 (assuming your csv delimiter is a comma), and so on until you get exactly the ID1 and ID2 you're looking for, this is reading the data at a character level.
Pandas does not allow you to do this (as far as I know, someone can correct this response if it does). The other example uses Dask, which is a library designed by default to handle data much larger than the RAM at scale, but is not optimized for index management as DBMS's are. Don't get me wrong, Dask is good, but not for your use case.
Another very basic alternative would be to index your data based on ID1 and ID2, store them indexed, and only look up your data through actual file reading by skipping lines that do not start with your designated ID1, and then skipping lines that do not start with your ID2, and so on, however, the best practice would be to use a DBMS, as caching, read optimization, among many other serious pros would be available; reducing the I/O read time from your disk.
You can get started with MySQL here: https://dev.mysql.com/doc/mysql-getting-started/en/
You can get started with PostgreSQL here: https://www.postgresqltutorial.com/postgresql-getting-started/
import os
os.system('pip install dask')
import dask.dataframe as dd
dd_data = dd.read_csv('sample.csv')
bool_filter_conditions = (dd_data['ID1'] == 'a') & (dd_data['ID2'] == 'b')
dd_result = dd_data[bool_filter_conditions][['val']]
dd_output = dd_result.compute()
dd_output

getting a random set of database entries

I would like to retrieve a random set of X entries from my postgres database using sqlalchemy. My first approach was this
random_set_of_Xrows = models.Table.query.filter(something).order_by(func.random()).limit(len(X)).all()
since my Table is quite big, this command takes about 1 second, and I was wondering how to optimise it. I guess the order_by function requires to look at all rows, so I figured using offset instead might make it faster. However, I can't quite see how to avoid the row count entirely?
Here is an approach using offset
rowCount = db.session.query(func.count(models.Table.id)).filter(something).scalar()
random_set_of_Xrows = models.Table.query.offset(func.floor(func.random()*rowCount)).limit(len(X)).all()
which however is not faster, with most of the time spent getting rowCount.
Any ideas how to make this faster?
cheers
carl
EDIT: As suggested below I added a column to the table with a random value and used that to extract the rows like
random_set_of_Xrows = models.Table.query.filter(something).order_by(models.Table.random_value).limit(len(X)).all()
I did ignore the offset part, since it doesn't matter to me if two calls give me the same results, I just need a random set of rows.
I've optimized this before by adding an indexed column r that inserts a random value automatically when a row is created. Then when you need a random set of rows just SELECT * FROM table ORDER BY r LIMIT 10 OFFSET some_random_value. You can run a script that updates your schema to add this column to your existing rows. You'll add a slight performance hit to writes with this approach, but if this is a functionality you need persistently it should be a fair trade off.

Is MapReduce a possible solution for two lists that have an id in common?

I have a list of 30m entries, containing a unique id and 4 attributes for each entry. In addition to that I have a second list with 10m entries, containing again a uniqe id and 2 other attributes.
The unique IDs in list 2 are a subset of the IDs in list 1.
I want to combine the two lists to do some analysis.
Example:
List 1:
ID|Age|Flag1|Flag2|Flag3
------------------------
ucab577|12|1|0|1
uhe4586|32|1|0|1
uhf4566|45|1|1|1
45e45tz|37|1|1|1
7ge4546|42|0|0|1
vdf4545|66|1|0|1
List 2:
ID|Country|Flag4|Flag5|Flag6
------------------------
uhe4586|US|0|0|1
uhf4566|US|0|1|1
45e45tz|UK|1|1|0
7ge4546|ES|0|0|1
I want to do analysis like:
"How many at the age of 45 have Flag4=1?" Or "What is the age structure of all IDs in US?"
My current approach is to load the two list into separate tables of a relational database and then doing a join.
Does a MapReduce approach make sense in this case?
If yes, how would a MapReduce approach look like?
How can I combine the attributes of list 1 with list 2?
Will it bring any advantages? (Currently I need more than 12 hours for importing the data)
when the files are big hadoops distributed processing helps(faster). once you bring data to hdfs then you can use hive or pig for your query. Both uses hadoop MR for processing,you do not need to write separate code for it . hive is almost sql like. from your query type i guess you can manage with hive. if your queries are more complex then you can consider pig. if you use hive here is the sample steps.
load both the files in two separate folder in hdfs.
create external tables for both of them and give location to the destination folders.
perform join and the query!
hive> create external table hiveint_r(id string, age int, Flag1 int, Flag2 int, Flag3 int)
> row format delimited
> fields terminated by '|'
> location '/user/root/data/hiveint_r'; (it is in hdfs)
table will be automatically populated with data, no need to load it.
similar way create other table, then run the join and query.
select a.* from hiveint_l a full outer join hiveint_r b on (a.id=b.id) where b.age>=30 and a.flag4=1 ;
MapReduce might be overkill for just 30m entries. How you should work really depends on your data. Is is dynamic (e.g. will new entries be added?) If not, just stick with your database, the data is now in it. 30m entries shouldn't take 12 hours to import, it's more likely 12 minutes (you should be able to get 30.000 insert/seconds with 20 byte datasize), so your approach should be to fix your import. You might want to try bulk import, LOAD DATA INFILE, use transactions and/or generate the index afterwards, try another engine (innodb, MyISAM), ...
You can get just one big table (so you can get rid of the joins when you query which will speed them up) by e.g.
Update List1 join List2 on List1.Id = List2.Id
set List1.Flag4 = List2.Flag4, List1.Flag5 = List2.Flag5, List1.Flag6 = List2.Flag6
after adding the columns to List1 of course and after adding the indexes, and you should afterwards add indexes for all your columns.
You can actually combine your data before you import it to mysql by e.g. reading list 2 into a hashmap (Hashmap in c/c++/java, array in php/python) and then generate a new import file with the combined data. It should actually just take you some seconds to read the data. You can even do evaluation here, it is just not that flexible as sql, but if you just have some fixed querys, that might be the fastest approach if your data changes often.
In map-Reduce you can process the two files by using the join techniques. There are two types of joins-Map side and reduce side.
Map side join can be efficiently used by using DistributedCache API in which one file shud be loaded in memory. In you case you can can create a HashMap with key->id and value-> Flag4 and during the map phase you can join the data based on ID. One point shud be noted that the file should be as large so that it can be saved in memory.
If both the files are large go for Reduce join.
First try to load the 2nd file in memory and create Map-side join.
OR you can go for pig. Anyway the pig executes its statements as map-reduce jobs only. But map-reduce is fast as compared to PIG and HIVE.

Insert slowing down over time as data base grows (no index)

I'm trying to create a (single) database file (that will be regularly updated/occasionally partially recreated/occasionally queried) that is over 200GB, so relatively large in my view. There are about 16k tables and they range in size from a few kb to ~1gb; they have 2-21 columns. The longest table has nearly 15 million rows.
The script I wrote goes through the input files one by one, doing a bunch of processing and regex to get usable data. It regularly sends a batch (0.5-1GB) to be written in sqlite3, with one separate executemany statement to each table that data is inserted to. There are no commit or create table statements etc in-between these execute statements so I believe that all comes under a single transaction
Initially the script worked fast enough for my purposes, but it slows down dramatically over time as it neared completion- which given I will need to slow it down further to keep the memory use manageable in normal use for my laptop is unfortunate.
I did some quick bench-marking comparing inserting identical sample data to an empty database versus inserting to the 200GB database. The later test was ~3 times slower to execute the insert statements (the relative speed commit was even worse, but in absolute terms its insignificant)- aside from that there was no significant difference between
When I researched this topic before it mostly returned results for indexes slowing down inserts on large tables. The answer seemed to be that insert on tables without an index should stay at more or less the same speed regardless of size; since I don't need to run numerous queries against this database I didn't make any indexes. I even double checked and ran a check for indexes which if I have it right should exclude that as a cause:
c.execute('SELECT name FROM sqlite_master WHERE type="index"')
print(c.fetchone()) #returned none
The other issue that cropped up was transactions, but I don't see how that could be a problem only writing to large databases for the same script and the same data to be written.
abbreviated relevant code:
#process pre defined objects, files, retrieve data in batch -
#all fine, no slowdown on full database
conn = sqlite3.connect(db_path)
c = conn.cursor()
table_breakdown=[(tup[0]+'-'+tup[1],tup[0],tup[1]) for tup in all_tup] # creates list of tuples
# (tuple name "subject-item", subject, item)
targeted_create_tables=functools.partial(create_tables,c) #creates new table if needed
#for new subjects/items-
list(map(targeted_create_tables,table_breakdown)) #no slowdown on full database
targeted_insert_data=functools.partial(insert_data,c) #inserts data for specific
#subject item combo
list(map(targeted_insert_data,table_breakdown)) # (3+) X slower
conn.commit() # significant relative slowdown, but insignificant in absolute terms
conn.close()
and relevant insert function:
def insert_data(c,tup):
global collector ###list of tuples of data for a combo of a subject and item
global sql_length ###pre defined dictionary translating the item into the
#right length (?,?,?...) string
tbl_name=tup[0]
subject=tup[1]
item=tup[2]
subject_data=collector[subject][item]
if not (subject_data==[]):
statement='''INSERT INTO "{0}" VALUES {1}'''.format(tbl_name,sql_length[item])
c.executemany(statement,subject_data)#massively slower, about 80% of
#inserts > twice slower
subject_data=[]
EDIT: table create function per CL request. I'm aware that this is inefficient (it takes roughly the same time to check if a table name exists this way as to create the table) but it's not significant to the slow down.
def create_tables(c,tup):
global collector
global title #list of column schemes to match to items
tbl_name=tup[0]
bm_unit=tup[1]
item=tup[2]
subject_data=bm_collector[bm_unit][item]
if not (subject_data==[]):
c.execute('SELECT * FROM sqlite_master WHERE name = "{0}" and type="table"'.format(tbl_name))
if c.fetchone()==None:
c.execute('CREATE TABLE "{0}" {1}'.format(tbl_name,title[item]))
there are all told 65 different column schemes in the title dict but this is an example of what they look like:
title.append(('WINDFOR','(TIMESTAMP TEXT, SP INTEGER, SD TEXT, PUBLISHED TEXT, WIND_CAP NUMERIC, WIND_FOR NUMERIC)'))
Anyone got any ideas about where to look or what could cause this issue? I apologize if I've left out important information or missed something horribly basic, I've come into this topic area completely cold.
Appending rows to the end of a table is the fastest way to insert data (and you are not playing games with the rowid, so you are indeed appending to then end).
However, you are not using a single table but 16k tables, so the overhead for managing the table structure is multiplied.
Try increasing the cache size. But the most promising change would be to use fewer tables.
It makes sense to me that the time to INSERT increases as a function of the database size. The operating system itself may be slower when opening/closing/writing to larger files. An index could slow things down much more of course, but that doesn't mean that there would be no slowdown at all without an index.

Categories

Resources