Fastest way to write database table to file in python - python

I'm trying to extract huge amounts of data from a DB and write it to a csv file. I'm trying to find out what the fastest way would be to do this. I found that running writerows on the result of a fetchall was 40% slower than the code below.
with open(filename, 'a') as f:
writer = csv.writer(f, delimiter='\t')
cursor.execute("SELECT * FROM table")
writer.writerow([i[0] for i in cursor.description])
count = 0
builder = []
row = cursor.fetchone()
DELIMITERS = ['\t'] * (len(row) - 1) + ['\n']
while row:
count += 1
# Add row with delimiters to builder
builder += [str(item) for pair in zip(row, DELIMITERS) for item in pair]
if count == 1000:
count = 0
f.write(''.join(builder))
builder[:] = []
row = cursor.fetchone()
f.write(''.join(builder))
Edit: The database I'm using is unique to the small company that I'm working for, so unfortunately I can't provide much information on that front. I'm using jpype to connect with the database since the only means of connecting is via a jdbc driver. I'm running cPython 2.7.5; would love to use PyPy but it doesn't work with Pandas.
Since I'm extracting such a large number of rows, I'm hesitant to use fetchall for fear that I'll run out of memory. row has comparable performance and is much easier on the eyes, so I think I'll use that. Thanks a bunch!

With the little you've given us to go on, it's hard to be more specific, but…
I've wrapped your code up as a function, and written three alternative versions:
def row():
with open(filename, 'w') as f:
writer = csv.writer(f, delimiter='\t')
cursor = db.execute("SELECT * FROM mytable")
writer.writerow([i[0] for i in cursor.description])
for row in cursor:
writer.writerow(row)
def rows():
with open(filename, 'w') as f:
writer = csv.writer(f, delimiter='\t')
cursor = db.execute("SELECT * FROM mytable")
writer.writerow([i[0] for i in cursor.description])
writer.writerows(cursor)
def rowsall():
with open(filename, 'w') as f:
writer = csv.writer(f, delimiter='\t')
cursor = db.execute("SELECT * FROM mytable")
writer.writerow([i[0] for i in cursor.description])
writer.writerows(cursor.fetchall())
Notice that the last one is the one you say you tried.
Now, I wrote this test driver:
def randomname():
return ''.join(random.choice(string.ascii_lowercase) for _ in range(30))
db = sqlite3.connect(':memory:')
db.execute('CREATE TABLE mytable (id INTEGER PRIMARY KEY AUTOINCREMENT, name VARCHAR)')
db.executemany('INSERT INTO mytable (name) VALUES (?)',
[[randomname()] for _ in range(10000)])
filename = 'db.csv'
for f in manual, row, rows, rowsall:
t = timeit.timeit(f, number=1)
print('{:<10} {}'.format(f.__name__, t))
And here are the results:
manual 0.055549702141433954
row 0.03852885402739048
rows 0.03992213006131351
rowsall 0.02850699401460588
So, your code takes nearly twice as long as calling fetchall and writerows in my test!
When I repeat a similar test with other databases, however, rowsall is anywhere from 20% faster to 15% slower than manual (never 40% slower, but as much as 15%)… but row or rows is always significantly faster than manual.
I think the explanation is that your custom code is significantly slower than csv.writerows, but that in some databases, using fetchall instead of fetchone (or just iterating the cursor) slows things down significantly. The reason this isn't true with an in-memory sqlite3 database is that fetchone is doing all of the same work as fetchall and then feeding you the list one at a time; with a remote database, fetchone may do anything from fetch all the lines, to fetching a buffer at a time, to fetching a row at a time, making it potentially much slower or faster than fetchall, depending on your data.
But for a really useful explanation, you'd have to tell us exactly which database and library you're using (and which Python version—CPython 3.3.2's csv module seems to be a lot faster than CPython 2.7.5's, and PyPy 2.1/2.7.2 seems to be faster than CPython 2.7.5 as well, but then either one also might run your code faster too…) and so on.

Related

How to deal with large csv file quickly?

I have a large csv file with more than 1 million rows. Each row has two features, callsite (the location of an API invocation) and a sequence of tokens to the callsite. They are written as:
callsite 1, token 1, token 2, token 3, ...
callsite 1, token 3, token 4, token 4, token 6, ...
callsite 2, token 3, token 1, token 6, token 7, ...
I want to shuffle the rows and split them into two files (for training and testing). The problem is that I want to split than according to the callsites instead of the rows. There may be more than one row belonging to one callsite. So I first read all the callsites, shuffle and split them as follows:
import csv
import random
with open(file,'r') as csv_file:
reader = csv.reader(csv_file)
callsites = [row[0] for row in reader]
random.shuffle(callsites)
test_callsites = callsites[0:n_test] //n_test is the number of test cases
Then, I read each row from the csv file and compare the callsite to put it in the train.csv or test.csv as follows:
with open(file,'r') as csv_file, open('train.csv','w') as train_file, open('test.csv','w') as test_file:
reader = csv.reader(csv_file)
train_writer = csv.writer(train_file)
test_writer = csv.writer(test_file)
for row in reader:
if row[0] in test_callsites:
test_writer.writerow(row)
else:
train_writer.writerow(row)
The problem is that the code works too slow, more than one day to finish. The comparison for each row causes the complexity O(n^2). And the read and write row by row may also be not efficient. But I am afraid that loading all data in the memory would cause memory error. Is there any better way to deal with large files like that?
Would it be quicker if I use dataframe to read and write it? But the sequence length is varied each row. I tried to write the data as (put all tokens as a list in one column):
callsite, sequence
callsite 1, [token1||token2||token 3]
However, it seems not convenient to restore the [token 1||token 2||token 3] as sequences.
Is there any better practice to store and restore that kind of data with variable length?
The simplest fix is to change:
test_callsites = callsites[0:n_test]
to
test_callsites = frozenset(callsites[:n_test]) # set also works; frozenset just reduces chance of mistakenly modifying it
This would reduce the work for each test of if row[0] in test_callsites: from O(n_test) to O(1), which would likely make a huge improvement if n_test is on the order of four digits or more (likely, when we're talking about millions of rows).
You could also slightly reduce the work (mostly in terms of improving memory locality by having a smaller bin of things being selected) in creating it in the first place by changing:
random.shuffle(callsites)
test_callsites = callsites[0:n_test]
to:
test_callsites = frozenset(random.sample(callsites, n_test))
which avoids reshuffling the whole of callsites in favor of selecting n_test values from it (which you then convert to a frozenset, or just set, for cheap lookup). Bonus, it's a one-liner. :-)
Side-note: Your code is potentially wrong as written. You must pass newline='' to your various calls to open to ensure that the chosen CSV dialect's newline preferences are honored.
What about something like this?
import csv
import random
random.seed(42) # need this to get reproducible splits
with open("input.csv", "r") as input_file, open("train.csv", "w") as train_file, open(
"test.csv", "w"
) as test_file:
reader = csv.reader(input_file)
train_writer = csv.writer(train_file)
test_writer = csv.writer(test_file)
test_callsites = set()
train_callsites = set()
for row in reader:
callsite = row[0]
if callsite in test_callsites:
test_writer.writerow(row)
elif callsite in train_callsites:
train_writer.writerow(row)
elif random.random() <= 0.2: # put here the train/test split you need
test_writer.writerow(row)
test_callsites.add(callsite)
else:
train_writer.writerow(row)
train_callsites.add(callsite)
In this way you'll need a single pass over the file. Drawback is that you'll get a split which is approximately 20%.
Tested on 1Mx100 rows (~850mb) and seems reasonably usable.

Effective way to create .csv file from MongoDB

I have a MongoDB(media_mongo) with collection main_hikari and a lot of data inside. I'm trying to make a function to create a .csv file from this data asap. I'm using this code, but it takes too much time and CPU usage
import pymongo
from pymongo import MongoClient
mongo_client = MongoClient('mongodb://admin:password#localhost:27017')
db = mongo_client.media_mongo
def download_file(down_file_name="hikari"):
docs = pd.DataFrame(columns=[])
if down_file_name == "kokyaku":
col = db.main_kokyaku
if down_file_name == "hikari":
col = db.main_hikari
if down_file_name == "hikanshou":
col = db.main_hikanshou
cursor = col.find()
mongo_docs = list(cursor)
for num, doc in enumerate(mongo_docs):
doc["_id"] = str(doc["_id"])
doc_id = doc["_id"]
series_obj = pandas.Series(doc, name=doc_id)
docs = docs.append(series_obj)
csv_export = docs.to_csv("file.csv", sep=",")
download_file()
My database has data in this format (sorry for that Japanese :D)
_id:"ObjectId("5e0544c4f4eefce9ee9b5a8b")"
事業者受付番号:"data1"
開通区分/処理区分:"data2"
開通ST/処理ST:"data3"
申込日,顧客名:"data4"
郵便番号:"data5"
住所1:"data6"
住所2:"data7"
連絡先番号:"data8"
契約者電話番号:"data9"
And about 150000 entries like this
If you have a lot of data as you indicate, then this line is going to hurt you:
mongo_docs = list(cursor)
It basically means read the entire collection into a client-side array at once. This will create a huge memory high water mark.
Better to use mongoexport as noted above or walk the cursor yourself instead of having list() slurp the whole thing, e.g.:
cursor = col.find()
for doc in cursor:
# read docs one at a time
or to be very pythonic about it:
for doc in col.find(): # or find(expression of your choice)
# read docs one at a time

Python and PostgreSQL - Check on multi-insert operation

I'll make it easier on you.
I need to perform a multi-insert operation using parameters from a text file.
However, I need to report each input line in a log or an err file depending on the insert status.
I was to able to understand if the insert was ok or nor when performing it once at a time (for example, using cur.rowcount or simply a try..except statement).
Is there a way to perform N insert (corresponding to N input line) and to understand which fail?
Here my code:
QUERY="insert into table (field1, field2, field3) values (%s, %s, %s)"
Let
a b c
d e f
g h i
be 3 rows from input file. So
args=[('a','b','c'), ('d','e','f'),('g','h','i')]
cur.executemany(QUERY,args)
Now, let's suppose only the first 2 rows were successfully added. So I have to track such a situation as follows:
log file
a b c
d e f
err file
g h i
Any idea?
Thanks!
try this:
QUERY="insert into table (field1, field2, field3) values ({}, {}, {})"
with open('input.txt', 'r') as inputfile:
readfile = inputfile.read()
inputlist = readfile.splitlines()
listafinal = []
for x in inputlist:
intermediate = x.split(' ')
cur.execute(QUERY.format(intermediate[0], intermediate[1], intermediate[2]))
# if error:
# log into the error file
# else:
# log into the success file
Do not forget to undo the comments and ajust the error as you like
How common do you expect failures to be, and what kind of failures? What I have done in such similar cases is insert 10,000 rows at a time, and if the chunk fails then go back and do that chunk 1 row at a time to get the full error message and specific row. Of course, that depends on failures being rare. What I would be more likely to do today is just turn off synchronous_commit and process them one row at a time always.

Fetching huge data from Oracle in Python

I need to fetch huge data from Oracle (using cx_oracle) in python 2.6, and to produce some csv file.
The data size is about 400k record x 200 columns x 100 chars each.
Which is the best way to do that?
Now, using the following code...
ctemp = connection.cursor()
ctemp.execute(sql)
ctemp.arraysize = 256
for row in ctemp:
file.write(row[1])
...
... the script remain hours in the loop and nothing is writed to the file... (is there a way to print a message for every record extracted?)
Note: I don't have any issue with Oracle, and running the query in SqlDeveloper is super fast.
Thank you, gian
You should use cur.fetchmany() instead.
It will fetch chunk of rows defined by arraysise (256)
Python code:
def chunks(cur): # 256
global log, d
while True:
#log.info('Chunk size %s' % cur.arraysize, extra=d)
rows=cur.fetchmany()
if not rows: break;
yield rows
Then do your processing in a for loop;
for i, chunk in enumerate(chunks(cur)):
for row in chunk:
#Process you rows here
That is exactly how I do it in my TableHunter for Oracle.
add print statements after each line
add a counter to your loop indicating progress after each N rows
look into a module like 'progressbar' for displaying a progress indicator
I think your code is asking the database for the data one row at the time which might explain the slowness.
Try:
ctemp = connection.cursor()
ctemp.execute(sql)
Results = ctemp.fetchall()
for row in Results:
file.write(row[1])

How to Compare 2 very large matrices using Python

I have an interesting problem.
I have a very large (larger than 300MB, more than 10,000,000 lines/rows in the file) CSV file with time series data points inside. Every month I get a new CSV file that is almost the same as the previous file, except for a few new lines have been added and/or removed and perhaps a couple of lines have been modified.
I want to use Python to compare the 2 files and identify which lines have been added, removed and modified.
The issue is that the file is very large, so I need a solution that can handle the large file size and execute efficiently within a reasonable time, the faster the better.
Example of what a file and its new file might look like:
Old file
A,2008-01-01,23
A,2008-02-01,45
B,2008-01-01,56
B,2008-02-01,60
C,2008-01-01,3
C,2008-02-01,7
C,2008-03-01,9
etc...
New file
A,2008-01-01,23
A,2008-02-01,45
A,2008-03-01,67 (added)
B,2008-01-01,56
B,2008-03-01,33 (removed and added)
C,2008-01-01,3
C,2008-02-01,7
C,2008-03-01,22 (modified)
etc...
Basically the 2 files can be seen as matrices that need to be compared, and I have begun thinking of using PyTable. Any ideas on how to solve this problem would be greatly appreciated.
Like this.
Step 1. Sort.
Step 2. Read each file, doing line-by-line comparison. Write differences to another file.
You can easily write this yourself. Or you can use difflib. http://docs.python.org/library/difflib.html
Note that the general solution is quite slow as it searches for matching lines near a difference. Writing your own solution can run faster because you know things about how the files are supposed to match. You can optimize that "resynch-after-a-diff" algorithm.
And 10,000,000 lines hardly matters. It's not that big. Two 300Mb files easily fit into memory.
This is a little bit of a naive implementation but will deal with unsorted data:
import csv
file1_dict = {}
file2_dict = {}
with open('file1.csv') as handle:
for row in csv.reader(handle):
file1_dict[tuple(row[:2])] = row[2:]
with open('file2.csv') as handle:
for row in csv.reader(handle):
file2_dict[tuple(row[:2])] = row[2:]
with open('outfile.csv', 'w') as handle:
writer = csv.writer(handle)
for key, val in file1_dict.iteritems():
if key in file2_dict:
#deal with keys that are in both
if file2_dict[key] == val:
writer.writerow(key+val+('Same',))
else:
writer.writerow(key+file2_dict[key]+('Modified',))
file2_dict.pop(key)
else:
writer.writerow(key+val+('Removed',))
#deal with added keys!
for key, val in file2_dict.iteritems():
writer.writerow(key+val+('Added',))
You probably won't be able to "drop in" this solution but it should get you ~95% of the way there. #S.Lott is right, 2 300mb files will easily fit in memory ... if your files get into the 1-2gb range then this may have to be modified with the assumption of sorted data.
Something like this is close ... although you may have to change the comparisons around for the added a modified to make sense:
#assumming both files are sorted by columns 1 and 2
import datetime
from itertools import imap
def str2date(in):
return datetime.date(*map(int,in.split('-')))
def convert_tups(row):
key = (row[0], str2date(row[1]))
val = tuple(row[2:])
return key, val
with open('file1.csv') as handle1:
with open('file2.csv') as handle2:
with open('outfile.csv', 'w') as outhandle:
writer = csv.writer(outhandle)
gen1 = imap(convert_tups, csv.reader(handle1))
gen2 = imap(convert_tups, csv.reader(handle2))
gen2key, gen2val = gen2.next()
for gen1key, gen1val in gen1:
if gen1key == gen2key and gen1val == gen2val:
writer.writerow(gen1key+gen1val+('Same',))
gen2key, gen2val = gen2.next()
elif gen1key == gen2key and gen1val != gen2val:
writer.writerow(gen2key+gen2val+('Modified',))
gen2key, gen2val = gen2.next()
elif gen1key > gen2key:
while gen1key>gen2key:
writer.writerow(gen2key+gen2val+('Added',))
gen2key, gen2val = gen2.next()
else:
writer.writerow(gen1key+gen1val+('Removed',))

Categories

Resources