Fetching huge data from Oracle in Python - python

I need to fetch huge data from Oracle (using cx_oracle) in python 2.6, and to produce some csv file.
The data size is about 400k record x 200 columns x 100 chars each.
Which is the best way to do that?
Now, using the following code...
ctemp = connection.cursor()
ctemp.execute(sql)
ctemp.arraysize = 256
for row in ctemp:
file.write(row[1])
...
... the script remain hours in the loop and nothing is writed to the file... (is there a way to print a message for every record extracted?)
Note: I don't have any issue with Oracle, and running the query in SqlDeveloper is super fast.
Thank you, gian

You should use cur.fetchmany() instead.
It will fetch chunk of rows defined by arraysise (256)
Python code:
def chunks(cur): # 256
global log, d
while True:
#log.info('Chunk size %s' % cur.arraysize, extra=d)
rows=cur.fetchmany()
if not rows: break;
yield rows
Then do your processing in a for loop;
for i, chunk in enumerate(chunks(cur)):
for row in chunk:
#Process you rows here
That is exactly how I do it in my TableHunter for Oracle.

add print statements after each line
add a counter to your loop indicating progress after each N rows
look into a module like 'progressbar' for displaying a progress indicator

I think your code is asking the database for the data one row at the time which might explain the slowness.
Try:
ctemp = connection.cursor()
ctemp.execute(sql)
Results = ctemp.fetchall()
for row in Results:
file.write(row[1])

Related

Python and PostgreSQL - Check on multi-insert operation

I'll make it easier on you.
I need to perform a multi-insert operation using parameters from a text file.
However, I need to report each input line in a log or an err file depending on the insert status.
I was to able to understand if the insert was ok or nor when performing it once at a time (for example, using cur.rowcount or simply a try..except statement).
Is there a way to perform N insert (corresponding to N input line) and to understand which fail?
Here my code:
QUERY="insert into table (field1, field2, field3) values (%s, %s, %s)"
Let
a b c
d e f
g h i
be 3 rows from input file. So
args=[('a','b','c'), ('d','e','f'),('g','h','i')]
cur.executemany(QUERY,args)
Now, let's suppose only the first 2 rows were successfully added. So I have to track such a situation as follows:
log file
a b c
d e f
err file
g h i
Any idea?
Thanks!
try this:
QUERY="insert into table (field1, field2, field3) values ({}, {}, {})"
with open('input.txt', 'r') as inputfile:
readfile = inputfile.read()
inputlist = readfile.splitlines()
listafinal = []
for x in inputlist:
intermediate = x.split(' ')
cur.execute(QUERY.format(intermediate[0], intermediate[1], intermediate[2]))
# if error:
# log into the error file
# else:
# log into the success file
Do not forget to undo the comments and ajust the error as you like
How common do you expect failures to be, and what kind of failures? What I have done in such similar cases is insert 10,000 rows at a time, and if the chunk fails then go back and do that chunk 1 row at a time to get the full error message and specific row. Of course, that depends on failures being rare. What I would be more likely to do today is just turn off synchronous_commit and process them one row at a time always.

Reading specific chunks pandas / not reading all chunks in pandas

I am trying to use accordingly to this question and answer reading a large csv file by chunks and processing it. Since I'm not native with python I got an optimization problem and looking for a better solution here.
What my code does:
I read in the line count of my csv with
with open(file) as f:
row_count = sum(1 for line in f)
afterwards I "slice" my data in 30 equal sized chunks and process it accordingly to the linked answer with a for loop and pd.read_csv(file, chunksize). Since plotting 30 graphs in one is pretty unclear, I plot it every 5 steps with modulo (which may be variated). For this I use an external counter.
chunksize = row_count // 30
counter = 0
for chunk in pd.read_csv(file, chunksize=chunksize):
df = chunk
print(counter)
if ((counter % 5) == 0 | (counter == 0):
plt.plot(df["Variable"])
counter = counter +1
plt.show()
Now to my question:
It seems like, this loop reads the chunk size in before processing the loop, which is reasonable. I can see this, since the print(counter) steps are also fairly slow. Since I read a few million rows of a csv, it takes some time every step. Is there a way to skip the not wanted chunks in the for loop, before reading it in? I was trying out something like:
wanted_plts <- [1,5,10,15,20,25,30]
for i in wanted_plts:
for chunk[i] in pd.read_csv(file, chunksize=chunksize):
.
.
I think I have understanding issues how I can manipulate this syntax of the for loop range. There should be an elegant way to fix this.
Also: i found the .get_chunk(x) by pandas but this seems to create just one chunk of size x.
Another attempt by me is trying to subset the reader object of pd.read_csv like pd.read_csv()[0,1,2] but it seems that's not possible too.
Amendment: I'm aware plotting a lot of data in matplotlib is slow. I preprocess it earlier, but for making this code readable I removed all unnecessary parts.
You are wasting a lot of resources when parsing CSV into DataFrame without using it. To avoid this you can create line index during the first pass:
fp = open(file_name)
row_count = 0
pos = {0: 0}
line = fp.readline()
while line:
row_count += 1
pos[row_count] = fp.tell()
line = fp.readline()
Do not dispose the file handle yet! Because read_csv() accepts streams, you can move your file pointer as you want:
chunksize = row_count // 30
wanted_plts = [1,5,10,15,20,25,30]
for i in wanted_plts:
fp.seek(pos[i*chunksize]) # this will bring you to the first line of the desired chunk
obj = pd.read_csv(fp, chunksize=chunksize) # read your chunk lazily
df = obj.get_chunk() # convert to DataFrame object
plt.plot(df["Variable"]) # do something
fp.close() # Don't forget to close the file when finished.
And finally a warning: when reading CSV this way you will lose column names. So make an adjustment:
obj = pd.read_csv(fp, chunksize=chunksize, names=[!!<column names you have>!!])
P.S. file is a reserved word, avoid using it to prevent undesired side effects. You can use file_ or file_name instead.
I've toyed with your setup, trying to find a way to skip chunks, using another rendering library like pyqtgraph or using matplotlib.pyplot subroutines instead of plot(), all to no avail.
So the only fair advice I can give you is to limit the scope of read_csv to only the data you're interested in by passing the usecols parameter.
Instead of:
for chunk in pd.read_csv(file, chunksize=chunksize):
plt.plot(chunk['Variable'])
Use:
for chunk in pd.read_csv(file, usecols=['Variable'], chunksize=chunksize):
plt.plot(chunk)
And, if you haven't already, definitely limit the number of iterations by going for the biggest chunksize you possibly can (so in your case the lowest row_count divider).
I haven't quantified their respective weight but you will gain on both the csv_read() and the plot() method overheads, even ever so slightly due to the fact that your current chunks are already quite big.
With my test data, quadrupling the chunksize cuts down processing time in half:
chunksize=1000 => executed in 12.7s
chunksize=2000 => executed in 9.06s
chunksize=3000 => executed in 7.68s
chunksize=4000 => executed in 6.94s
And specifying usecols at read time also cuts down processing time in half again:
chunksize=1000 + usecols=['Variable'] => executed in 8.33s
chunksize=2000 + usecols=['Variable'] => executed in 5.27s
chunksize=3000 + usecols=['Variable'] => executed in 4.39s
chunksize=4000 + usecols=['Variable'] => executed in 3.54s
As far as I know, pandas does not provide any support for skipping chunks of file. At least I never found anything about it in the documentation.
In general, skipping lines from file (not reading them at all) is difficult unless you know in advance how many lines you want to skip and how many characters you have in each of those lines. In this case you can try to play with IO and seek to move the stream position to the exact place you need the next iteration.
But it does not seem your case.
I think the best thing you can do to improve efficiency is to read the lines using standard IO, and convert to a dataframe only the lines you need / want to plot.
Consider for example the following custom iterator.
When instantiated, it saves the header (first line). Each iteration it reads a chunk of lines from the file and then skip the following n*chunksize lines. It returns the header line followed by the read lines, wrapped in a io.StringIO object (so it's a stream and can be fed directly to pandas.read_csv).
import io
from itertools import islice
class DfReaderChunks:
def __init__(self, filename, chunksize, n):
self.fo = open(filename)
self.chs = chunksize
self.skiplines = self.chs * n
self.header = next(self.fo)
def getchunk(self):
ll = list(islice(self.fo, self.chs))
if len(ll) == 0:
raise StopIteration
dd = list(islice(self.fo, self.skiplines))
return self.header + ''.join(ll)
def __iter__(self):
return self
def __next__(self):
return io.StringIO(self.getchunk())
def close(self):
self.fo.close()
def __del__(self):
self.fo.close()
Using this class, your can read from your file:
reader = DfReaderChunks(file, chunksize, 4)
for dfst in reader:
df = pd.read_csv(dfst)
print(df) #here I print to stdout, you can plot
reader.close()
which is "equivalent" to your setup:
for chunk in pd.read_csv(file, chunksize=chunksize):
df = chunk
if (counter % 5 == 0):
print(df) #again I print, you can plot
counter += 1
I tested the time used by both the above snippets using a dataframe of 39 Mb (100000 rows or random numbers).
On my machine, the former takes 0.458 seconds, the latter 0.821 seconds.
The only drawback is that the former snippet loses track of the row index (it's a new dataframe each time, so index always start from 0) but the printed chunks are the same.

Load PostgreSQL database with data from a NetCDF file

I have a netCDF file with eight variables. (sorry, canĀ“t share the actual file)
Each variable have two dimensions, time and station. Time is about 14 steps and station is currently 38000 different ids.
So for 38000 different "locations" (actually just an id) we have 8 variables and 14 different times.
$ncdump -h stationdata.nc
netcdf stationdata {
dimensions:
station = 38000 ;
name_strlen = 40 ;
time = UNLIMITED ; // (14 currently)
variables:
int time(time) ;
time:long_name = "time" ;
time:units = "seconds since 1970-01-01" ;
char station_name(station, name_strlen) ;
station_name:long_name = "station_name" ;
station_name:cf_role = "timeseries_id" ;
float var1(time, station) ;
var1:long_name = "Variable 1" ;
var1:units = "m3/s" ;
float var2(time, station) ;
var2:long_name = "Variable 2" ;
var2:units = "m3/s" ;
...
This data needs to be loaded into a PostGres database so that the data can be join to some geometries matching the station_name for later visualization .
Currently I have done this in Python with the netCDF4-module. Works but it takes forever!
Now I am looping like this:
times = rootgrp.variables['time']
stations = rootgrp.variables['station_name']
for timeindex, time in enumerate(times):
stations = rootgrp.variables['station_name']
for stationindex, stationnamearr in enumerate(stations):
var1val = var1[timeindex][stationindex]
print "INSERT INTO ncdata (validtime, stationname, var1) \
VALUES ('%s','%s', %s);" % \
( time, stationnamearr, var1val )
This takes several minutes on my machine to run and I have a feeling it could be done in a much more clever way.
Anyone has any idea on how this can be done in a smarter way? Preferably in Python.
Not sure this is the right way to do it but I found a good way to solve this and thought I should share it.
In the first version the script took about one hour to run. After a rewrite of the code it now runs in less than 30 sec!
The big thing was to use numpy arrays and transponse the variables arrays from the NetCDF reader to become rows and then stack all columns to one matrix. This matrix was then loaded in the db using psycopg2 copy_from function. I got the code for that from this question
Use binary COPY table FROM with psycopg2
Parts of my code:
dates = num2date(rootgrp.variables['time'][:],units=rootgrp.variables['time'].units)
var1=rootgrp.variables['var1']
var2=rootgrp.variables['var2']
cpy = cStringIO.StringIO()
for timeindex, time in enumerate(dates):
validtimes=np.empty(var1[timeindex].size, dtype="object")
validtimes.fill(time)
# Transponse and stack the arrays of parameters
# [a,a,a,a] [[a,b,c],
# [b,b,b,b] => [a,b,c],
# [c,c,c,c] [a,b,c],
# [a,b,c]]
a = np.hstack((
validtimes.reshape(validtimes.size,1),
stationnames.reshape(stationnames.size,1),
var1[timeindex].reshape(var1[timeindex].size,1),
var2[timeindex].reshape(var2[timeindex].size,1)
))
# Fill the cStringIO with text representation of the created array
for row in a:
cpy.write(row[0].strftime("%Y-%m-%d %H:%M")+'\t'+ row[1] +'\t' + '\t'.join([str(x) for x in row[2:]]) + '\n')
conn = psycopg2.connect("host=postgresserver dbname=nc user=user password=passwd")
curs = conn.cursor()
cpy.seek(0)
curs.copy_from(cpy, 'ncdata', columns=('validtime', 'stationname', 'var1', 'var2'))
conn.commit()
There are a few simple improvements you can make to speed this up. All these are independent, you can try all of them or just a couple to see if it's fast enough. They're in roughly ascending order of difficulty:
Use the psycopg2 database driver, it's faster
Wrap the whole block of inserts in a transaction. If you're using psycopg2 you're already doing this - it auto-opens a transaction you have to commit at the end.
Collect up several rows worth of values in an array and do a multi-valued INSERT every n rows.
Use more than one connection to do the inserts via helper processes - see the multiprocessing module. Threads won't work as well because of GIL (global interpreter lock) issues.
If you don't want to use one big transaction you can set synchronous_commit = off and set a commit_delay so the connection can return before the disk flush actually completes. This won't help you much if you're doing all the work in one transaction.
Multi-valued inserts
Psycopg2 doesn't directly support multi-valued INSERT but you can just write:
curs.execute("""
INSERT INTO blah(a,b) VALUES
(%s,%s),
(%s,%s),
(%s,%s),
(%s,%s),
(%s,%s);
""", parms);
and loop with something like:
parms = []
rownum = 0
for x in input_data:
parms.extend([x.firstvalue, x.secondvalue])
rownum += 1
if rownum % 5 == 0:
curs.execute("""INSERT ...""", tuple(parms))
del(parms[:])
Organize your loop to access all the variables for each time. In other words, read and write a record at a time rather than a variable at a time. This can speed things up enormously, especially if the source netCDF dataset is stored on a file system with large disk blocks, e.g. 1MB or larger. For an explanation of why this is faster and a discussion of order-of-magnitude resulting speedups, see this NCO speedup discussion, starting with entry 7.

How to download huge Oracle LOB with cx_Oracle on memory constrained system?

I'm developing part of a system where processes are limited to about 350MB of RAM; we use cx_Oracle to download files from an external system for processing.
The external system stores files as BLOBs, and we can grab them doing something like this:
# ... set up Oracle connection, then
cursor.execute(u"""SELECT filename, data, filesize
FROM FILEDATA
WHERE ID = :id""", id=the_one_you_wanted)
filename, lob, filesize = cursor.fetchone()
with open(filename, "w") as the_file:
the_file.write(lob.read())
lob.read() will obviously fail with MemoryError when we hit a file larger than 300-350MB, so we've tried something like this instead of reading it all at once:
read_size = 0
chunk_size = lob.getchunksize() * 100
while read_size < filesize:
data = lob.read(chunk_size, read_size + 1)
read_size += len(data)
the_file.write(data)
Unfortunately, we still get MemoryError after several iterations. From the time lob.read() is taking, and the out-of-memory condition we eventually get, it looks as if lob.read() is pulling ( chunk_size + read_size ) bytes from the database every time. That is, reads are taking O(n) time and O(n) memory, even though the buffer is quite a bit smaller.
To work around this, we've tried something like:
read_size = 0
while read_size < filesize:
q = u'''SELECT dbms_lob.substr(data, 2000, %s)
FROM FILEDATA WHERE ID = :id''' % (read_bytes + 1)
cursor.execute(q, id=filedataid[0])
row = cursor.fetchone()
read_bytes += len(row[0])
the_file.write(row[0])
This pulls 2000 bytes (argh) at a time, and takes forever (something like two hours for a 1.5GB file). Why 2000 bytes? According to the Oracle docs, dbms_lob.substr() stores its return value in a RAW, which is limited to 2000 bytes.
Is there some way I can store the dbms_lob.substr() results in a larger data object and read maybe a few megabytes at a time? How do I do this with cx_Oracle?
I think that the argument order in lob.read() is reversed in your code. The first argument should be the offset, the second argument should be the amount to read. This would explain the O(n) time and memory usage.

Optimize python file comparison script

I have written a script which works, but I'm guessing isn't the most efficient. What I need to do is the following:
Compare two csv files that contain user information. It's essentially a member list where one file is a more updated version of the other.
The files contain data such as ID, name, status, etc, etc
Write to a third csv file ONLY the records in the new file that either don't exist in the older file, or contain updated information. For each record, there is a unique ID that allows me to determine if a record is new or previously existed.
Here is the code I have written so far:
import csv
fileAin = open('old.csv','rb')
fOld = csv.reader(fileAin)
fileBin = open('new.csv','rb')
fNew = csv.reader(fileBin)
fileCout = open('NewAndUpdated.csv','wb')
fNewUpdate = csv.writer(fileCout)
old = []
new = []
for row in fOld:
old.append(row)
for row in fNew:
new.append(row)
output = []
x = len(new)
i = 0
num = 0
while i < x:
if new[num] not in old:
fNewUpdate.writerow(new[num])
num += 1
i += 1
fileAin.close()
fileBin.close()
fileCout.close()
In terms of functionality, this script works. However I'm trying to run this on files that contain hundreds of thousands of records and it's taking hours to complete. I am guessing the problem lies with reading both files to lists and treating the entire row of data as a single string for comparison.
My question is, for what I am trying to do is this there a faster, more efficient, way to process the two files to create the third file containing only new and updated records? I don't really have a target time, just mostly wanting to understand if there are better ways in Python to process these files.
Thanks in advance for any help.
UPDATE to include sample row of data:
123456789,34,DOE,JOHN,1764756,1234 MAIN ST.,CITY,STATE,305,1,A
How about something like this? One of the biggest inefficiencies of your code is checking whether new[num] is in old every time because old is a list so you have to iterate through the entire list. Using a dictionary is much much faster.
import csv
fileAin = open('old.csv','rb')
fOld = csv.reader(fileAin)
fileBin = open('new.csv','rb')
fNew = csv.reader(fileBin)
fileCout = open('NewAndUpdated.csv','wb')
fNewUpdate = csv.writer(fileCout)
old = {row[0]:row[1:] for row in fOld}
new = {row[0]:row[1:] for row in fNew}
fileAin.close()
fileBin.close()
output = {}
for row_id in new:
if row_id not in old or not old[row_id] == new[row_id]:
output[row_id] = new[row_id]
for row_id in output:
fNewUpdate.writerow([row_id] + output[row_id])
fileCout.close()
difflib is quite efficient: http://docs.python.org/library/difflib.html
Sort the data by your unique field(s), and then use a comparison process analogous to the merge step of merge sort:
http://en.wikipedia.org/wiki/Merge_sort

Categories

Resources