Context
I have a function in python that scores a row in my table. I would like to combine the scores of all the rows arithmetically (eg. computing the sum, average, etc.. of the scores).
def compute_score(row):
# some complicated python code that would be painful to convert into SQL-equivalent
return score
The obvious first approach is to simply read in all the data
import psycopg2
def sum_scores(dbname, tablename):
conn = psycopg2.connect(dbname)
cur = conn.cursor()
cur.execute('SELECT * FROM ?', tablename)
rows = cur.fetchall()
sum = 0
for row in rows:
sum += score(row)
conn.close()
return sum
Problem
I would like to be able to handle as much data as my database can hold. This could be larger that what would fit into Python's memory, so fetchall() seems to me like it would not function correctly in that case.
Proposed Solutions
I was considering 3 approaches, all with the aim of processing a couple records at a time:
One-by-one record processing using fetchone()
def sum_scores(dbname, tablename):
...
sum = 0
for row_num in cur.rowcount:
row = cur.fetchone()
sum += score(row)
...
return sum
Batch-record processing using fetchmany(n)
def sum_scores(dbname, tablename):
...
batch_size = 1e3 # tunable
sum = 0
batch = cur.fetchmany(batch_size)
while batch:
for row in batch:
sum += score(row)
batch = cur.fetchmany(batch_size)
...
return sum
Relying on the cursor's iterator
def sum_scores(dbname, tablename):
...
sum = 0
for row in cur:
sum += score(row)
...
return sum
Questions
Was my thinking correct in that my 3 proposed solutions would only pull in manageable sized chunks of data at a time? Or do they suffer from the same problem as fetchall?
Which of the 3 proposed solutions would work (ie. compute the correct score combination and not crash in the process) for LARGE datasets?
How does the cursor's iterator (Proposed Solution #3) actually pull in data into Python's memory? One-by-one, in batches, or all at once?
All 3 solutions will work, and only bring a subset of the results into memory.
Iterating via the cursor, Proposed solution #3, will work the same as Proposed Solution #2, if you pass a name to the cursor. Iterating over the cursor will fetch itersize records (default is 2000).
Solutions #2 and #3 will be much quicker than #1, because there is much less of a connection overhead.
http://initd.org/psycopg/docs/cursor.html#fetch
Related
I'm using pyobdc with an application that requires me to insert >1000 rows normally which I currently do individually with pyobdc. Though this tends to take >30 minutes to finish. I was wondering if there are any faster methods that could do this < 1 minute. I know you can use multiple values in an insert commands but according to this (Multiple INSERT statements vs. single INSERT with multiple VALUES) it would possibly be even slower.
The code currently looks like this.
def Insert_X(X_info):
columns = ', '.join(X_info.keys())
placeholders = ', '.join('?' * len(X_info.keys()))
columns = columns.replace("'","")
values = [x for x in X_info.values()]
query_string = f"INSERT INTO X ({columns}) VALUES ({placeholders});"
with conn.cursor() as cursor:
cursor.execute(query_string,values)
With Insert_X being called >1000 times.
I need an algorithm or method of computing a checksum of a sql column that is easy to replicate in python given a csv.
I want to verify that the csv column and the sql column match
I have a scheme summing the binary_checksums of each row in the column on the sql side and python side into two overall column sums but I'm worried about collisions and I want to know if there's a faster or better way.
I need a function where c is a full sql column
such that
python_function(c) == pyquery("EXEC sql_function("some_table",c))
where python_function(c) and sql_column(c) return something like a hash or checksum
one function doesn't need to encompass all possible types of c either.
although it's a plus if it does. you could give a scheme specific to varchars or ints or bytes, etc.
The csv's will be large almost 50 million rows 66 columns (varchar, ints, bits,smallints, decimal, numeric).
The csv comes from an external source and I need to verify that it matches the data in my database.
doesn't need to be 100% accurate missing less than 100,000 rows of difference is fine
As an example here's a high level implementation of my solution in python pseudocode.
def is_likely_equal(csv_filename, column_name):
column_data = get_column_data(csv_filename,column)
# I know it won't fit in memory this is an example
python_b_sum = get_b_sum(column_data)
sql_b_sum = some_db.execute("SELECT SUM(BINARY_CHECKSUM(column_name)) FROM table")
if python_b_sum == sql_b_sum:
return True
else:
return False
def get_b_sum(column_data):
b_sum = 0
for entry in column_data:
b_sum += b_checksum(entry)
return b_sum
I am attempting to parse a very big MySQL table that potentially may not fit in memory. The approach that I am following is, using pymysql:
db = PyMySQL.connect(**connection_params)
cur = db.cursor()
cur.execute('SELECT * FROM big_table')
for row in cur:
process(row)
What I am observing is that cur.execute() eagerly loads the data into memory. Is it possible to iterate by rows lazily?
I am aware this could be done combining LIMIT and OFFSET clauses, but is it possible to be done in a more transparent way?
You can get the number of results with that (after cur.execute):
numrows = cur.rowcount
Then, you can iterate over them with a simple for:
for num in xrange(0,numrows):
row = cursor.fetchone()
do stuff...
I'm downloading some data from a SQL Server database through a library that leverages pymssql in the back-end. The result of a curson.execute("""<QUERY BODY>""") is a sqlalchemy.engine.result.ResultProxy object. How can I check if the result of the query was empty, so there are no rows?
cur = ff.sql.create_engine(server=dw.address, db=dw.BI_DW,
login=":".join([os.environ["SQL_USER"],
os.environ["SQL_PASSWD"]]))
for n in range(100):
result = cur.execute("""QUERY BODY;""")
if result:
break
Unfortunately, result will never be None even when no rows were returned by the SQL query.
What's the best way to check for that?
The ResultProxy object does not contain any rows yet. Therefore it has no information about the total amount of them, or even whether there are any. ResultProxy is just a "pointer" to the database. You get your rows only when you explicitly fetch them via ResultProxy. You can do that via iteration over this object, or via .first() method, or via .fetchall() method.
Bottom line: you cannot know the amount of fethced rows until you actually fetch all of them and the ResultProxy object is exhausted.
Approach #1
You can fetch all the rows at once and count them and then do whatever you need with them:
rows = result.fetchall()
if len(rows):
# do something with rows
The downside of this method is that we load all rows into memory at once (rows is a Python list containing all the fetched rows). This may not be desirable if the amount of fetched rows is very large and/or if you only need to iterate over the rows one-by-one independently (usually that's the case).
Approach #2
If loading all fetched rows into memory at once is not acceptable, then we can do this:
rows_amount = 0
for row in result:
rows_amount += 1
# do something with row
if not rows_amount:
print('There were zero rows')
else:
print('{} rows were fetched and processed'.format(rows_amount))
SQLAlchemy < 1.2: You can always turn the ResultProxy into an iterator:
res = engine.execute(...)
rp_iter = iter(res)
row_count = 0
try:
row = next(rp_iter)
row_count += 1
except StopIteration:
# end of data
if not row_count:
# no rows returned, StopIteration was raised on first attempt
In SQLAlchemy >= 1.2, the ResultProxy implements both .next() and .__next__(), so you do not need to create the iterator:
res = engine.execute()
row_count = 0
try:
row = next(res)
row_count += 1
except StopIteration:
...
I am trying to push some big files (around 4 million records) into a mongo instance. What I am basically trying to achieve is to update the existent data with the one from the files. The algorithm would look something like:
rowHeaders = ('orderId', 'manufacturer', 'itemWeight')
for row in dataFile:
row = row.strip('\n').split('\t')
row = dict(zip(rowHeaders, row))
mongoRow = mongoCollection.find({'orderId': 12344})
if mongoRow is not None:
if mongoRow['itemWeight'] != row['itemWeight']:
row['tsUpdated'] = time.time()
else:
row['tsUpdated'] = time.time()
mongoCollection.update({'orderId': 12344}, row, upsert=True)
So, update the whole row besides 'tsUpdated' if weights are the same, add a new row if the row is not in mongo or update the whole row including 'tsUpdated' ... this is the algorithm
The question is: can this be done faster, easier and more efficient from mongo's point of view ? (eventually with some kind of bulk insert)
Combine an unique index on orderId with an update query where you also check for a change in itemWeight. The unique index prevents an insert with only a modified timestamp if the orderId is already present and itemWeight is the same.
mongoCollection.ensure_index('orderId', unique=True)
mongoCollection.update({'orderId': row['orderId'],
'itemWeight': {'$ne': row['itemWeight']}}, row, upsert=True)
My benchmark shows a 5-10x performance improvement against your algorithm (depending on the amount of inserts vs updates).