I am attempting to parse a very big MySQL table that potentially may not fit in memory. The approach that I am following is, using pymysql:
db = PyMySQL.connect(**connection_params)
cur = db.cursor()
cur.execute('SELECT * FROM big_table')
for row in cur:
process(row)
What I am observing is that cur.execute() eagerly loads the data into memory. Is it possible to iterate by rows lazily?
I am aware this could be done combining LIMIT and OFFSET clauses, but is it possible to be done in a more transparent way?
You can get the number of results with that (after cur.execute):
numrows = cur.rowcount
Then, you can iterate over them with a simple for:
for num in xrange(0,numrows):
row = cursor.fetchone()
do stuff...
Related
I'm working on a project where I need to get data from my SQL Server, but there is a catch. In total there is around 100.000 rows in the specific column I need the data out of but I only need the last 20.000 - 30.000 rows of it.
I use the casual connection string and stored procedure but is there a way to select a specific row to start from? (for example let it start at row 70.000)
try:
CONNECTION_STRING = 'DRIVER='+driver+';SERVER='+server+';DATABASE='+databaseName+';UID='+username+';PWD='+ password
conn = pyodbc.connect(CONNECTION_STRING)
cursor = conn.cursor()
storedproc = "*"
cursor.execute(storedproc)
row = cursor.fetchone()
while row:
OID = ((int(row[1])))
print(OID)
So my question: is there a way (for example) set cursor.fetchone to row 70.000 instead of 1? Or is there another way to do that?
Thanks in advance!
I Have this example of sql database. I want to use a certain item from the database in math calculation but I can't because the value looks like this: (25.0,) instead of just 25.0. Please see the attached picture
https://imgur.com/a/j7JOZ5H
import sqlite3
#Create the database:
connection = sqlite3.connect('DataBase.db')
c = connection.cursor()
c.execute('CREATE TABLE IF NOT EXISTS table1 (name TEXT,age NUMBER)')
c.execute("INSERT INTO table1 VALUES('Jhon',25)")
#Pull out the value:
c.execute('SELECT age FROM table1')
data =c.fetchall()
print(data[0])
#simple math calculation:
r=data[0]+1
print(r)
According to Python's PEP 249, the specification for most DB-APIs including sqlite3, fetchall returns a sequence of sequences, usually list of tuples. Therefore, to retrieve the single value in first column to do arithmetic, index the return twice: for specific row and then specific position in row.
data = c.fetchall()
data[0][0]
Alternatively, fetchone returns a single row, either first or next row, in resultset, so simply index once: the position in single row.
data = c.fetchone()
data[0]
The returned data from fetchall always comes back as a list of tuples, even if the tuple only contains 1 value. Your data variable should be:
[(25,)]
You need to use:
print(data[0][0])
r = data[0][0] + 1
print(r)
Context
I have a function in python that scores a row in my table. I would like to combine the scores of all the rows arithmetically (eg. computing the sum, average, etc.. of the scores).
def compute_score(row):
# some complicated python code that would be painful to convert into SQL-equivalent
return score
The obvious first approach is to simply read in all the data
import psycopg2
def sum_scores(dbname, tablename):
conn = psycopg2.connect(dbname)
cur = conn.cursor()
cur.execute('SELECT * FROM ?', tablename)
rows = cur.fetchall()
sum = 0
for row in rows:
sum += score(row)
conn.close()
return sum
Problem
I would like to be able to handle as much data as my database can hold. This could be larger that what would fit into Python's memory, so fetchall() seems to me like it would not function correctly in that case.
Proposed Solutions
I was considering 3 approaches, all with the aim of processing a couple records at a time:
One-by-one record processing using fetchone()
def sum_scores(dbname, tablename):
...
sum = 0
for row_num in cur.rowcount:
row = cur.fetchone()
sum += score(row)
...
return sum
Batch-record processing using fetchmany(n)
def sum_scores(dbname, tablename):
...
batch_size = 1e3 # tunable
sum = 0
batch = cur.fetchmany(batch_size)
while batch:
for row in batch:
sum += score(row)
batch = cur.fetchmany(batch_size)
...
return sum
Relying on the cursor's iterator
def sum_scores(dbname, tablename):
...
sum = 0
for row in cur:
sum += score(row)
...
return sum
Questions
Was my thinking correct in that my 3 proposed solutions would only pull in manageable sized chunks of data at a time? Or do they suffer from the same problem as fetchall?
Which of the 3 proposed solutions would work (ie. compute the correct score combination and not crash in the process) for LARGE datasets?
How does the cursor's iterator (Proposed Solution #3) actually pull in data into Python's memory? One-by-one, in batches, or all at once?
All 3 solutions will work, and only bring a subset of the results into memory.
Iterating via the cursor, Proposed solution #3, will work the same as Proposed Solution #2, if you pass a name to the cursor. Iterating over the cursor will fetch itersize records (default is 2000).
Solutions #2 and #3 will be much quicker than #1, because there is much less of a connection overhead.
http://initd.org/psycopg/docs/cursor.html#fetch
I am new in python so in the pyodbc. Maybe my question is very simple but I could not find any answer refer to my question.
I'm using this select
cursor.execute("SELECT [something] FROM [someone] WHERE [user_name]='John'")
rows = cursor.fetchall()
for row in rows:
print row.something
It prints some parameters for example 4 or 5.How to print only second or only third parameter.
I also used cursor.fetchmany() but I'm having same problem
If you wan't just the 4th row you can do:
rows = cursor.fetchall()
print rows[3].something
But it's better if you do it in the SQL query and avoid fetching all the rows from the database:
SELECT [something] FROM [someone] WHERE [user_name]='John' LIMIT 1 OFFSET 3
Example.
I guess your mean field and not parameter
cursor.execute("SELECT [something] FROM [someone] WHERE [user_name]='John'")
rows = cursor.fetchall()
from row in rows:
print row[1]
I am in facing a performance problem in my code.I am making db connection a making a select query and then inserting in a table.Around 500 rows in one select query ids populated .Before inserting i am running select query around 8-9 times first and then inserting then all using cursor.executemany.But it is taking 2 miuntes to insert which is not qood .Any idea
def insert1(id,state,cursor):
cursor.execute("select * from qwert where asd_id =%s",[id])
if sometcondition:
adding.append(rd[i])
cursor.executemany(indata, adding)
where rd[i] is a aray for records making and indata is a insert statement
#prog start here
cursor.execute("select * from assd")
for rows in cursor.fetchall()
if rows[1]=='aq':
insert1(row[1],row[2],cursor)
if rows[1]=='qw':
insert2(row[1],row[2],cursor)
I don't really understand why you're doing this.
It seems that you want to insert a subset of rows from "assd" into one table, and another subset into another table?
Why not just do it with two SQL statements, structured like this:
insert into tab1 select * from assd where asd_id = 42 and cond1 = 'set';
insert into tab2 select * from assd where asd_id = 42 and cond2 = 'set';
That'd dramatically reduce your number of roundtrips to the database and your client-server traffic. It'd also be an order of magnitude faster.
Of course, I'd also strongly recommend that you specify your column names in both the insert and select parts of the code.