Inserting billions of data to Sqlite via Python

Inserting billions of data to Sqlite via Python - python

I want to insert billions of values(exchange rates) to a sqlite db file. I want to use threading because it takes a lot of time but threading pool loop executes same nth element multiple times. I have a print statement in the begining of my method and it prints out multiple times instead of just one.
pool = ThreadPoolExecutor(max_workers=2500)
def gen_nums(i, cur):
global x
print('row number', x, ' has started')
gen_numbers = list(mydata)
sql_data = []
for f in gen_numbers:
sql_data.append((f, i, mydata[i]))
cur.executemany('INSERT INTO numbers (rate, min, max) VALUES (?, ?, ?)', sql_data)
print('row number', x, ' has finished')
x += 1
with conn:
cur = conn.cursor()
for i in mydata:
pool.submit(gen_nums, i, cur)
pool.shutdown(wait=True)
and the output is:
row number 1 has started
row number 1 has started
row number 1 has started
row number 1 has started
row number 1 has started
row number 1 has started
row number 1 has started
...

Divide your data into chunks on the fly using generator expressions, make inserts inside the transaction.
Here how your code may look like.
Also, sqlite has an ability to import CSV files.
Sqlite can do tens of thousands of inserts per second, just make sure to do all of them in a single transaction by surrounding the inserts with BEGIN and COMMIT. (executemany() does this automatically.)
As always, don't optimize before you know speed will be a problem. Test the easiest solution first, and only optimize if the speed is unacceptable.

Related

How can I fix for loop to other things

Currently I have set a for loop which retrieves data from the database for every row.
I need to use run a while loop, however it would not run as the for loop finishes once after it has retrieved database data. In result, this stops the rest of my While true loop to await for user response
c.execute("SELECT * FROM maincores WHERE set_status = 1")
rows = c.fetchall()
for v in rows:
# skip
while True:
#skip
I have tried using a global variable to store the database data then return the loop, all resulting in a fail.
How can I get sqlite3 database information without using for loop?

I'm not 100% on the problem, but I think you might want to use a generator so that you throttle your intake of information with your loop. So, you could do a function like:
def getDBdata():
c.execute("SELECT * FROM maincores WHERE set_status = 1")
rows = c.fetchall()
for v in rows:
yield(v) #just returns one result at a time ...
x = True
data = getDBdata()
while x is True:
do something with data
if <condition>:
next(data) #get the next row of data
else:
x = False
So, now you are controlling the data flow from your DB so that you don't exhaust your while loop as a condition of the data flow.
My apologies if I'm not answering the question your asking, but I hope this helps.

SQLite3 How to Select first 100 rows from database, then the next 100

Currently I have database filled with 1000s of rows.
I want to SELECT the first 100 rows, and then select the next 100, then the next 100 and so on...
So far I have:
c.execute('SELECT words FROM testWords')
data = c.fetchmany(100)
This allows me to get the first 100 rows, however, I can't find the syntax for selecting the next 100 rows after that, using another SELECT statement.
I've seen it is possible with other coding languages, but haven't found a solution with Python's SQLite3.

When you are using cursor.fetchmany() you don't have to issue another SELECT statement. The cursor is keeping track of where you are in the series of results, and all you need to do is call c.fetchmany(100) again until that produces an empty result:
c.execute('SELECT words FROM testWords')
while True:
batch = c.fetchmany(100)
if not batch:
break
# each batch contains up to 100 rows
or using the iter() function (which can be used to repeatedly call a function until a sentinel result is reached):
c.execute('SELECT words FROM testWords')
for batch in iter(lambda: c.fetchmany(100), []):
# each batch contains up to 100 rows
If you can't keep hold of the cursor (say, because you are serving web requests), then using cursor.fetchmany() is the wrong interface. You'll instead have to tell the SELECT statement to return only a selected window of rows, using the LIMIT syntax. LIMIT has an optional OFFSET keyword, together these two keywords specify at what row to start and how many rows to return.
Note that you want to make sure that your SELECT statement is ordered so you get a stable result set you can then slice into batches.
batchsize = 1000
offset = 0
while True:
c.execute(
'SELECT words FROM testWords ORDER BY somecriteria LIMIT ? OFFSET ?',
(batchsize, offset))
batch = list(c)
offset += batchsize
if not batch:
break
Pass the offset value to a next call to your code if you need to send these batches elsewhere and then later on resume.

sqlite3 is nothing to do with Python. It is a standalone database; Python just supplies an interface to it.
As a normal database, sqlite supports standard SQL. In SQL, you can use LIMIT and OFFSET to determine the start and end for your query. Note that if you do this, you should really use an explicit ORDER BY clause, to ensure that your results are consistently ordered between queries.
c.execute('SELECT words FROM testWords ORDER BY ID LIMIT 100')
...
c.execute('SELECT words FROM testWords ORDER BY ID LIMIT 100 OFFSET 100')

You can crate iterator and call it multiple times:
def ResultIter(cursor, arraysize=100):
while True:
results = cursor.fetchmany(arraysize)
if not results:
break
for result in results:
yield result

Or simply like this for returning the first 5 rows:
num_rows = 5
cursor = dbconn.execute("SELECT words FROM testWords" )
for row in cursor.fetchmany(num_rows):
print( "Words= " + str( row[0] ) + "\n" )

How to speed up insertion into Redis from SQL Query using Python

I have a SQL query I execute, and it comes into my Python program ~500ms (about 100k rows).
I want to quickly insert this into redis, but it currently takes ~6sec, even with piping.
pipe = r.pipeline()
for row in q:
pipe.zincrby(SKEY, row["name"], 1)
pipe.execute()
Is there a way to speed this up?

The problem is you insert a large number of items in a sorted set. Redis doc says that the time complexity of zincrby is O(log(N)) where N is the number of elements in the sorted set. So the more items you insert, the longer it takes. You probably should rethink the way you use Redis in this case. Maybe the sorted set is not the best answer to your use case.

In general there's no way to speed this up from redis's perspective, but there are two things you can do:
1 If keys repeat themselves, try reducing the number of rows by summing up the names before calling redis. i.e.:
d = dict()
for row in q:
name = row["name"]
d[name] = d.get(name, 0) + 1
and then if you have recurring ids, you'll make less queries in redis.
2 Another thing I would try it to call execute() every say 1000 or 5000 commands or so, that way redis would not be blocking for other callers while this is executed, and python itself would allocate less memory, which might speed things up.
e.g. (combined with the above):
d = dict()
for row in q:
name = row["name"]
d[name] = d.get(name, 0) + 1
pipe = r.pipeline()
for i, (k, v) in enumerate(d.iteritems()):
pipe.zincrby(SKEY, k, v)
if i > 0 and i % 5000 == 0:
pipe.execute()
pipe.execute()

Get random record set with Django, what is affecting the performance？

It said that
Record.objects.order_by('?')[:n]
have performance issues, and recommend doing something like this: (here)
sample = random.sample(xrange(Record.objects.count()),n)
result = [Record.objects.all()[i] for i in sample]
Since that, why not do it directly like this:
result = random.sample(Record.objects.all(),n)
I have no idea about when these code running what is django actually doing in background. Please tell me the one-line-code at last is more efficient or not? why?
================Edit 2013-5-12 23:21 UCT+8 ========================
I spent my whole afternoon to do this test.
My computer : CPU Intel i5-3210M RAM 8G
System : Win8.1 pro x64 Wampserver2.4-x64 (with apache2.4.4 mysql5.6.12 php5.4.12) Python2.7.5 Django1.4.6
What I did was:
Create an app.
build a simple model with a index and a CharField content, then Syncdb.
Create 3 views can get a random set with 20 records in 3 different ways above, and output the time used.
Modify settings.py that Django can output log into console.
Insert rows into table, untill the number of the rows is what I want.
Visit the 3 views, note the SQL Query statement, SQL time, and the total time
repeat 5, 6 in different number of rows in the table.(10k, 200k, 1m, 5m)
This is views.py:
def test1(request):
start = datetime.datetime.now()
result = Record.objects.order_by('?')[:20]
l = list(result) # Queryset是惰性的，强制将Queryset转为list
end = datetime.datetime.now()
return HttpResponse("time: <br/> %s" % (end-start).microseconds/1000))
def test2(request):
start = datetime.datetime.now()
sample = random.sample(xrange(Record.objects.count()),20)
result = [Record.objects.all()[i] for i in sample]
l = list(result)
end = datetime.datetime.now()
return HttpResponse("time: <br/> %s" % (end-start)
def test3(request):
start = datetime.datetime.now()
result = random.sample(Record.objects.all(),20)
l = list(result)
end = datetime.datetime.now()
return HttpResponse("time: <br/> %s" % (end-start)
As #Yeo said,result = random.sample(Record.objects.all(),n) is crap. I won't talk about that.
But interestingly, Record.objects.order_by('?')[:n] always better then others, especially the table smaller then 1m rows. Here is the data:
and the charts:
So, what's happened?
In the last test, 5,195,536 rows in tatget table, result = random.sample(Record.objects.all(),n) actually did ths:
(22.275) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content`
FROM `randomrecords_record` ORDER BY RAND() LIMIT 20; args=()
Every one is right. And it used 22 seconds. And
sample = random.sample(xrange(Record.objects.count()),n)
result = [Record.objects.all()[i] for i in sample]
actually did ths:
(1.393) SELECT COUNT(*) FROM `randomrecords_record`; args=()
(3.201) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content`
FROM `randomrecords_record` LIMIT 1 OFFSET 4997880; args=()
...20 lines
As you see, get one row, cost 3 seconds. I find that the larger index, the more time needed.
But... why?
My think is:
If there is some way can speed up the large index query,
sample = random.sample(xrange(Record.objects.count()),n)
result = [Record.objects.all()[i] for i in sample]
should be the best. Except(!) the table is smaller then 1m rows.

The problem with .order_by(?) is that under the hood it does ORDER BY RAND() (or equivalent, depending on DB) which basically has to create a random number for each row and do the sorting. This is a heavy operation and requires lots of time.
On the other hand doing Record.objects.all() forces your app to download all objects and then you choose from it. It is not that heavy on the database side (it will be faster then sorting) but it is heavy on network and memory. Thus it can kill your performance as well.
So that's the tradeoff.
Now this is a lot better:
sample = random.sample(xrange(Record.objects.count()),n)
result = [Record.objects.all()[i] for i in sample]
simply because it avoids all the problems mentioned above (note that Record.objects.all()[i] gets translated to SELECT * FROM table LIMIT 1 OFFSET i, depending on DB).
However it may still be inefficient since .count might be slow (as usual: depends on DB).

Record.objects.count() gets translated into very light SQL Query.
SELECT COUNT(*) FROM TABLE
Record.objects.all()[0] is also translated into a very light SQL Query.
SELECT * FROM TABLE LIMIT 1
Record.objects.all() usually the results get slice off to increase the performance
SELECT * FROM table LIMIT 20; // or something similar
list(Record.objects.all()) will query all the data and put it into a list data structure.
SELECT * FROM TABLE
Thus, any time you convert a Queryset into a list, that's where the expensive happened
In your example, random.sample() will convert into a list. (If I'm not wrong).
Thus when you do result = random.sample(Record.objects.all(),n) it will do the Full Queryset and convert into a list and then random pick the list.
Just imagine if you have millions of records. Are you going to query and store it into a list with millions element? or would you rather query one by one

cx_Oracle query returns zero rows

Why does the below code not work ? It returns zero rows even though I have many multiple matching the search criteria.
A simple query of the form select * from Table_1 works fine and returns positive number of rows
import cx_Oracle
def function_A (data):
connection = cx_Oracle.connect('omitted details here')
for index in range(len(data)):
# connection is already open
cursor = connection.cursor()
query = "select * from Table_1 where column_1=:column1 and column_2=:column2 and column_3=:column3 and column_4=:column4"
bindVars={'column1':data[index][3], 'column2':data[index][4], 'column4':data[index][5], 'column5':data[index][6]}
cursor.execute(query, bindVars)
cursor.arraysize = 256
rowCount = 0
resultSet = cursor.fetchall()
if (resultSet != None):
logger.debug("Obtained a resultSet with length = %s", len(resultSet))
for index in range(len(resultSet)):
logger.debug("Fetched one row from cursor, incrementing counter !!")
rowCount = rowCount + 1
logger.debug("Fetched one row from cursor, incremented counter !!")
logger.debug("Successfully executed the select statement for table Table_1; that returned %s rows !!", rowCount)
logger.debug("Successfully executed the select statement for table Table_1; that returned %s rows !!", cursor.rowcount)
Please ignore minor formatting issues, code runs just does not give me a positive number of rows.
Code is being run on IBM AIX with python2.6 and a compatible version of cx_Oracle.

Oracle CX's cursor object has a read-only rowcount property. Rowcount is returning how many rows are returned with fetch* methods.
Say the query yields 5 rows, then the interaction is like this
execute rowcount = 0
fetchone rowcount = 1
fetchone rowcount = 2
fetchall rowcount = 5
Thay way you do not need to manually track it. Your query issues will have to be resolved first offcourse :)

Your query returns 0 rows because there are 0 rows that match your query. Either remove a predicate from your WHERE clause or change the value you pass into one.
It's worth noting that you're not binding anything to column3 in your bindVars variable. I'm also not entirely certain why you're iterating, cursor.rowcount, as you have it gives you the number of rows that have been fetched by the cursor.
Generally, if you think a SELECT statement is not returning the correct result then take it our of code and run it directly against the database. Bind all variables first so you can see exactly what you're actually running.

banging my head against monitor on this one... you have to do something like below to check, as the cursor value changes once you operate on it:
result_set = DB_connector.execute(sql)
result_list = result_set.fetchall() # assign the return row to a list
if result_set.rowcount == 0:
print('DB check failed, no row returned')
sql_result_set = None
else:
for row in result_list: # use this instead result_set
print('row fetched: ' + row[0])
sql_result_set.append(row[0])
print('DB test Passed')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Inserting billions of data to Sqlite via Python - python

Related

How can I fix for loop to other things

SQLite3 How to Select first 100 rows from database, then the next 100

How to speed up insertion into Redis from SQL Query using Python

Get random record set with Django, what is affecting the performance？

cx_Oracle query returns zero rows

Categories

Resources