I have a list and len(list) is say 199 but it can increase during runtime at the same pace that the columns of my table will increase.
I want to INSERT these elements in a table in sqlite.
Code so far:
conn.execute('INSERT INTO Financial_data VALUES (?)',list)
I have read the docs and other s.o questions but I cannot place 199 question marks over there and even if I could, maybe after 20 seconds of running there will be 200 columns, what then?
What I want to do is fill all the columns of a row with data
Before the above code I have this, which inserts in a list elements for every header that exists in the sql database and if my other source has no value for one of the headers it appends a None value so I could be able to dump an entire row at once.
if header in lista_header_tabel:
lista_valori.append(valoare)
else:
lista_valori.append(None)
This is happening because the insert part of the program is going really slow and I don't know why. I tried to include them inside a conn.execute("begin") and it improves performance but...
Shouldnt sqlite be able to handle sizes of say 20-50K rows with 200 columns? At this point I have better performance using shelve or json than using sqlite.
Where am I going wrong?
Current working solution but slow after 2k rows:
conn.execute("begin")
for celula in rand:
if sheet_results.cell(row=1,column=celula.col_idx).value in lista_header_tabel and celula.value!=None:
header_coloana=sheet_results.cell(row=1,column=celula.col_idx).value
valoare_aferenta=str(celula.value).replace('"','^')
nume_fara_ghilimele=str(rand[1].value).replace('"','^')
query='UPDATE Financial_data SET "' + header_coloana + '" = "' + valoare_aferenta + '" WHERE `Company name`="' + nume_fara_ghilimele + '"'
#print(query)
conn.execute(query)
conn.commit()
Related
I've got this code. It does what I want, but it only works the first time I run it while the second time around it enters an infinite loop for some reason
Using FastApi and SQLAlchemy with MySQL
As an alternative to this, just to avoid the XY issue, what I am actually intending to do is make a table that contains variable columns (as populated by table "Attributes"). This could be a view, but I haven't really found a way to make a view that contains everything I want.
The endgoal is to connect my database with all relevant columns and rows in a single view or table and connect that to a software called Altium
def get_altium_plugin():
engine.execute("DROP TABLE IF EXISTS altium_plugin")
aRows = session.query(Attributes).group_by(Attributes.name).all()
cRows = session.query(Component_Attributes).all()
concat = " "
x = 0
while x < len(aRows):
concat = concat + (aRows[x].name + " VARCHAR(20), ")
x = x+1
print(concat)
concat = concat[:-2]
engine.execute("CREATE TABLE IF NOT EXISTS altium_plugin (id INTEGER PRIMARY KEY AUTO_INCREMENT," + concat + ")")
x = 0
y = 0
z = 1
while x < len(aRows):
while y < len(cRows):
if cRows[y].attribute_id == aRows[x].id:
dbstring = "INSERT INTO altium_plugin (" + aRows[x].name + ") VALUES ('" + cRows[y].value + "')"
engine.execute(dbstring)
print("X :" + str(x) + ", Y: " + str(y) + ", " + dbstring)
z = z + 1
y = y+1
y = 0
x = x+1
rows = session.query(Altium_Plugin).all()
return rows
I've got the above code, I've tried doing it with non-raw sql and that throws errors, I've done it directly in the SQL side as a script and it works with that raw sql, but the second time that you run this def after starting the python script it gets caught by the
engine.execute("DROP TABLE IF EXISTS altium_plugin")
And just loops infinitely, or at least it doesn't continue beyond that point while trying to process something
EDIT: It seems that after running the code once, I'm no longer able to use scripts directly within MySQL Workbench until I stop my python. The engine is probably continuing to work with SQL but without doing anything other than taking up space and processing power, how can I tell the engine to stop without also killing the engine outright?
DROP TABLE ...
... just loops infinitely
I wouldn't be surprised to find the occasional
lock outstanding if you query the locks table.
An ALTER or DROP will have to acquire an Exclusive
lock before it can begin.
It doesn't loop, it just hangs,
patiently waiting for the lock to be granted.
I didn't notice any COMMIT or ROLLBACK statements in your code.
Dropping the TCP connection,
or bouncing the backend DB daemon,
are other (more violent) ways of releasing locks,
including Reader locks.
Notice that your interactive workbench
can hold un-committed transaction locks, as well.
Recommend you COMMIT before attempting DDL.
I want to insert billions of values(exchange rates) to a sqlite db file. I want to use threading because it takes a lot of time but threading pool loop executes same nth element multiple times. I have a print statement in the begining of my method and it prints out multiple times instead of just one.
pool = ThreadPoolExecutor(max_workers=2500)
def gen_nums(i, cur):
global x
print('row number', x, ' has started')
gen_numbers = list(mydata)
sql_data = []
for f in gen_numbers:
sql_data.append((f, i, mydata[i]))
cur.executemany('INSERT INTO numbers (rate, min, max) VALUES (?, ?, ?)', sql_data)
print('row number', x, ' has finished')
x += 1
with conn:
cur = conn.cursor()
for i in mydata:
pool.submit(gen_nums, i, cur)
pool.shutdown(wait=True)
and the output is:
row number 1 has started
row number 1 has started
row number 1 has started
row number 1 has started
row number 1 has started
row number 1 has started
row number 1 has started
...
Divide your data into chunks on the fly using generator expressions, make inserts inside the transaction.
Here how your code may look like.
Also, sqlite has an ability to import CSV files.
Sqlite can do tens of thousands of inserts per second, just make sure to do all of them in a single transaction by surrounding the inserts with BEGIN and COMMIT. (executemany() does this automatically.)
As always, don't optimize before you know speed will be a problem. Test the easiest solution first, and only optimize if the speed is unacceptable.
Currently I have database filled with 1000s of rows.
I want to SELECT the first 100 rows, and then select the next 100, then the next 100 and so on...
So far I have:
c.execute('SELECT words FROM testWords')
data = c.fetchmany(100)
This allows me to get the first 100 rows, however, I can't find the syntax for selecting the next 100 rows after that, using another SELECT statement.
I've seen it is possible with other coding languages, but haven't found a solution with Python's SQLite3.
When you are using cursor.fetchmany() you don't have to issue another SELECT statement. The cursor is keeping track of where you are in the series of results, and all you need to do is call c.fetchmany(100) again until that produces an empty result:
c.execute('SELECT words FROM testWords')
while True:
batch = c.fetchmany(100)
if not batch:
break
# each batch contains up to 100 rows
or using the iter() function (which can be used to repeatedly call a function until a sentinel result is reached):
c.execute('SELECT words FROM testWords')
for batch in iter(lambda: c.fetchmany(100), []):
# each batch contains up to 100 rows
If you can't keep hold of the cursor (say, because you are serving web requests), then using cursor.fetchmany() is the wrong interface. You'll instead have to tell the SELECT statement to return only a selected window of rows, using the LIMIT syntax. LIMIT has an optional OFFSET keyword, together these two keywords specify at what row to start and how many rows to return.
Note that you want to make sure that your SELECT statement is ordered so you get a stable result set you can then slice into batches.
batchsize = 1000
offset = 0
while True:
c.execute(
'SELECT words FROM testWords ORDER BY somecriteria LIMIT ? OFFSET ?',
(batchsize, offset))
batch = list(c)
offset += batchsize
if not batch:
break
Pass the offset value to a next call to your code if you need to send these batches elsewhere and then later on resume.
sqlite3 is nothing to do with Python. It is a standalone database; Python just supplies an interface to it.
As a normal database, sqlite supports standard SQL. In SQL, you can use LIMIT and OFFSET to determine the start and end for your query. Note that if you do this, you should really use an explicit ORDER BY clause, to ensure that your results are consistently ordered between queries.
c.execute('SELECT words FROM testWords ORDER BY ID LIMIT 100')
...
c.execute('SELECT words FROM testWords ORDER BY ID LIMIT 100 OFFSET 100')
You can crate iterator and call it multiple times:
def ResultIter(cursor, arraysize=100):
while True:
results = cursor.fetchmany(arraysize)
if not results:
break
for result in results:
yield result
Or simply like this for returning the first 5 rows:
num_rows = 5
cursor = dbconn.execute("SELECT words FROM testWords" )
for row in cursor.fetchmany(num_rows):
print( "Words= " + str( row[0] ) + "\n" )
It said that
Record.objects.order_by('?')[:n]
have performance issues, and recommend doing something like this: (here)
sample = random.sample(xrange(Record.objects.count()),n)
result = [Record.objects.all()[i] for i in sample]
Since that, why not do it directly like this:
result = random.sample(Record.objects.all(),n)
I have no idea about when these code running what is django actually doing in background. Please tell me the one-line-code at last is more efficient or not? why?
================Edit 2013-5-12 23:21 UCT+8 ========================
I spent my whole afternoon to do this test.
My computer : CPU Intel i5-3210M RAM 8G
System : Win8.1 pro x64 Wampserver2.4-x64 (with apache2.4.4 mysql5.6.12 php5.4.12) Python2.7.5 Django1.4.6
What I did was:
Create an app.
build a simple model with a index and a CharField content, then Syncdb.
Create 3 views can get a random set with 20 records in 3 different ways above, and output the time used.
Modify settings.py that Django can output log into console.
Insert rows into table, untill the number of the rows is what I want.
Visit the 3 views, note the SQL Query statement, SQL time, and the total time
repeat 5, 6 in different number of rows in the table.(10k, 200k, 1m, 5m)
This is views.py:
def test1(request):
start = datetime.datetime.now()
result = Record.objects.order_by('?')[:20]
l = list(result) # Queryset是惰性的,强制将Queryset转为list
end = datetime.datetime.now()
return HttpResponse("time: <br/> %s" % (end-start).microseconds/1000))
def test2(request):
start = datetime.datetime.now()
sample = random.sample(xrange(Record.objects.count()),20)
result = [Record.objects.all()[i] for i in sample]
l = list(result)
end = datetime.datetime.now()
return HttpResponse("time: <br/> %s" % (end-start)
def test3(request):
start = datetime.datetime.now()
result = random.sample(Record.objects.all(),20)
l = list(result)
end = datetime.datetime.now()
return HttpResponse("time: <br/> %s" % (end-start)
As #Yeo said,result = random.sample(Record.objects.all(),n) is crap. I won't talk about that.
But interestingly, Record.objects.order_by('?')[:n] always better then others, especially the table smaller then 1m rows. Here is the data:
and the charts:
So, what's happened?
In the last test, 5,195,536 rows in tatget table, result = random.sample(Record.objects.all(),n) actually did ths:
(22.275) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content`
FROM `randomrecords_record` ORDER BY RAND() LIMIT 20; args=()
Every one is right. And it used 22 seconds. And
sample = random.sample(xrange(Record.objects.count()),n)
result = [Record.objects.all()[i] for i in sample]
actually did ths:
(1.393) SELECT COUNT(*) FROM `randomrecords_record`; args=()
(3.201) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content`
FROM `randomrecords_record` LIMIT 1 OFFSET 4997880; args=()
...20 lines
As you see, get one row, cost 3 seconds. I find that the larger index, the more time needed.
But... why?
My think is:
If there is some way can speed up the large index query,
sample = random.sample(xrange(Record.objects.count()),n)
result = [Record.objects.all()[i] for i in sample]
should be the best. Except(!) the table is smaller then 1m rows.
The problem with .order_by(?) is that under the hood it does ORDER BY RAND() (or equivalent, depending on DB) which basically has to create a random number for each row and do the sorting. This is a heavy operation and requires lots of time.
On the other hand doing Record.objects.all() forces your app to download all objects and then you choose from it. It is not that heavy on the database side (it will be faster then sorting) but it is heavy on network and memory. Thus it can kill your performance as well.
So that's the tradeoff.
Now this is a lot better:
sample = random.sample(xrange(Record.objects.count()),n)
result = [Record.objects.all()[i] for i in sample]
simply because it avoids all the problems mentioned above (note that Record.objects.all()[i] gets translated to SELECT * FROM table LIMIT 1 OFFSET i, depending on DB).
However it may still be inefficient since .count might be slow (as usual: depends on DB).
Record.objects.count() gets translated into very light SQL Query.
SELECT COUNT(*) FROM TABLE
Record.objects.all()[0] is also translated into a very light SQL Query.
SELECT * FROM TABLE LIMIT 1
Record.objects.all() usually the results get slice off to increase the performance
SELECT * FROM table LIMIT 20; // or something similar
list(Record.objects.all()) will query all the data and put it into a list data structure.
SELECT * FROM TABLE
Thus, any time you convert a Queryset into a list, that's where the expensive happened
In your example, random.sample() will convert into a list. (If I'm not wrong).
Thus when you do result = random.sample(Record.objects.all(),n) it will do the Full Queryset and convert into a list and then random pick the list.
Just imagine if you have millions of records. Are you going to query and store it into a list with millions element? or would you rather query one by one
This is a query that totals up every players game results from a game and displays the players who match the conditions.
select *,
(kills / deaths) as killdeathratio,
(totgames - wins) as losses
from (select gp.name as name,
gp.gameid as gameid,
gp.colour as colour,
Avg(dp.courierkills) as courierkills,
Avg(dp.raxkills) as raxkills,
Avg(dp.towerkills) as towerkills,
Avg(dp.assists) as assists,
Avg(dp.creepdenies) as creepdenies,
Avg(dp.creepkills) as creepkills,
Avg(dp.neutralkills) as neutralkills,
Avg(dp.deaths) as deaths,
Avg(dp.kills) as kills,
sc.score as totalscore,
Count(* ) as totgames,
Sum(case
when ((dg.winner = 1 and dp.newcolour < 6) or
(dg.winner = 2 and dp.newcolour > 6))
then 1
else 0
end) as wins
from gameplayers as gp,
dotagames as dg,
games as ga,
dotaplayers as dp,
scores as sc
where dg.winner <> 0
and dp.gameid = gp.gameid
and dg.gameid = dp.gameid
and dp.gameid = ga.id
and gp.gameid = dg.gameid
and gp.colour = dp.colour
and sc.name = gp.name
group by gp.name
having totgames >= 30
) as h
order by totalscore desc
Now I'm not too sure what's the best way to go but what would in your opinion be to optimize this query?
I run a Q6600 # 2.4ghz, 4gb of ram, 64-bit Linux Ubuntu 9.04 system and this query can take up to 6.7 seconds to run (I do have a huge database).
Also I would like to paginate the results as well and executing extra conditions on top of this query is far too slow....
I use django as a frontend so any methods that include using python +/- django methods would be great. MySQL, Apache2 tweaks are also welcome. And of course, I'm open to changing the query to make it run faster.
Thanks for reading my question; look forward to reading your answers!
Edit: EXPLAIN QUERY RESULTS
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 783 Using filesort
2 DERIVED sc ALL name,name_2 NULL NULL NULL 2099 Using temporary; Using filesort
2 DERIVED gp ref gameid,colour,name name 17 development.sc.name 2
2 DERIVED ga eq_ref PRIMARY,id,id_2 PRIMARY 4 development.gp.gameid 1 Using index
2 DERIVED dg ref gameid,winner gameid 4 development.ga.id 1 Using where
2 DERIVED dp ref gameid_2,colour gameid_2 4 development.ga.id 10 Using where
First of all, the SQL is badly formatted. The most obvious error is the line splitting before each AS clause. Second obvious problem is using implicit joins instead of explicitly using INNER JOIN ... ON ....
Now to answer the actual question.
Without knowing the data or the environment, the first thing I'd look at would be some of the MySQL server settings, such as sort_buffer and key_buffer. If you haven't changed any of these, go read up on them. The defaults are extremely conservative and can often be raised more than ten times their default, particularly on the large iron like you have.
Having reviewed that, I'd be running pieces of the query to see speed and what EXPLAIN says. The effect of indexing can be profound, but MySQL has a "fingers-and-toes" problem where it just can't use more than one per table. And JOINs with filtering can need two. So it has to descend to a rowscan for the other check. But having said that, dicing up the query and trying different combinations will show you where it starts stumbling.
Now you will have an idea where a "tipping point" might be: this is where a small increase in some raw data size, like how much it needs to extract, will result in a big loss of performance as some internal structure gets too big. At this point, you will probably want to raise the temporary tables size. Beware that this kind of optimization is a bit of a black art. :-)
However, there is another approach: denormalization. In a simple implementation, regularly scheduled scripts will run this expensive query from time-to-time and poke the data into a separate table in a structure much closer to what you want to display. There are multiple variations of this approach. It can be possible to keep this up-to-date on-the-fly, either in the application, or using table triggers. At the other extreme, you could allow your application to run the expensive query occasionally, but cache the result for a little while. This is most effective if a lot of people will call it often: even 2 seconds cache on a request that is run 15 times a second will show a visible improvement.
You could find ways of producing the same data by running half-a-dozen queries that each return some of the data, and post-processing the data. You could also run version of your original query that returns more data (which is likely to be much faster because it does less filtering) and post-process that. I have found several times that five simpler, smaller queries can be much faster - an order of magnitude, sometimes two - than one big query that is trying to do it all.
No index will help you since you are scanning entire tables.
As your database grows the query will always get slower.
Consider accumulating the stats : after every game, insert the row for that game, and also increment counters in the player's row, Then you don't need to count() and sum() because the information is available.
select * is bad most times - select only the columns you need
break the select into multiple simple selects, use temporary tables when needed
the sum(case part could be done with a subselect
mysql has a very bad performance with or-expressions. use two selects which you union together
Small Improvement
select *,
(kills / deaths) as killdeathratio,
(totgames - wins) as losses from (select gp.name as name,
gp.gameid as gameid,
gp.colour as colour,
Avg(dp.courierkills) as courierkills,
Avg(dp.raxkills) as raxkills,
Avg(dp.towerkills) as towerkills,
Avg(dp.assists) as assists,
Avg(dp.creepdenies) as creepdenies,
Avg(dp.creepkills) as creepkills,
Avg(dp.neutralkills) as neutralkills,
Avg(dp.deaths) as deaths,
Avg(dp.kills) as kills,
sc.score as totalscore,
Count(1 ) as totgames,
Sum(case
when ((dg.winner = 1 and dp.newcolour < 6) or
(dg.winner = 2 and dp.newcolour > 6))
then 1
else 0
end) as wins
from gameplayers as gp,
( select * from dotagames dg1 where dg.winner <> 0 ) as dg,
games as ga,
dotaplayers as dp,
scores as sc
where and dp.gameid = gp.gameid
and dg.gameid = dp.gameid
and dp.gameid = ga.id
and gp.gameid = dg.gameid
and gp.colour = dp.colour
and sc.name = gp.name
group by gp.name
having totgames >= 30
) as h order by totalscore desc
Changes:
1. count (*) chnaged to count(1)
2. In the FROM, The number of rows are reduced.