I have an existing table with a large number of entries and I want to calculate a new column for every row. I have only found the following solution. This works, but it's slow as it needs to scan most of the entries of the table.
What I would like is a way to:
Read a row
Calculate the value for new column based on contents of row
Update into database
This way it would only go through the table once and would have linear complexity.
cursor.execute("SELECT tweet FROM Table")
row = cursor.fetchone()
while row is not None:
vader = analyser.polarity_scores(row)
sentiment_vader = vader["compound"]
cursor2.execute(
"UPDATE Table SET sentiment_vader = %s WHERE tweet = %s LIMIT 1",
(sentiment_vader, row[0]))
kody.cnx.commit()
row = cursor.fetchone()
The main performance issue I see is that you should not commit for each row update as this adds an overhead. You should commit in the end of the while or after a batch.
while row is not None:
...
else:
kody.cnx.commit()
Also, if the tweet column is not indexed just create an index on that column in order not to make a table scan during the update.
OK, so first, not to critize the other answers, which are correct, given a generalized assumption that you have to do it in Python.
However, when you really have bulk volumes, chasing after a client-side, in-Python answer is often not the best approach. Since you want to update all the rows, assuming you can translate your polarity_scores algorithm into sql
UPDATE Table
SET sentiment_vader = <sql expressing your polarity_scores>;
would be the best performer. There is no back and forth with the database and everything gets committed at once.
Now, I am not saying it's easy or even possible. Often in these cases, even assuming the algorithm can be expressed in SQL, you may have to use work tables to store intermediate results and there is a lot of SQL going on. It's a different skill set than writing Python code.
But, if you truly need performance and you have large volumes, letting the server do the job on its own, in SQL, can be the way to go. That can be done via a series of sql commands, or using stored procedures.
In a previous job, we had explicit instructions to avoid loop and write constructs in client code and code reviews would almost always reject it on bulk data manipulations. I remember advising a colleague that doing a select-update on a table with potentially up 5M rows seemed a bad approach. He certainly ignored me at the time, but 3 months later his mission-critical code had all mysteriously shifted to a no-loop approach.
Note however one key conceptual difference: an error on a server-side update would rollback the transaction for all rows indiscriminately, whereas you could maybe choose to commit row-by-row using a loop construct like yours (even though you don't want it in your case).
The expected performance profile server-side is usually considerably better than O(n) linear time. Most of the time you should be nearly at constant O(1) time complexity, once you have correctly written your queries and indexes. Linear time to update, for a RDBMS vendor, would be commercial suicide. Usually what you see is a near constant time, followed by non-linear and hard-to-predict performance degradation past very high volume thresholds. You will see linear time earlier when indices can't be used and for your queries the RDBMS falls back to performing full-table scans.
Is this MySQLdb ? Maybe you can try an executemany.
cursor.execute("SELECT tweet FROM Table")
cursor2.executemany(
"UPDATE Table SET sentiment_vader = %s WHERE tweet = %s LIMIT 1",
((analyser.polarity_scores(row)["compound"], row[0]) for row in cursor)
)
kody.cnx.commit()
Just as #abc suggested above, you should also make sure autocommit is set to False, so that each query isn't committed separately during the executemany.
Related
I had an INSERT query where it got values from a SELECT statement. But since the SELECT returns millions of records, it put too much load on the MySQL server. So, we decided to break the SELECT query into parts and execute by having a LIMIT clause.
INSERT INTO target_table
SELECT * FROM source_table
WHERE my_condition = value
...
LIMIT <start>, <end>
We will keep increasing start and end values until SELECT returns 0 rows. I'm also thinking of making this multi-threaded.
How can I do it with PyMySQL?
Do I need to execute the SELECT, get the results and then generate the INSERT?
First of all, to answer your question: in PyMySQL, you get that value as the result of cursor.execute:
execute(query, args=None)
Execute a query
Parameters:
query (str) – Query to execute.
args (tuple, list or dict) – parameters used with query. (optional)
Returns: Number of affected rows
So you could just execute your query repeatedly until you get a value less then your selected range as a result.
Anyway, please consider:
the first thing you should check if you can optimize your select (assuming it's not as simple as in your example), e.g. by adding indexes. You may also want to test the difference between just selecting and actually inserting to get a rough idea which part is more relevant.
if the insertion is causing the problem, it can be due to the size of the transaction. In that case, splitting it up will only reduce the problems if you can also split up the transaction (although since you consider executing queries in parallel, this doesn't seem to be a concern)
if a query generates too much (cpu) load, running multiple instances of that query in parallel can, at best, only spread it over multiple cores, which will actually reduce the available cpu time for other queries. If "load" is related to I/O-load, effects of limited resources or "general responsiveness" , it it possible though, e.g. a small query might generate a small temporary table in memory, and big query generates a big temporary table on disk (although specifically with offset, this is unlikely, see below.) Otherwise, you would usually need to add small pauses between (small enough) parts that you run consecutively, to spread the same workload over a longer time.
limit only makes sense if you have an order by (probably by the primary key), otherwise, in successive runs, the m-th row can be a different row than before (because the order is not fixed). This may or may not increase the load (and resource requirements) depending on your indexes and your where-condition.
the same is true for updates to your source table, as if you add or remove a row from the resultset (e.g. changing the value of my_condition of the first row), all successive offsets will shift, and you may skip a row or get a row twice. You will probably need to lock the rows, which might prevent running your queries in parallel (as they lock the same rows), and also might influence the decision if you can split the transaction (see 2nd bullet point).
using an offset requires MySQL to first find and then skip rows. So if you split the query in n parts, the first row will need to be processed n times (and the last row usually once), so the total work (for selecting) will be increased by (n^2-n)/2. So especially if selecting the rows is the most relevant part (see 1st bullet point), this can actually make your situation much worse: just the last run will need to find the same amount of rows as your current query (although it throws most of them away), and might even need more resources for it depending on the effect of order by.
You may be able to get around some of the offset-problems by using the primary key in the condition, e.g. have a loop that contains something like this:
select max(id) as new_max from
where id > last_id and <your condition>
order by id limit 1000 -- no offset!
Exit the loop if new_max is null, otherwise do the insert:
insert ... select ...
where id > last_id and id <= new_max and <your condition>
Then set last_id = new_max and continue the loop.
It doubles the number of queries, as in contrast to limit with an offset, you need to know the actual id. It still requires your primary key and your where-condition to be compatible (so you may need to add an index that fits). If your search condition finds a significant percentage (more than about 15% or 20%) of your source table anyway, using the primary key might be the best execution plan anyway though.
If you want to parallize this (depending on your transaction requirements and if it is potentially worthwile, see above), you could first get the maximum value for primary key (select max(id) as max_id from ...) , and give each threads a range to work with. E.g. for max_id=3000 and 3 threads, start them with one of (0..1000), (1001, 2000), (2001..3000) and include that into the first query:
select max(id) as new_max from
where id > last_id
and id >= $threadmin_id and id <= $threadmax_id
and <your condition>
order by id limit 1000
It may depend on your data distribution if those ranges are equally sized (and you may find better ranges in your situation; calculating the exact ranges would require to execute the query though, so you probably can't be exact).
I've got 500K rows I want to insert into PostgreSQL using SQLAlchemy.
For speed, I'm inserting them using session.bulk_insert_mappings().
Normally, I'd break up the insert into smaller batches to minimize session bookkeeping. However, bulk_insert_mappings() uses dicts and bypasses a lot of the traditional session bookkeeping.
Will I still see a speed improvement if I break the insert up into smaller discrete batches, say doing an insert every 10K rows?
If so, should I close the PG transaction after every 10K rows, or leave it open the whole time?
In my experience, you'll see substantial performance improvements if you use INSERT INTO tbl (column1, column2) VALUES (...), (...), ...; as opposed to bulk_insert_mappings, which uses executemany. In this case you'll want to batch the rows at least on a statement level for sanity.
SQLAlchemy supports generating a multi-row VALUES clause for a single INSERT statement, so you don't have to hand-craft the statement.
Committing between batches probably won't have much of an effect on the performance, but the reason to do it would be to not keep an open transaction for too long, which could impact other transactions running on the server.
You can also experiment with using COPY to load it into a temporary table, then INSERTing from that table.
I have a scenario where I have to obfuscate data(=scramble, for testing purposes, so it is not possible to see the real data, there is no need on unscramble/unobfuscate it) in database. There are several tables that are referencing the address_table. I can not obfuscate the address_table, so I figured that I simply change the references in those tables with random other address_table ID-s. The address_table contains 6M+ records. So I would create a temp table with all the address ID-s and then, when needed call some function to get a random one from there. So I could possibly generate a random value and take that row like:
Select * From (
Select Id, Rownum Rn From myTempTable )
WHERE RN = x;
where x is some random value generated by dbms_random. Now, although this is what I need, it does not perform anything near to what I expect.
Other thing I have tried is to call the sample() function, this (at least on small table) performs I bit better, but it is not good enough.
I know there are several threads on this matter like this or this on mySql, but they do not directly answer it in terms of performance.
Also, I am not limited in using pl/sql. I know a very little of pl/sql, how is it in terms of performance? I mean, it is just another process in DB server processing queue, perhaps i could get better performance doing the processing (i mean generating the update scripts, populating randoms etcetc) on client side using something like python, even considering network latency etc? Does anybody have any experience on this?
Use sample clause
select * from myTempTable SAMPLE(10);
This will return only 10% of rows.
If you just want to hide the real data why don't you take care of that in the select part of the query. Insteady of querying:
select column_name from table;
you could select
select scrambling_function(column_name) from table;
scrambling_function can be whatever you like.
There is not a good way to sample randomly using SQL that I am aware of. The sample function available in some SQL versions is not a sufficient random sample. The best way is to export the full sample set and use random software to determine the index of rows to be included in your final solution. Or if you have a simple number index (1,2,3...n) and know how many rows you need to select from you could upload a list of index's to include and query against that. Try random.org for random number generation, their API is located at http://www.random.org/clients/http/.
I have lists of about 20,000 items that I want to insert into a table (with about 50,000 rows in it). Most of these items update certain fields in existing rows and a minority will insert entirely new rows.
I am accessing the database twice for each item. First is a select query that checks whether the row exists. Next I insert or update a row depending on the result of the select query. I commit each transaction right after the update/insert.
For the first few thousand entries, I am getting through about 3 or 4 items per second, then it starts to slow down. By the end it takes more than 1/2 second for each iteration. Why might it be slowing down?
My average times are: 0.5 seconds for an entire run divided up as .18s per select query and .31s per insert/update. The last 0.01 is due to a couple of unmeasured processes to do with parsing the data before entering into the database.
Update
I've commented out all the commits as a test and got no change, so that's not it (any more thoughts on optimal committing would be welcome, though).
As to table structure:
Each row has twenty columns. The first four are TEXT fields (all set with the first insert) and the 16 are REAL fields, one of which is inputted with the initial insert statement.
Over time the 'outstanding' REAL fields will be populated with the process I'm trying to optimize here.
I don't have an explicit index, though one of the fields is unique key to each row.
I should note that as the database has gotten larger both the SELECT and UPDATE queries have taken more and more time, with a particularly remarkable deterioration in performance in the SELECT operation.
I initially thought this might be some kind of structural problem with SQLITE (whatever that means), but haven't been able to find any documentation anywhere that suggests there are natural limits to the program.
The database is about 60ish megs, now.
I think your bottleneck is that you commit with/avec each insert/update:
I commit each transaction right after the update/insert.
Either stop doing that, or at least switch to WAL journaling; see this answer of mine for why:
SQL Server CE 4.0 performance comparison
If you have a primary key you can optimize out the select by using the ON CONFLICT clause with INSERT INTO:
http://www.sqlite.org/lang_conflict.html
EDIT : Earlier I meant to write "if you have a primary key " rather than foreign key; I fixed it.
Edit: shame on me. I misread the question and somehow understood this was for mySQL rather that SQLite... Oops.
Please disregard this response, other than to get generic ideas about upating DBMSes. The likely solution to the OP's problem is with the overly frequent commits, as pointed in sixfeetsix' response.
A plausible explanation is that the table gets fragmented.
You can verify this fact by defragmenting the table every so often, and checking if the performance returns to the 3 or 4 items per seconds rate. (Which BTW, is a priori relatively slow, but then may depend on hardware, data schema and other specifics.) Of course, you'll need to consider the amount of time defragmentation takes, and balance this against the time lost by slow update rate to find an optimal frequency for the defragmentation.
If the slowdown is effectively caused, at least in part, by fragmentation, you may also look into performing the updates in a particular order. It is hard to be more specific without knowing details of the schema of of the overall and data statistical profile, but fragmentation is indeed sensitive to the order in which various changes to the database take place.
A final suggestion, to boost the overall update performance, is (if this is possible) drop a few indexes on the table, perform the updates, and recreate the indexes anew. This counter-intuitive approach works for relative big updates because the cost for re-creating new indexes is often less that the cumulative cost for maintaining them as the update progresses.
There's two columns in the table inside mysql database. First column contains the fingerprint while the second one contains the list of documents which have that fingerprint. It's much like an inverted index built by search engines. An instance of a record inside the table is shown below;
34 "doc1, doc2, doc45"
The number of fingerprints is very large(can range up to trillions). There are basically following operations in the database: inserting/updating the record & retrieving the record accoring to the match in fingerprint. The table definition python snippet is:
self.cursor.execute("CREATE TABLE IF NOT EXISTS `fingerprint` (fp BIGINT, documents TEXT)")
And the snippet for insert/update operation is:
if self.cursor.execute("UPDATE `fingerprint` SET documents=CONCAT(documents,%s) WHERE fp=%s",(","+newDocId, thisFP))== 0L:
self.cursor.execute("INSERT INTO `fingerprint` VALUES (%s, %s)", (thisFP,newDocId))
The only bottleneck i have observed so far is the query time in mysql. My whole application is web based. So time is a critical factor. I have also thought of using cassandra but have less knowledge of it. Please suggest me a better way to tackle this problem.
Get a high end database. Oracle has some offers. SQL Server also.
TRILLIONS of entries is well beyond the scope of a normal database. THis is very high end very special stuff, especially if you want decent performance. Also get the hardware for it - this means a decent mid range server, 128+gb memory for caching, and either a decent SAN or a good enough DAS setup via SAS.
Remember, TRILLIONS means:
1000gb used for EVERY BYTE.
If the fingerprint is stored as an int64 this is 8000gb disc space alone for this data.
Or do you try running that from a small cheap server iwth a couple of 2tb discs? Good luck.
That data structure isn't a great fit for SQL - the 'correct' design in SQL would be to have a row for each fingerprint/document pair, but querying would be impossibly slow unless you add an index that would take up too much space. For what you are trying to do, SQL adds a lot of overhead to support functions you don't need while not supporting the multiple value column that you do need.
A redis cluster might be a good fit - the atomic set operations should be perfect for what you are doing, and with the right virtual memory setup and consistent hashing to distribute the fingerprints across nodes it should be able to handle the data volume. The commands would then be
SADD fingerprint, docid
to add or update the record, and
SMEMBERS fingerprint
to get all the document ids with that fingerprint.
SADD is O(1). SMEMBERS is O(n), but n is the number of documents in the set, not the number of documents/fingerprints in the system, so effectively also O(1) in this case.
The SQL insert you are currently using is O(n) with n being the very large total number of records, because the records are stored as an ordered list which must be reordered on insert rather than a hash table which is constant time for both get and set.
Greenplum data warehouse, FOC, postgres driven, good luck ...