I am trying to perform some n-gram counting in python and I thought I could use MySQL (MySQLdb module) for organizing my text data.
I have a pretty big table, around 10mil records, representing documents that are indexed by a unique numeric id (auto-increment) and by a language varchar field (e.g. "en", "de", "es" etc..)
select * from table is too slow and memory devastating.
I ended up splitting the whole id range into smaller ranges (say 2000 records wide each) and processing each of those smaller record sets one by one with queries like:
select * from table where id >= 1 and id <= 1999
select * from table where id >= 2000 and id <= 2999
and so on...
Is there any way to do it more efficiently with MySQL and achieve similar performance to reading a big corpus text file serially?
I don't care about the ordering of the records, I just want to be able to process all the documents that pertain to a certain language in my big table.
You can use the HANDLER statement to traverse a table (or index) in chunks. This is not very portable and works in an "interesting" way with transactions if rows appear and disappear while you're looking at it (hint: you're not going to get consistency) but makes code simpler for some applications.
In general, you are going to get a performance hit, as if your database server is local to the machine, several copies of the data will be necessary (in memory) as well as some other processing. This is unavoidable, and if it really bothers you, you shouldn't use mysql for this purpose.
Aside from having indexes defined on whatever columns you're using to filter the query (language and ID probably, where ID already has an index care of the primary key), no.
First: you should avoid using * if you can specify the columns you need (lang and doc in this case). Second: unless you change your data very often, I don't see the point of storing all
this in a database, especially if you are storing file names. You could use an xml format for example (and read/write with a SAX api)
If you want a DB and something faster than MySQL, you can consider an in-memory databasy such as SQLite or BerkeleyDb, which have both python bindings.
Greetz,
J.
Related
I am attempting to make an update query run faster in Postgres. The query is relatively simple and I have busted it up to spread it across all of the CPU's on my database server.
UPDATE p797.line a SET p = 5 FROM p797.pt b WHERE a.source = b.node AND a.id >= 0 and a.id < 40000000
where "0" and "40000000" are replaced with different values as you move through all the rows in the table. The "line" table has 1.3 billion records and the pt table has 500 million.
Right now this process runs in about 16 hours. I have other update queries that I need to perform and if each takes 16 hours, the results will take weeks to acquire.
I found something interesting that I would like to try, but am unsure if it can be implemented in my case as I am running queries over a Network.
Slow simple update query on PostgreSQL database with 3 million rows
Here, Le Droid makes reference to COPY, a method which I believe I cannot employ as I am running over a network. They also use BUFFER, of which I do not understand how to employ. Also, both my tables reside in the same database, not as the combination database table and CSV. How can I massage my query to get the gains that #Le Droid mentions? Is there another methodology that I can employ to see time gains? I did see Le Droid mention that HOT only sees marginal gains with lots of cost. Other methods?
It might also be noteworthy that I am creating the queries in Python and sending them to the Postgres database using psycopg2.
EDIT:
Here is an EXPLAIN on the above statement without the ID limitation:
"Update on line a (cost=10665536.12..342338721.96 rows=1381265438 width=116)"
" -> Hash Join (cost=10665536.12..342338721.96 rows=1381265438 width=116)"
" Hash Cond: (a.source= b.node)"
" -> Seq Scan on line a (cost=0.00..52953645.38 rows=1381265438 width=102)"
" -> Hash (cost=8347277.72..8347277.72 rows=126271072 width=22)"
" -> Seq Scan on pt b (cost=0.00..8347277.72 rows=126271072 width=22)"
Frankly I'd extract the data, apply all transformations outside the database, then reload it. So an ETL, but with the E and the L being the same table. The transactional guarantees the database provides do not come cheap, and if I didn't need them I wouldn't want to pay that price in this situation.
dropping all indexes from the table you are updating has a tremendous performance enhancement for updates, like 100x faster. Even if the indexes are not related to the columns you are updating or joining on.
I have a script that repopulates a large database and would generate id values from other tables when needed.
Example would be recording order information when given customer names only. I would check to see if the customer exists in a CUSTOMER table. If so, SELECT query to get his ID and insert the new record. Else I would create a new CUSTOMER entry and get the Last_Insert_Id().
Since these values duplicate a lot and I don't always need to generate a new ID -- Would it be better for me to store the ID => CUSTOMER relationship as a dictionary that gets checked before reaching the database or should I make the script constantly requery the database? I'm thinking the first approach is the best approach since it reduces load on the database, but I'm concerned for how large the ID Dictionary would get and the impacts of that.
The script is running on the same box as the database, so network delays are negligible.
"Is it more efficient"?
Well, a dictionary is storing the values in a hash table. This should be quite efficient for looking up a value.
The major downside is maintaining the dictionary. If you know the database is not going to be updated, then you can load it once and the in-application memory operations are probably going to be faster than anything you can do with a database.
However, if the data is changing, then you have a real challenge. How do you keep the memory version aligned with the database version? This can be very tricky.
My advice would be to keep the work in the database, using indexes for the dictionary key. This should be fast enough for your application. If you need to eke out further speed, then using a dictionary is one possibility -- but no doubt, one possibility out of many -- for improving the application performance.
I'm trying to use a Python script to parse the Wikipedia archives. (Yeah, I know.) Of course:
Wikipedia XML: 45.95 GB
Available memory: 16 GB
This precludes loading the file into memory, and going into virtual memory isn't going to fare much better. So in order to work with the data, I decided to parse the necessary information into a SQLite database. For the XML parsing, I used the ElementTree library, which performs quite well. I confirmed that running ONLY the XML parsing (just commenting out the database calls) it runs linearly, with no slowdowns as it traverses the file.
The problem comes with trying to insert MILLIONS of rows into the SQLite database (one per Wikipedia article). The simple version of the table that I'm using for testing is as follows:
CREATE TABLE articles(
id INTEGER NOT NULL PRIMARY KEY,
title TEXT NOT NULL UNIQUE ON CONFLICT IGNORE);
So I just have the id and a text field during this initial phase. When I start adding rows via:
INSERT OR IGNORE INTO articles(title) VALUES(?1);
it performs well at first. But at around 8 million rows in, it begins to slow down dramatically, by an order of magnitude or more.
Some detail is of course needed. I'm using cur.executemany() with a single cursor created before the insert statements. Each call to this function has a batch of about 100,000 rows. I don't call db.commit() until ALL of the million+ rows have been inserted. According to what I've read, executemany() shouldn't commit a transaction until db.commit() as long as there are only INSERT statements.
The source XML being read and the database being written are on two separate disks, and I've also tried creating the database in memory, but I see the slowdown regardless. I also tried the isolation_level=None option, adding the BEGIN TRANSACTION and COMMIT TRANSACTION calls myself at the beginning and end (so the entire parse sequence is one transaction), but it still doesn't help.
Some other questions on this site suggest that indexing is the problem. I don't have any indexes on the table. I did try removing the UNIQUE constraint and just limiting it to id INTEGER PRIMARY KEY and title TEXT NOT NULL but that also had no effect.
What's the best way to perform these types of insertions in SQLite for large data sets? Of course this simple query is just the first of many; there are other queries that will be more complex, involving foreign keys (ids of articles in this table) as well as insert statements with selects embedded (selecting an id from the articles table during an insert). These are bound to have the same problem but exacerbated by a large margin - where the articles table has less than 15 million rows, the other tables are probably going to have over a billion rows. So these performance issues are even more concerning.
One "invisible" thing happening on insertion is updating a table's indices (and checking index-related constraints such as UNIQUE). Since you're ignoring UNIQUE violations anyway, you may find it useful to disable the indices on the table while you're loading the table, and if you really need them, build the indices once after the loading is complete.
But also beware that SQLite's lightning speed for small data comes from certain implicit assumptions that get increasingly violated when you're processing big data. It may not be an appropriate tool for your current problem on your current hardware.
I have a database with a large table containing more that a hundred million rows. I want to export this data (after some transformation, like joining this table with a few others, cleaning some fields, etc.) and store it int a big text file, for later processing with Hadoop.
So far, I tried two things:
Using Python, I browse the table by chunks (typically 10'000 records at a time) using this subquery trick, perform the transformation on each row and write directly to a text file. The trick helps, but the LIMIT becomes slower and slower as the export progresses. I have not been able to export the full table with this.
Using the mysql command-line tool, I tried to output the result of my query in CSV form to a text file directly. Because of the size, it ran out of memory and crashed.
I am currently investigating Sqoop as a tool to import the data directly to HDFS, but I was wondering how other people handle such large-scale exports?
Memory issues point towards using the wrong database query machanism.
Normally, it is advisable to use mysql_store_result() on C level, which corresponds to having a Cursor or DictCursor on Python level. This ensures that the database is free again as soon as possible and the client can do with thedata whatever he wants.
But it is not suitable for large amounts of data, as the data is cached in the client process. This can be very memory consuming.
In this case, it may be better to use mysql_use_result() (C) resp. SSCursor / SSDictCursor (Python). This limits you to have to take the whole result set and doing nothing else with the database connection in the meanwhile. But it saves your client process a lot of memory. With the mysql CLI, you would achieve this with the -q argument.
I don't know what query exactly you have used because you have not given it here, but I suppose you're specifying the limit and offset. This are quite quick queries at begin of data, but are going very slow.
If you have unique column such as ID, you can fetch only the first N row, but modify the query clause:
WHERE ID > (last_id)
This would use index and would be acceptably fast.
However, it should be generally faster to do simply
SELECT * FROM table
and open cursor for such query, with reasonable big fetch size.
Apologies for the longish description.
I want to run a transform on every doc in a large-ish Mongodb collection with 10 million records approx 10G. Specifically I want to apply a geoip transform to the ip field in every doc and either append the result record to that doc or just create a whole other record linked to this one by say id (the linking is not critical, I can just create a whole separate record). Then I want to count and group by say city - (I do know how to do the last part).
The major reason I believe I cant use map-reduce is I can't call out to the geoip library in my map function (or at least that's the constraint I believe exists).
So I the central question is how do I run through each record in the collection apply the transform - using the most efficient way to do that.
Batching via Limit/skip is out of question as it does a "table scan" and it is going to get progressively slower.
Any suggestions?
Python or Js preferred just bec I have these geoip libs but code examples in other languages welcome.
Since you have to go over "each record", you'll do one full table scan anyway, then a simple cursor (find()) + maybe only fetching few fields (_id, ip) should do it. python driver will do the batching under the hood, so maybe you can give a hint on what's the optimal batch size (batch_size) if the default is not good enough.
If you add a new field and it doesn't fit the previously allocated space, mongo will have to move it to another place, so you might be better off creating a new document.
Actually I am also attempting another approach in parallel (as plan B) which is to use mongoexport. I use it with --csv to dump a large csv file with just the (id, ip) fields. Then the plan is to use a python script to do a geoip lookup and then post back to mongo as a new doc on which map-reduce can now be run for count etc. Not sure if this is faster or the cursor is. We'll see.