I've been pouring over everywhere I can to find an answer to this, but can't seem to find anything:
I've got a batch update to a MySQL database that happens every few minutes, with Python handling the ETL work (I'm pulling data from web API's into the MySQL system).
I'm trying to get a sense of what kinds of potential impact (be it positive or negative) I'd see by using either multithreading or multiprocessing to do multiple connections & inserts of the data simultaneously. Each worker (be it thread or process) would be updating a different table from any other worker.
At the moment I'm only updating a half-dozen tables with a few thousand records each, but this needs to be scalable to dozens of tables and hundreds of thousands of records each.
Every other resource I can find out there addresses doing multithreading/processing to the same table, not a distinct table per worker. I get the impression I would definitely want to use multithreading/processing, but it seems everyone's addressing the one-table use case.
Thoughts?
I think your question is too broad to answer concisely. It seems you're asking about two separate subjects - will writing to separate MySQL tables speed it up, and is python multithreading the way to go. For the python part, since you're probably doing mostly IO, you should look at gevent, and ultramysql. As for the MySQL part, you'll have to wait for more answers.
For one I wrote in C#, I decided the best work partitioning was each "source" having a thread for extraction, one for each transform "type", and one to load the transformed data to each target.
In my case, I found multiple threads per source just ended up saturating the source server too much; it became less responsive overall (to even non-ETL queries) and the extractions didn't really finish any faster since they ended up competing with each other on the source. Since retrieving the remote extract was more time consuming than the local (in memory) transform, I was able to pipeline the extract results from all sources through one transformer thread/queue (per transform "type"). Similarly, I only had a single target to load the data to, so having multiple threads there would have just monopolized the target.
(Some details omitted/simplified for brevity, and due to poor memory.)
...but I'd think we'd need more details about what your ETL process does.
Related
I am writing my bachelor thesis on a project with a massive database that tracks around 8000 animals, three times a second. After a few months, we now have approx 127 million entries and each row includes a column with an array with 1000-3000 entries that has the coordinates for every animal that was tracked in that square that moment. All that lays in a sql database that now easily exceeds 2 TB in size.
To export the data and analyse the moving patterns of the animals, they did it online over PHPMyAdmin as a csv export that would take hours to be finished and break down about everytime.
I wrote them a python (they wanted me to use python) script with mysql-connector-python that will fetch the data for them automatically. The problem is, since the database is so massive, one query can take up minutes or technically even hours to complete. (downloading a day of tracking data would be 3*60*60*24 entries)
The moment anything goes wrong (connection fails, computer is overloaded etc) the whole query is closed and it has to start all over again cause its not cached anywhere.
I then rewrote the whole thing as a class that will fetch the data by using smaller multithreaded queries.
I start about 5-7 Threads that each take a connection out of a connection pool, make the query, write it in a csv file successively and put the connection back in the pool once done with the query.
My solution works perfectly, the queries are about 5-6 times faster, depending on the amount of threads I use and the size of the chunks that I download. The data gets written into the file and when the connection breaks or anything happens, the csvfile still holds all the data that has been downloaded up to that point.
But on looking at solutions how to improve my method, I can find absolutely nothing about a similar approach and no-one seems to do it that way for large datasets.
What am I missing? Why does it seem like everyone is using a single-query approach to fetch their massive datasets, instead of splitting it into threads and avoiding these annoying issues with connection breaks and whatnot?
Is my solution even usable and good in a commercial environment or are there things that I just dont see right now, that would make my approach useless or even way worse?
Or maybe it is a matter of the programming language and if I had used C# to do the same thing it wouldve been faster anyways?
EDIT:
To clear some things up, I am not responsible for the database. While I can tinker with it since I also have admin rights, someone else that (hopefully) actually knows what he is doing, has set it up and writes the data. My Job is only to fetch it as simple and effective as possible. And since exporting from PHPMyAdmin is too slow and so is a single query on python for 100k rows (i do it using pd.read_sql) I switched to multithreading. So my question is only related to SELECTing the data effectively, not to change the DB.
I hope this is not becoming too long of a question...
There are many issues in a database of that size. We need to do the processing fast enough so that it never gets behind. (Once it lags, it will keel over, as you see.)
Ingestion. It sounds like a single client is receiving 8000 lat/lng values every 3 seconds, then INSERTing a single, quite wide row. Is that correct?
When you "process" the data, are you looking at each of the 8000 animals? Or looking at a selected animal? Fetching one out of a lat/lng from a wide row is messy and slow.
If the primary way things are SELECTed is one animal at a time, then your matrix needs to be transposed. That will make selecting all the data for one animal much faster, and we can mostly avoid the impact that Inserting and Selecting have on each other.
Are you inserting while you are reading?
What is the value of innodb_buffer_pool_size? You must plan carefully with the 2TB versus the much smaller RAM size. Depending on the queries, you may be terribly I/O-bound and maybe the data structure can be changed to avoid that.
"...csv file and put it back..." -- Huh? Are you deleting data, then re-inserting it? That sees 'wrong'. And very inefficient.
Do minimize the size of every column in the table. How big is the range for the animals? Your backyard? The Pacific Ocean? How much precision is needed in the location? Meters for whales; millimeters for ants. Maybe the coordinates can be scaled to a pair of SMALLINTs (2 bytes, 16-bit precision) or MEDIUMINTs (3 bytes each)?
I haven't dwelled on threading; I would like to wait until the rest of the issues are ironed out. Threads interfere with each other to some extent.
I find this topic interesting. Let's continue the discussion.
I have a question on the general strategy of how to integrate data into an MSSQL database.
Currently, I use python for my whole ETL process. I use it to clean, transform, and integrate the data in an MSSQL database. My data is small so I think this process works fine for now.
However, I think it a little awkward for my code to constantly read data and write data to the database. I think this strategy will be an issue once I'm dealing with large amount of data and the constant read/write seems very inefficient. However, I don't know enough to know if this is a real problem or not.
I want to know if this is a feasible approach or should I switch entirely to SSIS to handle it. SSIS to me is clunky and I'd prefer not to re-write my entire code. Any input on the general ETL architecture would be very helpful.
Is this practice alright? Maybe?
There are too many factors to give a definitive answer. Conceptually, what you're doing - Extract data from source, Transform it, Load it to destination, ETL, is all that SSIS does. It likely can do things more efficiently than python - at least I've had a devil of a time getting a bulk load to work with memory mapped data. Dump to disk and bulk insert that via python - no problem. But, if the existing process works, then let it go until it doesn't work.
If your team knows Python, introducing SSIS just to do ETL is likely going to be a bigger maintenance cost than scaling up your existing approach. On the other hand, if it's standard-ish Python + libraries and you're on SQL Server 2017+, you might be able to execute your scripts from within the database itself via sp_execute_external_script
If the ETL process runs on the same box as the database, then ensure you have sufficient resources to support both processes at their maximum observed levels of activity. If the ETL runs elsewhere, then you'll want to ensure you have fast, full duplex connectivity between the database server and the processing box.
Stand up a load testing environment that parallels production's resources. Dummy up a 10x increase in source data and observe how the ETL fares. 100x, 1000x. At some point, you'll identify what development sins you committed that do not scale and then you're poised to ask a really good, detailed question describing the current architecture, the specific code that does not perform well under load and how one can reproduce this load.
The above design considerations will hold true for Python, SSIS or any other ETL solution - prepackaged or bespoke.
I have huge data stored in cassandra and I wanted to process it using spark through python.
I just wanted to know how to interconnect spark and cassandra through python.
I have seen people using sc.cassandraTable but it isnt working and fetching all the data at once from cassandra and then feeding to spark doesnt make sense.
Any suggestions?
Have you tried the examples in the documentation.
Spark Cassandra Connector Python Documentation
spark.read\
.format("org.apache.spark.sql.cassandra")\
.options(table="kv", keyspace="test")\
.load().show()
I'll just give my "short" 2 cents. The official docs are totally fine for you to get started. You might want to specify why this isn't working, i.e. did you run out of memory (perhaps you just need to increase the "driver" memory) or is there some specific error that is causing your example not to work. Also it would be nice if you provided that example.
Here are some of my opinions/experiences that I had. Usually, not always, but most of the time you have multiple columns in partitions. You don't always have to load all the data in a table and more or less you can keep the processing (most of the time) within a single partition. Since the data is sorted within a partition this usually goes pretty fast. And didn't present any significant problem.
If you don't want the whole store in casssandra fetch to spark cycle to do your processing you have really a lot of the solutions out there. Basically that would be quora material. Here are some of the more common one:
Do the processing in your application right away - might require some sort of inter instance communication framework like hazelcast of even better akka cluster this is really a wide topic
spark streaming - just do your processing right away in micro batching and flush results for reading to some persistence layer - might be cassandra
apache flink - use proper streaming solution and periodically flush state of the process to i.e. cassandra
Store data into cassandra the way it's supposed to be read - this approach is the most adviseable (just hard to say with the info you provided)
The list could go on and on ... User defined function in cassandra, aggregate functions if your task is something simpler.
It might be also a good idea that you provide some details about your use case. More or less what I said here is pretty general and vague, but then again putting this all into a comment just wouldn't make sense.
I'm working with two databases, a local version and the version on the server. The server is the most up to date version and instead of recopying all values on all tables from the server to my local version,
I would like to enter each table and only insert/update the values that have changed, from server, and copy those values to my local version.
Is there some simple method to handling such a case? Some sort of batch insert/update? Googl'ing up the answer isn't working and I've tried my hand at coding one but am starting to get tied up in error handling..
I'm using Python and MySQLDB... Thanks for any insight
Steve
If all of your tables' records had timestamps, you could identify "the values that have changed in the server" -- otherwise, it's not clear how you plan to do that part (which has nothing to do with insert or update, it's a question of "selecting things right").
Once you have all the important values, somecursor.executemany will let you apply them all as a batch. Depending on your indexing it may be faster to put them into a non-indexed auxiliary temporary table, then insert/update from all of that table into the real one (before dropping the aux/temp one), the latter of course being a single somecursor.execute.
You can reduce wall-clock time for the whole job by using one (or a few) threads to do the selects and put the results onto a Queue.Queue, and a few worker threads to apply results plucked from the queue into the internal/local server. (Best balance of reading vs writing threads is best obtained by trying a few and measuring -- writing per se is slower than reading, but your bandwidth to your local server may be higher than to the other one, so it's difficult to predict).
However, all of this is moot unless you do have a strategy to identify "the values that have changed in the server", so it's not necessarily very useful to enter into more discussion about details "downstream" from that identification.
I have a database which I regularly need to import large amounts of data into via some python scripts. Compacted, the data for a single months imports takes about 280mb, but during the import file size swells to over a gb.
Given the 2gb size limit on mdb files, this is a bit of a concern. Apart from breaking the inserts into chunks and compacting inbetween each, are there any techniques for avoiding the increase in file size?
Note that no temporary tables are being created/deleted during the process: just inserts into existing tables.
And to forstall the inevitable comments: yes, I am required to store this data in Access 2003. No, I can't upgrade to Access 2007.
If it could help, I could preprocess in sqlite.
Edit:
Just to add some further information (some already listed in my comments):
The data is being generated in Python on a table by table basis, and then all of the records for that table batch inserted via odbc
All processing is happening in Python: all the mdb file is doing is storing the data
All of the fields being inserted are valid fields (none are being excluded due to unique key violations, etc.)
Given the above, I'll be looking into how to disable row level locking via odbc and considering presorting the data and/or removing then reinstating indexes. Thanks for the suggestions.
Any further suggestions still welcome.
Are you sure row locking is turned off? In my case, turning off row locking reduced bloat by over a 100 megs when working on a 5 meg file. (in other words the file barley grew after turning off row locking to about 6 megs). With row locking on, the same operation results in a file well over 100 megs in size.
Row locking is a HUGE source of bloat during recordset operations since it pads each record to a page size.
Do you have ms-access installed here, or are you just using JET (JET is the data engine that ms-access uses. You can use JET without access).
Open the database in ms-access and go:
Tools->options
On the advanced tab, un-check the box:
[ ] Open databases using record level locking.
This will not only make a HUGE difference in the file growth (bloat), it will also speed things up by a factor of 10 times.
There also a registry setting that you can use here.
And, Are you using odbc, or an oleDB connection?
You can try:
Set rs = New ADODB.Recordset
With rs
.ActiveConnection = RsCnn
.Properties("Jet OLEDB:Locking Granularity") = 1
Try the setting from accesss (change the setting), exit, re-enter and then compact and repair. Then run your test import…the bloat issue should go away.
There is likely no need to open the database using row locking. If you turn off that feature, then you should be able to reduce the bloat in file size down to a minimum.
For furher reading and an example see here:
Does ACEDAO support row level locking?
One thing to watch out for is records which are present in the append queries but aren't inserted into the data due to duplicate key values, null required fields, etc. Access will allocate the space taken by the records which aren't inserted.
About the only significant thing I'm aware of is to ensure you have exclusive access to the database file. Which might be impossible if doing this during the day. I noticed a change in behavior from Jet 3.51 (used in Access 97) to Jet 4.0 (used in Access 2000) when the Access MDBs started getting a lot larger when doing record appends. I think that if the MDB is being used by multiple folks then records are inserted once per 4k page rather than as many as can be stuffed into a page. Likely because this made index insert/update operations faster.
Now compacting does indeed put as many records in the same 4k page as possible but that isn't of help to you.
A common trick, if feasible with regard to the schema and semantics of the application, is to have several MDB files with Linked tables.
Also, the way the insertions take place matters with regards to the way the file size balloons... For example: batched, vs. one/few records at a time, sorted (relative to particular index(es)), number of indexes (as you mentioned readily dropping some during the insert phase)...
Tentatively a pre-processing approach with say storing of new rows to a separate linked table, heap fashion (no indexes), then sorting/indexing this data is a minimal fashion, and "bulk loading" it to its real destination. Similar pre-processing in SQLite (has hinted in question) would serve the serve purpose. Keeping it "ALL MDB" is maybe easier (fewer languages/processes to learn, fewer inter-op issues [hopefuly ;-)]...)
EDIT: on why inserting records in a sorted/bulk fashion may slow down the MDB file's growth (question from Tony Toews)
One of the reasons for MDB files' propensity to grow more quickly than the rate at which text/data added to them (and their counterpart ability to be easily compacted back down) is that as information is added, some of the nodes that constitute the indexes have to be re-arranged (for overflowing / rebalancing etc.). Such management of the nodes seems to be implemented in a fashion which favors speed over disk space and harmony, and this approach typically serves simple applications / small data rather well. I do not know the specific logic in use for such management but I suspect that in several cases, node operations cause a particular node (or much of it) to be copied anew, and the old location simply being marked as free/unused but not deleted/compacted/reused. I do have "clinical" (if only a bit outdated) evidence that by performing inserts in bulk we essentially limit the number of opportunities for such duplication to occur and hence we slow the growth.
EDIT again: After reading and discussing things from Tony Toews and Albert Kallal it appears that a possibly more significant source of bloat, in particular in Jet Engine 4.0, is the way locking is implemented. It is therefore important to set the database in single user mode to avoid this. (Read Tony's and Albert's response for more details.
Is your script executing a single INSERT statement per row of data? If so, pre-processing the data into a text file of many rows that could then be inserted with a single INSERT statement might improve the efficiency and cut down on the accumulating temporary crud that's causing it to bloat.
You might also make sure the INSERT is being executed without transactions. Whether or not that happens implicitly depends on the Jet version and the data interface library you're using to accomplish the task. By explicitly making sure it's off, you could improve the situation.
Another possibility is to drop the indexes before the insert, compact, run the insert, compact, re-instate the indexes, and run a final compact.
I find I am able to link from Access to Sqlite and to run a make table query to import the data. I used this ODBC Driver: http://www.ch-werner.de/sqliteodbc/ and created User DNS.
File --> Options --> Current Database -> Check below options
* Use the Cache format that is compatible with Microsoft Access 2010 and later
* Clear Cache on Close
Then, you file will be saved by compacting to the original size.