simultaneous database access question (Mysql,Sqlite?) - python

My basic problem is that i am trying to have two python programs run simultaneously and have access to the same database table. I feel like this should have a simple solution but it has passed my by so far.
All my attempts at this have caused the database(sqlite) to be locked and the program falling over.
i have tried being clever with the timing with how they programs run so that as one program opens the connection the other closes it, copying data from one database to another etc.. but this just gets horrible and messy very quickly and also a big goal in my design is that I would like to keep latency to an absolute minimum.
The basic structure is pictured below.
I should add too that program one - 'always running and adding to database' is in the milliseconds timeframe.
Program two can be in the multiple seconds range. Obviously none of my solutions have been able to come close to that.
Any help, steps in the right direction or links to further reading is greatly appreciated!
Cheers

Although your title mentions MySQL, in your question you are only using sqlite. Now, sqlite is a perfectly capable database if you only have a single process accessing it, but it is not good for multiple simultaneous access. This is exactly where you need a proper database - like MySQL.

Related

Automate a manual task using Python

I have a question and hope someone can direct me in the right direction; Basically every week I have to run a query (SSMS) to get a table containing some information (date, clientnumber, clientID, orderid etc) and then I copy all the information and that table and past it in a folder as a CSV file. it takes me about 15 min to do all this but I am just thinking can I automate this, if yes how can I do that and also can I schedule it so it can run by itself every week. I believe we live in a technological era and this should be done without human input; so I hope I can find someone here willing to show me how to do it using Python.
Many thanks for considering my request.
This should be pretty simple to automate:
Use some database adapter which can work with your database, for MSSQL the one delivered by pyodbc will be fine,
Within the script, connect to the database, perform the query, parse an output,
Save parsed output to a .csv file (you can use csv Python module),
Run the script as the periodic task using cron/schtask if you work on Linux/Windows respectively.
Please note that your question is too broad, and shows no research effort.
You will find that Python can do the tasks you desire.
There are many different ways to interact with SQL servers, depending on your implementation. I suggest you learn Python+SQL using the built-in sqlite3 library. You will want to save your query as a string, and pass it into an SQL connection manager of your choice; this depends on your server setup, there are many different SQL packages for Python.
You can use pandas for parsing the data, and saving it to a ~.csv file (literally called to_csv).
Python does have many libraries for scheduling tasks, but I suggest you hold off for a while. Develop your code in a way that it can be run manually, which will still be much faster/easier than without Python. Once you know your code works, you can easily implement a scheduler. The downside is that your program will always need to be running, and you will need to keep checking to see if it is running. Personally, I would keep it restricted to manually running the script; you could compile to an ~.exe and bind to a hotkey if you need the accessibility.

Is it useful to multithread sql queries to fetch data from a large DB

I am writing my bachelor thesis on a project with a massive database that tracks around 8000 animals, three times a second. After a few months, we now have approx 127 million entries and each row includes a column with an array with 1000-3000 entries that has the coordinates for every animal that was tracked in that square that moment. All that lays in a sql database that now easily exceeds 2 TB in size.
To export the data and analyse the moving patterns of the animals, they did it online over PHPMyAdmin as a csv export that would take hours to be finished and break down about everytime.
I wrote them a python (they wanted me to use python) script with mysql-connector-python that will fetch the data for them automatically. The problem is, since the database is so massive, one query can take up minutes or technically even hours to complete. (downloading a day of tracking data would be 3*60*60*24 entries)
The moment anything goes wrong (connection fails, computer is overloaded etc) the whole query is closed and it has to start all over again cause its not cached anywhere.
I then rewrote the whole thing as a class that will fetch the data by using smaller multithreaded queries.
I start about 5-7 Threads that each take a connection out of a connection pool, make the query, write it in a csv file successively and put the connection back in the pool once done with the query.
My solution works perfectly, the queries are about 5-6 times faster, depending on the amount of threads I use and the size of the chunks that I download. The data gets written into the file and when the connection breaks or anything happens, the csvfile still holds all the data that has been downloaded up to that point.
But on looking at solutions how to improve my method, I can find absolutely nothing about a similar approach and no-one seems to do it that way for large datasets.
What am I missing? Why does it seem like everyone is using a single-query approach to fetch their massive datasets, instead of splitting it into threads and avoiding these annoying issues with connection breaks and whatnot?
Is my solution even usable and good in a commercial environment or are there things that I just dont see right now, that would make my approach useless or even way worse?
Or maybe it is a matter of the programming language and if I had used C# to do the same thing it wouldve been faster anyways?
EDIT:
To clear some things up, I am not responsible for the database. While I can tinker with it since I also have admin rights, someone else that (hopefully) actually knows what he is doing, has set it up and writes the data. My Job is only to fetch it as simple and effective as possible. And since exporting from PHPMyAdmin is too slow and so is a single query on python for 100k rows (i do it using pd.read_sql) I switched to multithreading. So my question is only related to SELECTing the data effectively, not to change the DB.
I hope this is not becoming too long of a question...
There are many issues in a database of that size. We need to do the processing fast enough so that it never gets behind. (Once it lags, it will keel over, as you see.)
Ingestion. It sounds like a single client is receiving 8000 lat/lng values every 3 seconds, then INSERTing a single, quite wide row. Is that correct?
When you "process" the data, are you looking at each of the 8000 animals? Or looking at a selected animal? Fetching one out of a lat/lng from a wide row is messy and slow.
If the primary way things are SELECTed is one animal at a time, then your matrix needs to be transposed. That will make selecting all the data for one animal much faster, and we can mostly avoid the impact that Inserting and Selecting have on each other.
Are you inserting while you are reading?
What is the value of innodb_buffer_pool_size? You must plan carefully with the 2TB versus the much smaller RAM size. Depending on the queries, you may be terribly I/O-bound and maybe the data structure can be changed to avoid that.
"...csv file and put it back..." -- Huh? Are you deleting data, then re-inserting it? That sees 'wrong'. And very inefficient.
Do minimize the size of every column in the table. How big is the range for the animals? Your backyard? The Pacific Ocean? How much precision is needed in the location? Meters for whales; millimeters for ants. Maybe the coordinates can be scaled to a pair of SMALLINTs (2 bytes, 16-bit precision) or MEDIUMINTs (3 bytes each)?
I haven't dwelled on threading; I would like to wait until the rest of the issues are ironed out. Threads interfere with each other to some extent.
I find this topic interesting. Let's continue the discussion.

How do I run Python scripts automatically, while my Flask website is running on a VPS?

Okay, so basically I am creating a website. The data I need to display on this website is delivered twice daily, where I need to read the delivered data from a file and store this new data in the database (instead of the old data).
I have created the python functions to do this. However, I would like to know, what would be the best way to run this script, while my flask application is running? This may be a very simple answer, but I have seen some answers saying to incorporate the script into the website design (however these answers didn't explain how), and others saying to run it separately. The script needs to run automatically throughout the day with no monitoring or input from me.
TIA
Generally it's a really bad idea to put a webserver to handle such tasks, that is the flask application in your case. There are many reasons for it so just to name a few:
Python's Achilles heel - GIL.
Sharing system resources of the application between users and other operations.
Crashes - it happens, it could be unlikely but it does. And if you are not careful, the web application goes down along with it.
So with that in mind I'd advise you to ditch this idea and use crontabs. Basically write a script that does whatever transformations or operations it needs to do and create a cron job at a desired time.

Python Multithreading/processing gains for inserts to different tables in MySQL?

I've been pouring over everywhere I can to find an answer to this, but can't seem to find anything:
I've got a batch update to a MySQL database that happens every few minutes, with Python handling the ETL work (I'm pulling data from web API's into the MySQL system).
I'm trying to get a sense of what kinds of potential impact (be it positive or negative) I'd see by using either multithreading or multiprocessing to do multiple connections & inserts of the data simultaneously. Each worker (be it thread or process) would be updating a different table from any other worker.
At the moment I'm only updating a half-dozen tables with a few thousand records each, but this needs to be scalable to dozens of tables and hundreds of thousands of records each.
Every other resource I can find out there addresses doing multithreading/processing to the same table, not a distinct table per worker. I get the impression I would definitely want to use multithreading/processing, but it seems everyone's addressing the one-table use case.
Thoughts?
I think your question is too broad to answer concisely. It seems you're asking about two separate subjects - will writing to separate MySQL tables speed it up, and is python multithreading the way to go. For the python part, since you're probably doing mostly IO, you should look at gevent, and ultramysql. As for the MySQL part, you'll have to wait for more answers.
For one I wrote in C#, I decided the best work partitioning was each "source" having a thread for extraction, one for each transform "type", and one to load the transformed data to each target.
In my case, I found multiple threads per source just ended up saturating the source server too much; it became less responsive overall (to even non-ETL queries) and the extractions didn't really finish any faster since they ended up competing with each other on the source. Since retrieving the remote extract was more time consuming than the local (in memory) transform, I was able to pipeline the extract results from all sources through one transformer thread/queue (per transform "type"). Similarly, I only had a single target to load the data to, so having multiple threads there would have just monopolized the target.
(Some details omitted/simplified for brevity, and due to poor memory.)
...but I'd think we'd need more details about what your ETL process does.

Python over a network share

So I have tried to find a answer but must not be searching correctly or what I'm trying to do is the wrong way to go about it.
So I have a simple python script that creates a chess board and pieces in a command line environment. You can in put commands to move the pieces. So one of my co workers thought it would be cool to play each other over the network. I agreed and tried by creating a text file to read and write to on the network share. Then we would both run the script that reads that file. The problem I ran into is I pretty much DOS attacked that file share since it kept trying to check that file on network share for a update.
I am still new to python and haven't ever wrote code that travels the internet, our even a simple local network. So my question is how should I go about properly allowing 2 people to access this data at the same time with out stealing all the network resources?
Oh also im using version 2.6 because thats what everyone else uses and they refuse to change to new syntax
You need to use the proper networking way. It's not quite hard for simple networked program like yours.
Use the one from the Python's stdlib http://docs.python.org/library/socket.html (also take a look at the examples at the bottom of the page).
First off, without knowing how many times you are checking the fle with the moves, it is difficult to know why the file-share is getting DoS-ed. Most networks and network shares these days can handle that level of traffic - they are all gigabit Ethernet, so unless you are transferring large chunks of data each time, you should be ok. If you are transferring the whole file each time, then I'd suggest that you look at optimizing that.
That said, coming to your second question on how this is handled at a network level, to be honest, you are already doing it in a certain way - you are accessing a file on a network share and modifying it. The only optimization required is to be able to do it efficiently. Even over the network operations in a concurrent world do the same. In that case, it will be using fast in-memory database storing various changes / using a high-scale RDBMS / in the case of fast-serving web-servers better async I/O.
In the current case, since there are two users playing the game, I suggest that you work on a way to transmit only the difference in the moves each time over the network. So, instead of modifying the file over the network share, you can send the moves over to a server component and it synchronizing the changes locally to the file. Of course, this means you will need to create a server component that would do something like this
user1's moves <--> server <--> user2's moves . Server will modify the moves file.
Once you start doing this, you get into the realm of server programming / preventing race conditions etc. It will be a good learning experience.

Categories

Resources