Writing data so repeatedly in SQLite Database with python

Writing data so repeatedly in SQLite Database with python - python

I want to use SQLite for my GUI Python application but I have to update database every 500 MS without effecting the performance of my program.
I'm using PyQt4,So I thought about using QThread but it seems difficult to deal with, so I wondered if it was the best way before really trying to understand it.
My Question is: is QThread the best way or there are other ways?‬

According to the fact that python implementation rely on the GIL, even with using threads or timer you won't be able to do something (potentially costly) in your program without effecting the global performance of the program.
I will suggest you to have a look to multiprocessing module to get around of this limitation. Using this module, you will no more use threads (that are affected by the GIL), but processes (not affected by GIL).
Maybe you could create a subprocess that will arm a timer to make the update every 500ms when the main process will continue his job.
Then, you will let the system do the job of balancing the programs and it may be better in term of responsiveness (especially in a multi core environment)

Related

Will asyncio help me here?

I have a little python application that I have developed using wxpython4.0.3 that performs a fairly simple ETL type task of:
Takes user input to select a parent directory containing several
sub-directories full of CSV files (time series data.)
Transforms the data in each of those files to comply with formatting
required by a third party recipient and writes them out to new
files.
Zips each of the new files into a file named after the original
file's parent directory
At the user's initiative the application then FTPs the zip files to
the third party to be loaded into a data store.
The application works well enough, but the time required to process through several thousand CSV files is pretty extreme and mostly IO bound from what I can tell.
Is asyncio a reasonable option to pursue or are there other recommendations that anyone can make? I originally wrote this as a CLI and saw significant performance gains by using pypy, but I was reluctant to combine pypy with wxpython when I developed the UI for others.
Thanks for your guidance.

If you saw a significant speedup by using PyPy instead of CPython, that implies that your code probably isn't I/O-bound. Which means that making the I/O asynchronous isn't going to help very much. Plus, it'll be extra work, as well, because you'll have to restructure all of your CPU-heavy tasks into small pieces that can await repeatedly so they don't block the other tasks.
So, you probably want to use multiple processes here.
The simplest solution is to use a concurrent.futures.ProcessPoolExecutor: just toss tasks at the executor, and it'll run them on the child processes and return you a Future.
Unlike using asyncio, you won't have to change those tasks at all. They can read a file just by looping over the csv module, process it all in one big chunk, and even use the synchronous ftplib module, without needing to worry about anyone blocking anyone else. Only your top-level code needs to change.
However, you may want to consider splitting the code into a wx GUI that you run in CPython, and a multiprocessing engine that you run via subprocess in PyPy, which then spins off the ProcessPoolExecutor in PyPy as well. This would take a bit more work, but it means you'll get the CPU benefits of using PyPy, the well-tested-with-wx benefits of using CPython, and the parallelism of multiprocessing.
Another option to consider is pulling in a library like NumPy or Pandas that can do the slow parts (whether that's reading and processing the CSV, or doing some kind of elementwise computation on thousands of rows, or whatever) more quickly (and possibly even releasing the GIL, meaning you don't need multiprocessing).
If your code really is I/O-bound code, and primarily bound on the FTP requests, asyncio would help. But it would require rewriting a lot of code. You'd need to find or write an asyncio-driven FTP client library. And, if the file reading takes any significant part of your time, converting that to async is even more work.
There's also the problem of integrating the wx event loop with the asyncio event loop. You might be able to get away with running the asyncio loop in a second thread, but then you need to come up with some way of communicating between the wx event loop in the main thread and the asyncio loop in the background thread. Alternatively, you might be able to drive one loop from the other (or there might even be third-party libraries that do that for you). But this might be a lot easier to do with (or have better third-party libraries to help with) something like twisted instead of asyncio.
But, unless you need massive concurrency (which you probably don't, unless you've got hundreds of different FTP servers to talk to), threads should work just as well, with a lot fewer changes to your code. Just use a concurrent.futures.ThreadPoolExecutor, which is nearly identical to using a ProcessPoolExecutor as explained above.

Yes, you will probably benefit from using asynchronous library. Since most of your time is spent waiting for IO, a well-written asynchronous program will use that time to do something else, without the overhead of extra threads/processes. It will scale really well.

Does Python multiprocessing module need multi-core CPU?

Do I need a multi-core CPU to take advantage of the Python multiprocessing module?
Also, can someone tell me how it works under the hood?

multiprocessing asks the OS to launch one or more new processes, running the same version of Python and the same version of your script. It can also set up pipes or other ways of sharing data directly between them.
It usually works like magic; when you peek under the hood, sometimes it looks like sausage being made, but you can usually understand the sausage grinders. The multiprocessing docs do a great job explaining things further (they're long, but then there's a lot to explain). And if you need even more under-the-hood knowledge, the docs link to the source, which is pretty readable Python code. If you have a specific question after reading, come back to SO and ask a specific question.
Meanwhile, you can get some of the benefits of multiprocessing without multiple cores.
The main benefit—the reason the module was designed—is parallelism for speed. And obviously, without 4 cores, you aren't going to cut your time down to 25%. But sometimes, you actually can get a bit of speedup even with a single core, especially if that core has "hyperthreading" or similar technologies. I've seen times come down to 80%, or even 60%. More commonly, they'll go up to 108% instead (because you did get a small benefit from hyperthreading, but the overhead cost was higher than the gain). But try it with your code and see.
Meanwhile, you get all of the side benefits:
Concurrency: You can run multiple tasks at once without them blocking each other. Of course threads, asyncio, and other techniques can do this too.
Isolation: You can run multiple tasks at once without the risk of one of them changing data that another one wasn't expecting to change.
Crash protection: If a child task segfaults, only that task is affected. (Well, you still have to be careful of any side-effects—if it crashed in the middle of writing a file that another tasks expects to be in a consistent shape, you're still in trouble.)
You can also use the multiprocessing module without multiple processes. Sometimes you just want the higher-level API of the module, but you want to use it with threads; multiprocessing.dummy does that. And you can switch back and forth in a couple lines of code to test it both ways. Or you can use the higher-level concurrent.futures.ProcessPoolExecutor wrapper, if its model fits what you want to do. Besides often being simpler, it lets you switch between threads and processes by just changing one word in one line.
Also, redesigning your program around multiprocessing takes you a step closer to further redesigning it as a distributed system that runs on multiple separate machines. It forces you to deal with questions like how your tasks communicate without being able to share everything, without forcing you to deal with further questions like how they communicate without reliable connections.

Python TKinter for real time GUIs

I have written a monitoring program for the control system at our plant. It is basically a GUI which lets the operator see the current status of the lock of the closed loop system and aware the operator in case the lock/loop breaks.
Now, the operation is heavily dependent on the responses of the GUI. My seniors told me that they prefer just the console prints instead of using TKinter based GUI as TKinter has lags while working in real time.
Can anyone please comment on this aspect?
Can this lag be checked and corrected?
Thanks in advance.

I would say that if your program is simply accessing data and not interacting with the data, then a GUI seems to be a bit of overkill. GUI's are guided user interfaces, as you know, and are made for guiding a user through an interface. If the interface is just a status, as you indicated, then I see nothing wrong with a console program.
If, however, your program also interacts with data in a way that would be difficult without a GUI, then the GUI is likely the right choice.
Have you considered a GUI in another programming language? Python is known to be a bit slow, even in console. In my experience, C++ is faster in terms of viewing data. Best of luck!

Python / tkinter in general
In a tkinter program, your code falls in one of four categories;
Initialization code that runs before the mainloop is started.
Callbacks that are run from within the mainloop.
Code running in other threads.
Code running in different processes.
In the first case, the time the code takes only influences the startup time, which for a long-running program is probably not all that relevant.
Concerning the second case, well-written callbacks should not take that long to run. In the order of tens of milliseconds, maybe up to 100 ms. If they take longer, they will render the GUI unresponsive. So unless you notice a sluggish GUI (without threads; see below) this should not be a problem.
One pitfall here are after callbacks, that is functions that will be scheduled to run after a certain time. If you launch them too often, this will also starve the GUI of time.
Another possible problem might be the manipulation of a Canvas with lots and lots of items in it.
As of Python 3.x, tkinter is thread-safe to the best of my understanding. However, in the reference implementation of Python, only one thread at a time can be executing Python bytecode. So doing heavy calculations in a second thread would slow down the GUI.
If you GUI uses multiprocessing to run calculations in another process, that should not influence the speed of your GUI much, unless you do things wrong when communicating with that other process.
Your monitoring program
What is too slow depends on the situation. In general Python is not considered a language suitable for hard real-time programs. To do hard real-time one also needs a suitable operating system.
So the question then becomes what is the acceptable lag in your system specification? Without knowing that it is impossible to precisely answer your question.
It seems that your GUI is just displaying some system status. That should not cause too much of a load, provided that you don't read/check the data too often. As described in the callbacks paragraph above it is possible to starve your GUI of CPU cycles with callbacks that run too often. From what you've written, I gather that the GUI's task is just to inform the human operator.
That leads me to believe that the task is not hugely time critical; a system that requires millisecond intervention time should not rely on a human operator.
So based on your information I would say that a competently written GUI should probably not be too slow.

How to choose between different concurrent method available in Python?

There's different ways of doing concurrent in Python, below is a simple list:
process-based: process.Popen, multiprocessing.Process, old fashioned os.system, os.popen, os.exe*
thread-based: threading.Thread
microthread-based: greenlet
I know the difference between thread-based concurrency and process-based concurrency, and I know some (but not too much) about GIL's impact in CPython's thread support.
For a beginner who want to implement some level of concurrency, how to choose between them? Or, what's the general difference between them? Are there any more ways to do concurrent in Python?
I'm not sure if I'm asking the right question, please feel free to improve this question.

The reason all three of these mechanisms exist is that they have different strengths and weaknesses.
First, if you have huge numbers of small, independent tasks, and there's no sensible way to batch them up (typically, this means you're writing a C10k server, but that's not the only possible case), microthreads win hands down. You can only run a few hundred OS threads or processes before everything either bogs down or just fails. So, either you use microthreads, or you give up on automatic concurrency and start writing explicit callbacks or coroutines. This is really the only time microthreads win; otherwise, they're just like OS threads except a few things don't work right.
Next, if your code is CPU-bound, you need processes. Microthreads are an inherently single-core solution; Threads in Python generally can't parallelize well because of the GIL; processes get as much parallelism as the OS can handle. So, processes will let your 4-core system run your code 4x as fast; nothing else will. (In fact, you might want to go farther and distribute across separate computers, but you didn't ask about that.) But if your code is I/O-bound, core-parallelism doesn't help, so threads are just as good as processes.
If you have lots of shared, mutable data, things are going to be tough. Processes require explicitly putting everything into sharable structures, like using multiprocessing.Array in place of list, which gets nightmarishly complicated. Threads share everything automatically—which means there are race conditions everywhere. Which means you need to think through your flow very carefully and use locks effectively. With processes, an experienced developers can build a system that works on all of the test data but has to be reorganized every time you give it a new set of inputs. With threads, an experienced developer can write code that runs for weeks before accidentally and silently scrambling everyone's credit card numbers.
Whichever of those two scares you more—do that one, because you understand the problem better. Or, if it's at all possible, step back and try to redesign your code to make most of the shared data independent or immutable. This may not be possible (without making things either too slow or too hard to understand), but think about it hard before deciding that.
If you have lots of independent data or shared immutable data, threads clearly win. Processes need either explicit sharing (like multiprocessing.Array again) or marshaling. multiprocessing and its third-party alternatives make marshaling pretty easy for the simple cases where everything is picklable, but it's still not as simple as just passing values around directly, and it's also a lot slower.
Unfortunately, most cases where you have lots of immutable data to pass around are the exact same cases where you need CPU parallelism, which means you have a tradeoff. And the best answer to this tradeoff may be OS threads on your current 4-core system, but processes on the 16-core system you have in 2 years. (If you organize things around, e.g., multiprocessing.ThreadPool or concurrent.futures.ThreadPoolExecutor, and trivially switch to Pool or ProcessPoolExecutor later—or even with a runtime configuration switch—that pretty much solves the problem. But this isn't always possible.)
Finally, if your application inherently requires an event loop (e.g., a GUI app or a network server), pick the framework you like first. Coding with, say, PySide vs. wx, or twisted vs. gevent, is a bigger difference than coding with microthreads vs. OS threads. And, once you've picked the framework, see how much you can take advantage of its event loop where you thought you needed real concurrency. For example, if you need some code to run every 30 seconds, don't start a thread (micro- or OS) for that, ask the framework to schedule it however it wants.

How to track memory for a python script

We have a system that only has one interpreter. Many user scripts come through this interpreter. We want put a cap on each script's memory usage. There is only process, and that process invokes tasklets for each script. So since we only have one interpreter and one process, we don't know a way to put a cap on each scripts memory usage. What is the best way to do this

I don't think that it's possible at all. Your questions implies that the memory used by your tasklets is completly separated, which is probably not the case. Python is optimizing small objects like integers. As far as I know, for example each 3 in your code is using the same object, which is not a problem, because it is imutable. So if two of your tasklets use the same (small?) integer, they are already sharing memory. ;-)

Memory is separated at OS process level. There's no easy way to tell to which tasklet and even to which thread does a particular object belong.
Also, there's no easy way to add a custom bookkeeping allocator that would analyze which tasklet or thread is is allocating a piece of memory and prevent from allocating too much. It would also need to plug into garbage-collection code to discount objects which are freed.
Unless you're keen to write a custom Python interpreter, using a process per task is your best bet.
You don't even need to kill and respawn the interpreters every time you need to run another script. Pool several interpreters and only kill the ones that overgrow a certain memory threshold after running a script. Limit interpreters' memory consumption by means provided by OS if you need.
If you need to share large amounts of common data between the tasks, use shared memory; for smaller interactions, use sockets (with a messaging level above them as needed).
Yes, this might be slower than your current setup. But from your use of Python I suppose that in these scripts you don't do any time-critical computing anyway.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.