python 3.4 multiprocessing - python

This question is asking for advice as well as assistance with some code.
I currently am learning Python with 3.4
I have built a basic network checking tool, i import items from a text file and for each of them i want python to check dns (using pydns), ping the ip (using subprocess to call OS native ping).
Currently i am checking 5000 to 9000 thousand IP address and its taking a number of hours, approx 4 to return all the results.
I am wondering if i can use multiprocessing or threading to speed this up but still the return the output to a list so that the row can be written to a csv file at the very end of the script in bulk.
I am new to python so please tell me if i have overlooked something i should of also.
Main code
http://pastebin.com/ZS23XrdE
Class
http://pastebin.com/kh65hYhG

You could use multiple threads to run child processes (ping in your case) and collect their output but it is not necessary. Here's a code example how to make multiple http requests using a thread pool. Here's code that uses concurrent.futures to make dns requests concurrently.
You don't need multiple threads/process to check 5000-9000 IPs (DNS, ICMP).
You could use gevent, twisted, asyncio to make network connections in the same process.

As most of the work seems IO based, you can easily rely on Threads.
Take a look at the Executor.map() function in cocurrent.futures:
https://docs.python.org/3/library/concurrent.futures.html
You can pass the list of IPs and the function you want to run against each element, the returned value, virtually, is the list of results of the given function.
In your specific case you can wrap the two worker's methods (check_dns_ip and os_ping) in a single one and pass it to the ThreadPoolExecutor.map function.

Related

Python multi-processing one worker dynimc number of recievers of all worker data (1:n)

I am planing to setup a small proxy service for a remote sensor, that only accepts one connection. I have a temporary solution and I am now designing a more robust version, and therefore dived deeper into the python multiprocessing module.
I have written a couple of systems in python using a main process, which spawns subprocesses using the multiprocessing module and used multiprocessing.Queue to communicate between them. This works quite well and some of theses programs/scripts are doing their job in a production environment.
The new case is slightly different since it uses 2+n processes:
One data-collector, that reads data from the sensor (at 100Hz) and every once in a while receives short ASCII strings as command
One main-server, that binds to a socket and listens, for new connections and spawns...
n child-servers, that handle clients who want to have the sensor data
while communication from the child servers to the data collector seems pretty straight forward using a multiprocessing.Queue which manages a n:1 connection well enough, I have problems with the other way. I can't use a queue for that as well, because all child-servers need to get all the data the sensor produces, while they are active. At least I haven't found a way to configure a Queue to mimic that behaviour, as get takes the top most out of the Queue by design.
I looked into shared memory already, which massively increases the management overhead, since as far as I understand it while using it, I would basically need to implement a streaming buffer myself.
The only safe way I see right now, is using a redis server and messages queues, but I am a bit hesitant, since that would need more infrastructure than I like.
Is there a pure python internal way?
maybe You can use MQTT for that ?
You did not clearly specify, but sounds like observer pattern -
or do You want the clients to poll each time they need data ?
It depends which delays / data rate / jitter etc. You can accept.
after You provided the information :
The whole setup runs on one machine in one process space. What I would like to have, is a way without going through a third party process
I would suggest to check for observer pattern.
More informations can be found for example:
https://www.youtube.com/watch?v=_BpmfnqjgzQ&t=1882s
and
https://refactoring.guru/design-patterns/observer/python/example
and
https://www.protechtraining.com/blog/post/tutorial-the-observer-pattern-in-python-879
and
https://python-3-patterns-idioms-test.readthedocs.io/en/latest/Observer.html
Your Server should fork for each new connection and register with the observer, and will be therefore informed about every change.

Best way of parallelizing functions from django task

I have defined a Django task (it gets launched using ./manage.py task_name). This task reads a set of objects from the database and performs an operation (usually sending a ping) on each of them, writing each individual result back to the database.
Currently I have a plain for loop, but it's obviously too slow, because it waits for each ping to end to start with the next one. So my question here is, what's the best way of parallelizing the operations?
As far as I've read, the best way I've found is using Pool from the multiprocessing module, something like the code in this answer.
For your task, which appears pretty simple, multiprocessing is probably the easiest approach, if only because it's already part of the stdlib. You could do it something like this (untested!):
def run_process(record):
result = ping(record)
pool = Pool(processes=10)
results = pool.map_async(run_process, [records])
for r in results.get():
write_to_database(r)
I would simply recommend celery.
Write celery tasks for operations which you want to be executed parallelizing/async. Let celery handle the concurrency, and you own code can get rid of the mess process management.
I'd say that the best tool would be some event-driven networking engine like twisted library
unlike multi threading / multi processing solutions, event-driven networking engines shine when it comes to intense io operations, without context switching and waiting for block operation they use the system resources in the most efficient way.
one way to use twisted library is to write a scrapy spider that will handle both external network calls like those ping requests you mentioned as well as writing back the response to the database.
a few guidelines for writing such spider:
to read spider list of urls from the database see https://gist.github.com/saidimu/1024207
to properly write the responses to the database see Writing items to a MySQL database in Scrapy
once you have this spider written, simply launch it from your django command or straight from the shell:
scrapy crawl <spider name>

Parallel data processing in Python

I have an architecture which is basically a queue with url addresses and some classes to process the content of those url addresses. At the moment the code works good, but it is slow to sequentially pull a url out of the queue, send it to the correspondent class, download the url content and finally process it.
It would be faster and make proper use of resources if for example it could read n urls out of the queue and then shoot n processes or threads to handle the downloading and processing.
I would appreciate if you could help me with these:
What packages could be used to solve this problem ?
What other approach can you think of ?
You might want to look into the Python Multiprocessing library. With multiprocessing.pool, you can give it a function and an array, and it will call the function with each value of the array in parallel, using as many or as few processes as you specify.
If C-calls are slow, like downloading, database requests, other IO - You can use just threading.Thread
If python code is slow, like frameworks, your logic, not accelerated parsers - You need to use multiprocessing Pool or Process. Also it speedups python code, but it is less tread-save and need to deep understanding how it works in complex code (locks, semaphores).

Python-C api concurrency issue

We are developing a small c server application. The server application does some data processing and responds back to the client. To keep the data processing part configurable and flexible we decided to go for scripting and based on the availability of various ready modules we decided to go for Python. We are using the Python-C api to send/receive the data between c and python.
The Algorithm works something like this:-
Server receives some data from client, this data is stored in a dictionary created in c. The dictionary is created using the api function PyDict_New(); from c. The input is stored as a key value pair in the dictionary using the api function PyDict_SetItemString();
Next, we execute the python script PyRun_SimpleString(); passing the script as a parameter. This script makes use of the dictionary created in c. Please note, we make the dictionary created in c, accessible to the script using the methods PyImport_AddModule(); and PyModule_AddObject();
We store the result of the data processing in the script as a key value pair in the same dictionary created above. The c code can then simply access the result variable(key-value pair) after the script has executed.
The problem
The problem we are facing is in the case of concurrent requests coming in from different clients. When multiple requests come in from different clients we tend to object reference count exceptions. Please note, that for each request which comes in for a user, we create an independent dictionary for that user alone. To overcome this problem we encompassed the call to PyRun_SimpleString(); within PyEval_AcquireLock(); and PyEval_ReleaseLock();, but doing this has resulted in the script execution being a blocking call. So if a script is taking long time to execute, all the other users are also waiting for a response.
Could you please suggest the best possible approach or give pointers to where we are going wrong. Please ping me for more information.
Any help/guidance will be appreciated.
Perhaps you are missing one of the calls mentioned in this answer.
I suggest you investigate the multiprocessing module.
You should probably read http://docs.python.org/c-api/init.html#thread-state-and-the-global-interpreter-lock Your problem is explained in the first paragraph.
When you acquire the GIL, do so around your direct manipulation of Python objects. The call to PyRun_SimpleString will handle the GIL internally and will give it up on long-running operations or just every X instructions. It WILL NOT be truly multi-threaded, however.
Edit:
You need to acquire the lock and you need to ensure that Python knows it's in a different thread state:
// acquire the lock and switch thread state
PyEval_AcquireLock();
PyThreadState_Swap(perThreadState);
// execute some python code
PyEval_SimpleString("print 123");
// clear the thread state and release the lock
PyThreadState_Swap(NULL);
PyEval_ReleaseLock();

Python Socket and Thread pooling, how to get more performance?

I am trying to implement a basic lib to issue HTTP GET requests. My target is to receive data through socket connections - minimalistic design to improve performance - usage with threads, thread pool(s).
I have a bunch of links which I group by their hostnames, so here's a simple demonstration of input URLs:
hostname1.com - 500 links
hostname2.org - 350 links
hostname3.co.uk - 100 links
...
I intend to use sockets because of performance issues. I intend to use a number of sockets which keeps connected (if possible and it usually is) and issue HTTP GET requests. The idea came from urllib low performance on continuous requests, then I met urllib3, then I realized it uses httplib and then I decided to try sockets. So here's what I accomplished till now:
GETSocket class, SocketPool class, ThreadPool and Worker classes
GETSocket class is a minified, "HTTP GET only" version of Python's httplib.
So, I use these classes like that:
sp = Comm.SocketPool(host,size=self.poolsize, timeout=5)
for link in linklist:
pool.add_task(self.__get_url_by_sp, self.count, sp, link, results)
self.count += 1
pool.wait_completion()
pass
__get_url_by_sp function is a wrapper which calls sp.urlopen and saves the result to results list. I am using a pool of 5 threads which has a socket pool of 5 GETSocket classes.
What I wonder is, is there any other possible way that I can improve performance of this system?
I've read about asyncore here, but I couldn't figure out how to use same socket connection with class HTTPClient(asyncore.dispatcher) provided.
Another point, I don't know if I'm using a blocking or a non-blocking socket, which would be better for performance or how to implement which one.
Please be specific about your experiences, I don't intend to import another library to do just HTTP GET so I want to code my own tiny library.
Any help appreciated, thanks.
Do this.
Use multiprocessing. http://docs.python.org/library/multiprocessing.html.
Write a worker Process which puts all of the URL's into a Queue.
Write a worker Process which gets a URL from a Queue and does a GET, saving a file and putting the File information into another Queue. You'll probably want multiple copies of this Process. You'll have to experiment to find how many is the correct number.
Write a worker Process which reads file information from a Queue and does whatever it is that you're trying do.
I finally found a well chosen path to solve my problems. I was using Python 3 for my project and my only option was to use pycurl, so this made me have to port my project back to Python 2.7 series.
Using pycurl, I gained:
- Consistent responses to my requests (actually my script has to deal with minimum 10k URLs)
- With the usage of ThreadPool class I am receiving responses as fast as my system can (received data is processed later - so multiprocessing is not much of a possibility here)
I tried httplib2 first, I realized that it is not acting as solid as it acts on Python 2, by switching to pycurl I lost caching support.
Final conclusion: When it comes to HTTP communication, one could need a tool like (py)curl at his disposal. It is a lifesaver, especially when one is dealing with loads of URLs (try sometimes for fun: you will get lots of weird responses from them)
Thanks for the replies, folks.

Categories

Resources