Parallel data processing in Python

Parallel data processing in Python - python

I have an architecture which is basically a queue with url addresses and some classes to process the content of those url addresses. At the moment the code works good, but it is slow to sequentially pull a url out of the queue, send it to the correspondent class, download the url content and finally process it.
It would be faster and make proper use of resources if for example it could read n urls out of the queue and then shoot n processes or threads to handle the downloading and processing.
I would appreciate if you could help me with these:
What packages could be used to solve this problem ?
What other approach can you think of ?

You might want to look into the Python Multiprocessing library. With multiprocessing.pool, you can give it a function and an array, and it will call the function with each value of the array in parallel, using as many or as few processes as you specify.

If C-calls are slow, like downloading, database requests, other IO - You can use just threading.Thread
If python code is slow, like frameworks, your logic, not accelerated parsers - You need to use multiprocessing Pool or Process. Also it speedups python code, but it is less tread-save and need to deep understanding how it works in complex code (locks, semaphores).

Related

Python 3.5 non blocking functions

I have a fairly large python package that interacts synchronously with a third party API server and carries out various operations with the server. Additionally, I am now also starting to collect some of the data for future analysis by pickling the JSON responses. After profiling several serialisation/database methods, using pickle was the fastest in my case. My basic pseudo-code is:
While True:
do_existing_api_stuff()...
# additional data pickling
data = {'info': []} # there are multiple keys in real version!
if pickle_file_exists:
data = unpickle_file()
data['info'].append(new_data)
pickle_data(data)
if len(data['info']) >= 100: # file size limited for read/write speed
create_new_pickle_file()
# intensive section...
# move files from "wip" (Work In Progress) dir to "complete"
if number_of_pickle_files >= 100:
compress_pickle_files() # with lzma
move_compressed_files_to_another_dir()
My main issue is that the compressing and moving of the files takes several seconds to complete and is therefore slowing my main loop. What is the easiest way to call these functions in a non-blocking way without any major modifications to my existing code? I do not need any return from the function, however it will raise an error if anything fails. Another "nice to have" would be for the pickle.dump() to also be non-blocking. Again, I am not interested in the return beyond "did it raise an error?". I am aware that unpickle/append/re-pickle every loop is not particularly efficient, however it does avoid data loss when the api drops out due to connection issues, server errors, etc.
I have zero knowledge on threading, multiprocessing, asyncio, etc and after much searching, I am currently more confused than I was 2 days ago!
FYI, all of the file related functions are in a separate module/class, so that could be made asynchronous if necessary.
EDIT:
There may be multiple calls to the above functions, so I guess some sort of queuing will be required?

Easiest solution is probably the threading standard library package. This will allow you to spawn a thread to do the compression while your main loop continues.
There is almost certainly quite a bit of 'dead time' in your existing loop waiting for the API to respond and conversely there is quite a bit of time spent doing the compression when you could be usefully making another API call. For this reason I'd suggest separating these two aspects. There are lots of good tutorials on threading so I'll just describe a pattern which you could aim for
Keep the API call and the pickling in the main loop but add a step which passes the file path to each pickle to a queue after it is written
Write a function which takes a the queue as its input and works through the filepaths performing the compression
Before starting the main loop, start a thread with the new function as its target

python 3.4 multiprocessing

This question is asking for advice as well as assistance with some code.
I currently am learning Python with 3.4
I have built a basic network checking tool, i import items from a text file and for each of them i want python to check dns (using pydns), ping the ip (using subprocess to call OS native ping).
Currently i am checking 5000 to 9000 thousand IP address and its taking a number of hours, approx 4 to return all the results.
I am wondering if i can use multiprocessing or threading to speed this up but still the return the output to a list so that the row can be written to a csv file at the very end of the script in bulk.
I am new to python so please tell me if i have overlooked something i should of also.
Main code
http://pastebin.com/ZS23XrdE
Class
http://pastebin.com/kh65hYhG

You could use multiple threads to run child processes (ping in your case) and collect their output but it is not necessary. Here's a code example how to make multiple http requests using a thread pool. Here's code that uses concurrent.futures to make dns requests concurrently.
You don't need multiple threads/process to check 5000-9000 IPs (DNS, ICMP).
You could use gevent, twisted, asyncio to make network connections in the same process.

As most of the work seems IO based, you can easily rely on Threads.
Take a look at the Executor.map() function in cocurrent.futures:
https://docs.python.org/3/library/concurrent.futures.html
You can pass the list of IPs and the function you want to run against each element, the returned value, virtually, is the list of results of the given function.
In your specific case you can wrap the two worker's methods (check_dns_ip and os_ping) in a single one and pass it to the ThreadPoolExecutor.map function.

Best way of parallelizing functions from django task

I have defined a Django task (it gets launched using ./manage.py task_name). This task reads a set of objects from the database and performs an operation (usually sending a ping) on each of them, writing each individual result back to the database.
Currently I have a plain for loop, but it's obviously too slow, because it waits for each ping to end to start with the next one. So my question here is, what's the best way of parallelizing the operations?
As far as I've read, the best way I've found is using Pool from the multiprocessing module, something like the code in this answer.

For your task, which appears pretty simple, multiprocessing is probably the easiest approach, if only because it's already part of the stdlib. You could do it something like this (untested!):
def run_process(record):
result = ping(record)
pool = Pool(processes=10)
results = pool.map_async(run_process, [records])
for r in results.get():
write_to_database(r)

I would simply recommend celery.
Write celery tasks for operations which you want to be executed parallelizing/async. Let celery handle the concurrency, and you own code can get rid of the mess process management.

I'd say that the best tool would be some event-driven networking engine like twisted library
unlike multi threading / multi processing solutions, event-driven networking engines shine when it comes to intense io operations, without context switching and waiting for block operation they use the system resources in the most efficient way.
one way to use twisted library is to write a scrapy spider that will handle both external network calls like those ping requests you mentioned as well as writing back the response to the database.
a few guidelines for writing such spider:
to read spider list of urls from the database see https://gist.github.com/saidimu/1024207
to properly write the responses to the database see Writing items to a MySQL database in Scrapy
once you have this spider written, simply launch it from your django command or straight from the shell:
scrapy crawl <spider name>

Python Socket and Thread pooling, how to get more performance?

I am trying to implement a basic lib to issue HTTP GET requests. My target is to receive data through socket connections - minimalistic design to improve performance - usage with threads, thread pool(s).
I have a bunch of links which I group by their hostnames, so here's a simple demonstration of input URLs:
hostname1.com - 500 links
hostname2.org - 350 links
hostname3.co.uk - 100 links
...
I intend to use sockets because of performance issues. I intend to use a number of sockets which keeps connected (if possible and it usually is) and issue HTTP GET requests. The idea came from urllib low performance on continuous requests, then I met urllib3, then I realized it uses httplib and then I decided to try sockets. So here's what I accomplished till now:
GETSocket class, SocketPool class, ThreadPool and Worker classes
GETSocket class is a minified, "HTTP GET only" version of Python's httplib.
So, I use these classes like that:
sp = Comm.SocketPool(host,size=self.poolsize, timeout=5)
for link in linklist:
pool.add_task(self.__get_url_by_sp, self.count, sp, link, results)
self.count += 1
pool.wait_completion()
pass
__get_url_by_sp function is a wrapper which calls sp.urlopen and saves the result to results list. I am using a pool of 5 threads which has a socket pool of 5 GETSocket classes.
What I wonder is, is there any other possible way that I can improve performance of this system?
I've read about asyncore here, but I couldn't figure out how to use same socket connection with class HTTPClient(asyncore.dispatcher) provided.
Another point, I don't know if I'm using a blocking or a non-blocking socket, which would be better for performance or how to implement which one.
Please be specific about your experiences, I don't intend to import another library to do just HTTP GET so I want to code my own tiny library.
Any help appreciated, thanks.

Do this.
Use multiprocessing. http://docs.python.org/library/multiprocessing.html.
Write a worker Process which puts all of the URL's into a Queue.
Write a worker Process which gets a URL from a Queue and does a GET, saving a file and putting the File information into another Queue. You'll probably want multiple copies of this Process. You'll have to experiment to find how many is the correct number.
Write a worker Process which reads file information from a Queue and does whatever it is that you're trying do.

I finally found a well chosen path to solve my problems. I was using Python 3 for my project and my only option was to use pycurl, so this made me have to port my project back to Python 2.7 series.
Using pycurl, I gained:
- Consistent responses to my requests (actually my script has to deal with minimum 10k URLs)
- With the usage of ThreadPool class I am receiving responses as fast as my system can (received data is processed later - so multiprocessing is not much of a possibility here)
I tried httplib2 first, I realized that it is not acting as solid as it acts on Python 2, by switching to pycurl I lost caching support.
Final conclusion: When it comes to HTTP communication, one could need a tool like (py)curl at his disposal. It is a lifesaver, especially when one is dealing with loads of URLs (try sometimes for fun: you will get lots of weird responses from them)
Thanks for the replies, folks.

Downloading a Large Number of Files from S3

What's the Fastest way to get a large number of files (relatively small 10-50kB) from Amazon S3 from Python? (In the order of 200,000 - million files).
At the moment I am using boto to generate Signed URLs, and using PyCURL to get the files one by one.
Would some type of concurrency help? PyCurl.CurlMulti object?
I am open to all suggestions. Thanks!

I don't know anything about python, but in general you would want to break the task down into smaller chunks so that they can be run concurrently. You could break it down by file type, or alphabetical or something, and then run a separate script for each portion of the break down.

In the case of python, as this is IO bound, multiple threads will use of the CPU, but it will probably use up only one core. If you have multiple cores, you might want to consider the new multiprocessor module. Even then you may want to have each process use multiple threads. You would have to do some tweaking of number of processors and threads.
If you do use multiple threads, this is a good candidate for the Queue class.

You might consider using s3fs, and just running concurrent file system commands from Python.

I've been using txaws with twisted for S3 work, though what you'd probably want is just to get the authenticated URL and use twisted.web.client.DownloadPage (by default will happily go from stream to file without much interaction).
Twisted makes it easy to run at whatever concurrency you want. For something on the order of 200,000, I'd probably make a generator and use a cooperator to set my concurrency and just let the generator generate every required download request.
If you're not familiar with twisted, you'll find the model takes a bit of time to get used to, but it's oh so worth it. In this case, I'd expect it to take minimal CPU and memory overhead, but you'd have to worry about file descriptors. It's quite easy to mix in perspective broker and farm the work out to multiple machines should you find yourself needing more file descriptors or if you have multiple connections over which you'd like it to pull down.

what about thread + queue, I love this article: Practical threaded programming with Python

Each job can be done with appropriate tools :)
You want use python for stress testing S3 :), so I suggest find a large volume downloader program and pass link to it.
On Windows I have experience for installing ReGet program (shareware, from http://reget.com) and creating downloading tasks via COM interface.
Of course there may other programs with usable interface exists.
Regards!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parallel data processing in Python - python

You might want to look into the Python Multiprocessing library. With multiprocessing.pool, you can give it a function and an array, and it will call the function with each value of the array in parallel, using as many or as few processes as you specify.

Related

Python 3.5 non blocking functions

python 3.4 multiprocessing

Best way of parallelizing functions from django task

Python Socket and Thread pooling, how to get more performance?

Downloading a Large Number of Files from S3

Categories

Resources