Python parallelism preserving data - python

I need to repeatedly calculate very large python arrays based on small input and very large constant bulk if data, stored on the drive. I can successfully parallelize it by splitting that input bulk and joining response. Here comes the problem: sending identical data bulk to the pool is too slow. Moreover, I double required memory. Ideally I would read data in the thread from the file, and keep it there for multiple re-use.
How do I do it? I can only think of creating multiple servers that will listen to requests from the pool. Somehow it looks unnatural solution to quite common problem. Do I miss better solution?
best regards,
Vladimir

Related

Dask read csv versus pandas read csv

I have the following problem. I have a huge csv file and want to load it with multiprocessing. Pandas needs 19 seconds for a example file with 500000 rows and 130 colums with different dtypes. I tried dask because i want to multiprocess the reading. But this tooks much longer and I wonder why. I have 32 cores. and tried this:
import dask.dataframe as dd
import dask.multiprocessing
dask.config.set(scheduler='processes')
df = dd.read_csv(filepath,
sep='\t',
blocksize=1000000,
)
df = df.compute(scheduler='processes') # convert to pandas
When reading a huge file from disk, the bottleneck is the IO. As Pandas is highly optimized with a C parsing engine, there is very little to gain. Any attempt to use multi-processing or multi-threading is likely to be less performant, because you will spend the same time for loading the data from the disk, and only add some overhead for synchronizing the different processes or threads.
Consider what this means:
df = df.compute(scheduler='processes')
each process accesses some chunk of the original data. This may be in parallel or, quite likely, limited by the IO of the underlying storage device
each process makes a dataframe from its data, which is CPU-heavy and will parallelise well
each chunk is serialised by the process and communicated to the client from where you called it
the client deserialises the chunks and concatenates them for you.
Short story: don't use Dask if your only job is to get a Pandas dataframe in memory, it only adds overhead. Do use Dask if you can operate on the chunks independently, and only collect small output in the client (e.g., groupby-aggregate, etc.).
You could use mutliprocessthing, but like the file is not cut, you risk to have waiting when a program/thread wants to access file (its the case following your mesure).
If you want to use correctly multiprocessing, i recommand you to cut the file in differents parts and merge all results in the final operation
I recommend trying different numbers of processes with the num_workers keyword argument to compute.
Contrary to what is said above, read_csv is definitely compute-bound, and having a few processes working in parallel will likely help.
However, having too many processes all hammering at the disk at the same time might cause a lot of contention and slow things down.
I recommend experimenting a bit with different numbers of processes to see what works best.

Split search between python processes

I currently have a large database (stored as numpy array) that I'd like to perform a search on, however due to the size I'd like to split the database into pieces and perform a search on each piece before combining the results.
I'm looking for a way to host the split database pieces on separate python processes where they will wait for a query, after which they perform the search, and send the results back to the main process.
I've tried a load of different things with the multiprocessing package, but I can't find any way to (a) keep processes alive after loading the database on them, and (b) send more commands to the same process after initialisation.
Been scratching my head about this one for several days now, so any advice would be much appreciated.
EDIT: My problem is analogous to trying to host 'web' apis in the form of python processes, and I want to be able to send and receive requests at will without reloading the database shards every time.

Real-time data collection and 'offline' processing

I have a continuous stream of data. I want to do a small amount of processing to the data in real-time (mostly just compression, rolling some data off the end, whatever needs doing) and then store the data. Presumably no problem. HDF5 file format should do great! OOC data, no problem. Pytables.
Now the trouble. Occasionally, as a completely separate process so that data is still being gathered, I would like to perform a time consuming calculation involving the data (order minutes). This involving reading the same file I'm writing.
How do people do this?
Of course reading a file that you're currently writing should be challenging, but it seems that it must have come up enough in the past that people have considering some sort of slick solution---or at least a natural work-around.
Partial solutions:
It seems that HDF5-1.10.0 has a capability SWMR - Single Write, Multiple Read. This seems like exactly what I want. I can't find a python wrapper for this recent version, or if it exists I can't get Python to talk to the right version of hdf5. Any tips here would be welcomed. I'm using Conda package manager.
I could imagine writing to a buffer, which is occasionally flushed and added to the large database. How do I ensure that I'm not missing data going by while doing this?
This also seems like it might be computationally expensive, but perhaps there's no getting around that.
Collect less data. What's the fun in that?
I suggest you take a look at adding Apache Kafka to your pipeline, it can act as a data buffer and help you separate different tasks done on the data you collect.
pipeline example:
raw data ===> kafka topic (raw_data) ===> small processing ====> kafak topic (light_processing) ===> a process read from light_processing topic and writes to db or file
At the same time you can read with another process the same data from light_processing topic or any other topic and do your heavy processing and so on.
if both the light processing and the heavy processing connect to kafka topic with the same groupId the data will be replicated and both processes will get the same stream
hope it helped.

Communicating between one Read and one Write process in python

I am looking for a way to pass information safely (!) between two processes of python scripts.
One process reads only, and the other writes only.
The data I need to pass is a dictionary.
The best solution I found so far is a local SQL server (since the data itself is kind of a table) and I assume SQLite will handle the transaction safely.
Is there a better way, maybe a module to lock a file from being read while it is written to and vice versa?
I am using linux ubuntu 11.10, but a cross platform solution will be welcome.
To transfer data between processes, you can use pipes and sockets. Python uses pickling to convert between objects and byte streams.
To transfer the data safely, you need to make sure that the data is transferred exactly once. Which means: The destination process needs to be able to say "I didn't get everything, send it again" while the sender needs some form of receipt.
To achieve this, you should add a "header" to the data which gives it a unique key (maybe a timestamp or a hash). Both receiver and sender can then keep a list of those they have sent/seen to avoid processing data twice.
For one-way communication you could use e.g. multiprocessing.Queue or multiprocessing.queues.SimpleQueue.
Shared memory is also an option using multiprocessing.Array. But you'd have to split up the dictionary in at least two arrays (keys and values). This will only work well if all the values are of the same basic type.
The good point about both multiprocessing.Queue and multiprocessing.Array is that they are both protected by locks internally, so you don't have to worry about that.

Python Strategy for Large Scale Analysis (on-the-fly or deferred)

To analyze a large number of websites or financial data and pull out parametric data, what are the optimal strategies?
I'm classifying the following strategies as either "on-the-fly" or "deferred". Which is best?
On-the-fly: Process data on-the-fly and store parametric data into a database
Deferred: Store all the source data as ASCII into a file system and post process later, or with a processing-data-daemon
Deferred: Store all pages as a BLOB in a database to post-process later, or with a processing-data-daemon
Number 1 is simplest, especially if you only have a single server. Can #2 or #3 be more efficient with a single server, or do you only see the power with multiple servers?
Are there any python projects that are already geared toward this kind of analysis?
Edit: by best, I mean fastest execution to prevent user from waiting with ease of programming as secondary
I'd use celery either on a single or on multiple machines, with the "on-the-fly" strategy. You can have an aggregation Task, that fetches data, and a process Task that analyzes them and stores them in a db. This is a highly scalable approach, and you can tune it according to your computing power.
The "on-the-fly" strategy is more efficient in a sense that you process your data in a single pass. The other two involve an extra step, re-retrieve the data from where you saved them and process them after that.
Of course, everything depends on the nature of your data and the way you process them. If the process phase is slower than the aggregation, the "on-the-fly" strategy will hang and wait until completion of the processing. But again, you can configure celery to be asynchronous, and continue to aggregate while there are data yet unprocessed.
First: "fastest execution to prevent user from waiting" means some kind of deferred processing. Once you decide to defer the processing -- so the user doesn't see it -- the choice between flat-file and database is essentially irrelevant with respect to end-user-wait time.
Second: databases are slow. Flat files are fast. Since you're going to use celery and avoid end-user-wait time, however, the distinction between flat file and database becomes irrelevant.
Store all the source data as ASCII into a file system and post process later, or with a processing-data-daemon
This is fastest. Celery to load flat files.

Categories

Resources