Python Pandas Threading reads read_fwf

Python Pandas Threading reads read_fwf - python

Newbie-ish python/pandas user here. I've been playing with using chunksize arg in read_fwf and iterating value_counts of variables. I wrote a function to pass args such as the fileiterator and variables to parse and count. I was hoping to parallelize this function and be able to read 2 files at the same time into the same function.
It does appear to work... However, I'm getting unexpected slow downs. The threads finish same time but one seems to be slowing the other down (IO bottleneck?). I'm getting faster times by running the functions sequentially rather than parallel (324 secs Vs 172 secs). Ideas? I'm I executing this wrong? I've tried multiprocess but startmap errors that I can't pickle the fileiterator (output of read_fwf).
testdf1=pd.read_fwf(filepath_or_buffer='200k.dat',header=None,colspecs=wlist,names=nlist,dtype=object,na_values=[''],chunksize=1000)
testdf2=pd.read_fwf(filepath_or_buffer='200k2.dat',header=None,colspecs=wlist,names=nlist,dtype=object,na_values=[''],chunksize=1000)
def tfuncth(df,varn,q,*args):
td={}
for key in varn.keys():
td[key]=pd.Series()
for rdf in df:
if args is not None:
for arg in args:
rdf=eval(f"rdf.query(\"{arg}\")")
for key in varn.keys():
ecode=f'rdf.{varn[key]}.value_counts()'
td[key]=pd.concat([td[key],eval(ecode)])
td[key]=td[key].groupby(td[key].index).sum()
for key in varn.keys():
td[key]=pd.DataFrame(td[key].reset_index()).rename(columns={'index':'Value',0:'Counts'}).assign(Var=key,PCT=lambda x:round(x.Counts/x.Counts.sum()*100,2))[['Var','Value','Counts','PCT']]
q.put(td)
bands={
'1':'A',
'2':'B',
'3':'C',
'4':'D',
'5':'E',
'6':'F',
'7':'G',
'8':'H',
'9':'I'
}
vdict={
'var1':'e1270.str.slice(0,2)',
'var2':'e1270.str.slice(2,3)',
'band':'e7641.str.slice(0,1).replace(bands)'
}
my_q1=queue.Queue()
my_q2=queue.Queue()
thread1=threading.Thread(target=tfuncth,args=(testdf1,vdict,my_q1,flter1))
thread2=threading.Thread(target=tfuncth,args=(testdf2,vdict,my_q2))
thread1.start()
thread2.start()
UPDATE:
After much reading This is the conclusion I've came too. This is extremely simplified conclusion I'm sure so if someone knows otherwise please inform me.
Pandas is not a fully multi-thread friendly package
Apparently there’s a package called ‘dask’ that is and it replicates a lot of pandas functions. So I’ll be looking into that.
Python is not truly a multi-threading compatible language in many
cases
Python is bound by its compiler. In pure python, its interpreted and bound by the GIL for only execution of one thread at a time
Multiple threads can be spun off but will only be able to parallel non-cpu bound functions.
My code is wrapped with IO and CPU. The simple IO is probably running parallel but getting held up waiting on the processor for execution.
I plan to test this out by writing IO only operations and attempting threading.
Python can be compiled with different compilers that don’t have a global interpreter lock (GIL) on threads.
Thus packages such as ‘dask’ can utilize multi-threading.

I did manage to get this to work and fix my problems by using the multiprocessing package. I ran into two issues.
1) multiprocessing package is not compatible with Juypter Notebook
and
2) you can't pickle a handle to a pandas reader (multiprocessing pickles objects passed to the processes).
I fixed 1 by coding outside the Notebook environment and I fixed 2 by passing in the arguments needed to open a chunking file to each process and had each process start their own chunk read.
After doing those two things I was able to get a 60% increase in speed over sequential runs.

Related

How to be sure what's being imported while using Python Multiprocessing?

Short context:
Our application has a backend written in Python. It contains a couple of rest API endpoints and Message Queue handling (rabbitMQ and Pika).
The reason why we use Python is that it is a Data Science/AI project - so lot of data processing require some DS knowledge.
Problem:
Some parts of our application have CPU heavy calculations, and we are using multiprocessing to add parallelization.
However, we need to be careful because each process it's starting new interpreter and imports everything again.
The environment is windows, so the process creation is a "spawn".
Question:
Is there the best way how to maintain this? The team is big, so there is a chance that someone will put some big object creation or long processing function that will start on application boot that might be called and kept in memory while creating a pool of processes.

Python 3.5 non blocking functions

I have a fairly large python package that interacts synchronously with a third party API server and carries out various operations with the server. Additionally, I am now also starting to collect some of the data for future analysis by pickling the JSON responses. After profiling several serialisation/database methods, using pickle was the fastest in my case. My basic pseudo-code is:
While True:
do_existing_api_stuff()...
# additional data pickling
data = {'info': []} # there are multiple keys in real version!
if pickle_file_exists:
data = unpickle_file()
data['info'].append(new_data)
pickle_data(data)
if len(data['info']) >= 100: # file size limited for read/write speed
create_new_pickle_file()
# intensive section...
# move files from "wip" (Work In Progress) dir to "complete"
if number_of_pickle_files >= 100:
compress_pickle_files() # with lzma
move_compressed_files_to_another_dir()
My main issue is that the compressing and moving of the files takes several seconds to complete and is therefore slowing my main loop. What is the easiest way to call these functions in a non-blocking way without any major modifications to my existing code? I do not need any return from the function, however it will raise an error if anything fails. Another "nice to have" would be for the pickle.dump() to also be non-blocking. Again, I am not interested in the return beyond "did it raise an error?". I am aware that unpickle/append/re-pickle every loop is not particularly efficient, however it does avoid data loss when the api drops out due to connection issues, server errors, etc.
I have zero knowledge on threading, multiprocessing, asyncio, etc and after much searching, I am currently more confused than I was 2 days ago!
FYI, all of the file related functions are in a separate module/class, so that could be made asynchronous if necessary.
EDIT:
There may be multiple calls to the above functions, so I guess some sort of queuing will be required?

Easiest solution is probably the threading standard library package. This will allow you to spawn a thread to do the compression while your main loop continues.
There is almost certainly quite a bit of 'dead time' in your existing loop waiting for the API to respond and conversely there is quite a bit of time spent doing the compression when you could be usefully making another API call. For this reason I'd suggest separating these two aspects. There are lots of good tutorials on threading so I'll just describe a pattern which you could aim for
Keep the API call and the pickling in the main loop but add a step which passes the file path to each pickle to a queue after it is written
Write a function which takes a the queue as its input and works through the filepaths performing the compression
Before starting the main loop, start a thread with the new function as its target

dask, joblib, ipyparallel and other schedulers for embarrassingly parallel problems

This is a more general question about how to run "embarassingly paralllel" problems with python "schedulers" in a science environment.
I have a code that is a Python/Cython/C hybrid (for this example I'm using github.com/tardis-sn/tardis .. but I have more such problems for other codes) that is internally OpenMP parallalized. It provides a single function that takes a parameter dictionary and evaluates to an object within a few hundred seconds running on ~8 cores (result=fun(paramset, calibdata) where paramset is a dict and result is an object (collection of pandas and numpy arrays basically) and calibdata is a pre-loaded pandas dataframe/object). It logs using the standard Python logging function.
I would like a python framework that can easily evaluate ~10-100k parameter sets using fun on a SLURM/TORQUE/... cluster environment.
Ideally, this framework would automatically spawn workers (given availability with a few cores each) and distribute the parameter sets between the workers (different parameter sets take different amount of time). It would be nice to see the state (in_queue, running, finished, failed) for each of the parameter-sets as well as logs (if it failed or finished or is running).
It would be nice if it keeps track of what is finished and what needs to be done so that I can restart this if my scheduler tasks fails. It would be nice if this seemlessly integrates into jupyter notebook and runs locally for testing.
I have tried dask but that does not seem to queue the tasks but runs them all-at-once with client.map(fun, [list of parameter sets]). Maybe there are better tools or maybe this is a very niche problem. It's also unclear to me what the difference between dask, joblib and ipyparallel is (having quickly tried all three of them at various stages).
Happy to give additional info if things are not clear.
UPDATE: so dask seems to provide some functionality of what I require - but dealing with an OpenMP parallelized code in addition to dask is not straightforward - see issue https://github.com/dask/dask-jobqueue/issues/181

CPU (all cores) become idle during python multiprocessing on windows

My system is windows 7. I wrote python program to do data analysis. I use multiprocessing library to achieve parallelism. When I open windows powershell, and type python MyScript.py. It starts to use all the cpu cores. But after a while, the CPU (all cores) became idle. But if I hit Enter in powershell window, all cores are back to full-load. To be clear, the program is fine, and has been tested. The problem here is that CPU-cores went idle by themselves.
This happened not only on my office computer, which runs Windows 7 Pro, but also on my home desktop, which runs Windows 7 Ultimate.
The parallel part of the program is very simple:
def myfunc(input):
##some operations based on a huge data and a small data##
operation1: read in a piece of HugeData #query based HDF5
operation2: some operation based on HugeData and SmallData
return output
# read in Small data
SmallData=pd.read_csv('data.csv')
if __name__ == '__main__':
pool = mp.Pool()
result=pool.map_async(myfunc, a_list_of_input)
out=result.get()
My function are mainly data manipulations using Pandas.
There is nothing wrong with the program, because I've successfully finished my program couple times. But I have to keep watching it, and hit Enter when cores become idle. The job takes couple hours, and I really don't keep watching it.
Is this a problem of windows system itself or my program?
By the way, can all the cores have access to the same variable stored in the memory? e.g. I have a data set mydata read into memory right before if __name__ == '__main__':. This data will be used in myfunc. All the cores should be able to access mydata in the same time, right?
Please help!

I was re-directed to this question as I was facing a similar problem while using Python's Multiprocessing library in Ubuntu. In my case, the processes do not start by hitting enter or such, however, they start after sometime abruptly. My code is an iterative heuristic that uses multiprocessing in each of its iterations. I have to rerun the code after completion of some iterations in order to get a steady runtime-performance. As the question was posted way long back, did you come across the actual reason behind it and solution to it?

I confess to not understanding the subtleties of map_async, but I'm not sure whether you can use it like that (I can't seem to get it to work at all)...
I usually use the following recipe (a list comprehension of the calls I want doing):
In [11]: procs = [multiprocessing.Process(target=f, args=()) for _ in xrange(4)]
....: for p in procs: p.start()
....: for p in procs: p.join()
....:
It's simple and waits until the jobs are finished before continuing.
This works fine with pandas objects provided you're not doing modifications... (I think) copies of the object are passed to each thread and if you perform mutations they not propogate and will be garbage collected.
You can use multiprocessing's version of a dict or list with the Manager class, this is useful for storing the result of each job (simply access the dict/list from within the function):
mgr = multiproccessing.Manager()
d = mgr.dict()
L = mgr.list()
and they will have shared access (as if you had written a Lock). It's hardly worth mentioning, that if you are appending to a list then the order will not just be the same as procs!
You may be able to do something similar to the Manager for pandas objects (writing a lock to objects in memory without copying), but I think this would be a non-trivial task...

Parallel data processing in Python

I have an architecture which is basically a queue with url addresses and some classes to process the content of those url addresses. At the moment the code works good, but it is slow to sequentially pull a url out of the queue, send it to the correspondent class, download the url content and finally process it.
It would be faster and make proper use of resources if for example it could read n urls out of the queue and then shoot n processes or threads to handle the downloading and processing.
I would appreciate if you could help me with these:
What packages could be used to solve this problem ?
What other approach can you think of ?

You might want to look into the Python Multiprocessing library. With multiprocessing.pool, you can give it a function and an array, and it will call the function with each value of the array in parallel, using as many or as few processes as you specify.

If C-calls are slow, like downloading, database requests, other IO - You can use just threading.Thread
If python code is slow, like frameworks, your logic, not accelerated parsers - You need to use multiprocessing Pool or Process. Also it speedups python code, but it is less tread-save and need to deep understanding how it works in complex code (locks, semaphores).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.