Pandas pd.concat() in separate threads shows no speed-up - python

I am trying to use pandas in a multi-thread environment. I have a few lists of pandas frames (long list, 5000 pandas frames, with dimensions of 300x2500 dimension) which I need to concatenate.
Since I have multiple lists, I want to run the concat for each list in an own thread (or use a threadpool, at least to get some parallel processing).
For some reason the processing in my multi-thread setup is identical to single threaded processing. I am wondering if I am doing something systematically wrong.
Here is my code snippet, I use ThreadPoolExecutor to implement parallelization:
def func_merge(the_list, key):
return (key, pd.concat(the_list))
def my_thread_starter():
buffer = {
'A': [df_1, ..., df_5000],
'B': [df_a1, ...., df_a5000]
}
with ThreadPoolExecutor(max_workers=2) as executor:
submitted=[]
for key, df_list in buffer.items():
submitted.append(executor.submit(func_merge, df_list, key = key))
for future in as_completed(submitted):
out = future.result()
// do with results
Is there a trick to use Pandas' concat in separate threads? I would at least expect my CPU utilization to spark when running more threads but it does seem to have any effect. Consequently, the time advantage is zero, too
Does anyone has an idea what the problem could be?

Because of the Global Interpreter Lock -GIL), I'm not sure your code is leveraging multi-threading.
Basically, ThreadPoolExecutor is useful when workload is not CPU bounded but IO bounded, like making many Web API call at the same time.
It may have change in python 3.8. But I don't know how to interpret the "tasks which release the GIL" the documentation.
ProcessPoolExecutor could help, but because it requires to serialize input and output of function, with huge data volume, it won't be faster.

Related

multiprocessing a function execution with python

i have a pandas dataframe which consists of approximately 1M rows , it contains information entered by users. i wrote a function that validates if the number entered by the user is correct or not . what im trying to do, is to execute the function on multiple processors to overcome the issue of doing heavy computation on a single processor. what i did is i split my dataframe into multiple chunks where each chunk contains 50K rows and then used the python multiprocessor module to perform the processing on each chunk separately . the issue is that only the first process is starting and its still using one processor instead of distributing the load on all processors . here is the code i wrote :
pool = multiprocessing.Pool(processes=16)
r7 = pool.apply_async(validate.validate_phone_number, (has_phone_num_list[0],fields ,dictionary))
r8 = pool.apply_async(validate.validate_phone_number, (has_phone_num_list[1],fields ,dictionary))
print(r7.get())
print(r8.get())
pool.close()
pool.join()
i have attached a screenshot that shows how the CPU usage when executing the above code
any advice on how can i overcome this issue?
I suggest you try this:
from concurrent.futures import ProcessPoolExecutor
with ProcessPoolExecutor() as executor:
params = [(pnl, fields, dictionary) for pnl in has_phone_num_list]
for result in executor.map(validate.validate_phone_number, params):
pass # process results here
By constructing the ProcessPoolExecutor with no parameters, most of your CPUs will be fully utilised. This is a very portable approach because there's no explicit assumption about the number of CPUs available. You could, of course, construct with max_workers=N where N is a low number to ensure that a minimal number of CPUs are used concurrently. You might do that if you're not too concerned about how long the overall process is going to take.
As suggested in this answer, you can use pandarallel for using Pandas' apply function in parallel. Unfortunately as I cannot try your code I am not able to find the problem. Did you try to use less processors (like 8 instead of 16)?
Note that in some cases the parallelization doesn't work.

Python 3.8 Convert for loop to multiprocessing/multithreading

I am new to multiprocessing I would really appreciate it if someone can guide/help me here. I have the following for loop which gets some data from the two functions. The code looks like this
for a in accounts:
dl_users[a['Email']] = get_dl_users(a['Email'], adConn)
group_users[a['Email']] = get_group_users(a['Id'], adConn)
print(f"Users part of DL - {dl_users}")
print(f"Users part of groups - {group_users}")
adConn.unbind()
This works fine and gets all the results but recently I have noticed it takes a lot of time to get the list of users i.e. dl_users and group_users. It takes almost 14-15 mins to complete. I am looking for ways where I can speed up the function and would like to convert this for loop to multiprocessing. get_group_users and get_dl_users makes calls for LDAP, so I am not 100% sure if I should be converting this to multiprocessing or multithreading. Any suggestion would be of big help
As mentioned in the comments, multithreading is appropriate for I/O operations (reading/writing from/to files, sending http requests, communicating with databases), while multiprocessing is appropriate for CPU-bound tasks (such as transforming data, making calculations...). Depending on which kind of operation your functions perform, you want one or the other. If they do a mix, separate them internally and profile which of the two really needs optimisation, since both multiprocessing and -threading introduce overhead that might not be worth adding.
That said, the way to apply multiprocessing or multithreading is pretty simple in recent Python versions (including your 3.8).
Multiprocessing
from multiprocessing import Pool
# Pick the amount of processes that works best for you
processes = 4
with Pool(processes) as pool:
processed = pool.map(your_func, your_data)
Where your_func is a function to apply to each element of your_data, which is an iterable. If you need to provide some other parameters to the callable, you can use a lambda function:
processed = pool.map(lambda item: your_func(item, some_kwarg="some value"), your_data)
Multithreading
The API for multithreading is very similar:
from concurrent.futures import ThreadPoolExecutor
# Pick the amount of workers that works best for you.
# Most likely equal to the amount of threads of your machine.
workers = 4
with ThreadPoolExecutor(workers) as pool:
processed = pool.map(your_func, your_data)
If you want to avoid having to store your_data in memory if you need some attribute of the items instead of the items itself, you can use a generator:
processed = pool.map(your_func, (account["Email"] for account in accounts))

Slow parallel code in Python compared to sequential version, why is it slower?

I have created two versions of a program to add the numbers of an array, one version uses concurrent programming and the other is sequential. The problem that I have is that I cannot make the parallel program to return a faster processing time. I am currently using Windows 8 and Python 3.x. My code is:
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor, as_completed
import random
import time
def fun(v):
s=0
for i in range(0,len(v)):
s=s+l[i]
return s
def sumSeq(v):
s=0
start=time.time()
for i in range(0,len(v)):
s=s+v[i]
start1=time.time()
print ("time seq ",start1-start," sum = ",s)
def main():
workers=4
vector = [random.randint(1,101) for _ in range(1000000)]
sumSeq(vector)
dim=(int)(len(vector)/(workers*10))
s=0
chunks=(vector[k:k+dim] for k in range(0,len(vector),(int)(len(vector)/(workers*10))))
start=time.time()
with ThreadPoolExecutor(max_workers=workers) as executor:
futures=[executor.submit(fun,chunk) for chunk in chunks]
start1=time.time()
for future in as_completed(futures):
s=s+future.result()
print ("concurrent time ",start1-start," sum = ",s)
The problem is that I get the following answer:
time sec 0.048101186752319336 sum = 50998349
concurrent time 0.059157371520996094 sum = 50998349
I cannot make the concurrent version to runs faster, I have change the chunks size and the number of max workers to None, but nothing seems to work. What am I doing wrong? I have read that the problem could be the creation of the processes, so how can I fix that in a simple way?
A long-standing weakness of Python is that it can't run pure-Python code simultaneously in multiple threads; the keyword to search for is "GIL" or "global interpreter lock".
Ways around this:
This only applies to CPU-heavy operations, like addition; I/O operations and the like can happily run in parallel. You can happily continue to run Python code in one thread while others are waiting for disk, network, database etc.
This only applies to pure-Python code; several computation-heavy extension modules will release the GIL and let code in other threads run. Things like matrix operations in numpy or image operations can thus run in threads alongside a CPU-heavy Python thread.
It applies to threads (ThreadPoolExecutor) specifically; the ProcessPoolExecutor will work the way you expect — but it's more isolated, so the program will spend more time marshalling and demarshalling the data and intermediate results.
I should also note that it looks like your example isn't really well suited to this test:
It's too small; with a total time of 0.05s, a lot of that is going to be the setup and tear-down of the parallel run. In order to test this, you need at least several seconds, ideally many seconds or a couple of minutes.
The sequential version visits the array in sequence; things like CPU cache are optimised for this sort of access. The parallel version will access the chunks at random, potentially causing cache evictions and the like.

Python Pandas Threading reads read_fwf

Newbie-ish python/pandas user here. I've been playing with using chunksize arg in read_fwf and iterating value_counts of variables. I wrote a function to pass args such as the fileiterator and variables to parse and count. I was hoping to parallelize this function and be able to read 2 files at the same time into the same function.
It does appear to work... However, I'm getting unexpected slow downs. The threads finish same time but one seems to be slowing the other down (IO bottleneck?). I'm getting faster times by running the functions sequentially rather than parallel (324 secs Vs 172 secs). Ideas? I'm I executing this wrong? I've tried multiprocess but startmap errors that I can't pickle the fileiterator (output of read_fwf).
testdf1=pd.read_fwf(filepath_or_buffer='200k.dat',header=None,colspecs=wlist,names=nlist,dtype=object,na_values=[''],chunksize=1000)
testdf2=pd.read_fwf(filepath_or_buffer='200k2.dat',header=None,colspecs=wlist,names=nlist,dtype=object,na_values=[''],chunksize=1000)
def tfuncth(df,varn,q,*args):
td={}
for key in varn.keys():
td[key]=pd.Series()
for rdf in df:
if args is not None:
for arg in args:
rdf=eval(f"rdf.query(\"{arg}\")")
for key in varn.keys():
ecode=f'rdf.{varn[key]}.value_counts()'
td[key]=pd.concat([td[key],eval(ecode)])
td[key]=td[key].groupby(td[key].index).sum()
for key in varn.keys():
td[key]=pd.DataFrame(td[key].reset_index()).rename(columns={'index':'Value',0:'Counts'}).assign(Var=key,PCT=lambda x:round(x.Counts/x.Counts.sum()*100,2))[['Var','Value','Counts','PCT']]
q.put(td)
bands={
'1':'A',
'2':'B',
'3':'C',
'4':'D',
'5':'E',
'6':'F',
'7':'G',
'8':'H',
'9':'I'
}
vdict={
'var1':'e1270.str.slice(0,2)',
'var2':'e1270.str.slice(2,3)',
'band':'e7641.str.slice(0,1).replace(bands)'
}
my_q1=queue.Queue()
my_q2=queue.Queue()
thread1=threading.Thread(target=tfuncth,args=(testdf1,vdict,my_q1,flter1))
thread2=threading.Thread(target=tfuncth,args=(testdf2,vdict,my_q2))
thread1.start()
thread2.start()
UPDATE:
After much reading This is the conclusion I've came too. This is extremely simplified conclusion I'm sure so if someone knows otherwise please inform me.
Pandas is not a fully multi-thread friendly package
Apparently there’s a package called ‘dask’ that is and it replicates a lot of pandas functions. So I’ll be looking into that.
Python is not truly a multi-threading compatible language in many
cases
Python is bound by its compiler. In pure python, its interpreted and bound by the GIL for only execution of one thread at a time
Multiple threads can be spun off but will only be able to parallel non-cpu bound functions.
My code is wrapped with IO and CPU. The simple IO is probably running parallel but getting held up waiting on the processor for execution.
I plan to test this out by writing IO only operations and attempting threading.
Python can be compiled with different compilers that don’t have a global interpreter lock (GIL) on threads.
Thus packages such as ‘dask’ can utilize multi-threading.
I did manage to get this to work and fix my problems by using the multiprocessing package. I ran into two issues.
1) multiprocessing package is not compatible with Juypter Notebook
and
2) you can't pickle a handle to a pandas reader (multiprocessing pickles objects passed to the processes).
I fixed 1 by coding outside the Notebook environment and I fixed 2 by passing in the arguments needed to open a chunking file to each process and had each process start their own chunk read.
After doing those two things I was able to get a 60% increase in speed over sequential runs.

Slower Execution with Python multiprocessing

I may be in a little over my head here, but I am working on a little bioinformatics project in python. I am trying to parallelism a program that analyzes a large dictionary of sets of strings (~2-3GB in RAM). I find that the multiprocessing version is faster when I have smaller dictionaries but is of little benefit and mostly slower with the large ones. My first theory was that running out of memory just slowed everything and the bottleneck was from swapping into virtual memory. However, I ran the program on a cluster with 4*48GB of RAM and the same slowdown occurred. My second theory is that access to certain data was being locked. If one thread is trying to access a reference currently being accessed in another thread, will that thread have to wait? I have tried creating copies of the dictionaries I want to manipulate, but that seems terribly inefficient. What else could be causing my problems?
My multiprocessing method is below:
def worker(seqDict, oQueue):
#do stuff with the given partial dictionary
oQueue.put(seqDict)
oQueue = multiprocessing.Queue()
chunksize = int(math.ceil(len(sdict)/4)) # 4 cores
inDict = {}
i=0
dicts = list()
for key in sdict.keys():
i+=1
if len(sdict[key]) > 0:
inDict[key] = sdict[key]
if i%chunksize==0 or i==len(sdict.keys()):
print(str(len(inDict.keys())) + ", size")
dicts.append(copy(inDict))
inDict.clear()
for pdict in dicts:
p =multiprocessing.Process(target = worker,args = (pdict, oQueue))
p.start()
finalDict = {}
for i in range(4):
finalDict.update(oQueue.get())
return finalDict
As I said in the comments, and as Kinch said in his answer, everything passed through to a subprocess has to be pickled and unpickled to duplicate it in the local context of the spawned process. If you use multiprocess.Manager.dict for sDict (thereby allowing processes to share the same data through a server that proxies the objects created on it) and spawning the processes with slice indices in to that shared sDict, that should cut down on the serialize/deserialize sequence involved in spawning the child processes. You still might hit bottlenecks with that though in the server communication step of working with the shared objects. If so, you'll have to look at simplifying your data so you can use true shared memory with multiprocess.Array or multiprocess.Value or look at multiprocess.sharedctypes to create custom datastructures to share between your processes.
Seems like the data from the "large dictionary of sets of strings" could be reformatted into a something that could be stored in a file or string, allowing you to use the mmap module to share it among all the processes. Each process might incur some startup overhead if it needs to convert the data back into some other more preferable form, but that could be minimized by passing each process something indicating what subset of the whole dataset in shared memory they should do their work on and only reconstitute the part required by that process.
Every data which is passed through the queue will be serialized and deserialized using pickle. I would guess this could be a bottleneck if you pass a lot of data round.
You could reduce the amount of data, make use of shared memory, write a multi-threading version in a c extension or try a multithreading version of this with a multithreading safe implemention of python (maybe jython or pypy; I don't know).
Oh and by the way: You are using multiprocessing and not multithreading.

Categories

Resources