Unable to put() + get() larger DataFrames in a Queue - python

The code below simulates a problem with multiprocessing I am facing.
There are two functions - f1 and f2 - which return (pandas) dataframes with n rows to a calling function run_fns(n). The two functions are to be run in parallel.
The code works fine for smaller vales of n (eg n <= 700), but freezes for larger values of n (say n >= 7000).
I have tried calling Queue using Queue([maxsize]) with various maxsize values including the default, 0, -1 and many other numbers small and large with no change in this behaviour.
Any solutions, workarounds or alternate approaches would be very welcome. And I have a secondary question : Do I really need to include
if __name__ == "__main__":
somewhere? If so where?
The code:
f1 returns n rows and 3 columns, f2 returns n rows and 5 columns. The dataframes are built with randomly generated integers.
import numpy as np
import pandas as pd
from multiprocessing import Process, Queue
def run_fns(n):
"""Run p1 and p2 in parallel, and get the returned dataframes."""
q1 = Queue()
q2 = Queue()
p1 = Process(target=f1, args=(n, q1))
p2 = Process(target=f2, args=(n, q2))
p1.start()
p2.start()
p1.join()
p2.join()
df1 = q1.get()
df2 = q2.get()
return df1, df2
def f1(n, q):
"""Create a dataframe with n rows and 3 columns."""
df = pd.DataFrame(np.random.randint(n * 3, size=(n, 3)))
q.put(df)
def f2(n, q):
"""Create a dataframe with n rows and 5 columns."""
df = pd.DataFrame(np.random.randint(n * 5, size=(n, 5)))
q.put(df)

You are facing a typical issue which is documented in the multiprocessing programming guidelines.
Bear in mind that a process that has put items in a queue will wait before terminating until all the buffered items are fed by the “feeder” thread to the underlying pipe. (The child process can call the Queue.cancel_join_thread method of the queue to avoid this behaviour.)
This means that whenever you use a queue you need to make sure that all items which have been put on the queue will eventually be removed before the process is joined. Otherwise you cannot be sure that processes which have put items on the queue will terminate.
You need to make sure you get the data before joining the processes.
# start the processes
p1.start()
p2.start()
# drain the queues
df1 = q1.get()
df2 = q2.get()
# then join the queues
p1.join()
p2.join()
return df1, df2

Related

python does not keep data when multiproccessing

The script below does the following:
1) makes a data frame with 200 rows
2) sorts the df into a list of objects, multiprocessing so that each core does a quater of the df into their own list
3)sticks the lists together into a big list and prints
problem = the list is empty it's almost like the get_car_terms function wasn't called in each process with no error message
import random
import psutil
import pandas as pd
import multiprocessing as mp
class car_term(): #object to go into list
def __init__(self, capcode, miles,months , cmprice, fmprice ):
self.capcode = capcode
self.months = months
self.miles = miles
self.cmprice = cmprice
self.fmprice = fmprice
df_final = pd.DataFrame({'capcode':[],'months':[],'mileage':[],'cm':[],'fm':[]})
for i in range (200): # making dataframe to get data from
df_final.append(pd.DataFrame({'capcode':[i],'months':[random.randint(1, 12)],'mileage':[random.randint(0, 10000)],'cm':[random.randint(5, 700)],'fm':[random.randint(15, 710)]}))
all_deals=[] # this is the list i want to put my objects into
def get_car_terms(data,mdb1,all_deals1):
all_deals1.append(car_term(mdb1['capcode'][data],mdb1['mileage'][data],mdb1['months'][data],mdb1['cm'][data],mdb1['fm'][data])) # i make the objects with the dataframe like this
all_deals1a=[] # individual lists for each proccessor
all_deals2a=[]
all_deals3a=[]
all_deals4a=[]
print("yo1")
if __name__ == "__main__":
n_cpus = psutil.cpu_count() # number of cpus
print(n_cpus) # i have 4 cpus
if df_final.shape[0]%n_cpus == 0:
for i in range(int(df_final.shape[0]/n_cpus)):
############# the problem is the get_car_terms function doesnt run below
p1 = mp.Proccess(target = get_car_terms,args = (i+((df_final.shape[0]/n_cpus)*1), df_final,all_deals1a)) # each cpu sorts a quater of the dataframe into my objects list
p2 = mp.Proccess(target = get_car_terms,args = (i+((df_final.shape[0]/n_cpus)*2), df_final,all_deals2a))
p3 = mp.Proccess(target = get_car_terms,args = (i+((df_final.shape[0]/n_cpus)*3), df_final,all_deals3a))
p4 = mp.Proccess(target = get_car_terms,args = (i+((df_final.shape[0]/n_cpus)*4), df_final,all_deals4a))
p1.start()
p2.start()
p3.start()
p4.start()
p1.end()
p2.end()
p3.end()
p4.end()
all_deals.append(all_deals1a) # group lists together
all_deals.append(all_deals2a)
all_deals.append(all_deals3a)
all_deals.append(all_deals4a)
print("we did it")
print(len(all_deals)) # this should have 200 of my objects in it... it doesnt
for i in all_deals:
print(i.capcode)
You called .end() right after .start(), so the multiprocessings did not get the time they need to work. I would recommend running time.sleep(1) between the starts and ends to give them the time they need.

Python parallel data processing

We have a dataset which has approx 1.5MM rows. I would like to process that in parallel. The main function of that code is to lookup master information and enrich the 1.5MM rows. The master is a two column dataset with roughly 25000 rows. However i am unable to make the multi-process work and test its scalability properly. Can some one please help. The cut-down version of the code is as follows
import pandas
from multiprocessing import Pool
def work(data):
mylist =[]
#Business Logic
return mylist.append(data)
if __name__ == '__main__':
data_df = pandas.read_csv('D:\\retail\\customer_sales_parallel.csv',header='infer')
print('Source Data :', data_df)
agents = 2
chunksize = 2
with Pool(processes=agents) as pool:
result = pool.map(func=work, iterable= data_df, chunksize=20)
pool.close()
pool.join()
print('Result :', result)
Method work will have the business logic and i would like to pass partitioned data_df into work to enable parallel processing. The sample data is as follows
CUSTOMER_ID,PRODUCT_ID,SALE_QTY
641996,115089,2
1078894,78144,1
1078894,121664,1
1078894,26467,1
457347,59359,2
1006860,36329,2
1006860,65237,2
1006860,121189,2
825486,78151,2
825486,78151,2
123445,115089,4
Ideally i would like to process 6 rows in each partition.
Please help.
Thanks and Regards
Bala
First, work is returning the output of mylist.append(data), which is None. I assume (and if not, I suggest) you want to return a processed Dataframe.
To distribute the load, you could use numpy.array_split to split the large Dataframe into a list of 6-row Dataframes, which are then processed by work.
import pandas
import math
import numpy as np
from multiprocessing import Pool
def work(data):
#Business Logic
return data # Return it as a Dataframe
if __name__ == '__main__':
data_df = pandas.read_csv('D:\\retail\\customer_sales_parallel.csv',header='infer')
print('Source Data :', data_df)
agents = 2
rows_per_workload = 6
num_loads = math.ceil(data_df.shape[0]/float(rows_per_workload))
split_df = np.array_split(data_df, num_loads) # A list of Dataframes
with Pool(processes=agents) as pool:
result = pool.map(func=work, iterable=split_df)
result = pandas.concat(result) # Stitch them back together
pool.close()
pool.join()pool = Pool(processes=agents)
print('Result :', result)
My best recommendation is for you to use the chunksize parameter in read_csv (Docs) and iterate over. This way you wont crash your ram trying to load everything plus if you want you can for example use threads to speed up the process.
for i,chunk in enumerate(pd.read_csv('bigfile.csv', chunksize=500000)):
Im not sure if this answer your specific question but i hope it helps.

pandas apply first attempt on threading

I have a sample df of 1000 rows, that I read from excel, which looks like that:
exFcode
0 38907030
1 47036870
2 54060696
3 38907039
4 100811680
(...)
I need to assign number of articles for each code. To do this I connect to an API taking each code (this API only allows 1 code per request) and return value in the second column of the df. Currently I do it this way:
def getArticles(code):
r = requests.get(API_link % code).content
jsonized = json.loads(r.decode("utf-8"))
try:
num_articles = jsonized["TotalRecords"]
except:
return 'not found'
return num_articles
df['articles'] = df["exFcode"].apply(lambda row: getArticles(row))
It does the job but it's slow, performs each operation one by one. For 1000 codes it takes around 10 minutes. Very often I have to deal with files of 50k and more...
I was thinking how to do it more efficiently. I thought I could divide the df into 2 parts and then perform each part in separate thread. It's my first attempt of applying threading into my programs... So I have created two additional functions wrapper and main.
def wrapper(df):
df['articles'] = df["exFcode"].apply(lambda row: getArticles(row))
return df
def main(df):
#separate df to two even halves
half = int(len(df))
df1 = df.iloc[:half]
df2 = df.iloc[half:]
t1 = Thread(target=wrapper, args=(df1,))
t2 = Thread(target=wrapper, args=(df2,))
t1.start()
t2.start()
print('completed')
However when I execute the function main(df) nothing happens. Am I completely misunderstanding the concept of threading? Any other idea how to make it more efficient?
you print "complete" when the threads have started. But you're missing the join part to wait for them to complete.
t1.start()
t2.start()
print('threads started')
t1.join()
t2.join()
print('really completed')

How to reduce time for multiprocessing in python

I am trying to build multiprocessing in python to reduce computation speed, but it seems like after multiprocessing, the overall speed of computation decreased significantly. I have created 4 different processes and split dataFrame into 4 different dataframe, which will be an input to each processes. After timing each process, it seems like the overhead cost is significant, and was wondering if there is way to reduce these overhead costs.
I am using windows7, python 3.5 and my machine has 8 cores.
def doSomething(args, dataPassed,):
processing data, and calculating outputs
def parallelize_dataframe(df, nestedApply):
df_split = np.array_split(df, 4)
pool = multiprocessing.Pool(4)
df = pool.map(nestedApply, df_split)
print ('finished with Simulation')
time = float((dt.datetime.now() - startTime).total_seconds())
pool.close()
pool.join()
def nestedApply(df):
func2 = partial(doSomething, args=())
res = df.apply(func2, axis=1)
res = [output Tables]
return res
if __name__ == '__main__':
data = pd.read_sql_query(query, conn)
parallelize_dataframe(data, nestedApply)
I would suggest to use queues instead of providing your DataFrame as chunks. You need a lot of ressources to copy each chunk and it takes quite some time to do so. You could run out of memory if your DataFrame is really big. Using queues you could benefit from fast iterators in pandas.
Here is my approach. The overhead reduces with the complexity of your workers. Unfortunately, my workers are far to simple to really show that, but sleep simulates complexity a bit.
import pandas as pd
import multiprocessing as mp
import numpy as np
import time
def worker(in_queue, out_queue):
for row in iter(in_queue.get, 'STOP'):
value = (row[1] * row[2] / row[3]) + row[4]
time.sleep(0.1)
out_queue.put((row[0], value))
if __name__ == "__main__":
# fill a DataFrame
df = pd.DataFrame(np.random.randn(1e5, 4), columns=list('ABCD'))
in_queue = mp.Queue()
out_queue = mp.Queue()
# setup workers
numProc = 2
process = [mp.Process(target=worker,
args=(in_queue, out_queue)) for x in range(numProc)]
# run processes
for p in process:
p.start()
# iterator over rows
it = df.itertuples()
# fill queue and get data
# code fills the queue until a new element is available in the output
# fill blocks if no slot is available in the in_queue
for i in range(len(df)):
while out_queue.empty():
# fill the queue
try:
row = next(it)
in_queue.put((row[0], row[1], row[2], row[3], row[4]), block=True) # row = (index, A, B, C, D) tuple
except StopIteration:
break
row_data = out_queue.get()
df.loc[row_data[0], "Result"] = row_data[1]
# signals for processes stop
for p in process:
in_queue.put('STOP')
# wait for processes to finish
for p in process:
p.join()
Using numProc = 2 it takes 50sec per loop, with numProc = 4 it is twice as fast.

How to use pass by reference for data frame in python pandas

Manager Code..
import pandas as pd
import multiprocessing
import time
import MyDF
import WORKER
class Manager():
'Common base class for all Manager'
def __init__(self,Name):
print('Hello Manager..')
self.MDF=MyDF.MYDF(Name);
self.Arg=self.MDF.display();
self.WK=WORKER.Worker(self.Arg); MGR=Manager('event_wise_count') if __name__ == '__main__':
jobs = []
x=5;
for i in range(5):
x=10*i
print('Manager : ',i)
p = multiprocessing.Process(target=MGR.WK.DISPLAY)
jobs.append(p)
p.start()
time.sleep(x);
worker code...
import pandas as pd
import time
class Worker():
'Common base class for all Workers'
empCount = 0
def __init__(self,DF):
self.DF=DF;
print('Hello worker..',self.DF.count())
def DISPLAY(self):
self.DF=self.DF.head(10);
return self.DF
Hi I am trying to do multiprocessing. and i want to share a Data Frame address with all sub-processes.
So in above from Manager Class I am spawning 5 process , where each sub-process required to use Data Frame of worker class , expecting that each sub process will share reference of worker Data Frame. But unfortunately It is not happening..
Any Answer welcome..
Thanks In Advance,,.. please :)..
This answer suggests using Namespaces to share large objects between processes by reference.
Here's an example of an application where 4 different processes can read from the same DataFrame. (Note: you can't run this on an interactive console -- save this as a program.py and run it.)
import pandas as pd
from multiprocessing import Manager, Pool
def get_slice(namespace, column, rows):
'''Return the first `rows` rows from column `column in namespace.data'''
return namespace.data[column].head(rows)
if __name__ == '__main__':
# Create a namespace to place our DataFrame in it
manager = Manager()
namespace = manager.Namespace()
namespace.data = pd.DataFrame(pd.np.random.rand(1000, 10))
# Create 4 processes
pool = Pool(processes=2)
for column in namespace.data.columns:
# Each pool can access the same DataFrame object
result = pool.apply_async(get_slice, [namespace, column, 5])
print result._job, column, result.get().tolist()
While reading from the DataFrame is perfectly fine, it gets a little tricky if you want to write back to it. It's better to just stick to immutable objects unless you really need large write-able objects.
Sorry about the necromancy.
The issue is that the workers must have unique DataFrame instances. Almost all attempts to slice, or chunk, a Pandas DataFrame will result in aliases to the original DataFrame. These aliases will still result in resource contention between workers.
There a two things that should improve performance. The first would be to make sure that you are working with Pandas. Iterating row by row, with iloc or iterrows, fights against the design of DataFrames. Using a new-style class object and the apply a method is one option.
def get_example_df():
return pd.DataFrame(pd.np.random.randint(10, 100, size=(5,5)))
class Math(object):
def __init__(self):
self.summation = 0
def operation(self, row):
row_result = 0
for elem in row:
if elem % 2:
row_result += elem
else:
row_result += 1
self.summation += row_result
if row_result % 2:
return row_result
else:
return 1
def get_summation(self):
return self.summation
Custom = Math()
df = get_example_df()
df['new_col'] = df.apply(Custom.operation)
print Custom.get_summation()
The second option would be to read in, or generate, each DataFrame for each worker. Then recombine if desired.
workers = 5
df_list = [ get_example_df() ]*workers
...
# worker code
...
aggregated = pd.concat(df_list, axis=0)
However, multiprocessing will not be necessary in most cases. I've processed more than 6 million rows of data without multiprocessing in a reasonable amount of time (on a laptop).
Note: I did not time the above code and there is probably room for improvement.

Categories

Resources