I'm relatively new to python and very new to multithreading and multiprocessing. I've been trying to send out thousands of values (Approx. 70,000) into chunks through a web-based API and want it to return me data associated with all those values. The API can take on 50 values a batch at a time so for now as a test I have 100 values I'd like to send in 2 chunks of 50 values. Without multithreading, it would've taken me hours to finish the job so I've tried to use multithreading to improve performance.
The Issue: The code is getting stuck after performing only one task(first row, that even the header, not even the main values) on pool.map() part, I had to restart the notebook kernel. I've heard not to use multiprocessing on a notebook, so I've coded the whole thing on Spyder and ran it, but still the same. Code is below:
#create df data frame with
#some codes to get df of 100 values in
#2 chunks, each chunk contains 50 values.
output:
df = VAL
0 1166835704;1352357565;544477351;159345951;22...
1 354236462063;54666246046;13452466248...
def get_val(df):
data = []
v_list = df
s = requests.Session()
url = 'https://website/'
post_fields = {'format': 'json', 'data':v_list}
r = s.post(url, data=post_fields)
d = json.loads(r.text)
sort = pd.json_normalize(d, ['Results'])
return sort
if __name__ == "__main__":
pool = ThreadPool(4) # Make the Pool of workers
results = pool.map(get_val, df) #Open the df in their own threads
pool.close() #close the pool and wait for the work to finish
pool.join()
Any suggestions would be helpful. Thanks!
Can you check once with following
with ThreadPool(4) as pool:
results= pool.map(get_val, df) #df should be iterable.
print(results)
Also, pls.check if chunksize can be passed to threadpool as that can affect performance.
Related
I have to loop for N times to calculate formulas and add results in dataframe.
My code works and takes a few seconds to process each Item. However, it can only do one item at a time because I'm running the array through a for loop:
I try to update Code and I add numba library to optimise code
def calculationResults(myconfig,df_results,isvalid,dimension,....othersparams):
for month in nb.prange(0, myconfig.len_production):
calculationbymonth(month,df_results,,....othersparams)
return df_results
But it's still doing one item at a time?
ANy Ideas?
We can use parallelized apply using the similar to below function.
def parallelize_dataframe(df, func, n_cores=4):
df_split = np.array_split(df, n_cores)
pool = Pool(n_cores)
df = pd.concat(pool.map(func, df_split))
pool.close()
pool.join()
return df
datasets = {}
datasets['df1'] = df1
datasets['df2'] = df2
datasets['df3'] = df3
datasets['df4'] = df4
def prepare_dataframe(dataframe):
return dataframe.apply(lambda x: x.astype(str).str.lower().str.replace('[^\w\s]', ''))
for key, value in datasets.items():
datasets[key] = prepare_dataframe(value)
I need to prepare the data in some dataframes for further analysis. I would like to parallelize the for loop that updates the dictionary with a prepared dataframe. This code will eventually run on a machine with dozens of cores and thousands of dataframes. On my local machine I do not appear to be using more than a single core in the prepare_dataframe function.
I have looked at Numba and Joblib but I cannot find a way to work with dictionary values in either library.
Any insight would be very much appreciated!
You can use the multiprocessing library. You can read about its basics here.
Here is the code that does what you need:
from multiprocessing import Pool
def prepare_dataframe(dataframe):
# do whatever you want here
# changes made here are *not* global
# return a modified version of what you want
return dataframe
def worker(dict_item):
key,value = dict_item
return (key,prepare_dataframe(value))
def parallelize(data, func):
data_list = list(data.items())
pool = Pool()
data = dict(pool.map(func, data_list))
pool.close()
pool.join()
return data
datasets = parallelize(datasets,worker)
We have a dataset which has approx 1.5MM rows. I would like to process that in parallel. The main function of that code is to lookup master information and enrich the 1.5MM rows. The master is a two column dataset with roughly 25000 rows. However i am unable to make the multi-process work and test its scalability properly. Can some one please help. The cut-down version of the code is as follows
import pandas
from multiprocessing import Pool
def work(data):
mylist =[]
#Business Logic
return mylist.append(data)
if __name__ == '__main__':
data_df = pandas.read_csv('D:\\retail\\customer_sales_parallel.csv',header='infer')
print('Source Data :', data_df)
agents = 2
chunksize = 2
with Pool(processes=agents) as pool:
result = pool.map(func=work, iterable= data_df, chunksize=20)
pool.close()
pool.join()
print('Result :', result)
Method work will have the business logic and i would like to pass partitioned data_df into work to enable parallel processing. The sample data is as follows
CUSTOMER_ID,PRODUCT_ID,SALE_QTY
641996,115089,2
1078894,78144,1
1078894,121664,1
1078894,26467,1
457347,59359,2
1006860,36329,2
1006860,65237,2
1006860,121189,2
825486,78151,2
825486,78151,2
123445,115089,4
Ideally i would like to process 6 rows in each partition.
Please help.
Thanks and Regards
Bala
First, work is returning the output of mylist.append(data), which is None. I assume (and if not, I suggest) you want to return a processed Dataframe.
To distribute the load, you could use numpy.array_split to split the large Dataframe into a list of 6-row Dataframes, which are then processed by work.
import pandas
import math
import numpy as np
from multiprocessing import Pool
def work(data):
#Business Logic
return data # Return it as a Dataframe
if __name__ == '__main__':
data_df = pandas.read_csv('D:\\retail\\customer_sales_parallel.csv',header='infer')
print('Source Data :', data_df)
agents = 2
rows_per_workload = 6
num_loads = math.ceil(data_df.shape[0]/float(rows_per_workload))
split_df = np.array_split(data_df, num_loads) # A list of Dataframes
with Pool(processes=agents) as pool:
result = pool.map(func=work, iterable=split_df)
result = pandas.concat(result) # Stitch them back together
pool.close()
pool.join()pool = Pool(processes=agents)
print('Result :', result)
My best recommendation is for you to use the chunksize parameter in read_csv (Docs) and iterate over. This way you wont crash your ram trying to load everything plus if you want you can for example use threads to speed up the process.
for i,chunk in enumerate(pd.read_csv('bigfile.csv', chunksize=500000)):
Im not sure if this answer your specific question but i hope it helps.
I have a sample df of 1000 rows, that I read from excel, which looks like that:
exFcode
0 38907030
1 47036870
2 54060696
3 38907039
4 100811680
(...)
I need to assign number of articles for each code. To do this I connect to an API taking each code (this API only allows 1 code per request) and return value in the second column of the df. Currently I do it this way:
def getArticles(code):
r = requests.get(API_link % code).content
jsonized = json.loads(r.decode("utf-8"))
try:
num_articles = jsonized["TotalRecords"]
except:
return 'not found'
return num_articles
df['articles'] = df["exFcode"].apply(lambda row: getArticles(row))
It does the job but it's slow, performs each operation one by one. For 1000 codes it takes around 10 minutes. Very often I have to deal with files of 50k and more...
I was thinking how to do it more efficiently. I thought I could divide the df into 2 parts and then perform each part in separate thread. It's my first attempt of applying threading into my programs... So I have created two additional functions wrapper and main.
def wrapper(df):
df['articles'] = df["exFcode"].apply(lambda row: getArticles(row))
return df
def main(df):
#separate df to two even halves
half = int(len(df))
df1 = df.iloc[:half]
df2 = df.iloc[half:]
t1 = Thread(target=wrapper, args=(df1,))
t2 = Thread(target=wrapper, args=(df2,))
t1.start()
t2.start()
print('completed')
However when I execute the function main(df) nothing happens. Am I completely misunderstanding the concept of threading? Any other idea how to make it more efficient?
you print "complete" when the threads have started. But you're missing the join part to wait for them to complete.
t1.start()
t2.start()
print('threads started')
t1.join()
t2.join()
print('really completed')
I am reading in hundreds of HDF files and processing the data of each HDF seperately. However, this takes an awful amount of time, since it is working on one HDF file at a time. I just stumbled upon http://docs.python.org/library/multiprocessing.html and am now wondering how I can speed things up using multiprocessing.
So far, I came up with this:
import numpy as np
from multiprocessing import Pool
def myhdf(date):
ii = dates.index(date)
year = date[0:4]
month = date[4:6]
day = date[6:8]
rootdir = 'data/mydata/'
filename = 'no2track'+year+month+day
records = read_my_hdf(rootdir,filename)
if records.size:
results[ii] = np.mean(records)
dates = ['20080105','20080106','20080107','20080108','20080109']
results = np.zeros(len(dates))
pool = Pool(len(dates))
pool.map(myhdf,dates)
However, this is obviously not correct. Can you follow my chain of thought what I want to do? What do I need to change?
Try joblib for a friendlier multiprocessing wrapper:
from joblib import Parallel, delayed
def myhdf(date):
# do work
return np.mean(records)
results = Parallel(n_jobs=-1)(delayed(myhdf)(d) for d in dates)
The Pool classes map function is like the standard python libraries map function, you're guaranteed to get your results back in the order that you put them in. Knowing that, the only other trick is that you need to return results in a consistant manner, and the filter them afterwards.
import numpy as np
from multiprocessing import Pool
def myhdf(date):
year = date[0:4]
month = date[4:6]
day = date[6:8]
rootdir = 'data/mydata/'
filename = 'no2track'+year+month+day
records = read_my_hdf(rootdir,filename)
if records.size:
return np.mean(records)
dates = ['20080105','20080106','20080107','20080108','20080109']
pool = Pool(len(dates))
results = pool.map(myhdf,dates)
results = [ result for result in results if result ]
results = np.array(results)
If you really do want results as soon as they are available you can use imap_unordered