I am quite new to python.I have been thinking of making the below code to parellel calls where a list of doj values are formatted with help of lambda,
m_df[['doj']] = m_df[['doj']].apply(lambda x: formatdoj(*x), axis=1)
def formatdoj(doj):
doj = str(doj).split(" ")[0]
doj = datetime.strptime(doj, '%Y' + "-" + '%m' + "-" + "%d")
return doj
Since the list has million records, the time it takes to format all takes a lot of time.
How to make parellel function call in python similar to Parellel.Foreach in c#?
I think that in your case using parallel computation is a bit of an overkill. The slowness comes from the code, not from using a single processor. I'll show you in some steps how to make it faster, guessing a bit that you're working with a Pandas dataframe and what your dataframe contains (please stick to SO guidelines and include a complete working example!!)
For my test, I've used the following random dataframe with 100k rows (scale times up to get to your case):
N=int(1e5)
m_df = pd.DataFrame([['{}-{}-{}'.format(y,m,d)]
for y,m,d in zip(np.random.randint(2007,2019,N),
np.random.randint(1,13,N),
np.random.randint(1,28,N))],
columns=['doj'])
Now this is your code:
tstart = time()
m_df[['doj']] = m_df[['doj']].apply(lambda x: formatdoj(*x), axis=1)
print("Done in {:.3f}s".format(time()-tstart))
On my machine it runs in around 5.1s. It has several problems. The first one is you're using dataframes instead of series, although you work only on one column, and creating a useless lambda function. Simply doing:
m_df['doj'].apply(formatdoj)
Cuts down the time to 1.6s. Also joining strings with '+' is slow in python, you can change your formatdoj to:
def faster_formatdoj(doj):
return datetime.strptime(doj.split()[0], '%Y-%m-%d')
m_df['doj'] = m_df['doj'].apply(faster_formatdoj)
This is not a great improvement but does cut down a bit to 1.5s. If you need to join the strings for real (because e.g. they are not fixed), rather use '-'.join('%Y','%m','%d'), that's faster.
But the true bottleneck comes from using datetime.strptime a lot of times. It is intrinsically a slow command - dates are a bulky thing. On the other hand, if you have millions of dates, and assuming they're not uniformly spread since the beginning of humankind, chances are they are massively duplicated. So the following is how you should truly do it:
tstart = time()
# Create a new column with only the first word
m_df['doj_split'] = m_df['doj'].apply(lambda x: x.split()[0])
converter = {
x: faster_formatdoj(x) for x in m_df['doj_split'].unique()
}
m_df['doj'] = m_df['doj_split'].apply(lambda x: converter[x])
# Drop the column we added
m_df.drop(['doj_split'], axis=1, inplace=True)
print("Done in {:.3f}s".format(time()-tstart))
This works in around 0.2/0.3s, more than 10 times faster than your original implementation.
After all this, if you still are running to slow, you can consider working in parallel (rather parallelizing separately the first "split" instruction and, maybe, the apply-lambda part, otherwise you'd be creating many different "converter" dictionaries nullifying the gain). But I'd take that as a last step rather than the first solution...
[EDIT]: Originally in the first step of the last code box I used m_df['doj_split'] = m_df['doj'].str.split().apply(lambda x: x[0]) which is functionally equivalent but a bit slower than m_df['doj_split'] = m_df['doj'].apply(lambda x: x.split()[0]). I'm not entirely sure why, probably because it's essentially applying two functions instead of one.
Your best bet is to use dask. Dask has a data_frame type which you can use to create this a similar dataframe, but, while executing compute function, you can specify number of cores with num_worker argument. this will parallelize the task
Since I'm not sure about your example, I will give you another one using the multiprocessing library:
# -*- coding: utf-8 -*-
import multiprocessing as mp
input_list = ["str1", "str2", "str3", "str4"]
def format_str(str_input):
str_output = str_input + "_test"
return str_output
if __name__ == '__main__':
with mp.Pool(processes = 2) as p:
result = p.map(format_str, input_list)
print (result)
Now, let's say you want to map a function with several arguments, you should then use starmap():
# -*- coding: utf-8 -*-
import multiprocessing as mp
input_list = ["str1", "str2", "str3", "str4"]
def format_str(str_input, i):
str_output = str_input + "_test" + str(i)
return str_output
if __name__ == '__main__':
with mp.Pool(processes = 2) as p:
result = p.starmap(format_str, [(input_list, i) for i in range(len(input_list))])
print (result)
Do not forget to place the Pool inside the if __name__ == '__main__': and that multiprocessing will not work inside an IDE such as spyder (or others), thus you'll need to run the script in the cmd.
To keep the results, you can either save them to a file, or keep the cmd open at the end with os.system("pause") (Windows) or an input() on Linux.
It's a fairly simple way to use multiprocessing with python.
Related
Currently I am using Multiprocessing feature of Python. Though this works fine for text file up to 2 million records, it fails for the file with 8 million records with "Can't access lock."
Moreover, it takes about 30 minutes to process the file with 2 million records and fails like after about an hour or so for the big file.
I am doing this:
def try_multiple_operations(item):
aab_type = item[15:17]
aab_amount = item[35:46]
aab_name = item[82:100]
aab_reference = item[64:82]
if aab_type not in '99' or 'Z5':
aab_record = f'{aab_name} {aab_amount} {aab_reference}'
else:
aab_record = 'ignore'
return aab_record
Calling the try_multiple_operations in the __main__:
if __name__ == '__main__':
//some other code
executor = concurrent.futures.ProcessPoolExecutor(10)
futures = [executor.submit(try_multiple_operations, item) for item in aab ]
concurrent.futures.wait(futures)
aab_list = [x.result() for x in futures]
aab_list.sort()
//some other code for further processing
I have used pandas/dataframes too. I am able to do a bit of the processing using that. However, I want to be able to retain the original format of the file after processing which dataframes make a bit tricky as they return data in either ndarray format or acsv format.
I would like to understand if there is a faster way of doing this, maybe using some other programming language.
As advised by #wwii, updated the code to get rid of the multiprocessing. This made the code a lot faster.
def try_multiple_operations(items):
data_aab = []
value = ['Z5','RA','Z4','99', 99]
for item in items:
aab_type = item[15:17]
aab_amount = item[35:46]
aab_name = item[82:100]
aab_reference = item[64:82]
if aab_type not in value:
# aab_record = f'{aab_name} {aab_amount} {aab_reference}'
data_aab.append(f'{aab_name} {aab_amount} {aab_reference}')
return data_aab
Called this as:
if __name__ == '__main__':
#some code
aab_list = try_multiple_operations(aab)
#some extra code
This, while a bit surprising for me, is a lot faster than multiprocessing.
I use multiprocessing Pool to run parallel. I tried with 4 cores first in HPC with sub. When it uses 4 core, the time is reduced 4 times compared to 1 core. When I check with qstat, several times it uses 4 cores but after that just 1 core, with exactly the same code.
Could you please give some advice what is wrong with my code or the system?
import pandas as pd
import numpy as np
from multiprocessing import Pool
from datetime import datetime
t1 = pd.read_csv("template.csv",header=None)
s1 = pd.read_csv("/home/donp/dude_1000_raw_raw/dude_1000_raw_raw_adfr.csv")
s2 = pd.read_csv("/home/donp/dude_1000_raw_raw/dude_1000_raw_raw_dock.csv")
s3 = pd.read_csv("/home/donp/dude_1000_raw_raw/dude_1000_raw_raw_gemdock.csv")
s4 = pd.read_csv("/home/donp/dude_1000_raw_raw/dude_1000_raw_raw_ledock.csv")
s5 = pd.read_csv("/home/donp/dude_1000_raw_raw/dude_1000_raw_raw_plants.csv")
s6 = pd.read_csv("/home/donp/dude_1000_raw_raw/dude_1000_raw_raw_psovina.csv")
s7 = pd.read_csv("/home/donp/dude_1000_raw_raw/dude_1000_raw_raw_quickvina2.csv")
s8 = pd.read_csv("/home/donp/dude_1000_raw_raw/dude_1000_raw_raw_smina.csv")
s9 = pd.read_csv("/home/donp/dude_1000_raw_raw/dude_1000_raw_raw_vina.csv")
s10 = pd.read_csv("/home/donp/dude_1000_raw_raw/dude_1000_raw_raw_vinaxb.csv")
#number of core and arrays
n = 4
m = (len(t1) // n)+1
g= m*n - len(t1)
for g1 in range(g):
t1.loc[len(t1)]=0
results=[]
def block_linear(i):
temp = pd.DataFrame(np.zeros((m,29)))
for a in range(0,m):
sum_matrix = (t1.iloc[a,0]*s1) + (t1.iloc[a,1]*s2) + (t1.iloc[a,2]*s3)+ (t1.iloc[a,3]*s4) + (t1.iloc[a,4]*s5) + (t1.iloc[a,5]*s6) + (t1.iloc[a,6]*s7) + (t1.iloc[a,7]*s8) + (t1.iloc[a,8]*s9) + (t1.iloc[a,9]*s10)
rank_sum= pd.DataFrame.rank(sum_matrix,axis=0,ascending=True,method='min') #real-True
temp.iloc[a,:] = rank_sum.iloc[999].values
temp['median'] = temp.median(axis=1)
temp.index = range(i*m,(i+1)*m)
return temp
start=datetime.now()
if __name__ == '__main__':
pool = Pool(processes=n)
results = pool.map(block_linear,range(0,n))
print(datetime.now()-start)
out=pd.concat(results)
out.drop(out.tail(g).index,inplace=True)
out.to_csv('test_10dock_4core.csv',index=False)
The main idea is to cut large table into smallers, run calculations and combine together.
Without a more detailed usage of the multiprocessing's Pool package is really difficult to understand and help. Please notice that the Pool package does not guarantee parallelization: the _apply function, for example, only uses one worker of the Pool, and block all your executions. You can check out more details about it here and there.
But assuming you are using the library properly, you should make sure your code is fully parallelizable: an I/O operation on disk, for example, can bottleneck your parallelization and thus making your code run in only one process at a time.
I hope it helped.
[Edit]
Since you provided more details about your problem, I can give more specific tips:
The first thing is that your code is zero parallel. You are just calling the same function N times. This is not how multiprocessing should work.
Instead, the part that should be parallel is the one that is usually in a for loops, like the one you have inside the block_linear().
So, what I recommend to you:
You should change your code to first calculate all your weighted sum and only after that do the rest of the operations. This will help a lot with parallelization.
So, put this operation in a function:
def weighted_sum(column,df2):
temp = pd.DataFrame(np.zeros(m))
for a in range(0,m):
result = (t1.iloc[a,column]*df2)
temp.iloc[a] = result
return temp
So then, you use pool.starmap to parallel the function for the 10 dataframes you have, something like this:
results = pool.starmap(weighted_sum,[(0,s1),(1,s2),(2,s3),....,[9,s10]])
ps: pool.starmap is similar to pool.map but accepts a list of tuple arguments. You can have more details about it here.
At last but not least, you should operate over your results to end your calculations. Since you will have one weighted_sum per column, you can apply a sum over the columns and then the rank_sum.
This is not a fully runnable code to solve your problem, but a general guide of how your should restructure your code to have a multiprocessing advantage. I recommend you to test it over a subsample of the data frames just to make sure it's working properly before you run it on all your data.
I'm trying to revisit this slightly older question and see if there's a better answer these days.
I'm using python3 and I'm trying to share a large dataframe with the workers in a pool. My function reads the dataframe, generates a new array using data from the dataframe, and returns that array. Example code below (note: in the example below I do not actually use the dataframe, but in my code I do).
def func(i):
return i*2
def par_func_dict(mydict):
values = mydict['values']
df = mydict['df']
return pd.Series([func(i) for i in values])
N = 10000
arr = list(range(N))
data_split = np.array_split(arr, 3)
df = pd.DataFrame(np.random.randn(10,10))
pool = Pool(cores)
gen = ({'values' : i, 'df' : df}
for i in data_split)
data = pd.concat(pool.map(par_func_dict,gen), axis=0)
pool.close()
pool.join()
I'm wondering if there's a way I can prevent feeding the generator with copies of the dataframe to prevent taking up so much memory.
The answer to the link above suggests using multiprocessing.Process(), but from what I can tell, it's difficult to use that on top of functions that return things (need to incorporate signals / events), and the comments indicate that each process still ends up using a large amount of memory.
I understand from simple examples that Pool.map is supposed to behave identically to the 'normal' python code below except in parallel:
def f(x):
# complicated processing
return x+1
y_serial = []
x = range(100)
for i in x: y_serial += [f(x)]
y_parallel = pool.map(f, x)
# y_serial == y_parallel!
However I have two bits of code that I believe should follow this example:
#Linear version
price_datas = []
for csv_file in loop_through_zips(data_directory):
price_datas += [process_bf_data_csv(csv_file)]
#Parallel version
p = Pool()
price_data_parallel = p.map(process_bf_data_csv, loop_through_zips(data_directory))
However the Parallel code doesn't work whereas the Linear code does. From what I can observe, the parallel version appears to be looping through the generator (it's printing out log lines from the generator function) but then not actually performing the "process_bf_data_csv" function. What am I doing wrong here?
.map tries to pull all values from your generator to form it into an iterable before actually starting the work.
Try waiting longer (till the generator runs out) or use multi threading and a queue instead.
My first post:
Before beginning, I should note I am relatively new to OOP, though I have done DB/stat work in SAS, R, etc., so my question may not be well posed: please let me know if I need to clarify anything.
My question:
I am attempting to import and parse large CSV files (~6MM rows and larger likely to come). The two limitations that I've run into repeatedly have been runtime and memory (32-bit implementation of Python). Below is a simplified version of my neophyte (nth) attempt at importing and parsing in reasonable time. How can I speed up this process? I am splitting the file as I import and performing interim summaries due to memory limitations and using pandas for the summarization:
Parsing and Summarization:
def ParseInts(inString):
try:
return int(inString)
except:
return None
def TextToYearMo(inString):
try:
return 100*inString[0:4]+int(inString[5:7])
except:
return 100*inString[0:4]+int(inString[5:6])
def ParseAllElements(elmValue,elmPos):
if elmPos in [0,2,5]:
return elmValue
elif elmPos == 3:
return TextToYearMo(elmValue)
else:
if elmPos == 18:
return ParseInts(elmValue.strip('\n'))
else:
return ParseInts(elmValue)
def MakeAndSumList(inList):
df = pd.DataFrame(inList, columns = ['x1','x2','x3','x4','x5',
'x6','x7','x8','x9','x10',
'x11','x12','x13','x14'])
return df[['x1','x2','x3','x4','x5',
'x6','x7','x8','x9','x10',
'x11','x12','x13','x14']].groupby(
['x1','x2','x3','x4','x5']).sum().reset_index()
Function Calls:
def ParsedSummary(longString,delimtr,rowNum):
keepColumns = [0,3,2,5,10,9,11,12,13,14,15,16,17,18]
#Do some other stuff that takes very little time
return [pse.ParseAllElements(longString.split(delimtr)[i],i) for i in keepColumns]
def CSVToList(fileName, delimtr=','):
with open(fileName) as f:
enumFile = enumerate(f)
listEnumFile = set(enumFile)
for lineCount, l in enumFile:
pass
maxSplit = math.floor(lineCount / 10) + 1
counter = 0
Summary = pd.DataFrame({}, columns = ['x1','x2','x3','x4','x5',
'x6','x7','x8','x9','x10',
'x11','x12','x13','x14'])
for counter in range(0,10):
startRow = int(counter * maxSplit)
endRow = int((counter + 1) * maxSplit)
includedRows = set(range(startRow,endRow))
listOfRows = [ParsedSummary(row,delimtr,rownum)
for rownum, row in listEnumFile if rownum in includedRows]
Summary = pd.concat([Summary,pse.MakeAndSumList(listOfRows)])
listOfRows = []
counter += 1
return Summary
(Again, this is my first question - so I apologize if I simplified too much or, more likely, too little, but I am at a loss as to how to expedite this.)
For runtime comparison:
Using Access I can import, parse, summarize, and merge several files in this size-range in <5 mins (though I am right at its 2GB lim). I'd hope I can get comparable results in Python - presently I'm estimating ~30 min run time for one file. Note: I threw something together in Access' miserable environment only because I didn't have admin rights readily available to install anything else.
Edit: Updated parsing code. Was able to shave off five minutes (est. runtime at 25m) by changing some conditional logic to try/except. Also - runtime estimate doesn't include pandas portion - I'd forgotten I'd commented that out while testing, but its impact seems negligible.
If you want to optimize performance, don't roll your own CSV reader in Python. There is already a standard csv module. Perhaps pandas or numpy have faster csv readers; I'm not sure.
From https://softwarerecs.stackexchange.com/questions/7463/fastest-python-library-to-read-a-csv-file:
In short, pandas.io.parsers.read_csv beats everybody else, NumPy's loadtxt is impressively slow and NumPy's from_file and load impressively fast.