I found my bottleneck in my python script. This function takes for my csv over 4 min.
Is it better to use dataframe assign function with lambda here? And is this possible for my function I wrote?
This function checks if an article nr is more than once in the dataframe and when this is true, it should mark all these rows as an variant.
def mark_variants(df):
single_prods = df["ArtikelNr"].unique()
varianten = pd.DataFrame()
non_varianten = pd.DataFrame()
for prod in single_prods:
filtered_prods = df[df.ArtikelNr == prod]
if len(filtered_prods["ArtikelNr"]) > 1:
varianten = pd.concat([varianten, filtered_prods])
else:
non_varianten = pd.concat([non_varianten, filtered_prods])
varianten["variante"] = 1
non_varianten["variante"] = 0
return pd.concat([varianten, non_varianten])
You are concatenating dataframes multiple times within the for-loop, which is computationally expensive.
You did not provide a reproducible example, so I can not test it for myself, but:
using lists instead of empty dataframes to instantiate varianten and non_varianten
and concatenating only once the for-loop is over
might speed up things.
Here is how you could refactor your function and give it a try:
def mark_variants(df):
single_prods = df["ArtikelNr"].unique()
varianten = []
non_varianten = []
for prod in single_prods:
filtered_prods = df[df.ArtikelNr == prod]
if len(filtered_prods["ArtikelNr"]) > 1:
varianten.append(filtered_prods)
else:
non_varianten.append(filtered_prods)
varianten = pd.concat(varianten)
non_varianten = pd.concat(non_varianten)
varianten["variante"] = 1
non_varianten["variante"] = 0
return pd.concat([varianten, non_varianten])
Related
Input
mydfs= [df1,df2,df3,df4,df5,df6,df7,df8,df9]
My Code
import pandas as pd
df_1 = pd.concat([mydfs[0],mydfs[1],mydfs[2]])
df_m = df_1.merge(mydfs[2])
df_2 = pd.concat([mydfs[3],mydfs[4],mydfs[5]])
df_m1 = df_2.merge(mydfs[5])
df_3 = pd.concat([mydfs[6],mydfs[7],mydfs[8]])
df_m2 = df_3.merge(mydfs[8])
But I want my code dynamic way instead of doing manually,
using for loop is it possible? may be in future the list of data frames will increase
You can use a dictionary comprehension:
N = 3
out_dfs = {f'df_{i//N+1}': pd.concat(mydfs[i:i+N])
for i in range(0, len(mydfs), N)}
output:
{'df_1': <concatenation result of ['df1', 'df2', 'df3']>,
'df_2': <concatenation result of ['df4', 'df5', 'df6']>,
'df_3': <concatenation result of ['df7', 'df8', 'df9']>,
}
You can use a loop with "globals" to iterate through mydfs and define two "kth" variables each round
i = 0
k = 1
while i < len(mydfs):
globals()["df_{}".format(k)] = pd.concat([mydfs[i],mydfs[i+1],mydfs[i+2]])
globals()["df_m{}".format(k)] = globals()["df_{}".format(k)].merge(mydfs[i+2])
i = i+3
k = k+1
I would like to store the result of the work in a specific variable after multiprocessing as shown below.
Alternatively, I want to save the results of the job as a csv file. May I know how to do it?
This is my code:
(I want to get 'df4' and 'df7' data and to save csv file)
import pandas as pd
from pandas import DataFrame
import time
import multiprocessing
df2 = pd.DataFrame()
df3 = pd.DataFrame()
df4 = pd.DataFrame()
df5 = pd.DataFrame()
df6 = pd.DataFrame()
df7 = pd.DataFrame()
df8 = pd.DataFrame()
date = '2011-03', '2011-02' ........ '2021-03' #There are 120 list.
list1 = df1['resion'].drop_duplicates() # There are 20 list. 'df1' is original data
#I'd like to divide the list and work on it.
list11 = list1.iloc[0:10]
list12 = list1.iloc[10:20]
#It's a function using 'list11'.
def cal1():
global df2
global df3
global df4
start = time.time()
for i, t in enumerate(list11):
df2 = pd.DataFrame(df1[df1['resion'] == t]) #'df1' is original data
if i%2 == 0:
print ("cal1 function processing: ", i)
end = time.time()
print (end-start)
else:
pass
for n, d in enumerate(date):
df3 = pd.DataFrame(df2[df2['date'] == d])
df3['number'] = df3['price'].rank(pct=True, ascending = False )
df4 = df4.append(pd.DataFrame(df3))
return df4
#It's a function using 'list12'.
def cal2():
global df5
global df6
global df7
start = time.time()
for i, t in enumerate(list12):
df5 = pd.DataFrame(df1[df1['resion'] == t]) #'df1' is original data
if i%2 == 0:
print ("cal1 function processing: ", i)
end = time.time()
print (end-start)
else:
pass
for n, d in enumerate(date):
df6 = pd.DataFrame(df5[df5['date'] == d])
df6['number'] = df6['price'].rank(pct=True, ascending = False )
df7 = df7.append(pd.DataFrame(df6))
return df7
## Multiprocessing code
if __name__ == "__main__":
# creating processes
p1 = multiprocessing.Process(target=cal1, args=())
p2 = multiprocessing.Process(target=cal2, args=())
# starting process 1
p1.start()
# starting process 2
p2.start()
# wait until process 1 is finished
p1.join()
# wait until process 2 is finished
p2.join()
# both processes finished
print("Done!")
It looks like your functions cal1 and cal2 are identical except that they are trying to assign results to some different global variables. This is not going to work, because when you run them in a subprocess, they will assign that global variable in the subprocess, but that will have no impact whatsoever on the main process from which you started them.
If you want to map a function to multiple input ranges across multiple processes you can use a process Pool and Pool.map.
For example:
def cal(input_list):
start = time.time()
for i, t in enumerate(input_list):
df2 = pd.DataFrame(df1[df1['resion'] == t]) #'df1' is original data
if i%2 == 0:
print ("cal1 function processing: ", i)
end = time.time()
print (end-start)
else:
pass
for n, d in enumerate(date):
df3 = pd.DataFrame(df2[df2['date'] == d])
df3['number'] = df3['price'].rank(pct=True, ascending = False )
df4 = df4.append(pd.DataFrame(df3))
# I kept your original code unmodified but I'm not really sure this
# is what to do, because you are returning after one pass through the
# outer loop. I haven't scrutinized what you are actually trying to
# do but I suspect this is wrong too.
return df4
Then create a process pool and you can divide up the input how you want (or, with a bit of tweaking, you can let Pool.map chunk the input for you, and then reduce the outputs from map into a single output):
pool = multiprocessing.Pool(2)
dfs = pool.map(cal, [list1.iloc[0:10], list1.iloc[10:20]])
This is just to get you started. I would probably do a number of other things differently as well.
I am trying to find percent match between keywords using filters, and have had some trouble getting the correct percent result when using a loop.
Here's what I've tried so far:
import pandas as pd
def percentmatch(component=[], manufacture=[]):
dummy = 0
for i in component:
if i in manufacture:
dummy += 1
requirements = len(component)
return (dummy/requirements)*100
def isDesired(innovator = [], manufacture = []):
for i in innovator:
if i in manufacture:
return True
return False
part = pd.read_csv("fakedata.csv")
#Change the Value for test case
part['Size'].iloc[5] = 'Startup'
manufacture = pd.read_csv("book1.csv")
#First filter if the manufacture wants to work with certain customer
criteria = []
for i, r in manufacture.iterrows():
criteria.append((isDesired([part['Size'].iloc[0]], r['Desired Customer**'].split(", "))))
manufacture['criteria'] = criteria
firstfilter = manufacture[criteria]
Now the second filter.
#Second filter if the manufacture can do certain phase. Ex: prototype, pre-release
criteria2 = []
for i, r in firstfilter.iterrows():
criteria2.append(isDesired([part['Phase'].iloc[0]], r['Preferred'].split(", ")))
firstfilter['criteria2'] = criteria2
secondfilter = firstfilter[criteria2]
#Third Filter to find the percent match in Methods
percentmatch1 = []
for i, r in secondfilter.iterrows():
print(r['Method'].split(", "))
print(part['Method'].iloc[0].split(", "))
# Indentation below is there, but refuses to show in S.O. for some reason
percentmatch1.append(percentmatch([part['Method'].iloc[0].split(", ")], r['Method'].split(",")))
# End of for loop is above, next line is on same level of indentation as for loop instantiation
secondfilter['Method match'] = percentmatch1
In the above code block, my output is
['CNC Machining', '3D printing', 'Injection Molding']
['CNC Machining', '3D printing']
Doing a quick secondfilter.head() lookup gives me the following:
secondfilter.head() output here
The method match should be 100% not 0%. How do I correct this?
I would to run a loop which retrieve data from a function (not coded in the loop) for each base_currency. The code run without error but it displays 5 times (number of base_currency) the first items in the list instead of looping one after the other (the x in the function is not working properly).
The code:
base_currency = ['BTC','ABX','ADH','ALX','1WO']
length = len(base_currency)
d_volu = []
i = 0
while i < length:
for x in base_currency:
volu = daily_volume_historical(x, 'JPY', exchange='CCCAGG').set_index('timestamp').volume
d_volu.append(volu)
i += 1
d_volu = pd.concat(d_volu, axis=1)
print(d_volu)
Thank you
You're looping over base_currency twice as mentioned by #Grismar. You can avoid confusion by using list comprehension like this.
base_currency = ['BTC','ABX','ADH','ALX','1WO']
d_volu = [daily_volume_historical(x, 'JPY', exchange='CCCAGG').set_index('timestamp').volume
for x in base_currency]
a quick question.
I want to know if there is a way to create a sequence of data frames, by setting a variable inside the name of a data frame. For example:
df_0 = pd.read_csv(file1, sep =',')
b=0
x=1
while (b == 0):
df_+str(x) = pd.merge(df_+str(x-1) , Source, left_on='R_Key', right_on = 'S_Key', how='inner')
if Final_+str(x).empty != 'True':
x = x + 1
else:
b = b + 1
Now when executed, this returns "can't assign to operator" for df_+str(x). Any idea how to fix this?
This is the right time to use a list (a sequence type in Python), so you can refer to exactly as many data frames as you need.
dfs = []
dfs.append(pd.read_csv(file1, sep =',')) # It is now dfs[0]
b=0
x=1
while (b == 0):
dfs.append(pd.merge(dfs[x-1],
Source, left_on='R_Key',
right_on = 'S_Key', how='inner'))
if Final[x].empty != 'True':
x = x + 1
else:
b = b + 1
Now, you never define Final. You'll need to use the same trick there.
Not sure why you want to do this, but I think a clearer and more logical way is just to create a dictionary with dataframe name strings as keys and your generated dataframes as values?