I'm currently trying to learn how to use multiprocessing on python. Moreover I want to apply multiprocessing on a code of mine.
I have read other questions on the subject but the solutions on those questions did not work on my environment (maybe because something has changed with python 3.10)
My code looks like:
def obtenern2():
A = []
for d in days:
aux = dfhabil[dfhabil["day"] == d]
n2 = casosn(aux,2)
aml = ExportarMODml(n2)
adl = ExportarMODdl(n2)
A.append(aml)
A.append(adl)
return pd.concat(A)
B = obtenern2()
where "ExportarMODml" or "ExportarMODdl" takes the dataframe "n2" and perform some calculations returning a dataframe (so "A" is actually a list of dataframes).
I think that "ExportarMODml" and "ExportarMODdl" could be process in parallel, but I dont know how to append the resulting dataframes to the same list without causing corruption or something like that.
Here is a pattern that you could probably adapt to your requirements.
We have two functions ExportarMODml and ExportarMODdl. Each function takes a dictionary as its only argument and returns a DataFrame.
These can be executed in parallel and a concatenation of the returned DataFrames can be achieved thus:
from pandas import concat, DataFrame
from concurrent.futures import ProcessPoolExecutor
def ExportarMODml(d):
return DataFrame(d)
def ExportarMODdl(d):
return DataFrame(d)
def main():
d = {'a': [1,2], 'b': [3,4]}
with ProcessPoolExecutor() as ppe:
futures = [ppe.submit(func, d) for func in (ExportarMODml, ExportarMODdl)]
df = concat([future.result() for future in futures])
print(df)
if __name__ == '__main__':
main()
Output:
a b
0 1 3
1 2 4
0 1 3
1 2 4
Related
I'm trying to use DataFrame.map_partitions() from Dask to apply a function on each partition. The function takes in input a list of values and have to return the rows of the dataframe partition that contains these values, on a specific column (using loc() and isin()).
The issue is that I get this error:
"index = partition_info['number'] - 1
TypeError: 'NoneType' object is not subscriptable"
When I print partition_info, it prints None hundreds of times (but I only have 60 elements in the loop so we expect only 60 prints), is it normal to print None because it's a child process or am I missing something with partition_info? I cannot find useful information on that.
def apply_f(df, barcodes_per_core: List[List[str]], partition_info=None):
print(partition_info)
index = partition_info['number'] - 1
indexes = barcodes_per_core[index]
return df.loc[df['barcode'].isin(indexes)]
df = from_pandas(df, npartitions=nb_cores)
dfs_per_core = df.map_partitions(apply_f, barcodes_per_core, meta=df)
dfs_per_core = dfs_per_core.compute(scheduler='processes')
=> Doc of partition_info at the end of this page.
It's not clear why things are not working on your end, one potential thing is that you are re-using df multiple times. Here's a MWE that works:
import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame(range(10), columns=["a"])
ddf = dd.from_pandas(df, npartitions=3)
def my_func(d, x, partition_info=None):
print(x, partition_info)
ddf.map_partitions(my_func, 3, meta=df.head()).compute(scheduler='processes')
I want use multiprocessing library to parallelize the computation. If you comment line 5 and 9 and uncomment line 11 we can run this code in serial fashion.
My dataframe is very big and taking lot of time so I want to use multiprocessing.
This is what i am trying
def do_something (df):
return df
def main(df,df_hide,df_res):
p = Pool() # comment to run normal way
for i in range(0,df_hide.shape[0]):
df = df.append(df_hide.iloc[i,:])
df = p.map(do_something,df) # comment to run normal way
#df = do_something(df) # uncomment to run normal way
df_res.iloc[i,0] = df.iloc[-1,0]
return df_res
if __name__ == '__main__':
df = pd.DataFrame({'a':[1,2,3]})
df_hide = pd.DataFrame({'a':[4,5,6]})
df_res = pd.DataFrame({'z':[0,0,0]})
df_res1 = main(df,df_hide,df_res)
print(df_res1)
Excepted output it will come if I run it normally
z
0 4
1 5
2 6
This gives me nothing It freezes the cmd. Any way still if I run it I don't think I will get expected results. As I have to do something after ever process. Can you please suggest how to parallelize this above code using multiprocessing.
import numpy as np
import pandas as pd
def do_something (df):
return df
def main(df,df_hide,df_res):
for i in range(0,df_hide.shape[0]):
df = df.append(df_hide.iloc[i,:])
df_res.iloc[i,0] = df.iloc[-1,0]
return df_res
if __name__ == '__main__':
df = pd.DataFrame({'a':[1,2,3]})
df_hide = pd.DataFrame({'a':[4,5,6]})
df_res = pd.DataFrame({'z':[0,0,0]})
df_res1 = main(df,df_hide,df_res)
print(df_res1)
I would like to create a dataframe in a loop and after use these dataframe in a loop. I tried eval() function but it didn't work.
For example :
for i in range(5):
df_i = df[(df.age == i)]
There I would like to create df_0,df_1 etc. And then concatenate these new dataframe after some calculations :
final_df = pd.concat(df_0,df_1)
for i in range(2:5):
final_df = pd.concat(final_df, df_i)
You can create a dict of DataFrames x and have is as dict keys:
np.random.seed(42)
df = pd.DataFrame({'age': np.random.randint(0, 5, 20)})
x = {}
for i in range(5):
x[i] = df[df['age']==i]
final = pd.concat(x.values())
Then you can refer to individual DataFrames as:
x[1]
Output:
age
5 1
13 1
15 1
And concatenate all of them with:
pd.concat(x.values())
Output:
age
18 0
5 1
13 1
15 1
2 2
6 2
...
The way is weird and not recommended, but it can be done.
Answer
for i in range(5):
exec("df_{i} = df[df['age']=={i}]")
def UDF(dfi):
# do something in user-defined function
for i in range(5):
exec("df_{i} = UDF(df_{i})")
final_df = pd.concat(df_0,df_1)
for i in range(2:5):
final_df = pd.concat(final_df, df_i)
Better Way 1
Using a list or a dict to store the dataframe should be a better way since you can access each dataframe by an index or a key.
Since another answer shows the way using dict (#perl), I will show you the way using list.
def UDF(dfi):
# do something in user-defined function
dfs = [df[df['age']==i] for i in range(i)]
final_df = pd.concat(map(UDF, dfs))
Better Way 2
Since you are using pandas.DataFrame, groupby function is a 'pandas' way to do what you want. (maybe, I guess, cause I don't know what you want to do. LOL)
def UDF(dfi):
# do something in user-defined function
final_df = df.groupby('age').apply(UDF)
Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
Sometimes I end up with a series of tuples/lists when using Pandas. This is common when, for example, doing a group-by and passing a function that has multiple return values:
import numpy as np
from scipy import stats
df = pd.DataFrame(dict(x=np.random.randn(100),
y=np.repeat(list("abcd"), 25)))
out = df.groupby("y").x.apply(stats.ttest_1samp, 0)
print out
y
a (1.3066417476, 0.203717485506)
b (0.0801133382517, 0.936811414675)
c (1.55784329113, 0.132360504653)
d (0.267999459642, 0.790989680709)
dtype: object
What is the correct way to "unpack" this structure so that I get a DataFrame with two columns?
A related question is how I can unpack either this structure or the resulting dataframe into two Series/array objects. This almost works:
t, p = zip(*out)
but it t is
(array(1.3066417475999257),
array(0.08011333825171714),
array(1.557843291126335),
array(0.267999459641651))
and one needs to take the extra step of squeezing it.
maybe this is most strightforward (most pythonic i guess):
out.apply(pd.Series)
if you would want to rename the columns to something more meaningful, than:
out.columns=['Kstats','Pvalue']
if you do not want the default name for the index:
out.index.name=None
maybe:
>>> pd.DataFrame(out.tolist(), columns=['out-1','out-2'], index=out.index)
out-1 out-2
y
a -1.9153853424536496 0.067433
b 1.277561889173181 0.213624
c 0.062021492729736116 0.951059
d 0.3036745009819999 0.763993
[4 rows x 2 columns]
I believe you want this:
df=pd.DataFrame(out.tolist())
df.columns=['KS-stat', 'P-value']
result:
KS-stat P-value
0 -2.12978778869 0.043643
1 3.50655433879 0.001813
2 -1.2221274198 0.233527
3 -0.977154419818 0.338240
I have met the similar problem. What I found 2 ways to solving it are exactly the answer of #CT ZHU and that of #Siraj S.
Here is my supplementary information you might be interested:
I have compared 2 ways and found the way of #CT ZHU performs much faster when the size of input grows.
Example:
#Python 3
import time
from statistics import mean
df_a = pd.DataFrame({'a':range(1000),'b':range(1000)})
#function to test
def func1(x):
c = str(x)*3
d = int(x)+100
return c,d
# Siraj S's way
time_difference = []
for i in range(100):
start = time.time()
df_b = df_a['b'].apply(lambda x: func1(x)).apply(pd.Series)
end = time.time()
time_difference.append(end-start)
print(mean(time_difference))
# 0.14907703161239624
# CT ZHU's way
time_difference = []
for i in range(100):
start = time.time()
df_b = pd.DataFrame(df_a['b'].apply(lambda x: func1(x)).tolist())
end = time.time()
time_difference.append(end-start)
print(mean(time_difference))
# 0.0014058423042297363
PS: Please forgive my ugly code.
not sure if the t, r are predefined somewhere, but if not, I am getting the two tuples passing to t and r by,
>>> t, r = zip(*out)
>>> t
(-1.776982300308175, 0.10543682705459552, -1.7206831272759038, 1.0062163376448068)
>>> r
(0.08824925924534484, 0.9169054844258786, 0.09817788453771065, 0.3243492942246433)
Thus, you could do this,
>>> df = pd.DataFrame(columns=['t', 'r'])
>>> df.t, df.r = zip(*out)
>>> df
t r
0 -1.776982 0.088249
1 0.105437 0.916905
2 -1.720683 0.098178
3 1.006216 0.324349
I am attempting to create approximately 120 data frames based upon a list of files and dataframe names. The problem is that after the loop works, the dataframes don't persist. My code can be found below. Does anyone know why this may not be working?
for fname, dfname in zip(CSV_files, DF_names):
filepath = find(fname, path)
dfname = pd.DataFrame.from_csv(filepath)
This is a python feature.
See this simpler example: (comments show the outputs)
values = [1,2,3]
for v in values:
print v,
# 1 2 3
for v in values:
v = 4
print v,
# 4 4 4
print values
# [1, 2, 3]
# the values have not been modified
Also look at this SO question and answer: Modifying a list iterator in Python not allowed?
The solution suggestd in the comment should work better because you do not modify the iterator. If you need a name to access the dataframe, you can also use a dictionanry:
dfs = {}
for fname, dfname in zip(CSV_files, DF_names):
filepath = find(fname, path)
df = pd.DataFrame.from_csv(filepath)
dfs[dfname] = df
print dfs[DF_names[1]]