I need to apply a function on df, I used a pandarallel to parallelize the process, however, I have an issue here, I need to give func_do an N rows each call so that I can utilize a vectorization on that function.
The following will call func_do on each row. Any idea how to make a single call for each batch and keep the parallelization process.
def fun_do(value_col):
return do(value_col)
df['processed_col'] = df.parallel_apply(lambda row: fun_do(row['col']), axis=1)
A possible solution is to create virtual groups of N rows:
import pandas as pd
from pandarallel import pandarallel
# Setup MRE
pandarallel.initialize(progress_bar=False)
df = pd.DataFrame({'col1': np.linspace(0, 100, 11)})
def fun_do(sr):
return sr**2
N = 4 # size of chunk
df['col2'] = (df.groupby(pd.RangeIndex(len(df)) // N)
.parallel_apply(lambda x: fun_do(x['col1']))
.droplevel(0)) # <- remove virtual group index
Output:
>>> df
col1 col2
0 0.0 0.0
1 10.0 100.0
2 20.0 400.0
3 30.0 900.0
4 40.0 1600.0
5 50.0 2500.0
6 60.0 3600.0
7 70.0 4900.0
8 80.0 6400.0
9 90.0 8100.0
10 100.0 10000.0
Note: I don't know why groupby(...)['col'].parallel_apply(fun_do) doesn't work. It seems parallel_apply is not available with SeriesGroupBy.
This is the first time I use pandarallel, usually I used multiprocessing module
Related
I am working in pandas and want to implement an algorithm that requires I assess a modified centered median on a window, but omitting the middle value. So for instance the unmodified might be:
ser = pd.Series(data=[0.,1.,2.,4.5,5.,6.,8.,9])
med = ser.rolling(5,center=True).median()
print(med)
and I would like the result for med[3] to be 3.5 (the median of 1.,2.,4.,6.) rather than 4.5 which the ordinary windowed median. Is there an economical way to do this?
Try:
import numpy as np
import pandas as pd
ser = pd.Series(data=[0.,1.,2.,4.5,5.,6.,8.,9])
med = ser.rolling(5).apply(lambda x: np.median(np.concatenate([x[0:2],x[3:5]]))).shift(-2)
print(med)
With output:
0 NaN
1 NaN
2 2.75
3 3.50
4 5.25
5 6.50
6 NaN
7 NaN
And more generally:
rolling_size = 5
ser.rolling(rolling_size).apply(lambda x: np.median(np.concatenate([x[0:int(rolling_size/2)],x[int(rolling_size/2)+1:rolling_size]]))).shift(-int(rolling_size/2))
ser = pd.Series(data=[0.,1.,2.,4.5,5.,6.,8.,9])
def median(series, window = 2):
df = pd.DataFrame(series[window:].reset_index(drop=True))
df[1] = series[:-window]
df = df.apply(lambda x: x.mean(), axis=1)
df.index += window - 1
return df
median(ser)
I think it is simpler
I have a simulation that uses pandas Dataframes to describe objects in a hierarchy. To achieve this, I have used a MultiIndex to show the route to a child object.
Parent df
par_val
a b
0 0.0 0.366660
1.0 0.613888
1 2.0 0.506531
3.0 0.327356
2 4.0 0.684335
0.0 0.013800
3 1.0 0.590058
2.0 0.179399
4 3.0 0.790628
4.0 0.310662
Child df
child_val
a b c
0 0.0 0 0.528217
1.0 0 0.515479
1 2.0 0 0.719221
3.0 0 0.785008
2 4.0 0 0.249344
0.0 0 0.455133
3 1.0 0 0.009394
2.0 0 0.775960
4 3.0 0 0.639091
4.0 0 0.150854
0 0.0 1 0.319277
1.0 1 0.571580
1 2.0 1 0.029063
3.0 1 0.498197
2 4.0 1 0.424188
0.0 1 0.572045
3 1.0 1 0.246166
2.0 1 0.888984
4 3.0 1 0.818633
4.0 1 0.366697
This implies that object (0,0,0) and (0,0,1) in the child Dataframes are both characterised by values at (0,0) in the parent Dataframe.
When a function is performed on the child dataframe for a certain subject of 'a', it may therefore need to grab a value from 'b'. My current solution locates the value from the parent Dataframe by index within the solution function:
import pandas as pd
import numpy as np
import time
from matplotlib import pyplot as plt
r = range(10, 1000, 10)
dt = []
for i in r:
start = time.time()
df_par = pd.DataFrame(
{'a': np.repeat(np.arange(5), i/5),
'b': np.append(np.arange(i/2), np.arange(i/2)),
'par_val': np.random.rand(i)
}).set_index(['a','b'])
df_child = pd.concat([df_par[[]]] * 2, keys = [0, 1], names = ['c'])\
.reorder_levels(['a', 'b', 'c'])
df_child['child_val'] = np.random.rand(i * 2)
df_child['solution'] = np.nan
def solution(row, df_par, var):
data_level = len(df_par.index.names)
index_filt = tuple([row.name[i] for i in range(data_level)])
sol = df_par.loc[index_filt, 'par_val'] / row.child_val
return sol
a_mask = df_child.index.get_level_values('a') == 0
df_child.loc[a_mask, 'solution'] = df_child.loc[a_mask].apply(solution,
df_par = df_par,
var = 10,
axis = 1)
stop = time.time()
dt.append(stop - start)
plt.plot(r, dt)
plt.show()
The solution function is becoming very costly for large amounts of iterations in the simulation:
(iterations (x) vs time in seconds (y))
Is there a more efficient method of calculating this? I have considered including the 'par_val' in the child df, but I was trying to avoid this as the very large amount of repetitions reduces the amount of simulations I can fit in RAM.
par_val is a float64 which takes 8 bytes for each value. If the child data frame has 1 million rows, that's 8MB of memory (before the OS's Memory Compression feature kicks in). If it has 1 billions rows, then yes, I would worry about the memory impact.
The bigger performance bottleneck though, is in your df_child.loc[a_mask].apply(..., axis=1) line. This makes pandas uses the slow Python loop instead of the much faster vectorized code. In SQL, we call the loop approach row-by-agonizing-row and it's an anti-pattern. You generally want to avoid .apply(..., axis=1) for this reason.
Here's one way to improve the performance without changing df_par or df_child:
a_mask = df_child.index.get_level_values('a') == 0
child_val = df_child.loc[a_mask, 'child_val'].droplevel(-1)
solution = df_par.loc[child_val.index, 'par_val'] / child_val
df_child.loc[a_mask, 'solution'] = solution.to_numpy()
Before:
After:
Neither .size() nor .count() seem to produce a single count column when
applied to data produced with a .cut() method.
This may only be a problem of syntax, but I have tried .size(), .count(), and .describe() and get multiple columns with a group count, but not 1 single column.
#python 2.7
import pandas as pd
import numpy as np
np.random.seed(seed=1)
df = pd.DataFrame({"var1": np.random.random(100),
"var2": np.random.random(100) + 5})
# Bin the data frame by "var1" with 10 bins...
df = df.groupby(pd.cut(df.var1, 10)).describe().var2[['mean','count']]
df =df.reset_index()
print df"
#Results:
var1 mean count
0 (-0.000874, 0.099] 5.546257 11.0
1 (0.099, 0.198] 5.434613 12.0
2 (0.198, 0.297] 5.483686 9.0
3 (0.297, 0.396] 5.313241 6.0
4 (0.396, 0.494] 5.537168 13.0
5 (0.494, 0.593] 5.518476 10.0
6 (0.593, 0.692] 5.614630 10.0
7 (0.692, 0.791] 5.443415 10.0
8 (0.791, 0.89] 5.464804 7.0
9 (0.89, 0.989] 5.418756 12.0
#Updated the posted question with code that provides the desired answer.
If that is case you need transform with pd.Series.groupby()
df['cnt']=df.groupby(pd.cut(df.var1, 10))['var2'].transform('count')
I have a pandas DataFrame and I want to calculate on a rolling basis the average of all the value: for all the columns, for all the observations in the rolling window.
I have a solution with loops but feels very inefficient. Note that I can have NaNs in my data, so calculating the sum and dividing by the shape of the window would not be safe (as I want a nanmean).
Any better approach?
Setup
import numpy as np
import pandas as pd
np.random.seed(1)
df = pd.DataFrame(np.random.randint(0, 10, size=(10, 2)), columns=['A', 'B'])
df[df>5] = np.nan # EDIT: add nans
My Attempt
n_roll = 2
df_stacked = df.values
roll_avg = {}
for idx in range(n_roll, len(df_stacked)+1):
roll_avg[idx-1] = np.nanmean(df_stacked[idx - n_roll:idx, :].flatten())
roll_avg = pd.Series(roll_avg)
roll_avg.index = df.index[n_roll-1:]
roll_avg = roll_avg.reindex(df.index)
Desired Result
roll_avg
Out[33]:
0 NaN
1 5.000000
2 1.666667
3 0.333333
4 1.000000
5 3.000000
6 3.250000
7 3.250000
8 3.333333
9 4.000000
Thanks!
Here's one NumPy solution with sliding windows off view_as_windows -
from skimage.util.shape import view_as_windows
# Setup o/p array
out = np.full(len(df),np.nan)
# Get sliding windows of length n_roll along axis=0
w = view_as_windows(df.values,(n_roll,1))[...,0]
# Assign nan-ignored mean values computed along last 2 axes into o/p
out[n_roll-1:] = np.nanmean(w, (1,2))
Memory efficiency with views -
In [62]: np.shares_memory(df,w)
Out[62]: True
To be able to get the same result in case of nan, you can use column_stack on all the df.shift(i).values for i in range(n_roll), use nanmean on axis=1, and then you need to replace the first n_roll-1 value with nan after:
roll_avg = pd.Series(np.nanmean(np.column_stack([df.shift(i).values for i in range(n_roll)]),1))
roll_avg[:n_roll-1] = np.nan
and with the second input with nan, you get as expected
0 NaN
1 5.000000
2 1.666667
3 0.333333
4 1.000000
5 3.000000
6 3.250000
7 3.250000
8 3.333333
9 4.000000
dtype: float64
Using the answer referenced in the comment, one can do:
wsize = n_roll
cols = df.shape[1]
out = group.stack(dropna=False).rolling(window=wsize * cols, min_periods=1).mean().reset_index(-1, drop=True).sort_index()
out.groupby(out.index).last()
out.iloc[:nroll-1] = np.nan
In my case it was important to specify dropna=False in stack, otherwise the length of the rolling window would not be correct.
But I am looking forward to other approaches as this does not feel very elegant/efficient.
I have read in some with Pandas and I want to add a column after the last column. After I did, the problem is that the values start from zero, and I want them to start from one.
I have 12800 rows and want the added column to start from 1 and go to 100 and the start over and go from 1 to 100. I want this pattern for all the rows. So basically I want this to cycle 128 times go from 1 to 100. Can anyone tell me how I can do this?
import numpy as np
import pandas as pd
df = pd.read_csv('...csv')
df1=pd.DataFrame(df.values.reshape(12800, -1))
df1['10'] = df1.index
The included picture is not correct. I want the last column which is number 10 to start from one and have a pattern like I said above.
To repeat a pattern of 1..100 and assign to a column you can do:
df['1_to_100'] = np.tile(
np.arange(1, 101), int(len(df) * 0.01) + 1)[:len(df)]
To add a pattern that is 100 per step you can do:
df['by_100'] = np.floor(df.index / 100)
Test Code:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(2002, 3))
df['1_to_100'] = np.tile(
np.arange(1, 101), int(len(df) * 0.01) + 1 )[:len(df)]
df['by_100'] = np.floor(df.index / 100)
print(df.head())
print(df.tail())
Results:
0 1 2 1_to_100 by_100
0 0.301862 0.824019 0.267810 1 0.0
1 0.568186 0.040328 0.799634 2 0.0
2 0.887218 0.407702 0.351990 3 0.0
3 0.871072 0.583761 0.498725 4 0.0
4 0.169657 0.026824 0.446667 5 0.0
0 1 2 1_to_100 by_100
1997 0.370640 0.662019 0.541747 98 19.0
1998 0.545908 0.682259 0.970764 99 19.0
1999 0.416177 0.665771 0.926145 100 19.0
2000 0.207109 0.762653 0.813754 1 20.0
2001 0.711998 0.236817 0.025387 2 20.0