How to use groupby + transform instead of pipe? - python

Let's say I have a dataframe like this
import pandas as pd
from scipy import stats
df = pd.DataFrame(
{
'group': list('abaab'),
'val1': range(5),
'val2': range(2, 7),
'val3': range(4, 9)
}
)
group val1 val2 val3
0 a 0 2 4
1 b 1 3 5
2 a 2 4 6
3 a 3 5 7
4 b 4 6 8
Now I want to calculate linear regressions for each group in column group using two of the vali columns (potentially all pairs, so I don't want to hardcode column names anywhere).
A potential implementation could look like this based on pipe
def do_lin_reg_pipe(df, group_col, col1, col2):
group_names = df[group_col].unique()
df_subsets = []
for s in group_names:
df_subset = df.loc[df[group_col] == s]
x = df_subset[col1].values
y = df_subset[col2].values
slope, intercept, r, p, se = stats.linregress(x, y)
df_subset = df_subset.assign(
slope=slope,
intercept=intercept,
r=r,
p=p,
se=se
)
df_subsets.append(df_subset)
return pd.concat(df_subsets)
and then I can use
df_linreg_pipe = (
df.pipe(do_lin_reg_pipe, group_col='group', col1='val1', col2='val3')
.assign(p=lambda d: d['p'].round(3))
)
which gives the desired outcome
group val1 val2 val3 slope intercept r p se
0 a 0 2 4 1.0 4.0 1.0 0.0 0.0
2 a 2 4 6 1.0 4.0 1.0 0.0 0.0
3 a 3 5 7 1.0 4.0 1.0 0.0 0.0
1 b 1 3 5 1.0 4.0 1.0 0.0 0.0
4 b 4 6 8 1.0 4.0 1.0 0.0 0.0
What I don't like is that I have to loop through the groups, use and append and then also concat, so I thought I should somehow use a groupby and transform but I don't get this to work. The function call should be something like
df_linreg_transform = df.copy()
df_linreg_transform[['slope', 'intercept', 'r', 'p', 'se']] = (
df.groupby('group').transform(do_lin_reg_transform, col1='val1', col2='val3')
)
question is how to define do_lin_reg_transform; I would like to have something along these lines
def do_lin_reg_transform(df, col1, col2):
x = df[col1].values
y = df[col2].values
slope, intercept, r, p, se = stats.linregress(x, y)
return (slope, intercept, r, p, se)
but that - of course - crashes with a KeyError
KeyError: 'val1'
How could one implement do_lin_reg_transform to make it work with groupby and transform?

As you can use groupby_transform because you need extra columns to compute the result, the idea is to use groupby_apply with map to broadcast the result to each rows:
cols = ['slope', 'intercept', 'r', 'p', 'se']
lingress = lambda x: stats.linregress(x['val1'], x['val3'])
df[cols] = pd.DataFrame.from_records(df['group'].map(df.groupby('group').apply(lingress)))
print(df)
# Output
group val1 val2 val3 slope intercept r p se
0 a 0 2 4 1.0 4.0 1.0 9.003163e-11 0.0
1 b 1 3 5 1.0 4.0 1.0 0.000000e+00 0.0
2 a 2 4 6 1.0 4.0 1.0 9.003163e-11 0.0
3 a 3 5 7 1.0 4.0 1.0 9.003163e-11 0.0
4 b 4 6 8 1.0 4.0 1.0 0.000000e+00 0.0

Transform is meant to aggregate results for a single column. A regression requires multiple so you should use apply.
If you wanted, you could define your aggregation to return a DataFrame as opposed to the Series (so the result doesn't reduce). For this to work, you'd want to make sure you index is unique. Then concat the result back so it aligns on the index. You won't have any issues if there's more than 1 grouping column.
def group_reg(gp, col1, col2):
df = pd.DataFrame([stats.linregress(gp[col1], gp[col2])]*len(gp),
columns=['slope', 'intercept', 'r', 'p', 'se'],
index=gp.index)
return df
pd.concat([df, df.groupby('group').apply(group_reg, col1='val1', col2='val3')], axis=1)
group val1 val2 val3 slope intercept r p se
0 a 0 2 4 1.0 4.0 1.0 9.003163e-11 0.0
1 b 1 3 5 1.0 4.0 1.0 0.000000e+00 0.0
2 a 2 4 6 1.0 4.0 1.0 9.003163e-11 0.0
3 a 3 5 7 1.0 4.0 1.0 9.003163e-11 0.0
4 b 4 6 8 1.0 4.0 1.0 0.000000e+00 0.0

Related

Match multiple csv columns entries to another csv and extract data in python

I am trying to match csv entries and extract data but stuck. My csv files are in this format:
df1 looks like this:
type prediction ax ay az
df2 looks like this:
type ax ay az x y z fx fy fz
I would like to first match df1 and df2. For this, I need to match ax, ay and az all together with `df2. Matching only single column can give me wrong dataframe because entries are repeated.
After matching multiple columns with df2, I would like to extract those values and make a dataframe with df1.
Expected dataframe:
type prediction ax ay az x y z
df1 and df2 doesn't have same size. Actually, df2 is a huge file that is why I want to extract only required dataset.
This is my code:
def match_dataset(df1, df2):
df1_new = pd.DataFrame(columns=['x','y','z','fx','fy','fz','az','ax','ay'])
df2_new = pd.DataFrame(columns=['x','y','z','fx','fy','fz','az','ax','ay'])
for i in range(len(df1)):
for j in range(len(df2)):
if df1.iloc[i]['az'] == df2.iloc[j]['az'] and df1.iloc[i]['ay'] == df2.iloc[j]['ay'] and df1.iloc[i]['ax'] == df2.iloc[j]['ax']:
df1_new = df1_new.append(df2.iloc[j], ignore_index=True)
#df2_new = df2_new.append(df2.iloc[j], ignore_index=True)
return df1_new
data = match_dataset(df1, df2)
print(data.head())
But my code is stuck in loop. It doesn't give me output.
Can I get some help?
Thank you.
I think you can use df1.merge() to get the desired output. Here, I'm creating two dfs with your columns and some random digits in each column:
import pandas as pd
import numpy as np
df1_cols = ['type', 'prediction', 'ax', 'ay', 'az']
df2_cols = ['type', 'ax', 'ay', 'az', 'x', 'y', 'z', 'fx', 'fy', 'fz']
df1 = pd.DataFrame(np.random.randint(10,size=(4,len(df1_cols))), columns = df1_cols)
df2 = pd.DataFrame(np.random.randint(10,size=(1000,len(df2_cols))), columns = df2_cols)
In this example, df1 came out like this:
type prediction ax ay az
0 8 1 8 2 7
1 3 0 5 4 5
2 7 3 0 0 2
3 2 5 3 5 7
Now apply merge:
df1_new = df1.merge(df2, on=['az','ay','ax'], how='left')
print(df1_new)
result:
type_x prediction ax ay az type_y x y z fx fy fz
0 8 1 8 2 7 3.0 0.0 2.0 6.0 7.0 8.0 9.0
1 3 0 5 4 5 NaN NaN NaN NaN NaN NaN NaN
2 7 3 0 0 2 NaN NaN NaN NaN NaN NaN NaN
3 2 5 3 5 7 9.0 8.0 4.0 0.0 3.0 3.0 1.0
4 2 5 3 5 7 9.0 9.0 1.0 3.0 6.0 5.0 9.0
Apparently, in this random example, we found 2 matches for df1.iloc[3], but zero for df1.iloc[1:3], hence the NA values. We can simply drop these from the df with df1_new.dropna(inplace=True). Finally, reset the index: df1_new.reset_index(drop=True, inplace=True):
type_x prediction ax ay az type_y x y z fx fy fz
0 8 1 8 2 7 3.0 0.0 2.0 6.0 7.0 8.0 9.0
1 2 5 3 5 7 9.0 8.0 4.0 0.0 3.0 3.0 1.0
2 2 5 3 5 7 9.0 9.0 1.0 3.0 6.0 5.0 9.0
To select only the columns mentioned in your def, you can use:
df1_new[['x','y','z','fx','fy','fz','az','ax','ay']]

Summarize data from a list of pandas dataframes

I have a list of dfs, df_list:
[ CLASS IDX A B C D
0 1 1 1.0 0.0 0.0 0.0
1 1 2 1.0 0.0 0.0 0.0
2 1 3 1.0 0.0 0.0 0.0,
CLASS IDX A B C D
0 1 1 NaN NaN NaN NaN
1 1 2 1.0 0.0 0.0 0.0
2 1 3 1.0 0.0 0.0 0.0,
CLASS IDX A B C D
0 1 1 0.900 0.100 0.0 0.0
1 1 2 1.000 0.000 0.0 0.0
2 1 3 0.999 0.001 0.0 0.0]
I would like to summarize the data into one df based on conditions and values in the individual dfs. Each df has 4 columns of interest, A, B, C and D. If the value in for example column A is >= 0.1 in df_list[0], I want to print 'A' in the summary df. If two columns, for example A and B, have values >= 0.1, I want to print 'A/B'. The final summary df for this data should be:
CLASS IDX 0 1 2
0 1 1 A NaN A/B
1 1 2 A A A
2 1 3 A A A
In the summary df, the column labels (0,1,2) represent the position of the df in the df_list.
I am starting with this
for index, values in enumerate(df_list):
# summarize the data
But not sure what would be the best way to continue..
Any help greatly appreciated!
Here there is one approach
cols = ['A','B','C','D']
def join_func(df):
m = df[cols].ge(0.1)
return (df[cols].mask(m, cols)
.where(m, np.nan)
.apply(lambda x: '/'.join(x.dropna()), axis=1))
result = (df_list[0].loc[:, ['CLASS','IDX']]
.assign(**{str(i) : join_func(df)
for i, df in enumerate(df_list)}))
print(result)
CLASS IDX 0 1 2
0 1 1 A A/B
1 1 2 A A A
2 1 3 A A A

Perform custom function on each row of Dataframe, while ignoring the first col

I am new to python, still trying to figure out python. I am not sure how to approach this.
I am trying to apply a custom function to calculate the percentile while ignoring the first column as it is a string. and also want to only the last 3 data points of each row.
Tried using the df.rolling function from pandas, but was not successful in its implementation. thanks for the help in advance.
import pandas as pd
import numpy as np
# df = pd.read_csv('data/imp_vol.csv')
df=pd.DataFrame({"A":['a',3,None,4,2,4],
"B":['b',2,4,3,2,5],
"C":['c',3,8,5,4,None],
"D":['d',2,None,4,2,None]})
df['heading'] = ['a','b','c','d','e','f']
new_order = [-1,0,1,2,3]
df = df[df.columns[new_order]]
df = df.replace(np.nan, 0)
df.update(df.iloc[:, -4:].mask(lambda x: x.isin([0, '0'])).ffill(axis=1))
def perc_func(r):
x = r
last_val = x[-1]
min_val = x.min()
max_val = x.max()
percentile = ((last_val - min_val) / (max_val - min_val) * 100)
return percentile
df['Percentile'] = df.apply(lambda row:perc_func(row), axis=1)
print(df)
sample output which I am after is below (data is only place holder for col percentile):
heading A B C D Percentile
0 a 1.0 3.0 2.0 4.0 45
1 b 3.0 2.0 3.0 2.0 44
2 c 0.0 4.0 8.0 8.0 32
3 d 4.0 3.0 5.0 4.0 48
4 e 2.0 2.0 4.0 2.0 59
5 f 4.0 5.0 5.0 5.0 59
You want to apply your custom function on all rows of df, except 1st and also excluding 1st column:
In [1157]: df['Percentile'] = df.iloc[1:, 1:].apply(perc_func, 1)
In [1158]: df
Out[1158]:
heading A B C D Percentile
0 a a b c d NaN
1 b 3 2 3 2 0.0
2 c 0 4 8 8 100.0
3 d 4 3 5 4 50.0
4 e 2 2 4 2 0.0
5 f 4 5 5 5 100.0

Perform arithmetic operations on null values

When i am trying to do arithmetic operation including two or more columns facing problem with null values.
One more thing which i want to mention here that i don't want to fill missed/null values.
Actually i want something like 1 + np.nan = 1 but it is giving np.nan. I tried to solve it by np.nansum but it didn't work.
df = pd.DataFrame({"a":[1,2,3,4],"b":[1,2,np.nan,np.nan]})
df
Out[6]:
a b c
0 1 1.0 2.0
1 2 2.0 4.0
2 3 NaN NaN
3 4 NaN NaN
And,
df["d"] = np.nansum([df.a + df.b])
df
Out[13]:
a b d
0 1 1.0 6.0
1 2 2.0 6.0
2 3 NaN 6.0
3 4 NaN 6.0
But i want actually like,
df
Out[10]:
a b c
0 1 1.0 2.0
1 2 2.0 4.0
2 3 NaN 3.0
3 4 NaN 4.0
The np.nansum here calculated the sum, of the entire column. You do not want that, you probably want to call the np.nansum on the two columns, like:
df['d'] = np.nansum((df.a, df.b), axis=0)
This then yield the expected:
>>> df
a b d
0 1 1.0 2.0
1 2 2.0 4.0
2 3 NaN 3.0
3 4 NaN 4.0
Simply use DataFrame.sum over axis=1:
df['c'] = df.sum(axis=1)
Output
a b c
0 1 1.0 2.0
1 2 2.0 4.0
2 3 NaN 3.0
3 4 NaN 4.0

How to implement sql coalesce in pandas

I have a data frame like
df = pd.DataFrame({"A":[1,2,np.nan],"B":[np.nan,10,np.nan], "C":[5,10,7]})
A B C
0 1.0 NaN 5
1 2.0 10.0 10
2 NaN NaN 7
I want to add a new column 'D'. Expected output is
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
Thanks in advance!
Another way is to explicitly fill column D with A,B,C in that order.
df['D'] = np.nan
df['D'] = df.D.fillna(df.A).fillna(df.B).fillna(df.C)
Another approach is to use the combine_first method of a pd.Series. Using your example df,
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({"A":[1,2,np.nan],"B":[np.nan,10,np.nan], "C":[5,10,7]})
>>> df
A B C
0 1.0 NaN 5
1 2.0 10.0 10
2 NaN NaN 7
we have
>>> df.A.combine_first(df.B).combine_first(df.C)
0 1.0
1 2.0
2 7.0
We can use reduce to abstract this pattern to work with an arbitrary number of columns.
>>> from functools import reduce
>>> cols = [df[c] for c in df.columns]
>>> reduce(lambda acc, col: acc.combine_first(col), cols)
0 1.0
1 2.0
2 7.0
Name: A, dtype: float64
Let's put this all together in a function.
>>> def coalesce(*args):
... return reduce(lambda acc, col: acc.combine_first(col), args)
...
>>> coalesce(*cols)
0 1.0
1 2.0
2 7.0
Name: A, dtype: float64
I think you need bfill with selecting first column by iloc:
df['D'] = df.bfill(axis=1).iloc[:,0]
print (df)
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
same as:
df['D'] = df.fillna(method='bfill',axis=1).iloc[:,0]
print (df)
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
option 1
pandas
df.assign(D=df.lookup(df.index, df.isnull().idxmin(1)))
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
option 2
numpy
v = df.values
j = np.isnan(v).argmin(1)
df.assign(D=v[np.arange(len(v)), j])
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
naive time test
over given data
over larger data
There is already a method for Series in Pandas that does this:
df['D'] = df['A'].combine_first(df['C'])
Or just stack them if you want to look up values sequentially:
df['D'] = df['A'].combine_first(df['B']).combine_first(df['C'])
This outputs the following:
>>> df
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0

Categories

Resources