Creating two shifted columns in grouped pandas data-frame - python

I have looked all over and I still can't find an example of how to create two shifted columns in a Pandas Dataframe within its groups.
I have done it with one column as follows:
data_frame['previous_category'] = data_frame.groupby('id')['category'].shift()
But I have to do it with 2 columns, shifting one upwards and the other downwards.
Any ideas?

It is possible by custom function with GroupBy.apply, because one column need shift down and second shift up:
df = pd.DataFrame({
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'F':list('aaabbb')
})
def f(x):
x['B'] = x['B'].shift()
x['C'] = x['C'].shift(-1)
return x
df = df.groupby('F').apply(f)
print (df)
B C F
0 NaN 8.0 a
1 4.0 9.0 a
2 5.0 NaN a
3 NaN 2.0 b
4 5.0 3.0 b
5 5.0 NaN b
If want shift same way only specify all columns in lists:
df[['B','C']] = df.groupby('F')['B','C'].shift()
print (df)
B C F
0 NaN NaN a
1 4.0 7.0 a
2 5.0 8.0 a
3 NaN NaN b
4 5.0 4.0 b
5 5.0 2.0 b

Related

Python Dataframe Duplicated Columns while Merging multple times

I have a main dataframe and a sub dataframe. I want to merge each column in sub dataframe into main dataframe with main dataframe column as a reference. I have successfully arrived at my desired answer, except that I see duplicated columns of the main dataframe. Below are the my expected and present answers.
Present solution:
df = pd.DataFrame({'Ref':[1,2,3,4]})
df1 = pd.DataFrame({'A':[2,3],'Z':[1,2]})
df = [df.merge(df1[col_name],left_on='Ref',right_on=col_name,how='left') for col_name in df1.columns]
df = pd.concat(df,axis=1)
df =
Ref A Ref Z
0 1 NaN 1 1.0
1 2 2.0 2 2.0
2 3 3.0 3 NaN
3 4 NaN 4 NaN
Expected Answer:
df =
Ref A Z
0 1 NaN 1.0
1 2 2.0 2.0
2 3 3.0 NaN
3 4 NaN NaN
Update
Use duplicated:
>>> df.loc[:, ~df.columns.duplicated()]
Ref A Z
0 1 NaN 1.0
1 2 2.0 2.0
2 3 3.0 NaN
3 4 NaN NaN
Old answer
You can use:
# Your code
...
df = pd.concat(df, axis=1)
# Use pop and insert to cleanup your dataframe
df.insert(0, 'Ref', df.pop('Ref').iloc[:, 0])
Output:
>>> df
Ref A Z
0 1 NaN 1.0
1 2 2.0 2.0
2 3 3.0 NaN
3 4 NaN NaN
What about setting 'Ref' col as index while getting dataframe list. (And resetting index such that you get back Ref as a column)
df = pd.DataFrame({'Ref':[1,2,3,4]})
df1 = pd.DataFrame({'A':[2,3],'Z':[1,2]})
df = [df.merge(df1[col_name],left_on='Ref',right_on=col_name,how='left').set_index('Ref') for col_name in df1.columns]
df = pd.concat(df,axis=1)
df = df.reset_index()
Ref A Z
1 NaN 1.0
2 2.0 2.0
3 3.0 NaN
4 NaN NaN
This is a reduction process. Instead of the list comprehension use for - loop, or even reduce:
from functools import reduce
reduce(lambda x, y : x.merge(df1[y],left_on='Ref',right_on=y,how='left'), df1.columns, df)
Ref A Z
0 1 NaN 1.0
1 2 2.0 2.0
2 3 3.0 NaN
3 4 NaN NaN
The above is similar to:
for y in df1.columns:
df = df.merge(df1[y],left_on='Ref',right_on=y,how='left')
df
Ref A Z
0 1 NaN 1.0
1 2 2.0 2.0
2 3 3.0 NaN
3 4 NaN NaN

how to append two dataframes with different column names and avoid columns with nan values

xyarr= [[0,1,2],[1,1,3],[2,1,2]]
df1 = pd.DataFrame(xyarr, columns=['a', 'b','c'])
df2 = pd.DataFrame([['text','text2']], columns=['x','y'])
df3 = pd.concat([df1,df2],axis=0, ignore_index=True)
df3 will have NaN values, from the empty columns a b c.
a b c x y
0 0.0 1.0 2.0 NaN NaN
1 1.0 1.0 3.0 NaN NaN
2 2.0 1.0 2.0 NaN NaN
3 NaN NaN NaN text text2
I want to save df3 to a csv, but without the extra commas
any suggestions?
As pd.concat is an outer join by default, you will get the NaN values from the empty columns a b c. If you use other Pandas function e.g. .join() which is left join by default, you can get around the problem here.
You can try using .join(), as follows:
df3 = df1.join(df2)
Result:
print(df3)
a b c x y
0 0 1 2 text text2
1 1 1 3 NaN NaN
2 2 1 2 NaN NaN

Compare two pandas dataframes and replace value based on condition

I have the following two pandas dataframes:
df1
A B C
0 1 2 1
1 7 3 6
2 3 10 11
df2
A B C
0 2 0 2
1 8 4 7
Where A,B and C are column headings of both dataframes.
I am trying to compare columns of df1 to columns of df2 such that the first row in df2 is the lower bound and the second row is the upper bound. Any values in df1 outside the lower and upper bound (column wise) needs to be replaced with NaN.
So in this example the output should be:
A B C
0 nan 2 nan
1 7 3 6
2 3 nan nan
As a basic I am trying df1[df1 < df2] = np.nan, but this does not work. I have also tried .where() but not getting any success.
Would appreciate some help here, thanks.
IIUC
df=df1.where(df1.ge(df2.iloc[0])&df1.lt(df2.iloc[1]))
A B C
0 NaN 2.0 NaN
1 7.0 3.0 6.0
2 3.0 NaN NaN
You could do something like:
lower = df1 < df2.iloc[0, :]
upper = df1 > df2.iloc[1, :]
df1[lower | upper] = np.nan
print(df1)
Output
A B C
0 NaN 2.0 NaN
1 7.0 3.0 6.0
2 3.0 NaN NaN
Here is one with df.clip and mask:
df1.mask(df1.ne(df1.clip(lower = df2.loc[0],upper = df1.loc[1],axis=1)))
A B C
0 NaN 2.0 NaN
1 7.0 3.0 6.0
2 3.0 NaN NaN
A slightly different approach using between,
df1.apply(lambda x:x.where(x.between(*df2.values, False)), axis=1)

Python dfply: unable to mask on multiple conditions

I am an R user learning how to use Python's dfply, the Python equivalent to R's dplyr. My problem: in dfply, I am unable to mask on multiple conditions in a pipe. I seek a solution involving dfply pipes rather than multiple lines of subsetting.
My code:
# Import
import pandas as pd
import numpy as np
from dfply import *
# Create data frame and mask it
df = pd.DataFrame({'a':[np.nan,2,3,4,5],'b':[6,7,8,9,np.nan],'c':[5,4,3,2,1]})
df2 = (df >>
mask((X.a.isnull()) | ~(X.b.isnull())))
print(df)
print(df2)
Here is the oringal data frame, df:
a b c
0 NaN 6.0 5
1 2.0 7.0 4
2 3.0 8.0 3
3 4.0 9.0 2
4 5.0 NaN 1
And here is the result of the piped mask, df2:
a b c
0 NaN 6.0 5
4 5.0 NaN 1
However, I expect this instead:
a b c
0 NaN 6.0 5
1 2.0 7.0 4
2 3.0 8.0 3
3 4.0 9.0 2
Why don't the "|" and "~" operators result in rows in which column "a" is either NaN or column "b" is not NaN?
By the way, I also tried np.logical_or():
df = pd.DataFrame({'a':[np.nan,2,3,4,5],'b':[6,7,8,9,np.nan],'c':[5,4,3,2,1]})
df2 = (df >>
mask(np.logical_or(X.a.isnull(),~X.b.isnull())))
print(df)
print(df2)
But this resulted in error:
mask(np.logical_or(X.a.isnull(),~X.b.isnull())))
ValueError: invalid __array_struct__
Edit: Tweak the second conditional to "df.col2.notnull()". No idea why the tilde is ignored after the pipe.
df = pd.DataFrame({'a':[np.nan,2,3,4,5],'b':[6,7,8,9,np.nan],'c':[5,4,3,2,1]})
df2 = (df >> mask((X.a.isnull()) | (X.b.notnull())))
print(df2)
a b c
0 NaN 6.0 5
1 2.0 7.0 4
2 3.0 8.0 3
3 4.0 9.0 2
How about filter_by?
df >> filter_by((X.a.isnull()) | (X.b.isnull()))

Pandas: General Data Imputation Based on Column Dtype

I'm working with a dataset with ~80 columns, many of which contain NaN. I definitely don't want to manually inspect dtype for each column and impute based on that.
So I wrote a function to impute a column's missing values based on its dtype:
def impute_df(df, col):
# if col is float, impute mean
if df[col].dtype == "int64":
df[col].fillna(df[col].mean(), inplace=True)
else:
df[col].fillna(df[col].mode()[0], inplace=True)
But to use this, I'd have to loop over all columns in my DataFrame, something like:
for col in train_df.columns:
impute_df(train_df, col)
And I know looping in Pandas is generally slow. Is there a better way of going about this?
Thanks!
I think you need select_dtypes for numeric and non numeric columns and then apply fillna for filtered columns:
df = pd.DataFrame({'A':list('abcdef'),
'B':[np.nan,5,4,5,5,4],
'C':[7,8,np.nan,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':['a','a','b','b','b',np.nan]})
print (df)
A B C D E F
0 a NaN 7.0 1 5 a
1 b 5.0 8.0 3 3 a
2 c 4.0 NaN 5 6 b
3 d 5.0 4.0 7 9 b
4 e 5.0 2.0 1 2 b
5 f 4.0 3.0 0 4 NaN
cols1 = df.select_dtypes([np.number]).columns
cols2 = df.select_dtypes(exclude = [np.number]).columns
df[cols1] = df[cols1].fillna(df[cols1].mean())
df[cols2] = df[cols2].fillna(df[cols2].mode().iloc[0])
print (df)
A B C D E F
0 a 4.6 7.0 1 5 a
1 b 5.0 8.0 3 3 a
2 c 4.0 4.8 5 6 b
3 d 5.0 4.0 7 9 b
4 e 5.0 2.0 1 2 b
5 f 4.0 3.0 0 4 b
I think you do not need a function here,
for example:
df=pd.DataFrame({'A':[1,np.nan,3,4],'A_1':[1,np.nan,3,4],'B':['A','A',np.nan,'B']})
v=df.select_dtypes(exclude=['object']).columns
t=~df.columns.isin(v)
df.loc[:,v]=df.loc[:,v].fillna(df.loc[:,v].mean().to_dict())
df.loc[:,t]=df.loc[:,t].fillna(df.loc[:,t].mode().iloc[0].to_dict())
df
Out[1440]:
A A_1 B
0 1.000000 1.000000 A
1 2.666667 2.666667 A
2 3.000000 3.000000 A
3 4.000000 4.000000 B

Categories

Resources