I have a pandas dataframe like:
index col1 col2 col3 col4 col5
0 a c 1 2 f
1 a c 1 2 f
2 a d 1 2 f
3 b d 1 2 g
4 b e 1 2 g
5 b e 1 2 g
if i group by two columns, like the following:
df.groupby(['col1', 'col2']).agg({'col3':'sum','col4':'sum'})
I get:
col3 col4
col1 col2
a c 2 4
d 1 2
b d 1 2
e 2 4
Is it possible to convert this to:
col1 c_col3 d_col3 c_col4 d_col4 e_col3 e_col4
a 2 1 4 2 Nan Nan
b Nan 1 Nan 2 2 4
in an efficient manner where col1 is the index?
Add unstack for MultiIndex in columns, so necessary flattening:
df1 = df.groupby(['col1', 'col2']).agg({'col3':'sum','col4':'sum'}).unstack()
#python 3.6+
df1.columns = [f'{j}_{i}' for i, j in df1.columns]
#python bellow
#df1.columns = ['{}_{}'.format(j, i) for i, j in df1.columns]
print (df1)
c_col3 d_col3 e_col3 c_col4 d_col4 e_col4
col1
a 2.0 1.0 NaN 4.0 2.0 NaN
b NaN 1.0 2.0 NaN 2.0 4.0
Related
Let's take this dataframe :
df = pd.DataFrame(dict(Col1 = [1,2,np.nan,4,5,6], Col2=[4,np.nan,5,np.nan,1,5]))
Col1 Col2
0 1.0 4.0
1 2.0 NaN
2 NaN 5.0
3 4.0 NaN
4 5.0 1.0
5 6.0 5.0
I would like to extract the n last rows of df with no NaN.
Could you please help me to get this expected result ?
Col1 Col2
0 5 1
1 6 5
EDIT : Let's say I don't know where is the last NaN
Use DataFrame.dropna with DataFrame.tail and converting to integers:
N = 2
df1 = df.dropna().tail(N).astype(int)
#alternative
#df1 = df.dropna().iloc[-N:].astype(int)
print (df1)
Col1 Col2
4 5 1
5 6 5
EDIT: For last group with no missing values compare misisng values with DataFrame.isna and DataFrame.any, then swap order with cumulative sum, so last group has 0 values in mask:
m = df.isna().any(axis=1).iloc[::-1].cumsum().eq(0).sort_index()
df1 = df[m].astype(int)
print (df1)
Col1 Col2
4 5 1
5 6 5
If no row match it return correct empty DataFrame:
df = pd.DataFrame(dict(Col1 = [1,2,np.nan,4], Col2=[np.nan,np.nan,5,np.nan]))
print (df)
Col1 Col2
0 1.0 NaN
1 2.0 NaN
2 NaN 5.0
3 4.0 NaN
m = df.isna().any(axis=1).iloc[::-1].cumsum().eq(0).sort_index()
df1 = df[m].astype(int)
print (df1)
Empty DataFrame
Columns: [Col1, Col2]
Index: []
another way is to use isna with drop_duplicates and cumsum to get the max index and then just use index filtering
last_na = df.isna().cumsum(axis=0).drop_duplicates(keep='first').index.max() + 1
new_df = df.iloc[last_na:]
print(new_df)
Col1 Col2
4 5.0 1.0
5 6.0 5.0
I currently have a dataframe which looks like this:
col1 col2 col3
1 2 3
2 3 NaN
3 4 NaN
2 NaN NaN
0 2 NaN
What I want to do is apply some condition to the column values and return the final result in a new column.
The condition is to assign values based on this order of priority where 2 being the first priority: [2,1,3,0,4]
I tried to define a function to append the final results but wasnt really getting anywhere...any thoughts?
The desired outcome would look something like:
col1 col2 col3 col4
1 2 3 2
2 3 NaN 2
3 4 NaN 3
2 NaN NaN 2
0 2 NaN 2
where col4 is the new column created.
Thanks
first you may want to get ride of the NaNs:
df.fillna(5)
and then apply a function to every row to find your value:
def func(x,l=[2,1,3,0,4,5]):
for j in l:
if(j in x):
return j
df['new'] = df.apply(lambda x: func(list(x)),axis =1)
Output:
col1 col2 col3 new
0 1 2 3 2
1 2 3 5 2
2 3 4 5 3
3 2 5 5 2
4 0 2 5 2
maybe a little later.
import numpy as np
def f(x):
for i in [2,1,3,0,4]:
if i in x.tolist():
return i
return np.nan
df["col4"] = df.apply(f, axis=1)
and the Output:
col1 col2 col3 col4
0 1 2.0 3.0 2
1 2 3.0 NaN 2
2 3 4.0 NaN 3
3 2 NaN NaN 2
4 0 2.0 NaN 2
Hi I want to join 2 or more dataframes together based on a column lets say 'id' The column has similar and different IDs but I want to join/merge/concat/append them together so they are all in one big dataframe.
Here is an example:
Df1:
id col1 col2
1
2
4
5
Df2:
id col3 col4
1
2
3
5
This is what I want:
Df3:
Id col1 col2 col3 col4
1
2
3
4
5
Assuming no columns overlap other than the id column, you can merge them.
df1 = pd.DataFrame({'id': [1, 2, 4, 5], 'col1': list('ABCD'), 'col2': list('EFGH')})
df2 = pd.DataFrame({'id': [1, 2, 3, 5], 'col3': list('ABCD'), 'col4': list('EFGH')})
>>> df1.merge(df2, how='outer', on='id').set_index('id').sort_index()
col1 col2 col3 col4
id
1 A E A E
2 B F B F
3 NaN NaN C G
4 C G NaN NaN
5 D H D H
Note that concatenation does not work given your example:
>>> pd.concat([df1, df2], axis=1)
col1 col2 id col3 col4 id
0 A E 1 A E 1
1 B F 2 B F 2
2 C G 4 C G 3
3 D H 5 D H 5
You can merge the dataframes if you first set the index before using concat. Here is a general solution for multiple dataframes:
dfs = (df1, df2) # Add other dataframes as required.
>>> pd.concat([df.set_index('id') for df in dfs], axis=1)
col1 col2 col3 col4
id
1 A E A E
2 B F B F
3 NaN NaN C G
4 C G NaN NaN
5 D H D H
Note that if you have overlapping columns in your dataframe (e.g. col2), you would end up with something like this using pd.concat:
col1 col2 col2 col4
id
1 A E A E
2 B F B F
3 NaN NaN C G
4 C G NaN NaN
5 D H D H
I have a dataframe like the follow:
Col1
0 C
1 A
3 D
4 A
5 A
I would like to count the step/index that a certain value will re-occur so I would get the following:
Col1 Col2
0 C NaN
1 A 2
3 D NaN
4 A 1
5 A NaN
Any ideas on how to do it ? Thanks for help !
Use GroupBy.cumcount and then replace 0 to NaNs:
df['Col2'] = df.groupby('Col1').cumcount(ascending=False).replace(0,np.nan)
print (df)
Col1 Col2
0 C NaN
1 A 2.0
3 D NaN
4 A 1.0
5 A NaN
Alternative solution with mask:
df['Col2'] = df.groupby('Col1').cumcount(ascending=False).mask(lambda x: x == 0)
I have two Pandas Datafrmae as
df: df2:
col1 col2 val col1 col2 s
0 1 a 1.2 0 1 a 0.90
1 2 b 3.2 1 3 b 0.70
2 2 a 4.2 2 1 b 0.02
3 1 b -1.2 3 2 a 0.10
and I want to use df2['s'] and multiply it in df['val'] whenever the combination of ['col1', 'col2'] match. If a row does not match, I don't need to do the multiplication
I create a mapper
mapper = df2.set_index(['col1','col2'])['s']
where I get
mapper
col1 col2
1 a 0.90
3 b 0.70
1 b 0.02
2 a 0.10
Name: s, dtype: float64
and I want to match it with df[['col1','col2']]
df[['col1','col2']]
col1 col2
0 1 a
1 2 b
2 2 a
3 1 b
but when I do the mapping
mapped_value = df[['col1','col2']].map(mapper)
I get the following error
AttributeError: 'DataFrame' object has no attribute 'map'
any hint?
I think you need mul:
df = df2.set_index(['col1','col2'])['s'].mul(df.set_index(['col1','col2'])['val'])
print (df)
col1 col2
1 a 1.080
b -0.024
2 a 0.420
b NaN
3 b NaN
dtype: float64
If need replace NaN:
df = df2.set_index(['col1','col2'])['s']
.mul(df.set_index(['col1','col2'])['val'], fill_value=1)
print (df)
col1 col2
1 a 1.080
b -0.024
2 a 0.420
b 3.200
3 b 0.700
dtype: float64