Groupby and transpose in pandas, python - python

Dataframe have
ID col col2 col3 col4
1 A 50 S 1
1 A 52 M 4
1 B 45 N 8
1 C 18 S 7
Dataframe want
ID col colA colB colC colD colE colF
1 A 50 52 S M 1 4
1 B 45 NULL N NULL 8 NULL
1 C 18 NULL S NULL 7 NULL
I want 1 line per unique ID+col (groupby ID and col).
If there are multiple entries per ID+col (max can be 2, no more) then put the first value of col2 in colA and second value in colB, put the first value of col3 in colC and second value in colD, put the first value of col4 in colE and second value in colF. If there is only one entry per ID+col then for col2 put the value in colA and colB is null etc.
I tried to first create a counter:
df['COUNT'] = df.groupby(['ID','col']).cumcount()+1
From here I was thinking of just adding a column to say
if count=1 then df['colA']=df.col2
if count=2 then df['colB']=df.col2
.. but this will still result in the same number of rows as the original df.

I think need set_index with unstack:
df['COUNT'] = df.groupby(['ID','col']).cumcount()+1
df = df.set_index(['ID','col', 'COUNT'])['col2'].unstack().add_prefix('col').reset_index()
print (df)
COUNT ID col col1 col2
0 1 A 50.0 52.0
1 1 B 45.0 NaN
2 1 C 18.0 NaN
Or:
c = df.groupby(['ID','col']).cumcount()+1
df = df.set_index(['ID','col', c])['col2'].unstack().add_prefix('col').reset_index()
print (df)
ID col col1 col2
0 1 A 50.0 52.0
1 1 B 45.0 NaN
2 1 C 18.0 NaN
EDIT:
For multiple columns is solution a bit changed, because working with MultiIndex in columns:
df['COUNT'] = (df.groupby(['ID','col']).cumcount()+1).astype(str)
#remove col2
df = df.set_index(['ID','col', 'COUNT']).unstack()
#flatten Multiindex
df.columns = df.columns.map('_'.join)
df = df.reset_index()
print (df)
ID col col2_1 col2_2 col3_1 col3_2 col4_1 col4_2
0 1 A 50.0 52.0 S M 1.0 4.0
1 1 B 45.0 NaN N None 8.0 NaN
2 1 C 18.0 NaN S None 7.0 NaN

You can using groupby with apply(pd.Series)
df.groupby(['ID','col']).col2.apply(list).apply(pd.Series).add_prefix('col').reset_index()
Out[404]:
ID col col0 col1
0 1 A 50.0 52.0
1 1 B 45.0 NaN
2 1 C 18.0 NaN

Not sure if this is what you looking for, but it renders the same result you are looking for. Please note I am using multiple aggregate function on same column and thus using ravel function to flatten the dataframe columns.
import pandas as pd
import numpy as np
df = pd.DataFrame({'ID':[1,1,1,1],
'Col1':['A','A','B','C'],
'Col2':[50,52,45,18]})
df = df.groupby(['ID','Col1']).agg({'Col2':['first','last']})
df.columns = ["_".join(x) for x in df.columns.ravel()]
df = df.reset_index()
df['Col2_last'] = np.where(df.Col2_first == df.Col2_last, float('nan'), df.Col2_last)
print(df)

Related

Get last non NaN value after groupby and aggregation

I have a data frame like this for example:
col1 col2
0 A 3
1 B 4
2 A NaN
3 B 5
4 A 5
5 A NaN
6 B NaN
.
.
.
47 B 8
48 A 9
49 B NaN
50 A NaN
when i try df.groupby(['col1'], sort=False).agg({'col2':'last'}).reset_index() it gives me this output
col1 col2
0 A NaN
1 B NaN
I want to get the last non NaN value after groupby and agg. The desirable output is like below
col1 col2
0 A 9
1 B 8
For me your solution working well, if NaN are missing values.
Here is alternative:
df = df.dropna(subset=['col2']).drop_duplicates('col1', keep='last')
If NaNs are strings first convert them to missing values:
df['col2'] = df['col2'].replace('NaN', np.nan)
df.groupby(['col1'], sort=False).agg({'col2':'last'}).reset_index()

Take n last rows of a dataframe with no NaN

Let's take this dataframe :
df = pd.DataFrame(dict(Col1 = [1,2,np.nan,4,5,6], Col2=[4,np.nan,5,np.nan,1,5]))
Col1 Col2
0 1.0 4.0
1 2.0 NaN
2 NaN 5.0
3 4.0 NaN
4 5.0 1.0
5 6.0 5.0
I would like to extract the n last rows of df with no NaN.
Could you please help me to get this expected result ?
Col1 Col2
0 5 1
1 6 5
EDIT : Let's say I don't know where is the last NaN
Use DataFrame.dropna with DataFrame.tail and converting to integers:
N = 2
df1 = df.dropna().tail(N).astype(int)
#alternative
#df1 = df.dropna().iloc[-N:].astype(int)
print (df1)
Col1 Col2
4 5 1
5 6 5
EDIT: For last group with no missing values compare misisng values with DataFrame.isna and DataFrame.any, then swap order with cumulative sum, so last group has 0 values in mask:
m = df.isna().any(axis=1).iloc[::-1].cumsum().eq(0).sort_index()
df1 = df[m].astype(int)
print (df1)
Col1 Col2
4 5 1
5 6 5
If no row match it return correct empty DataFrame:
df = pd.DataFrame(dict(Col1 = [1,2,np.nan,4], Col2=[np.nan,np.nan,5,np.nan]))
print (df)
Col1 Col2
0 1.0 NaN
1 2.0 NaN
2 NaN 5.0
3 4.0 NaN
m = df.isna().any(axis=1).iloc[::-1].cumsum().eq(0).sort_index()
df1 = df[m].astype(int)
print (df1)
Empty DataFrame
Columns: [Col1, Col2]
Index: []
another way is to use isna with drop_duplicates and cumsum to get the max index and then just use index filtering
last_na = df.isna().cumsum(axis=0).drop_duplicates(keep='first').index.max() + 1
new_df = df.iloc[last_na:]
print(new_df)
Col1 Col2
4 5.0 1.0
5 6.0 5.0

Move values in rows in a new column in pandas

I have a DataFrame with an Ids column an several columns with data, like the column "value" in this example.
For this DataFrame I want to move all the values that correspond to the same id to a new column in the row as shown below:
I guess there is an opposite function to "melt" that allow this, but I'm not getting how to pivot this DF.
The dicts for the input and out DFs are:
d = {"id":[1,1,1,2,2,3,3,4,5],"value":[12,13,1,22,21,23,53,64,9]}
d2 = {"id":[1,2,3,4,5],"value1":[12,22,23,64,9],"value2":[1,21,53,"","",],"value3":[1,"","","",""]}
Create MultiIndex by cumcount, reshape by unstack and add change columns names by add_prefix:
df = (df.set_index(['id',df.groupby('id').cumcount()])['value']
.unstack()
.add_prefix('value')
.reset_index())
print (df)
id value0 value1 value2
0 1 12.0 13.0 1.0
1 2 22.0 21.0 NaN
2 3 23.0 53.0 NaN
3 4 64.0 NaN NaN
4 5 9.0 NaN NaN
Missing values is possible replace by fillna, but get mixed numeric with strings data, so some function should failed:
df = (df.set_index(['id',df.groupby('id').cumcount()])['value']
.unstack()
.add_prefix('value')
.reset_index()
.fillna(''))
print (df)
id value0 value1 value2
0 1 12.0 13 1
1 2 22.0 21
2 3 23.0 53
3 4 64.0
4 5 9.0
You can GroupBy to a list, then expand the series of lists:
df = pd.DataFrame(d) # create input dataframe
res = df.groupby('id')['value'].apply(list).reset_index() # groupby to list
res = res.join(pd.DataFrame(res.pop('value').values.tolist())) # expand lists to columns
print(res)
id 0 1 2
0 1 12 13.0 1.0
1 2 22 21.0 NaN
2 3 23 53.0 NaN
3 4 64 NaN NaN
4 5 9 NaN NaN
In general, such operations will be expensive as the number of columns is arbitrary. Pandas / NumPy solutions work best when you can pre-allocate memory, which isn't possible here.

.combine_first for merging multiple rows

I have a pandas dataframe (df) where there are duplicating rows for some of the rows. Some columns in these repeating rows have NaN values while similar columns in the duplicated rows have values. I would like to merge the duplicating rows such that the missing values are replaced with the values from the duplicating rows and then dropping the duplicated rows. For examples the following are duplicated rows:
id col1 col2 col3
0 01 abc 123
9 01 xy
The result should be like:
id col1 col2 col3
0 01 abc xy 123
I tried .combine_first by using df.iloc[0:1,].combine_first(df.iloc[9:10,]) but no success. Can anybody help me with this? Thanks!
I think you need groupby with forward and back filling NaNs and then drop_duplicates:
print (df)
id col1 col2 col3
0 1 abc NaN 123.0
9 1 NaN xy NaN
0 2 abc NaN 17.0
9 2 NaN xr NaN
9 2 NaN xu NaN
df = df.groupby('id').apply(lambda x: x.ffill().bfill()).drop_duplicates()
print (df)
id col1 col2 col3
0 1 abc xy 123.0
0 2 abc xr 17.0
9 2 abc xu 17.0

Re-occurrence count

I have a dataframe like the follow:
Col1
0 C
1 A
3 D
4 A
5 A
I would like to count the step/index that a certain value will re-occur so I would get the following:
Col1 Col2
0 C NaN
1 A 2
3 D NaN
4 A 1
5 A NaN
Any ideas on how to do it ? Thanks for help !
Use GroupBy.cumcount and then replace 0 to NaNs:
df['Col2'] = df.groupby('Col1').cumcount(ascending=False).replace(0,np.nan)
print (df)
Col1 Col2
0 C NaN
1 A 2.0
3 D NaN
4 A 1.0
5 A NaN
Alternative solution with mask:
df['Col2'] = df.groupby('Col1').cumcount(ascending=False).mask(lambda x: x == 0)

Categories

Resources