.combine_first for merging multiple rows - python

I have a pandas dataframe (df) where there are duplicating rows for some of the rows. Some columns in these repeating rows have NaN values while similar columns in the duplicated rows have values. I would like to merge the duplicating rows such that the missing values are replaced with the values from the duplicating rows and then dropping the duplicated rows. For examples the following are duplicated rows:
id col1 col2 col3
0 01 abc 123
9 01 xy
The result should be like:
id col1 col2 col3
0 01 abc xy 123
I tried .combine_first by using df.iloc[0:1,].combine_first(df.iloc[9:10,]) but no success. Can anybody help me with this? Thanks!

I think you need groupby with forward and back filling NaNs and then drop_duplicates:
print (df)
id col1 col2 col3
0 1 abc NaN 123.0
9 1 NaN xy NaN
0 2 abc NaN 17.0
9 2 NaN xr NaN
9 2 NaN xu NaN
df = df.groupby('id').apply(lambda x: x.ffill().bfill()).drop_duplicates()
print (df)
id col1 col2 col3
0 1 abc xy 123.0
0 2 abc xr 17.0
9 2 abc xu 17.0

Related

Get last non NaN value after groupby and aggregation

I have a data frame like this for example:
col1 col2
0 A 3
1 B 4
2 A NaN
3 B 5
4 A 5
5 A NaN
6 B NaN
.
.
.
47 B 8
48 A 9
49 B NaN
50 A NaN
when i try df.groupby(['col1'], sort=False).agg({'col2':'last'}).reset_index() it gives me this output
col1 col2
0 A NaN
1 B NaN
I want to get the last non NaN value after groupby and agg. The desirable output is like below
col1 col2
0 A 9
1 B 8
For me your solution working well, if NaN are missing values.
Here is alternative:
df = df.dropna(subset=['col2']).drop_duplicates('col1', keep='last')
If NaNs are strings first convert them to missing values:
df['col2'] = df['col2'].replace('NaN', np.nan)
df.groupby(['col1'], sort=False).agg({'col2':'last'}).reset_index()

Creating multiple dataframes from 2 data frames basis dates in Python

I am new to Python and would need some help with my problem:
I have a dataframe that looks something like this:
df1
date col1 col2 col3
01-01-2008 nan 16 19
02-01-2008 nan 25 20
03-01-2008 nan nan nan
04-01-2008 18 18 nan
I have another dataframe that looks like:
df2
start end col4
01-01-2008 04-01-2008 [col1,col2]
02-01-2008 04-01-2008 [col1]
03-01-2008 04-01-2008 [col3]
I need to write a code such that I have values of col1 col2 from date 1-1-2008 to 4-1-2008 stored in one dataframe and values of col1 from 02-01-2008-04-01-2008 stored in another dataframe and so forth.
Basically I want my output something like this:
df3
date col1 col2
01-01-2008 nan 16
02-01-2008 nan 25
03-01-2008 nan nan
04-01-2008 18 18
df4
date col1
02-01-2008 nan
03-01-2008 nan
04-01-2008 18
df5
date col3
03-01-2008 nan
04-01-2008 nan
Please Help!!
Try this and let me know if you face any problem. Here you go:
for i in range(len(df2)):
start = df2["start"][i]
end = df2["end"][i]
cols = df2["col4"][i].replace("[","").replace("]","").split(",")
print(df1.set_index("date",drop=True).loc[[df1[start:end]],cols])

How to select columns and generate Nan values for non-existing columns?

I have a list that contains a list of target columns:
cols = ["col1", "col2", "col4"]
Then I have several pandas DataFrames with a different number of columns. I must select columns from cols. If one of the columns from cols does not exist in a DataFrame, then NaN values should be generated.
df1 =
col1 col3
1 x1
2 x2
3 x3
df2 =
col1 col2 col4
1 f1 car3
3 f2 car2
4 f5 car1
For example, df2[cols] works well, but df1[cols] obvioulsy fails. I need the following output for df1
df1 =
col1 col2 col3
1 NaN NaN
2 NaN NaN
3 NaN NaN
Use DataFrame.reindex with list of columns, if no matched are added NaNs columns:
df1 = df1.reindex(cols, axis=1)
print (df1)
col1 col2 col4
0 1 NaN NaN
1 2 NaN NaN
2 3 NaN NaN
So for df2 are returned same columns:
df2 = df2.reindex(cols, axis=1)
print (df2)
col1 col2 col4
0 1 f1 car3
1 3 f2 car2
2 4 f5 car1

how to lag columns in batch in dataframe

I have a data frame with more then 100 columns. i need to lag 60 of them, and i know columns names for which i need to lag. Is there a way to lag them in batch or just few lines?
Say I have a dataframe like belwo
col1 col2 col3 col4 col5 col6 ... col100
1 2 3 4 5 6 8
3 9 15 19 21 23 31
The only way i know is to do it one by one. i.e run df['col1_lag']=df['col'].shift(1) for each column.
It seems too much for so many columns. Is there a better way to do this? Thanks in advance.
Use shift with add_prefix for new DataFrame and join to original:
df1 = df.join(df.shift().add_suffix('_lag'))
#alternative
#df1 = pd.concat([df, df.shift().add_suffix('_lag')], axis=1)
print (df1)
col1 col2 col3 col4 col5 col6 col100 col1_lag col2_lag col3_lag \
0 1 2 3 4 5 6 8 NaN NaN NaN
1 3 9 15 19 21 23 31 1.0 2.0 3.0
col4_lag col5_lag col6_lag col100_lag
0 NaN NaN NaN NaN
1 4.0 5.0 6.0 8.0
If want lag only some columns is possible filter them by list:
cols = ['col1','col3','col5']
df2 = df.join(df[cols].shift().add_suffix('_lag'))
print (df2)
col1 col2 col3 col4 col5 col6 col100 col1_lag col3_lag col5_lag
0 1 2 3 4 5 6 8 NaN NaN NaN
1 3 9 15 19 21 23 31 1.0 3.0 5.0

Groupby and transpose in pandas, python

Dataframe have
ID col col2 col3 col4
1 A 50 S 1
1 A 52 M 4
1 B 45 N 8
1 C 18 S 7
Dataframe want
ID col colA colB colC colD colE colF
1 A 50 52 S M 1 4
1 B 45 NULL N NULL 8 NULL
1 C 18 NULL S NULL 7 NULL
I want 1 line per unique ID+col (groupby ID and col).
If there are multiple entries per ID+col (max can be 2, no more) then put the first value of col2 in colA and second value in colB, put the first value of col3 in colC and second value in colD, put the first value of col4 in colE and second value in colF. If there is only one entry per ID+col then for col2 put the value in colA and colB is null etc.
I tried to first create a counter:
df['COUNT'] = df.groupby(['ID','col']).cumcount()+1
From here I was thinking of just adding a column to say
if count=1 then df['colA']=df.col2
if count=2 then df['colB']=df.col2
.. but this will still result in the same number of rows as the original df.
I think need set_index with unstack:
df['COUNT'] = df.groupby(['ID','col']).cumcount()+1
df = df.set_index(['ID','col', 'COUNT'])['col2'].unstack().add_prefix('col').reset_index()
print (df)
COUNT ID col col1 col2
0 1 A 50.0 52.0
1 1 B 45.0 NaN
2 1 C 18.0 NaN
Or:
c = df.groupby(['ID','col']).cumcount()+1
df = df.set_index(['ID','col', c])['col2'].unstack().add_prefix('col').reset_index()
print (df)
ID col col1 col2
0 1 A 50.0 52.0
1 1 B 45.0 NaN
2 1 C 18.0 NaN
EDIT:
For multiple columns is solution a bit changed, because working with MultiIndex in columns:
df['COUNT'] = (df.groupby(['ID','col']).cumcount()+1).astype(str)
#remove col2
df = df.set_index(['ID','col', 'COUNT']).unstack()
#flatten Multiindex
df.columns = df.columns.map('_'.join)
df = df.reset_index()
print (df)
ID col col2_1 col2_2 col3_1 col3_2 col4_1 col4_2
0 1 A 50.0 52.0 S M 1.0 4.0
1 1 B 45.0 NaN N None 8.0 NaN
2 1 C 18.0 NaN S None 7.0 NaN
You can using groupby with apply(pd.Series)
df.groupby(['ID','col']).col2.apply(list).apply(pd.Series).add_prefix('col').reset_index()
Out[404]:
ID col col0 col1
0 1 A 50.0 52.0
1 1 B 45.0 NaN
2 1 C 18.0 NaN
Not sure if this is what you looking for, but it renders the same result you are looking for. Please note I am using multiple aggregate function on same column and thus using ravel function to flatten the dataframe columns.
import pandas as pd
import numpy as np
df = pd.DataFrame({'ID':[1,1,1,1],
'Col1':['A','A','B','C'],
'Col2':[50,52,45,18]})
df = df.groupby(['ID','Col1']).agg({'Col2':['first','last']})
df.columns = ["_".join(x) for x in df.columns.ravel()]
df = df.reset_index()
df['Col2_last'] = np.where(df.Col2_first == df.Col2_last, float('nan'), df.Col2_last)
print(df)

Categories

Resources