merging dataframes on the same index - python

I can't find the answer to this in here.
I have two dataframes:
index, name, color, day
0 Nan Nan Nan
1 b red thu
2 Nan Nan Nan
3 d green mon
index, name, color, week
0 c blue 1
1 Nan Nan Nan
2 t yellow 4
3 Nan Nan Nan
And I'd like the result to be one dataframe:
index, name, color, day, week
0 c Blue Nan 1
1 b red thu Nan
2 t yellow Nan 4
3 d green mon Nan
Is there a way to merge the dataframes on their indexes, while adding new columns?

You can use DataFrame.combine_first:
df = df1.combine_first(df2)
print (df)
color day name week
0 blue NaN c 1.0
1 red thu b NaN
2 yellow NaN t 4.0
3 green mon d NaN
For custom order of columns create columns names by numpy.concatenate, pd.unique and then add reindex_axis:
cols = pd.unique(np.concatenate([df1.columns, df2.columns]))
df = df1.combine_first(df2).reindex_axis(cols, axis=1)
print (df)
name color day week
0 c blue NaN 1.0
1 b red thu NaN
2 t yellow NaN 4.0
3 d green mon NaN
EDIT:
Use rename columns:
df = df1.combine_first(df2.rename(columns={'week':'day'}))
print (df)
name color day
0 c blue 1
1 b red thu
2 t yellow 4
3 d green mon

Related

Cumsum with nan values - pandas

I want to pass a cumulative sum of unique values to a separate column. However, I want to disregard nan values so it essentially skips these rows and continues the count with the next viable row.
d = {'Item': [np.nan, "Blue", "Blue", np.nan, "Red", "Blue", "Blue", "Red"],
}
df = pd.DataFrame(data=d)
df['count'] = df.Item.ne(df.Item.shift()).cumsum()
intended out:
Item count
0 NaN NaN
1 Blue 1
2 Blue 1
3 NaN NaN
4 Red 2
5 Blue 3
6 Blue 3
7 Red 4
Try:
df['count'] =(df.Item.ne(df.Item.shift()) & df.Item.notna()).cumsum().mask(df.Item.isna())
OR
as suggested by #SeanBean:
df['count'] =df.Item.ne(df.Item.shift()).mask(df.Item.isna()).cumsum()
Output of df:
Item count
0 NaN NaN
1 Blue 1.0
2 Blue 1.0
3 NaN NaN
4 Red 2.0
5 Blue 3.0
6 Blue 3.0
7 Red 4.0
Here's one way:
NOTE: (you just need to add the where condition):
df['count'] = df.Item.ne(df.Item.shift()).where(~df.Item.isna()).cumsum()
OUTPUT:
Item count
0 NaN NaN
1 Blue 1.0
2 Blue 1.0
3 NaN NaN
4 Red 2.0
5 Blue 3.0
6 Blue 3.0
7 Red 4.0

Retain NaN values after concatenating

I have a df where I want to apply a function. How can I retain the NaN values even after concatenating two columns? I want to avoid np.where since the real function has more elif conditions
df fruit year price vol signifiance
0 apple 2010 1 5 NaN
1 apple 2011 2 4 NaN
2 apple 2012 3 3 NaN
3 NaN 2013 3 3 NaN
4 NaN NaN NaN 3 NaN
5 apple 2015 3 3 important
df = df.fillna('')
def func(row):
if (pd.notna(row['year'])):
return row['fruit'] + row['significance'] +row['price']+ '_test'
else:
return np.NaN
df['final'] = row.apply(func, axis=1)
Expected Output
df fruit year price vol significance final
0 apple 2010 1 5 NaN apple1_test
1 apple 2011 2 4 NaN apple2_test
2 apple 2012 3 3 NaN apple3_test
3 NaN 2013 3 3 NaN 3_test
4 NaN 2014 NaN 3 NaN NaN
5 apple 2015 3 3 important appleimportant3_test
df = df.fillna('')
def func(row):
a = f"{row['fruit']}{row['significance']}{row['price']}"
if a:
return a + '_test'
return np.NaN
First remove df = df.fillna('') and then use your solution with added elif for test if missing values in both columns:
def func(row):
if (pd.notna(row['fruit'])) & (pd.notna(row['signifiance'])):
return row['fruit'] +'_' + row['signifiance']
elif (pd.isna(row['fruit'])) & (pd.isna(row['signifiance'])):
return 'apple'
else:
return row['fruit']
df['final'] = df.apply(func, axis=1)
print (df)
df fruit year price vol signifiance final
0 0 apple 2010 1 5 NaN apple
1 1 apple 2011 2 4 NaN apple
2 2 apple 2012 3 3 NaN apple
3 3 apple 2013 3 3 NaN apple
4 4 NaN 2014 3 3 NaN apple
5 5 apple 2015 3 3 important apple_important

New Columns based on column value in pandas from long to wide format

I have a dataframe that has unique identifier in one column and is in long format. My goal is to have one user_id(student) per row and to pivot so that the structure is wide.
Current dataframe example:
user_id test_type test_date
0 1 ACT 2013-08-15
1 2 ACT 2011-12-09
2 3 SAT 2012-03-09
3 4 ACT 2003-07-27
4 4 SAT 2013-12-31
The problem is that some students have taken both tests so I want to ultimately have one column for ACT, one column for SAT, and a column each for the corresponding date.
Desired Format:
user_id test_ACT ACT_date test_SAT SAT_date
0 1 ACT 2013-08-15 NaN NaN
1 2 ACT 2011-12-09 NaN NaN
2 3 NaN NaN SAT 2012-03-09
3 4 ACT 2003-07-27 SAT 2013-12-31
I have tried to groupby and pivot:
df['idx'] = df.groupby('user_id').cumcount()
tmp = []
for var in ['test_type','test_date']:
procedure_sct['tmp_idx'] = var + '_' + df.idx.astype(str)
tmp.append(df.pivot(index='user_id',columns='tmp_idx',values=var))
df_wide = pd.concat(tmp,axis=1).reset_index()
This means that the format is wide but not separated by test type.
Output from attempt but not desired:
user_id test_type_0 test_date_0 test_type_1 test_date_1
0 1 ACT 2013-08-15 NaN NaN
1 2 ACT 2011-12-09 NaN NaN
2 3 SAT 2012-03-09 NaN NaN
3 4 ACT 2003-07-27 SAT 2013-12-31
After trying provided answer:
index user_id ACT_date test_ACT user_id SAT_date test_SAT
0 0 1.0 2013-08-15 ACT NaN NaN NaN
1 1 2.0 2011-12-09 ACT NaN NaN NaN
2 2 NaN NaN NaN 3.0 2012-03-09 SAT
3 3 4.0 2003-07-27 ACT NaN NaN NaN
4 4 NaN NaN NaN 4.0 2013-12-31 SAT
This should work:
df1=df[df.test_type=='ACT'].set_index(user_id)
df1.columns = ['ACT_date']
df1["test_ACT"]="ACT"
df2=df[dft.test_type=='SAT'].set_index(user_id)
df1.columns = ['SAT_date']
df2["test_SAT"]="SAT"
finaldf = pd.concat([df1, df2], axis=1).reset_index()
#create temporary column
#and set index
res = (df.assign(temp = df.test_type)
.set_index(['user_id','temp'])
)
#unstack
#remove unnecessary column level
#and rename columns
(res.unstack()
.droplevel(0,axis=1)
.set_axis(['test_ACT','test_SAT','ACT_date','SAT_date'],axis=1)
)
test_ACT test_SAT ACT_date SAT_date
user_id
1 ACT NaN 2013-08-15 NaN
2 ACT NaN 2011-12-09 NaN
3 NaN SAT NaN 2012-03-09
4 ACT SAT 2003-07-27 2013-12-31

Pandas dataframe: Replace multiple rows based on values in another column

I'm trying to replace some values in one dataframe's column with values from another data frame's column. Here's what the data frames look like. df2 has a lot of rows and columns.
df1
0 1029
0 aaaaa Green
1 bbbbb Green
2 fffff Blue
3 xxxxx Blue
4 zzzzz Green
df2
0 1 2 3 .... 1029
0 aaaaa 1 NaN 14 NaN
1 bbbbb 1 NaN 14 NaN
2 ccccc 1 NaN 14 Blue
3 ddddd 1 NaN 14 Blue
...
25 yyyyy 1 NaN 14 Blue
26 zzzzz 1 NaN 14 Blue
The final df should look like this
0 1 2 3 .... 1029
0 aaaaa 1 NaN 14 Green
1 bbbbb 1 NaN 14 Green
2 ccccc 1 NaN 14 Blue
3 ddddd 1 NaN 14 Blue
...
25 yyyyy 1 NaN 14 Blue
26 zzzzz 1 NaN 14 Green
So basically what needs to happen is that df1[0] and df[2] need to be matched and then df2[1029] needs to have values replaced by the corresponding row in df1[1029] for the rows that matched. I don't want to lose any values in df2['1029'] which are not in df1['1029']
I believe the re module in python can do that? This is what I have so far:
import re
for line in replace:
line = re.sub(df1['1029'],
'1029',
line.rstrip())
print(line)
But it definitely doesn't work.
I could also use merge as in merged1 = df1.merge(df2, left_index=True, right_index=True, how='inner') but that doesn't replace the values inline.
You need:
df1 = pd.DataFrame({'0':['aaaaa','bbbbb','fffff','xxxxx','zzzzz'], '1029':['Green','Green','Blue','Blue','Green']})
df2 = pd.DataFrame({'0':['aaaa','bbbb','ccccc','ddddd','yyyyy','zzzzz',], '1029':[None,None,'Blue','Blue','Blue','Blue']})
# Fill NaNs
df2['1029'] = df2['1029'].fillna(df1['1029'])
# Merge the dataframes
df_ = df2.merge(df1, how='left', on=['0'])
df_['1029'] = np.where(df_['1029_y'].isna(), df_['1029_x'], df_['1029_y'])
df_.drop(['1029_y','1029_x'],1,inplace=True)
print(df_)
Output:
0 1029
0 aaaa Green
1 bbbb Green
2 ccccc Blue
3 ddddd Blue
4 yyyyy Blue
5 zzzzz Green
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'0':['aa','bb','ff','xx', 'zz'], '1029':['Green', 'Green', 'Blue', 'Blue', 'Green']})
df2 = pd.DataFrame({'0':['aa','bb','cc','dd','ff','gg','hh','xx','yy', 'zz'], '1': [1]*10, '2': [np.nan]*10, '1029':[np.nan, np.nan, 'Blue', 'Blue', np.nan, np.nan, 'Blue', 'Green', 'Blue', 'Blue']})
df1
0 1029
0 aa Green
1 bb Green
2 ff Blue
3 xx Blue
4 zz Green
df2
0 1 1029 2
0 aa 1 NaN NaN
1 bb 1 NaN NaN
2 cc 1 Blue NaN
3 dd 1 Blue NaN
4 ff 1 NaN NaN
5 gg 1 NaN NaN
6 hh 1 Blue NaN
7 xx 1 Green NaN
8 yy 1 Blue NaN
9 zz 1 Blue NaN
If the column '0' in both the data frames is sorted, this will work.
df2.loc[(df2['1029'].isna() & df2['0'].isin(df1['0'])), '1029'] = df1['1029'][df2['0'].isin(df1['0'])].tolist()
df2
0 1 1029 2
0 aa 1 Green NaN
1 bb 1 Green NaN
2 cc 1 Blue NaN
3 dd 1 Blue NaN
4 ff 1 Green NaN
5 gg 1 NaN NaN
6 hh 1 Blue NaN
7 xx 1 Green NaN
8 yy 1 Blue NaN
9 zz 1 Blue NaN

Applying values to column and grouping all columns by those values

I have a pandas dataframe as shown here. All lines without a value for ["sente"] contain further information but they are yet not linked to ["sente"].
id pos value sente
1 a I 21
2 b have 21
3 b a 21
4 a cat 21
5 d ! 21
6 cat N Nan
7 a My 22
8 a cat 22
9 b is 22
10 a cute 22
11 d . 22
12 cat N NaN
13 cute M NaN
Now I want each row where there is no value in ["sente"] to get its value from the row above. Then I want to group them all by ["sente"] and create a new column with its content from the row without a value in ["sente"].
sente pos value content
21 a,b,b,a,d I have a cat ! 'cat,N'
22 a,a,b,a,d My cat is cute . 'cat,N','cute,M'
This would be my first step:
df.loc[(df['sente'] != df["sente"].shift(-1) & df["sente"] == Nan) , "sente"] = df["sente"].shift(+1)
but it only works for one additional row not if there is 2 or more.
This groups up one column like I want it:
df.groupby(["sente"])['value'].apply(lambda x: " ".join()
But for more columns it doesn't work like I want:
df.groupby(["sente"]).agr(lambda x: ",".join()
Is there any way to do this without using stack functions?
Use:
#check NaNs values to boolean mask
m = df['sente'].isnull()
#new column of joined columns only if mask
df['contant'] = np.where(m, df['pos'] + ',' + df['value'], np.nan)
#replace to NaNs by mask
df[['pos', 'value']] = df[['pos', 'value']].mask(m)
print (df)
id pos value sente contant
0 1 a I 21.0 NaN
1 2 b have 21.0 NaN
2 3 b a 21.0 NaN
3 4 a cat 21.0 NaN
4 5 d ! 21.0 NaN
5 6 NaN NaN NaN cat,N
6 7 a My 22.0 NaN
7 8 a cat 22.0 NaN
8 9 b is 22.0 NaN
9 10 a cute 22.0 NaN
10 11 d . 22.0 NaN
11 12 NaN NaN NaN cat,N
12 13 NaN NaN NaN cute,M
Last replace NaNs by forward filling with ffill and join with remove NaNs by dropna:
df1 = df.groupby(df["sente"].ffill()).agg(lambda x: " ".join(x.dropna()))
print (df1)
pos value contant
sente
21.0 a b b a d I have a cat ! cat,N
22.0 a a b a d My cat is cute . cat,N cute,M

Categories

Resources