I can't find the answer to this in here.
I have two dataframes:
index, name, color, day
0 Nan Nan Nan
1 b red thu
2 Nan Nan Nan
3 d green mon
index, name, color, week
0 c blue 1
1 Nan Nan Nan
2 t yellow 4
3 Nan Nan Nan
And I'd like the result to be one dataframe:
index, name, color, day, week
0 c Blue Nan 1
1 b red thu Nan
2 t yellow Nan 4
3 d green mon Nan
Is there a way to merge the dataframes on their indexes, while adding new columns?
You can use DataFrame.combine_first:
df = df1.combine_first(df2)
print (df)
color day name week
0 blue NaN c 1.0
1 red thu b NaN
2 yellow NaN t 4.0
3 green mon d NaN
For custom order of columns create columns names by numpy.concatenate, pd.unique and then add reindex_axis:
cols = pd.unique(np.concatenate([df1.columns, df2.columns]))
df = df1.combine_first(df2).reindex_axis(cols, axis=1)
print (df)
name color day week
0 c blue NaN 1.0
1 b red thu NaN
2 t yellow NaN 4.0
3 d green mon NaN
EDIT:
Use rename columns:
df = df1.combine_first(df2.rename(columns={'week':'day'}))
print (df)
name color day
0 c blue 1
1 b red thu
2 t yellow 4
3 d green mon
Related
I want to pass a cumulative sum of unique values to a separate column. However, I want to disregard nan values so it essentially skips these rows and continues the count with the next viable row.
d = {'Item': [np.nan, "Blue", "Blue", np.nan, "Red", "Blue", "Blue", "Red"],
}
df = pd.DataFrame(data=d)
df['count'] = df.Item.ne(df.Item.shift()).cumsum()
intended out:
Item count
0 NaN NaN
1 Blue 1
2 Blue 1
3 NaN NaN
4 Red 2
5 Blue 3
6 Blue 3
7 Red 4
Try:
df['count'] =(df.Item.ne(df.Item.shift()) & df.Item.notna()).cumsum().mask(df.Item.isna())
OR
as suggested by #SeanBean:
df['count'] =df.Item.ne(df.Item.shift()).mask(df.Item.isna()).cumsum()
Output of df:
Item count
0 NaN NaN
1 Blue 1.0
2 Blue 1.0
3 NaN NaN
4 Red 2.0
5 Blue 3.0
6 Blue 3.0
7 Red 4.0
Here's one way:
NOTE: (you just need to add the where condition):
df['count'] = df.Item.ne(df.Item.shift()).where(~df.Item.isna()).cumsum()
OUTPUT:
Item count
0 NaN NaN
1 Blue 1.0
2 Blue 1.0
3 NaN NaN
4 Red 2.0
5 Blue 3.0
6 Blue 3.0
7 Red 4.0
I have a df where I want to apply a function. How can I retain the NaN values even after concatenating two columns? I want to avoid np.where since the real function has more elif conditions
df fruit year price vol signifiance
0 apple 2010 1 5 NaN
1 apple 2011 2 4 NaN
2 apple 2012 3 3 NaN
3 NaN 2013 3 3 NaN
4 NaN NaN NaN 3 NaN
5 apple 2015 3 3 important
df = df.fillna('')
def func(row):
if (pd.notna(row['year'])):
return row['fruit'] + row['significance'] +row['price']+ '_test'
else:
return np.NaN
df['final'] = row.apply(func, axis=1)
Expected Output
df fruit year price vol significance final
0 apple 2010 1 5 NaN apple1_test
1 apple 2011 2 4 NaN apple2_test
2 apple 2012 3 3 NaN apple3_test
3 NaN 2013 3 3 NaN 3_test
4 NaN 2014 NaN 3 NaN NaN
5 apple 2015 3 3 important appleimportant3_test
df = df.fillna('')
def func(row):
a = f"{row['fruit']}{row['significance']}{row['price']}"
if a:
return a + '_test'
return np.NaN
First remove df = df.fillna('') and then use your solution with added elif for test if missing values in both columns:
def func(row):
if (pd.notna(row['fruit'])) & (pd.notna(row['signifiance'])):
return row['fruit'] +'_' + row['signifiance']
elif (pd.isna(row['fruit'])) & (pd.isna(row['signifiance'])):
return 'apple'
else:
return row['fruit']
df['final'] = df.apply(func, axis=1)
print (df)
df fruit year price vol signifiance final
0 0 apple 2010 1 5 NaN apple
1 1 apple 2011 2 4 NaN apple
2 2 apple 2012 3 3 NaN apple
3 3 apple 2013 3 3 NaN apple
4 4 NaN 2014 3 3 NaN apple
5 5 apple 2015 3 3 important apple_important
I have a dataframe that has unique identifier in one column and is in long format. My goal is to have one user_id(student) per row and to pivot so that the structure is wide.
Current dataframe example:
user_id test_type test_date
0 1 ACT 2013-08-15
1 2 ACT 2011-12-09
2 3 SAT 2012-03-09
3 4 ACT 2003-07-27
4 4 SAT 2013-12-31
The problem is that some students have taken both tests so I want to ultimately have one column for ACT, one column for SAT, and a column each for the corresponding date.
Desired Format:
user_id test_ACT ACT_date test_SAT SAT_date
0 1 ACT 2013-08-15 NaN NaN
1 2 ACT 2011-12-09 NaN NaN
2 3 NaN NaN SAT 2012-03-09
3 4 ACT 2003-07-27 SAT 2013-12-31
I have tried to groupby and pivot:
df['idx'] = df.groupby('user_id').cumcount()
tmp = []
for var in ['test_type','test_date']:
procedure_sct['tmp_idx'] = var + '_' + df.idx.astype(str)
tmp.append(df.pivot(index='user_id',columns='tmp_idx',values=var))
df_wide = pd.concat(tmp,axis=1).reset_index()
This means that the format is wide but not separated by test type.
Output from attempt but not desired:
user_id test_type_0 test_date_0 test_type_1 test_date_1
0 1 ACT 2013-08-15 NaN NaN
1 2 ACT 2011-12-09 NaN NaN
2 3 SAT 2012-03-09 NaN NaN
3 4 ACT 2003-07-27 SAT 2013-12-31
After trying provided answer:
index user_id ACT_date test_ACT user_id SAT_date test_SAT
0 0 1.0 2013-08-15 ACT NaN NaN NaN
1 1 2.0 2011-12-09 ACT NaN NaN NaN
2 2 NaN NaN NaN 3.0 2012-03-09 SAT
3 3 4.0 2003-07-27 ACT NaN NaN NaN
4 4 NaN NaN NaN 4.0 2013-12-31 SAT
This should work:
df1=df[df.test_type=='ACT'].set_index(user_id)
df1.columns = ['ACT_date']
df1["test_ACT"]="ACT"
df2=df[dft.test_type=='SAT'].set_index(user_id)
df1.columns = ['SAT_date']
df2["test_SAT"]="SAT"
finaldf = pd.concat([df1, df2], axis=1).reset_index()
#create temporary column
#and set index
res = (df.assign(temp = df.test_type)
.set_index(['user_id','temp'])
)
#unstack
#remove unnecessary column level
#and rename columns
(res.unstack()
.droplevel(0,axis=1)
.set_axis(['test_ACT','test_SAT','ACT_date','SAT_date'],axis=1)
)
test_ACT test_SAT ACT_date SAT_date
user_id
1 ACT NaN 2013-08-15 NaN
2 ACT NaN 2011-12-09 NaN
3 NaN SAT NaN 2012-03-09
4 ACT SAT 2003-07-27 2013-12-31
I'm trying to replace some values in one dataframe's column with values from another data frame's column. Here's what the data frames look like. df2 has a lot of rows and columns.
df1
0 1029
0 aaaaa Green
1 bbbbb Green
2 fffff Blue
3 xxxxx Blue
4 zzzzz Green
df2
0 1 2 3 .... 1029
0 aaaaa 1 NaN 14 NaN
1 bbbbb 1 NaN 14 NaN
2 ccccc 1 NaN 14 Blue
3 ddddd 1 NaN 14 Blue
...
25 yyyyy 1 NaN 14 Blue
26 zzzzz 1 NaN 14 Blue
The final df should look like this
0 1 2 3 .... 1029
0 aaaaa 1 NaN 14 Green
1 bbbbb 1 NaN 14 Green
2 ccccc 1 NaN 14 Blue
3 ddddd 1 NaN 14 Blue
...
25 yyyyy 1 NaN 14 Blue
26 zzzzz 1 NaN 14 Green
So basically what needs to happen is that df1[0] and df[2] need to be matched and then df2[1029] needs to have values replaced by the corresponding row in df1[1029] for the rows that matched. I don't want to lose any values in df2['1029'] which are not in df1['1029']
I believe the re module in python can do that? This is what I have so far:
import re
for line in replace:
line = re.sub(df1['1029'],
'1029',
line.rstrip())
print(line)
But it definitely doesn't work.
I could also use merge as in merged1 = df1.merge(df2, left_index=True, right_index=True, how='inner') but that doesn't replace the values inline.
You need:
df1 = pd.DataFrame({'0':['aaaaa','bbbbb','fffff','xxxxx','zzzzz'], '1029':['Green','Green','Blue','Blue','Green']})
df2 = pd.DataFrame({'0':['aaaa','bbbb','ccccc','ddddd','yyyyy','zzzzz',], '1029':[None,None,'Blue','Blue','Blue','Blue']})
# Fill NaNs
df2['1029'] = df2['1029'].fillna(df1['1029'])
# Merge the dataframes
df_ = df2.merge(df1, how='left', on=['0'])
df_['1029'] = np.where(df_['1029_y'].isna(), df_['1029_x'], df_['1029_y'])
df_.drop(['1029_y','1029_x'],1,inplace=True)
print(df_)
Output:
0 1029
0 aaaa Green
1 bbbb Green
2 ccccc Blue
3 ddddd Blue
4 yyyyy Blue
5 zzzzz Green
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'0':['aa','bb','ff','xx', 'zz'], '1029':['Green', 'Green', 'Blue', 'Blue', 'Green']})
df2 = pd.DataFrame({'0':['aa','bb','cc','dd','ff','gg','hh','xx','yy', 'zz'], '1': [1]*10, '2': [np.nan]*10, '1029':[np.nan, np.nan, 'Blue', 'Blue', np.nan, np.nan, 'Blue', 'Green', 'Blue', 'Blue']})
df1
0 1029
0 aa Green
1 bb Green
2 ff Blue
3 xx Blue
4 zz Green
df2
0 1 1029 2
0 aa 1 NaN NaN
1 bb 1 NaN NaN
2 cc 1 Blue NaN
3 dd 1 Blue NaN
4 ff 1 NaN NaN
5 gg 1 NaN NaN
6 hh 1 Blue NaN
7 xx 1 Green NaN
8 yy 1 Blue NaN
9 zz 1 Blue NaN
If the column '0' in both the data frames is sorted, this will work.
df2.loc[(df2['1029'].isna() & df2['0'].isin(df1['0'])), '1029'] = df1['1029'][df2['0'].isin(df1['0'])].tolist()
df2
0 1 1029 2
0 aa 1 Green NaN
1 bb 1 Green NaN
2 cc 1 Blue NaN
3 dd 1 Blue NaN
4 ff 1 Green NaN
5 gg 1 NaN NaN
6 hh 1 Blue NaN
7 xx 1 Green NaN
8 yy 1 Blue NaN
9 zz 1 Blue NaN
I have a pandas dataframe as shown here. All lines without a value for ["sente"] contain further information but they are yet not linked to ["sente"].
id pos value sente
1 a I 21
2 b have 21
3 b a 21
4 a cat 21
5 d ! 21
6 cat N Nan
7 a My 22
8 a cat 22
9 b is 22
10 a cute 22
11 d . 22
12 cat N NaN
13 cute M NaN
Now I want each row where there is no value in ["sente"] to get its value from the row above. Then I want to group them all by ["sente"] and create a new column with its content from the row without a value in ["sente"].
sente pos value content
21 a,b,b,a,d I have a cat ! 'cat,N'
22 a,a,b,a,d My cat is cute . 'cat,N','cute,M'
This would be my first step:
df.loc[(df['sente'] != df["sente"].shift(-1) & df["sente"] == Nan) , "sente"] = df["sente"].shift(+1)
but it only works for one additional row not if there is 2 or more.
This groups up one column like I want it:
df.groupby(["sente"])['value'].apply(lambda x: " ".join()
But for more columns it doesn't work like I want:
df.groupby(["sente"]).agr(lambda x: ",".join()
Is there any way to do this without using stack functions?
Use:
#check NaNs values to boolean mask
m = df['sente'].isnull()
#new column of joined columns only if mask
df['contant'] = np.where(m, df['pos'] + ',' + df['value'], np.nan)
#replace to NaNs by mask
df[['pos', 'value']] = df[['pos', 'value']].mask(m)
print (df)
id pos value sente contant
0 1 a I 21.0 NaN
1 2 b have 21.0 NaN
2 3 b a 21.0 NaN
3 4 a cat 21.0 NaN
4 5 d ! 21.0 NaN
5 6 NaN NaN NaN cat,N
6 7 a My 22.0 NaN
7 8 a cat 22.0 NaN
8 9 b is 22.0 NaN
9 10 a cute 22.0 NaN
10 11 d . 22.0 NaN
11 12 NaN NaN NaN cat,N
12 13 NaN NaN NaN cute,M
Last replace NaNs by forward filling with ffill and join with remove NaNs by dropna:
df1 = df.groupby(df["sente"].ffill()).agg(lambda x: " ".join(x.dropna()))
print (df1)
pos value contant
sente
21.0 a b b a d I have a cat ! cat,N
22.0 a a b a d My cat is cute . cat,N cute,M