I'm trying to replace some values in one dataframe's column with values from another data frame's column. Here's what the data frames look like. df2 has a lot of rows and columns.
df1
0 1029
0 aaaaa Green
1 bbbbb Green
2 fffff Blue
3 xxxxx Blue
4 zzzzz Green
df2
0 1 2 3 .... 1029
0 aaaaa 1 NaN 14 NaN
1 bbbbb 1 NaN 14 NaN
2 ccccc 1 NaN 14 Blue
3 ddddd 1 NaN 14 Blue
...
25 yyyyy 1 NaN 14 Blue
26 zzzzz 1 NaN 14 Blue
The final df should look like this
0 1 2 3 .... 1029
0 aaaaa 1 NaN 14 Green
1 bbbbb 1 NaN 14 Green
2 ccccc 1 NaN 14 Blue
3 ddddd 1 NaN 14 Blue
...
25 yyyyy 1 NaN 14 Blue
26 zzzzz 1 NaN 14 Green
So basically what needs to happen is that df1[0] and df[2] need to be matched and then df2[1029] needs to have values replaced by the corresponding row in df1[1029] for the rows that matched. I don't want to lose any values in df2['1029'] which are not in df1['1029']
I believe the re module in python can do that? This is what I have so far:
import re
for line in replace:
line = re.sub(df1['1029'],
'1029',
line.rstrip())
print(line)
But it definitely doesn't work.
I could also use merge as in merged1 = df1.merge(df2, left_index=True, right_index=True, how='inner') but that doesn't replace the values inline.
You need:
df1 = pd.DataFrame({'0':['aaaaa','bbbbb','fffff','xxxxx','zzzzz'], '1029':['Green','Green','Blue','Blue','Green']})
df2 = pd.DataFrame({'0':['aaaa','bbbb','ccccc','ddddd','yyyyy','zzzzz',], '1029':[None,None,'Blue','Blue','Blue','Blue']})
# Fill NaNs
df2['1029'] = df2['1029'].fillna(df1['1029'])
# Merge the dataframes
df_ = df2.merge(df1, how='left', on=['0'])
df_['1029'] = np.where(df_['1029_y'].isna(), df_['1029_x'], df_['1029_y'])
df_.drop(['1029_y','1029_x'],1,inplace=True)
print(df_)
Output:
0 1029
0 aaaa Green
1 bbbb Green
2 ccccc Blue
3 ddddd Blue
4 yyyyy Blue
5 zzzzz Green
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'0':['aa','bb','ff','xx', 'zz'], '1029':['Green', 'Green', 'Blue', 'Blue', 'Green']})
df2 = pd.DataFrame({'0':['aa','bb','cc','dd','ff','gg','hh','xx','yy', 'zz'], '1': [1]*10, '2': [np.nan]*10, '1029':[np.nan, np.nan, 'Blue', 'Blue', np.nan, np.nan, 'Blue', 'Green', 'Blue', 'Blue']})
df1
0 1029
0 aa Green
1 bb Green
2 ff Blue
3 xx Blue
4 zz Green
df2
0 1 1029 2
0 aa 1 NaN NaN
1 bb 1 NaN NaN
2 cc 1 Blue NaN
3 dd 1 Blue NaN
4 ff 1 NaN NaN
5 gg 1 NaN NaN
6 hh 1 Blue NaN
7 xx 1 Green NaN
8 yy 1 Blue NaN
9 zz 1 Blue NaN
If the column '0' in both the data frames is sorted, this will work.
df2.loc[(df2['1029'].isna() & df2['0'].isin(df1['0'])), '1029'] = df1['1029'][df2['0'].isin(df1['0'])].tolist()
df2
0 1 1029 2
0 aa 1 Green NaN
1 bb 1 Green NaN
2 cc 1 Blue NaN
3 dd 1 Blue NaN
4 ff 1 Green NaN
5 gg 1 NaN NaN
6 hh 1 Blue NaN
7 xx 1 Green NaN
8 yy 1 Blue NaN
9 zz 1 Blue NaN
Related
I'm hoping to streamline a function that only return columns based on a single string value. Using below, I have two distinct colours in a df. I want to pass each colour to a function. But I only want the output to include columns relating to that colour.
If I have numerous colours and multiple outputs within the function, the returned df gets too large.
import pandas as pd
import numpy as np
d = ({
'Date' : ['1/1/18','1/1/18','2/1/18','3/1/18','1/2/18','1/3/18','2/1/19','3/1/19'],
'Val' : ['A','B','C','D','A','B','C','D'],
'Blue' : ['Blue', 'Blue', 'Blue', np.NaN, np.NaN, 'Blue', np.NaN, np.NaN],
'Red' : [np.NaN, np.NaN, np.NaN, 'Red', 'Red', np.NaN, 'Red', 'Red']
})
df = pd.DataFrame(data = d)
df['Date'] = pd.to_datetime(df['Date'], format = '%d/%m/%y')
df['Count'] = df.Date.map(df.groupby('Date').size())
def func(df, val):
df['%s_cat' % val] = df['Count'] * 2
return df
blue = func(df, 'Blue')
red = func(df, 'Red')
Intended output (Blue):
Date Val Blue Count Blue_cat
0 2018-01-01 A Blue 2 4
1 2018-01-01 B Blue 2 4
2 2018-01-02 C Blue 1 2
5 2018-03-01 B Blue 1 2
Intended output (Red):
Date Val Blue Red Count Red_cat
3 2018-01-03 D NaN Red 1 2
4 2018-02-01 A NaN Red 1 2
6 2019-01-02 C NaN Red 1 2
7 2019-01-03 D NaN Red 1 2
Use boolean indexing with DataFrame.copy for avoid SettingWithCopyWarning, because if you modify values in filtered DataFrame later you will find that the modifications do not propagate back to the original data, and that Pandas does warning:
def func(df, val):
df = df[df[val].eq(val)].copy()
df[f'{val}_cat'] = df['Count'] * 2
return df
blue = func(df, 'Blue')
print (blue)
Date Val Blue Red Count Blue_cat
0 2018-01-01 A Blue NaN 2 4
1 2018-01-01 B Blue NaN 2 4
2 2018-01-02 C Blue NaN 1 2
5 2018-03-01 B Blue NaN 1 2
red = func(df, 'Red')
print (red)
Date Val Blue Red Count Red_cat
3 2018-01-03 D NaN Red 1 2
4 2018-02-01 A NaN Red 1 2
6 2019-01-02 C NaN Red 1 2
7 2019-01-03 D NaN Red 1 2
I want to pass a cumulative sum of unique values to a separate column. However, I want to disregard nan values so it essentially skips these rows and continues the count with the next viable row.
d = {'Item': [np.nan, "Blue", "Blue", np.nan, "Red", "Blue", "Blue", "Red"],
}
df = pd.DataFrame(data=d)
df['count'] = df.Item.ne(df.Item.shift()).cumsum()
intended out:
Item count
0 NaN NaN
1 Blue 1
2 Blue 1
3 NaN NaN
4 Red 2
5 Blue 3
6 Blue 3
7 Red 4
Try:
df['count'] =(df.Item.ne(df.Item.shift()) & df.Item.notna()).cumsum().mask(df.Item.isna())
OR
as suggested by #SeanBean:
df['count'] =df.Item.ne(df.Item.shift()).mask(df.Item.isna()).cumsum()
Output of df:
Item count
0 NaN NaN
1 Blue 1.0
2 Blue 1.0
3 NaN NaN
4 Red 2.0
5 Blue 3.0
6 Blue 3.0
7 Red 4.0
Here's one way:
NOTE: (you just need to add the where condition):
df['count'] = df.Item.ne(df.Item.shift()).where(~df.Item.isna()).cumsum()
OUTPUT:
Item count
0 NaN NaN
1 Blue 1.0
2 Blue 1.0
3 NaN NaN
4 Red 2.0
5 Blue 3.0
6 Blue 3.0
7 Red 4.0
I have df like this
df1:
Name A B C
a b r t y U
0 xyz 1 2 3 4 3 4
1 abc 3 5 4 7 7 8
2 pqr 2 4 4 5 4 6
df2:
Name A B C
a b r t y U
0 xyz Nan Nan Nan Nan Nan Nan
1 abc 2 4 5 7 7 9
2 pqr Nan Nan Nan Nan Nan Nan
i want df like this
Name A B C
a b r t y U
0 xyz Nan Nan Nan Nan Nan Nan
1 abc 5 9 9 14 14 17
2 pqr Nan Nan Nan Nan Nan Nan
basically i want the sum of abc row only
First check what is columns names, obviously it is tuple ('Name', '') here, so set to index and then sum it:
print (df1.columns.tolist())
print (df2.columns.tolist())
df1 = df1.set_index([('Name', '')])
df2 = df2.set_index([('Name', '')])
#set by position
#df1 = df1.set_index([df1.columns[0]])
#df2 = df2.set_index([df2.columns[0]])
df = df1.add(df2)
Or:
df = df1 + df2
I'm trying to group several group of columns to count or sum the rows in a pandas dataframe
I've checked many questions already and the most similar I found is this one > Groupby sum and count on multiple columns in python, but, by what I understand I have to do many steps to reach my goal. and was also looking at this link
As an example, I have the dataframe below:
import numpy as np
df = pd.DataFrame(np.random.randint(0,5,size=(5, 7)), columns=["grey2","red1","blue1","red2","red3","blue2","grey1"])
grey2 red1 blue1 red2 red3 blue2 grey1
0 4 3 0 2 4 0 2
1 4 2 0 4 0 3 1
2 1 1 3 1 1 3 1
3 4 4 1 4 1 1 1
4 3 4 1 0 3 3 1
I want to group here, all the columns by colour, for example, and what I would expect is:
If I sum the numbers,
blue 15
grey 22
red 34
If I count ( x > 0 ) then I will get,
blue 7
grey 10
red 13
this is what I have achieved so far, so now i will have to sum and then create a dataframe with the results, but if I have 100 groups,this would be very time consuming.
pd.pivot_table(data=df, index=df.index, values=["red1","red2","red3"], aggfunc='sum', margins=True)
red1 red2 red3
0 3 2 4
1 2 4 0
2 1 1 1
3 4 4 1
4 4 0 3
ALL 14 11 9
pd.pivot_table(data=df, index=df.index, values=["red1","red2","red3"], aggfunc='count', margins=True)
But here is also counting the zeros:
red1 red2 red3
0 1 1 1
1 1 1 1
2 1 1 1
3 1 1 1
4 1 1 1
All 5 5 5
Not sure how to alter the function to get my results, and I've already spend hours, hopefully you can help.
NOTE:
I only use colours in this example to simplify the case, but I could have around many columns and they are called col001 till col300, etc...
So, the groups could be:
blue = col131, col254, col005
red = col023, col190, col053
and so on.....
You can use pd.wide_to_long:
data= pd.wide_to_long(df.reset_index(), stubnames=['grey','red','blue'],
i='index',
j='group',
sep=''
)
Output:
# data
grey red blue
index group
0 1 2.0 3 0.0
2 4.0 2 0.0
3 NaN 4 NaN
1 1 1.0 2 0.0
2 4.0 4 3.0
3 NaN 0 NaN
2 1 1.0 1 3.0
2 1.0 1 3.0
3 NaN 1 NaN
3 1 1.0 4 1.0
2 4.0 4 1.0
3 NaN 1 NaN
4 1 1.0 4 1.0
2 3.0 0 3.0
3 NaN 3 NaN
And:
data.sum()
# grey 22.0
# red 34.0
# blue 15.0
# dtype: float64
data.gt(0).sum()
# grey 10
# red 13
# blue 7
# dtype: int64
Update wide_to_long is just a convenient shortcut for merge and rename. So if you have a dictionary {cat:[col_list]}, you could resolve to that:
groups = {'blue' : ['col131', 'col254', 'col005'],
'red' : ['col023', 'col190', 'col053']}
# create the inverse dictionary for mapping
inv_group = {v:k for k,v in groups.items()}
data = df.melt()
# map the original columns to group
data['group'] = data['variable'].map(inv_group)
# from now on, it's similar to other answers
# sum
data.groupby('group')['value'].sum()
# count
data['value'].gt(0).groupby(data['group']).sum()
The complication here is that you want to collapse both by rows and columns, which is generally difficult to do at the same time. We can melt to go from your wide format to a longer format, which then reduces the problem to a single groupby
# Get rid of the numbers + reshape
df.columns = pd.Index(df.columns.str.rstrip('0123456789'), name='color')
df = df.melt()
df.groupby('color').sum()
# value
#color
#blue 15
#grey 22
#red 34
df.value.gt(0).groupby(df.color).sum()
#color
#blue 7.0
#grey 10.0
#red 13.0
#Name: value, dtype: float64
With names that are less simple to group, we'd need to have the mapping somewhere, the steps are very similar:
# Unnecessary in this case, but more general
d = {'grey1': 'color_1', 'grey2': 'color_1',
'red1': 'color_2', 'red2': 'color_2', 'red3': 'color_2',
'blue1': 'color_3', 'blue2': 'color_3'}
df.columns = pd.Index(df.columns.map(d), name='color')
df = df.melt()
df.groupby('color').sum()
# value
#color
#color_1 22
#color_2 34
#color_3 15
Use:
df.groupby(df.columns.str.replace('\d+', ''),axis=1).sum().sum()
Output:
blue 15
grey 22
red 34
dtype: int64
this works regardless of the number of digits contained in the name of the columns:
df=df.add_suffix('22')
print(df)
grey22222 red12222 blue12222 red22222 red32222 blue22222 grey12222
0 4 3 0 2 4 0 2
1 4 2 0 4 0 3 1
2 1 1 3 1 1 3 1
3 4 4 1 4 1 1 1
4 3 4 1 0 3 3 1
df.groupby(df.columns.str.replace('\d+', ''),axis=1).sum().sum()
blue 15
grey 22
red 34
dtype: int64
You could also do something like this for the general case:
colors = {'blue':['blue1','blue2'], 'red':['red1','red2','red3'], 'grey':['grey1','grey2']}
orig_columns = df.columns
df.columns = [key for col in df.columns for key in colors.keys() if col in colors[key]]
print(df.groupby(level=0,axis=1).sum().sum())
df.columns = orig_columns
I can't find the answer to this in here.
I have two dataframes:
index, name, color, day
0 Nan Nan Nan
1 b red thu
2 Nan Nan Nan
3 d green mon
index, name, color, week
0 c blue 1
1 Nan Nan Nan
2 t yellow 4
3 Nan Nan Nan
And I'd like the result to be one dataframe:
index, name, color, day, week
0 c Blue Nan 1
1 b red thu Nan
2 t yellow Nan 4
3 d green mon Nan
Is there a way to merge the dataframes on their indexes, while adding new columns?
You can use DataFrame.combine_first:
df = df1.combine_first(df2)
print (df)
color day name week
0 blue NaN c 1.0
1 red thu b NaN
2 yellow NaN t 4.0
3 green mon d NaN
For custom order of columns create columns names by numpy.concatenate, pd.unique and then add reindex_axis:
cols = pd.unique(np.concatenate([df1.columns, df2.columns]))
df = df1.combine_first(df2).reindex_axis(cols, axis=1)
print (df)
name color day week
0 c blue NaN 1.0
1 b red thu NaN
2 t yellow NaN 4.0
3 d green mon NaN
EDIT:
Use rename columns:
df = df1.combine_first(df2.rename(columns={'week':'day'}))
print (df)
name color day
0 c blue 1
1 b red thu
2 t yellow 4
3 d green mon