reuse function with multiple string values pandas - python

I'm hoping to streamline a function that only return columns based on a single string value. Using below, I have two distinct colours in a df. I want to pass each colour to a function. But I only want the output to include columns relating to that colour.
If I have numerous colours and multiple outputs within the function, the returned df gets too large.
import pandas as pd
import numpy as np
d = ({
'Date' : ['1/1/18','1/1/18','2/1/18','3/1/18','1/2/18','1/3/18','2/1/19','3/1/19'],
'Val' : ['A','B','C','D','A','B','C','D'],
'Blue' : ['Blue', 'Blue', 'Blue', np.NaN, np.NaN, 'Blue', np.NaN, np.NaN],
'Red' : [np.NaN, np.NaN, np.NaN, 'Red', 'Red', np.NaN, 'Red', 'Red']
})
df = pd.DataFrame(data = d)
df['Date'] = pd.to_datetime(df['Date'], format = '%d/%m/%y')
df['Count'] = df.Date.map(df.groupby('Date').size())
def func(df, val):
df['%s_cat' % val] = df['Count'] * 2
return df
blue = func(df, 'Blue')
red = func(df, 'Red')
Intended output (Blue):
Date Val Blue Count Blue_cat
0 2018-01-01 A Blue 2 4
1 2018-01-01 B Blue 2 4
2 2018-01-02 C Blue 1 2
5 2018-03-01 B Blue 1 2
Intended output (Red):
Date Val Blue Red Count Red_cat
3 2018-01-03 D NaN Red 1 2
4 2018-02-01 A NaN Red 1 2
6 2019-01-02 C NaN Red 1 2
7 2019-01-03 D NaN Red 1 2

Use boolean indexing with DataFrame.copy for avoid SettingWithCopyWarning, because if you modify values in filtered DataFrame later you will find that the modifications do not propagate back to the original data, and that Pandas does warning:
def func(df, val):
df = df[df[val].eq(val)].copy()
df[f'{val}_cat'] = df['Count'] * 2
return df
blue = func(df, 'Blue')
print (blue)
Date Val Blue Red Count Blue_cat
0 2018-01-01 A Blue NaN 2 4
1 2018-01-01 B Blue NaN 2 4
2 2018-01-02 C Blue NaN 1 2
5 2018-03-01 B Blue NaN 1 2
red = func(df, 'Red')
print (red)
Date Val Blue Red Count Red_cat
3 2018-01-03 D NaN Red 1 2
4 2018-02-01 A NaN Red 1 2
6 2019-01-02 C NaN Red 1 2
7 2019-01-03 D NaN Red 1 2

Related

Cumsum with nan values - pandas

I want to pass a cumulative sum of unique values to a separate column. However, I want to disregard nan values so it essentially skips these rows and continues the count with the next viable row.
d = {'Item': [np.nan, "Blue", "Blue", np.nan, "Red", "Blue", "Blue", "Red"],
}
df = pd.DataFrame(data=d)
df['count'] = df.Item.ne(df.Item.shift()).cumsum()
intended out:
Item count
0 NaN NaN
1 Blue 1
2 Blue 1
3 NaN NaN
4 Red 2
5 Blue 3
6 Blue 3
7 Red 4
Try:
df['count'] =(df.Item.ne(df.Item.shift()) & df.Item.notna()).cumsum().mask(df.Item.isna())
OR
as suggested by #SeanBean:
df['count'] =df.Item.ne(df.Item.shift()).mask(df.Item.isna()).cumsum()
Output of df:
Item count
0 NaN NaN
1 Blue 1.0
2 Blue 1.0
3 NaN NaN
4 Red 2.0
5 Blue 3.0
6 Blue 3.0
7 Red 4.0
Here's one way:
NOTE: (you just need to add the where condition):
df['count'] = df.Item.ne(df.Item.shift()).where(~df.Item.isna()).cumsum()
OUTPUT:
Item count
0 NaN NaN
1 Blue 1.0
2 Blue 1.0
3 NaN NaN
4 Red 2.0
5 Blue 3.0
6 Blue 3.0
7 Red 4.0

Pandas: Remove values that meet condition

Let's say I have data like this:
df = pd.DataFrame({'category': ["blue","red","blue", "blue","green"], 'val1': [5, 3, 2, 2, 5], 'val2':[1, 3, 2, 2, 5], 'val3': [2, 1, 1, 4, 3]})
print(df)
category val1 val2 val3
0 blue 5 1 2
1 red 3 3 1
2 blue 2 2 1
3 blue 2 2 4
4 green 5 5 3
How do I remove (or replace with for example NaN) values that meet a certain condition without removing the entire row or shift the column?
Let's say my condition is that I want to remove all values below 3 from the above data, the result would have to look like this:
category val1 val2 val3
0 blue 5
1 red 3 3
2 blue
3 blue 4
4 green 5 5 3
Use mask:
df.iloc[:, 1:] = df.iloc[:, 1:].mask(df.iloc[:, 1:] < 3)
print(df)
Output
category val1 val2 val3
0 blue 5.0 NaN NaN
1 red 3.0 3.0 NaN
2 blue NaN NaN NaN
3 blue NaN NaN 4.0
4 green 5.0 5.0 3.0
If you want to set particular value, for example 0, do:
df.iloc[:, 1:] = df.iloc[:, 1:].mask(df.iloc[:, 1:] < 3, 0)
print(df)
Output
category val1 val2 val3
0 blue 5 0 0
1 red 3 3 0
2 blue 0 0 0
3 blue 0 0 4
4 green 5 5 3
If you just need a few columns, you could do:
df[['val1', 'val2', 'val3']] = df[['val1', 'val2', 'val3']].mask(df[['val1', 'val2', 'val3']] < 3)
print(df)
Output
category val1 val2 val3
0 blue 5.0 NaN NaN
1 red 3.0 3.0 NaN
2 blue NaN NaN NaN
3 blue NaN NaN 4.0
4 green 5.0 5.0 3.0
One approach is to create a mask of the values that don't meet the removal criteria.
mask = df[['val1','val2','val3']] > 3
You can then create a new df, that is just the non-removed vals.
updated_df = df[['val1','val2','val3']][mask]
You need to add back in the unaffected columns.
updated_df['category'] = df['category']
You can use applymap or transform to columns containing integers.
df[df.iloc[:,1:].transform(lambda x: x>=3)].fillna('')

Fill one column by excluding specific values in another column in Pandas

For a dataframe df, I'm trying to fill column b by value 2017-01-01 if the values in column a are either empty NaNs or Others:
df = pd.DataFrame({'a':['Coffee','Muffin','Donut','Others',pd.np.nan, pd.np.nan]})
a
0 Coffee
1 Muffin
2 Donut
3 Others
4 NaN
5 NaN
The expected result is like this:
a b
0 Coffee 2017-01-01
1 Muffin 2017-01-01
2 Donut 2017-01-01
3 Others NaN
4 NaN NaN
5 NaN NaN
What I have tried which didn't exclude NaNs:
df.loc[~df['a'].isin(['nan', 'Others']), 'b'] = '2017-01-01'
a b
0 Coffee 2017-01-01
1 Muffin 2017-01-01
2 Donut 2017-01-01
3 Others NaN
4 NaN 2017-01-01
5 NaN 2017-01-01
Thanks!
Use np.nan instead nan:
df.loc[~df['a'].isin([np.nan, 'Others']), 'b'] = '2017-01-01'
Or before comparing replace missing values by Others:
df.loc[~df['a'].fillna('Others').eq('Others'), 'b'] = '2017-01-01'
print (df)
a b
0 Coffee 2017-01-01
1 Muffin 2017-01-01
2 Donut 2017-01-01
3 Others NaN
4 NaN NaN
5 NaN NaN
check this out:
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': ['Coffee', 'Muffin', 'Donut', 'Others', pd.np.nan, pd.np.nan]})
conditions = [
(df['a'] == 'Others'),
(df['a'].isnull())
]
choices = [np.nan, np.nan]
df['color'] = np.select(conditions, choices, default='2017-01-01')
print(df)
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':['Coffee','Muffin','Donut','Others',pd.np.nan, pd.np.nan]})
df.loc[df['a'].replace('Others',np.nan).notnull(),'b'] = '2017-01-01'
print(df)

Pandas dataframe: Replace multiple rows based on values in another column

I'm trying to replace some values in one dataframe's column with values from another data frame's column. Here's what the data frames look like. df2 has a lot of rows and columns.
df1
0 1029
0 aaaaa Green
1 bbbbb Green
2 fffff Blue
3 xxxxx Blue
4 zzzzz Green
df2
0 1 2 3 .... 1029
0 aaaaa 1 NaN 14 NaN
1 bbbbb 1 NaN 14 NaN
2 ccccc 1 NaN 14 Blue
3 ddddd 1 NaN 14 Blue
...
25 yyyyy 1 NaN 14 Blue
26 zzzzz 1 NaN 14 Blue
The final df should look like this
0 1 2 3 .... 1029
0 aaaaa 1 NaN 14 Green
1 bbbbb 1 NaN 14 Green
2 ccccc 1 NaN 14 Blue
3 ddddd 1 NaN 14 Blue
...
25 yyyyy 1 NaN 14 Blue
26 zzzzz 1 NaN 14 Green
So basically what needs to happen is that df1[0] and df[2] need to be matched and then df2[1029] needs to have values replaced by the corresponding row in df1[1029] for the rows that matched. I don't want to lose any values in df2['1029'] which are not in df1['1029']
I believe the re module in python can do that? This is what I have so far:
import re
for line in replace:
line = re.sub(df1['1029'],
'1029',
line.rstrip())
print(line)
But it definitely doesn't work.
I could also use merge as in merged1 = df1.merge(df2, left_index=True, right_index=True, how='inner') but that doesn't replace the values inline.
You need:
df1 = pd.DataFrame({'0':['aaaaa','bbbbb','fffff','xxxxx','zzzzz'], '1029':['Green','Green','Blue','Blue','Green']})
df2 = pd.DataFrame({'0':['aaaa','bbbb','ccccc','ddddd','yyyyy','zzzzz',], '1029':[None,None,'Blue','Blue','Blue','Blue']})
# Fill NaNs
df2['1029'] = df2['1029'].fillna(df1['1029'])
# Merge the dataframes
df_ = df2.merge(df1, how='left', on=['0'])
df_['1029'] = np.where(df_['1029_y'].isna(), df_['1029_x'], df_['1029_y'])
df_.drop(['1029_y','1029_x'],1,inplace=True)
print(df_)
Output:
0 1029
0 aaaa Green
1 bbbb Green
2 ccccc Blue
3 ddddd Blue
4 yyyyy Blue
5 zzzzz Green
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'0':['aa','bb','ff','xx', 'zz'], '1029':['Green', 'Green', 'Blue', 'Blue', 'Green']})
df2 = pd.DataFrame({'0':['aa','bb','cc','dd','ff','gg','hh','xx','yy', 'zz'], '1': [1]*10, '2': [np.nan]*10, '1029':[np.nan, np.nan, 'Blue', 'Blue', np.nan, np.nan, 'Blue', 'Green', 'Blue', 'Blue']})
df1
0 1029
0 aa Green
1 bb Green
2 ff Blue
3 xx Blue
4 zz Green
df2
0 1 1029 2
0 aa 1 NaN NaN
1 bb 1 NaN NaN
2 cc 1 Blue NaN
3 dd 1 Blue NaN
4 ff 1 NaN NaN
5 gg 1 NaN NaN
6 hh 1 Blue NaN
7 xx 1 Green NaN
8 yy 1 Blue NaN
9 zz 1 Blue NaN
If the column '0' in both the data frames is sorted, this will work.
df2.loc[(df2['1029'].isna() & df2['0'].isin(df1['0'])), '1029'] = df1['1029'][df2['0'].isin(df1['0'])].tolist()
df2
0 1 1029 2
0 aa 1 Green NaN
1 bb 1 Green NaN
2 cc 1 Blue NaN
3 dd 1 Blue NaN
4 ff 1 Green NaN
5 gg 1 NaN NaN
6 hh 1 Blue NaN
7 xx 1 Green NaN
8 yy 1 Blue NaN
9 zz 1 Blue NaN

merging dataframes on the same index

I can't find the answer to this in here.
I have two dataframes:
index, name, color, day
0 Nan Nan Nan
1 b red thu
2 Nan Nan Nan
3 d green mon
index, name, color, week
0 c blue 1
1 Nan Nan Nan
2 t yellow 4
3 Nan Nan Nan
And I'd like the result to be one dataframe:
index, name, color, day, week
0 c Blue Nan 1
1 b red thu Nan
2 t yellow Nan 4
3 d green mon Nan
Is there a way to merge the dataframes on their indexes, while adding new columns?
You can use DataFrame.combine_first:
df = df1.combine_first(df2)
print (df)
color day name week
0 blue NaN c 1.0
1 red thu b NaN
2 yellow NaN t 4.0
3 green mon d NaN
For custom order of columns create columns names by numpy.concatenate, pd.unique and then add reindex_axis:
cols = pd.unique(np.concatenate([df1.columns, df2.columns]))
df = df1.combine_first(df2).reindex_axis(cols, axis=1)
print (df)
name color day week
0 c blue NaN 1.0
1 b red thu NaN
2 t yellow NaN 4.0
3 d green mon NaN
EDIT:
Use rename columns:
df = df1.combine_first(df2.rename(columns={'week':'day'}))
print (df)
name color day
0 c blue 1
1 b red thu
2 t yellow 4
3 d green mon

Categories

Resources