I have a df that may contain nan values for two columns in the first row. If true I want to replace these values with 0. However, if there are integers there then leave as is. So this df should replace X,Y with 0.
df = pd.DataFrame({
'Code1' : ['A','A','B','B','C','C'],
'Code2' : [np.nan,np.nan,5,np.nan,np.nan,10],
'X' : [np.nan,np.nan,1,np.nan,np.nan,3],
'Y' : [np.nan,np.nan,2,np.nan,np.nan,4],
})
1)
if df.loc[0,'X':'Y'] == np.nan:
df.loc[:0, 'X':'Y'] = 0
2)
if df.loc[[0],['X','Y']].isnull():
df.loc[:0, 'X':'Y'] = 0
else:
pass
But this example df should not replace with 0 as integers exist:
df1 = pd.DataFrame({
'Code1' : ['A','A','B','B','C','C'],
'Code2' : [np.nan,np.nan,5,np.nan,np.nan,10],
'X' : [5,np.nan,1,np.nan,np.nan,3],
'Y' : [6,np.nan,2,np.nan,np.nan,4],
})
df.loc[0, ['X', 'Y']] = df.loc[0, ['X', 'Y']].fillna(0)
>>>> df
Code1 Code2 X Y
0 A NaN 0.0 0.0
1 A NaN NaN NaN
2 B 5.0 1.0 2.0
3 B NaN NaN NaN
4 C NaN NaN NaN
5 C 10.0 3.0 4.0
Try this
if sum(pd.isnull(df.loc[0,'X':'Y']))==2:
df.loc[0, ['X', 'Y']] = df.loc[0, ['X', 'Y']].fillna(0)
Related
I would like to update a dataframe with another one but with multiple "destination". Here is an example
df1 = pd.DataFrame({'name':['A', 'B', 'C', 'A'], 'category':['X', 'X', 'Y', 'Y'], 'value1':[None, 1, None, None], 'value2':[None, 10, None, None]})
name category value1 value2
0 A X NaN NaN
1 B X 1.0 10.0
2 C Y NaN NaN
3 A Y NaN NaN
df2 = pd.DataFrame({'name':['A', 'C'], 'value1':[2, 3], 'value2':[11, 12]})
name value1 value2
0 A 2 11
1 C 3 12
And the desired result would be
name category value1 value2
0 A X 2.0 11.0
1 B X 1.0 10.0
2 C Y 3.0 12.0
3 A Y 2.0 11.0
I don't think pd.update works since there are two time 'A' in my first DataFrame.
pd.merge creates other columns and I think there is probably a more elegant way than to merge these columns manually after their creation
Thanks in advance for your help!
You can use fillna after mapping the column A in df1 with the corresponding values from df2:
mapping = df2.set_index('name')['value']
df1['value'] = df1['value'].fillna(df1['name'].map(mapping))
If you want to map multiple columns:
mapping = df2.set_index('name')
for col in mapping:
df1[col] = df1[col].fillna(df1['name'].map(mapping[col]))
Alternatively you can try merge:
df = df1.merge(df2, on='name', how='left', suffixes=['', '_r'])
df.groupby(df.columns.str.rstrip('_r'), axis=1, sort=False).first()
name category value1 value2
0 A X 2.0 11.0
1 B X 1.0 10.0
2 C Y 3.0 12.0
3 A Y 2.0 11.0
Can I use pandas pivot_table to aggregate over a column with missing values and have those missing values included as separate category?
In:
df = pd.DataFrame({'a': pd.Series(['X', 'X', 'Y', 'Y', 'N', 'N'], dtype='category'),
'b': pd.Series([None, None, 'd', 'd', 'd', 'd'], dtype='category')})
Out:
a b
0 X NaN
1 X NaN
2 Y d
3 Y d
4 N d
5 N d
In:
df.groupby('a')['b'].apply(lambda x: x.value_counts(dropna=False)).unstack(1)
Out:
NaN d
a
N NaN 2.0
X 2.0 0.0
Y NaN 2.0
Can I achieve the same result using pandas pivot_table? If yes than how? Thanks.
For some unknown reason, dtype="category" does not work with pivot_table() when counting NaN values. Casting them to regular strings enables regular pivot_table(aggfunc="size").
df.astype(str).pivot_table(index="a", columns="b", aggfunc="size")
Result
b d nan
a
N 2.0 NaN
X NaN 2.0
Y 2.0 NaN
One can optionally do .fillna(0) to replace nans with 0s
Dataframe df has many thousand columns and rows. For a subset of columns that are given in a particular sequence, say columns B, C, E, I want to fill NaN values in B with first non-NaN value found in remaining columns (C, E) searching sequentially. Finally C, E are dropped
Sample df can be built as follows:
import numpy as np
import pandas as pd
df = pd.DataFrame(10*(2+np.random.randn(6, 5)), columns=list('ABCDE'))
df.loc[1, 'B'] = np.nan
df.loc[2, 'B'] = np.nan
df.loc[5, 'B'] = np.nan
df.loc[2, 'C'] = np.nan
df.loc[5, 'C'] = np.nan
df.loc[2, 'D'] = np.nan
df.loc[2, 'E'] = np.nan
df.loc[4, 'E'] = np.nan
df
A B C D E
0 18.161033 6.453597 25.253036 18.542586 20.667311
1 27.629402 NaN 40.654821 22.804547 23.633502
2 15.459256 NaN NaN NaN NaN
3 19.115203 4.002131 14.167508 23.796780 29.557706
4 27.180622 NaN 20.763618 15.923794 NaN
5 17.917170 NaN NaN 21.865184 9.867743
The expected outcome is as follows:
A B D
0 18.161033 6.453597 18.542586
1 27.629402 40.654821 22.804547
2 15.459256 NaN NaN
3 19.115203 4.002131 23.796780
4 27.180622 20.763618 15.923794
5 17.917170 9.867743 21.865184
Here is one way
drop = ['C', 'E']
fill= 'B'
d=dict(zip(df.columns,[fill if x in drop else x for x in df.columns.tolist() ]))
df.groupby(d,axis=1).first()
Out[172]:
A B D
0 14.472915 30.598602 24.528571
1 22.010242 22.215140 15.412039
2 5.383674 NaN NaN
3 38.265940 24.746673 35.367622
4 22.730089 20.244289 27.570413
5 31.216037 15.496690 9.746814
IIUC, use bfill to backfill, then drop to remove unwanted columns.
df.assign(B=df[['B', 'C', 'E']].bfill(axis=1)['B']).drop(['C', 'E'], axis=1)
A B D
0 18.161033 6.453597 18.542586
1 27.629402 40.654821 22.804547
2 15.459256 NaN NaN
3 19.115203 4.002131 23.796780
4 27.180622 20.763618 15.923794
5 17.917170 9.867743 21.865184
Here's a slightly more generalised version of the one above,
to_drop = ['C', 'E']
upd = 'B'
df.update(df[[upd, *to_drop]].bfill(axis=1)[upd]) # in-place
df.drop(to_drop, axis=1) # not in-place, need to assign
A B D
0 18.161033 6.453597 18.542586
1 27.629402 40.654821 22.804547
2 15.459256 NaN NaN
3 19.115203 4.002131 23.796780
4 27.180622 20.763618 15.923794
5 17.917170 9.867743 21.865184
I currently have a Pandas Dataframe in which I'm performing comparisons between columns. I found a case in which there are empty columns when comparison is taking place, comparison for some reason returns else value. I added an extra statement to clean it up to empty. Looking to see if I can simplify this and have a single statement.
df['doc_type'].loc[(df['a_id'].isnull() & df['b_id'].isnull())] = ''
Code
df = pd.DataFrame({
'a_id': ['A', 'B', 'C', 'D', '', 'F', ''],
'a_score': [1, 2, 3, 4, '', 6, ''],
'b_id': ['a', 'b', 'c', 'd', 'e', 'f', ''],
'b_score': [0.1, 0.2, 3.1, 4.1, 5, 5.99, ''],
})
print df
# Replace empty string with NaN
df = df.apply(lambda x: x.str.strip() if isinstance(x, str) else x).replace('', np.nan)
# Calculate higher score
df['doc_id'] = df.apply(lambda df: df['a_id'] if df['a_score'] >= df['b_score'] else df['b_id'], axis=1)
# Select type based on higher score
df['doc_type'] = df.apply(lambda df: 'a' if df['a_score'] >= df['b_score'] else 'b', axis=1)
print df
# Update type when is empty
df['doc_type'].loc[(df['a_id'].isnull() & df['b_id'].isnull())] = ''
print df
You can use numpy.where instead apply, also for select by boolean indexing with column(s) is better use this solution:
df.loc[mask, 'colname'] = val
# Replace empty string with NaN
df = df.apply(lambda x: x.str.strip() if isinstance(x, str) else x).replace('', np.nan)
# Calculate higher score
df['doc_id'] = np.where(df['a_score'] >= df['b_score'], df['a_id'], df['b_id'])
# Select type based on higher score
df['doc_type'] = np.where(df['a_score'] >= df['b_score'], 'a', 'b')
print (df)
# Update type when is empty
df.loc[(df['a_id'].isnull() & df['b_id'].isnull()), 'doc_type'] = ''
print (df)
a_id a_score b_id b_score doc_id doc_type
0 A 1.0 a 0.10 A a
1 B 2.0 b 0.20 B a
2 C 3.0 c 3.10 c b
3 D 4.0 d 4.10 d b
4 NaN NaN e 5.00 e b
5 F 6.0 f 5.99 F a
6 NaN NaN NaN NaN NaN
Alternative of mask with DataFrame.all for check if all True in row - axis=1:
print (df[['a_id', 'b_id']].isnull())
a_id b_id
0 False False
1 False False
2 False False
3 False False
4 True False
5 False False
6 True True
print (df[['a_id', 'b_id']].isnull().all(axis=1))
0 False
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool
df.loc[df[['a_id', 'b_id']].isnull().all(axis=1), 'doc_type'] = ''
print (df)
a_id a_score b_id b_score doc_id doc_type
0 A 1.0 a 0.10 A a
1 B 2.0 b 0.20 B a
2 C 3.0 c 3.10 c b
3 D 4.0 d 4.10 d b
4 NaN NaN e 5.00 e b
5 F 6.0 f 5.99 F a
6 NaN NaN NaN NaN NaN
Bur better is use double numpy.where:
# Replace empty string with NaN
df = df.apply(lambda x: x.str.strip() if isinstance(x, str) else x).replace('', np.nan)
#create masks to series - not compare twice
mask = df['a_score'] >= df['b_score']
mask1 = (df['a_id'].isnull() & df['b_id'].isnull())
#altrnative solution for mask1
#mask1 = df[['a_id', 'b_id']].isnull().all(axis=1)
# Calculate higher score
df['doc_id'] = np.where(mask, df['a_id'], df['b_id'])
# Select type based on higher score
df['doc_type'] = np.where(mask, 'a', np.where(mask1, '', 'b'))
print (df)
a_id a_score b_id b_score doc_id doc_type
0 A 1.0 a 0.10 A a
1 B 2.0 b 0.20 B a
2 C 3.0 c 3.10 c b
3 D 4.0 d 4.10 d b
4 NaN NaN e 5.00 e b
5 F 6.0 f 5.99 F a
6 NaN NaN NaN NaN NaN
I want to make sure that when Column A is NULL (in csv), or NaN (in dataframe), Column B is "Cash".
I've tried this:
check = df[df['A'].isnull()]['B']
check = check.to_string(index=False)
if "Cash" not in check:
print "Column A Fail"
else:
print "Column A Pass!"
But it is not working.
any suggestions?
I also need to make sure that it doesn't treat '0' as NaN
UPDATE:
my goal is not to assign 'Cash', but rather to make sure that it's
already there as a quality check
In [40]: df
Out[40]:
A B
0 NaN a
1 1.0 b
2 2.0 c
3 NaN Cash
In [41]: df.query("A != A and B != 'Cash'")
Out[41]:
A B
0 NaN a
or using boolean indexing:
In [42]: df.loc[df.A.isnull() & (df.B != 'Cash')]
Out[42]:
A B
0 NaN a
OLD answer:
Alternative solution:
In [23]: df.B = np.where(df.A.isnull(), 'Cash', df.B)
In [24]: df
Out[24]:
A B
0 NaN Cash
1 1.0 b
2 2.0 c
3 NaN Cash
another solution:
In [31]: df = df.mask(df.A.isnull(), df.assign(B='Cash'))
In [32]: df
Out[32]:
A B
0 NaN Cash
1 1.0 b
2 2.0 c
3 NaN Cash
Use loc to assign where A is null.
df.loc[df['A'].isnull(), 'B'] = 'Cash'
example
df = pd.DataFrame(dict(
A=[np.nan, 1, 2, np.nan],
B=['a', 'b', 'c', 'd']
))
print(df)
A B
0 NaN a
1 1.0 b
2 2.0 c
3 NaN d
Then do
df.loc[df['A'].isnull(), 'B'] = 'Cash'
print(df)
A B
0 NaN Cash
1 1.0 b
2 2.0 c
3 NaN Cash
check if all B are 'Cash' where A is null*
(df.loc[df.A.isnull(), 'B'] == 'Cash').all()
According to logic rules, P=>Q is (not P) or Q. So
(~df.A.isnull()|(df.B=="Cash")).all()
check all the lines.