How do I do condition replacements in pandas?
df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
In R - think this code is very easy to understand:
library(dplyr)
df = df %>%
mutate( # mutate means create new column for non-r people
my_new_column = ifelse( is.na(the_2nd_column)==TRUE & is.na(the_3rd_column)==TRUE, ' abc', 'cuz')
how do I do this in pandas - probably dumb question with the syntax, but I have heard np.where is the equivalent of if else in R...
df['new_column'] = np.where(np.nan(....help here with a conditional....))
np.where like this
df['new_column'] = np.where(df[1].isnull() & df[2].isnull(), 'abc', 'cuz')
print(df)
or faster with more numpy
df['new_column'] = \
np.where(np.isnan(df[1].values) & np.isnan(df[2].values), 'abc', 'cuz')
0 1 2 new_column
0 1.0 2.0 3.0 cuz
1 4.0 NaN NaN abc
2 NaN NaN 9.0 cuz
timing
Using np.where
In [279]: df['new'] = np.where(df[[1, 2]].isnull().all(axis=1), 'abc', 'cuz')
In [280]: df
Out[280]:
0 1 2 new
0 1.0 2.0 3.0 cuz
1 4.0 NaN NaN abc
2 NaN NaN 9.0 cuz
Related
I have a data frame with numeric values, such as
import pandas as pd
df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'])
and I append a single row with all the column sums
totals = df.sum()
totals.name = 'totals'
df_append = df.append(totals)
Simple enough.
Here are the values of df, totals, and df_append
>>> df
A B
0 1 2
1 3 4
>>> totals
A 4
B 6
Name: totals, dtype: int64
>>> df_append
A B
0 1 2
1 3 4
totals 4 6
Unfortunately, in newer versions of pandas the method DataFrame.append is deprecated, and will be removed in some future version of pandas. The advise is to replace it with pandas.concat.
Now, using pd.concat naively as follows
df_concat_bad = pd.concat([df, totals])
produces
>>> df_concat_bad
A B 0
0 1.0 2.0 NaN
1 3.0 4.0 NaN
A NaN NaN 4.0
B NaN NaN 6.0
Apparently, with df.append the Series object got interpreted as a row, but with pd.concat it got interpreted as a column.
You cannot fix this with something like calling pd.concat with axis=1, because that would add the totals as column:
>>> pd.concat([df, totals], axis=1)
A B totals
0 1.0 2.0 NaN
1 3.0 4.0 NaN
A NaN NaN 4.0
B NaN NaN 6.0
(In this case, the result looks the same as using the default axis=0, because the indexes of df and totals are disjoint, as are their column names.)
How to handle this (elegantly and efficiently)?
The solution is to convert totals (a Series object) to a DataFrame (which will then be a column) using to_frame() and next transpose it using T:
df_concat_good = pd.concat([df, totals.to_frame().T])
yields the desired
>>> df_concat_good
A B
0 1 2
1 3 4
totals 4 6
I prefer to use df.loc() to solve this problem than pd.concat()
df.loc["totals"]=df.sum()
so I was trying to impute some missing values with fillna() in pandas, but I don't really know how to impute by using the mean value of the last 3 rows in the same column (not the mean value of the entire column), so if anyone can help it will be greatly appreciated, thanks
You can fillna with rolling(3).mean(). shift gets the alignment correct. This approach fills everything at once, so for consecutive NaN values the fillings are independent. If you need iterative filling (fills the first NaN then that value is used to compute the fill value in the next consecutive NaN) then it cannot be done in this way.
df = pd.DataFrame({'col1': [np.NaN, 3, 4, 5, np.NaN, np.NaN, np.NaN, 7]})
# Fill if
# at least
# one value
df.fillna(df.rolling(3, min_periods=1).mean().shift()) # works for many cols at once
col1
0 NaN # Unfilled because < min_periods
1 3.0
2 4.0
3 5.0
4 4.0 # np.nanmean([3, 4, 5])
5 4.5 # np.nanmean([np.NaN, 4, 5])
6 5.0 # np.nanmean([np.NaN, np.naN ,5])
7 7.0
You could do:
df.fillna(df.iloc[-3:].mean())
For example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'var1':[1, 2, 3, np.nan, 5, 6, 7],
'var2':[np.nan, np.nan, np.nan, np.nan, np.nan, 1, 0]})
var1 var2
0 1.0 NaN
1 2.0 NaN
2 3.0 NaN
3 NaN NaN
4 5.0 NaN
5 6.0 1.0
6 7.0 0.0
print(df.fillna(df.iloc[-3:].mean()))
Output:
var1 var2
0 1.0 0.5
1 2.0 0.5
2 3.0 0.5
3 6.0 0.5
4 5.0 0.5
5 6.0 1.0
6 7.0 0.0
Dan's solution is much simpler if the kink is worked out. If not, this will accomplish it:
df2 = df1.fillna('nan') # Just filling them for the loop
dfrows = df2.shape[0]
dfcols = df2.shape[1]
for row in range(dfrows):
for col in range(dfcols):
if df2.iloc[row, col] == ('nan'):
df2.iloc[row,col] = (df2.iloc[row-1,col] + df2.iloc[row-2,col] + df2.iloc[row-3,col])/3
df2
I want to fill np.nan with 0 in pd.DataFrame when columns satisfied specific conditions.
import pandas as pd
import numpy as np
from datetime import datetime as dt
df = pd.DataFrame({'A': [np.datetime64('NaT'), dt.strptime('201803', '%Y%m'), dt.strptime('201804', '%Y%m'), np.datetime64('NaT'), dt.strptime('201806', '%Y%m')],
'B': [1, np.nan, 3, 4, np.nan],
'C': [8, 9, np.nan, 4, 1]})
A B C
0 NaT 1.0 8.0
1 2018-03-01 NaN 9.0
2 2018-04-01 3.0 NaN
3 NaT 4.0 4.0
4 2018-06-01 NaN 1.0
When df['A'] >= dt.strptime('201804', '%Y%m'), I want to fill np.nan with 0 in columns B and C. I want to get dataframe as below.
A B C
0 NaT 1.0 8.0
1 2018-03-01 NaN 9.0
2 2018-04-01 3.0 0.0
3 NaT 4.0 4.0
4 2018-06-01 0.0 1.0
I tried
m = df[df['A'] >= dt.strptime('201804', '%Y%m')][['B', 'C']].isnull()
df.mask(m, 0, inplace=True)
and got error Cannot do inplace boolean setting on mixed-types with a non np.nan value. I think this error caused by existence of NaT in column A...
Is there another way to get desired dataframe by using mask method?
I'm sure there is a more elegant solution, but this works:
df2 = df.copy()
df2.loc[df2.A>=datetime.strptime('201804', '%Y%m')] =
df2[df2.A>=datetime.strptime('201804', '%Y%m')].fillna(0)
The first line of code makes a copy of your original dataframe. The second line gets the slice meeting the condition where you can fill the NaN items accordingly.
I hope it is useful,
cheers!
I want to make sure that when Column A is NULL (in csv), or NaN (in dataframe), Column B is "Cash".
I've tried this:
check = df[df['A'].isnull()]['B']
check = check.to_string(index=False)
if "Cash" not in check:
print "Column A Fail"
else:
print "Column A Pass!"
But it is not working.
any suggestions?
I also need to make sure that it doesn't treat '0' as NaN
UPDATE:
my goal is not to assign 'Cash', but rather to make sure that it's
already there as a quality check
In [40]: df
Out[40]:
A B
0 NaN a
1 1.0 b
2 2.0 c
3 NaN Cash
In [41]: df.query("A != A and B != 'Cash'")
Out[41]:
A B
0 NaN a
or using boolean indexing:
In [42]: df.loc[df.A.isnull() & (df.B != 'Cash')]
Out[42]:
A B
0 NaN a
OLD answer:
Alternative solution:
In [23]: df.B = np.where(df.A.isnull(), 'Cash', df.B)
In [24]: df
Out[24]:
A B
0 NaN Cash
1 1.0 b
2 2.0 c
3 NaN Cash
another solution:
In [31]: df = df.mask(df.A.isnull(), df.assign(B='Cash'))
In [32]: df
Out[32]:
A B
0 NaN Cash
1 1.0 b
2 2.0 c
3 NaN Cash
Use loc to assign where A is null.
df.loc[df['A'].isnull(), 'B'] = 'Cash'
example
df = pd.DataFrame(dict(
A=[np.nan, 1, 2, np.nan],
B=['a', 'b', 'c', 'd']
))
print(df)
A B
0 NaN a
1 1.0 b
2 2.0 c
3 NaN d
Then do
df.loc[df['A'].isnull(), 'B'] = 'Cash'
print(df)
A B
0 NaN Cash
1 1.0 b
2 2.0 c
3 NaN Cash
check if all B are 'Cash' where A is null*
(df.loc[df.A.isnull(), 'B'] == 'Cash').all()
According to logic rules, P=>Q is (not P) or Q. So
(~df.A.isnull()|(df.B=="Cash")).all()
check all the lines.
I have the following problem. I have this kind of a dataframe:
f = pd.DataFrame([['Meyer', 2], ['Mueller', 4], ['Radisch', math.nan], ['Meyer', 2],['Pavlenko', math.nan]])
is there an elegant way to split the DataFrame up in several dataframes by the first column? So, I would like to get a dataframe where first column = 'Müller' and another one for first column = Radisch.
Thanks in advance,
Erik
You can loop by unique values of column A with boolean indexing:
df = pd.DataFrame([['Meyer', 2], ['Mueller', 4],
['Radisch', np.nan], ['Meyer', 2],
['Pavlenko', np.nan]])
df.columns = list("AB")
print (df)
A B
0 Meyer 2.0
1 Mueller 4.0
2 Radisch NaN
3 Meyer 2.0
4 Pavlenko NaN
print (df.A.unique())
['Meyer' 'Mueller' 'Radisch' 'Pavlenko']
for x in df.A.unique():
print(df[df.A == x])
A B
0 Meyer 2.0
3 Meyer 2.0
A B
1 Mueller 4.0
A B
2 Radisch NaN
A B
4 Pavlenko NaN
Then use dict comprehension - get dictionary of DataFrames:
dfs = {x:df[df.A == x].reset_index(drop=True) for x in df.A.unique()}
print (dfs)
{'Meyer': A B
0 Meyer 2.0
1 Meyer 2.0, 'Radisch': A B
0 Radisch NaN, 'Mueller': A B
0 Mueller 4.0, 'Pavlenko': A B
0 Pavlenko NaN}
print (dfs.keys())
dict_keys(['Meyer', 'Radisch', 'Mueller', 'Pavlenko'])
print (dfs['Meyer'])
A B
0 Meyer 2.0
1 Meyer 2.0
print (dfs['Pavlenko'])
A B
0 Pavlenko NaN