I currently have a Pandas Dataframe in which I'm performing comparisons between columns. I found a case in which there are empty columns when comparison is taking place, comparison for some reason returns else value. I added an extra statement to clean it up to empty. Looking to see if I can simplify this and have a single statement.
df['doc_type'].loc[(df['a_id'].isnull() & df['b_id'].isnull())] = ''
Code
df = pd.DataFrame({
'a_id': ['A', 'B', 'C', 'D', '', 'F', ''],
'a_score': [1, 2, 3, 4, '', 6, ''],
'b_id': ['a', 'b', 'c', 'd', 'e', 'f', ''],
'b_score': [0.1, 0.2, 3.1, 4.1, 5, 5.99, ''],
})
print df
# Replace empty string with NaN
df = df.apply(lambda x: x.str.strip() if isinstance(x, str) else x).replace('', np.nan)
# Calculate higher score
df['doc_id'] = df.apply(lambda df: df['a_id'] if df['a_score'] >= df['b_score'] else df['b_id'], axis=1)
# Select type based on higher score
df['doc_type'] = df.apply(lambda df: 'a' if df['a_score'] >= df['b_score'] else 'b', axis=1)
print df
# Update type when is empty
df['doc_type'].loc[(df['a_id'].isnull() & df['b_id'].isnull())] = ''
print df
You can use numpy.where instead apply, also for select by boolean indexing with column(s) is better use this solution:
df.loc[mask, 'colname'] = val
# Replace empty string with NaN
df = df.apply(lambda x: x.str.strip() if isinstance(x, str) else x).replace('', np.nan)
# Calculate higher score
df['doc_id'] = np.where(df['a_score'] >= df['b_score'], df['a_id'], df['b_id'])
# Select type based on higher score
df['doc_type'] = np.where(df['a_score'] >= df['b_score'], 'a', 'b')
print (df)
# Update type when is empty
df.loc[(df['a_id'].isnull() & df['b_id'].isnull()), 'doc_type'] = ''
print (df)
a_id a_score b_id b_score doc_id doc_type
0 A 1.0 a 0.10 A a
1 B 2.0 b 0.20 B a
2 C 3.0 c 3.10 c b
3 D 4.0 d 4.10 d b
4 NaN NaN e 5.00 e b
5 F 6.0 f 5.99 F a
6 NaN NaN NaN NaN NaN
Alternative of mask with DataFrame.all for check if all True in row - axis=1:
print (df[['a_id', 'b_id']].isnull())
a_id b_id
0 False False
1 False False
2 False False
3 False False
4 True False
5 False False
6 True True
print (df[['a_id', 'b_id']].isnull().all(axis=1))
0 False
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool
df.loc[df[['a_id', 'b_id']].isnull().all(axis=1), 'doc_type'] = ''
print (df)
a_id a_score b_id b_score doc_id doc_type
0 A 1.0 a 0.10 A a
1 B 2.0 b 0.20 B a
2 C 3.0 c 3.10 c b
3 D 4.0 d 4.10 d b
4 NaN NaN e 5.00 e b
5 F 6.0 f 5.99 F a
6 NaN NaN NaN NaN NaN
Bur better is use double numpy.where:
# Replace empty string with NaN
df = df.apply(lambda x: x.str.strip() if isinstance(x, str) else x).replace('', np.nan)
#create masks to series - not compare twice
mask = df['a_score'] >= df['b_score']
mask1 = (df['a_id'].isnull() & df['b_id'].isnull())
#altrnative solution for mask1
#mask1 = df[['a_id', 'b_id']].isnull().all(axis=1)
# Calculate higher score
df['doc_id'] = np.where(mask, df['a_id'], df['b_id'])
# Select type based on higher score
df['doc_type'] = np.where(mask, 'a', np.where(mask1, '', 'b'))
print (df)
a_id a_score b_id b_score doc_id doc_type
0 A 1.0 a 0.10 A a
1 B 2.0 b 0.20 B a
2 C 3.0 c 3.10 c b
3 D 4.0 d 4.10 d b
4 NaN NaN e 5.00 e b
5 F 6.0 f 5.99 F a
6 NaN NaN NaN NaN NaN
Related
I have a simple dataset:
import pandas as pd
data = [['A', 10,16], ['B', 15,11], ['C', 14,8]]
df = pd.DataFrame(data, columns = ['Name', 'Apple','Pear'])
Output
Name Apple Pear
0 A 10 16
1 B 15 11
2 C 14 8
I want to rank the quantity of different fruits - apple and pear. The rule:
determine the difference between each place for apple and pear
rank the difference by place. Two places with the closer quantity will receive lower ranking
# apple
dif = abs(df['Apple'].values - df['Apple'].values[:, None])
df_apple = pd.concat((df['Name'], pd.DataFrame(dif, columns = df['Name'])), axis=1)
df_apple1 = pd.melt(df_apple, id_vars = ['Name'], value_name='Difference_apple')
df_apple1 = df_apple1[df_apple1.Difference_apple != 0]
df_apple1['Ranking_apple'] = df_apple1.groupby('variable')['Difference_apple'].rank(method = 'dense', ascending = True)
df_apple1 = df_apple1[["variable","Name","Ranking_apple"]]
df_apple1
# Output - apple
variable Name Ranking_apple
1 A B 2.0
2 A C 1.0
3 B A 2.0
5 B C 1.0
6 C A 2.0
7 C B 1.0
# pear
dif = abs(df['Pear'].values - df['Pear'].values[:, None])
df_pear = pd.concat((df['Name'], pd.DataFrame(dif, columns = df['Name'])), axis=1)
df_pear1 = pd.melt(df_pear, id_vars = ['Name'], value_name='Difference_pear')
df_pear1 = df_pear1[df_pear1.Difference_pear != 0]
df_pear1['Ranking_pear'] = df_pear1.groupby('variable')['Difference_pear'].rank(method = 'dense', ascending = True)
df_pear1 = df_pear1[["variable","Name","Ranking_pear"]]
df_pear1
# output-pear
variable Name Ranking_pear
1 A B 1.0
2 A C 2.0
3 B A 2.0
5 B C 1.0
6 C A 2.0
7 C B 1.0
That is the algorithm for each fruit. As I use the same logic, so I can create a loop for each fruit.
I am not sure how to merge these two pieces, because I need the final output to look like the following way:
new_df = pd.merge(df_apple1, df_pear1, how='inner', left_on=['variable','Name'], right_on = ['variable','Name'])
new_df = new_df[["variable","Name","Ranking_apple","Ranking_pear"]]
new_df
# output
variable Name Ranking_apple Ranking_pear
0 A B 2.0 1.0
1 A C 1.0 2.0
2 B A 2.0 2.0
3 B C 1.0 1.0
4 C A 2.0 2.0
5 C B 1.0 1.0
I appreciate any ideas. Thank you
If you are looking to generalise your method for any arbitrary number of fruits, you could do the following:
data = [['A', 10,16], ['B', 15,11], ['C', 14,8]]
df = pd.DataFrame(data, columns = ['Name', 'Apple','Pear'])
# all fruit
final = pd.DataFrame()
fruitcols = df.columns.values.tolist()
fruitcols.remove('Name')
for col in fruitcols:
dif = abs(df[col].values - df[col].values[:, None])
diff_col = 'Difference_{}'.format(col)
rank_col = 'Ranking_{}'.format(col)
df_frt = pd.concat((df['Name'], pd.DataFrame(dif, columns = df['Name'])), axis=1)
df_frt1 = pd.melt(df_frt, id_vars = ['Name'], value_name=diff_col)
df_frt1 = df_frt1[df_frt1[diff_col] != 0]
df_frt1[rank_col] = df_frt1.groupby('variable')[diff_col].rank(method = 'dense', ascending = True)
df_frt1 = df_frt1[["variable","Name",rank_col]]
df_frt1
final = pd.concat([final, df_frt1], axis=1)
final.loc[:,~final.columns.duplicated()]
variable Name Ranking_Apple Ranking_Pear
1 A B 2.0 1.0
2 A C 1.0 2.0
3 B A 2.0 2.0
5 B C 1.0 1.0
6 C A 2.0 2.0
7 C B 1.0 1.0
I have a dataframe with three columns
a b c
[1,0,2]
[0,3,2]
[0,0,2]
and need to create a fourth column based on a hierarchy as follows:
If column a has value then column d = column a
if column a has no value but b has then column d = column b
if column a and b have no value but c has then column d = column c
a b c d
[1,0,2,1]
[0,3,2,3]
[0,0,2,2]
I'm quite the beginner at python and have no clue where to start.
Edit: I have tried the following but they all will not return a value in column d if column a is empty or None
df['d'] = df['a']
df.loc[df['a'] == 0, 'd'] = df['b']
df.loc[~df['a'].astype('bool') & ~df['b'].astype('bool'), 'd'] = df['c']
df['d'] = df['a']
df.loc[df['a'] == None, 'd'] = df['b']
df.loc[~df['a'].astype('bool') & ~df['b'].astype('bool'), 'd'] = df['c']
df['d']=np.where(df.a!=0, df.a,\
np.where(df.b!=0,\
df.b, df.c)
A simple one-liner would be,
df['d'] = df.replace(0, np.nan).bfill(axis=1)['a'].astype(int)
Step by step visualization
Convert no value to NaN
a b c
0 1.0 NaN 2
1 NaN 3.0 2
2 NaN NaN 2
Now backward fill the values along rows
a b c
0 1.0 2.0 2.0
1 3.0 3.0 2.0
2 2.0 2.0 2.0
Now select the required column, i.e 'a' and create a new column 'd'
Output
a b c d
0 1 0 2 1
1 0 3 2 3
2 0 0 2 2
import numpy as np
import pandas as pd
df = pd.DataFrame([[1,0,2], [0,3,2], [0,0,2]], columns = ('a','b','c'))
print(df)
df['d'] = df['a']
df.loc[df['a'] == 0, 'd'] = df['b']
df.loc[~df['a'].astype('bool') & ~df['b'].astype('bool'), 'd'] = df['c']
print(df)
Try this (df is your dataframe)
df['d']=np.where(df.a!=0 and df.a is not None, df.a, np.where(df.b!=0 and df.b is not None, df.b, df.c))
>>> print(df)
a b c d
0 1 0 2 1
1 0 3 2 3
2 0 0 2 2
I want to create a function that takes a dataframe and replaces NaN with the mode in categorical columns, and replaces NaN in numerical columns with the mean of that column. If there are more than one mode in the categorical columns, then it should use the first mode.
I have managed to do it with following code:
def exercise4(df):
df1 = df.select_dtypes(np.number)
df2 = df.select_dtypes(exclude = 'float')
mode = df2.mode()
df3 = df1.fillna(df.mean())
df4 = df2.fillna(mode.iloc[0,:])
new_df = [df3,df4]
df5 = pd.concat(new_df,axis=1)
new_cols = list(df.columns)
df6 = df5[new_cols]
return df6
But i am sure there is a far easier method to do this?
You can use:
df = pd.DataFrame({
'A':list('abcdec'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':list('bbcdeb'),
})
df.iloc[[1,3], [1,2,0,4]] = np.nan
print (df)
A B C D E
0 a 4.0 7.0 1 b
1 NaN NaN NaN 3 NaN
2 c 4.0 9.0 5 c
3 NaN NaN NaN 7 NaN
4 e 5.0 2.0 1 e
5 c 4.0 3.0 0 b
Idea is use DataFrame.select_dtypes for non numeric columns with DataFrame.mode and select first row by DataFrame.iloc for positions, then count means - non numeric are expluded by default, so possible use Series.append for Series with all values for replacement passed to DataFrame.fillna:
modes = df.select_dtypes(exclude=np.number).mode().iloc[0]
means = df.mean()
both = modes.append(means)
print (both)
A c
E b
B 4.25
C 5.25
D 2.83333
dtype: object
df.fillna(both, inplace=True)
print (df)
A B C D E
0 a 4.00 7.00 1 b
1 c 4.25 5.25 3 b
2 c 4.00 9.00 5 c
3 c 4.25 5.25 7 b
4 e 5.00 2.00 1 e
5 c 4.00 3.00 0 b
Passed to function with DataFrame.pipe:
def exercise4(df):
modes = df.select_dtypes(exclude=np.number).mode().iloc[0]
means = df.mean()
both = modes.append(means)
df.fillna(both, inplace=True)
return df
df = df.pipe(exercise4)
#alternative
#df = exercise4(df)
print (df)
A B C D E
0 a 4.00 7.00 1 b
1 c 4.25 5.25 3 b
2 c 4.00 9.00 5 c
3 c 4.25 5.25 7 b
4 e 5.00 2.00 1 e
5 c 4.00 3.00 0 b
Another idea is use DataFrame.apply, but is necessary result_type='expand' parameter with test dtypes by types.is_numeric_dtype:
from pandas.api.types import is_numeric_dtype
f = lambda x: x.mean() if is_numeric_dtype(x.dtype) else x.mode()[0]
df.fillna(df.apply(f, result_type='expand'), inplace=True)
print (df)
A B C D E
0 a 4.00 7.00 1 b
1 c 4.25 5.25 3 b
2 c 4.00 9.00 5 c
3 c 4.25 5.25 7 b
4 e 5.00 2.00 1 e
5 c 4.00 3.00 0 b
Passed to function:
from pandas.api.types import is_numeric_dtype
def exercise4(df):
f = lambda x: x.mean() if is_numeric_dtype(x.dtype) else x.mode()[0]
df.fillna(df.apply(f, result_type='expand'), inplace=True)
return df
df = df.pipe(exercise4)
#alternative
#df = exercise4(df)
print (df)
You can use the _get_numeric_data() method to get the numeric columns (and consequently the categorical ones):
numerical_col = df._get_numeric_data().columns
At this point you only need one line of code using an apply function that runs through the columns:
fixed_df = df.apply(lambda col: col.fillna(col.mean()) if col.name in numerical_col else col.fillna(col.mode()[0]), axis=0)
Actually you have all the ingredients already there! Some of your steps can be chained though making some others obsolete.
Looking at these two lines for example:
mode = df2.mode()
df4 = df2.fillna(mode.iloc[0,:])
You could just replace them with df4 = df2.fillna(df2.mode().iloc[0,:]. Then instead of constantly reassigning new (sub)dataframes to variables, altering them and concatenating them you can make these alterations inplace, meaning they are applied directly to the dataframe in question. Lastly exclude='float' might work in your particular (example) case, but what if there are even more datatypes in the dataframe? A string column maybe?
My suggestion:
def mean_mode(df):
df.select_dtypes(np.number).fillna(df.mean(), inplace=True)
df.select_dtypes('category').fillna(df.mode()[0], inplace=True)
return df
You can work as follows:
df = df.apply(lambda x: x.fillna(x.mode()[0]) if (x.dtypes==category) else x.fillna(x.mean()) )
I have a dataframe similar to
df = pd.DataFrame({'A': [1, np.nan,2,3, np.nan,4], 'B': [np.nan, 1,np.nan,2, 3, np.nan]})
df
A B
0 1.0 NaN
1 NaN 1.0
2 2.0 NaN
3 3.0 2.0
4 NaN 3.0
5 4.0 NaN
How do I count the number of occurrences of A is np.nan but B not np.nan, A not np.nan but B is np.nan, and A and B both not np.nan?
I tried df.groupby(['A', 'B']).count() but it doesn't read the rows with np.nan.
Using
df.isnull().groupby(['A','B']).size()
Out[541]:
A B
False False 1
True 3
True False 2
dtype: int64
You can use DataFrame.isna with crosstab for count Trues values:
df1 = df.isna()
df2 = pd.crosstab(df1.A, df1.B)
print (df2)
B False True
A
False 1 3
True 2 0
For scalar:
print (df2.loc[False, False])
1
df2 = pd.crosstab(df1.A, df1.B).add_prefix('B_').rename(lambda x: 'A_' + str(x))
print (df2)
B B_False B_True
A
A_False 1 3
A_True 2 0
Then for scalar use indexing:
print (df2.loc['A_False', 'B_False'])
1
Another solution is use DataFrame.dot by columns names with Series.replace and Series.value_counts:
df = pd.DataFrame({'A': [1, np.nan,2,3, np.nan,4, np.nan],
'B': [np.nan, 1,np.nan,2, 3, np.nan, np.nan]})
s = df.isna().dot(df.columns).replace({'':'no match'}).value_counts()
print (s)
B 3
A 2
no match 1
AB 1
dtype: int64
If we are dealing with two columns only, there's a very simple solution that involves assigning simple weights to columns A and B, then summing them.
v = df.isna().mul([1, 2]).sum(1).value_counts()
v.index = v.index.map({2: 'only B', 1: 'only A', 0: 'neither'})
v
only B 3
only A 2
neither 1
dtype: int64
Another alternative with pivot_table and stack can be achieved by,
df.isna().pivot_table(index='A', columns='B', aggfunc='size').stack()
A B
False False 1.0
True 3.0
True False 2.0
dtype: float64
I think you need:
df = pd.DataFrame({'A': [1, np.nan,2,3, np.nan,4], 'B': [np.nan, 1,np.nan,2, 3, np.nan]})
count1 = len(df[(~df['A'].isnull()) & (df['B'].isnull())])
count2 = len(df[(~df['A'].isnull()) & (~df['B'].isnull())])
count3 = len(df[(df['A'].isnull()) & (~df['B'].isnull())])
print(count1, count2, count3)
Output:
3 1 2
To get rows where either A or B is null, we can do:
bool_df = df.isnull()
df[bool_df['A'] ^ bool_df['B']].shape[0]
To get rows where both are null values:
df[bool_df['A'] & bool_df['B']].shape[0]
I want to make sure that when Column A is NULL (in csv), or NaN (in dataframe), Column B is "Cash".
I've tried this:
check = df[df['A'].isnull()]['B']
check = check.to_string(index=False)
if "Cash" not in check:
print "Column A Fail"
else:
print "Column A Pass!"
But it is not working.
any suggestions?
I also need to make sure that it doesn't treat '0' as NaN
UPDATE:
my goal is not to assign 'Cash', but rather to make sure that it's
already there as a quality check
In [40]: df
Out[40]:
A B
0 NaN a
1 1.0 b
2 2.0 c
3 NaN Cash
In [41]: df.query("A != A and B != 'Cash'")
Out[41]:
A B
0 NaN a
or using boolean indexing:
In [42]: df.loc[df.A.isnull() & (df.B != 'Cash')]
Out[42]:
A B
0 NaN a
OLD answer:
Alternative solution:
In [23]: df.B = np.where(df.A.isnull(), 'Cash', df.B)
In [24]: df
Out[24]:
A B
0 NaN Cash
1 1.0 b
2 2.0 c
3 NaN Cash
another solution:
In [31]: df = df.mask(df.A.isnull(), df.assign(B='Cash'))
In [32]: df
Out[32]:
A B
0 NaN Cash
1 1.0 b
2 2.0 c
3 NaN Cash
Use loc to assign where A is null.
df.loc[df['A'].isnull(), 'B'] = 'Cash'
example
df = pd.DataFrame(dict(
A=[np.nan, 1, 2, np.nan],
B=['a', 'b', 'c', 'd']
))
print(df)
A B
0 NaN a
1 1.0 b
2 2.0 c
3 NaN d
Then do
df.loc[df['A'].isnull(), 'B'] = 'Cash'
print(df)
A B
0 NaN Cash
1 1.0 b
2 2.0 c
3 NaN Cash
check if all B are 'Cash' where A is null*
(df.loc[df.A.isnull(), 'B'] == 'Cash').all()
According to logic rules, P=>Q is (not P) or Q. So
(~df.A.isnull()|(df.B=="Cash")).all()
check all the lines.