I have a simple dataset:
import pandas as pd
data = [['A', 10,16], ['B', 15,11], ['C', 14,8]]
df = pd.DataFrame(data, columns = ['Name', 'Apple','Pear'])
Output
Name Apple Pear
0 A 10 16
1 B 15 11
2 C 14 8
I want to rank the quantity of different fruits - apple and pear. The rule:
determine the difference between each place for apple and pear
rank the difference by place. Two places with the closer quantity will receive lower ranking
# apple
dif = abs(df['Apple'].values - df['Apple'].values[:, None])
df_apple = pd.concat((df['Name'], pd.DataFrame(dif, columns = df['Name'])), axis=1)
df_apple1 = pd.melt(df_apple, id_vars = ['Name'], value_name='Difference_apple')
df_apple1 = df_apple1[df_apple1.Difference_apple != 0]
df_apple1['Ranking_apple'] = df_apple1.groupby('variable')['Difference_apple'].rank(method = 'dense', ascending = True)
df_apple1 = df_apple1[["variable","Name","Ranking_apple"]]
df_apple1
# Output - apple
variable Name Ranking_apple
1 A B 2.0
2 A C 1.0
3 B A 2.0
5 B C 1.0
6 C A 2.0
7 C B 1.0
# pear
dif = abs(df['Pear'].values - df['Pear'].values[:, None])
df_pear = pd.concat((df['Name'], pd.DataFrame(dif, columns = df['Name'])), axis=1)
df_pear1 = pd.melt(df_pear, id_vars = ['Name'], value_name='Difference_pear')
df_pear1 = df_pear1[df_pear1.Difference_pear != 0]
df_pear1['Ranking_pear'] = df_pear1.groupby('variable')['Difference_pear'].rank(method = 'dense', ascending = True)
df_pear1 = df_pear1[["variable","Name","Ranking_pear"]]
df_pear1
# output-pear
variable Name Ranking_pear
1 A B 1.0
2 A C 2.0
3 B A 2.0
5 B C 1.0
6 C A 2.0
7 C B 1.0
That is the algorithm for each fruit. As I use the same logic, so I can create a loop for each fruit.
I am not sure how to merge these two pieces, because I need the final output to look like the following way:
new_df = pd.merge(df_apple1, df_pear1, how='inner', left_on=['variable','Name'], right_on = ['variable','Name'])
new_df = new_df[["variable","Name","Ranking_apple","Ranking_pear"]]
new_df
# output
variable Name Ranking_apple Ranking_pear
0 A B 2.0 1.0
1 A C 1.0 2.0
2 B A 2.0 2.0
3 B C 1.0 1.0
4 C A 2.0 2.0
5 C B 1.0 1.0
I appreciate any ideas. Thank you
If you are looking to generalise your method for any arbitrary number of fruits, you could do the following:
data = [['A', 10,16], ['B', 15,11], ['C', 14,8]]
df = pd.DataFrame(data, columns = ['Name', 'Apple','Pear'])
# all fruit
final = pd.DataFrame()
fruitcols = df.columns.values.tolist()
fruitcols.remove('Name')
for col in fruitcols:
dif = abs(df[col].values - df[col].values[:, None])
diff_col = 'Difference_{}'.format(col)
rank_col = 'Ranking_{}'.format(col)
df_frt = pd.concat((df['Name'], pd.DataFrame(dif, columns = df['Name'])), axis=1)
df_frt1 = pd.melt(df_frt, id_vars = ['Name'], value_name=diff_col)
df_frt1 = df_frt1[df_frt1[diff_col] != 0]
df_frt1[rank_col] = df_frt1.groupby('variable')[diff_col].rank(method = 'dense', ascending = True)
df_frt1 = df_frt1[["variable","Name",rank_col]]
df_frt1
final = pd.concat([final, df_frt1], axis=1)
final.loc[:,~final.columns.duplicated()]
variable Name Ranking_Apple Ranking_Pear
1 A B 2.0 1.0
2 A C 1.0 2.0
3 B A 2.0 2.0
5 B C 1.0 1.0
6 C A 2.0 2.0
7 C B 1.0 1.0
Related
I have a dataframe with three columns
a b c
[1,0,2]
[0,3,2]
[0,0,2]
and need to create a fourth column based on a hierarchy as follows:
If column a has value then column d = column a
if column a has no value but b has then column d = column b
if column a and b have no value but c has then column d = column c
a b c d
[1,0,2,1]
[0,3,2,3]
[0,0,2,2]
I'm quite the beginner at python and have no clue where to start.
Edit: I have tried the following but they all will not return a value in column d if column a is empty or None
df['d'] = df['a']
df.loc[df['a'] == 0, 'd'] = df['b']
df.loc[~df['a'].astype('bool') & ~df['b'].astype('bool'), 'd'] = df['c']
df['d'] = df['a']
df.loc[df['a'] == None, 'd'] = df['b']
df.loc[~df['a'].astype('bool') & ~df['b'].astype('bool'), 'd'] = df['c']
df['d']=np.where(df.a!=0, df.a,\
np.where(df.b!=0,\
df.b, df.c)
A simple one-liner would be,
df['d'] = df.replace(0, np.nan).bfill(axis=1)['a'].astype(int)
Step by step visualization
Convert no value to NaN
a b c
0 1.0 NaN 2
1 NaN 3.0 2
2 NaN NaN 2
Now backward fill the values along rows
a b c
0 1.0 2.0 2.0
1 3.0 3.0 2.0
2 2.0 2.0 2.0
Now select the required column, i.e 'a' and create a new column 'd'
Output
a b c d
0 1 0 2 1
1 0 3 2 3
2 0 0 2 2
import numpy as np
import pandas as pd
df = pd.DataFrame([[1,0,2], [0,3,2], [0,0,2]], columns = ('a','b','c'))
print(df)
df['d'] = df['a']
df.loc[df['a'] == 0, 'd'] = df['b']
df.loc[~df['a'].astype('bool') & ~df['b'].astype('bool'), 'd'] = df['c']
print(df)
Try this (df is your dataframe)
df['d']=np.where(df.a!=0 and df.a is not None, df.a, np.where(df.b!=0 and df.b is not None, df.b, df.c))
>>> print(df)
a b c d
0 1 0 2 1
1 0 3 2 3
2 0 0 2 2
I have a dataframe that looks like this
import pandas as pd
import numpy as np
fff = pd.DataFrame({'group': ['a','a','a','b','b','b','b','c','c'], 'value': [1,2, np.nan, 1,2,3,4, np.nan, np.nan]})
I would like to drop the NAs by group only if all values are Nas inside the group. How could i do that ?
Expected output:
fff = pd.DataFrame({'group': ['a','a','a','b','b','b','b'], 'value': [1,2, np.nan, 1,2,3,4]})
You can check value for nan and use groupby().any():
fff = fff[(~fff['value'].isna()).groupby(fff['group']).transform('any')]
Output:
group value
0 a 1.0
1 a 2.0
2 a NaN
3 b 1.0
4 b 2.0
5 b 3.0
6 b 4.0
create a boolean series with isna() and then group on fff['group'], and transform with all , then filter out(exclude) values which return True
c = fff['value'].isna()
fff[~c.groupby(fff['group']).transform('all')]
group value
0 a 1.0
1 a 2.0
2 a NaN
3 b 1.0
4 b 2.0
5 b 3.0
6 b 4.0
Another option:
fff["cases"] = fff.groupby("group").cumcount()
fff["null"] = fff["value"].isnull()
fff["cases 2"] = fff.groupby(["group","null"]).cumcount()
fff[~((fff["value"].isnull()) & (fff["cases"] == fff["cases 2"]))][["group","value"]]
Output:
group value
0 a 1.0
1 a 2.0
2 a NaN
3 b 1.0
4 b 2.0
5 b 3.0
6 b 4.0
An addition to the answers already provided : Keep only groups where all the values are True, and filter the fff dataframe with the result variable.
result = fff.groupby("group").value.all().index.tolist()
fff.query("group == #result")
I am trying to concat two dataframes, horizontally. df2 contains 2 result variables for every observation in df1.
df1.shape
(242583, 172)
df2.shape
(242583, 2)
My code is:
Fin = pd.concat([df1, df2], axis= 1)
But somehow the result is stacked in 2 dimensions:
Fin.shape
(485166, 174)
What am I missing here?
There are different index values, so indexes are not aligned and get NaNs:
df1 = pd.DataFrame({
'A': ['a','a','a'],
'B': range(3)
})
print (df1)
A B
0 a 0
1 a 1
2 a 2
df2 = pd.DataFrame({
'C': ['b','b','b'],
'D': range(4,7)
}, index=[5,7,8])
print (df2)
C D
5 b 4
7 b 5
8 b 6
Fin = pd.concat([df1, df2], axis= 1)
print (Fin)
A B C D
0 a 0.0 NaN NaN
1 a 1.0 NaN NaN
2 a 2.0 NaN NaN
5 NaN NaN b 4.0
7 NaN NaN b 5.0
8 NaN NaN b 6.0
One possible solution is create default indexes:
Fin = pd.concat([df1.reset_index(drop=True), df2.reset_index(drop=True)], axis= 1)
print (Fin)
A B C D
0 a 0 b 4
1 a 1 b 5
2 a 2 b 6
Or assign:
df2.index = df1.index
Fin = pd.concat([df1, df2], axis= 1)
print (Fin)
A B C D
0 a 0 b 4
1 a 1 b 5
2 a 2 b 6
df1.index = df2.index
Fin = pd.concat([df1, df2], axis= 1)
print (Fin)
A B C D
5 a 0 b 4
7 a 1 b 5
8 a 2 b 6
If you are looking for the one-liner, there is the set_index method:
import pandas as pd
x = pd.DataFrame({'A': ["a"] * 3, 'B': range(3)})
y = pd.DataFrame({'C': ["b"] * 3, 'D': range(4,7)})
pd.concat([x, y.set_index(x.index)], axis = 1)
Note that pd.concat([x, y], axis = 1) will instead create new lines and produce NA values, due to non-matching indexes, as shown by #jezrael
I have a dataset in which I want to count the missing values for each column. If there are missing values, I want to print the header name. I use the following code in order to find the missing values per column
isnull().sum()
If I print the result everything is OK, if I try to put the result in a list and then handle the headers, I can't!
newList = pd.isnull(myData).sum()
print(newList)
In this case the output is:
Name 5
Surname 0
Age 3
and I want to print only Surname but I can't find how to return it to a variable.
newList = pd.isnull(myData).sum()
print(newList[0])
This print 5 (the number of missing values for column 'Name')
Use boolean indexing with Series:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[np.nan,8,9,4,2,3],
'D':[1,3,5,np.nan,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a 4 NaN 1.0 5 a
1 b 5 8.0 3.0 3 a
2 c 4 9.0 5.0 6 a
3 d 5 4.0 NaN 9 b
4 e 5 2.0 1.0 2 b
5 f 4 3.0 0.0 4 b
newList = df.isnull().sum()
print (newList)
A 0
B 0
C 1
D 1
E 0
F 0
dtype: int64
#for return NaNs columns
print(newList.index[newList != 0].tolist())
['C', 'D']
#for return non NaNs columns
print(newList.index[newList == 0].tolist())
['A', 'B', 'E', 'F']
I currently have a Pandas Dataframe in which I'm performing comparisons between columns. I found a case in which there are empty columns when comparison is taking place, comparison for some reason returns else value. I added an extra statement to clean it up to empty. Looking to see if I can simplify this and have a single statement.
df['doc_type'].loc[(df['a_id'].isnull() & df['b_id'].isnull())] = ''
Code
df = pd.DataFrame({
'a_id': ['A', 'B', 'C', 'D', '', 'F', ''],
'a_score': [1, 2, 3, 4, '', 6, ''],
'b_id': ['a', 'b', 'c', 'd', 'e', 'f', ''],
'b_score': [0.1, 0.2, 3.1, 4.1, 5, 5.99, ''],
})
print df
# Replace empty string with NaN
df = df.apply(lambda x: x.str.strip() if isinstance(x, str) else x).replace('', np.nan)
# Calculate higher score
df['doc_id'] = df.apply(lambda df: df['a_id'] if df['a_score'] >= df['b_score'] else df['b_id'], axis=1)
# Select type based on higher score
df['doc_type'] = df.apply(lambda df: 'a' if df['a_score'] >= df['b_score'] else 'b', axis=1)
print df
# Update type when is empty
df['doc_type'].loc[(df['a_id'].isnull() & df['b_id'].isnull())] = ''
print df
You can use numpy.where instead apply, also for select by boolean indexing with column(s) is better use this solution:
df.loc[mask, 'colname'] = val
# Replace empty string with NaN
df = df.apply(lambda x: x.str.strip() if isinstance(x, str) else x).replace('', np.nan)
# Calculate higher score
df['doc_id'] = np.where(df['a_score'] >= df['b_score'], df['a_id'], df['b_id'])
# Select type based on higher score
df['doc_type'] = np.where(df['a_score'] >= df['b_score'], 'a', 'b')
print (df)
# Update type when is empty
df.loc[(df['a_id'].isnull() & df['b_id'].isnull()), 'doc_type'] = ''
print (df)
a_id a_score b_id b_score doc_id doc_type
0 A 1.0 a 0.10 A a
1 B 2.0 b 0.20 B a
2 C 3.0 c 3.10 c b
3 D 4.0 d 4.10 d b
4 NaN NaN e 5.00 e b
5 F 6.0 f 5.99 F a
6 NaN NaN NaN NaN NaN
Alternative of mask with DataFrame.all for check if all True in row - axis=1:
print (df[['a_id', 'b_id']].isnull())
a_id b_id
0 False False
1 False False
2 False False
3 False False
4 True False
5 False False
6 True True
print (df[['a_id', 'b_id']].isnull().all(axis=1))
0 False
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool
df.loc[df[['a_id', 'b_id']].isnull().all(axis=1), 'doc_type'] = ''
print (df)
a_id a_score b_id b_score doc_id doc_type
0 A 1.0 a 0.10 A a
1 B 2.0 b 0.20 B a
2 C 3.0 c 3.10 c b
3 D 4.0 d 4.10 d b
4 NaN NaN e 5.00 e b
5 F 6.0 f 5.99 F a
6 NaN NaN NaN NaN NaN
Bur better is use double numpy.where:
# Replace empty string with NaN
df = df.apply(lambda x: x.str.strip() if isinstance(x, str) else x).replace('', np.nan)
#create masks to series - not compare twice
mask = df['a_score'] >= df['b_score']
mask1 = (df['a_id'].isnull() & df['b_id'].isnull())
#altrnative solution for mask1
#mask1 = df[['a_id', 'b_id']].isnull().all(axis=1)
# Calculate higher score
df['doc_id'] = np.where(mask, df['a_id'], df['b_id'])
# Select type based on higher score
df['doc_type'] = np.where(mask, 'a', np.where(mask1, '', 'b'))
print (df)
a_id a_score b_id b_score doc_id doc_type
0 A 1.0 a 0.10 A a
1 B 2.0 b 0.20 B a
2 C 3.0 c 3.10 c b
3 D 4.0 d 4.10 d b
4 NaN NaN e 5.00 e b
5 F 6.0 f 5.99 F a
6 NaN NaN NaN NaN NaN