I have a dataframe below:
df = {'a': [1, 2, 3],
'b': [77, 88, 99],
'c1': [1, 1, 1],
'c2': [2, 2, 2],
'c3': [3, 3, 3]}
df = pd.DataFrame(df)
and a function:
def test_function(row):
return row['b']
How can I apply this function on the 'c' columns (i.e. c1, c2 and c3), BUT only for specific rows whose 'a' value matches the 2nd character of the 'c' columns?
For example, for the first row, the value of 'a' is 1, so for the first row, I would like to apply this function on column 'c1'.
For the second row, the value of 'a' is 2, so for the second row, I would like to apply this function on column 'c2'. And so forth for the rest of the rows.
The desired end result should be:
df_final = {'a': [1, 2, 3],
'b': [77, 88, 99],
'c1': [77, 1, 1],
'c2': [2, 88, 2],
'c3': [3, 3, 99]}
df_final = pd.DataFrame(df_final)
Use Series.mask with compare c columns filtered by DataFrame.filter and if match repalce by values of b:
c_cols = df.filter(like='c').columns
def test_function(row):
#for test integers from 0 to 9
#m = c_cols.str[1].astype(int) == row['a']
#for test integers from 0 to 100
m = c_cols.str.extract('(\d+)', expand=False).astype(int) == row['a']
row[c_cols] = row[c_cols].mask(m, row['b'])
return row
df = df.apply(test_function, axis=1)
print (df)
a b c1 c2 c3
0 1 77 77 2 3
1 2 88 1 88 3
2 3 99 1 2 99
Non loop faster alternative with broadcasting:
arr = c_cols.str.extract('(\d+)', expand=False).astype(int).to_numpy()[:, None]
m = df['a'].to_numpy() == arr
df[c_cols] = df[c_cols].mask(m, df['b'], axis=0)
Related
I have a code with multiple columns and I would like to add two more, one for the highest number on the row, and another one for the second highest. However, instead of the number, I would like to show the column name where they are found.
Assume the following data frame:
import pandas as pd
df = pd.DataFrame({'A': [1, 5, 10], 'B': [2, 6, 11], 'C': [3, 7, 12], 'D': [4, 8, 13], 'E': [5, 9, 14]})
To extract the highest number on every row, I can just apply max(axis=1) like this:
df['max1'] = df[['A', 'B', 'C', 'D', 'E']].max(axis = 1)
This gets me the max number, but not the column name itself.
How can this be applied to the second max number as well?
You can sorting values and assign top2 values:
cols = ['A', 'B', 'C', 'D', 'E']
df[['max2','max1']] = np.sort(df[cols].to_numpy(), axis=1)[:, -2:]
print (df)
A B C D E max2 max1
0 1 2 3 4 5 4 5
1 5 6 7 8 9 8 9
2 10 11 12 13 14 13 14
df[['max1','max2']] = np.sort(df[cols].to_numpy(), axis=1)[:, -2:][:, ::-1]
EDIT: For get top2 columns names and top2 values use:
df = pd.DataFrame({'A': [1, 50, 10], 'B': [2, 6, 11],
'C': [3, 7, 12], 'D': [40, 8, 13], 'E': [5, 9, 14]})
cols = ['A', 'B', 'C', 'D', 'E']
#values in numpy array
vals = df[cols].to_numpy()
#columns names in array
cols = np.array(cols)
#get indices that would sort an array in descending order
arr = np.argsort(-vals, axis=1)
#top 2 columns names
df[['top1','top2']] = cols[arr[:, :2]]
#top 2 values
df[['max2','max1']] = vals[np.arange(arr.shape[0])[:, None], arr[:, :2]]
print (df)
A B C D E top1 top2 max2 max1
0 1 2 3 40 5 D E 40 5
1 50 6 7 8 9 A E 50 9
2 10 11 12 13 14 E D 14 13
Another approaches to you can get first max then remove it and get max again to get the second max
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 15, 10], 'B': [2, 89, 11], 'C': [80, 7, 12], 'D': [4, 8, 13], 'E': [5, 9, 14]})
max1=df.max(axis=1)
maxcolum1=df.idxmax(axis=1)
max2 = df.replace(np.array(df.max(axis=1)),0).max(axis=1)
maxcolum2=df.replace(np.array(df.max(axis=1)),0).idxmax(axis=1)
df2 =pd.DataFrame({ 'max1': max1, 'max2': max2 ,'maxcol1':maxcolum1,'maxcol2':maxcolum2 })
df.join(df2)
This works:
import pandas as pd
data = [["aa", 1, 2], ["bb", 3, 4]]
df = pd.DataFrame(data, columns=['id', 'a', 'b'])
df = df.set_index('id')
print(df)
"""
a b
id
aa 1 2
bb 3 4
"""
but is it possible in just one call of pd.DataFrame(...) directly with a parameter, without using set_index after?
Convert values to 2d array:
data = [["aa", 1, 2], ["bb", 3, 4]]
arr = np.array(data)
df = pd.DataFrame(arr[:, 1:], columns=['a', 'b'], index=arr[:, 0])
print (df)
a b
aa 1 2
bb 3 4
Details:
print (arr)
[['aa' '1' '2']
['bb' '3' '4']]
Another solution:
data = [["aa", 1, 2], ["bb", 3, 4], ["cc", 30, 40]]
cols = ['a','b']
L = list(zip(*data))
print (L)
[('aa', 'bb', 'cc'), (1, 3, 30), (2, 4, 40)]
df = pd.DataFrame(dict(zip(cols, L[1:])), index=L[0])
print (df)
a b
aa 1 2
bb 3 4
cc 30 40
If I have 2 dataframes in pandas like below
but 2 dataframes don't have same columns, only few columns are same.
df1
no datas1 datas2 datas3 datas4
0 a b a a
1 b c b b
2 d b c a
df2
no datas1 datas2 datas3 data4 data 5 data6
0 c a a a a b
1 a c b b b b
2 a b c b c c
I'd like to know how much it's matched for each same column based on "no" filed using pandas functions
the result are below
data3 is 100% match
data4 is 66% match
or
data3 is 3 matched
data4 is 2 matched
What's the best way to make like that ?
You can do this - first run equals method and if True then print that dataframes match, otherwise use compare method and then calculate the percentage of rows that matched between dfs:
import pandas as pd
df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df2 = pd.DataFrame({'a': [2, 2, 3], 'b': [4, 5, 7]})
if df1.equals(df2):
print('df1 matched df2')
else:
comp = df1.compare(df2)
match_perc = (df1.shape[0] - comp.shape[0]) / df1.shape[0]
print(f'{match_perc * 100: .4f} match') # Out: 33.3333 match
You can simplify by just using compare and if the dataframes match perfectly then you print that they matched:
df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df3 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
comp = df1.compare(df3)
match_perc = (df1.shape[0] - comp.shape[0]) / df1.shape[0]
if match_perc == 1:
print('dfs matched')
else:
print(f'{match_perc * 100: .4f} match')
# Out: dfs matched
I'd like to make a mathematical action (division by 2) on each cell of column1 that corresponds to the condition col1 > 5. And I would like to save this result in df. Any ideas how to do that?
I've tried apply with lambda, but had no success, cause all the df was changed to the values
df = pd.DataFrame(data = {'col1' : [8, 6, 2, 2],
'col2' : [2, 2, 1, 1],
'col3' : [4, 4, 4, 4]})
out = df[df_check.col1 > 5].index
'''
I expect the first column to look like [4, 3, 2 , 2]
You are very close as you got the indices. You just need to divide the values of those indices and assign to the same indices. It will look like this:
df = pd.DataFrame(data = {'col1' : [8, 6, 2, 2],
'col2' : [2, 2, 1, 1],
'col3' : [4, 4, 4, 4]})
# indices where the condition is true
ix = df[df['col1']>5].index
df['col1'][ix] = df[df['col1']>5]['col1']//2
>>> df
col1 col2 col3
0 4 2 4
1 3 2 4
2 2 1 4
3 2 1 4
import pandas as pd
df1 = pd.DataFrame({'ID':['i1', 'i2', 'i3'],
'A': [2, 3, 1],
'B': [1, 1, 2],
'C': [2, 1, 0],
'D': [3, 1, 2]})
df1.set_index('ID')
df1.head()
A B C D
ID
i1 2 1 2 3
i2 3 1 1 1
i3 1 2 0 2
df2 = pd.DataFrame({'ID':['i1-i2', 'i1-i3', 'i2-i3'],
'A': [2, 1, 1],
'B': [1, 1, 1],
'C': [1, 0, 0],
'D': [1, 1, 1]})
df2.set_index('ID')
df2
A B C D
ID
i1-i2 2 1 1 1
i1-i3 1 1 0 1
i2-i3 1 1 0 1
Given a data frame as df1, I want to compare every two different rows, and get the smaller value at each column, and output the result to a new data frame like df2.
For example, to compare i1 row and i2 row, get new row i1-i2 as 2, 1, 1, 1
Please advise what is the best way of pandas to do that.
Try this:
from itertools import combinations
v = df1.values
r = pd.DataFrame([np.minimum(v[t[0]], v[t[1]])
for t in combinations(np.arange(len(df1)), 2)],
columns=df1.columns,
index=list(combinations(df1.index, 2)))
Result:
In [72]: r
Out[72]:
A B C D
(i1, i2) 2 1 1 1
(i1, i3) 1 1 0 2
(i2, i3) 1 1 0 1