I have a data frame with four columns that have values between 0-100.
In a new column I want to assign a value dependant on the values within the first four columns.
The values from the first four columns will be assigned a number 0, 1 or 2 and then summed together as follows:
0 - 30 = 0
31 -70 = 1
71 - 100 = 2
So the maximum number in the fifth column will be 8 and the minimum 0.
In the example data frame below the fifth column should result in 2, 3. (Just in case I haven't described this clearly.)
I'm still very new with python and at this stage the only string that I have in my bow is a very long and cumbersome multiple nested if statement, followed with df['E'] = df.apply().
My question is what is the best and most efficient function/method for populating the fifth column.
data = {
'A': [50, 90],
'B': [2, 4],
'C': [20, 80],
'D': [75, 72],
}
df = pd.DataFrame(data)
Edit
A more comprehensive method with np.select:
condlist = [(0 <= df) & (df <= 30),
(31 <= df) & (df <= 70),
(71 <= df) & (df <= 100)]
choicelist = [0, 1, 2]
df['E'] = np.select(condlist, choicelist).sum(axis=1)
print(df)
# Output
A B C D E
0 50 2 20 75 3
1 90 4 80 72 6
Use pd.cut after flatten your dataframe into one column with melt:
df['E'] = pd.cut(pd.melt(df, ignore_index=False)['value'],
bins=[0, 30, 70, 100], labels=[0, 1, 2]) \
.cat.codes.groupby(level=0).sum()
print(df)
# Output:
A B C D E
0 50 2 20 75 3
1 90 4 80 72 6
Details:
>>> pd.melt(df, ignore_index=False)
variable value
0 A 50
1 A 90
0 B 2
1 B 4
0 C 20
1 C 80
0 D 75
1 D 72
>>> pd.cut(pd.melt(df, ignore_index=False)['value'],
bins=[0, 30, 70, 100], labels=[0, 1, 2])
0 1
1 2
0 0
1 0
0 0
1 2
0 2
1 2
Name: value, dtype: category
Categories (3, int64): [0 < 1 < 2]
Related
I have some data, that needs to be clusterised into groups. That should be done by a few predifined conditions.
Suppose we have the following table:
d = {'ID': [100, 101, 102, 103, 104, 105],
'col_1': [12, 3, 7, 13, 19, 25],
'col_2': [3, 1, 3, 3, 2, 4]
}
df = pd.DataFrame(data=d)
df.head()
Here, I want to group ID based on the following ranges, conditions, on col_1 and col_2.
For col_1 I divide values into following groups: [0, 10], [11, 15], [16, 20], [20, +inf]
For col_2 just use the df['col_2'].unique() values: [1], [2], [3], [4].
The desired groupping is in group_num column:
notice, that 0 and 3 rows have the same group number and the order, in which group number is assigned.
For now, I only came up with if-elif function to pre-define all the groups. It's not the solution for now cause in my real task there are far more ranges and confitions.
My code snippet, if it's relevant:
# This logic is not working cause here I have to predefine all the groups configurations, aka numbers,
# but I want to make groups "dymanicly":
# first group created and if the next row is not in that group -> create new one
def groupping(val_1, val_2):
# not using match case here, cause my Python < 3.10
if ((val_1 >= 0) and (val_1 <10)) and (val_2 == 1):
return 1
elif ((val_1 >= 0) and (val_1 <10)) and (val_2 == 2):
return 2
elif ...
...
df['group_num'] = df.apply(lambda x: groupping(x.col_1, x.col_2), axis=1)
make dataframe for chking group
bins = [0, 10, 15, 20, float('inf')]
df1 = df[['col_1', 'col_2']].assign(col_1=pd.cut(df['col_1'], bins=bins, right=False)).sort_values(['col_1', 'col_2'])
df1
col_1 col_2
1 [0.0, 10.0) 1
2 [0.0, 10.0) 3
0 [10.0, 15.0) 3
3 [10.0, 15.0) 3
4 [15.0, 20.0) 2
5 [20.0, inf) 4
chk group by df1
df1.ne(df1.shift(1)).any(axis=1).cumsum()
output:
1 1
2 2
0 3
3 3
4 4
5 5
dtype: int32
make output to group_num column
df.assign(group_num=df1.ne(df1.shift(1)).any(axis=1).cumsum())
result:
ID col_1 col_2 group_num
0 100 12 3 3
1 101 3 1 1
2 102 7 3 2
3 103 13 3 3
4 104 19 2 4
5 105 25 4 5
Not sure I understand the full logic, can't you use pandas.cut:
bins = [0, 10, 15, 20, np.inf]
df['group_num'] = pd.cut(df['col_1'], bins=bins,
labels=range(1, len(bins)))
Output:
ID col_1 col_2 group_num
0 100 12 3 2
1 101 3 1 1
2 102 7 3 1
3 103 13 2 2
4 104 19 3 3
5 105 25 4 4
Say I have such Pandas dataframe
df = pd.DataFrame({
'a': [4, 5, 3, 1, 2],
'b': [20, 10, 40, 50, 30],
'c': [25, 20, 5, 15, 10]
})
so df looks like:
print(df)
a b c
0 4 20 25
1 5 10 20
2 3 40 5
3 1 50 15
4 2 30 10
And I want to get the column name of the 2nd largest value in each row. Borrowing the answer from Felex Le in this thread, I can now get the 2nd largest value by:
def second_largest(l = []):
return (l.nlargest(2).min())
print(df.apply(second_largest, axis = 1))
which gives me:
0 20
1 10
2 5
3 15
4 10
dtype: int64
But what I really want is the column names for those values, or to say:
0 b
1 b
2 c
3 c
4 c
Pandas has a function idxmax which can do the job for the largest value:
df.idxmax(axis = 1)
0 c
1 c
2 b
3 b
4 b
dtype: object
Is there any elegant way to do the same job but for the 2nd largest value?
Use numpy.argsort for positions of second largest values:
df['new'] = df['new'] = df.columns.to_numpy()[np.argsort(df.to_numpy())[:, -2]]
print(df)
a b c new
0 4 20 25 b
1 5 10 20 b
2 3 40 5 c
3 1 50 15 c
4 2 30 10 c
Your solution should working, but is slow:
def second_largest(l = []):
return (l.nlargest(2).idxmin())
print(df.apply(second_largest, axis = 1))
If efficiency is important, numpy.argpartition is quite efficient:
N = 2
cols = df.columns.to_numpy()
pd.Series(cols[np.argpartition(df.to_numpy().T, -N, axis=0)[-N]], index=df.index)
If you want a pure pandas (less efficient):
out = df.stack().groupby(level=0).apply(lambda s: s.nlargest(2).index[-1][1])
Output:
0 b
1 b
2 c
3 c
4 c
dtype: object
I have the following dataframe where I want to determinate if the column A is greater than column B and if column C is greater of column B. In case it is smaller, I want to change that value for 0.
d = {'A': [6, 8, 10, 1, 3], 'B': [4, 9, 12, 0, 2], 'C': [3, 14, 11, 4, 9] }
df = pd.DataFrame(data=d)
df
I have tried this with the np.where and it is working:
df[B] = np.where(df[A] > df[B], 0, df[B])
df[C] = np.where(df[B] > df[C], 0, df[C])
However, I have a huge amount of columns and I want to know if there is any way to do this without writing each comparation separately. For example, a loop for.
Thanks
Solution with different ouput, because is compared original columns with DataFrame.diff and set less like 0 values to 0 by DataFrame.mask:
df1 = df.mask(df.diff(axis=1).lt(0), 0)
print (df1)
A B C
0 6 0 0
1 8 9 14
2 10 12 0
3 1 0 4
4 3 0 9
If use list comprehension with zip shifted columns names output is different, because is compared already assigned columns B, C...:
for a, b in zip(df.columns, df.columns[1:]):
df[b] = np.where(df[a] > df[b], 0, df[b])
print (df)
A B C
0 6 0 3
1 8 9 14
2 10 12 0
3 1 0 4
4 3 0 9
To use a vectorial approach, you cannot simply use a diff as the condition depends on the previous value being replaced or not by 0. Thus two consecutive diff cannot happen.
You can achieve a correct vectorial replacement using a shifted mask:
m1 = df.diff(axis=1).lt(0) # check if < than previous
m2 = ~m1.shift(axis=1, fill_value=False) # and this didn't happen twice
df2 = df.mask(m1&m2, 0)
output:
A B C
0 6 0 3
1 8 9 14
2 10 12 0
3 1 0 4
4 3 0 9
I'd like to create two cumcount columns, depending on the values of two columns.
In the example below, I'd like one cumcount starting when colA is at least 100, and another cumcount starting when colB is at least 10.
columns = ['ID', 'colA', 'colB', 'cumcountA', 'cumountB']
data = [['A', 3, 1, '',''],
['A', 20, 4, '',''],
['A', 102, 8, 1, ''],
['A', 117, 10, 2, 1],
['B', 75, 0, '',''],
['B', 170, 12, 1, 1],
['B', 200, 13, 2, 2],
['B', 300, 20, 3, 3],
]
pd.DataFrame(columns=columns, data=data)
ID colA colB cumcountA cumountB
0 A 3 1
1 A 20 4
2 A 102 8 1
3 A 117 10 2 1
4 B 75 0
5 B 170 12 1 1
6 B 200 13 2 2
7 B 300 20 3 3
How would I calculate cumcountA and cumcountB?
you can try setting df.clip lower = your values (here 100 and 10) and then compare then groupby ID and cumsum :
col_list = ['colA','colB']
val_list = [100,10]
df[['cumcountA','cumountB']] = (df[col_list].ge(df[col_list].clip(lower=val_list,axis=1))
.groupby(df['ID']).cumsum().replace(0,''))
print(df)
Or may be even better to compare directly:
df[['cumcountA','cumountB']] = (df[['colA','colB']].ge([100,10])
.groupby(df['ID']).cumsum().replace(0,''))
print(df)
ID colA colB cumcountA cumountB
0 A 3 1
1 A 20 4
2 A 102 8 1
3 A 117 10 2 1
4 B 75 0
5 B 170 12 1 1
6 B 200 13 2 2
7 B 300 20 3 3
I have a dataframe like
df1 = pd.DataFrame({'name':['al', 'ben', 'cary'], 'bin':[1.0, 1.0, 3.0], 'score':[40, 75, 15]})
bin name score
0 1 al 40
1 1 ben 75
2 3 cary 15
and a dataframe like
df2 = pd.DataFrame({'bin':[1.0, 2.0, 3.0, 4.0, 5.0], 'x':[1, 1, 0, 0, 0],
'y':[0, 0, 1, 1, 0], 'z':[0, 0, 0, 1, 0]})
bin x y z
0 1 1 0 0
1 2 1 0 0
2 3 0 1 0
3 4 0 1 1
4 5 0 0 0
what I want to do is extend df1 with the columns ‘x’, ‘y’, and ‘z’, and fill with score only where the bin matches and the the respective ‘x’, ‘y’, ‘z’ value is 1, not 0.
I’ve gotten as far as
df3 = pd.merge(df1, df2, how='left', on=['bin'])
bin name score x y z
0 1 al 40 1 0 0
1 1 ben 75 1 0 0
2 3 cary 15 0 1 0
but I don't see an elegant way to get the score values into the correct 'x', 'y', etc columns (my real-life problem has over a hundred such columns so df3['x'] = df3['score'] * df3['x'] might be rather slow).
You can just get a list of the columns you want to multiply the scores by and then use the apply function:
cols = [each for each in df2.columns if each not in ('name', 'bin')]
df3 = pd.merge(df1, df2, how='left', on=['bin'])
df3[cols] = df3.apply(lambda x: x['score'] * x[cols], axis=1)
This may not be much faster than iterating, but is an idea.
Import numpy, define the columns covered in the operation
import numpy as np
columns = ['x','y','z']
score_col = 'score'
Contruct a numpy array of the score column, reshaped to match the number of columns in the operation.
score_matrix = np.repeat(df3[score_col].values, len(columns))
score_matrix = score_matrix.reshape(len(df3), len(columns))
Multiply by the the columns and assign back to the dataframe.
df3[columns] = score_matrix * df3[columns]