I have some data, that needs to be clusterised into groups. That should be done by a few predifined conditions.
Suppose we have the following table:
d = {'ID': [100, 101, 102, 103, 104, 105],
'col_1': [12, 3, 7, 13, 19, 25],
'col_2': [3, 1, 3, 3, 2, 4]
}
df = pd.DataFrame(data=d)
df.head()
Here, I want to group ID based on the following ranges, conditions, on col_1 and col_2.
For col_1 I divide values into following groups: [0, 10], [11, 15], [16, 20], [20, +inf]
For col_2 just use the df['col_2'].unique() values: [1], [2], [3], [4].
The desired groupping is in group_num column:
notice, that 0 and 3 rows have the same group number and the order, in which group number is assigned.
For now, I only came up with if-elif function to pre-define all the groups. It's not the solution for now cause in my real task there are far more ranges and confitions.
My code snippet, if it's relevant:
# This logic is not working cause here I have to predefine all the groups configurations, aka numbers,
# but I want to make groups "dymanicly":
# first group created and if the next row is not in that group -> create new one
def groupping(val_1, val_2):
# not using match case here, cause my Python < 3.10
if ((val_1 >= 0) and (val_1 <10)) and (val_2 == 1):
return 1
elif ((val_1 >= 0) and (val_1 <10)) and (val_2 == 2):
return 2
elif ...
...
df['group_num'] = df.apply(lambda x: groupping(x.col_1, x.col_2), axis=1)
make dataframe for chking group
bins = [0, 10, 15, 20, float('inf')]
df1 = df[['col_1', 'col_2']].assign(col_1=pd.cut(df['col_1'], bins=bins, right=False)).sort_values(['col_1', 'col_2'])
df1
col_1 col_2
1 [0.0, 10.0) 1
2 [0.0, 10.0) 3
0 [10.0, 15.0) 3
3 [10.0, 15.0) 3
4 [15.0, 20.0) 2
5 [20.0, inf) 4
chk group by df1
df1.ne(df1.shift(1)).any(axis=1).cumsum()
output:
1 1
2 2
0 3
3 3
4 4
5 5
dtype: int32
make output to group_num column
df.assign(group_num=df1.ne(df1.shift(1)).any(axis=1).cumsum())
result:
ID col_1 col_2 group_num
0 100 12 3 3
1 101 3 1 1
2 102 7 3 2
3 103 13 3 3
4 104 19 2 4
5 105 25 4 5
Not sure I understand the full logic, can't you use pandas.cut:
bins = [0, 10, 15, 20, np.inf]
df['group_num'] = pd.cut(df['col_1'], bins=bins,
labels=range(1, len(bins)))
Output:
ID col_1 col_2 group_num
0 100 12 3 2
1 101 3 1 1
2 102 7 3 1
3 103 13 2 2
4 104 19 3 3
5 105 25 4 4
Related
Here I have a working solution but my question focus on how to do this the Pandas way. I assume Pandas over better solutions for this.
I use groupby() and then apply(axis=1) to compare the values in the rows of the groups. And while doing this I made the decision which row to delete.
The rule doesn't matter! In this example here the rule is that when values in column A differ only by 1 (the values are "near") then delete the second one. How the decision is made is not part of the question. There could also be a list of color names and I would say that darkblue and marineblue are "near" and one if should be deleted.
The initial data frame is that.
X A B
0 A 9 0 <--- DELETE
1 A 14 1
2 A 8 2
3 A 1 3
4 A 18 4
5 B 10 5
6 B 20 6
7 B 11 7 <--- DELETE
8 B 30 8
Row index 0 should be deleted because it's value 9 is near the value 8 in row index 2. The same with row index 7: It's value 11 is "near" 10 in row index 5.
That is the code
#!/usr/bin/env python3
import pandas as pd
df = pd.DataFrame(
{
'X': list('AAAAABBBB'),
'A': [9, 14, 8, 1, 18, 10, 20, 11, 30],
'B': range(9)
}
)
print(df)
def mark_near_neighbors(group):
# I snip the decission process here.
# Delete 9 because it is "near" 8.
default_result = pd.Series(
data=[False] * len(group),
index=['Delete'] * len(group)
)
if group.X.iloc[0] is 'A':
# the 9
default_result.iloc[0] = True
else:
# the 11
default_result.iloc[2] = True
return default_result
result = df.groupby('X').apply(mark_near_neighbors)
result = result.reset_index(drop=True)
print(result)
df = df.loc[~result]
print(df)
So in the end I use a "boolean indexing thing" to solve this
0 True
1 False
2 False
3 False
4 False
5 False
6 False
7 True
8 False
dtype: bool
But is there a better way to do this?
Initialize the dataframe
df = pd.DataFrame([
['A', 9, 0],
['A', 14, 1],
['A', 8, 2],
['A', 1, 3],
['B', 18, 4],
['B', 10, 5],
['B', 20, 6],
['B', 11, 7],
['B', 30, 8],
], columns=['X', 'A', 'B'])
Sort the dataframe based on A column
df = df.sort_values('A')
Find the difference between values
df["diff" ] =df.groupby('X')['A'].diff()
Select the rows where the difference is not 1
result = df[df["diff"] != 1.0]
Drop the extra column and sort by index to get the initial dataframe
result.drop("diff", axis=1, inplace=True)
result = result.sort_index()
Sample output
X A B
1 A 14 1
2 A 8 2
3 A 1 3
4 B 18 4
5 B 10 5
6 B 20 6
8 B 30 8
IIUC, you can use numpy broadcasting to compare all values within a group. Keeping everything with apply here as it seems wanted:
def mark_near_neighbors(group, thresh=1):
a = group.to_numpy().astype(float)
idx = np.argsort(a)
b = a[idx]
d = abs(b-b[:,None])
d[np.triu_indices(d.shape[0])] = thresh+1
return pd.Series((d>thresh).all(1)[np.argsort(idx)], index=group.index)
out = df[df.groupby('X')['A'].apply(mark_near_neighbors)]
output:
X A B
1 A 14 1
2 A 8 2
3 A 1 3
4 A 18 4
5 B 10 5
6 B 20 6
8 B 30 8
I have a data frame with four columns that have values between 0-100.
In a new column I want to assign a value dependant on the values within the first four columns.
The values from the first four columns will be assigned a number 0, 1 or 2 and then summed together as follows:
0 - 30 = 0
31 -70 = 1
71 - 100 = 2
So the maximum number in the fifth column will be 8 and the minimum 0.
In the example data frame below the fifth column should result in 2, 3. (Just in case I haven't described this clearly.)
I'm still very new with python and at this stage the only string that I have in my bow is a very long and cumbersome multiple nested if statement, followed with df['E'] = df.apply().
My question is what is the best and most efficient function/method for populating the fifth column.
data = {
'A': [50, 90],
'B': [2, 4],
'C': [20, 80],
'D': [75, 72],
}
df = pd.DataFrame(data)
Edit
A more comprehensive method with np.select:
condlist = [(0 <= df) & (df <= 30),
(31 <= df) & (df <= 70),
(71 <= df) & (df <= 100)]
choicelist = [0, 1, 2]
df['E'] = np.select(condlist, choicelist).sum(axis=1)
print(df)
# Output
A B C D E
0 50 2 20 75 3
1 90 4 80 72 6
Use pd.cut after flatten your dataframe into one column with melt:
df['E'] = pd.cut(pd.melt(df, ignore_index=False)['value'],
bins=[0, 30, 70, 100], labels=[0, 1, 2]) \
.cat.codes.groupby(level=0).sum()
print(df)
# Output:
A B C D E
0 50 2 20 75 3
1 90 4 80 72 6
Details:
>>> pd.melt(df, ignore_index=False)
variable value
0 A 50
1 A 90
0 B 2
1 B 4
0 C 20
1 C 80
0 D 75
1 D 72
>>> pd.cut(pd.melt(df, ignore_index=False)['value'],
bins=[0, 30, 70, 100], labels=[0, 1, 2])
0 1
1 2
0 0
1 0
0 0
1 2
0 2
1 2
Name: value, dtype: category
Categories (3, int64): [0 < 1 < 2]
I'd like to create two cumcount columns, depending on the values of two columns.
In the example below, I'd like one cumcount starting when colA is at least 100, and another cumcount starting when colB is at least 10.
columns = ['ID', 'colA', 'colB', 'cumcountA', 'cumountB']
data = [['A', 3, 1, '',''],
['A', 20, 4, '',''],
['A', 102, 8, 1, ''],
['A', 117, 10, 2, 1],
['B', 75, 0, '',''],
['B', 170, 12, 1, 1],
['B', 200, 13, 2, 2],
['B', 300, 20, 3, 3],
]
pd.DataFrame(columns=columns, data=data)
ID colA colB cumcountA cumountB
0 A 3 1
1 A 20 4
2 A 102 8 1
3 A 117 10 2 1
4 B 75 0
5 B 170 12 1 1
6 B 200 13 2 2
7 B 300 20 3 3
How would I calculate cumcountA and cumcountB?
you can try setting df.clip lower = your values (here 100 and 10) and then compare then groupby ID and cumsum :
col_list = ['colA','colB']
val_list = [100,10]
df[['cumcountA','cumountB']] = (df[col_list].ge(df[col_list].clip(lower=val_list,axis=1))
.groupby(df['ID']).cumsum().replace(0,''))
print(df)
Or may be even better to compare directly:
df[['cumcountA','cumountB']] = (df[['colA','colB']].ge([100,10])
.groupby(df['ID']).cumsum().replace(0,''))
print(df)
ID colA colB cumcountA cumountB
0 A 3 1
1 A 20 4
2 A 102 8 1
3 A 117 10 2 1
4 B 75 0
5 B 170 12 1 1
6 B 200 13 2 2
7 B 300 20 3 3
I'm trying to use .isin with the ~ so I can get a list of unique rows back based on multiple columns in 2 data-sets.
So, I have 2 data-sets with 9 rows:
df1 is the bottom and df2 is the top (sorry but I couldn't get it to show both below, it showed 1 then a row of numbers)
Index Serial Count Churn
1 9 5 0
2 8 6 0
3 10 2 1
4 7 4 2
5 7 9 2
6 10 2 2
7 2 9 1
8 9 8 3
9 4 3 5
Index Serial Count Churn
1 10 2 1
2 10 2 1
3 9 3 0
4 8 6 0
5 9 8 0
6 1 9 1
7 10 3 1
8 6 7 1
9 4 8 0
I would like to get a list of rows from df1 which aren't in df2 based on more than 1 column.
For example if I base my search on the columns Serial and Count I wouldn't get Index 1 and 2 back from df1 as it appears in df2 at Index position 6, the same with Index position 4 in df1 as it appears at Index position 2 in df2. The same would apply to Index position 5 in df1 as it is at Index position 8 in df2.
The churn column doesn't really matter.
I can get it to work but based only on 1 column but not on more than 1 column.
df2[~df2.Serial.isin(df1.Serial.values)] kinda does what I want, but only on 1 column. I want it to be based on 2 or more.
Index Serial Count Churn
3 9 3 0
6 1 9 1
7 10 3 1
8 6 7 1
9 4 8 0
One solution is to merge with indicators:
df1 = pd.DataFrame([[10, 2, 0], [9, 4, 1], [9, 8, 1], [8, 6, 1], [9, 8, 1], [1, 9, 1], [10, 3, 1], [6, 7, 1], [4, 8, 1]], columns=['Serial', 'Count', 'Churn'])
df2 = pd.DataFrame([[9, 5, 1], [8, 6, 1], [10, 2, 1], [7, 4, 1], [7, 9, 1], [10, 2, 1], [2, 9, 1], [9, 8, 1], [4, 3, 1]], columns=['Serial', 'Count', 'Churn'])
# merge with indicator on
df_temp = df1.merge(df2[['Serial', 'Count']].drop_duplicates(), on=['Serial', 'Count'], how='left', indicator=True)
res = df_temp.loc[df_temp['_merge'] == 'left_only'].drop('_merge', axis=1)
Output
Serial Count Churn
1 9 4 1
5 1 9 1
6 10 3 1
7 6 7 1
8 4 8 1
I've had similar issue to solve, I've found the easiest way to deal with it by creating a temporary column, which consists of merged identifier columns and utilising isin on this newly created temporary column values.
A simple function achieving this could be the following
from functools import reduce
get_temp_col = lambda df, cols: reduce(lambda x, y: x + df[y].astype('str'), cols, "")
def subset_on_x_columns(df1, df2, cols):
"""
Subsets the input dataframe `df1` based on the missing unique values of input columns
`cols` of dataframe `df2`.
:param df1: Pandas dataframe to be subsetted
:param df2: Pandas dataframe which missing values are going to be
used to subset `df1` by
:param cols: List of column names
"""
df1_temp_col = get_temp_col(df1, cols)
df2_temp_col = get_temp_col(df2, cols)
return df1[~df1_temp_col.isin(df2_temp_col.unique())]
Thus for your case all that is needed, is to execute:
result_df = subset_on_x_columns(df1, df2, ['Serial', 'Count'])
which has the wanted rows:
Index Serial Count Churn
3 9 3 0
6 1 9 1
7 10 3 1
8 6 7 1
9 4 8 0
The nice thing about this solution is that it is naturally scalable in the number of columns to use, i.e. all that is needed is to specify in the input parameter list cols which columns to use as identifiers.
Here is the operation I am trying to do:
ID SUB_ID AMOUNT
1 101 1 50
2 101 1 -10
3 101 1 -20
4 101 2 30
5 101 2 20
6 102 3 10
7 102 3 -10
8 102 4 10
9 102 4 10
We want to group by ID and SUB_ID, and then take the sum of the absolute value of AMOUNT. Then order this summed up column within ID groups and return the SUB_ID values of the maximum value.
We can get the summation by:
df1 = (df
.groupby(['ID','SUB_ID'])
.apply(lambda x: np.sum(np.absolute(x['AMOUNT']))))
)
This will return a Series with MultiIndex
ID SUB_ID
101 1 80
2 50
102 3 20
4 20
From here I would like to return [1,3] ([1,4] is also accepted as the two values in the 102 group are the same, but we want to return only one value per group!)
Obviously we can loop and pick the max but I am trying to find out the most efficient way possible. This operation will be applied to millions of rows.
This is one way. As your dataset is large, I strongly recommend you avoid lambda functions since these are not applied in a vectorised fashion.
res = df.assign(AMOUNT=df['AMOUNT'].abs())\
.groupby(['ID', 'SUB_ID'], as_index=False).sum()\
.sort_values('AMOUNT', ascending=False)\
.groupby('ID').head(1)
Example
df = pd.DataFrame([[101, 1, 50], [101, 1, -10], [101, 1, -20], [101, 2, 30],
[101, 2, 20], [102, 3, 10], [102, 3, -10], [102, 4, 10], [102, 4, 10]],
columns=['ID', 'SUB_ID', 'AMOUNT'])
res = df.assign(AMOUNT=df['AMOUNT'].abs())\
.groupby(['ID', 'SUB_ID'], as_index=False).sum()\
.sort_values('AMOUNT', ascending=False)\
.groupby('ID').head(1)
print(res)
ID SUB_ID AMOUNT
0 101 1 80
2 102 3 20
I think you can use nlargest:
df1.groupby('ID').nlargest(1).index.get_level_values(level='SUB_ID').tolist()
# [1, 3]