I am trying to aggregate and count values together. Below you can see my dataset
data = {'id':['1','2','3','4','5'],
'name': ['Company1', 'Company1', 'Company3', 'Company3', 'Company5'],
'sales': [10, 3, 5, 1, 0],
'income': [10, 3, 5, 1, 0],
}
df = pd.DataFrame(data, columns = ['id','name', 'sales','income'])
conditions = [
(df['sales'] < 1),
(df['sales'] >= 1) & (df['sales'] < 3),
(df['sales'] >= 3) & (df['sales'] < 5),
(df['sales'] >= 5)
]
values = ['<1', '1-3', '3-5', '>= 5']
df['range'] = np.select(conditions, values)
df=df.groupby('range')['sales','income'].agg(['count','sum']).reset_index()
This code gives me the next table
But I am not satisfied with the appearance of this table because 'count' is duplicated two times. So can anybody help me with this table in order to have separate columns 'range', 'count', 'income' and 'sales'.
You could try named aggregation:
df.groupby('range', as_index=False).agg(count=('range','count'), sales=('sales','sum'), income=('income','sum'))
Output:
range count sales income
0 1-3 1 1 1
1 3-5 1 3 3
2 <1 1 0 0
3 >= 5 2 15 15
P.S. You probably want to make "range" a categorical variable, so that the output is sorted in the correct order:
df['range'] = pd.Categorical(np.select(conditions, values), categories=values, ordered=True)
Then the above code outputs:
range count sales income
0 <1 1 0 0
1 1-3 1 1 1
2 3-5 1 3 3
3 >= 5 2 15 15
Related
I have some data, that needs to be clusterised into groups. That should be done by a few predifined conditions.
Suppose we have the following table:
d = {'ID': [100, 101, 102, 103, 104, 105],
'col_1': [12, 3, 7, 13, 19, 25],
'col_2': [3, 1, 3, 3, 2, 4]
}
df = pd.DataFrame(data=d)
df.head()
Here, I want to group ID based on the following ranges, conditions, on col_1 and col_2.
For col_1 I divide values into following groups: [0, 10], [11, 15], [16, 20], [20, +inf]
For col_2 just use the df['col_2'].unique() values: [1], [2], [3], [4].
The desired groupping is in group_num column:
notice, that 0 and 3 rows have the same group number and the order, in which group number is assigned.
For now, I only came up with if-elif function to pre-define all the groups. It's not the solution for now cause in my real task there are far more ranges and confitions.
My code snippet, if it's relevant:
# This logic is not working cause here I have to predefine all the groups configurations, aka numbers,
# but I want to make groups "dymanicly":
# first group created and if the next row is not in that group -> create new one
def groupping(val_1, val_2):
# not using match case here, cause my Python < 3.10
if ((val_1 >= 0) and (val_1 <10)) and (val_2 == 1):
return 1
elif ((val_1 >= 0) and (val_1 <10)) and (val_2 == 2):
return 2
elif ...
...
df['group_num'] = df.apply(lambda x: groupping(x.col_1, x.col_2), axis=1)
make dataframe for chking group
bins = [0, 10, 15, 20, float('inf')]
df1 = df[['col_1', 'col_2']].assign(col_1=pd.cut(df['col_1'], bins=bins, right=False)).sort_values(['col_1', 'col_2'])
df1
col_1 col_2
1 [0.0, 10.0) 1
2 [0.0, 10.0) 3
0 [10.0, 15.0) 3
3 [10.0, 15.0) 3
4 [15.0, 20.0) 2
5 [20.0, inf) 4
chk group by df1
df1.ne(df1.shift(1)).any(axis=1).cumsum()
output:
1 1
2 2
0 3
3 3
4 4
5 5
dtype: int32
make output to group_num column
df.assign(group_num=df1.ne(df1.shift(1)).any(axis=1).cumsum())
result:
ID col_1 col_2 group_num
0 100 12 3 3
1 101 3 1 1
2 102 7 3 2
3 103 13 3 3
4 104 19 2 4
5 105 25 4 5
Not sure I understand the full logic, can't you use pandas.cut:
bins = [0, 10, 15, 20, np.inf]
df['group_num'] = pd.cut(df['col_1'], bins=bins,
labels=range(1, len(bins)))
Output:
ID col_1 col_2 group_num
0 100 12 3 2
1 101 3 1 1
2 102 7 3 1
3 103 13 2 2
4 104 19 3 3
5 105 25 4 4
I have data as you can see in the terminal. I need it to be converted to the Excel sheet format as you can see in the Excel sheet file by creating multi-levels in columns.
I researched this and reached many different things but cannot achieve my goal then, I reached "transpose", and it gave me the shape that I need but unfortunately that it did reshape from a column to a row instead where I got the wrong data ordering.
Current result:
Desired result:
What can I try next?
You can use pivot() function and reorder multi-column levels.
Before that, index/group data for repeated iterations/rounds:
data=[
(2,0,0,1),
(10,2,5,3),
(2,0,0,0),
(10,1,1,1),
(2,0,0,0),
(10,1,2,1),
]
columns = ["player_number", "cel1", "cel2", "cel3"]
df = pd.DataFrame(data=data, columns=columns)
df_nbr_plr = df[["player_number"]].groupby("player_number").agg(cnt=("player_number","count"))
df["round"] = list(itertools.chain.from_iterable(itertools.repeat(x, df_nbr_plr.shape[0]) for x in range(df_nbr_plr.iloc[0,0])))
[Out]:
player_number cel1 cel2 cel3 round
0 2 0 0 1 0
1 10 2 5 3 0
2 2 0 0 0 1
3 10 1 1 1 1
4 2 0 0 0 2
5 10 1 2 1 2
Now, pivot and reorder the colums levels:
df = df.pivot(index="round", columns="player_number").reorder_levels([1,0], axis=1).sort_index(axis=1)
[Out]:
player_number 2 10
cel1 cel2 cel3 cel1 cel2 cel3
round
0 0 0 1 2 5 3
1 0 0 0 1 1 1
2 0 0 0 1 2 1
This can be done with unstack after setting player__number as index. You have to reorder the Multiindex columns and fill missing values/delete duplicates though:
import pandas as pd
data = {"player__number": [2, 10 , 2, 10, 2, 10],
"cel1": [0, 2, 0, 1, 0, 1],
"cel2": [0, 5, 0, 1, 0, 2],
"cel3": [1, 3, 0, 1, 0, 1],
}
df = pd.DataFrame(data).set_index('player__number', append=True)
df = df.unstack('player__number').reorder_levels([1, 0], axis=1).sort_index(axis=1) # unstacking, reordering and sorting columns
df = df.ffill().iloc[1::2].reset_index(drop=True) # filling values and keeping only every two rows
df.to_excel('output.xlsx')
Output:
I have a data frame with four columns that have values between 0-100.
In a new column I want to assign a value dependant on the values within the first four columns.
The values from the first four columns will be assigned a number 0, 1 or 2 and then summed together as follows:
0 - 30 = 0
31 -70 = 1
71 - 100 = 2
So the maximum number in the fifth column will be 8 and the minimum 0.
In the example data frame below the fifth column should result in 2, 3. (Just in case I haven't described this clearly.)
I'm still very new with python and at this stage the only string that I have in my bow is a very long and cumbersome multiple nested if statement, followed with df['E'] = df.apply().
My question is what is the best and most efficient function/method for populating the fifth column.
data = {
'A': [50, 90],
'B': [2, 4],
'C': [20, 80],
'D': [75, 72],
}
df = pd.DataFrame(data)
Edit
A more comprehensive method with np.select:
condlist = [(0 <= df) & (df <= 30),
(31 <= df) & (df <= 70),
(71 <= df) & (df <= 100)]
choicelist = [0, 1, 2]
df['E'] = np.select(condlist, choicelist).sum(axis=1)
print(df)
# Output
A B C D E
0 50 2 20 75 3
1 90 4 80 72 6
Use pd.cut after flatten your dataframe into one column with melt:
df['E'] = pd.cut(pd.melt(df, ignore_index=False)['value'],
bins=[0, 30, 70, 100], labels=[0, 1, 2]) \
.cat.codes.groupby(level=0).sum()
print(df)
# Output:
A B C D E
0 50 2 20 75 3
1 90 4 80 72 6
Details:
>>> pd.melt(df, ignore_index=False)
variable value
0 A 50
1 A 90
0 B 2
1 B 4
0 C 20
1 C 80
0 D 75
1 D 72
>>> pd.cut(pd.melt(df, ignore_index=False)['value'],
bins=[0, 30, 70, 100], labels=[0, 1, 2])
0 1
1 2
0 0
1 0
0 0
1 2
0 2
1 2
Name: value, dtype: category
Categories (3, int64): [0 < 1 < 2]
I am trying to count consecutive zeros (e.g. 2 consecutive zeros or 3 consecutive zeros) in groups and combine the results in a new dataframe.
raw_data = {'groups': ['x', 'x', 'x', 'x', 'x', 'x', 'x','z','y', 'y', 'y','y', 'y', 'z'],
'runs': [0, 0, 0, 1, 1, 0, 0, 2, 1, 1, 1, 0, 0, 2]}
df = pd.DataFrame(raw_data, columns = ['groups', 'runs'])
Example in the dataframe above, first I want to know how many 2 consecutive zeros are in each group and then I want to know how many 3 consecutive zeros are in each group.
I want the results (preferably in a dataframe):
group 2_0s 3_0s
x 1 1
y 1 0
z 0 0
I am hoping to find a generic way, as I want to be able to do the same for consecutive 1s and 2s as well.
Thanks.
You can use:
#get original unique sorted values of groups
orig = np.sort(df.groups.unique())
#add new groups for distinguish 0 in one group
df['g'] = (df.runs != df.runs.shift()).cumsum()
#filter only 0 values
df = df[df.runs == 0]
print (df)
groups runs g
0 x 0 1
1 x 0 1
2 x 0 1
5 x 0 3
6 x 0 3
11 y 0 6
12 y 0 6
#get size by groups and g
df = df.groupby(['groups', 'g']).size().reset_index(name='0')
print (df)
groups g 0
0 x 1 3
1 x 3 2
2 y 6 2
#get size by groups and 0, unstack
#reindex by original unique values, add suffix to column names
df1 = df.groupby(['groups','0'])
.size()
.unstack(fill_value=0)
.reindex(orig, fill_value=0)
.add_suffix('_0s')
print (df1)
0 2_0s 3_0s
groups
x 1 1
y 1 0
z 0 0
More generic solution:
df['g'] = (df.runs != df.runs.shift()).cumsum()
df = df.groupby(['groups', 'g', 'runs']).size().reset_index(name='0')
df1 = df.groupby(['groups','runs', '0']).size().unstack(level=[1,2]).fillna(0).astype(int)
print (df1)
runs 0 1 2
0 2 3 2 3 1
groups
x 1 1 1 0 0
y 1 0 0 1 0
z 0 0 0 0 2
I have a dataframe like
df1 = pd.DataFrame({'name':['al', 'ben', 'cary'], 'bin':[1.0, 1.0, 3.0], 'score':[40, 75, 15]})
bin name score
0 1 al 40
1 1 ben 75
2 3 cary 15
and a dataframe like
df2 = pd.DataFrame({'bin':[1.0, 2.0, 3.0, 4.0, 5.0], 'x':[1, 1, 0, 0, 0],
'y':[0, 0, 1, 1, 0], 'z':[0, 0, 0, 1, 0]})
bin x y z
0 1 1 0 0
1 2 1 0 0
2 3 0 1 0
3 4 0 1 1
4 5 0 0 0
what I want to do is extend df1 with the columns ‘x’, ‘y’, and ‘z’, and fill with score only where the bin matches and the the respective ‘x’, ‘y’, ‘z’ value is 1, not 0.
I’ve gotten as far as
df3 = pd.merge(df1, df2, how='left', on=['bin'])
bin name score x y z
0 1 al 40 1 0 0
1 1 ben 75 1 0 0
2 3 cary 15 0 1 0
but I don't see an elegant way to get the score values into the correct 'x', 'y', etc columns (my real-life problem has over a hundred such columns so df3['x'] = df3['score'] * df3['x'] might be rather slow).
You can just get a list of the columns you want to multiply the scores by and then use the apply function:
cols = [each for each in df2.columns if each not in ('name', 'bin')]
df3 = pd.merge(df1, df2, how='left', on=['bin'])
df3[cols] = df3.apply(lambda x: x['score'] * x[cols], axis=1)
This may not be much faster than iterating, but is an idea.
Import numpy, define the columns covered in the operation
import numpy as np
columns = ['x','y','z']
score_col = 'score'
Contruct a numpy array of the score column, reshaped to match the number of columns in the operation.
score_matrix = np.repeat(df3[score_col].values, len(columns))
score_matrix = score_matrix.reshape(len(df3), len(columns))
Multiply by the the columns and assign back to the dataframe.
df3[columns] = score_matrix * df3[columns]