I have a dataframe where I need to create a grouping of ages and then have the averages amount of Tip amount for each group.
My Data looks the following
Tip amount
Age
3
30
30
35
4
60
1
12
7
25
3
45
15
31
5
8
I have tried to use pd.cut() with bins to create the grouping, but I can't seem to get the Tip amount average (maybe using mean()) to be in the DataFrame as well.
import pandas as pd
bins= [0,15,30,45,60,85]
labels = ['0-14','15-29','30-44','45-59','60+']
df['Tip amount']=df['Tip amount'].astype(int)
#df = df.groupby('Age')[['Tip amount']].mean()
df = df.groupby(pd.cut(df['Age'], bins=bins, labels=labels, right=False)).size()
This gives the following output:
Age
0-14
2
15-29
1
30-44
3
45-59
1
60+
1
But I would like to have the average Tip amount for the groups as well.
Age
Tip amount
0-14
2
avg
15-29
1
avg
30-44
3
avg
45-59
1
avg
60+
1
avg
Try:
df.groupby(pd.cut(df['Age'], bins=bins, labels=labels, right=False)).agg({'Age': ['size'], 'Tip amount': ['mean']})
Related
I have a dataframe like that:
year
count_yes
count_no
1900
5
7
1903
5
3
1915
14
6
1919
6
14
I want to have two bins, independently of the value itself.
How can I group those categories and sum its values?
Expected result:
year
count_yes
count_no
1900
10
10
1910
20
20
Logic: Grouped the first two rows (1900 and 1903) and the two last rows (1915 and 1919) and summed the values of each category
I want to create a stacked percentage column graphic, so 1900 would be 50/50% and 1910 would be also 50/50%.
I've already created the function to build this graphic, I just need to adjust the dataframe size into bins to create a better distribution and visualization
This is a way to do what you need, if you are ok using the decades as index:
df['year'] = (df.year//10)*10
df_group = df.groupby('year').sum()
Output>>>
df_group
count_yes count_no
year
1900 10 10
1910 20 20
You can bin the years with pandas.cut and aggregate with groupby+sum:
bins = list(range(1900, df['year'].max()+10, 10))
group = pd.cut(df['year'], bins=bins, labels=bins[:-1], right=False)
df.drop('year', axis=1).groupby(group).sum().reset_index()
If you only want to specify the number of bins, compute group with:
group = pd.cut(df['year'], bins=2, right=False)
output:
year count_yes count_no
0 1900 10 10
1 1910 20 20
I have 4 columns in dataset which are cid(customer level), month, spending and transaction (max.cid=10000). As seen below, df.head().
cid month spending transaction
0 1 3 61.94 28
1 1 4 73.02 23
2 1 7 59.34 25
3 1 8 48.69 24
4 1 9 121.79 26
I use the following function to calculate the trend (slope)in the outflow spending per customer. However, I get the identical number as a result for the whole dataset. Expected to calculate trend of spendings on customer level. (trend value for each customer).
Is there a way to iterate over each customer level in the dataset and obtain individual trends per customer? Thanks in advance!
df = pd.read_csv("/content/case_data.csv")
import numpy as np
def trendline(df, order=1):
coeffs = np.polyfit(df.index.values, list(df), order)
slope = coeffs[-2]
return float(slope)
outflow = df['spending']
cid = df['cid']
df_ = pd.DataFrame({'cid': cid, 'outflow': outflow})
slope_outflow = trendline(df_['cid'])
slope_outflow
Output : 0.13377820413729283
Expected Output: (Trend1), (Trend2), (Trend3), ......, (Trend10000)
def trendline(x, y, order=1):
return np.polyfit(x, y, order)[-2]
df.groupby('cid').apply(lambda subdf: trendline(subdf['month'].values, subdf['spending'].values))
You can use groupby to calculate the trend by each cid value. In the above example it is for the trend of spending.
I am new to Python and can see at least 5 similar questions and this one is very close but non of them work for me.
I have a dataframe with non-unique customers.
customer_id amount male age income days reward difficulty duration
0 id_1 16.06 1 45 62000.0 608 2.0 10.0 10.0
1 id_1 18.00 1 45 62000.0 608 2.0 10.0 10.0
I am trying to group them by customer_id, sum by amount and keep all other columns PLUS add one column total, counting my transactions
Desired output
customer_id amount male age income days reward difficulty duration total
0 id_1 34.06 1 45 62000.0 608 2.0 10.0 10.0 2
My best personal attempt so far does not preserve all columns
groupby('customer_id')['amount'].agg(total_sum = 'sum', total = 'count')
You could do it this way, include all other columns in your groupby then reset_index after aggregating:
df.groupby(df.columns.difference(['amount']).tolist())['amount']\
.agg(total_sum='sum',total='count').reset_index()
Output:
age customer_id days difficulty duration income male reward total_sum total
0 45 id_1 608 10.0 10.0 62000.0 1 2.0 34.06 2
you could do:
grouper = df.groupby('customer_id')
first_dict = {col: 'first' for col in df.columns.difference(['customer_id', 'amount'])}
o = grouper.agg({
'amount': 'size',
**first_dict,
})
o['total'] = grouper.size().values
Based on #Scott Boston's answer, I found an answer myself too and I acknowledge that my solution is not elegant (maybe something will help to clean it). But it gives me an expanded solution, when I have non-unique rows (for instance, each customer_id has five different transactions).
df.groupby('customer_id').agg({'amount':['sum'], 'reward_':['sum'], 'difficulty':['mean'],
'duration':['mean'], 'male':['mean'], 'male':['mean'],
'income':['mean'], 'days':['mean'], 'age':['mean'],
'customer_id':['count']}).reset_index()
df_grouped = starbucks_grouped.droplevel(1, axis = 1)
My output is
Here is a sample dataset.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'VipNo':np.repeat( range(3), 2 ),
'Quantity': np.random.randint(200,size=6),
'OrderDate': np.random.choice( pd.date_range('1/1/2020', periods=365, freq='D'), 6, replace=False)})
print(df)
So I have a couple of steps to do. I want to create a new column named qtywithin1mon/totalqty. First I want to group the VipNo (each number represents an individual) because a person may have made multiple purchases. Then I want to see if the orderdate is within a certain range (let's say 2020/03/01 - 2020/03/31). If so, I want to use the respective quantity on that day divided by the total quantity this customer purchased. My dataset is big so a customer may have ordered twice within the time range and I would want the sum of the two orders divided by the total quantity in this case. How can I achieve this goal? I really have no idea where to start..
Thank you so much!
You can create a new column masking quantity within the given date range, then groupby:
start, end = pd.to_datetime(['2020/03/01','2020/03/31'])
(df.assign(QuantitySub=df['OrderDate'].between(start,end)*df.Quantity)
.groupby('VipNo')[['Quantity','QuantitySub']]
.sum()
.assign(output=lambda x: x['QuantitySub']/x['Quantity'])
.drop('QuantitySub', axis=1)
)
With a data frame:
VipNo Quantity OrderDate
0 0 105 2020-01-07
1 0 56 2020-03-04
2 1 167 2020-09-05
3 1 18 2020-05-08
4 2 151 2020-11-01
5 2 14 2020-03-17
The output is:
Quantity output
VipNo
0 161 0.347826
1 185 0.000000
2 165 0.084848
I have a dataframe containing dates and prices. I need to add all prices belonging to the week of ex: 17/12 to 23/12 and put it infront of a new label corresponding to that week.
Date Price
12/17/2015 10
12/18/2015 20
12/19/2015 30
12/21/2015 40
12/24/2015 50
I want the output to be the following
week total
17/12-23/12 100
24/12-30/12 50
I tried using different datetime functions and groupby functions but was not able to get the o/p. Please help
what about this approach?
In [19]: df.groupby(df.Date.dt.weekofyear)['Price'].sum().rename_axis('week_no').reset_index(name='total')
Out[19]:
week_no total
0 51 60
1 52 90
UPDATE:
In [49]: df.resample(on='Date', rule='7D', base='4D').sum().rename_axis('week_from') \
.reset_index('total')
Out[49]:
week_from Price
0 2015-12-17 100
1 2015-12-24 50
UPDATE2:
x = (df.resample(on='Date', rule='7D', base='4D')
.sum()
.reset_index()
.rename(columns={'Price':'total'}))
x = x.assign(week=x['Date'].dt.strftime('%d/%m')
+'-'
+(x.pop('Date')+pd.DateOffset(days=7)).dt.strftime('%d/%m'))
In [127]: x
Out[127]:
total week
0 100 17/12-24/12
1 50 24/12-31/12
Using resample
df['Date'] = pd.to_datetime(df['Date'])
df.set_index(df.Date, inplace = True)
df = df.resample('W').sum()
Price
Date
2015-12-20 60
2015-12-27 90