Summing different dates and categories - python

So I have a pandas data frame that is grouped by date and a particular category and has the sum of another column. What I would like to do is take the number for a particular category for a particular day and add it to the next day and then take that number and add it to the next day. For example, say the category is apples, the date is 5-26-2021 and the cost is $5. The next day, 5-27-2021 is $6. So 5-27-2021 should have a cost of $11. Then 5-28-2021 has a cost of $3 but it should be added to $11 so the cost should show up as $14. How can I go about doing this? There are multiple categories by the way besides just the apples. Thank you!
enter image description here
Expected Output:
(the output is not the most accurate and this data frame is not the most accurate so feel free to ask questions)

Use groupby then cumsum
data = [
[2021, 'apple', 1,],
[2022, 'apple', 2,],
[2021, 'banana', 3,],
[2022, 'cherry', 4],
[2022, 'banana', 5],
[2023, 'cherry', 6],
]
columns = ['date','category', 'cost']
df = pd.DataFrame(data, columns=columns)
>>> df
date category cost
0 2021 apple 1
1 2022 apple 2
2 2021 banana 3
3 2022 cherry 4
4 2022 banana 5
5 2023 cherry 6
df.sort_values(['category','date'], inplace=True)
df.reset_index(drop=True, inplace=True)
df['CostCsum'] = df.groupby(['category'])['cost'].cumsum()
date category cost CostCsum
0 2021 apple 1 1
1 2022 apple 2 3
2 2021 banana 3 3
3 2022 banana 5 8
4 2022 cherry 4 4
5 2023 cherry 6 10

Related

how to find the number of rows in a column that are above the mean?

I have a dataset and among the columns there are column A that have the release year of products and column B that have the sales of each product.
I want to know how many product have sales above the mean for each year.
The dataset is a pandas dataframe.
Thank you and I hope my question is clear
Compute yearly averages with groupby.transform() and compare them against the individual sales, e.g.:
df = pd.DataFrame({'product': np.random.choice(['foo','bar'], size=10), 'year': np.random.choice([2019,2020,2021], size=10), 'sales': np.random.randint(10000, size=10)})
# product year sales
# 0 foo 2019 7507
# 1 bar 2019 9186
# 2 foo 2021 6234
# 3 foo 2021 7375
# 4 bar 2020 9934
# 5 foo 2021 6403
# 6 foo 2021 7729
# 7 foo 2021 1875
# 8 bar 2020 7148
# 9 foo 2019 8163
df['above_mean'] = df.sales > df.groupby(['product','year']).sales.transform('mean')
df.groupby('year', as_index=False).above_mean.sum()
# year above_mean
# 0 2019 1
# 1 2020 1
# 2 2021 4

How to find churned customers on a monthly basis? Python Pandas

I have a large customer dataset, it has things like Customer ID, Service ID, Product, etc. So the two ways we can measure churn are at a Customer-ID level, if the entire customer leaves and at a Service-ID level where maybe they cancel 2 out of 5 services.
The data looks like this, and as we can see
Alligators stops being a customer at the end of Jan as they dont have any rows in Feb (CustomerChurn)
Aunties stops being a customer at the end of Jan as they dont have any rows in Feb (CustomerChurn)
Bricks continues with Apples and Oranges in Jan and Feb (ServiceContinue)
Bricks continues being a customer but cancels two services at the end of Jan (ServiceChurn)
I am trying to write some code that creates the 'Churn' column.. I have tried
To manually just grab lists of CustomerIDs and ServiceIDs using Set from Oct 2019, and then comparing that to Nov 2019, to find the ones that churned. This is not too slow but doesn't seem very Pythonic.
Thank you!
data = {'CustomerName': ['Alligators','Aunties', 'Bricks', 'Bricks','Bricks', 'Bricks', 'Bricks', 'Bricks', 'Bricks', 'Bricks'],
'ServiceID': [1009, 1008, 1001, 1002, 1003, 1004, 1001, 1002, 1001, 1002],
'Product': ['Apples', 'Apples', 'Apples', 'Bananas', 'Oranges', 'Watermelon', 'Apples', 'Bananas', 'Apples', 'Bananas'],
'Month': ['Jan', 'Jan', 'Jan', 'Jan', 'Jan', 'Jan', 'Feb', 'Feb', 'Mar', 'Mar'],
'Year': [2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021],
'Churn': ['CustomerChurn', 'CustomerChurn', 'ServiceContinue', 'ServiceContinue', 'ServiceChurn', 'ServiceChurn','ServiceContinue', 'ServiceContinue', 'NA', 'NA']}
df = pd.DataFrame(data)
df
I think this gets close to what you want, except for the NA in the last two rows, but if you really need those NA, then you can filter by date and change the values.
Because you are really testing two different groupings, I send the first customername grouping through a function and depending what I see, I send a more refined grouping through a second function. For this data set it seems to work.
I create an actual date column and make sure everything is sorted before grouping. The logic inside the functions is testing the max date of the group to see if it's less than a certain date. Looks like you are testing March as the current month
You should be able to adapt it for your needs
df['testdate'] = df.apply(lambda x: datetime.datetime.strptime('-'.join((x['Month'], str(x['Year']))),'%b-%Y'), axis=1)
df = df.sort_values('testdate')
df1 = df.drop('Churn',axis=1)
def get_customerchurn(x, tdate):
# print(x)
# print(tdate)
if x.testdate.max() < tdate:
x.loc[:, 'Churn'] = 'CustomerChurn'
return x
else:
x = x.groupby(['CustomerName', 'Product']).apply(lambda x: get_servicechurn(x, datetime.datetime(2021,3,1)))
return x
def get_servicechurn(x, tdate):
print(x)
# print(tdate)
if x.testdate.max() < tdate:
x.loc[:, 'Churn'] = 'ServiceChurn'
return x
else:
x.loc[:, 'Churn'] = 'ServiceContinue'
return x
df2 = df1.groupby(['CustomerName']).apply(lambda x: get_customerchurn(x, datetime.datetime(2021,3,1)))
df2
Output:
CustomerName ServiceID Product Month Year testdate Churn
0 Alligators 1009 Apples Jan 2021 2021-01-01 CustomerChurn
1 Aunties 1008 Apples Jan 2021 2021-01-01 CustomerChurn
2 Bricks 1001 Apples Jan 2021 2021-01-01 ServiceContinue
3 Bricks 1002 Bananas Jan 2021 2021-01-01 ServiceContinue
4 Bricks 1003 Oranges Jan 2021 2021-01-01 ServiceChurn
5 Bricks 1004 Watermelon Jan 2021 2021-01-01 ServiceChurn
6 Bricks 1001 Apples Feb 2021 2021-02-01 ServiceContinue
7 Bricks 1002 Bananas Feb 2021 2021-02-01 ServiceContinue
8 Bricks 1001 Apples Mar 2021 2021-03-01 ServiceContinue
9 Bricks 1002 Bananas Mar 2021 2021-03-01 ServiceContinue

Sort values intra group [duplicate]

This question already has an answer here:
Pandas groupby sort each group values and order dataframe groups based on max of each group
(1 answer)
Closed 1 year ago.
Suppose I have this dataframe:
df = pd.DataFrame({
'price': [2, 13, 24, 15, 11, 44],
'category': ["shirts", "pants", "shirts", "tops", "hat", "tops"],
})
price category
0 2 shirts
1 13 pants
2 24 shirts
3 15 tops
4 11 hat
5 44 tops
I want to sort values in such a way that:
Find what is the highest price of each category.
Sort categories according to highest price (in this case, in descending order: tops, shirts, pants, hat).
Sort each category according to higher price.
The final dataframe would look like:
price category
0 44 tops
1 15 tops
2 24 shirts
3 24 shirts
4 13 pants
5 11 hat
I'm not a big fan of one-liners, so here's my solution:
# Add column with max-price for each category
df = df.merge(df.groupby('category')['price'].max().rename('max_cat_price'),
left_on='category', right_index=True)
# Sort
df.sort_values(['category','price','max_cat_price'], ascending=False)
# Drop column that has max-price for each category
df.drop('max_cat_price', axis=1, inplace=True)
print(df)
price category
5 44 tops
3 15 tops
2 24 shirts
0 2 shirts
1 13 pants
4 11 hat
You can use .groupby and .sort_values:
df.join(df.groupby("category").agg("max"), on="category", rsuffix="_r").sort_values(
["price_r", "price"], ascending=False
)
Output
price category price_r
5 44 tops 44
3 15 tops 44
2 24 shirts 24
0 2 shirts 24
1 13 pants 13
4 11 hat 11
I used the get_group in an dataframe apply to get the max price for a category
df = pd.DataFrame({
'price': [2, 13, 24, 15, 11, 44],
'category': ["shirts", "pants", "shirts", "tops", "hat", "tops"],
})
grouped=df.groupby('category')
df['price_r']=df['category'].apply(lambda row: grouped.get_group(row).price.max())
df=df.sort_values(['category','price','price_r'], ascending=False)
print(df)
output
price category price_r
5 44 tops 44
3 15 tops 44
2 24 shirts 24
0 2 shirts 24
1 13 pants 13
4 11 hat 11

Sum of a Column in a Dictionary of Dataframes

How can I work with a dictionary of dataframes please? Or, is there a better way to get an overview of my data? If I have for example:
Fruit Qty Year
Apple 2 2016
Orange 1 2017
Mango 2 2016
Apple 9 2016
Orange 8 2015
Mango 7 2016
Apple 6 2016
Orange 5 2017
Mango 4 2015
Then I am trying to find out how many in total I get per year, for example:
2015 2016 2017
Apple 0 11 0
Orange 8 0 6
Mango 4 9 0
I have written some code but it might not be useful:
import pandas as pd
# Fruit Data
df_1 = pd.DataFrame({'Fruit':['Apple','Orange','Mango','Apple','Orange','Mango','Apple','Orange','Mango'], 'Qty': [2,1,2,9,8,7,6,5,4], 'Year': [2016,2017,2016,2016,2015,2016,2016,2017,2015]})
# Create a list of Fruits
Fruits = df_1.Fruit.unique()
# Break down the dataframe by Year
df_2015 = df_1[df_1['Year'] == 2015]
df_2016 = df_1[df_1['Year'] == 2016]
df_2017 = df_1[df_1['Year'] == 2017]
# Create a dataframe dictionary of Fruits
Dict_2015 = {elem : pd.DataFrame for elem in Fruits}
Dict_2016 = {elem : pd.DataFrame for elem in Fruits}
Dict_2017 = {elem : pd.DataFrame for elem in Fruits}
# Store the Qty for each Fruit x each Year
for Fruit in Dict_2015.keys():
Dict_2015[Fruit] = df_2015[:][df_2015.Fruit == Fruit]
for Fruit in Dict_2016.keys():
Dict_2016[Fruit] = df_2016[:][df_2016.Fruit == Fruit]
for Fruit in Dict_2017.keys():
Dict_2017[Fruit] = df_2017[:][df_2017.Fruit == Fruit]
You can use pandas.pivot_table.
res = df.pivot_table(index='Fruit', columns=['Year'], values='Qty',
aggfunc=np.sum, fill_value=0)
print(res)
Year 2015 2016 2017
Fruit
Apple 0 17 0
Mango 4 9 0
Orange 8 0 6
For guidance on usage, see How to pivot a dataframe.
jpp has already posted an answer in the format you wanted. However, since your question seemed like you are open to other views, I thought of sharing another way. Not exactly in the format you posted but this how I usually do it.
df = df.groupby(['Fruit', 'Year']).agg({'Qty': 'sum'}).reset_index()
This will look something like:
Fruit Year Sum
Apple 2015 0
Apple 2016 11
Apple 2017 0
Orange 2015 8
Orange 2016 0
Orange 2017 6
Mango 2015 4
Mango 2016 9
Mango 2017 0

Groupby 2 different columns Python Pandas

import pandas as pd
df1 = pd.DataFrame([['Dog', '2017', 100], ['Dog', '2017' ,500],['Dog', '2016' ,200],['Dog', '2016' ,150],['Cat', '2017' ,50],['Cat', '2017' ,100],
['Cat', '2016' ,50]], columns=('Pet','Year','Amount'))
DF1
Pet Year Amount
Dog 2017 100
Dog 2017 500
Dog 2016 200
Dog 2016 150
Cat 2017 50
Cat 2017 100
Cat 2016 50
I would like to turn the above dataframe into the following:
DF2
Pet Year Amount
Dog 2017 600
Dog 2016 350
Cat 2017 150
Cat 2016 50
This is grouping by Pet and by Year and summing the amount between them.
Any ideas?
Use groupby with parameters as_index=False for not return MultiIndex and sort=False for avoid sorting:
print (df1.groupby(['Pet','Year'], as_index=False, sort=False).sum())
Pet Year Amount
0 Dog 2017 600
1 Dog 2016 350
2 Cat 2017 150
3 Cat 2016 50

Categories

Resources