df
order_date Month Name Year Days Data
2015-12-20 Dec 2014 1 3
2016-1-21 Jan 2014 2 3
2015-08-20 Aug 2015 1 1
2016-04-12 Apr 2016 4 1
and so on
Code:
df = df.groupby(["Year", "Month Name"], as_index=False)["days"].agg(['min',
'mean'])
df3 = (df.groupby(["Year", "Month Name"], as_index=False)
["Data"].agg(['count']))
merged_df=pd.merge(df3, df, on=['Year','Month Name'])
I have a groupby output as below
Min Mean Count
Year Month Name
2015 Aug 2 11 200
Dec 5 13 130
Feb 3 15 100
Jan 4 20 123
May 1 21 342
Nov 2 12 234
2016 Apr 1 10 200
Dec 2 12 120
Feb 2 13 200
Jan 2 24 200
Sep 1 25 220
Issue:
Basically I am getting output of groupby sorted by Month Name starting from A to Z, So I am getting April, August, December, Feb etc......rather than Jan, Feb ....till Dec etc. How to get output sorted by Month number.
I need output like 2016, Jan, Feb ....Dec then 2017, Jan , Feb, Mar till Dec
Please help if there is merging of 2 dfs. I have just presented a simplified code here(real code is different, I need to merge both and then only I can work)
EDIT: Your solution should be changed:
df1 = df.groupby(["Year", "Month Name"], as_index=False)["Days"].agg(['min', 'mean'])
df3 = df.groupby(["Year", "Month Name"], as_index=False)["Data"].agg(['count'])
merged_df=pd.merge(df3, df1, on=['Year','Month Name']).reset_index()
cats = ['Jan', 'Feb', 'Mar', 'Apr','May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
merged_df['Month Name'] = pd.Categorical(merged_df['Month Name'],categories=cats, ordered=True)
merged_df = merged_df.sort_values(["Year", "Month Name"])
print (merged_df)
Year Month Name count min mean
1 2014 Jan 1 2 2
0 2014 Dec 1 1 1
2 2015 Aug 1 1 1
3 2016 Apr 1 4 4
Or:
df1 = (df.groupby(["Year", "Month Name"])
.agg(min_days=("Days", 'min'),
avg_days=("Days", 'mean'),
count = ('Data', 'count'))
.reset_index())
cats = ['Jan', 'Feb', 'Mar', 'Apr','May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
df1['Month Name'] = pd.Categorical(df1['Month Name'], categories=cats, ordered=True)
df1 = df1.sort_values(["Year", "Month Name"])
print (df1)
Year Month Name min_days avg_days count
1 2014 Jan 2 2 1
0 2014 Dec 1 1 1
2 2015 Aug 1 1 1
3 2016 Apr 4 4 1
Last solution with MultiIndex and no categoricals, solution create helper dates column and sorting by it:
df1 = (df.groupby(["Year", "Month Name"])
.agg(min_days=("Days", 'min'),
avg_days=("Days", 'mean'),
count = ('Data', 'count'))
)
df1['dates'] = pd.to_datetime([f'{y}{m}' for y, m in df1.index], format='%Y%b')
df1 = df1.sort_values('dates')
print (df1)
min_days avg_days count dates
Year Month Name
2014 Jan 2 2 1 2014-01-01
Dec 1 1 1 2014-12-01
2015 Aug 1 1 1 2015-08-01
2016 Apr 4 4 1 2016-04-01
Simply tell groupby you don't want it to sort group keys (by default, that's what it does - see the docs)
df.groupby(["Year", "Month Name"], as_index=False, sort=False)["Days"].agg(
["min", "mean"]
)
NOTE: you should make sure your df is sorted before applying groupby
Here is my solution to sort by month number and return sorted month names for level=1 of multiindex taking merged_df as the input:
import calendar
d={i:e for e,i in enumerate([*calendar.month_abbr])}
#for full month name use :-> d={i:e for e,i in enumerate([*calendar.month_name])}
merged_df.index=pd.MultiIndex.from_tuples(sorted(merged_df.index,key=lambda x: d.get(x[1])))
merged_df = merged_df.sort_index(level=0)
print(merged_df)
count min mean
Year Month Name
2014 Jan 1 2 2
Dec 1 1 1
2015 Aug 1 1 1
2016 Apr 1 4 4
Related
I have a large customer dataset, it has things like Customer ID, Service ID, Product, etc. So the two ways we can measure churn are at a Customer-ID level, if the entire customer leaves and at a Service-ID level where maybe they cancel 2 out of 5 services.
The data looks like this, and as we can see
Alligators stops being a customer at the end of Jan as they dont have any rows in Feb (CustomerChurn)
Aunties stops being a customer at the end of Jan as they dont have any rows in Feb (CustomerChurn)
Bricks continues with Apples and Oranges in Jan and Feb (ServiceContinue)
Bricks continues being a customer but cancels two services at the end of Jan (ServiceChurn)
I am trying to write some code that creates the 'Churn' column.. I have tried
To manually just grab lists of CustomerIDs and ServiceIDs using Set from Oct 2019, and then comparing that to Nov 2019, to find the ones that churned. This is not too slow but doesn't seem very Pythonic.
Thank you!
data = {'CustomerName': ['Alligators','Aunties', 'Bricks', 'Bricks','Bricks', 'Bricks', 'Bricks', 'Bricks', 'Bricks', 'Bricks'],
'ServiceID': [1009, 1008, 1001, 1002, 1003, 1004, 1001, 1002, 1001, 1002],
'Product': ['Apples', 'Apples', 'Apples', 'Bananas', 'Oranges', 'Watermelon', 'Apples', 'Bananas', 'Apples', 'Bananas'],
'Month': ['Jan', 'Jan', 'Jan', 'Jan', 'Jan', 'Jan', 'Feb', 'Feb', 'Mar', 'Mar'],
'Year': [2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021],
'Churn': ['CustomerChurn', 'CustomerChurn', 'ServiceContinue', 'ServiceContinue', 'ServiceChurn', 'ServiceChurn','ServiceContinue', 'ServiceContinue', 'NA', 'NA']}
df = pd.DataFrame(data)
df
I think this gets close to what you want, except for the NA in the last two rows, but if you really need those NA, then you can filter by date and change the values.
Because you are really testing two different groupings, I send the first customername grouping through a function and depending what I see, I send a more refined grouping through a second function. For this data set it seems to work.
I create an actual date column and make sure everything is sorted before grouping. The logic inside the functions is testing the max date of the group to see if it's less than a certain date. Looks like you are testing March as the current month
You should be able to adapt it for your needs
df['testdate'] = df.apply(lambda x: datetime.datetime.strptime('-'.join((x['Month'], str(x['Year']))),'%b-%Y'), axis=1)
df = df.sort_values('testdate')
df1 = df.drop('Churn',axis=1)
def get_customerchurn(x, tdate):
# print(x)
# print(tdate)
if x.testdate.max() < tdate:
x.loc[:, 'Churn'] = 'CustomerChurn'
return x
else:
x = x.groupby(['CustomerName', 'Product']).apply(lambda x: get_servicechurn(x, datetime.datetime(2021,3,1)))
return x
def get_servicechurn(x, tdate):
print(x)
# print(tdate)
if x.testdate.max() < tdate:
x.loc[:, 'Churn'] = 'ServiceChurn'
return x
else:
x.loc[:, 'Churn'] = 'ServiceContinue'
return x
df2 = df1.groupby(['CustomerName']).apply(lambda x: get_customerchurn(x, datetime.datetime(2021,3,1)))
df2
Output:
CustomerName ServiceID Product Month Year testdate Churn
0 Alligators 1009 Apples Jan 2021 2021-01-01 CustomerChurn
1 Aunties 1008 Apples Jan 2021 2021-01-01 CustomerChurn
2 Bricks 1001 Apples Jan 2021 2021-01-01 ServiceContinue
3 Bricks 1002 Bananas Jan 2021 2021-01-01 ServiceContinue
4 Bricks 1003 Oranges Jan 2021 2021-01-01 ServiceChurn
5 Bricks 1004 Watermelon Jan 2021 2021-01-01 ServiceChurn
6 Bricks 1001 Apples Feb 2021 2021-02-01 ServiceContinue
7 Bricks 1002 Bananas Feb 2021 2021-02-01 ServiceContinue
8 Bricks 1001 Apples Mar 2021 2021-03-01 ServiceContinue
9 Bricks 1002 Bananas Mar 2021 2021-03-01 ServiceContinue
I have a dataframe structured as follows:
Name Month Grade
Sue Jan D
Sue Feb D
Jason Mar B
Sue Mar D
Jason Jan B
Sue Apr A
Jason Feb C
I want to get the list of students who got D 3 consecutive months in the past 6 months. In the example above, Sue will be on the list since she got D in Jan, Feb ad March. How can I do that using Python or Pandas or Numpy?
I tried to solve your problem. I do have a solution for you but it may not be the fastest in terms of efficiency / code execution. Please see below:
newdf = df.pivot(index='Name', columns='Month', values='Grade')
newdf = newdf[['Jan', 'Feb', 'Mar', 'Apr']].fillna(-1)
newdf['concatenated'] = newdf['Jan'].astype('str') + newdf['Feb'].astype('str') + newdf['Mar'].astype('str') + newdf['Apr'].astype('str')
newdf[newdf['concatenated'].str.contains('DDD', regex=False, na=False)]
Output will be like:
Month Jan Feb Mar Apr concatenated
Name
Sue D D D A DDDA
If you just want the names, then the following command instead.
newdf[newdf['concatenated'].str.contains('DDD', regex=False, na=False)].index.to_list()
I came up with this.
df['Month_Nr'] = pd.to_datetime(df.Month, format='%b').dt.month
names = df.Name.unique()
students = np.array([])
for name in names:
filter = df[(df.Name==name) & (df.Grade=='D')].sort_values('Month_Nr')
if filter['Month_Nr'].diff().cumsum().max() >= 2:
students = np.append(students, name)
print(students)
Output:
['Sue']
you have a few ways to deal with this, first use my previous solution but this will require mapping academic numbers to months (i.e September = 1, August = 12) that way you can apply math to work out consecutive values.
the following is to covert the Month into a DateTime and work out the difference in months, we can then apply a cumulative sum and filter any values greater than 3.
d = StringIO("""Name Month Grade
Sue Jan D
Sue Feb D
Jason Mar B
Sue Dec D
Jason Jan B
Sue Apr A
Jason Feb C""")
df = pd.read_csv(d,sep='\s+')
df['date'] = pd.to_datetime(df['Month'],format='%b').dt.normalize()
# set any values greater than June to the previous year.
df['date'] = np.where(df['date'].dt.month > 6,
(df['date'] - pd.DateOffset(years=1)),df['date'])
df.sort_values(['Name','date'],inplace=True)
def month_diff(date):
cumlative_months = (
np.round(((date.sub(date.shift(1)) / np.timedelta64(1, "M")))).eq(1).cumsum()
) + 1
return cumlative_months
df['count'] = df.groupby(["Name", "Grade"])["date"].apply(month_diff)
print(df.drop('date',axis=1))
Name Month Grade count
4 Jason Jan B 1
6 Jason Feb C 1
2 Jason Mar B 1
3 Sue Dec D 1
0 Sue Jan D 2
1 Sue Feb D 3
5 Sue Apr A 1
print(df.loc[df['Name'] == 'Sue'])
Name Month Grade date count
3 Sue Dec D 1899-12-01 1
0 Sue Jan D 1900-01-01 2
1 Sue Feb D 1900-02-01 3
5 Sue Apr A 1900-04-01 1
This question already has answers here:
Pandas DataFrame Groupby two columns and get counts
(8 answers)
Closed 3 years ago.
I have a data frame like this:
year drug_name avg_number_of_ingredients
0 2019 NEXIUM I.V. 8
1 2016 ZOLADEX 10
2 2017 PRILOSEC 59
3 2017 BYDUREON BCise 24
4 2019 Lynparza 28
And I need to group drug names and mean number of ingredients by year like this:
year drug_name avg_number_of_ingredients
0 2019 drug a,b,c.. mean value for column
1 2018 drug a,b,c.. mean value for column
2 2017 drug a,b,c.. mean value for column
If I do df.groupby('year'), I lose drug names. How can I do it?
Let me show you the solution on the simple example. First, I make the same data frame as you have:
>>> df = pd.DataFrame(
[
{'year': 2019, 'drug_name': 'NEXIUM I.V.', 'avg_number_of_ingredients': 8},
{'year': 2016, 'drug_name': 'ZOLADEX', 'avg_number_of_ingredients': 10},
{'year': 2017, 'drug_name': 'PRILOSEC', 'avg_number_of_ingredients': 59},
{'year': 2017, 'drug_name': 'BYDUREON BCise', 'avg_number_of_ingredients': 24},
{'year': 2019, 'drug_name': 'Lynparza', 'avg_number_of_ingredients': 28},
]
)
>>> print(df)
year drug_name avg_number_of_ingredients
0 2019 NEXIUM I.V. 8
1 2016 ZOLADEX 10
2 2017 PRILOSEC 59
3 2017 BYDUREON BCise 24
4 2019 Lynparza 28
Now, I make a df_grouped, which still consists of information about drugs name.
>>> df_grouped = df.groupby('year', as_index=False).agg({'drug_name': ', '.join, 'avg_number_of_ingredients': 'mean'})
>>> print(df_grouped)
year drug_name avg_number_of_ingredients
0 2016 ZOLADEX 10.0
1 2017 PRILOSEC, BYDUREON BCise 41.5
2 2019 NEXIUM I.V., Lynparza 18.0
I'm downloading data from FRED. I'm summing to get annual numbers, but don't want incomplete years. So I need a sum condition if count the number of obs is 12 because the series is monthly.
import pandas_datareader.data as web
mnemonic = 'RSFSXMV'
df = web.DataReader(mnemonic, 'fred', 2000, 2020)
df['year'] = df.index.year
new_df = df.groupby(["year"])[mnemonic].sum().reset_index()
print(new_df)
I don't want 2019 to show up.
In your case we using transform with nunique to make sure each year should have 12 unique month , if not we drop it before do the groupby sum
df['Month']=df.index.month
m=df.groupby('year').Month.transform('nunique')==12
new_df = df.loc[m].groupby(["year"])[mnemonic].sum().reset_index()
isin
df['Month']=df.index.month
m=df.groupby('year').Month.nunique()
new_df = df.loc[df.year.isin(m.index[m==12)].groupby(["year"])[mnemonic].sum().reset_index()
You could use a aggreate function count while groupby:
df['year'] = df.index.year
df = df.groupby('year').agg({'RSFSXMV': 'sum', 'year': 'count'})
which will give you:
RSFSXMV year
year
2000 2487790 12
2001 2563218 12
2002 2641870 12
2003 2770397 12
2004 2969282 12
2005 3196141 12
2006 3397323 12
2007 3531906 12
2008 3601512 12
2009 3393753 12
2010 3541327 12
2011 3784014 12
2012 3934506 12
2013 4043037 12
2014 4191342 12
2015 4252113 12
2016 4357528 12
2017 4561833 12
2018 4810502 12
2019 2042147 5
Then simply drop those rows with a year count less than 12
Here's my dataset, (only one column)
Apr 1 09:14:55 i have apple
Apr 2 08:10:10 i have mango
There's the result I need
month date time message
Apr 1 09:14:55 i have apple
Apr 2 09:10:10 i have mango
This is what I've done
import pandas as pd
month = []
date = []
time = []
message = []
for line in dns_data:
month.append(line.split()[0])
date.append(line.split()[1])
time.append(line.split()[2])
df = pd.DataFrame(data={'month': month, 'date':date, 'time':time})
This is the output I get
month date time
0 Apr 1 09:14:55
1 Apr 2 09:10:10
How to display message column?
Use parameter n in Series.str.split for spliting by first 3 whitespaces, expand=True is for output DataFrame:
print (df)
col
0 Apr 1 09:14:55 i have apple
1 Apr 2 08:10:10 i have mango
df1 = df['col'].str.split(n=3, expand=True)
df1.columns=['month','date','time','message']
print (df1)
month date time message
0 Apr 1 09:14:55 i have apple
1 Apr 2 08:10:10 i have mango
Another solution with list comprehension:
c = ['month','date','time','message']
df1 = pd.DataFrame([x.split(maxsplit=3) for x in df['col']], columns=c)
print (df1)
month date time message
0 Apr 1 09:14:55 i have apple
1 Apr 2 08:10:10 i have mango
You could use Series.str.extractall with a regex pattern:
df = pd.DataFrame({'text': {0: 'Apr 1 09:14:55 i have apple', 1: 'Apr 2 08:10:10 i have mango'}})
df_new = (df.text.str
.extractall(r'^(?P<month>\w{3})\s?(?P<date>\d{1,2})\s?(?P<time>\d{2}:\d{2}:\d{2})\s?(?P<message>.*)$')
.reset_index(drop=True))
print(df_new)
month date time message
0 Apr 1 09:14:55 i have apple
1 Apr 2 08:10:10 i have mango
This may help you.
(?<Month>\w+)\s(?<Date>\d+)\s(?<Time>[\w:]+)\s(?<Message>.*)
Match 1
Month Apr
Date 1
Time 09:14:55
Message i have apple
Match 2
Month Apr
Date 2
Time 08:10:10
Message i have mango
https://rubular.com/r/1S4BcbDxPtlVxE