pandas groupby sum if count equals condition - python

I'm downloading data from FRED. I'm summing to get annual numbers, but don't want incomplete years. So I need a sum condition if count the number of obs is 12 because the series is monthly.
import pandas_datareader.data as web
mnemonic = 'RSFSXMV'
df = web.DataReader(mnemonic, 'fred', 2000, 2020)
df['year'] = df.index.year
new_df = df.groupby(["year"])[mnemonic].sum().reset_index()
print(new_df)
I don't want 2019 to show up.

In your case we using transform with nunique to make sure each year should have 12 unique month , if not we drop it before do the groupby sum
df['Month']=df.index.month
m=df.groupby('year').Month.transform('nunique')==12
new_df = df.loc[m].groupby(["year"])[mnemonic].sum().reset_index()
isin
df['Month']=df.index.month
m=df.groupby('year').Month.nunique()
new_df = df.loc[df.year.isin(m.index[m==12)].groupby(["year"])[mnemonic].sum().reset_index()

You could use a aggreate function count while groupby:
df['year'] = df.index.year
df = df.groupby('year').agg({'RSFSXMV': 'sum', 'year': 'count'})
which will give you:
RSFSXMV year
year
2000 2487790 12
2001 2563218 12
2002 2641870 12
2003 2770397 12
2004 2969282 12
2005 3196141 12
2006 3397323 12
2007 3531906 12
2008 3601512 12
2009 3393753 12
2010 3541327 12
2011 3784014 12
2012 3934506 12
2013 4043037 12
2014 4191342 12
2015 4252113 12
2016 4357528 12
2017 4561833 12
2018 4810502 12
2019 2042147 5
Then simply drop those rows with a year count less than 12

Related

how to find the number of rows in a column that are above the mean?

I have a dataset and among the columns there are column A that have the release year of products and column B that have the sales of each product.
I want to know how many product have sales above the mean for each year.
The dataset is a pandas dataframe.
Thank you and I hope my question is clear
Compute yearly averages with groupby.transform() and compare them against the individual sales, e.g.:
df = pd.DataFrame({'product': np.random.choice(['foo','bar'], size=10), 'year': np.random.choice([2019,2020,2021], size=10), 'sales': np.random.randint(10000, size=10)})
# product year sales
# 0 foo 2019 7507
# 1 bar 2019 9186
# 2 foo 2021 6234
# 3 foo 2021 7375
# 4 bar 2020 9934
# 5 foo 2021 6403
# 6 foo 2021 7729
# 7 foo 2021 1875
# 8 bar 2020 7148
# 9 foo 2019 8163
df['above_mean'] = df.sales > df.groupby(['product','year']).sales.transform('mean')
df.groupby('year', as_index=False).above_mean.sum()
# year above_mean
# 0 2019 1
# 1 2020 1
# 2 2021 4

How to groupby in Pandas and keep all columns [duplicate]

This question already has answers here:
Pandas DataFrame Groupby two columns and get counts
(8 answers)
Closed 3 years ago.
I have a data frame like this:
year drug_name avg_number_of_ingredients
0 2019 NEXIUM I.V. 8
1 2016 ZOLADEX 10
2 2017 PRILOSEC 59
3 2017 BYDUREON BCise 24
4 2019 Lynparza 28
And I need to group drug names and mean number of ingredients by year like this:
year drug_name avg_number_of_ingredients
0 2019 drug a,b,c.. mean value for column
1 2018 drug a,b,c.. mean value for column
2 2017 drug a,b,c.. mean value for column
If I do df.groupby('year'), I lose drug names. How can I do it?
Let me show you the solution on the simple example. First, I make the same data frame as you have:
>>> df = pd.DataFrame(
[
{'year': 2019, 'drug_name': 'NEXIUM I.V.', 'avg_number_of_ingredients': 8},
{'year': 2016, 'drug_name': 'ZOLADEX', 'avg_number_of_ingredients': 10},
{'year': 2017, 'drug_name': 'PRILOSEC', 'avg_number_of_ingredients': 59},
{'year': 2017, 'drug_name': 'BYDUREON BCise', 'avg_number_of_ingredients': 24},
{'year': 2019, 'drug_name': 'Lynparza', 'avg_number_of_ingredients': 28},
]
)
>>> print(df)
year drug_name avg_number_of_ingredients
0 2019 NEXIUM I.V. 8
1 2016 ZOLADEX 10
2 2017 PRILOSEC 59
3 2017 BYDUREON BCise 24
4 2019 Lynparza 28
Now, I make a df_grouped, which still consists of information about drugs name.
>>> df_grouped = df.groupby('year', as_index=False).agg({'drug_name': ', '.join, 'avg_number_of_ingredients': 'mean'})
>>> print(df_grouped)
year drug_name avg_number_of_ingredients
0 2016 ZOLADEX 10.0
1 2017 PRILOSEC, BYDUREON BCise 41.5
2 2019 NEXIUM I.V., Lynparza 18.0

Select by Column Values

I know it is possible in arcpy. Finding out if can happen in pandas.
I have the following
data= {'Species':[ 'P.PIN','P.PIN','V.FOG', 'V.KOP', 'E.MON', 'E.CLA', 'E.KLI', 'D.FGH','W.ERT','S.MIX','P.PIN'],
'FY':[ '2002','2016','2018','2010','2009','2019','2017','2016','2018','2018','2016']}
I need to select all the P.PIN, P.RAD and any other species starting with E that have a FY equal to or older than 2016 and put into a new dataframe.
How can I get this done. All I am able to select P.PIN and P.RAD but have adding in all the other starting with E;
df3 =df[(df['FY']>=2016)&(df1['LastSpecies'].isin(['P.PIN','P.RAD']))]
Your help will be highly appreciated.
Step by step way. But you can also combine the logic inside the np.where() just want to show that all conditions were done.
Start by typecasting your df['FY'] values as int so we can use the greater than (>) operator.
>>> df['FY'] = df['FY'].astype(int)
>>> df['flag'] = np.where(df['Species'].isin(['P.PIN', 'P.RAD']), ['Take'], ['Remove'])
>>> df
Species FY flag
0 P.PIN 2002 Take
1 P.PIN 2016 Take
2 V.FOG 2018 Remove
3 V.KOP 2010 Remove
4 E.MON 2009 Remove
5 E.CLA 2019 Remove
6 E.KLI 2017 Remove
7 D.FGH 2016 Remove
8 W.ERT 2018 Remove
9 S.MIX 2018 Remove
10 P.PIN 2016 Take
>>> df['flag'] = np.where((df['FY'] > 2016) & (df['Species'].str.startswith('E')), ['Take'], df['flag'])
>>> df
Species FY flag
0 P.PIN 2002 Take
1 P.PIN 2016 Take
2 V.FOG 2018 Remove
3 V.KOP 2010 Remove
4 E.MON 2009 Remove
5 E.CLA 2019 Take
6 E.KLI 2017 Take
7 D.FGH 2016 Remove
8 W.ERT 2018 Remove
9 S.MIX 2018 Remove
10 P.PIN 2016 Take
>>> new_df = df[df['flag'].isin(['Take'])][['Species', 'FY']]
>>> new_df
Species FY
0 P.PIN 2002
1 P.PIN 2016
5 E.CLA 2019
6 E.KLI 2017
10 P.PIN 2016
Hope this helps :D

iterate over pandas dataframe and create another dataframe with repititive records

I have a dataframe act with columns as ['ids','start-yr','end-yr'].
I want to create another dataframe timeline with columns as ['ids','years'].
using the act df. So if act has fields as
ids start-yr end-yr
--------------------------------
'IAs728-ahe83j' 2014 2016
'J8273nbajsu-193h' 2012 2018
I want the timeline df to be populated like this:
ids years
------------------------
'IAs728-ahe83j' 2014
'IAs728-ahe83j' 2015
'IAs728-ahe83j' 2016
'J8273nbajsu-193h' 2012
'J8273nbajsu-193h' 2013
'J8273nbajsu-193h' 2014
'J8273nbajsu-193h' 2015
'J8273nbajsu-193h' 2016
'J8273nbajsu-193h' 2017
'J8273nbajsu-193h' 2018
My attempt so far:
timeline = pd.DataFrame(columns=['ids','years'])
cnt = 0
for ix, row in act.iterrows():
for yr in range(int(row['start-yr']), int(row['end-yr'])+1, 1):
timeline[cnt, 'ids'] = row['ids']
timeline[cnt, 'years'] = yr
cnt += 1
But this is a very costly operation, too much time consuming (which is obvious, i know). So what should be the best pythonic approach to populate a pandas df in a situation like this?
Any help is appreciated, thanks.
Use list comprehension with range for list of tuples and DataFrame constructor:
a = [(i, x) for i, a, b in df.values for x in range(a, b + 1)]
df = pd.DataFrame(a, columns=['ids','years'])
print (df)
ids years
0 'IAs728-ahe83j' 2014
1 'IAs728-ahe83j' 2015
2 'IAs728-ahe83j' 2016
3 'J8273nbajsu-193h' 2012
4 'J8273nbajsu-193h' 2013
5 'J8273nbajsu-193h' 2014
6 'J8273nbajsu-193h' 2015
7 'J8273nbajsu-193h' 2016
8 'J8273nbajsu-193h' 2017
9 'J8273nbajsu-193h' 2018
If possible multiple columns in DataFrame filter them by list:
c = ['ids','start-yr','end-yr']
a = [(i, x) for i, a, b in df[c].values for x in range(a, b + 1)]

Pandas Groupby with multiple columns selecting rows with full range of values

I am working with a pandas dataframe. From the code:
contracts.groupby(['State','Year'])['$'].mean()
I have a pandas groupby object with two group layers: State and Year.
State / Year / $
NY 2009 5
2010 10
2011 5
2012 15
NJ 2009 2
2012 12
DE 2009 1
2010 2
2011 3
2012 6
I would like to look at only those states for which I have data on all the years (i.e. NY and DE, not NJ as it is missing 2010). Is there a way to suppress those nested groups with less than full rank?
After grouping by State and Year and taking the mean,
means = contracts.groupby(['State', 'Year'])['$'].mean()
you could groupby the State alone, and use filter to keep the desired groups:
result = means.groupby(level='State').filter(lambda x: len(x)>=len(years))
For example,
import numpy as np
import pandas as pd
np.random.seed(2015)
N = 15
states = ['NY','NJ','DE']
years = range(2009, 2013)
contracts = pd.DataFrame({
'State': np.random.choice(states, size=N),
'Year': np.random.choice(years, size=N),
'$': np.random.randint(10, size=N)})
means = contracts.groupby(['State', 'Year'])['$'].mean()
result = means.groupby(level='State').filter(lambda x: len(x)>=len(years))
print(result)
yields
State Year
DE 2009 8
2010 5
2011 3
2012 6
NY 2009 2
2010 1
2011 5
2012 9
Name: $, dtype: int64
Alternatively, you could filter first and then take the mean:
filtered = contracts.groupby(['State']).filter(lambda x: x['Year'].nunique() >= len(years))
result = filtered.groupby(['State', 'Year'])['$'].mean()
but playing with various examples suggest this is typically slower than taking the mean, then filtering.

Categories

Resources