This question already has answers here:
Pandas DataFrame Groupby two columns and get counts
(8 answers)
Closed 3 years ago.
I have a data frame like this:
year drug_name avg_number_of_ingredients
0 2019 NEXIUM I.V. 8
1 2016 ZOLADEX 10
2 2017 PRILOSEC 59
3 2017 BYDUREON BCise 24
4 2019 Lynparza 28
And I need to group drug names and mean number of ingredients by year like this:
year drug_name avg_number_of_ingredients
0 2019 drug a,b,c.. mean value for column
1 2018 drug a,b,c.. mean value for column
2 2017 drug a,b,c.. mean value for column
If I do df.groupby('year'), I lose drug names. How can I do it?
Let me show you the solution on the simple example. First, I make the same data frame as you have:
>>> df = pd.DataFrame(
[
{'year': 2019, 'drug_name': 'NEXIUM I.V.', 'avg_number_of_ingredients': 8},
{'year': 2016, 'drug_name': 'ZOLADEX', 'avg_number_of_ingredients': 10},
{'year': 2017, 'drug_name': 'PRILOSEC', 'avg_number_of_ingredients': 59},
{'year': 2017, 'drug_name': 'BYDUREON BCise', 'avg_number_of_ingredients': 24},
{'year': 2019, 'drug_name': 'Lynparza', 'avg_number_of_ingredients': 28},
]
)
>>> print(df)
year drug_name avg_number_of_ingredients
0 2019 NEXIUM I.V. 8
1 2016 ZOLADEX 10
2 2017 PRILOSEC 59
3 2017 BYDUREON BCise 24
4 2019 Lynparza 28
Now, I make a df_grouped, which still consists of information about drugs name.
>>> df_grouped = df.groupby('year', as_index=False).agg({'drug_name': ', '.join, 'avg_number_of_ingredients': 'mean'})
>>> print(df_grouped)
year drug_name avg_number_of_ingredients
0 2016 ZOLADEX 10.0
1 2017 PRILOSEC, BYDUREON BCise 41.5
2 2019 NEXIUM I.V., Lynparza 18.0
Related
The "REPAIR_YEAR" column in my dataframe consists of many different years (2018, 2019, etc).
Using pandas, I would like to define a new dataframe where all rows are filtered based on the years 2019, 2020, and 2021 from the REPAIR_YEAR column.
How do I do this?
Considering the following dataframe
[In]: df = pd.DataFrame({"REPAIR_YEAR": [2017, 2018, 2019, 2020, 2021, 2022], "Letters": ["a", "a", "a", "a", "b", "c"]})
[Out]:
REPAIR_YEAR Letters
0 2017 a
1 2018 a
2 2019 a
3 2020 a
4 2021 b
5 2022 c
Create a condition to filter the columns where REPAIR_YEAR is equal to 2019 or 2020 or 2021
condition = (df["REPAIR_YEAR"] == 2019) | (df["REPAIR_YEAR"] == 2020) | (df["REPAIR_YEAR"] == 2021)
Then apply the conditions to a new dataframe as follows
[In]: new_df = df[condition]
[Out]:
REPAIR_YEAR Letters
0 2019 a
1 2020 a
2 2021 b
I have a dataframe with some duplicates that I need to remove. In the dataframe below, where the month, year and type are all the same it should keep the row with the highest sale. Eg:
df = pd.DataFrame({'month': [1, 1, 7, 10],
'year': [2012, 2012, 2013, 2014],
'type':['C','C','S','C'],
'sale': [55, 40, 84, 31]})
After removing duplicates and keeping the highest value of column 'sale' should look like:
df_2 = pd.DataFrame({'month': [1, 7, 10],
'year': [2012, 2013, 2014],
'type':['C','S','C'],
'sale': [55, 84, 31]})
You can use:
(df.sort_values('sale',ascending=False)
.drop_duplicates(['month','year','type']).sort_index())
month year type sale
0 1 2012 C 55
2 7 2013 S 84
3 10 2014 C 31
You could groupby and take the max of sale:
df.groupby(['month', 'year', 'type']).max().reset_index()
month year type sale
0 1 2012 C 55
1 7 2013 S 84
2 10 2014 C 31
If you have another column, like other, than you must specify which column to take the max, in this way:
df.groupby(['month', 'year', 'type'])[['sale']].max().reset_index()
month year type sale
0 1 2012 C 55
1 7 2013 S 84
2 10 2014 C 31
I have a pandas DataFrame with 2 columns: Year(int) and Condition(string). In column Condition I have a nan value and I want to replace it based on information from groupby operation.
import pandas as pd
import numpy as np
year = [2015, 2016, 2017, 2016, 2016, 2017, 2015, 2016, 2015, 2015]
cond = ["good", "good", "excellent", "good", 'excellent','excellent', np.nan, 'good','excellent', 'good']
X = pd.DataFrame({'year': year, 'condition': cond})
stat = X.groupby('year')['condition'].value_counts()
It gives:
print(X)
year condition
0 2015 good
1 2016 good
2 2017 excellent
3 2016 good
4 2016 excellent
5 2017 excellent
6 2015 NaN
7 2016 good
8 2015 excellent
9 2015 good
print(stat)
year condition
2015 good 2
excellent 1
2016 good 3
excellent 1
2017 excellent 2
As nan value in 6th row gets year = 2015 and from stat I get that from 2015 the most frequent is 'good' so I want to replace this nan value with 'good' value.
I have tried with fillna and .transform method but it does not work :(
I would be grateful for any help.
I did a little extra transformation to get stat as a dictionary mapping the year to its highest frequency name (credit to this answer):
In[0]:
fill_dict = stat.unstack().idxmax(axis=1).to_dict()
fill_dict
Out[0]:
{2015: 'good', 2016: 'good', 2017: 'excellent'}
Then use fillna with map based on this dictionary (credit to this answer):
In[0]:
X['condition'] = X['condition'].fillna(X['year'].map(fill_dict))
X
Out[0]:
year condition
0 2015 good
1 2016 good
2 2017 excellent
3 2016 good
4 2016 excellent
5 2017 excellent
6 2015 good
7 2016 good
8 2015 excellent
9 2015 good
i have a dataframe column name like this
id salary year emp_type salary1 year1 emp_type1 salary2 year2 emp_type2 .. salary9 year9 emp_type9
1 xx xx xx .. ..
2 .. ..
3
i wan to pivot column to row like this
id salary year emp_type
-------------------------------------------------------------------
value of salary value of year value of emp_type
value of salary1 value of year1 value of emp_type1
.. .. ..
.. .. ..
value of salary9 value of year9 value of emp_type9
If columns are guaranteed to be in this order, you can simply create a new dataframe from the reshaped old one:
new_df = pd.DataFrame(old_df.values.reshape((-1, 3)),
columns=['salary', 'year', 'emp_type'])
The new dataframe will not keep the old index, though.
The solution given by #Marat should work. Here I used 9 columns and it works.
df = pd.DataFrame(['1000', 2011, 'Type1', '2000', 2012, 'Type2', '3000', 2013, 'Type3',
'4000', 2014, 'Type4', '5000', 2015, 'Type5', '6000', 2016, 'Type6',
'8000', 2018, 'Type7', '8000', 2018, 'Type8', '9000', 2019, 'Type9'])
df = pd.DataFrame(df.values.reshape(-1,3),columns=['salary', 'year', 'emp_type'])
print(df)
Output:
salary year emp_type
0 1000 2011 Type1
1 2000 2012 Type2
2 3000 2013 Type3
3 4000 2014 Type4
4 5000 2015 Type5
5 6000 2016 Type6
6 8000 2018 Type7
7 8000 2018 Type8
8 9000 2019 Type9
I'm downloading data from FRED. I'm summing to get annual numbers, but don't want incomplete years. So I need a sum condition if count the number of obs is 12 because the series is monthly.
import pandas_datareader.data as web
mnemonic = 'RSFSXMV'
df = web.DataReader(mnemonic, 'fred', 2000, 2020)
df['year'] = df.index.year
new_df = df.groupby(["year"])[mnemonic].sum().reset_index()
print(new_df)
I don't want 2019 to show up.
In your case we using transform with nunique to make sure each year should have 12 unique month , if not we drop it before do the groupby sum
df['Month']=df.index.month
m=df.groupby('year').Month.transform('nunique')==12
new_df = df.loc[m].groupby(["year"])[mnemonic].sum().reset_index()
isin
df['Month']=df.index.month
m=df.groupby('year').Month.nunique()
new_df = df.loc[df.year.isin(m.index[m==12)].groupby(["year"])[mnemonic].sum().reset_index()
You could use a aggreate function count while groupby:
df['year'] = df.index.year
df = df.groupby('year').agg({'RSFSXMV': 'sum', 'year': 'count'})
which will give you:
RSFSXMV year
year
2000 2487790 12
2001 2563218 12
2002 2641870 12
2003 2770397 12
2004 2969282 12
2005 3196141 12
2006 3397323 12
2007 3531906 12
2008 3601512 12
2009 3393753 12
2010 3541327 12
2011 3784014 12
2012 3934506 12
2013 4043037 12
2014 4191342 12
2015 4252113 12
2016 4357528 12
2017 4561833 12
2018 4810502 12
2019 2042147 5
Then simply drop those rows with a year count less than 12