dataframe from dictionary of dictionary in list - python

I have a list of dictionary like that :
my_list = [
{
'Currency': 'USD',
'Product': 'a',
'Quantity': {
'Apr 2019': 1.0,
'Jun 2019': 7.0
}
},
{
'Currency': 'USD',
'Product': 'b',
'Quantity': {
'Jan 2019': 4.0,
'Feb 2019': 8.0
}
}
]
And I want a dataframe like that :
Currency Product Quantity Date
'USD' 'a' 1 Apr 2019
'USD' 'a' 7 Jun 2019
'USD' 'b' 4 Jan 2019
'USD' 'b' 8 Feb 2019
currently I am doing that :
for element in my_list :
currency = element.get('Currency')
product = element.get('Product')
dates = list(element.get('Quantity').keys())
for date in dates:
quantity = element.get('Quantity')[date]
row = [currency, product, quantity, date]
df.loc[df.shape[0]] = row
But I imagine there is a better way instead of loop in the list and
pd.DataFrame.from_dict(my_list)
works if there is only one value in quantity (with a little modification with .apply)
thanks

df_dict = [{**d, "Quantity": quantity, "Date": date,} for d in my_list for date, quantity in d['Quantity'].items()]
df = pd.DataFrame.from_dict(df_dict)
output:
>>> df
Currency Product Quantity Date
0 USD a 1.0 Apr 2019
1 USD a 7.0 Jun 2019
2 USD b 4.0 Jan 2019
3 USD b 8.0 Feb 2019
Explanation:
By using a double-nested loop, you enumerate your list by the number of quantity/date pairs -which is what you want. Then you unpack the dictionary on the first level (using **d). This sets the correct Currency and Product values but leaves us with the "bad" Quantity value. This is overwritten in the next step of the dictionary comprehension. And finally, Date is set. From there, it's simply pandas reading each dictionary as a row.

Use json_normalize:
from pandas.io.json import json_normalize
df=json_normalize(my_list,'Quantity',['Currency','Product'])
Quantity=[]
for d in my_list:
for month in d['Quantity']:
Quantity.append(d['Quantity'][month])
df['Quantity']=Quantity
df=df.rename(columns={0:'Date'}).reindex(columns=['Currency','Product','Quantity','Date'])
print(df)
Currency Product Quantity Date
0 USD a 1.0 Apr 2019
1 USD a 7.0 Jun 2019
2 USD b 4.0 Jan 2019
3 USD b 8.0 Feb 2019

You can use double loop for processing your data.
The following code
df = pd.DataFrame(
[
{
'Currency': item.get('Currency'),
'Product': item.get('Product'),
'Date': quant_key,
'Quantity': quant_val,
} for item in my_list for quant_key, quant_val in item['Quantity'].items()
]
)
print(df)
returns this output:
Currency Product Date Quantity
0 USD a Apr 2019 1.0
1 USD a Jun 2019 7.0
2 USD b Jan 2019 4.0
3 USD b Feb 2019 8.0

Related

Pandas Groupby returning max value and other specific column

Supposed a dataframe
import pandas as pd
df = pd.DataFrame({
'Model': ['A', 'B', 'C', 'D', 'E', 'F', 'G'],
'Year': [2019, 2020, 2021, 2018, 2019, 2020, 2021],
'Transmission': ['Manual', 'Automatic', 'Automatic', 'Manual', 'Automatic', 'Automatic', 'Manual'],
'EngineSize': [1.4, 2.0, 1.4, 1.5, 2.0, 1.5, 1.5],
'MPG': [55.4, 67.3, 58.9, 52.3, 64.2, 68.9, 83.1]
})
df
and I want to return the highest MPG per year plus the model. Looked like this
Year MPG Model
2018 52.3 D
2019 64.2 E
2020 68.9 F
2021 83.1 G
I'm thinking by using groupby but still stuck on how to show the Model column.
You could use groupby + idxmax to get the index of the max MPG of each year; then use loc to filter:
out = df.loc[df.groupby('Year')['MPG'].idxmax(), ['Year', 'MPG', 'Model']]
Output:
Year MPG Model
3 2018 52.3 D
4 2019 64.2 E
5 2020 68.9 F
6 2021 83.1 G
I like #enke's answer better. But you could use the groupby apply with pd.DataFrame.nlargest
df.groupby('Year').apply(pd.DataFrame.nlargest, n=1, columns=['MPG'])
Model Year Transmission EngineSize MPG
Year
2018 3 D 2018 Manual 1.5 52.3
2019 4 E 2019 Automatic 2.0 64.2
2020 5 F 2020 Automatic 1.5 68.9
2021 6 G 2021 Manual 1.5 83.1

Pandas dataframe mapping one value to another in a row

I have a pandas two dataframes, one with columns 'Name', 'Year' and 'Currency Rate'.
The other with columns named 'Name', and a long list of years(1950, 1951,...,2019,2020).
And in the column 'Year', it is storing value 2000, 2001,...,2015. The years(1950, 1951,...,2019,2020) columns are storing income of the year.
I want to merge these two dataframes but map the income of the year according to the 'Year' column to a new column named 'Income' and drop all the other years, is there a convenient way to do this?
What I am thinking is splitting the second dataframe into 16 different years, and then joining it to the first dataframe.
UPDATED to reflect OP's comment clarifying the question:
To restate the question based on my updated understanding:
There is a dataframe with columns Name, Year and Currency Rate (3 columns in total). Each row contains a Currency Rate in a given Year for a given Name. Each row is unique by (Name, Year) pair.
There is a second dataframe with columns Name, 1950, 1951, ..., 2020 (72 columns in total). Each cell contains an Income value for the corresponding Name for the row and the Year corresponding to the column name. Each row is unique by Name.
Question: How do we add an Income column to the first dataframe with each row containing the Income value from the second dataframe in the cell with (row, column) corresponding to the (Name, Year) pair of such row in the first dataframe?
Test case assumptions I have made:
Name in the first dataframe is a letter from 'a' to 'n' with some duplicates.
Year in the first dataframe is between 2000 and 2015 (as in the question).
Currency Rate in the first dataframe is arbitrary.
Name in the second dataframe is a letter from 'a' to 'z' (no duplicates).
The values in the second dataframe (which represent Income) are arbitrarily constructed using the ASCII offsets of the characters in the corresponding Name concatenated with the Year of the corresponding column name. This way we can visually "decode" them in the test results to confirm that the value from the correct location in the second dataframe has been loaded into the new Income column in the first dataframe.
table1 = [
{'Name': 'a', 'Year': 2000, 'Currency Rate': 1.1},
{'Name': 'b', 'Year': 2001, 'Currency Rate': 1.2},
{'Name': 'c', 'Year': 2002, 'Currency Rate': 1.3},
{'Name': 'd', 'Year': 2003, 'Currency Rate': 1.4},
{'Name': 'e', 'Year': 2004, 'Currency Rate': 1.5},
{'Name': 'f', 'Year': 2005, 'Currency Rate': 1.6},
{'Name': 'g', 'Year': 2006, 'Currency Rate': 1.7},
{'Name': 'h', 'Year': 2007, 'Currency Rate': 1.8},
{'Name': 'i', 'Year': 2008, 'Currency Rate': 1.9},
{'Name': 'j', 'Year': 2009, 'Currency Rate': 1.8},
{'Name': 'k', 'Year': 2010, 'Currency Rate': 1.7},
{'Name': 'l', 'Year': 2011, 'Currency Rate': 1.6},
{'Name': 'm', 'Year': 2012, 'Currency Rate': 1.5},
{'Name': 'm', 'Year': 2013, 'Currency Rate': 1.4},
{'Name': 'n', 'Year': 2014, 'Currency Rate': 1.3},
{'Name': 'n', 'Year': 2015, 'Currency Rate': 1.2}
]
table2 = [{'Name': name} | {str(year): (income := sum(ord(c) - ord('a') + 1 for c in name)*10000 + year) for year in range(1950, 2021)} for name in set(['x', 'y', 'z']) | set(map(lambda row: row['Name'], table1))]
import pandas as pd
df1 = pd.DataFrame(table1)
df2 = pd.DataFrame(table2).sort_values(by='Name')
print(df1)
print(df2)
df1['Income'] = df1.apply(lambda x: int(df2[df2['Name'] == x['Name']][str(x['Year'])]), axis=1)
print(df1.to_string(index=False))
Output:
Name Year Currency Rate
0 a 2000 1.1
1 b 2001 1.2
2 c 2002 1.3
3 d 2003 1.4
4 e 2004 1.5
5 f 2005 1.6
6 g 2006 1.7
7 h 2007 1.8
8 i 2008 1.9
9 j 2009 1.8
10 k 2010 1.7
11 l 2011 1.6
12 m 2012 1.5
13 m 2013 1.4
14 n 2014 1.3
15 n 2015 1.2
Name 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 ... 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
11 a 11950 11951 11952 11953 11954 11955 11956 11957 11958 11959 11960 ... 12009 12010 12011 12012 12013 12014 12015 12016 12017 12018 12019 12020
15 b 21950 21951 21952 21953 21954 21955 21956 21957 21958 21959 21960 ... 22009 22010 22011 22012 22013 22014 22015 22016 22017 22018 22019 22020
0 c 31950 31951 31952 31953 31954 31955 31956 31957 31958 31959 31960 ... 32009 32010 32011 32012 32013 32014 32015 32016 32017 32018 32019 32020
10 d 41950 41951 41952 41953 41954 41955 41956 41957 41958 41959 41960 ... 42009 42010 42011 42012 42013 42014 42015 42016 42017 42018 42019 42020
9 e 51950 51951 51952 51953 51954 51955 51956 51957 51958 51959 51960 ... 52009 52010 52011 52012 52013 52014 52015 52016 52017 52018 52019 52020
1 f 61950 61951 61952 61953 61954 61955 61956 61957 61958 61959 61960 ... 62009 62010 62011 62012 62013 62014 62015 62016 62017 62018 62019 62020
4 g 71950 71951 71952 71953 71954 71955 71956 71957 71958 71959 71960 ... 72009 72010 72011 72012 72013 72014 72015 72016 72017 72018 72019 72020
3 h 81950 81951 81952 81953 81954 81955 81956 81957 81958 81959 81960 ... 82009 82010 82011 82012 82013 82014 82015 82016 82017 82018 82019 82020
2 i 91950 91951 91952 91953 91954 91955 91956 91957 91958 91959 91960 ... 92009 92010 92011 92012 92013 92014 92015 92016 92017 92018 92019 92020
13 j 101950 101951 101952 101953 101954 101955 101956 101957 101958 101959 101960 ... 102009 102010 102011 102012 102013 102014 102015 102016 102017 102018 102019 102020
14 k 111950 111951 111952 111953 111954 111955 111956 111957 111958 111959 111960 ... 112009 112010 112011 112012 112013 112014 112015 112016 112017 112018 112019 112020
12 l 121950 121951 121952 121953 121954 121955 121956 121957 121958 121959 121960 ... 122009 122010 122011 122012 122013 122014 122015 122016 122017 122018 122019 122020
5 m 131950 131951 131952 131953 131954 131955 131956 131957 131958 131959 131960 ... 132009 132010 132011 132012 132013 132014 132015 132016 132017 132018 132019 132020
7 n 141950 141951 141952 141953 141954 141955 141956 141957 141958 141959 141960 ... 142009 142010 142011 142012 142013 142014 142015 142016 142017 142018 142019 142020
8 x 241950 241951 241952 241953 241954 241955 241956 241957 241958 241959 241960 ... 242009 242010 242011 242012 242013 242014 242015 242016 242017 242018 242019 242020
6 y 251950 251951 251952 251953 251954 251955 251956 251957 251958 251959 251960 ... 252009 252010 252011 252012 252013 252014 252015 252016 252017 252018 252019 252020
16 z 261950 261951 261952 261953 261954 261955 261956 261957 261958 261959 261960 ... 262009 262010 262011 262012 262013 262014 262015 262016 262017 262018 262019 262020
[17 rows x 72 columns]
Name Year Currency Rate Income
a 2000 1.1 12000
b 2001 1.2 22001
c 2002 1.3 32002
d 2003 1.4 42003
e 2004 1.5 52004
f 2005 1.6 62005
g 2006 1.7 72006
h 2007 1.8 82007
i 2008 1.9 92008
j 2009 1.8 102009
k 2010 1.7 112010
l 2011 1.6 122011
m 2012 1.5 132012
m 2013 1.4 132013
n 2014 1.3 142014
n 2015 1.2 142015

How can I use groupby with multiple values in a column in pandas?

I've a dataframe like as follows,
import pandas as pd
data = {
'brand': ['Mercedes', 'Renault', 'Ford', 'Mercedes', 'Mercedes', 'Mercedes', 'Renault'],
'model': ['X', 'Y', 'Z', 'X', 'X', 'X', 'Q'],
'year': [2011, 2010, 2009, 2010, 2012, 2020, 2011],
'price': [None, 1000.4, 2000.3, 1000.0, 1100.3, 3000.5, None]
}
df = pd.DataFrame(data)
print(df)
brand model year price
0 Mercedes X 2011 NaN
1 Renault Y 2010 1000.4
2 Ford Z 2009 2000.3
3 Mercedes X 2010 1000.0
4 Mercedes X 2012 1100.3
5 Mercedes X 2020 3000.5
6 Renault Q 2011 NaN
And here is the another case to test your solution,
data = {
'brand': ['Mercedes', 'Mercedes', 'Mercedes', 'Mercedes', 'Mercedes'],
'model': ['X', 'X', 'X', 'X', 'X'], 'year': [2017, 2018, 2018, 2019, 2019],
'price': [None, None, None, 1000.0, 1200.50]
}
Expected output,
brand model year price
0 Mercedes X 2017 NaN
1 Mercedes X 2018 1100.25
2 Mercedes X 2018 1100.25
3 Mercedes X 2019 1000.00
4 Mercedes X 2019 1200.50
I want to fill the missing values with the average of the observations containing year-1, year and year+1 and also same brand and model. For instance, Mercedes X model has a null price in 2011. When I look at the data,
2011 - 1 = 2010
2011 + 1 = 2012
The 4th observation -> Mercedes,X,2010,1000.0
The 5th observation -> Mercedes,X,2012,1100.3
The mean -> (1000.0 + 1100.3) / 2 = 1050.15
I've tried something as follows,
for c_key, _ in df.groupby(['brand', 'model', 'year']):
fc = (
(df['brand'] == c_key[0])
& (df['model'] == c_key[1])
& (df['year'].isin([c_key[2] + 1, c_key[2], c_key[2] - 1]))
)
sc = (
(df['brand'] == c_key[0])
& (df['model'] == c_key[1])
& (df['year'] == c_key[2])
& (df['price'].isnull())
)
mean_val = df[fc]['price'].mean()
df.loc[sc, 'price'] = mean_val
print(df)
brand model year price
0 Mercedes X 2011 1050.15
1 Renault Y 2010 1000.40
2 Ford Z 2009 2000.30
3 Mercedes X 2010 1000.00
4 Mercedes X 2012 1100.30
5 Mercedes X 2020 3000.50
6 Renault Q 2011 NaN
But this solution takes a long time for 90,000 rows and 27 columns so, is there a more effective solution? For instance, can I use groupby for the values year-1, year, year+1, brand and model?
Thanks in advance.
I think actually a more efficient way would be to sort by Brand and then Year, and then use interpolate:
df = df.sort_values(['brand', 'year']).groupby('brand').apply(lambda g: g.interpolate(limit_area='inside'))
Output:
>>> df
brand model year price
0 Mercedes X 2011 1050.15
1 Renault Y 2010 1000.40
2 Ford Z 2009 2000.30
3 Mercedes X 2010 1000.00
4 Mercedes X 2012 1100.30
5 Mercedes X 2020 3000.50
6 Renault Q 2011 1000.40
That also handles all the columns.
Based on the solution of #richardec, but with some addition to correct the price if the next year's price is known. Not sure if it faster than your original solution though
# Make an interpolated average
df_out = df.sort_values(['brand', 'year']).groupby('brand').apply(lambda g: g.interpolate(limit_area='inside'))
# Make an average per brand/year/model
df1 = df.sort_values(['brand', 'year']).groupby(['brand','year','model']).mean().reset_index()
# Check if the next line has the same brand and model. If so, take the next average price when the price isNa
mask1 = df1["model"] == df1["model"].shift(-1)
mask2 = df1["brand"] == df1["brand"].shift(-1)
mask3 = df1["price"].isna()
df1["priceCorr"] = np.where(mask1 & mask2 & mask3 ,df1["price"].shift(-1),df1["price"] )
# Merge everything together
df_out = df_out.merge(df1[["brand", "year", "model","priceCorr"]], on=["brand", "year", "model"])
df_out["price"] = np.where(df_out["price"].isna(),df_out["priceCorr"], df_out["price"])
Here goes a solution that looks simpler:
Sort values in the original dataframe:
df = df.sort_values(["brand", "model", "year"])
Group by "brand" and "model", and store the groups in a variable (to calculate only once):
groups = df.groupby(["brand", "model"])
Fill nan values using the average of the previous and next rows (Important: this assumes that you have data of consecutive years, meaning that if you're missing data for 2015 you know the values of 2014 and 2016. If you have no data for consecutive years, null values will remain null).
df["price"] = df["price"].fillna((groups["price"].ffill(limit=1) + groups["price"].bfill(limit=1)) / 2)
Resulting code:
df = df.sort_values(["brand", "model", "year"])
groups = df.groupby(["brand", "model"])
df["price"] = df["price"].fillna((groups["price"].ffill(limit=1) + groups["price"].bfill(limit=1)) / 2)
print(df)
Output:
brand model year price
2 Ford Z 2009 2000.30
3 Mercedes X 2010 1000.00
0 Mercedes X 2011 1050.15
4 Mercedes X 2012 1100.30
5 Mercedes X 2020 3000.50
6 Renault Q 2011 NaN
1 Renault Y 2010 1000.40
def fill_it(x):
return df[(df.brand==df.iat[x,0])&(df.model==df.iat[x,1])&((df.year==df.iat[x,2]-1)|(df.year==df.iat[x,2]+1))].price.mean()
df = df.apply(lambda x: x.fillna(fill_it(x.name)), axis=1)
df
Output 1:
brand model year price
0 Mercedes X 2011 1050.15
1 Renault Y 2010 1000.40
2 Ford Z 2009 2000.30
3 Mercedes X 2010 1000.00
4 Mercedes X 2012 1100.30
5 Mercedes X 2020 3000.50
6 Renault Q 2011 NaN
Output 2:
brand model year price
0 Mercedes X 2017 NaN
1 Mercedes X 2018 1100.25
2 Mercedes X 2018 1100.25
3 Mercedes X 2019 1000.00
4 Mercedes X 2019 1200.50
This is 3x Faster
df.loc[df.price.isna(), 'price'] = df[df.price.isna()].apply(lambda x: x.fillna(fill_it(x.name)), axis=1)
I tried with another approach, using pd.rolling and it is way faster (on the dataframe with 70k rows runs in 200ms). The outputs are still as you wanted them.
df.year = pd.to_datetime(df.year, format='%Y')
df.sort_values('year', inplace=True)
df.groupby(['brand', 'model']).apply(lambda x: x.fillna(x.rolling('1095D',on='year', center=True).mean())).sort_index()
This is not a pretty solution, but from your description, I believe it would work and be really fast. It's just a lot of ifs inside a np.where on a sorted data frame.
import pandas as pd
import numpy as np
data = pd.DataFrame({
'brand': ['Mercedes', 'Renault', 'Ford', 'Mercedes', 'Mercedes', 'Mercedes', 'Renault'],
'model': ['X', 'Y', 'Z', 'X', 'X', 'X', 'Q'],
'year': [2011, 2010, 2009, 2010, 2012, 2020, 2011],
'price': [None, 1000.4, 2000.3, 1000.0, 1100.3, 3000.5, None]
})
data = data.sort_values(by=['brand', 'model', 'year'])
data['adjusted_price'] = np.where(data['price'].isnull() &
(data['brand']==data['brand'].shift(1)) & (data['brand']==data['brand'].shift(-1)) &
(data['model']==data['model'].shift(1)) & (data['model']==data['model'].shift(-1)) &
(data['year']==(data['year'].shift(1)+1))&(data['year']==(data['year'].shift(-1)-1)),
(data['price'].shift(1)+data['price'].shift(-1))/2,
data['price'])
data['price'] = data['adjusted_price']
data = data.drop(['adjusted_price'], axis=1)

Python/increase code efficiency about multiple columns filter

I was wondering if someone could help me find a more efficiency way to run my code.
I have a dataset contains 7 columns, which are country,sector,year,month,week,weekday,value.
the year column have only 3 elements, 2019,2020,2021
What I have to do here is to substract every value in 2020 and 2021 from 2019.
But its more complicated that I need to match the weekday columns.
For example,i need to use year 2020, month 1, week 1, weekday 0(monday) value to substract,
year 2019, month 1, week 1, weekday 0(monday) value, if cant find it, it will pass, and so on, which means, the weekday(monday,Tuesaday....must be matched)
And here is my code, it can run, but it tooks me hours:(
for i in itertools.product(year_list,country_list, sector_list,month_list,week_list,weekday_list):
try:
data_2 = df_carbon[(df_carbon['country'] == i[1])
& (df_carbon['sector'] == i[2])
& (df_carbon['year'] == i[0])
& (df_carbon['month'] == i[3])
& (df_carbon['week'] == i[4])
& (df_carbon['weekday'] == i[5])]['co2'].tolist()[0]
data_1 = df_carbon[(df_carbon['country'] == i[1])
& (df_carbon['sector'] == i[2])
& (df_carbon['year'] == 2019)
& (df_carbon['month'] == i[3])
& (df_carbon['week'] == i[4])
& (df_carbon['weekday'] == i[5])]['co2'].tolist()[0]
co2.append(data_2-data_1)
country.append(i[1])
sector.append(i[2])
year.append(i[0])
month.append(i[3])
week.append(i[4])
weekday.append(i[5])
except:
pass
I changed the for loops to itertools, but it still not fast enough, any other ideas?
many thanks:)
##############################
here is the sample dataset
country co2 sector date week weekday year month
Brazil 108.767782 Power 2019-01-01 0 1 2019 1
China 14251.044482 Power 2019-01-01 0 1 2019 1
EU27 & UK 1886.493814 Power 2019-01-01 0 1 2019 1
France 53.856398 Power 2019-01-01 0 1 2019 1
Germany 378.323440 Power 2019-01-01 0 1 2019 1
Japan 21.898788 IA 2021-11-30 48 1 2021 11
Russia 19.773822 IA 2021-11-30 48 1 2021 11
Spain 42.293944 IA 2021-11-30 48 1 2021 11
UK 56.425121 IA 2021-11-30 48 1 2021 11
US 166.425000 IA 2021-11-30 48 1 2021 11
or this
import pandas as pd
pd.DataFrame({
'year': [2019, 2020, 2021],
'co2': [1,2,3],
'country': ['Brazil', 'Brazil', 'Brazil'],
'sector': ['power', 'power', 'power'],
'month': [1, 1, 1],
'week': [0,0,0],
'weekday': [0,0,0]
})
pandas can subtract two dataframe index-by-index, so the idea would be to separate your data into a minuend and a subtrahend, set ['country', 'sector', 'month', 'week', 'weekday'] as their indices, just subtract them, and remove rows (by dropna) where a match in year 2019 is not found.
df_carbon = pd.DataFrame({
'year': [2019, 2020, 2021],
'co2': [1,2,3],
'country': ['ab', 'ab', 'bc']
})
index = ['country']
# index = ['country', 'sector', 'month', 'week', 'weekday']
df_2019 = df_carbon[df_carbon['year']==2019].set_index(index)
df_rest = df_carbon[df_carbon['year']!=2019].set_index(index)
ans = (df_rest - df_2019).reset_index().dropna()
ans['year'] += 2019
Two additional points:
In this subtraction the year is also covered, so I need to add 2019 back.
I created a small example of df_carbon to test my code. If you had provided a more realistic version in text form, I would have tested my code using your data.

How to groupby in Pandas and keep all columns [duplicate]

This question already has answers here:
Pandas DataFrame Groupby two columns and get counts
(8 answers)
Closed 3 years ago.
I have a data frame like this:
year drug_name avg_number_of_ingredients
0 2019 NEXIUM I.V. 8
1 2016 ZOLADEX 10
2 2017 PRILOSEC 59
3 2017 BYDUREON BCise 24
4 2019 Lynparza 28
And I need to group drug names and mean number of ingredients by year like this:
year drug_name avg_number_of_ingredients
0 2019 drug a,b,c.. mean value for column
1 2018 drug a,b,c.. mean value for column
2 2017 drug a,b,c.. mean value for column
If I do df.groupby('year'), I lose drug names. How can I do it?
Let me show you the solution on the simple example. First, I make the same data frame as you have:
>>> df = pd.DataFrame(
[
{'year': 2019, 'drug_name': 'NEXIUM I.V.', 'avg_number_of_ingredients': 8},
{'year': 2016, 'drug_name': 'ZOLADEX', 'avg_number_of_ingredients': 10},
{'year': 2017, 'drug_name': 'PRILOSEC', 'avg_number_of_ingredients': 59},
{'year': 2017, 'drug_name': 'BYDUREON BCise', 'avg_number_of_ingredients': 24},
{'year': 2019, 'drug_name': 'Lynparza', 'avg_number_of_ingredients': 28},
]
)
>>> print(df)
year drug_name avg_number_of_ingredients
0 2019 NEXIUM I.V. 8
1 2016 ZOLADEX 10
2 2017 PRILOSEC 59
3 2017 BYDUREON BCise 24
4 2019 Lynparza 28
Now, I make a df_grouped, which still consists of information about drugs name.
>>> df_grouped = df.groupby('year', as_index=False).agg({'drug_name': ', '.join, 'avg_number_of_ingredients': 'mean'})
>>> print(df_grouped)
year drug_name avg_number_of_ingredients
0 2016 ZOLADEX 10.0
1 2017 PRILOSEC, BYDUREON BCise 41.5
2 2019 NEXIUM I.V., Lynparza 18.0

Categories

Resources