Groupby 2 different columns Python Pandas

Groupby 2 different columns Python Pandas - python

import pandas as pd
df1 = pd.DataFrame([['Dog', '2017', 100], ['Dog', '2017' ,500],['Dog', '2016' ,200],['Dog', '2016' ,150],['Cat', '2017' ,50],['Cat', '2017' ,100],
['Cat', '2016' ,50]], columns=('Pet','Year','Amount'))
DF1
Pet Year Amount
Dog 2017 100
Dog 2017 500
Dog 2016 200
Dog 2016 150
Cat 2017 50
Cat 2017 100
Cat 2016 50
I would like to turn the above dataframe into the following:
DF2
Pet Year Amount
Dog 2017 600
Dog 2016 350
Cat 2017 150
Cat 2016 50
This is grouping by Pet and by Year and summing the amount between them.
Any ideas?

Use groupby with parameters as_index=False for not return MultiIndex and sort=False for avoid sorting:
print (df1.groupby(['Pet','Year'], as_index=False, sort=False).sum())
Pet Year Amount
0 Dog 2017 600
1 Dog 2016 350
2 Cat 2017 150
3 Cat 2016 50

Related

Summing different dates and categories

So I have a pandas data frame that is grouped by date and a particular category and has the sum of another column. What I would like to do is take the number for a particular category for a particular day and add it to the next day and then take that number and add it to the next day. For example, say the category is apples, the date is 5-26-2021 and the cost is $5. The next day, 5-27-2021 is $6. So 5-27-2021 should have a cost of $11. Then 5-28-2021 has a cost of $3 but it should be added to $11 so the cost should show up as $14. How can I go about doing this? There are multiple categories by the way besides just the apples. Thank you!
enter image description here
Expected Output:
(the output is not the most accurate and this data frame is not the most accurate so feel free to ask questions)

Use groupby then cumsum
data = [
[2021, 'apple', 1,],
[2022, 'apple', 2,],
[2021, 'banana', 3,],
[2022, 'cherry', 4],
[2022, 'banana', 5],
[2023, 'cherry', 6],
]
columns = ['date','category', 'cost']
df = pd.DataFrame(data, columns=columns)
>>> df
date category cost
0 2021 apple 1
1 2022 apple 2
2 2021 banana 3
3 2022 cherry 4
4 2022 banana 5
5 2023 cherry 6
df.sort_values(['category','date'], inplace=True)
df.reset_index(drop=True, inplace=True)
df['CostCsum'] = df.groupby(['category'])['cost'].cumsum()
date category cost CostCsum
0 2021 apple 1 1
1 2022 apple 2 3
2 2021 banana 3 3
3 2022 banana 5 8
4 2022 cherry 4 4
5 2023 cherry 6 10

Sort values intra group [duplicate]

This question already has an answer here:
Pandas groupby sort each group values and order dataframe groups based on max of each group
(1 answer)
Closed 1 year ago.
Suppose I have this dataframe:
df = pd.DataFrame({
'price': [2, 13, 24, 15, 11, 44],
'category': ["shirts", "pants", "shirts", "tops", "hat", "tops"],
})
price category
0 2 shirts
1 13 pants
2 24 shirts
3 15 tops
4 11 hat
5 44 tops
I want to sort values in such a way that:
Find what is the highest price of each category.
Sort categories according to highest price (in this case, in descending order: tops, shirts, pants, hat).
Sort each category according to higher price.
The final dataframe would look like:
price category
0 44 tops
1 15 tops
2 24 shirts
3 24 shirts
4 13 pants
5 11 hat

I'm not a big fan of one-liners, so here's my solution:
# Add column with max-price for each category
df = df.merge(df.groupby('category')['price'].max().rename('max_cat_price'),
left_on='category', right_index=True)
# Sort
df.sort_values(['category','price','max_cat_price'], ascending=False)
# Drop column that has max-price for each category
df.drop('max_cat_price', axis=1, inplace=True)
print(df)
price category
5 44 tops
3 15 tops
2 24 shirts
0 2 shirts
1 13 pants
4 11 hat

You can use .groupby and .sort_values:
df.join(df.groupby("category").agg("max"), on="category", rsuffix="_r").sort_values(
["price_r", "price"], ascending=False
)
Output
price category price_r
5 44 tops 44
3 15 tops 44
2 24 shirts 24
0 2 shirts 24
1 13 pants 13
4 11 hat 11

I used the get_group in an dataframe apply to get the max price for a category
df = pd.DataFrame({
'price': [2, 13, 24, 15, 11, 44],
'category': ["shirts", "pants", "shirts", "tops", "hat", "tops"],
})
grouped=df.groupby('category')
df['price_r']=df['category'].apply(lambda row: grouped.get_group(row).price.max())
df=df.sort_values(['category','price','price_r'], ascending=False)
print(df)
output
price category price_r
5 44 tops 44
3 15 tops 44
2 24 shirts 24
0 2 shirts 24
1 13 pants 13
4 11 hat 11

Python - Extract multiple values from string in pandas df

I've searched for an answer for the following question but haven't found the answer yet. I have a large dataset like this small example:
df =
A B
1 I bought 3 apples in 2013
3 I went to the store in 2020 and got milk
1 In 2015 and 2019 I went on holiday to Spain
2 When I was 17, in 2014 I got a new car
3 I got my present in 2018 and it broke down in 2019
What I would like is to extract all the values of > 1950 and have this as an end result:
A B C
1 I bought 3 apples in 2013 2013
3 I went to the store in 2020 and got milk 2020
1 In 2015 and 2019 I went on holiday to Spain 2015_2019
2 When I was 17, in 2014 I got a new car 2014
3 I got my present in 2018 and it broke down in 2019 2018_2019
I tried to extract values first, but didn't get further than:
df["C"] = df["B"].str.extract('(\d+)').astype(int)
df["C"] = df["B"].apply(lambda x: re.search(r'\d+', x).group())
But all I get are error messages (I've only started python and working with texts a few weeks ago..). Could someone help me?

With single regex pattern (considering your comment "need the year it took place"):
In [268]: pat = re.compile(r'\b(19(?:[6-9]\d|5[1-9])|[2-9]\d{3})')
In [269]: df['C'] = df['B'].apply(lambda x: '_'.join(pat.findall(x)))
In [270]: df
Out[270]:
A B C
0 1 I bought 3 apples in 2013 2013
1 3 I went to the store in 2020 and got milk 2020
2 1 In 2015 and 2019 I went on holiday to Spain 2015_2019
3 2 When I was 17, in 2014 I got a new car 2014
4 3 I got my present in 2018 and it broke down in ... 2018_2019

Here's one way using str.findall and joining those items from the resulting lists that are greater than 1950::
s = df["B"].str.findall('\d+')
df['C'] = s.apply(lambda x: '_'.join(i for i in x if int(i)> 1950))
A B C
0 1 I bought 3 apples in 2013 2013
1 3 I went to the store in 2020 and got milk 2020
2 1 In 2015 and 2019 I went on holiday to Spain 2015_2019
3 2 When I was 17, in 2014 I got a new car 2014
4 3 I got my present in 2018 and it broke down in ... 2018_2019

Sum of a Column in a Dictionary of Dataframes

How can I work with a dictionary of dataframes please? Or, is there a better way to get an overview of my data? If I have for example:
Fruit Qty Year
Apple 2 2016
Orange 1 2017
Mango 2 2016
Apple 9 2016
Orange 8 2015
Mango 7 2016
Apple 6 2016
Orange 5 2017
Mango 4 2015
Then I am trying to find out how many in total I get per year, for example:
2015 2016 2017
Apple 0 11 0
Orange 8 0 6
Mango 4 9 0
I have written some code but it might not be useful:
import pandas as pd
# Fruit Data
df_1 = pd.DataFrame({'Fruit':['Apple','Orange','Mango','Apple','Orange','Mango','Apple','Orange','Mango'], 'Qty': [2,1,2,9,8,7,6,5,4], 'Year': [2016,2017,2016,2016,2015,2016,2016,2017,2015]})
# Create a list of Fruits
Fruits = df_1.Fruit.unique()
# Break down the dataframe by Year
df_2015 = df_1[df_1['Year'] == 2015]
df_2016 = df_1[df_1['Year'] == 2016]
df_2017 = df_1[df_1['Year'] == 2017]
# Create a dataframe dictionary of Fruits
Dict_2015 = {elem : pd.DataFrame for elem in Fruits}
Dict_2016 = {elem : pd.DataFrame for elem in Fruits}
Dict_2017 = {elem : pd.DataFrame for elem in Fruits}
# Store the Qty for each Fruit x each Year
for Fruit in Dict_2015.keys():
Dict_2015[Fruit] = df_2015[:][df_2015.Fruit == Fruit]
for Fruit in Dict_2016.keys():
Dict_2016[Fruit] = df_2016[:][df_2016.Fruit == Fruit]
for Fruit in Dict_2017.keys():
Dict_2017[Fruit] = df_2017[:][df_2017.Fruit == Fruit]

You can use pandas.pivot_table.
res = df.pivot_table(index='Fruit', columns=['Year'], values='Qty',
aggfunc=np.sum, fill_value=0)
print(res)
Year 2015 2016 2017
Fruit
Apple 0 17 0
Mango 4 9 0
Orange 8 0 6
For guidance on usage, see How to pivot a dataframe.

jpp has already posted an answer in the format you wanted. However, since your question seemed like you are open to other views, I thought of sharing another way. Not exactly in the format you posted but this how I usually do it.
df = df.groupby(['Fruit', 'Year']).agg({'Qty': 'sum'}).reset_index()
This will look something like:
Fruit Year Sum
Apple 2015 0
Apple 2016 11
Apple 2017 0
Orange 2015 8
Orange 2016 0
Orange 2017 6
Mango 2015 4
Mango 2016 9
Mango 2017 0

Split a list of tuples in a column of dataframe to columns of a dataframe

I've a dataframe which contains a list of tuples in one of its columns. I need to split the list tuples into corresponding columns. My dataframe df looks like as given below:-
A B
[('Apple',50),('Orange',30),('banana',10)] Winter
[('Orange',69),('WaterMelon',50)] Summer
The expected output should be:
Fruit rate B
Apple 50 winter
Orange 30 winter
banana 10 winter
Orange 69 summer
WaterMelon 50 summer

You can use DataFrame constructor with numpy.repeat and numpy.concatenate:
df1 = pd.DataFrame(np.concatenate(df.A), columns=['Fruit','rate']).reset_index(drop=True)
df1['B'] = np.repeat(df.B.values, df['A'].str.len())
print (df1)
Fruit rate B
0 Apple 50 Winter
1 Orange 30 Winter
2 banana 10 Winter
3 Orange 69 Summer
4 WaterMelon 50 Summer
Another solution with chain.from_iterable:
from itertools import chain
df1 = pd.DataFrame(list(chain.from_iterable(df.A)), columns=['Fruit','rate'])
.reset_index(drop=True)
df1['B'] = np.repeat(df.B.values, df['A'].str.len())
print (df1)
Fruit rate B
0 Apple 50 Winter
1 Orange 30 Winter
2 banana 10 Winter
3 Orange 69 Summer
4 WaterMelon 50 Summer

This should work:
fruits = []
rates = []
seasons = []
def create_lists(row):
tuples = row['A']
season = row['B']
for t in tuples:
fruits.append(t[0])
rates.append(t[1])
seasons.append(season)
df.apply(create_lists, axis=1)
new_df = pd.DataFrame({"Fruit" :fruits, "Rate": rates, "B": seasons})[["Fruit", "Rate", "B"]]
output:
Fruit Rate B
0 Apple 50 winter
1 Orange 30 winter
2 banana 10 winter
3 Orange 69 summer
4 WaterMelon 50 summer

You can do this in a chained operation:
(
df.apply(lambda x: [[k,v,x.B] for k,v in x.A],axis=1)
.apply(pd.Series)
.stack()
.apply(pd.Series)
.reset_index(drop=True)
.rename(columns={0:'Fruit',1:'rate',2:'B'})
)
Out[1036]:
Fruit rate B
0 Apple 50 Winter
1 Orange 30 Winter
2 banana 10 Winter
3 Orange 69 Summer
4 WaterMelon 50 Summer

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Groupby 2 different columns Python Pandas - python

Use groupby with parameters as_index=False for not return MultiIndex and sort=False for avoid sorting: print (df1.groupby(['Pet','Year'], as_index=False, sort=False).sum()) Pet Year Amount 0 Dog 2017 600 1 Dog 2016 350 2 Cat 2017 150 3 Cat 2016 50

Related

Summing different dates and categories

Sort values intra group [duplicate]

Python - Extract multiple values from string in pandas df

Sum of a Column in a Dictionary of Dataframes

Split a list of tuples in a column of dataframe to columns of a dataframe

Categories

Resources