Including missing combination of values based on a group of grouped data

Including missing combination of values based on a group of grouped data - python

I am expanding on earlier thread: Including missing combinations of values in a pandas groupby aggregation
In above thread, the accepted answer computes all possible combinations for the grouping variable. In this version, I'd like to compute combinations based on group of groups.
Let's take an example.
Here's input dataframe:
Here, one group is [Year,Quarter] i.e.
Year Quarter
2014 Q1
2015 Q2
2015 Q3
Another set of group is Name:
Name
Adam
Smith
Now, I want to apply groupby and sum such that missing values of the combination of above groups is detected as NaN
Here's sample output:
I'd appreciate any help.
Here's sample input and output in dict format:
input=
{'Year': {0: 2014, 1: 2014, 2: 2015, 3: 2015, 4: 2015},
'Quarter': {0: 'Q1', 1: 'Q1', 2: 'Q2', 3: 'Q2', 4: 'Q3'},
'Name': {0: 'Adam', 1: 'Smith', 2: 'Adam', 3: 'Adam', 4: 'Smith'},
'Value': {0: 2, 1: 3, 2: 4, 3: 5, 4: 5}}
output=
{'Year': {0: 2014, 1: 2014, 2: 2015, 3: 2015, 4: 2015, 5: 2015},
'Quarter': {0: 'Q1', 1: 'Q1', 2: 'Q2', 3: 'Q2', 4: 'Q3', 5: 'Q3'},
'Name': {0: 'Adam', 1: 'Smith', 2: 'Adam', 3: 'Smith', 4: 'Smith', 5: 'Adam'},
'Value': {0: 2.0, 1: 3.0, 2: 9.0, 3: nan, 4: 5.0, 5: nan}}
Clarification:
I am looking for a method without doing melt and cast. i.e. without playing around with long and wide format.

The example post you posted is the correct answer: groupby get the sum then unstack to find the missing value then stack with the param dropna=False here are the docs on stack
df.groupby(['Year','Quarter','Name']).sum().unstack().stack(dropna=False).reset_index()
Year Quarter Name Value
0 2014 Q1 Adam 2.0
1 2014 Q1 Smith 3.0
2 2015 Q2 Adam 9.0
3 2015 Q2 Smith NaN
4 2015 Q3 Adam NaN
5 2015 Q3 Smith 5.0

Using pivot_table, PS you can add reset_index at the end
df.pivot_table(index=['Year','Quarter'],columns='Name',values='Value',aggfunc='sum').stack(dropna=False)
Year Quarter Name
2014 Q1 Adam 2.0
Smith 3.0
2015 Q2 Adam 9.0
Smith NaN
Q3 Adam NaN
Smith 5.0
dtype: float64

Related

selecting duplicates by condition python pandas

I have a simple dataframe which I would like to separate from each other with some conditions.
Car
Year
Speed
Cond
BMW
2001
150
X
BMW
2000
150
Audi
1997
200
Audi
2000
200
Audi
2012
200
X
Fiat
2020
180
Mazda
2022
183
What i have to do is take duplicates to another dataframe and in my main dataframe leave only one line.
Rows that are duplicates in the Car column I would like to separate into a separate dataframe, but I don't need the ones that have X in the cond column.
In main dataframe I would like keep one row. I would like the left row to be the one that contains X in the cond column
I have code:
import pandas as pd
import numpy as np
cars = {'Car': {0: 'BMW', 1: 'BMW', 2: 'Audi', 3: 'Audi', 4: 'Audi', 5: 'Fiat', 6: 'Mazda'},
'Year': {0: 2001, 1: 2000, 2: 1997, 3: 2000, 4: 2012, 5: 2020, 6: 2022},
'Speed': {0: 150, 1: 150, 2: 200, 3: 200, 4: 200, 5: 180, 6: 183},
'Cond': {0: 'X', 1: np.nan, 2: 'X', 3: np.nan, 4: np.nan, 5: np.nan, 6: np.nan}}
df = pd.DataFrame.from_dict(cars)
df_duplicates = df.loc[df.duplicated(subset=['Car'], keep = False)].loc[df['Cond']!='X']
I don't know how i can leave the main dataframe with only one row which additionally contains X in cond column
Maybe it's possible to have one command that will delete and select another dataframe according to the rules above?

If I understand correctly the desired logic, you can use groupby.idxmax to select the first X per group if any (else the first row of the group), to keep in the main DataFrame. The rest goes in the other DataFrame (df2).
# get indices of the row with X is any, else of the first one per group
keep = df['Cond'].eq('X').groupby(df['Car']).idxmax()
# drop selected rows
df2 = df.drop(keep)
# keep selected rows
df = df.loc[keep]
Output:
# updated df1
Car Year Speed Cond
0 BMW 2001 150 X
2 Audi 1997 200 X
5 Fiat 2020 180 NaN
6 Mazda 2022 183 NaN
# df2
Car Year Speed Cond
1 BMW 2000 150 NaN
3 Audi 2000 200 NaN
4 Audi 2012 200 NaN

drop.na() not working on dataframe with Nan values?

I have a data frame with Nan values. For some reason, df.dropna() doesn't work when I try to drop these rows. Any thoughts?
Example of a row:
30754 22 Nan Nan Nan Nan Nan Nan Jewellery-Women N
df = pd.read_csv('/Users/xxx/Desktop/CS 677/Homework_4/FashionDataset.csv')
df.dropna()
df.head().to_dict()
{'Unnamed: 0': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4},
'BrandName': {0: 'life',
1: 'only',
2: 'fratini',
3: 'zink london',
4: 'life'},
'Deatils': {0: 'solid cotton blend collar neck womens a-line dress - indigo',
1: 'polyester peter pan collar womens blouson dress - yellow',
2: 'solid polyester blend wide neck womens regular top - off white',
3: 'stripes polyester sweetheart neck womens dress - black',
4: 'regular fit regular length denim womens jeans - stone'},
'Sizes': {0: 'Size:Large,Medium,Small,X-Large,X-Small',
1: 'Size:34,36,38,40',
2: 'Size:Large,X-Large,XX-Large',
3: 'Size:Large,Medium,Small,X-Large',
4: 'Size:26,28,30,32,34,36'},
'MRP': {0: 'Rs\n1699',
1: 'Rs\n3499',
2: 'Rs\n1199',
3: 'Rs\n2299',
4: 'Rs\n1699'},
'SellPrice': {0: '849', 1: '2449', 2: '599', 3: '1379', 4: '849'},
'Discount': {0: '50% off',
1: '30% off',
2: '50% off',
3: '40% off',
4: '50% off'},
'Category': {0: 'Westernwear-Women',
1: 'Westernwear-Women',
2: 'Westernwear-Women',
3: 'Westernwear-Women',
4: 'Westernwear-Women'}}
This is what I get when using df.head().to_dict()

Try this;
df = pd.DataFrame({"col1":[12,20,np.nan,np.nan],
"col2":[10,np.nan,np.nan,40]})
df1 = df.dropna()
# df;
col1 col2
0 12.0 10.0
1 20.0 NaN
2 NaN NaN
3 NaN 40.0
# df1;
col1 col2
0 12.0 10.0

running discounted price in pandas data frame

This is my first post so pardon any missing information. I do have dataset like this below
Dataset:-
My expected final output should be like this
Final Output:-
Basically I would like to iterate over Discounted price and apply last discounted price to next year
For example in 2019 , NYC Budget is $10,000 and Discount is 0.05 so Discounted Price is $9,500. Next year discount becomes 0.64 which should be calculated on $9,500, which is $3,420 and in 2021 , 0.04 which comes out to be $3,283 as final discounted price for NYC.
I need to code it in Python using pandas data frame. I think I need to write For loop and then IF inside it . But struggling so far.
Really appreciate any help.

You could use groupby and apply the function to obtain the total discount amount for all the years for a particular city. Then use first over the Budget column groups to get the first row in the group with the rows indexed as the original dataframe. This holds true since "Groupby preserves the order of rows within each group." is a guaranteed behavior.
import pandas as pd
data = {'Year': {0: 2019, 1: 2020, 2: 2021, 3: 2019, 4: 2020, 5: 2019},
'City': {0: 'NYC', 1: 'NYC', 2: 'NYC', 3: 'Edison', 4: 'Edison', 5: 'Princeton'},
'Budget': {0: 10000, 1: 10000, 2: 10000, 3: 5000, 4: 5000, 5: 2000},
'Discount': {0: 0.05, 1: 0.64, 2: 0.04, 3: 0.35, 4: 0.06, 5: 0.45}}
df = pd.DataFrame(data)
g = df.groupby("City")
disc_prod = g['Discount'].apply(lambda x: (1-x).prod())
budget = g['Budget'].first()
result = disc_prod * budget
print(result)
City
Edison 3055.0
NYC 3283.2
Princeton 1100.0

How can I create a stacked bar chart in matplotlib where the stacks vary from bar to bar?

So I have a pandas DataFrame that looks something like this:
year country total
0 2010 USA 10
1 2010 CHIN 12
2 2011 USA 8
3 2011 JAPN 12
4 2012 KORR 7
5 2012 USA 10
6 2013 CHIN 9
7 2013 USA 13
I'd like to create a stacked bar chart in matplotlib, where there is one bar for each year and stacks for the two countries in that year with height based on the total column. The color should be based on the country and be represented in the legend.
I can't seem to figure out how to make this happen. I think I could do it using for loops to go through each year and each country, then construct the bar with the color corresponding to values in a dictionary. However, this will create individual legend entries for each individual bar such that there are 8 total values in the legend. This is also a horribly inefficient way to graph in matplotlib as far as I can tell.
Can anyone give some pointers?

You need to transform your df first. It can be done via the below:
df = pd.DataFrame({'year': {0: 2010, 1: 2010, 2: 2011, 3: 2011, 4: 2012, 5: 2012, 6: 2013, 7: 2013},
'country': {0: 'USA', 1: 'CHIN', 2: 'USA', 3: 'JAPN', 4: 'KORR', 5: 'USA', 6: 'CHIN', 7: 'USA'},
'total': {0: 10, 1: 12, 2: 8, 3: 12, 4: 7, 5: 10, 6: 9, 7: 13}})
df2 = df.groupby(['year',"country"])['total'].sum().unstack("country")
print (df2)
#
country CHIN JAPN KORR USA
year
2010 12.0 NaN NaN 10.0
2011 NaN 12.0 NaN 8.0
2012 NaN NaN 7.0 10.0
2013 9.0 NaN NaN 13.0
#
ax = df2.plot(kind='bar', stacked=True)
plt.show()
Result:

Aggregate a bunch of different data in a single groupby with multiple columns

I have large dataframe of data in Pandas (let's say of courses at a university) looking like:
ID name credits enrolled ugrad/grad year semester
1 Math 4 62 ugrad 2016 Fall
2 History 3 15 ugrad 2016 Spring
3 Adv Math 3 8 grad 2017 Fall
...
and I want to group it by year and semester, and then get a bunch of different aggregate data on it, but all at one time if I can. For example, I want a total count of courses, count of only undergraduate courses, and sum of enrollment for a given semester. I can do each of these individually using value_counts, but I'd like to get an output such as:
year semester count count_ugrad total_enroll
2016 Fall # # #
Spring # # #
2017 Fall # # #
Spring # # #
...
Is this possible?

Here I added a new subject for Python and provided as a dict to load into dataframe.
Solution is a combination of the agg() method on a groupby, where the aggregations are provided in a dictionary, and then the use of a custom aggregation function for your ugrad requirement:
def my_custom_ugrad_aggregator(arr):
return sum(arr == 'ugrad')
dict = {'name': {0: 'Math', 1: 'History', 2: 'Adv Math', 3: 'Python'}, 'year': {0: 2016, 1: 2016, 2: 2017, 3: 2017}, 'credits': {0: 4, 1: 3, 2: 3, 3: 4}, 'semester': {0: 'Fall', 1: 'Spring', 2: 'Fall', 3: 'Spring'}, 'ugrad/grad': {0: 'ugrad', 1: 'ugrad', 2: 'grad', 3: 'ugrad'}, 'enrolled': {0: 62, 1: 15, 2: 8, 3: 8}, 'ID': {0: 1, 1: 2, 2: 3, 3: 4}}
df =pd.DataFrame(dict)
ID credits enrolled name semester ugrad/grad year
0 1 4 62 Math Fall ugrad 2016
1 2 3 15 History Spring ugrad 2016
2 3 3 8 Adv Math Fall grad 2017
3 4 4 8 Python Spring ugrad 2017
print df.groupby(['year','semester']).agg({'name':['count'],'enrolled':['sum'],'ugrad/grad':my_custom_ugrad_aggregator})
gives:
name ugrad/grad enrolled
count my_custom_ugrad_aggregator sum
year semester
2016 Fall 1 1 62
Spring 1 1 15
2017 Fall 1 0 8
Spring 1 1 8

Use agg with dictionary on how to rollup/aggregate each column:
df_out = df.groupby(['year','semester'])[['enrolled','ugrad/grad']]\
.agg({'ugrad/grad':lambda x: (x=='ugrad').sum(),'enrolled':['sum','size']})\
.set_axis(['Ugrad Count','Total Enrolled','Count Courses'], inplace=False, axis=1)
df_out
Output:
Ugrad Count Total Enrolled Count Courses
year semester
2016 Fall 1 62 1
Spring 1 15 1
2017 Fall 0 8 1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Including missing combination of values based on a group of grouped data - python

Using pivot_table, PS you can add reset_index at the end df.pivot_table(index=['Year','Quarter'],columns='Name',values='Value',aggfunc='sum').stack(dropna=False) Year Quarter Name 2014 Q1 Adam 2.0 Smith 3.0 2015 Q2 Adam 9.0 Smith NaN Q3 Adam NaN Smith 5.0 dtype: float64

Related

selecting duplicates by condition python pandas

drop.na() not working on dataframe with Nan values?

running discounted price in pandas data frame

How can I create a stacked bar chart in matplotlib where the stacks vary from bar to bar?

Aggregate a bunch of different data in a single groupby with multiple columns

Categories

Resources