Groupby with counts + aggregate row - python

I have the following code which produces a data frame showing me a per month and per year average sold price. I would like to add to this a total row per year and a total row per pid (person). Sample code and data:
import pandas as pd
import StringIO
s = StringIO.StringIO("""pid,year,month,price
1,2017,4,2000
1,2017,4,2900
1,2018,4,2000
1,2018,4,2300
1,2018,5,2000
1,2018,5,1990
1,2018,6,2200
1,2018,6,2400
1,2018,6,2250
1,2018,7,2150
""")
df = pd.read_csv(s)
maths = {'price': 'mean'}
gb = df.groupby(['pid','year','month'])
counts = gb.size().to_frame(name='n')
out = counts.join(gb.agg(maths)).reset_index()
print(out)
Which yields:
pid year month n price
0 1 2017 4 2 2450.000000
1 1 2018 4 2 2150.000000
2 1 2018 5 2 1995.000000
3 1 2018 6 3 2283.333333
4 1 2018 7 1 2150.000000
I would the additional per year rows to look like:
pid year month n price
0 1 2017 all 2 2450.000000
0 1 2018 all 8 2161.000000
And then the per pid rollup to look like:
pid year month n price
0 1 all all 10 2218.000000
I'm having trouble cleanly grouping/aggregating those last two frames where I essentially want an all split for each year and month value, and then have each data frame here combined into one which I can write to CSV, or a database table.

Using pd.concat
df1=df.groupby(['pid','year','month']).price.agg(['size','mean']).reset_index()
df2=df.groupby(['pid','year']).price.agg(['size','mean']).assign(month='all').reset_index()
df3=df.groupby(['pid']).price.agg(['size','mean']).assign(**{'month':'all','year':'all'}).reset_index()
pd.concat([df1,df2,df3])
Out[484]:
mean month pid size year
0 2450.000000 4 1 2 2017
1 2150.000000 4 1 2 2018
2 1995.000000 5 1 2 2018
3 2283.333333 6 1 3 2018
4 2150.000000 7 1 1 2018
0 2450.000000 all 1 2 2017
1 2161.250000 all 1 8 2018
0 2219.000000 all 1 10 all

Related

Filter individuals that don't have data for the whole period

I am using Python 3.9 on Pycharm. I have the following dataframe:
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
6 C 2020 5
7 C 2021 4
I want to keep individuals that have available data for the whole period. In other words, I would like to filter the rows such that I only keep id that have data for the three years (2019, 2020, 2021). This means excluding all observations of id C and keep all observations of id A and B:
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
Is it feasible in Python?
As you want to include only the ids for which all three year exist, you can group the dataframe by id then filter based on set equalities for the years you want versus the years available for particular id:
>>> years = {2019, 2020, 2021}
>>> df.groupby('id').filter(lambda x: set(x['year'].values)==years)
# df is your dataframe
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
First, make a set of all the years existing in the column year then use a boolean mask to filter your dataframe. For that, you need pandas.DataFrame.groupby and pandas.DataFrame.transform to count the occurences of each id in each group of year.
from io import StringIO
import pandas as pd
s = """id year gdp
A 2019 3
A 2020 0
A 2021 5
B 2019 4
B 2020 2
B 2021 1
C 2020 5
C 2021 4
"""
df = pd.read_csv(StringIO(s), sep='\t')
mask = df.groupby('id')['year'].transform('count').eq(len(set(df['id'])))
out = df[mask]
>>> print(out)
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
Here is a way using pivot and dropna to automatically find ids with missing values:
keep = df.pivot('id', 'year', 'gdp').dropna().index
# ['A', 'B']
out = df[df['id'].isin(keep)]
output:
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1

Split int64 Pandas column in two

I've been given a dataset that has dates as an integer using the format 52019 for May 2019. I've put it into a Pandas DataFrame, and I need to extract that date format into a month column and year column, but I can't figure out how to do that for an int64 datatype or how to handle it for the two digit months. So I want to take something like
ID Date
1 22019
2 32019
3 52019
5 102019
and make it become
ID Month Year
1 2 2019
2 3 2019
3 5 2019
5 10 2019
What should I do?
divmod
df['Month'], df['Year'] = np.divmod(df.Date, 10000)
df
ID Date Month Year
0 1 22019 2 2019
1 2 32019 3 2019
2 3 52019 5 2019
3 5 102019 10 2019
Without mutating original dataframe using assign
df.assign(**dict(zip(['Month', 'Year'], np.divmod(df.Date, 10000))))
ID Date Month Year
0 1 22019 2 2019
1 2 32019 3 2019
2 3 52019 5 2019
3 5 102019 10 2019
Using // and %
df['Month'], df['Year'] = df.Date//10000,df.Date%10000
df
Out[528]:
ID Date Month Year
0 1 22019 2 2019
1 2 32019 3 2019
2 3 52019 5 2019
3 5 102019 10 2019
Use:
s=pd.to_datetime(df.pop('Date'),format='%m%Y') #convert to datetime and pop deletes the col
df['Month'],df['Year']=s.dt.month,s.dt.year #extract month and year
print(df)
ID Month Year
0 1 2 2019
1 2 3 2019
2 3 5 2019
3 5 10 2019
str.extract can handle the tricky part of figuring out whether the Month has 1 or 2 digits.
(df['Date'].astype(str)
.str.extract(r'^(?P<Month>\d{1,2})(?P<Year>\d{4})$')
.astype(int))
Month Year
0 2 2019
1 3 2019
2 5 2019
3 10 2019
You may also use string slicing if it's guaranteed your numbers have only 5 or 6 digits (if not, use str.extract above):
u = df['Date'].astype(str)
df['Month'], df['Year'] = u.str[:-4], u.str[-4:]
df
ID Date Month Year
0 1 22019 2 2019
1 2 32019 3 2019
2 3 52019 5 2019
3 5 102019 10 2019

Iterating Through Pandas Dataframe to Calculate based on Conditions

For the DataFrame below, I need to create a new column 'unit_count' which is 'unit'/'count' for each year and month. However, because each year and month is not unique, for each entry, I only want to use the count for a given month from the B option.
key UID count month option unit year
0 1 100 1 A 10 2015
1 1 200 1 B 20 2015
2 1 300 2 A 30 2015
3 1 400 2 B 40 2015
Essentially, I need a function that does the following:
unit_count = df.unit / df.count
for value of unit, but using the only the 'count' value of option 'B' in that given 'month'.
So that the end result would look like the table below, where unit_count is dividing the number of units by the count of 'sector' 'B' for a given month.
key UID count month option unit year unit_count
0 1 100 1 A 10 2015 0.05
1 1 200 1 B 20 2015 0.10
2 1 300 2 A 30 2015 0.075
3 1 400 2 B 40 2015 0.01
Here is the code I used to create the original DataFrame:
df = pd.DataFrame({'UID':[1,1,1,1],
'year':[2015,2015,2015,2015],
'month':[1,1,2,2],
'option':['A','B','A','B'],
'unit':[10,20,30,40],
'count':[100,200,300,400]
})
It seems you can first create NaN where not option is B and then divide back filled NaN values:
Notice: DataFrame has to be sorted by year, month and option first for last value with B for each group
#if necessary in real data
#df.sort_values(['year','month', 'option'], inplace=True)
df['unit_count'] = df.loc[df.option=='B', 'count']
print (df)
UID count month option unit year unit_count
0 1 100 1 A 10 2015 NaN
1 1 200 1 B 20 2015 200.0
2 1 300 2 A 30 2015 NaN
3 1 400 2 B 40 2015 400.0
df['unit_count'] = df.unit.div(df['unit_count'].bfill())
print (df)
UID count month option unit year unit_count
0 1 100 1 A 10 2015 0.050
1 1 200 1 B 20 2015 0.100
2 1 300 2 A 30 2015 0.075
3 1 400 2 B 40 2015 0.100

How to extract a percentage column from a periodic column and the sum of the column?

I have a matrix that looks like this in as pandas.DataFrame:
Store Sales year month day
0 1 5263 2015 7 31
1 1 5020 2015 7 30
2 1 4782 2015 7 29
3 2 5011 2015 7 28
4 2 6102 2015 7 27
[986159 rows x 5 columns]
After I do some transformation I get the total sales sum for each shop:
train['StoreTotalSales'] = train.groupby('Store')['Sales'].transform('sum')
But now I need to iterate through each row of train.groupby(['Store', 'day', 'month']) and then divide the Sales figure of each row of the groupby and divide by the StoreTotalSales.
I've tried the following:
train['PercentSales'] = train.groupby(['Store','day', 'month'])['Sales'].transform(lambda x: x /float(x.sum()))
But it's return all 1s for the new PercentSales column:
Store Sales year month day StoreTotalSales PercentSales
0 1 5263 2015 7 31 26178 1
1 1 5020 2015 7 30 26178 1
2 1 4782 2015 7 29 26178 1
3 2 5011 2015 7 28 12357 1
4 2 6102 2015 7 27 12357 1
But that's PercentSales row should have been:
0 5263/26178
1 5020/26178
2 4782/26178
3 5011/12357
4 6012/12357
Why the complication of another groupby? If all you want is to divide the column by the group sum, you can simply do:
train['PercentSales'] = train.groupby('Store')['Sales'].transform(lambda x: x/x.sum())
Or equivalently, following your method:
train['StoreTotalSales'] = train.groupby('Store'['Sales'].transform('sum')
train['PercentSales'] = train['Sales']/train['StoreTotalSales']
Let me know if you run into additional problems.

Pandas/Python Modeling Time-Series, Groups with Different Inputs

I am trying to model different scenarios for groups of assets in future years. This is something I have accomplished very tediously in Excel, but want to leverage the large database I have built with Pandas.
Example:
annual_group_cost = 0.02
df1:
year group x_count y_count value
2018 a 2 5 109000
2019 a 0 4 nan
2020 a 3 0 nan
2018 b 0 0 55000
2019 b 1 0 nan
2020 b 1 0 nan
2018 c 5 1 500000
2019 c 3 0 nan
2020 c 2 5 nan
df2:
group x_benefit y_cost individual_avg starting_value
a 0.2 0.72 1000 109000
b 0.15 0.75 20000 55000
c 0.15 0.70 20000 500000
I would like to update the values in df1, by taking the previous year's value (or starting value) and adding the x benefit, y cost, and annual cost. I am assuming this will take a function to accomplish, but I don't know of an efficient way to handle it.
The final output I would like to have is:
df1:
year group x_count y_count value
2018 a 2 5 103620
2019 a 0 4 98667.3
2020 a 3 0 97294.248
2018 b 0 0 53900
2019 b 1 0 56822
2020 b 1 0 59685.56
2018 c 5 1 495000
2019 c 3 0 497100
2020 c 2 5 420158
I achieved this by using:
starting_value-(starting_value*annual_group_cost)+(x_count*(individual_avg*x_benefit))-(y_count*(individual_avg*y_cost))
Since subsequent new values are dependent upon previously calculated new values, this will need to involve (even if behind the scenes using e.g. apply) a for loop:
for i in range(1, len(df1)):
if np.isnan(df1.loc[i, 'value']):
df1.loc[i, 'value'] = df1.loc[i-1, 'value'] #your logic here
You should merge the two tables together and then just do the functions on the data Series
hold = df_1.merge(df_2, on=['group']).fillna(0)
x = (hold.x_count*(hold.individual_avg*hold.x_benefit))
y = (hold.y_count*(hold.individual_avg*hold.y_cost))
for year in hold.year.unique():
start = hold.loc[hold.year == year, 'starting_value']
hold.loc[hold.year == year, 'value'] = (start-(start*annual_group_cost)+x-y)
if year != hold.year.max():
hold.loc[hold.year == year + 1, 'starting_value'] = hold.loc[hold.year == year, 'value'].values
hold.drop(['x_benefit', 'y_cost', 'individual_avg', 'starting_value'],axis=1)
Will give you
year group x_count y_count value
0 2018 a 2 5 103620.0
1 2019 a 0 4 98667.6
2 2020 a 3 0 97294.25
3 2018 b 0 0 53900.0
4 2019 b 1 0 55822.0
5 2020 b 1 0 57705.56
6 2018 c 5 1 491000.0
7 2019 c 3 0 490180.0
8 2020 c 2 5 416376.4

Categories

Resources