Calculating difference in multi index pandas data frame - python

I have a df:
df_test = sns.load_dataset("flights")
df_test['cat_2'] = np.random.choice(range(10), df_test.shape[0])
df_test.pivot_table(index='month',
columns='year',
values=['passengers', 'cat_2'])\
.swaplevel(0,1, axis=1)\
.sort_index(axis=1, level=0)\
.fillna(0)
I am trying to calculate the difference between cat_2 and passengers each year compared to the year before.
What is the best way to achieve this?
Desired output would look similar to this:
year 1949 1950 1951
cat_2 passengers % diff cat_2 passengers % diff cat_2 passengers % diff
month
Jan 6 112 0 6 115 115/112 6 90 90/115
Feb 0 118 0 6 126 126/118 6 150 150 / 126
Mar 2 132 0 7 141 7 141
Apr 0 129 0 9 135 9 135
May 5 121 0 4 125 4 125
Jun 1 135 0 3 149 3 149
Jul 6 148 0 5 170 5 170
Aug 5 148 0 2 170 2 170
Sep 1 136 0 4 158 4 158
Oct 5 119 0 5 133 5 133
Nov 0 104 0 1 114 1 114
Dec 7 118 0 1 140 1 140
I only showed the desired calculations for columns passengers but the same calculations method can be used for cat_2 as well.
As there is nothing to compare the first year I filled the values with 0.

You can select columns passengers by DataFrame.xs, divide by shifted rows by DataFrame.shift, create MultiIndex and append to original DataFrame:
df_test = sns.load_dataset("flights")
df_test['cat_2'] = np.random.choice(range(10), df_test.shape[0])
df = df_test.pivot_table(index='month',
columns='year',
values=['passengers', 'cat_2'])\
.swaplevel(0,1, axis=1)\
.sort_index(axis=1, level=0)\
.fillna(0)
df1 = df.xs('passengers', axis=1, level=1)
df2 = pd.concat({'% diff': df1.div(df1.shift(axis=1)).fillna(0)}, axis=1)
out = (pd.concat([df, df2.swaplevel(axis=1)], axis=1)
.sort_index(axis=1, level=0, sort_remaining=False))
Another idea is shift values by add 1 year, only necessary remove last NaN column by iloc:
df1 = df.xs('passengers', axis=1, level=1)
df2 = (pd.concat({'% diff': df1.div(df1.rename(columns=lambda x: x+1))}, axis=1)
.fillna(0)
.iloc[:, :-1])
out = (pd.concat([df, df2.swaplevel(axis=1)], axis=1)
.sort_index(axis=1, level=0, sort_remaining=False))

Related

Choosing values with df.quantile() for separate years and months

I have a large data set and I want to add values to a column based on the higest values in another column in my data set.
Easy, I can just use df.quantile() and access the appropriate values
However, I want to check for each month in each year.
I solved it for looking at years only, see code below.
I'm sure I could do it for months with nested for loops but I'd rather avoid it if I can.
I guess the most pythonic way would by to not loop at all but use pandas in a smarter way..
Any suggestion?
Sample code:
df = pd.read_excel(file)
df.index = df['date']
df = df.drop('date', axis=1)
df['new'] = 0
year = (2016, 2017, 2018, 2019, 2020)
for i in year:
df['new'].loc[str(i)] = np.where(df['cost'].loc[str(i)] < df['cost'].loc[str(i)].quantile(0.5), 0, 1)
print(df)
Sample input
file =
cost
date
2016-11-01 30
2016-12-01 29
2017-11-01 40
2017-12-01 45
2018-11-30 240
2018-12-01 200
2019-11-30 220
2019-12-30 180
2020-11-30 150
2020-12-30 130
Output
cost new
date
2016-11-01 30 1
2016-12-01 29 0
2017-11-01 40 0
2017-12-01 45 1
2018-11-30 240 1
2018-12-01 200 0
2019-11-30 220 1
2019-12-30 180 0
2020-11-30 150 1
2020-12-30 130 0
Desired output (if quantile works like that on single values, but as an example)
cost new
date
2016-11-01 30 1
2016-12-01 29 1
2017-11-01 40 1
2017-12-01 45 1
2018-11-30 240 1
2018-12-01 200 1
2019-11-30 220 1
2019-12-30 180 1
2020-11-30 150 1
2020-12-30 130 1
Thank you _/_
An interesting question, it took me a little while to work out a solution!
import pandas as pd
df = pd.DataFrame(data={"cost": [30, 29, 40, 45, 240, 200, 220, 180, 150, 130],
"date": ["2016-11-01", "2016-12-01", "2017-11-01",
"2017-12-01", "2018-11-30", "2018-12-01",
"2019-11-30", "2019-12-30", "2020-11-30",
"2020-12-30"]})
df["date"] = pd.to_datetime(df["date"])
df.set_index("date", inplace=True)
df["new"] = df.groupby([lambda x: x.year, lambda x: x.month]).transform(lambda x: (x >= x.quantile(0.5))*1)
#Out:
# cost new
#date
#2016-11-01 30 1
#2016-12-01 29 1
#2017-11-01 40 1
#2017-12-01 45 1
#2018-11-30 240 1
#2018-12-01 200 1
#2019-11-30 220 1
#2019-12-30 180 1
#2020-11-30 150 1
#2020-12-30 130 1
What the important line does:
Groups by the index year and month
For each item in the group, calculates whether it is greater than or equal to the 0.5 quantile (as bool)
Multiplying by 1 creates an integer bool (1/0) instead of True/False
The initial creation of the dataframe should be equivalent to your df = pd.read_excel(file)
Leaving out the , lambda x: x.month part of the groupby (by year only), the output is the same as your current output:
# cost new
#date
#2016-11-01 30 1
#2016-12-01 29 0
#2017-11-01 40 0
#2017-12-01 45 1
#2018-11-30 240 1
#2018-12-01 200 0
#2019-11-30 220 1
#2019-12-30 180 0
#2020-11-30 150 1
#2020-12-30 130 0

Create Columns for Count for Each Variable Pandas

After using value_counts and some other data cleaning, I have my data in the form:
year city category count_per_city
2005 NYC 1 145
2007 ATL 1 75
2005 NYC 2 55
2006 LA 3 40
I'd like to convert it to this:
year city 1 2 3 total
2005 NYC 145 55 0 200
2006 LA 0 0 40 40
2007 ATL 75 0 0 75
I feel like there is a relatively simple way to do this that I'm missing.
You can use pivot_table() with margins and fill_value:
out = df.pivot_table(
index=['year', 'city'],
columns='category',
aggfunc='sum',
fill_value=0,
margins=True,
margins_name='total'
).drop('total')
# count_per_city
# category 1 2 3 total
# year city
# 2005 NYC 145 55 0 200
# 2006 LA 0 0 40 40
# 2007 ATL 75 0 0 75
If you want the exact output from the OP, you can do some cleanup (thanks to #HenryEcker):
out.droplevel(0, axis=1).rename_axis(columns=None).reset_index()
# year city 1 2 3 total
# 0 2005 NYC 145 55 0 200
# 1 2006 LA 0 0 40 40
# 2 2007 ATL 75 0 0 75
Another solution using unstack:
(
df.set_index(['year', 'city', 'category']).unstack(2)
.droplevel(0, axis=1)
.assign(Total =lambda x: x.fillna(0).apply(sum, axis=1))
.reset_index()
.rename_axis(columns='')
)

groupby cumsum sorted dataframe

I want to group a dataframe by a column then apply a cumsum over the other ordered by the first column descending
df1:
id PRICE DEMAND
0 120 10
1 232 2
2 120 3
3 232 8
4 323 5
5 323 6
6 323 2
df2:
id PRICE DEMAND
0 323 13
1 232 23
2 120 36
I do it in two instructions but I am feeling it can be done with only one sum
data = data.groupby('PRICE',as_index=False).agg({'DEMAND': 'sum'}).sort_values(by='PRICE', ascending=False)
data['DEMAND'] = data['DEMAND'].cumsum()
What you have seems perfectly fine to me. But if you want to chain everything together, first sort then groupby with sort=False so it doesn't change the order. Then you can sum within group and cumsum the resulting Series
(df.sort_values('PRICE', ascending=False)
.groupby('PRICE', sort=False)['DEMAND'].sum()
.cumsum()
.reset_index())
PRICE DEMAND
0 323 13
1 232 23
2 120 36
Another option would be to sort then cumsum and then drop_duplicates:
(df.sort_values('PRICE', ascending=False)
.set_index('PRICE')
.DEMAND.cumsum()
.reset_index()
.drop_duplicates('PRICE', keep='last'))
PRICE DEMAND
2 323 13
4 232 23
6 120 36

How to merge 2 df based on comparison of 2 columns to match 1 column

How to .merge 2 df, 1 column to match 2 columns ??
The goal is to merge 2 df to have count of records for every campaign id from a REF table to the Data by id.
The issue .merge just compare 1 column with 1 column
The Data is mess up and for some rows there are id names rather then id's.
It works if I want to merge 1 column to 1 column, or 2 columns to 2 columns, but NOT for 1 column to 2 columns
Reff table
g_spend =
campaignid id_name cost
154 campaign1 15
155 campaign2 12
1566 campaign33 12
158 campaign4 33
Data
cw =
campaignid
154
154
155
campaign1
campaign33
1566
158
campaign1
campaign1
campaign33
campaign4
Desired output
g_spend =
campaignid id_name cost leads
154 campaign1 15 5
155 campaign2 12 0
1566 campaign33 12 3
158 campaign4 33 2
What I done..
# Just work for one column
cw.head()
grouped_cw = cw.groupby(["campaignid"]).count()
grouped_cw.rename(columns={'reach':'leads'}, inplace=True)
grouped_cw = pd.DataFrame(grouped_cw)
# now merging
g_spend.campaignid = g_spend.campaignid.astype(str)
g_spend = g_spend.merge(grouped_cw, left_on='campaignid', right_index=True)
I would first set id_name as index in g_spend, then do a replace on cw, followed by a value_counts:
s = (cw.campaignid
.replace(g_spend.set_index('id_name').campaignid
.value_counts()
.to_frame('leads')
)
g_spend = g_spend.merge(s, left_on='campaignid', right_index=True)
Output:
campaignid id_name cost leads
0 154 campaign1 15 5
1 155 campaign2 12 1
2 1566 campaign33 12 3
3 158 campaign4 33 2

Python pandas group by two columns

I have a pandas dataframe:
code type
index
312 11 21
312 11 41
312 11 21
313 23 22
313 11 21
... ...
So I need to group it by count of pairs 'code' and 'type' columns for each index item:
11_21 11_41 23_22
index
312 2 1 0
313 1 0 1
... ...
How implement it with python and pandas?
Here's one way using pd.crosstab and then rename column names, using levels information.
In [136]: dff = pd.crosstab(df['index'], [df['code'], df['type']])
In [137]: dff
Out[137]:
code 11 23
type 21 41 22
index
312 2 1 0
313 1 0 1
In [138]: dff.columns = ['%s_%s' % c for c in dff.columns]
In [139]: dff
Out[139]:
11_21 11_41 23_22
index
312 2 1 0
313 1 0 1
Alternatively, less elegantly, create another column and use crosstab.
In [140]: df['ct'] = df.code.astype(str) + '_' + df.type.astype(str)
In [141]: df
Out[141]:
index code type ct
0 312 11 21 11_21
1 312 11 41 11_41
2 312 11 21 11_21
3 313 23 22 23_22
4 313 11 21 11_21
In [142]: pd.crosstab(df['index'], df['ct'])
Out[142]:
ct 11_21 11_41 23_22
index
312 2 1 0
313 1 0 1

Categories

Resources