Function to turn single Pandas dataframe into multi-year dataframe

Function to turn single Pandas dataframe into multi-year dataframe - python

I have this Pandas dataframe which is a single year snapshot:
data = pd.DataFrame({'ID' : (1, 2),
'area': (2, 3),
'population' : (100, 200),
'demand' : (100, 200)})
I want to make this into a time series where population grows by 10% per year and demand grows by 20% per year. In this example I do this for two extra years.
This should be the output (note: it includes an added 'year' column):
output = pd.DataFrame({'ID': (1,2,1,2,1,2),
'year': (1,1,2,2,3,3),
'area': (2,3,2,3,2,3),
'population': (100,200,110,220,121,242),
'demand': (100,200,120,240,144,288)})

Setup variables:
k = 5 #Number of years to forecast
a = 1.20 #Demand Growth
b = 1.10 #Population Growth
Forecast dataframe:
df_out = (data[['ID','area']].merge(pd.concat([(data[['demand','population']].mul([pow(a,i),pow(b,i)])).assign(year=i+1) for i in range(k)]),
left_index=True, right_index=True)
.sort_values(by='year'))
print(df_out)
Output:
ID area demand population year
0 1 2 100.00 100.00 1
1 2 3 200.00 200.00 1
0 1 2 120.00 110.00 2
1 2 3 240.00 220.00 2
0 1 2 144.00 121.00 3
1 2 3 288.00 242.00 3
0 1 2 172.80 133.10 4
1 2 3 345.60 266.20 4
0 1 2 207.36 146.41 5
1 2 3 414.72 292.82 5

create a numpy array with [1.1, 1.2] that I repeat and cumprod
prepend a set of ones [1.0, 1.0] to account for the initial condition
multiply by the values of a conveniently stacked pd.Series
manipulate into a pd.DataFrame constructor
clean up indices and what not
k = 5
cols = ['ID', 'area']
cum_ret = np.vstack(
[np.ones((1, 2)), np.array([[1.2, 1.1]]
)[[0] * k].cumprod(0)])[:, [0, 0, 1, 1]]
s = data.set_index(cols).unstack(cols)
pd.DataFrame(
cum_ret * s.values,
columns=s.index
).stack(cols).reset_index(cols).reset_index(drop=True)
ID area demand population
0 1 2 100.000 100.000
1 2 3 200.000 200.000
2 1 2 120.000 110.000
3 2 3 240.000 220.000
4 1 2 144.000 121.000
5 2 3 288.000 242.000
6 1 2 172.800 133.100
7 2 3 345.600 266.200
8 1 2 207.360 146.410
9 2 3 414.720 292.820
10 1 2 248.832 161.051
11 2 3 497.664 322.102

Related

Calculate difference of two columns from two different dataframes based on condition

I have two dataframes with common columns. I would like to create a new column that contains the difference between two columns (one from each dataframe) based on a condition from a third column.
df_a:
Time Volume ID
1 5 1
2 6 2
3 7 3
df_b:
Time Volume ID
1 2 2
2 3 1
3 4 3
output is appending a new column to df_a with the differnece between volume columns (df_a.Volume - df_b.Volume) where the two IDs are equal.
df_a:
Time Volume ID Diff
1 5 1 2
2 6 2 4
3 7 3 3

If ID is unique per row in each dataframe:
df_a['Diff'] = df_a['Volume'] - df_a['ID'].map(df_b.set_index('ID')['Volume'])
Output:
Time Volume ID Diff
0 1 5 1 2
1 2 6 2 4
2 3 7 3 3

An option is to merge the two dfs on ID and then calculate Diff:
df_a = df_a.merge(df_b.drop(['Time'], axis=1), on="ID", suffixes=['', '2'])
df_a['Diff'] = df_a['Volume'] - df_a['Volume2']
df:
Time Volume ID Volume2 Diff
0 1 5 1 3 2
1 2 6 2 2 4
2 3 7 3 4 3

Merge the two dataframes on 'ID', then take the difference:
import pandas as pd
df_a = pd.DataFrame({'Time': [1,2,3], 'Volume': [5,6,7], 'ID':[1,2,3]})
df_b = pd.DataFrame({'Time': [1,2,3], 'Volume': [2,3,4], 'ID':[2,1,3]})
merged = pd.merge(df_a,df_b, on = 'ID')
df_a['Diff'] = merged['Volume_x'] - merged['Volume_y']
print(df_a)
#output:
Time Volume ID Diff
0 1 5 1 2
1 2 6 2 4
2 3 7 3 3

pandas, compute a value per group?

I'm trying to go from df to df2
I'm grouping by review_meta_id, age_bin then calculate a ctr from sum(click_count)/ sum(impression_count)
In [69]: df
Out[69]:
review_meta_id age_month impression_count click_count age_bin
0 3 4 10 3 1
1 3 10 5 2 2
2 3 20 5 3 3
3 3 8 9 2 2
4 4 9 9 5 2
In [70]: df2
Out[70]:
review_meta_id ctr age_bin
0 3 0.300000 1
1 3 0.285714 2
2 3 0.600000 3
3 4 0.555556 2
import pandas as pd
bins = [0, 5, 15, 30]
labels = [1,2,3]
l = [dict(review_meta_id=3, age_month=4, impression_count=10, click_count=3), dict(review_meta_id=3, age_month=10, impression_count=5, click_count=2), dict(review_meta_id=3, age_month=20, impression_count=5, cli\
ck_count=3), dict(review_meta_id=3, age_month=8, impression_count=9, click_count=2), dict(review_meta_id=4, age_month=9, impression_count=9, click_count=5)]
df = pd.DataFrame(l)
df['age_bin'] = pd.cut(df['age_month'], bins=bins, labels=labels)
grouped = df.groupby(['review_meta_id', 'age_bin'])
Is there an elegant way of doing the following?
data = []
for name, group in grouped:
ctr = group['click_count'].sum() / group['impression_count'].sum()
review_meta_id, age_bin = name
data.append(dict(review_meta_id=review_meta_id, ctr=ctr, age_bin=age_bin))
df2 = pd.DataFrame(data)

You can first aggregate goth columns by sum, then divide columns with DataFrame.pop for use and remove columns and last convert MultiIndex to columns with remove rows with missing values by DataFrame.dropna:
df2 = df.groupby(['review_meta_id', 'age_bin'])[['click_count','impression_count']].sum()
df2['ctr'] = df2.pop('click_count') / df2.pop('impression_count')
df2 = df2.reset_index().dropna()
print (df2)
review_meta_id age_bin ctr
0 3 1 0.300000
1 3 2 0.285714
2 3 3 0.600000
4 4 2 0.555556

you can use apply function after you grouping the dataframe by 'review_meta_id', 'age_bin' in order to calculate 'ctr', the result will be a pandas series in order to convert it to a dataframe we use reset_index() and provide name='ctr', The name of the column corresponding to the Series values.
def divide_two_cols(df_sub):
return df_sub['click_count'].sum() / float(df_sub['impression_count'].sum())
df2 = df.groupby(['review_meta_id', 'age_bin']).apply(divide_two_cols).reset_index(name='ctr')
new_df

Sorting across multiple columns pandas

df = pd.DataFrame([["Alpha", 3, 2, 4], ["Bravo", 2, 3, 1], ["Charlie", 4, 1, 3], ["Delta", 1, 4, 2]],
columns = ["Company", "Running", "Combat", "Range"])
print(df)
Company Running Combat Range
0 Alpha 3 2 4
1 Bravo 2 3 1
2 Charlie 4 1 3
3 Delta 1 4 2
Hi, I am trying to sort the the following dataframe so the rows would be arranged such that the best performing across the three columns would be at the top. In this case would be Bravo company as it is 2 in running, 3 in drills and 1 in range.
Would this approach work if the list have a lot more companies and it is hard to know the exact "best performing company"?
I have tried:
df_sort = df.sort_values(['Running', 'Combat', 'Range'], ascending=[True, True, True])
current output:
Company Running Combat Range
1 Delta 1 4 2
0 Bravo 2 3 1
3 Alpha 3 2 4
2 Charlie 4 1 3
but it doesn't turn out how I wanted it to be. Can this be done through pandas?
I was expecting the output to be:
Company Running Combat Range
0 Bravo 2 3 1
1 Delta 1 4 2
2 Charlie 4 1 3
3 Alpha 3 2 4

If want sorting by means per rows first create mean, then add Series.argsort for positions of sorted values and last change order of values by DataFrame.iloc:
df1 = df.iloc[df.mean(axis=1).argsort()]
print (df1)
Company Running Combat Range
1 Bravo 2 3 1
3 Delta 1 4 2
2 Charlie 4 1 3
0 Alpha 3 2 4
EDIT: If need remove some columns before by DataFrame.drop:
cols = ['Overall','Subordination']
df2 = text_df.iloc[text_df.drop(cols, axis=1).mean(axis=1).argsort()]
print (df2)
Company Running Combat Overall Subordination Range
1 Bravo 2 3 0.70 Poor 1
3 Delta 1 4 0.83 Good 2
2 Charlie 4 1 0.81 Good 3
0 Alpha 3 2 0.91 Excellent 4

expand pandas groupby results to initial dataframe

Say I have a dataframe df and group it by a few columns, dfg, with the median of one of its columns. How could I then take those median values, and expand them out so that those mean values are in a new column of the original df, and associated with the respective conditions? This will mean there are duplicates, but I will next be using this column for a subsequent calculation and having these in a column will make this possible.
Example data:
import pandas as pd
data = {'idx':[1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2],
'condition1':[1,1,2,2,3,3,4,4,1,1,2,2,3,3,4,4],
'condition2':[1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2],
'values':np.random.normal(0,1,16)}
df = pd.DataFrame(data)
dfg = df.groupby(['idx', 'condition2'], as_index=False)['values'].median()
example of desired result (note duplicates corresponding to correct conditions):
idx condition1 condition2 values medians
0 1 1 1 0.35031 0.656355
1 1 1 2 -0.291736 -0.024304
2 1 2 1 1.593545 0.656355
3 1 2 2 -1.275154 -0.024304
4 1 3 1 0.075259 0.656355
5 1 3 2 1.054481 -0.024304
6 1 4 1 0.9624 0.656355
7 1 4 2 0.243128 -0.024304
8 2 1 1 1.717391 1.155406
9 2 1 2 0.788847 1.006583
10 2 2 1 1.145891 1.155406
11 2 2 2 -0.492063 1.006583
12 2 3 1 -0.157029 1.155406
13 2 3 2 1.224319 1.006583
14 2 4 1 1.164921 1.155406
15 2 4 2 2.042239 1.006583

I believe you need GroupBy.transform with median for new column:
df['medians'] = df.groupby(['idx', 'condition2'])['values'].transform('median')

Ranking over multiple columns in pandas

I have this data frame:
dict_data = {'id' : [1,1,1,2,2,2,2,2],
'datetime' : np.array(['2016-01-03T16:05:52.000000000', '2016-01-03T16:05:52.000000000',
'2016-01-03T16:05:52.000000000', '2016-01-27T15:45:20.000000000',
'2016-01-27T15:45:20.000000000', '2016-11-27T15:08:04.000000000',
'2016-11-27T15:08:04.000000000', '2016-11-27T15:08:04.000000000'], dtype='datetime64[ns]')}
df_data=pd.DataFrame(dict_data)
The data looks like this
Data
I want to rank over customer id and date, I used this code
(df_data.assign(rn=df_data.sort_values(['datetime'], ascending=True)
....: .groupby(['datetime','id'])
....: .cumcount() + 1)
....: .sort_values(['datetime','rn'])
....: )
I get a different rank by ID for each date:
table with rank
What I would like to see is rank by ID but for the same datetime get the same rank for each ID.

Here is how you can rank by datetime and id :
##### RANK BY datetime and id #####
In[]: df_data.rank(axis =0,ascending = 1, method = 'dense')
Out[]:
datetime id
0 1 1
1 1 1
2 1 1
3 2 2
4 2 2
5 3 2
6 3 2
7 3 2
##### GROUPBY id AND USE APPLY TO GET VALUE FOR FOR EACH GROUP #####
In[]: df_data.rank(axis =0,ascending = 1, method = 'dense').groupby('id').apply(lambda x: x)
Out[]:
datetime id
0 1 1
1 1 1
2 1 1
3 2 2
4 2 2
5 3 2
6 3 2
7 3 2
##### THEN RANK INSIDE EACH GROUP #####
In[]: df_data.assign(rank=df_data.rank(axis =0,ascending = 1, method = 'dense').groupby('id').apply(lambda x: x.rank(axis =0,ascending = 1, method = 'dense'))['datetime'])
Out[]:
datetime id rank
0 2016-01-03 16:05:52 1 1
1 2016-01-03 16:05:52 1 1
2 2016-01-03 16:05:52 1 1
3 2016-01-27 15:45:20 2 1
4 2016-01-27 15:45:20 2 1
5 2016-11-27 15:08:04 2 2
6 2016-11-27 15:08:04 2 2
7 2016-11-27 15:08:04 2 2
If you want to change the method of ranking you'll get more info on ranking from the pandas documentation on ranking

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Function to turn single Pandas dataframe into multi-year dataframe - python

Related

Calculate difference of two columns from two different dataframes based on condition

pandas, compute a value per group?

Sorting across multiple columns pandas

expand pandas groupby results to initial dataframe

Ranking over multiple columns in pandas

Categories

Resources