Calculating YoY growth for columns

Calculating YoY growth for columns - python

I am trying to calculate the YoY change between columns. Lets say I have the df below
datedf = pd.DataFrame({'ID':list('12345'),'1/1/2019':[1,2,3,4,5],'2/1/2019':[1,2,3,4,5],'3/1/2019':[1,2,3,4,5],'1/1/2020':[2,4,6,8,10],'2/1/2020':[2,4,6,8,10],'3/1/2020':[2,4,6,8,10]})
What transformation would I have to do in order to get to this result below, to show 100% gain YoY.
endingdf = pd.DataFrame({'ID':list('12345'),'1/1/2020':[1,1,1,1,1],'2/1/2020':[1,1,1,1,1],'3/1/2020':[1,1,1,1,1]})
This is the code I have tried but it does not work. The real data I am working with has multiple years.
just_dates = datedf.loc[:,'1/1/2019':]
just_dates.columns = pd.to_datetime(just_dates.columns)
just_dates.groupby(pd.Grouper(level=0,freq='M',axis=1),axis=1).pct_change()

Try this:
result = datedf.set_index('ID')
result.columns = pd.to_datetime(result.columns)
result = result.pct_change(periods=12, freq='MS', axis=1)
Result:
2019-01-01 2019-02-01 2019-03-01 2020-01-01 2020-02-01 2020-03-01
ID
1 NaN NaN NaN 1.0 1.0 1.0
2 NaN NaN NaN 1.0 1.0 1.0
3 NaN NaN NaN 1.0 1.0 1.0
4 NaN NaN NaN 1.0 1.0 1.0
5 NaN NaN NaN 1.0 1.0 1.0

Related

combine two pd df by index and column

The data looks like this:
df1 = 456089.0 456091.0 456093.0
5428709.0 1.0 1.0 NaN
5428711.0 1.0 1.0 NaN
5428713.0 NaN NaN 1.0
df2 = 456093.0 456095.0 456097.0
5428711.0 2.0 NaN NaN
5428713.0 NaN 2.0 NaN
5428715.0 NaN NaN 2.0
I would like to have this output:
df3 = 456089.0 456091.0 456093.0 456095.0 456097.0
5428709.0 1.0 1.0 NaN NaN NaN
5428711.0 1.0 1.0 2.0 NaN NaN
5428713.0 NaN NaN 1.0 2.0 NaN
5428715.0 NaN NaN NaN NaN 2.0
I tried several combinations with pd.merge, pd.join, pd.concat but nothing worked the way I want it, since I want to combine the data by index and column.
Does anyone have an idea how to do this? Thanks in advance!

Let us try sum with concat
out = pd.concat([df1,df2]).sum(axis=1,level=0,min_count=1).sum(axis=0,level=0,min_count=1)
Out[150]:
456089.0 456091.0 456093.0 456095.0 456097.0
5428709.0 1.0 1.0 NaN NaN NaN
5428711.0 1.0 1.0 2.0 NaN NaN
5428713.0 NaN NaN 1.0 2.0 NaN
5428715.0 NaN NaN NaN NaN 2.0

Convert two pandas rows into one

I want to convert below dataframe,
ID TYPE A B
0 1 MISSING 0.0 0.0
1 2 1T 1.0 2.0
2 2 2T 3.0 4.0
3 3 MISSING 0.0 0.0
4 4 2T 10.0 4.0
5 5 CBN 15.0 20.0
6 5 DSV 25.0 35.0
to:
ID MISSING_A MISSING_B 1T_A 1T_B 2T_A 2T_B CBN_A CBN_B DSV_A DSV_B
0 1 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN
1 2 NaN NaN 1.0 2.0 3.0 4.0 NaN NaN NaN NaN
3 3 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN
4 4 10.0 4.0 NaN NaN 10.0 4.0 NaN NaN NaN NaN
5 5 NaN NaN NaN NaN NaN NaN 15.0 20.0 25.0 35.0
For IDs with multiple types, multiple rows for A and B to merge into one row as shown above.

You are looking for a pivot, which will end up giving you a multi-index. You'll need to join those columns to get the suffix you are looking for.
df = df.pivot(index='ID',columns='TYPE', values=['A','B'])
df.columns = ['_'.join(reversed(col)).strip() for col in df.columns.values]
df.reset_index()

Pandas combine two columns

I have following database:
df = pandas.DataFrame({'Buy':[10,np.nan,2,np.nan,np.nan,4],'Sell':[np.nan,7,np.nan,9,np.nan,np.nan]})
Out[37]:
Buy Sell
0 10.0 NaN
1 NaN 7.0
2 2.0 NaN
3 NaN 9.0
4 NaN NaN
5 4.0 NaN
I want o create two more columns called Quant and B/S
for Quant it is working fine as follows:
df['Quant'] = df['Buy'].fillna(df['Sell']) # Fetch available value from both column and if both values are Nan then output is Nan.
Output is:
df
Out[39]:
Buy Sell Quant
0 10.0 NaN 10.0
1 NaN 7.0 7.0
2 2.0 NaN 2.0
3 NaN 9.0 9.0
4 NaN NaN NaN
5 4.0 NaN 4.0
But I want to create B/S on the basis of "from which column they have taken value while creating Quant"

You can perform an equality test and feed into numpy.where:
df['B/S'] = np.where(df['Quant'] == df['Buy'], 'B', 'S')
For the case where both values are null, you can use an additional step:
df.loc[df[['Buy', 'Sell']].isnull().all(1), 'B/S'] = np.nan
Example
from io import StringIO
import pandas as pd
mystr = StringIO("""Buy Sell
10 nan
nan 8
4 nan
nan 5
nan 7
3 nan
2 nan
nan nan""")
df = pd.read_csv(mystr, delim_whitespace=True)
df['Quant'] = df['Buy'].fillna(df['Sell'])
df['B/S'] = np.where(df['Quant'] == df['Buy'], 'B', 'S')
df.loc[df[['Buy', 'Sell']].isnull().all(1), 'B/S'] = np.nan
Result
print(df)
Buy Sell Quant B/S
0 10.0 NaN 10.0 B
1 NaN 8.0 8.0 S
2 4.0 NaN 4.0 B
3 NaN 5.0 5.0 S
4 NaN 7.0 7.0 S
5 3.0 NaN 3.0 B
6 2.0 NaN 2.0 B
7 NaN NaN NaN NaN

Visualizing multiple dummy variables over time

I have a dataframe with dummy variables for daily weather types observations.
date high_wind thunder snow smoke
0 2050-10-23 1.0 NaN NaN NaN
1 2050-10-24 1.0 1.0 NaN NaN
2 2050-10-25 NaN NaN NaN NaN
3 2050-10-26 NaN NaN NaN 1.0
4 2050-10-27 NaN NaN NaN 1.0
5 2050-10-28 NaN NaN NaN 1.0
6 2050-10-29 1.0 NaN NaN NaN
7 2050-10-30 NaN 1.0 NaN NaN
8 2050-10-31 NaN 1.0 NaN NaN
9 2050-11-01 1.0 1.0 NaN NaN
10 2050-11-02 1.0 1.0 NaN NaN
11 2050-11-03 1.0 1.0 NaN NaN
12 2050-11-04 1.0 NaN NaN NaN
13 2050-11-05 1.0 NaN NaN NaN
14 2050-11-06 NaN NaN NaN NaN
15 2050-11-07 NaN 1.0 NaN NaN
16 2050-11-08 NaN NaN NaN NaN
17 2050-11-09 NaN NaN 1.0 NaN
18 2050-11-10 NaN NaN NaN NaN
19 2050-11-11 NaN NaN 1.0 NaN
20 2050-11-12 NaN NaN 1.0 NaN
21 2050-11-13 NaN NaN NaN NaN
For those of you playing along at home, copy the above and then:
import pandas as pd
df = pd.read_clipboard()
df.date = df.date.apply(pd.to_datetime)
df.set_index('date', inplace=True)
I want to visualize this dataframe with the date on the x axis and each weather type category on the y axis. Here's what I've tried so far:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
labels = df.columns.tolist()
#unsatisfying loop to give categories some y separation
for i,col in enumerate(df.columns):
ax.scatter(x=df[col].index, y=(df[col]+i)) #add a little to each
ax.set_yticklabels(labels)
ax.set_xlim(df.index.min(), df.index.max())
fig.autofmt_xdate()
Which gives me this:
Questions:
How do I get the y labels aligned properly?
Is there a better way to structure the data to make plotting easier?

This aligns you y labels:
ax.set_yticks(range(1, len(df.columns) + 1))

Get range from sparse datetimeindex

I have this kind of pandas DataFrame for each user in a large database.
each row is a period of length [start_date, end_date], but sometimes 2 consecutive rows are in fact the same period : end_date is equal to the following start_date (red underlining). Sometimes periods even overlap on more than 1 date.
I would like to get the "real periods" by combining rows which corresponds to the same periods.
What I have tried
def split_range(name):
df_user = de_201512_echant[de_201512_echant.name == name]
# -- Create a date_range with a length [min_start_date, max_start_date]
t_date = pd.DataFrame(index=pd.date_range("2005-01-01", "2015-12-12").date)
for row in range(0, df_user.shape[0]):
start_date = df_user.iloc[row].start_date
end_date = df_user.iloc[row].end_date
if ((pd.isnull(start_date) == False) and (pd.isnull(end_date) == False)):
t = pd.DataFrame(index=pd.date_range(start_date, end_date))
t["period_%s" % (row)] = 1
t_date = pd.merge(t_date, t, right_index=True, left_index=True, how="left")
else:
pass
return t_date
which yields a DataFrame where each colunms is a period (1 if in the range, NaN if not) :
t_date
Out[29]:
period_0 period_1 period_2 period_3 period_4 period_5 \
2005-01-01 NaN NaN NaN NaN NaN NaN
2005-01-02 NaN NaN NaN NaN NaN NaN
2005-01-03 NaN NaN NaN NaN NaN NaN
2005-01-04 NaN NaN NaN NaN NaN NaN
2005-01-05 NaN NaN NaN NaN NaN NaN
2005-01-06 NaN NaN NaN NaN NaN NaN
2005-01-07 NaN NaN NaN NaN NaN NaN
2005-01-08 NaN NaN NaN NaN NaN NaN
2005-01-09 NaN NaN NaN NaN NaN NaN
2005-01-10 NaN NaN NaN NaN NaN NaN
2005-01-11 NaN NaN NaN NaN NaN NaN
Then if I sum all the columns (periods) I got almost exactly what I want :
full_spell = t_date.sum(axis=1)
full_spell.loc[full_spell == 1]
Out[31]:
2005-11-14 1.0
2005-11-15 1.0
2005-11-16 1.0
2005-11-17 1.0
2005-11-18 1.0
2005-11-19 1.0
2005-11-20 1.0
2005-11-21 1.0
2005-11-22 1.0
2005-11-23 1.0
2005-11-24 1.0
2005-11-25 1.0
2005-11-26 1.0
2005-11-27 1.0
2005-11-28 1.0
2005-11-29 1.0
2005-11-30 1.0
2006-01-16 1.0
2006-01-17 1.0
2006-01-18 1.0
2006-01-19 1.0
2006-01-20 1.0
2006-01-21 1.0
2006-01-22 1.0
2006-01-23 1.0
2006-01-24 1.0
2006-01-25 1.0
2006-01-26 1.0
2006-01-27 1.0
2006-01-28 1.0
2015-07-06 1.0
2015-07-07 1.0
2015-07-08 1.0
2015-07-09 1.0
2015-07-10 1.0
2015-07-11 1.0
2015-07-12 1.0
2015-07-13 1.0
2015-07-14 1.0
2015-07-15 1.0
2015-07-16 1.0
2015-07-17 1.0
2015-07-18 1.0
2015-07-19 1.0
2015-08-02 1.0
2015-08-03 1.0
2015-08-04 1.0
2015-08-05 1.0
2015-08-06 1.0
2015-08-07 1.0
2015-08-08 1.0
2015-08-09 1.0
2015-08-10 1.0
2015-08-11 1.0
2015-08-12 1.0
2015-08-13 1.0
2015-08-14 1.0
2015-08-15 1.0
2015-08-16 1.0
2015-08-17 1.0
dtype: float64
But I could not find a way to slice all the time range of this sparse datetime index to finally get my desired output : the original dataframe containing the "real" period of time.
It might not be the most efficient way to do this, so If you have alternatives, do not hesitate!

I found a much more efficient way to do this by using apply:
def get_range(row):
'''returns a DataFrame containing the day-range from a "start_date"
and a "end_date"'''
start_date = row["start_date"]
end_date = row["end_date"]
period = pd.date_range(start_date, end_date, freq="1D")
return pd.Dataframe(period, columns='days_in_period')
# -- Apply get_range() to the initial df
t_all = df.apply(get_range)
# -- Drop overlapping dates
t_all.drop_duplicates(inplace=True)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Calculating YoY growth for columns - python

Related

combine two pd df by index and column

Convert two pandas rows into one

Pandas combine two columns

Visualizing multiple dummy variables over time

Get range from sparse datetimeindex

Categories

Resources