I am trying to calculate the YoY change between columns. Lets say I have the df below
datedf = pd.DataFrame({'ID':list('12345'),'1/1/2019':[1,2,3,4,5],'2/1/2019':[1,2,3,4,5],'3/1/2019':[1,2,3,4,5],'1/1/2020':[2,4,6,8,10],'2/1/2020':[2,4,6,8,10],'3/1/2020':[2,4,6,8,10]})
What transformation would I have to do in order to get to this result below, to show 100% gain YoY.
endingdf = pd.DataFrame({'ID':list('12345'),'1/1/2020':[1,1,1,1,1],'2/1/2020':[1,1,1,1,1],'3/1/2020':[1,1,1,1,1]})
This is the code I have tried but it does not work. The real data I am working with has multiple years.
just_dates = datedf.loc[:,'1/1/2019':]
just_dates.columns = pd.to_datetime(just_dates.columns)
just_dates.groupby(pd.Grouper(level=0,freq='M',axis=1),axis=1).pct_change()
Try this:
result = datedf.set_index('ID')
result.columns = pd.to_datetime(result.columns)
result = result.pct_change(periods=12, freq='MS', axis=1)
Result:
2019-01-01 2019-02-01 2019-03-01 2020-01-01 2020-02-01 2020-03-01
ID
1 NaN NaN NaN 1.0 1.0 1.0
2 NaN NaN NaN 1.0 1.0 1.0
3 NaN NaN NaN 1.0 1.0 1.0
4 NaN NaN NaN 1.0 1.0 1.0
5 NaN NaN NaN 1.0 1.0 1.0
Related
The data looks like this:
df1 = 456089.0 456091.0 456093.0
5428709.0 1.0 1.0 NaN
5428711.0 1.0 1.0 NaN
5428713.0 NaN NaN 1.0
df2 = 456093.0 456095.0 456097.0
5428711.0 2.0 NaN NaN
5428713.0 NaN 2.0 NaN
5428715.0 NaN NaN 2.0
I would like to have this output:
df3 = 456089.0 456091.0 456093.0 456095.0 456097.0
5428709.0 1.0 1.0 NaN NaN NaN
5428711.0 1.0 1.0 2.0 NaN NaN
5428713.0 NaN NaN 1.0 2.0 NaN
5428715.0 NaN NaN NaN NaN 2.0
I tried several combinations with pd.merge, pd.join, pd.concat but nothing worked the way I want it, since I want to combine the data by index and column.
Does anyone have an idea how to do this? Thanks in advance!
Let us try sum with concat
out = pd.concat([df1,df2]).sum(axis=1,level=0,min_count=1).sum(axis=0,level=0,min_count=1)
Out[150]:
456089.0 456091.0 456093.0 456095.0 456097.0
5428709.0 1.0 1.0 NaN NaN NaN
5428711.0 1.0 1.0 2.0 NaN NaN
5428713.0 NaN NaN 1.0 2.0 NaN
5428715.0 NaN NaN NaN NaN 2.0
I want to convert below dataframe,
ID TYPE A B
0 1 MISSING 0.0 0.0
1 2 1T 1.0 2.0
2 2 2T 3.0 4.0
3 3 MISSING 0.0 0.0
4 4 2T 10.0 4.0
5 5 CBN 15.0 20.0
6 5 DSV 25.0 35.0
to:
ID MISSING_A MISSING_B 1T_A 1T_B 2T_A 2T_B CBN_A CBN_B DSV_A DSV_B
0 1 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN
1 2 NaN NaN 1.0 2.0 3.0 4.0 NaN NaN NaN NaN
3 3 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN
4 4 10.0 4.0 NaN NaN 10.0 4.0 NaN NaN NaN NaN
5 5 NaN NaN NaN NaN NaN NaN 15.0 20.0 25.0 35.0
For IDs with multiple types, multiple rows for A and B to merge into one row as shown above.
You are looking for a pivot, which will end up giving you a multi-index. You'll need to join those columns to get the suffix you are looking for.
df = df.pivot(index='ID',columns='TYPE', values=['A','B'])
df.columns = ['_'.join(reversed(col)).strip() for col in df.columns.values]
df.reset_index()
I have following database:
df = pandas.DataFrame({'Buy':[10,np.nan,2,np.nan,np.nan,4],'Sell':[np.nan,7,np.nan,9,np.nan,np.nan]})
Out[37]:
Buy Sell
0 10.0 NaN
1 NaN 7.0
2 2.0 NaN
3 NaN 9.0
4 NaN NaN
5 4.0 NaN
I want o create two more columns called Quant and B/S
for Quant it is working fine as follows:
df['Quant'] = df['Buy'].fillna(df['Sell']) # Fetch available value from both column and if both values are Nan then output is Nan.
Output is:
df
Out[39]:
Buy Sell Quant
0 10.0 NaN 10.0
1 NaN 7.0 7.0
2 2.0 NaN 2.0
3 NaN 9.0 9.0
4 NaN NaN NaN
5 4.0 NaN 4.0
But I want to create B/S on the basis of "from which column they have taken value while creating Quant"
You can perform an equality test and feed into numpy.where:
df['B/S'] = np.where(df['Quant'] == df['Buy'], 'B', 'S')
For the case where both values are null, you can use an additional step:
df.loc[df[['Buy', 'Sell']].isnull().all(1), 'B/S'] = np.nan
Example
from io import StringIO
import pandas as pd
mystr = StringIO("""Buy Sell
10 nan
nan 8
4 nan
nan 5
nan 7
3 nan
2 nan
nan nan""")
df = pd.read_csv(mystr, delim_whitespace=True)
df['Quant'] = df['Buy'].fillna(df['Sell'])
df['B/S'] = np.where(df['Quant'] == df['Buy'], 'B', 'S')
df.loc[df[['Buy', 'Sell']].isnull().all(1), 'B/S'] = np.nan
Result
print(df)
Buy Sell Quant B/S
0 10.0 NaN 10.0 B
1 NaN 8.0 8.0 S
2 4.0 NaN 4.0 B
3 NaN 5.0 5.0 S
4 NaN 7.0 7.0 S
5 3.0 NaN 3.0 B
6 2.0 NaN 2.0 B
7 NaN NaN NaN NaN
I have a dataframe with dummy variables for daily weather types observations.
date high_wind thunder snow smoke
0 2050-10-23 1.0 NaN NaN NaN
1 2050-10-24 1.0 1.0 NaN NaN
2 2050-10-25 NaN NaN NaN NaN
3 2050-10-26 NaN NaN NaN 1.0
4 2050-10-27 NaN NaN NaN 1.0
5 2050-10-28 NaN NaN NaN 1.0
6 2050-10-29 1.0 NaN NaN NaN
7 2050-10-30 NaN 1.0 NaN NaN
8 2050-10-31 NaN 1.0 NaN NaN
9 2050-11-01 1.0 1.0 NaN NaN
10 2050-11-02 1.0 1.0 NaN NaN
11 2050-11-03 1.0 1.0 NaN NaN
12 2050-11-04 1.0 NaN NaN NaN
13 2050-11-05 1.0 NaN NaN NaN
14 2050-11-06 NaN NaN NaN NaN
15 2050-11-07 NaN 1.0 NaN NaN
16 2050-11-08 NaN NaN NaN NaN
17 2050-11-09 NaN NaN 1.0 NaN
18 2050-11-10 NaN NaN NaN NaN
19 2050-11-11 NaN NaN 1.0 NaN
20 2050-11-12 NaN NaN 1.0 NaN
21 2050-11-13 NaN NaN NaN NaN
For those of you playing along at home, copy the above and then:
import pandas as pd
df = pd.read_clipboard()
df.date = df.date.apply(pd.to_datetime)
df.set_index('date', inplace=True)
I want to visualize this dataframe with the date on the x axis and each weather type category on the y axis. Here's what I've tried so far:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
labels = df.columns.tolist()
#unsatisfying loop to give categories some y separation
for i,col in enumerate(df.columns):
ax.scatter(x=df[col].index, y=(df[col]+i)) #add a little to each
ax.set_yticklabels(labels)
ax.set_xlim(df.index.min(), df.index.max())
fig.autofmt_xdate()
Which gives me this:
Questions:
How do I get the y labels aligned properly?
Is there a better way to structure the data to make plotting easier?
This aligns you y labels:
ax.set_yticks(range(1, len(df.columns) + 1))
I have this kind of pandas DataFrame for each user in a large database.
each row is a period of length [start_date, end_date], but sometimes 2 consecutive rows are in fact the same period : end_date is equal to the following start_date (red underlining). Sometimes periods even overlap on more than 1 date.
I would like to get the "real periods" by combining rows which corresponds to the same periods.
What I have tried
def split_range(name):
df_user = de_201512_echant[de_201512_echant.name == name]
# -- Create a date_range with a length [min_start_date, max_start_date]
t_date = pd.DataFrame(index=pd.date_range("2005-01-01", "2015-12-12").date)
for row in range(0, df_user.shape[0]):
start_date = df_user.iloc[row].start_date
end_date = df_user.iloc[row].end_date
if ((pd.isnull(start_date) == False) and (pd.isnull(end_date) == False)):
t = pd.DataFrame(index=pd.date_range(start_date, end_date))
t["period_%s" % (row)] = 1
t_date = pd.merge(t_date, t, right_index=True, left_index=True, how="left")
else:
pass
return t_date
which yields a DataFrame where each colunms is a period (1 if in the range, NaN if not) :
t_date
Out[29]:
period_0 period_1 period_2 period_3 period_4 period_5 \
2005-01-01 NaN NaN NaN NaN NaN NaN
2005-01-02 NaN NaN NaN NaN NaN NaN
2005-01-03 NaN NaN NaN NaN NaN NaN
2005-01-04 NaN NaN NaN NaN NaN NaN
2005-01-05 NaN NaN NaN NaN NaN NaN
2005-01-06 NaN NaN NaN NaN NaN NaN
2005-01-07 NaN NaN NaN NaN NaN NaN
2005-01-08 NaN NaN NaN NaN NaN NaN
2005-01-09 NaN NaN NaN NaN NaN NaN
2005-01-10 NaN NaN NaN NaN NaN NaN
2005-01-11 NaN NaN NaN NaN NaN NaN
Then if I sum all the columns (periods) I got almost exactly what I want :
full_spell = t_date.sum(axis=1)
full_spell.loc[full_spell == 1]
Out[31]:
2005-11-14 1.0
2005-11-15 1.0
2005-11-16 1.0
2005-11-17 1.0
2005-11-18 1.0
2005-11-19 1.0
2005-11-20 1.0
2005-11-21 1.0
2005-11-22 1.0
2005-11-23 1.0
2005-11-24 1.0
2005-11-25 1.0
2005-11-26 1.0
2005-11-27 1.0
2005-11-28 1.0
2005-11-29 1.0
2005-11-30 1.0
2006-01-16 1.0
2006-01-17 1.0
2006-01-18 1.0
2006-01-19 1.0
2006-01-20 1.0
2006-01-21 1.0
2006-01-22 1.0
2006-01-23 1.0
2006-01-24 1.0
2006-01-25 1.0
2006-01-26 1.0
2006-01-27 1.0
2006-01-28 1.0
2015-07-06 1.0
2015-07-07 1.0
2015-07-08 1.0
2015-07-09 1.0
2015-07-10 1.0
2015-07-11 1.0
2015-07-12 1.0
2015-07-13 1.0
2015-07-14 1.0
2015-07-15 1.0
2015-07-16 1.0
2015-07-17 1.0
2015-07-18 1.0
2015-07-19 1.0
2015-08-02 1.0
2015-08-03 1.0
2015-08-04 1.0
2015-08-05 1.0
2015-08-06 1.0
2015-08-07 1.0
2015-08-08 1.0
2015-08-09 1.0
2015-08-10 1.0
2015-08-11 1.0
2015-08-12 1.0
2015-08-13 1.0
2015-08-14 1.0
2015-08-15 1.0
2015-08-16 1.0
2015-08-17 1.0
dtype: float64
But I could not find a way to slice all the time range of this sparse datetime index to finally get my desired output : the original dataframe containing the "real" period of time.
It might not be the most efficient way to do this, so If you have alternatives, do not hesitate!
I found a much more efficient way to do this by using apply:
def get_range(row):
'''returns a DataFrame containing the day-range from a "start_date"
and a "end_date"'''
start_date = row["start_date"]
end_date = row["end_date"]
period = pd.date_range(start_date, end_date, freq="1D")
return pd.Dataframe(period, columns='days_in_period')
# -- Apply get_range() to the initial df
t_all = df.apply(get_range)
# -- Drop overlapping dates
t_all.drop_duplicates(inplace=True)