I am fairly new in using pandas I have the following dataframe:
Date
2019-06-01 195.585770
2019-07-01 210.527466
2019-08-01 206.278168
2019-09-01 222.169479
2019-10-01 246.760193
2019-11-01 265.101562
2019-12-01 292.163818
2020-01-01 307.943604
2020-02-01 271.976532
2020-03-01 253.603500
2020-04-01 293.006836
2020-05-01 317.081665
2020-06-01 331.500000
2020-06-05 331.500000
Name: AAPL, dtype: float64
How can I quickly calculate the difference between 2 dates in days? In the end I want to calculate the average monthly increase percentage-wise.
The result should be that the difference is alternately 30 and 31 days. There must be a quick command to calculate the difference between two consecutive dates but I can't seem to find it.
We can do pct_change and mean:
df['AAPL'].pct_change().mean()
Or in case your series:
s.pct_change().mean()
If you want to find out the daily percentage change:
s.pct_change()/s.index.to_series().diff().dt.days
Output:
Date
2019-06-01 NaN
2019-07-01 0.002546
2019-08-01 -0.000651
2019-09-01 0.002485
2019-10-01 0.003689
2019-11-01 0.002398
2019-12-01 0.003403
2020-01-01 0.001742
2020-02-01 -0.003768
2020-03-01 -0.002329
2020-04-01 0.005012
2020-05-01 0.002739
2020-06-01 0.001467
2020-06-05 0.000000
dtype: float64
I see that you already have your answer, but this may be useful in some other instances.
You can use the df.itterows() to easily grab the difference in dates. First, make sure you convert your Date row to datetime. Then iterate over the rows and grab the difference.
df = pd.DataFrame(data={'Date':['2020-04-01','2020-05-01','2020-06-01','2020-06-05 '], 'Col2':['195.585770','210.527466','331.500000','331.500000']})
df['Date'] = pd.to_datetime(df['Date']) #convert to datetime
for index, row in df.iterrows():
if not index == 0:
print(row['Date'] - df['Date'][index-1])
The output will look like this:
30 days 00:00:00
31 days 00:00:00
4 days 00:00:00
Related
I have a dataset called weather and it contains one column 'Date' that looks like this.
Date
2020-01-01
2020-01-02
2020-02-01
2020-02-04
2020-03-01
2020-04-01
2020-04-02
2020-04-03
2020-04-04
2020-05-01
2020-06-01
2020-07-01
2020-08-01
2020-09-01
2020-10-01
2020-11-01
2020-01-01
2020-02-01
2020-04-01
2020-05-01
2020-06-01
2020-07-01
2020-08-01
2020-09-01
2020-10-01
2020-11-01
2020-12-01
2020-01-01
The problem is the year is always 2020 when it should be 2020, 2021, and 2022.
The desired column looks like this
Date
2020-01-01
2020-01-02
2020-02-01
2020-02-04
2020-03-01
2020-04-01
2020-04-02
2020-04-03
2020-04-04
2020-05-01
2020-06-01
2020-07-01
2020-08-01
2020-09-01
2020-10-01
2020-11-01
2021-01-01
2021-02-01
2021-04-01
2021-05-01
2021-06-01
2021-07-01
2021-08-01
2021-09-01
2021-10-01
2021-11-01
2021-12-01
2022-01-01
Each year's last month is not necessarily 12, but the new year starts with month 01.
Here is my code:
month = ['01','02','03','04','05','06','07','08','09','10','11','12']
for i in range(len(weather['Date'])):
year = 2022
for j in range(len(month)):
if weather['Date'][i][5:7] == '01':
weather['Date'][i] = weather['Date'][i].apply(lambda x: 'year' + x[5:])
Is there any suggestion for fixing my code and getting the desired column?
Here's one approach:
Turn the date strings in column Date into datetime, using pd.to_datetime and apply Series.diff and chain Series.dt.day.
Since each negative value (i.e. "day") in our Series will represent the start of a new year, let's apply Series.lt(0) to turn all values below 0 into True and the rest into False.
At this stage, we chain Series.cumsum to end up with a Series containing 0, ..., 1, ..., 2. These will be the values that need to be added to the year 2020 to achieve the correct years.
Now, finally, we can create the correct dates by passing (new_year = year + addition), month, day again to pd.to_datetime (cf. this SO answer).
df['Date'] = pd.to_datetime(df['Date'])
df['Date'] = pd.to_datetime(dict(year=(df['Date'].dt.year
+ df['Date'].diff().dt.days.lt(0).cumsum()),
month=df['Date'].dt.month,
day=df['Date'].dt.day))
df['Date']
0 2020-01-01
1 2020-01-02
2 2020-02-01
3 2020-02-04
4 2020-03-01
5 2020-04-01
6 2020-04-02
7 2020-04-03
8 2020-04-04
9 2020-05-01
10 2020-06-01
11 2020-07-01
12 2020-08-01
13 2020-09-01
14 2020-10-01
15 2020-11-01
16 2021-01-01
17 2021-02-01
18 2021-04-01
19 2021-05-01
20 2021-06-01
21 2021-07-01
22 2021-08-01
23 2021-09-01
24 2021-10-01
25 2021-11-01
26 2021-12-01
27 2022-01-01
Name: Date, dtype: datetime64[ns]
You don't need to convert to datetime, of course. You can also recreate the date strings, leaving off from the following line:
df['Date'].str[5:7].astype(int).diff().lt(0).cumsum()
Similar to #ouroboros1, but using numpy to get the number of years to add to each date, and then pd.offsets.DateOffset(years=...) for the addition.
import numpy as np
import pandas as pd
df['Date'] = pd.to_datetime(df['Date'])
s = df['Date'].values
y = np.r_[0, (s[:-1] > s[1:]).cumsum()]
At this point, it would be tempting to do:
df['Date'] += y * pd.offsets.DateOffset(years=1)
But then we would get a warning: PerformanceWarning: Adding/subtracting object-dtype array to DatetimeArray not vectorized.
So instead, we group by number of years to add, and add the relevant offset to all the dates in the group.
def add_years(g):
return g['Date'] + pd.offsets.DateOffset(years=g['y'].iloc[0])
df['Date'] = df.assign(y=y).groupby('y', sort=False, group_keys=False).apply(add_years)
This is reasonably fast (4.25 ms for 1000 rows and 10 distinct y values), and, for other situations than yours, is a bit more general than #ouroboros1's answer:
It handles date changes due to leap year (not present in your example where all dates are on the first of a month, but if one of the dates was '2020-02-29' and we try to add 1 year to it using the construct dt = df['Date'].dt; pd.to_datetime(dict(year=dt.year + y, month=dt.month, ...), then we'd get a ValueError: cannot assemble the datetimes: day is out of range for month).
It preserves any time of day and timezone information (again, not in your case, but in the general case one would retain those).
I have following dataframe, where date was set as the index col,
date
renormalized
2017-01-01
6
2017-01-08
5
2017-01-15
3
2017-01-22
3
2017-01-29
3
I want to append 00:00:00 to each of the datetime in the index column, make it like
date
renormalized
2017-01-01 00:00:00
6
2017-01-08 00:00:00
5
2017-01-15 00:00:00
3
2017-01-22 00:00:00
3
2017-01-29 00:00:00
3
It seems I got stuck for no solution to make it happen.... It will be great if anyone can help...
Thanks
AL
When your time is 0 for all instances, pandas doesn't show the time by default (although it's a Timestamp class, so it has the time!). Probably your data is already normalized, and you can perform delta time operations as usual.
You can see a target observation with df.index[0] for instance, or take a look at all the times with df.index.time.
You can use DatetimeIndex.strftime
df.index = pd.to_datetime(df.index).strftime('%Y-%m-%d %H:%M:%S')
print(df)
renormalized
date
2017-01-01 00:00:00 6
2017-01-08 00:00:00 5
2017-01-15 00:00:00 3
2017-01-22 00:00:00 3
2017-01-29 00:00:00 3
Or you can choose
df.index = df.index + ' 00:00:00'
So I have a dataset with a specific date along with every data. I want to fill these values according to their specific date in Excel which contains the date range of the whole year. It's like the date starts from 01-01-2020 00:00:00 and end at 31-12-2020 23:45:00 with the frequency of 15 mins. So there will be a total of 35040 date-time values in Excel.
my data is like:
load date
12 01-02-2020 06:30:00
21 29-04-2020 03:45:00
23 02-07-2020 12:15:00
54 07-08-2020 16:00:00
23 22-09-2020 16:30:00
As you can see these values are not continuous but they have specific dates with them, so I these date values as the index and put it at that particular date in the Excel which has the date column, and also put zero in the missing values. Can someone please help?
Use DataFrame.reindex with date_range - so added 0 values for all not exist datetimes:
rng = pd.date_range('2020-01-01','2020-12-31 23:45:00', freq='15Min')
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date').reindex(rng, fill_value=0)
print (df)
load
2020-01-01 00:00:00 0
2020-01-01 00:15:00 0
2020-01-01 00:30:00 0
2020-01-01 00:45:00 0
2020-01-01 01:00:00 0
...
2020-12-31 22:45:00 0
2020-12-31 23:00:00 0
2020-12-31 23:15:00 0
2020-12-31 23:30:00 0
2020-12-31 23:45:00 0
[35136 rows x 1 columns]
I have a dataframe that has four different columns and looks like the table below:
index_example | column_a | column_b | column_c | datetime_column
1 A 1,000 1 2020-01-01 11:00:00
2 A 2,000 2 2019-11-01 10:00:00
3 A 5,000 3 2019-12-01 08:00:00
4 B 1,000 4 2020-01-01 05:00:00
5 B 6,000 5 2019-01-01 01:00:00
6 B 7,000 6 2019-04-01 11:00:00
7 A 8,000 7 2019-11-30 07:00:00
8 B 500 8 2020-01-01 05:00:00
9 B 1,000 9 2020-01-01 03:00:00
10 B 2,000 10 2020-01-01 02:00:00
11 A 1,000 11 2019-05-02 01:00:00
Purpose:
For each row, get the different rolling statistics for column_b based on a window of time in the datetime_column defined as the last N months. The window of time to look at however, is filtered by the value in column_a.
Code example using a for loop which is not feasible given the size:
mean_dict = {}
for index,value in enumerate(df.datetime_column)):
test_date = value
test_column_a = df.column_a[index]
subset_df = df[(df.datetime_column<test_date)&\
(df.datetime_column>=test_date-timedelta(days = 180))&
(df.column_a == test_column_a)]
mean_dict[index] = df.column_b.mean()
For example for row #1:
Target date = 2020-01-01 11:00:00
Target value in column_a = A
Date Range: from 2019-07-01 11:00:00 to 2020-01-01 11:00:00
Average would be the mean of rows 2,3,7
If I wanted average for row #2 then it would be:
Target date = 2019-11-01 10:00:00
Target value in column_a = A
Date Range: from 2019-05-01 10:00 to 2019-11-01 10:00:00
Average would be the mean of rows 11
and so on...
I cannot use the grouper since in reality I do not have dates but datetimes.
Has anyone encountered this before?
Thanks!
EDIT
The dataframe is big ~2M rows which means that looping is not an option. I already tried looping and creating a subset based on conditional values but it takes too long.
Let's assume that I would like to know how the value of my invested money changes over time. I have the following data in Pandas DataFrame:
contribution monthly_return
2020-01-01 91.91 np.Nan
2020-02-01 102.18 0.037026
2020-03-01 95.90 -0.012792
2020-04-01 117.89 -0.009188
2020-05-01 100.44 0.011203
2020-06-01 98.89 0.053917
2020-07-01 106.10 -0.049397
2020-08-01 112.55 0.062375
2020-09-01 103.16 -0.063198
...and so on. Every month I invest additional sum of money to my "fund" (contribution). Monthly return shows how the value of my money changed during last month.
I would like to add additional column, where I could find information on current value of my investment in every month (so I can plot it on a graph). As far as I know, I cannot use any of numpy financial functions (such as np.fv()) because contributions and rates changes over time. I can cumulative sum contributions, but I don't know how to add profit and loss on investment.
This may be a trivial question, but I am completely stuck and I have wasted more hours on this problem than I could ever admit. Any help would be appreciated!
Suppose you have a df:
print(df)
contribution monthly_return
2020-01-01 91.91 NaN
2020-02-01 102.18 0.037026
2020-03-01 95.90 -0.012792
2020-04-01 117.89 -0.009188
2020-05-01 100.44 0.011203
2020-06-01 98.89 0.053917
2020-07-01 106.10 -0.049397
2020-08-01 112.55 0.062375
2020-09-01 103.16 -0.063198
Then let's find a multiplier your money grows monthly:
df['monthly_multiplier'] = 1 + df['monthly_return'].shift(-1)
print(df)
contribution monthly_return monthly_multiplier
2020-01-01 91.91 NaN 1.037026
2020-02-01 102.18 0.037026 0.987208
2020-03-01 95.90 -0.012792 0.990812
2020-04-01 117.89 -0.009188 1.011203
2020-05-01 100.44 0.011203 1.053917
2020-06-01 98.89 0.053917 0.950603
2020-07-01 106.10 -0.049397 1.062375
2020-08-01 112.55 0.062375 0.936802
2020-09-01 103.16 -0.063198 NaN
Finally we can iterate over rows and see how your wealth grows:
df['fv'] = 0
fv = 0
for index, row in df.iterrows():
fv = (fv+row['contribution'])*row['monthly_multiplier']
df.loc[index,'fv']=fv
print(df)
contribution monthly_return monthly_multiplier fv
2020-01-01 91.91 NaN 1.037026 95.313060
2020-02-01 102.18 0.037026 0.987208 194.966728
2020-03-01 95.90 -0.012792 0.990812 288.194245
2020-04-01 117.89 -0.009188 1.011203 410.633607
2020-05-01 100.44 0.011203 1.053917 538.629162
2020-06-01 98.89 0.053917 0.950603 606.027628
2020-07-01 106.10 -0.049397 1.062375 756.546589
2020-08-01 112.55 0.062375 0.936802 814.171423
2020-09-01 103.16 -0.063198 NaN NaN
df['fv'] is your wealth at the end of the stated month, or just prior your next contribution.