pandas- changing the start and end date of resampled timeseries - python

I've a time series that i resampled into this dataframe df ,
My data is from 6th june to 28 june. it want to extend the data from 1st june to 30th june. count column will have 0 value in only extended period and my real values from 6th to 28th.
Out[123]:
count
Timestamp
2009-06-07 02:00:00 1
2009-06-07 03:00:00 0
2009-06-07 04:00:00 0
2009-06-07 05:00:00 0
2009-06-07 06:00:00 0
i need to the make the
start date:2009-06-01 00:00:00
end date:2009-06-30 23:00:00
so the data would look something like this:
count
Timestamp
2009-06-01 01:00:00 0
2009-06-01 02:00:00 0
2009-06-01 03:00:00 0
is there an effective way to perform this. the only way i can think of is not that effective.i am trying this since yesterday. please help
index = pd.date_range('2009-06-01 00:00:00','2009-06-30 23:00:00', freq='H')
df = pandas.DataFrame(numpy.zeros(len(index),1), index=index)
df.columns=['zeros']
result= pd.concat([df2,df])
result1= pd.concat([df,result])
result1.fillna(0)
del result1['zero']

You can create a new index with the desired start and end day/times, resample the time series data and aggregate by count, then set the index to the new index.
import pandas as pd
# create the index with the start and end times you want
t_index = pd.DatetimeIndex(pd.date_range(start='2009-06-01', end='2009-06-30 23:00:00', freq="1h"))
# create the data frame
df = pd.DataFrame([['2009-06-07 02:07:42'],
['2009-06-11 17:25:28'],
['2009-06-11 17:50:42'],
['2009-06-11 17:59:18']], columns=['daytime'])
df['daytime'] = pd.to_datetime(df['daytime'])
# resample the data to 1 hour, aggregate by counts,
# then reset the index and fill the na's with 0
df2 = df.resample('1h', on='daytime').count().reindex(t_index).fillna(0)

DatetimeIndex() no longer works with those arguments, raises __new__() got an unexpected keyword argument 'start'

Related

How to tackle a dataset that has multiple same date values

I have a large data set that I'm trying to produce a time series using ARIMA. However
some of the data in the date column has multiple rows with the same date.
The data for the dates was entered this way in the data set as it was not known the exact date of the event, hence unknown dates where entered for the first of that month(biased). Known dates have been entered correctly in the data set.
2016-01-01 10035
2015-01-01 5397
2013-01-01 4567
2014-01-01 4343
2017-01-01 3981
2011-01-01 2049
Ideally I want to randomise the dates within the month so they are not the same. I have the code to randomise the date but I cannot find a way to replace the data with the date ranges.
import random
import time
def str_time_prop(start, end, time_format, prop):
stime = time.mktime(time.strptime(start, time_format))
etime = time.mktime(time.strptime(end, time_format))
ptime = stime + prop * (etime - stime)
return time.strftime(time_format, time.localtime(ptime))
def random_date(start, end, prop):
return str_time_prop(start, end, '%Y-%m-%d', prop)
# check if the random function works
print(random_date("2021-01-02", "2021-01-11", random.random()))
The code above I use to generate a random date within a date range but I'm stuggling to find a way to replace the dates.
Any help/guidance would be great.
Thanks
With the following toy dataframe:
import random
import time
import pandas as pd
df = pd.DataFrame(
{
"date": [
"2016-01-01",
"2015-01-01",
"2013-01-01",
"2014-01-01",
"2017-01-01",
"2011-01-01",
],
"value": [10035, 5397, 4567, 4343, 3981, 2049],
}
)
print(df)
# Output
date value
0 2016-01-01 10035
1 2015-01-01 5397
2 2013-01-01 4567
3 2014-01-01 4343
4 2017-01-01 3981
5 2011-01-01 2049
Here is one way to do it:
df["date"] = [
random_date("2011-01-01", "2022-04-17", random.random()) for _ in range(df.shape[0])
]
print(df)
# Ouput
date value
0 2013-12-30 10035
1 2016-06-17 5397
2 2018-01-26 4567
3 2012-02-14 4343
4 2014-06-26 3981
5 2019-07-03 2049
Since the data in the date column has multiple rows with the same date, and you want to randomize the dates within the month, you could group by the year and month and select only those who have the day equal 1. Then, use calendar.monthrange to find the last day of the month for that particular year, and use that information when replacing the timestamp's day. Change the FIRST_DAY and last_day values to match your desired range.
import pandas as pd
import calendar
import numpy as np
np.random.seed(42)
df = pd.read_csv('sample.csv')
df['date'] = pd.to_datetime(df['date'])
# group multiple rows with the same year, month and day equal 1
grouped = df.groupby([df['date'].dt.year, df['date'].dt.month, df['date'].dt.day==1])
FIRST_DAY = 2 # set for the desired range
df_list = []
for n,g in grouped:
last_day = calendar.monthrange(n[0], n[1])[1] # get last day for this month and year
g['New_Date'] = g['date'].apply(lambda d:
d.replace(day=np.random.randint(FIRST_DAY,last_day+1))
)
df_list.append(g)
new_df = pd.concat(df_list)
print(new_df)
Output from new_df
date num New_Date
2 2013-01-01 4567 2013-01-08
3 2014-01-01 4343 2014-01-21
1 2015-01-01 5397 2015-01-30
0 2016-01-01 10035 2016-01-16
4 2017-01-01 3981 2017-01-12

How to use Pandas to get date_range from some timestamp?

I need to split a year in enumerated 20-minute chunks and then find the sequece number of corresponding time range chunk for randomly distributed timestamps in a year for further processing.
I tried to use pandas for this, but I can't find a way to index timestamp in date_range:
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import pandas as pd
from datetime import timedelta
if __name__ == '__main__':
date_start = pd.to_datetime('2018-01-01')
date_end = date_start + timedelta(days=365)
index = pd.date_range(start=date_start, end=date_end, freq='20min')
data = range(len(index))
df = pd.DataFrame(data, index=index, columns=['A'])
print(df)
event_ts = pd.to_datetime('2018-10-14 02:17:43')
# How to find the corresponding df['A'] for event_ts?
# print(df.loc[event_ts])
Output:
A
2018-01-01 00:00:00 0
2018-01-01 00:20:00 1
2018-01-01 00:40:00 2
2018-01-01 01:00:00 3
2018-01-01 01:20:00 4
... ...
2018-12-31 22:40:00 26276
2018-12-31 23:00:00 26277
2018-12-31 23:20:00 26278
2018-12-31 23:40:00 26279
2019-01-01 00:00:00 26280
[26281 rows x 1 columns]
What is the best practice to do it in python? I imagine how to find the range "by hand" converting date_range to integers and comparing it, but may be there are some elegant pandas/python-style ways to do it?
First of all, I've worked with a small interval, one week:
date_end = date_start + timedelta(days=7)
Then I've followed your steps, and got a portion of your dataframe.
My event_ts is this:
event_ts = pd.to_datetime('2018-01-04 02:17:43')
And I've chosen to reset the index, and have a dataframe easy to manipulate:
df = df.reset_index()
With this code I found the last value where event_ts belongs:
for i in df['index']:
if i <= event_ts:
run.append(i)
print(max(run))
#2018-01-04 02:00:00
or:
top = max(run)
Finally:
df.loc[df['index'] == top].index[0]
222
event_ts belongs to index df[222]

How to find the row index of the first occurrence of a match in a cell in Python dataframe (containing date)

I have a Python data frame containing a column with Date Time like this
2019-01-02 09:00:00 (which means January 2, 2019 9 AM)
There may be a bunch of rows which have the same date in the Date Time column.
In other words, I can have 2019-01-02 09:00:00 or 2019-01-02 09:15:00 or 2019-01-02 09:30:00 and so on.
Now I need to find the row index of the first occurrence of the date 2019-01-02 in the Python data frame.
I obviously do this using a loop, but am wondering if there is a better way.
With the df['Date Time'].str.contains() method, I can get that all the rows that match a given date, but I need the index.
The generic question is that how do we find the index of a first occurrence of a match in a cell in Python data frame that matches a given string pattern.
The more specific question is that how do we find the index of a first occurrence of a match in a cell in Python data frame that matches a given date in a cell that contains date Time assuming that the Python data frame is sorted in chronologically ascending order of date Time , i.e.
2019-01-02 09:00:00 occurs at an index earlier than 2019-01-02 09:15:00 followed by 2019-01-03 09:00:00 and so on.
Thank you for any inputs
You can use next with iter for first index value matched condition for prevent failed if no matched values:
df = pd.DataFrame({'dates':pd.date_range(start='2018-01-01 20:00:00',
end='2018-01-02 02:00:00', freq='H')})
print (df)
dates
0 2018-01-01 20:00:00
1 2018-01-01 21:00:00
2 2018-01-01 22:00:00
3 2018-01-01 23:00:00
4 2018-01-02 00:00:00
5 2018-01-02 01:00:00
6 2018-01-02 02:00:00
date = '2018-01-02'
mask = df['dates'] >= date
idx = next(iter(mask.index[mask]), 'not exist')
print (idx)
4
date = '2018-01-08'
mask = df['dates'] >= date
idx = next(iter(mask.index[mask]), 'not exist')
print (idx)
not exist
If performance is important, see Efficiently return the index of the first value satisfying condition in array.
Yep you can use .loc and a condition to slice the df, and then return the index using .iloc.
import pandas as pd
df = pd.DataFrame({'time':pd.date_range(start='2018-01-01 00:00:00',end='2018-12-31 00:00:00', freq='H')}, index=None).reset_index(drop=True)
# then use conditions and .iloc to get the first instance
df.loc[df['time']>'2018-10-30 01:00:00'].iloc[[0,]].index[0]
# if you specify a coarser condition, for instance without time,
# it will also return the first instance
df.loc[df['time']>'2018-10-30'].iloc[[0,]].index[0]
I do not know, if it is optimal, but it works
(df['Date Time'].dt.strftime('%Y-%m-%d') == '2019-01-02').idxmax()

Most efficient way to break up a dataframe using multiple DateTimeIndexes

I have a dataframe which contains prices for a security each minute over a long period of time.
I would like to extract a subset of the prices, 1 per day between certain hours.
Here is an example of brute-forcing it (using hourly for brevity):
dates = pandas.date_range('20180101', '20180103', freq='H')
prices = pandas.DataFrame(index=dates,
data=numpy.random.rand(len(dates)),
columns=['price'])
I now have a DateTimeIndex for the hours within each day I want to extract:
start = datetime.datetime(2018,1,1,8)
end = datetime.datetime(2018,1,1,17)
day1 = pandas.date_range(start, end, freq='H')
start = datetime.datetime(2018,1,2,9)
end = datetime.datetime(2018,1,2,13)
day2 = pandas.date_range(start, end, freq='H')
days = [ day1, day2 ]
I can then use prices.index.isin with each of my DateTimeIndexes to extract the relevant day's prices:
daily_prices = [ prices[prices.index.isin(d)] for d in days]
This works as expected:
daily_prices[0]
daily_prices[1]
The problem is that as the length of each selection DateTimeIndex increases, and the number of days I want to extract increases, my list-comprehension slows down to a crawl.
Since I know each selection DateTimeIndex is fully inclusive of the hours it encompasses, I tried using loc and the first and last element of each index in my list comprehension:
daily_prices = [ prices.loc[d[0]:d[-1]] for d in days]
Whilst a bit faster, it is still exceptionally slow when the number of days is very large
Is there a more efficient way to divide up a dataframe into begin and end time ranges like above?
If the hours are consistent from day to day as it seems like they might be, you can just filter the index, which should be pretty fast:
In [5]: prices.loc[prices.index.hour.isin(range(8,18))]
Out[5]:
price
2018-01-01 08:00:00 0.638051
2018-01-01 09:00:00 0.059258
2018-01-01 10:00:00 0.869144
2018-01-01 11:00:00 0.443970
2018-01-01 12:00:00 0.725146
2018-01-01 13:00:00 0.309600
2018-01-01 14:00:00 0.520718
2018-01-01 15:00:00 0.976284
2018-01-01 16:00:00 0.973313
2018-01-01 17:00:00 0.158488
2018-01-02 08:00:00 0.053680
2018-01-02 09:00:00 0.280477
2018-01-02 10:00:00 0.802826
2018-01-02 11:00:00 0.379837
2018-01-02 12:00:00 0.247583
....
EDIT: To your comment, working directly on the index and then doing a single lookup at the end will still probably be fastest even if it's not always consistent from day to day. Single day frames at the end will be easy with a groupby.
For example:
df = prices.loc[[i for i in prices.index if (i.hour in range(8, 18) and i.day in range(1,10)) or (i.hour in range(2,4) and i.day in range(11,32))]]
framelist = [frame for _, frame in df.groupby(df.index.date)]
will give you a list of dataframes with 1 day per list element, and will include 8:00-17:00 for the first 10 days each month and 2:00-3:00 for days 11-31.

Drop datetimes not within certain range from index

I have a DataFrame like this:
Date X
....
2014-01-02 07:00:00 16
2014-01-02 07:15:00 20
2014-01-02 07:30:00 21
2014-01-02 07:45:00 33
2014-01-02 08:00:00 22
....
2014-01-02 23:45:00 0
....
1)
So my "Date" Column is a datetime and has values vor every 15min of a day.
What i want is to remove ALL Rows where the time is NOT between 08:00 and 18:00 o'clock.
2)
Some days are missing in the datas...how could i put the missing days in my dataframe and fill them with the value 0 as X.
My approach: Create a new Series between two Dates and set 15min as frequenz and concat my X Column with the new created Series. Is that right?
Edit:
Problem for my second Question:
#create new full DF without missing dates and reindex
full_range = pandas.date_range(start='2014-01-02', end='2017-11-
14',freq='15min')
df = df.reindex(full_range,fill_value=0)
df.head()
Output:
Date X
2014-01-02 00:00:00 1970-01-01 0
2014-01-02 00:15:00 1970-01-01 0
2014-01-02 00:30:00 1970-01-01 0
2014-01-02 00:45:00 1970-01-01 0
2014-01-02 01:00:00 1970-01-01 0
That didnt work as you see.
The "Date" Column is not a index btw. i need it as Column in my df
and why did he take "1970-01-01"? 1970 as year makes no sense to me
What I want is to remove ALL Rows where the time is NOT between 08:00
and 18:00 o'clock.
Create a mask with datetime.time. Example:
from datetime import time
idx = pd.date_range('2014-01-02', freq='15min', periods=10000)
df = pd.DataFrame({'x': np.empty(idx.shape[0])}, index=idx)
t1 = time(8); t2 = time(18)
times = df.index.time
mask = (times > t1) & (times < t2)
df = df.loc[mask]
Some days are missing in the data...how could I put the missing days
in my DataFrame and fill them with the value 0 as X?
Build a date range that doesn't have missing data with pd.date_range() (see above).
Call reindex() on df and specify fill_value=0.
Answering your questions in comments:
np.empty creates an empty array. I was just using it to build some "example" data that is basically garbage. Here idx.shape is the shape of your index (length, width), a tuple. So np.empty(idx.shape[0]) creates an empty 1d array with the same length as idx.
times = df.index.time creates a variable (a NumPy array) called times. df.index.time is the time for each element in the index of df. You can explore this yourself by just breaking the code down in pieces and experimenting with it on your own.

Categories

Resources