I need to split a year in enumerated 20-minute chunks and then find the sequece number of corresponding time range chunk for randomly distributed timestamps in a year for further processing.
I tried to use pandas for this, but I can't find a way to index timestamp in date_range:
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import pandas as pd
from datetime import timedelta
if __name__ == '__main__':
date_start = pd.to_datetime('2018-01-01')
date_end = date_start + timedelta(days=365)
index = pd.date_range(start=date_start, end=date_end, freq='20min')
data = range(len(index))
df = pd.DataFrame(data, index=index, columns=['A'])
print(df)
event_ts = pd.to_datetime('2018-10-14 02:17:43')
# How to find the corresponding df['A'] for event_ts?
# print(df.loc[event_ts])
Output:
A
2018-01-01 00:00:00 0
2018-01-01 00:20:00 1
2018-01-01 00:40:00 2
2018-01-01 01:00:00 3
2018-01-01 01:20:00 4
... ...
2018-12-31 22:40:00 26276
2018-12-31 23:00:00 26277
2018-12-31 23:20:00 26278
2018-12-31 23:40:00 26279
2019-01-01 00:00:00 26280
[26281 rows x 1 columns]
What is the best practice to do it in python? I imagine how to find the range "by hand" converting date_range to integers and comparing it, but may be there are some elegant pandas/python-style ways to do it?
First of all, I've worked with a small interval, one week:
date_end = date_start + timedelta(days=7)
Then I've followed your steps, and got a portion of your dataframe.
My event_ts is this:
event_ts = pd.to_datetime('2018-01-04 02:17:43')
And I've chosen to reset the index, and have a dataframe easy to manipulate:
df = df.reset_index()
With this code I found the last value where event_ts belongs:
for i in df['index']:
if i <= event_ts:
run.append(i)
print(max(run))
#2018-01-04 02:00:00
or:
top = max(run)
Finally:
df.loc[df['index'] == top].index[0]
222
event_ts belongs to index df[222]
Related
I'd like to get a time series with a fixed set of dates in the index. I thought that resample with freq and epoch='origin' will do the trick. It seems that I'm using this method in a wrong way. Here's an example that shows that epoch='origin' does not seem to work.
import pandas as pd
dates = pd.date_range('2022-01-01', '2022-02-01', freq="1D")
freq = '2W-MON'
vals = range(len(dates))
print(
pd.Series(vals,index = dates)
.resample(freq,
origin="epoch",
convention='end')
.sum()
.to_markdown()
)
0
2022-01-03 00:00:00
3
2022-01-17 00:00:00
133
2022-01-31 00:00:00
329
2022-02-14 00:00:00
31
If I change the first date in the series to anything after the "2022-01-03", I get a different result.
dates = pd.date_range('2022-01-04', '2022-02-01', freq="1D")
freq = '2W-MON'
vals = range(len(dates))
print(
pd.Series(vals,index = dates)
.resample(freq,
origin="epoch",
convention='end')
.sum()
.to_markdown()
)
0
2022-01-10 00:00:00
21
2022-01-24 00:00:00
189
2022-02-07 00:00:00
196
I'd expect that if the freq='2W-MON' and epoch='origin', both the examples will end up with the same dates (so, both should have either 2022-01-10 or 2022-01-03).
Is there an elegant way of forcing pandas to actually use epoch="origin"?
I have a large data set that I'm trying to produce a time series using ARIMA. However
some of the data in the date column has multiple rows with the same date.
The data for the dates was entered this way in the data set as it was not known the exact date of the event, hence unknown dates where entered for the first of that month(biased). Known dates have been entered correctly in the data set.
2016-01-01 10035
2015-01-01 5397
2013-01-01 4567
2014-01-01 4343
2017-01-01 3981
2011-01-01 2049
Ideally I want to randomise the dates within the month so they are not the same. I have the code to randomise the date but I cannot find a way to replace the data with the date ranges.
import random
import time
def str_time_prop(start, end, time_format, prop):
stime = time.mktime(time.strptime(start, time_format))
etime = time.mktime(time.strptime(end, time_format))
ptime = stime + prop * (etime - stime)
return time.strftime(time_format, time.localtime(ptime))
def random_date(start, end, prop):
return str_time_prop(start, end, '%Y-%m-%d', prop)
# check if the random function works
print(random_date("2021-01-02", "2021-01-11", random.random()))
The code above I use to generate a random date within a date range but I'm stuggling to find a way to replace the dates.
Any help/guidance would be great.
Thanks
With the following toy dataframe:
import random
import time
import pandas as pd
df = pd.DataFrame(
{
"date": [
"2016-01-01",
"2015-01-01",
"2013-01-01",
"2014-01-01",
"2017-01-01",
"2011-01-01",
],
"value": [10035, 5397, 4567, 4343, 3981, 2049],
}
)
print(df)
# Output
date value
0 2016-01-01 10035
1 2015-01-01 5397
2 2013-01-01 4567
3 2014-01-01 4343
4 2017-01-01 3981
5 2011-01-01 2049
Here is one way to do it:
df["date"] = [
random_date("2011-01-01", "2022-04-17", random.random()) for _ in range(df.shape[0])
]
print(df)
# Ouput
date value
0 2013-12-30 10035
1 2016-06-17 5397
2 2018-01-26 4567
3 2012-02-14 4343
4 2014-06-26 3981
5 2019-07-03 2049
Since the data in the date column has multiple rows with the same date, and you want to randomize the dates within the month, you could group by the year and month and select only those who have the day equal 1. Then, use calendar.monthrange to find the last day of the month for that particular year, and use that information when replacing the timestamp's day. Change the FIRST_DAY and last_day values to match your desired range.
import pandas as pd
import calendar
import numpy as np
np.random.seed(42)
df = pd.read_csv('sample.csv')
df['date'] = pd.to_datetime(df['date'])
# group multiple rows with the same year, month and day equal 1
grouped = df.groupby([df['date'].dt.year, df['date'].dt.month, df['date'].dt.day==1])
FIRST_DAY = 2 # set for the desired range
df_list = []
for n,g in grouped:
last_day = calendar.monthrange(n[0], n[1])[1] # get last day for this month and year
g['New_Date'] = g['date'].apply(lambda d:
d.replace(day=np.random.randint(FIRST_DAY,last_day+1))
)
df_list.append(g)
new_df = pd.concat(df_list)
print(new_df)
Output from new_df
date num New_Date
2 2013-01-01 4567 2013-01-08
3 2014-01-01 4343 2014-01-21
1 2015-01-01 5397 2015-01-30
0 2016-01-01 10035 2016-01-16
4 2017-01-01 3981 2017-01-12
I have a dataframe with a datetime column. I want to group by the time component only and aggregate, e.g. by taking the mean.
I know that I can use pd.Grouper to group by date AND time, but it doesn't work on time only.
Say we have the following dataframe:
import numpy as np
import pandas as pd
drange = pd.date_range('2019-08-01 00:00', '2019-08-12 12:00', freq='1T')
time = drange.time
c0 = np.random.rand(len(drange))
c1 = np.random.rand(len(drange))
df = pd.DataFrame(dict(drange=drange, time=time, c0=c0, c1=c1))
print(df.head())
drange time c0 c1
0 2019-08-01 00:00:00 00:00:00 0.031946 0.159739
1 2019-08-01 00:01:00 00:01:00 0.809171 0.681942
2 2019-08-01 00:02:00 00:02:00 0.036720 0.133443
3 2019-08-01 00:03:00 00:03:00 0.650522 0.409797
4 2019-08-01 00:04:00 00:04:00 0.239262 0.814565
In this case, the following throws a TypeError:
grouper = pd.Grouper(key='time', freq='5T')
grouped = df.groupby(grouper).mean()
I could set key=drange to group by date and time and then:
Reset the index
Transform the new column to float
Bin with pd.cut
Cast back to time
Finally group-by and then aggregate
... But I wonder whether there is a cleaner way to achieve the same results.
Series.dt.time/DatetimeIndex.time returns the time as datetime.time. This isn't great because pandas works best withtimedelta64 and so your 'time' column is cast to object, losing all datetime functionality.
You can subtract off the normalized date to obtain the time as a timedelta so you can continue to use the datetime tools of pandas. You can floor this to group.
s = (df.drange - df.drange.dt.normalize()).dt.floor('5T')
df.groupby(s).mean()
c0 c1
drange
00:00:00 0.436971 0.530201
00:05:00 0.441387 0.518831
00:10:00 0.465008 0.478130
... ... ...
23:45:00 0.523233 0.515991
23:50:00 0.468695 0.434240
23:55:00 0.569989 0.510291
Alternatively if you feel unsure of floor, this gets the identical output up to the index name
df['time'] = (df.drange - df.drange.dt.normalize()) # timedelta64[ns]
df.groupby(pd.Grouper(key='time', freq='5T')).mean()
When you use DataFrame.groupby you can a Series an argument. Moreover, if your series is a datetime, you can use the series.dt to access the properties of date. In your case df['drange'].dt.hour or df['drange'].dt.time should do it.
# df['drange']=pd.to_datetime(df['drange'])
df.groupby(df['drange'].dt.hour).agg(...)
I have a DataFrame like this:
Date X
....
2014-01-02 07:00:00 16
2014-01-02 07:15:00 20
2014-01-02 07:30:00 21
2014-01-02 07:45:00 33
2014-01-02 08:00:00 22
....
2014-01-02 23:45:00 0
....
1)
So my "Date" Column is a datetime and has values vor every 15min of a day.
What i want is to remove ALL Rows where the time is NOT between 08:00 and 18:00 o'clock.
2)
Some days are missing in the datas...how could i put the missing days in my dataframe and fill them with the value 0 as X.
My approach: Create a new Series between two Dates and set 15min as frequenz and concat my X Column with the new created Series. Is that right?
Edit:
Problem for my second Question:
#create new full DF without missing dates and reindex
full_range = pandas.date_range(start='2014-01-02', end='2017-11-
14',freq='15min')
df = df.reindex(full_range,fill_value=0)
df.head()
Output:
Date X
2014-01-02 00:00:00 1970-01-01 0
2014-01-02 00:15:00 1970-01-01 0
2014-01-02 00:30:00 1970-01-01 0
2014-01-02 00:45:00 1970-01-01 0
2014-01-02 01:00:00 1970-01-01 0
That didnt work as you see.
The "Date" Column is not a index btw. i need it as Column in my df
and why did he take "1970-01-01"? 1970 as year makes no sense to me
What I want is to remove ALL Rows where the time is NOT between 08:00
and 18:00 o'clock.
Create a mask with datetime.time. Example:
from datetime import time
idx = pd.date_range('2014-01-02', freq='15min', periods=10000)
df = pd.DataFrame({'x': np.empty(idx.shape[0])}, index=idx)
t1 = time(8); t2 = time(18)
times = df.index.time
mask = (times > t1) & (times < t2)
df = df.loc[mask]
Some days are missing in the data...how could I put the missing days
in my DataFrame and fill them with the value 0 as X?
Build a date range that doesn't have missing data with pd.date_range() (see above).
Call reindex() on df and specify fill_value=0.
Answering your questions in comments:
np.empty creates an empty array. I was just using it to build some "example" data that is basically garbage. Here idx.shape is the shape of your index (length, width), a tuple. So np.empty(idx.shape[0]) creates an empty 1d array with the same length as idx.
times = df.index.time creates a variable (a NumPy array) called times. df.index.time is the time for each element in the index of df. You can explore this yourself by just breaking the code down in pieces and experimenting with it on your own.
I've a time series that i resampled into this dataframe df ,
My data is from 6th june to 28 june. it want to extend the data from 1st june to 30th june. count column will have 0 value in only extended period and my real values from 6th to 28th.
Out[123]:
count
Timestamp
2009-06-07 02:00:00 1
2009-06-07 03:00:00 0
2009-06-07 04:00:00 0
2009-06-07 05:00:00 0
2009-06-07 06:00:00 0
i need to the make the
start date:2009-06-01 00:00:00
end date:2009-06-30 23:00:00
so the data would look something like this:
count
Timestamp
2009-06-01 01:00:00 0
2009-06-01 02:00:00 0
2009-06-01 03:00:00 0
is there an effective way to perform this. the only way i can think of is not that effective.i am trying this since yesterday. please help
index = pd.date_range('2009-06-01 00:00:00','2009-06-30 23:00:00', freq='H')
df = pandas.DataFrame(numpy.zeros(len(index),1), index=index)
df.columns=['zeros']
result= pd.concat([df2,df])
result1= pd.concat([df,result])
result1.fillna(0)
del result1['zero']
You can create a new index with the desired start and end day/times, resample the time series data and aggregate by count, then set the index to the new index.
import pandas as pd
# create the index with the start and end times you want
t_index = pd.DatetimeIndex(pd.date_range(start='2009-06-01', end='2009-06-30 23:00:00', freq="1h"))
# create the data frame
df = pd.DataFrame([['2009-06-07 02:07:42'],
['2009-06-11 17:25:28'],
['2009-06-11 17:50:42'],
['2009-06-11 17:59:18']], columns=['daytime'])
df['daytime'] = pd.to_datetime(df['daytime'])
# resample the data to 1 hour, aggregate by counts,
# then reset the index and fill the na's with 0
df2 = df.resample('1h', on='daytime').count().reindex(t_index).fillna(0)
DatetimeIndex() no longer works with those arguments, raises __new__() got an unexpected keyword argument 'start'