is there way to convert daily predictions to hourly predictions in pandas - python

#df_test.head(3)
ID Datetime
0 18288 26-09-2014 00:00
1 18289 26-09-2014 01:00
2 18290 26-09-2014 02:00
#df_test['Datetime'] = pd.to_datetime(df_test['Datetime'])
#df_test = df_test.set_index('Datetime')
Datetime ID
2014-09-26 00:00:00 18288
2014-09-26 01:00:00 18289
2014-09-26 02:00:00 18290
# Converting to daily mean
df_test_daily = df_test.resample('D').mean()
# model.predict(df_test_daily)
After making the traffic predictions count on the daily data, how can we convert it to hourly predictions.

Just change the D to H.
# Converting to daily mean
df_test_daily = df_test.resample('H').mean()
More info here: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects

Related

Python Dataframe extract quarterly data and export to a quarterly folder

I have a summary dataframe. I want to extract quarterly data and export it to the quarterly folders created already.
My code:
ad = pd.DataFrame({"sensor_value":[10,20]},index=['2019-01-01 05:00:00','2019-06-01 05:00:00'])
ad =
sensor_value
2019-01-01 05:00:00 10
2019-06-01 05:00:00 20
ad.index = pd.to_datetime(ad.index,format = '%Y-%m-%d %H:%M:%S')
# create quarter column
ad['quarter'] = ad.index.to_period('Q')
ad =
sensor_value quarter
2019-01-01 05:00:00 10 2019Q1
2019-06-01 05:00:00 20 2019Q2
# quarters list
qt_list = ad['quarter'].unique()
# extract data for quarter and store it in the corresponding folder that already exist
fold_location = 'C\\Data\\'
for i in qt_list:
auxdf = ad[ad['quarter']=='%s'%(i)]
save_loc = fold_location+'\\'+str(i)
auxdf.to_csv(save_loc+'\\'+'Sensor_1minData_%s.csv'%(i))
Is there a better way of doing it?
Thanks
You can use groupby with something like:
for quarter, df in ad.groupby('quarter'):
df.to_csv(f"C\\Data\\{quarter}\\Sensor_1minData_{quarter}.csv")

How to split a datetime column efficiently to have timezone in a new column?

So, I am creating a yearly time series data taking DST into consideration as follows:
import pandas as pd
sd = '2020-01-01'
ed = '2021-01-01'
df = pd.date_range(sd, ed, freq='0.25H', tz='Europe/Berlin')
df = df.to_frame().reset_index(drop=True)
df.rename(columns={0:'dates'}, inplace=True)
The dates column also contains the timezone (+1(CET) and +2 (CEST)). Now, I want to split the dates column in such a way that in the dates column, there is only the date of format (YYYY-MM-DD HH:MM) and a new column named tz be created and it must have the timezone in the form of a string as either +01 or +02
I did:
df['dates'] = df['dates'].apply(lambda t: str(t))
df['tz'] = df['dates'].str.split('+').str[1]
df['tz'] = df['tz'].str.split(':').str[0]
df['dates'] = pd.to_datetime(df['dates'])
df['dates'] = df['dates'].apply(lambda t: t.strftime('%Y-%m-%d %H:%M'))
and this gives me the output as follows:
dates tz
2020-01-01 00:00 01
2020-01-01 00:15 01
2020-01-01 00:30 01
2020-01-01 00:45 01
2020-01-01 01:00 01
2020-01-01 01:15 01
2020-01-01 01:30 01
Now, I need help with a couple of things:
In the tz column as you can see the values are only 01, I want to know how can I include the '+' sign in the tz column while splitting it?
I know I can do it by doing:
df['tz'] = '+' + df['tz'].str.split(':').str[0]
But it seems very messy.
Is there a more efficient way of splitting the column after creating the original time-series (pd.date_range(sd, ed, freq='0.25H', tz='Europe/Berlin')) into the desired output?
Desired output
dates tz
2020-01-01 00:00 +01
2020-01-01 00:15 +01
2020-01-01 00:30 +01
2020-01-01 00:45 +01
2020-01-01 01:00 +01
2020-01-01 01:15 +01
2020-01-01 01:30 +01
In general, I'd advise against storing datetime type as string, especially those of non-standard format. However, if you insist, you can do:
# from the original dataframe
df['tz'] = df['dates'].astype(str).str.extract(r'(\+\d{2})')[0]
df['dates'] = df['dates'].dt.strftime('%Y-%m-%d %H:%M')
Or only one extract with more complex regex:
df['tz'] = ''
df[['dates', 'tz']] = df['dates'].astype(str).str.extract(r'([\d\- \:]+):\d{2}(.+):')
Output (head):
dates tz
0 2020-01-01 00:00 +01
1 2020-01-01 00:15 +01
2 2020-01-01 00:30 +01
3 2020-01-01 00:45 +01
4 2020-01-01 01:00 +01

How to calculate a mean of measurements taken at the same time (n-hours window) on different days in pandas dataframe?

I have a dataset with measurements acquired almost every 2-hours over a week. I would like to calculate a mean of measurements taken at the same time on different days. For example, I want to calculate the mean of every measurement taken between 12:00 and 13:59.
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
#generating test dataframe
date_today = datetime.now()
time_of_taken_measurment = pd.date_range(date_today, date_today +
timedelta(72), freq='2H20MIN')
np.random.seed(seed=1111)
data = np.random.randint(1, high=100,
size=len(time_of_taken_measurment))
df = pd.DataFrame({'measurementTimestamp': time_of_taken_measurment, 'measurment': data})
df = df.set_index('measurementTimestamp')
#Calculating the mean for measurments taken in the same hour
hourly_average = df.groupby([df.index.hour]).mean()
hourly_average
The code above gives me this output:
0 47.967742
1 43.354839
2 46.935484
.....
22 42.833333
23 52.741935
I would like to have a result like this:
0 mean0
2 mean1
4 mean2
.....
20 mean10
22 mean11
I was trying to solve my problem using rolling_mean function, but I could not find a way to apply it to my static case.
Use the built-in floor functionality of datetimeIndex, which allows you to easily create 2 hour time bins.
df.groupby(df.index.floor('2H').time).mean()
Output:
measurment
00:00:00 51.516129
02:00:00 54.868852
04:00:00 52.935484
06:00:00 43.177419
08:00:00 43.903226
10:00:00 55.048387
12:00:00 50.639344
14:00:00 48.870968
16:00:00 43.967742
18:00:00 49.225806
20:00:00 43.774194
22:00:00 50.590164

Hourly average for each week/month in dataframe (moving average)

I have a dataframe with full year data of values on each second:
YYYY-MO-DD HH-MI-SS_SSS TEMPERATURE (C)
2016-09-30 23:59:55.923 28.63
2016-09-30 23:59:56.924 28.61
2016-09-30 23:59:57.923 28.63
... ...
2017-05-30 23:59:57.923 30.02
I want to create a new dataframe which takes each week or month of values and average them over the same hour of each day (kind of moving average but for each hour).
So the result for the month case will be like this:
Date TEMPERATURE (C)
2016-09 00:00:00 28.63
2016-09 01:00:00 27.53
2016-09 02:00:00 27.44
...
2016-10 00:00:00 28.61
... ...
I'm aware of the fact that I can split the df into 12 df's for each month and use:
hour = pd.to_timedelta(df['YYYY-MO-DD HH-MI-SS_SSS'].dt.hour, unit='H')
df2 = df.groupby(hour).mean()
But I'm searching for a better and faster way.
Thanks !!
Here's an alternate method of converting your date and time columns:
df['datetime'] = pd.to_datetime(df['YYYY-MO-DD'] + ' ' + df['HH-MI-SS_SSS'])
Additionally you could groupby both week and hour to form a MultiIndex dataframe (instead of creating and managing 12 dfs):
df.groupby([df.datetime.dt.weekofyear, df.datetime.dt.hour]).mean()

Hours, Date, Day Count Calculation

I have this huge dataset which has dates for several days and timestamps. The datetime format is in UNIX format. The datasets are logs of some login.
The code is supposed to group start and end time logs and provide log counts and unique id counts.
I am trying to get some stats like:
total log counts per hour & unique login ids per hour.
log count with choice of hours i.e. 24hrs, 12hrs, 6 hrs, 1 hr, etc and day of the week and such options.
I am able to split the data with start and end hours but I am not able to get the stats of counts of logs and unique ids.
Code:
from datetime import datetime,time
# This splits data from start to end time
start = time(8,0,0)
end = time(20,0,0)
with open('input', 'r') as infile, open('output','w') as outfile:
for row in infile:
col = row.split()
t1 = datetime.fromtimestamp(float(col[2])).time()
t2 = datetime.fromtimestamp(float(col[3])).time()
print (t1 >= start and t2 <= end)
Input data format: The data has no headers but the fields are given below. The number of days is not known in input.
UserID, StartTime, StopTime, GPS1, GPS2
00022d9064bc,1073260801,1073260803,819251,440006
00022d9064bc,1073260803,1073260810,819213,439954
00904b4557d3,1073260803,1073261920,817526,439458
00022de73863,1073260804,1073265410,817558,439525
00904b14b494,1073260804,1073262625,817558,439525
00022d1406df,1073260807,1073260809,820428,438735
00022d9064bc,1073260801,1073260803,819251,440006
00022dba8f51,1073260801,1073260803,819251,440006
00022de1c6c1,1073260801,1073260803,819251,440006
003065f30f37,1073260801,1073260803,819251,440006
00904b48a3b6,1073260801,1073260803,819251,440006
00904b83a0ea,1073260803,1073260810,819213,439954
00904b85d3cf,1073260803,1073261920,817526,439458
00904b14b494,1073260804,1073265410,817558,439525
00904b99499c,1073260804,1073262625,817558,439525
00904bb96e83,1073260804,1073265163,817558,439525
00904bf91b75,1073260804,1073263786,817558,439525
Expected Output: Example Output
StartTime, EndTime, Day, LogCount, UniqueIDCount
00:00:00, 01:00:00, Mon, 349, 30
StartTime and Endtime = Human readable format
Only to separate data with range of time is already achieved, but I am trying to write a round off time and calculate the counts of logs and uniqueids. Solution with Pandas is also welcome.
Edit One: I more details
StartTime --> EndTIime
1/5/2004, 5:30:01 --> 1/5/2004, 5:30:03
But that falls between 5:00:00 --> 6:00:00 . So this way count of all the logs in the time range is what I am trying to find. Similarly for others also like
5:00:00 --> 6:00:00 Hourly Count
00:00:00 --> 6:00:00 Every 6 hours
00:00:00 --> 12:00:00 Every 12 hours
5 Jan 2004, Mon --> count
6 Jan 2004, Tue --> Count
And so on Looking for a generic program where I can change the time/hours range as needed.
Unfortunately i couldn't find any elegant solution.
Here is my attempt:
fn = r'D:\temp\.data\dart_small.csv'
cols = ['UserID','StartTime','StopTime','GPS1','GPS2']
df = pd.read_csv(fn, header=None, names=cols)
df['m'] = df.StopTime + df.StartTime
df['d'] = df.StopTime - df.StartTime
# 'start' and 'end' for the reporting DF: `r`
# which will contain equal intervals (1 hour in this case)
start = pd.to_datetime(df.StartTime.min(), unit='s').date()
end = pd.to_datetime(df.StopTime.max(), unit='s').date() + pd.Timedelta(days=1)
# building reporting DF: `r`
freq = '1H' # 1 Hour frequency
idx = pd.date_range(start, end, freq=freq)
r = pd.DataFrame(index=idx)
r['start'] = (r.index - pd.datetime(1970,1,1)).total_seconds().astype(np.int64)
# 1 hour in seconds, minus one second (so that we will not count it twice)
interval = 60*60 - 1
r['LogCount'] = 0
r['UniqueIDCount'] = 0
for i, row in r.iterrows():
# intervals overlap test
# https://en.wikipedia.org/wiki/Interval_tree#Overlap_test
# i've slightly simplified the calculations of m and d
# by getting rid of division by 2,
# because it can be done eliminating common terms
u = df[np.abs(df.m - 2*row.start - interval) < df.d + interval].UserID
r.ix[i, ['LogCount', 'UniqueIDCount']] = [len(u), u.nunique()]
r['Day'] = pd.to_datetime(r.start, unit='s').dt.weekday_name.str[:3]
r['StartTime'] = pd.to_datetime(r.start, unit='s').dt.time
r['EndTime'] = pd.to_datetime(r.start + interval + 1, unit='s').dt.time
print(r[r.LogCount > 0])
PS the less periods you will have in the report DF - r, the faster it will count. So you may want to get rid of rows (times) if you know beforehand that those timeframes won't contain any data (for example during the weekends, holidays, etc.)
Result:
start LogCount UniqueIDCount Day StartTime EndTime
2004-01-05 00:00:00 1073260800 24 15 Mon 00:00:00 01:00:00
2004-01-05 01:00:00 1073264400 5 5 Mon 01:00:00 02:00:00
2004-01-05 02:00:00 1073268000 3 3 Mon 02:00:00 03:00:00
2004-01-05 03:00:00 1073271600 3 3 Mon 03:00:00 04:00:00
2004-01-05 04:00:00 1073275200 2 2 Mon 04:00:00 05:00:00
2004-01-06 12:00:00 1073390400 22 12 Tue 12:00:00 13:00:00
2004-01-06 13:00:00 1073394000 3 2 Tue 13:00:00 14:00:00
2004-01-06 14:00:00 1073397600 3 2 Tue 14:00:00 15:00:00
2004-01-06 15:00:00 1073401200 3 2 Tue 15:00:00 16:00:00
2004-01-10 16:00:00 1073750400 20 11 Sat 16:00:00 17:00:00
2004-01-14 23:00:00 1074121200 218 69 Wed 23:00:00 00:00:00
2004-01-15 00:00:00 1074124800 12 11 Thu 00:00:00 01:00:00
2004-01-15 01:00:00 1074128400 1 1 Thu 01:00:00 02:00:00

Categories

Resources