Upsample timeseries with weather data in a correct way - python

I have a dataset that holds weather data for each month from 1st day to 20th of month and for each hour of the day throw a year and the last 10 days(with it's hours) of each month are removed.
The weather data are :
(temperature - humidity - wind_speed - visibility - dew_temperature - solar_radiation - rainfall -snowfall)
I want to upsample the dataset as time series to fill the missing data of the days but i face many issue due too the changes of climate.
Here it what is tried so far
def get_hour_month_mean(data,date,hour,max_id):
return { 'ID':max_id,
'temperature':data['temperature'].mean(),
'humidity':data['humidity'].mean(),
'date':date,
'hour':hour,
'wind_speed':data['wind_speed'].mean(),
'visibility':data['visibility'].mean(),
'dew_temperature':data['dew_temperature'].mean(),
'solar_radiation':data['solar_radiation'].mean(),
'rainfall':data['rainfall'].mean(),
'count':data['count'].mean() if str(date.date()) not in seoul_not_func else 0,
'snowfall':data['snowfall'].mean(),
'season':data['season'].mode()[0],
'is_holiday':'No Holiday' if str(date.date()) not in seoul_p_holidays_17_18 else 'Holiday' ,
'functional_day':'Yes' if str(date.date()) not in seoul_not_func else 'No' ,
}
def upsample_data_with_missing_dates(data):
data_range = pd.date_range(
start="2017-12-20", end="2018-11-30", freq='D')
missing_range=data_range.difference(df['date'])
hour_range=range(0,24)
max_id=data['ID'].max()
data_copy=data.copy()
for date in missing_range:
for hour in hour_range:
max_id+=1
year=data_copy.year
month=date.month
if date.month==11:
year-=1
month=12
else:
month+=1
month_mask=((data_copy['year'] == year) &
(data_copy['month'] == month) &
(data_copy['hour'] == hour) &(data_copy['day'].isin([1,2])))
data_filter=data_copy[month_mask]
dict_row=get_hour_month_mean(data_filter,date,hour,max_id)
data = data.append(dict_row, ignore_index=True)
return data
any ideas what is the best way to get the values of the missing days if i have the previous 20 days and the next 20 days ?

There is a lot of manners to deal with missing timeseries values in fact.
You already tried the traditional way, imputing data with mean values. But the drawback of this method is the bias caused by so many values on the data.
You can try a genetic algorithm (GA), Support Vector Machine(SVR), autoregressive(AR) and moving average(MA) for time series imputation and modeling. To overcome the bias problem caused by the tradional method (mean), these methods are used to forecast or/and impute time series.
(Consider that you have a multivariate timeseries)
Here are some ressources you can use :
A Survey on Deep Learning Approaches
time.series.missing-values-in-time-series-in-python
Interpolation in Python to fill Missing Values

Related

how calculate icclim indicator tn10p over a period < 365 days of year

The goal is to calculate the climate indicator tn10p (percentage of days when Tmin < 10th percentile) based on the icclim package (link). Alternatively, I tried the same indicator from the xclim package. (here). I want to calculate the predicotor for a specific time period, e.g. '1960-12-01' to '1961-01-31', that can include two different years and is <= 12 months.
1 - open xarray dataset (for every 3 hours)
t2m = xa.open_dataset('filepath.nc', decode_cf = True, decode_coords = "all").sel(time=slice('1960-12-01', '1961-01-31'))
2 - calculate minimum daily temperature values
t2m_min = t2m.t2m.resample(time='1D').min(keep_attrs = True)
3.1 - With Icclim:
icclim_tn10p = icclim._generated_api.tn10p(in_files=t2m_min, slice_mode=['season',([12,1])])
3.2 - With xClim:
t2m_min_q10 = percentile_doy(arr = t2m_min, window=5, per=10).sel(percentiles=10)
xclim_tn10p = xclim.indicators.atmos.tn10p(tasmin = t2m_min, t10 = t2m_min_q10)
In both cases, 3.1 and 3.2, I get the following ValueError:
ValueError: conflicting sizes for dimension 'dayofyear': length 61 on <this-array> and length 365 on {'longitude': 'longitude', 'latitude': 'latitude', 'dayofyear': 'dayofyear', 'percentiles': 'percentiles'}
I believe that the problem is the percentile_doy function (link) that only seems to work witg 365 or 366 calendar days. Any suggestions on how to solve this?
It seems to be related to this xclim issue.
The following way works for icclim library:
Compute tn10p for each year with slice_mode='month'
Select the time range from the output array
However, the process decreases performance.

Pandas Dataframes: Addition of float to column value based on if condition

Relative newbie with Python and Pandas, finally admitting defeat on not being able to figure this out myself. I have a pandas Dataframe from our energy suppliers API, each row is a 30min interval showing wholesale energy costs in p/kWH 'value_exc_vat', the solar output for the house 'export' and a datetime stamp 'datetime'.
| index |'value_exc_vat'|'datetime'|'export'|'hour'|'export_rate'|'export_rate_var'|
'hour' is taken from datetime for each row e.g. 13, 14, 15, 16, etc.
To calculate the price/kWh we are paid i need to calculate
0.97 x 'value_exc_vat' + peak_rate_uplift
peak_rate_uplift is only applied during the hours 16:19 inclusive
I've tried just about every method i can think of but i can't get this to work.
peak_rate = [16,17,18,19]
for hour in df['hour']:
if hour == peak_rate:
df['export_rate_var'] = (df['export_rate'] + peak_rate_uplift)
else:
df['export_rate_var'] = df['export_rate']
Printing the output from the if function i can see that 'hour' is being selected for the correct values but the remainder of the statement doesn't then add the peak_rate_uplift I would expect.
Any advice or help on how to apply the addition to the selected row would be appreciated, feels like it should be something simple but I've been at this for 3 days now...
You could use:
peak_rate = [16,17,18,19]
df['export_rate_var'] = (df['export_rate'] + df.hour.isin(peak_rate) * peak_rate_uplift)
Where df.hour.isin([peak_rate]) returns a boolean series. This multiplied with the integer peak_rate_uplift gives a Series of integers which is 0 where the hour is not in the peak rate hours.
Does this work:
peak_rate = [16,17,18,19]
for i in range(len(df)):
if df.hour.iloc[i].isin(peak_rate):
df['export_rate_var'] = (df['export_rate'] + peak_rate_uplift)
else:
df['export_rate_var'] = df['export_rate']

Is there a way to join two datasets on timestamp with an offset such that it connects time_1 with time_2 where time_2 is 2hrs earlier than time_1?

I'm trying to predict delays based on weather 2 hours before scheduled travel. I have one dataset of travel data (call df1) and one dataset of weather (call df2). In order to predict the delay, I am trying to join df1 and df2 with an offset of 2 hours. That is, I want to look at the weather data 2 hours before the scheduled travel data. A paired down view of the data would look something like this
example df1 (travel data):
travel_data
location
departure_time
delayed
blah
KPHX
2015-04-23T15:02:00.000+0000
1
bleh
KRDU
2015-04-27T15:19:00.000+0000
0
example df2 (weather data):
location
report_time
weather_data
KPHX
2015-01-01 01:53:00
blih
KRDU
2015-01-01 09:53:00
bloh
I would like to join the data first on location and then on the timestamp data with a minimum 2 hour offset. If there are multiple weather reports greater than 2 hours earlier than departure time, I would like to join the travel data with the closest report to a 2 hour offset as possible.
So far I have used
joinedDF = airlines_6m_recode.join(weather_filtered, (col("location") == col("location")) & (col("departure_time") == (col("report_date") + f.expr('INTERVAL 2 HOURS'))), "inner")
This works only for the times when the departure time and (report date - 2hrs) match exactly, so I'm losing a large percentage of my data. Is there a way to join to the next closest report date outside the 2hr buffer?
I have looked into window functions but they don't describe how to do joins.
Change the join condition to be >= and get largest report timestamp after partitioning by location.
from pyspark.sql import functions as F
from pyspark.sql.window import Window
# 1.Join as per conditions
# 2. Partition by location, order by report_ts desc, add row_number
# 3. Filter row_number == 1
joinedDF = airlines_6m_recode.join(
weather_filtered,
(airlines_6m_recode["location"] == weather_filtered["location"]) & (weather_filtered["report_time_ts"] <= airlines_6m_recode["departure_time_ts"] - F.expr("INTERVAL 2 HOURS"))
, "inner")\
.withColumn("row_number", F.row_number().over(Window.partitionBy(airlines_6m_recode['location'])\
.orderBy(weather_filtered["report_time_ts"].desc())))
# Just to Print Intermediate result.
joinedDF.show()
joinedDF.filter('row_number == 1').show()

Convert quandl fx hourly data to daily

I would like to convert hourly financial data imported into a pandas dataframe that has the following csv header to daily data:
symbol,date,hour,openbid,highbid,lowbid,closebid,openask,highask,lowask,closeask,totalticks
I've imported the data with pandas.read_csv(). I have eliminated all but one symbol from the data for testing purposes, and have figured out this part so far:
df.groupby('date').agg({'highask': [max], 'lowask': [min]})
I'm still pretty new to python, so I'm not really sure how to continue. I'm guessing I can use some kind of anonymous function to create additional fields. For example, I'd like to get the open ask price for each date at hour 0, and the close ask price for each data at hour 23. Ideally, I would add additional columns and create a new dataframe. I want add a new column for market price, that is a an average of the ask/bid for low, high, open, and close.
Any advice would be greatly appreciated. Thanks!
edit
As requested, here is the output I would expect for just 2018-07-24:
symbol,date,openbid,highbid,lowbid,closebid,openask,highask,lowask,closeask,totalticks
AUD/USD,2018-07-24,0.7422,0.74297,0.7429,0.74196,0.74257,0.743,0.74197,0.74258,5191
openbid is the openbid at the lowest hour column for a single date, closebid is the closebid at the highest hour for a single date, etc. Total ticks is the sum. What I am really struggling with is determining openbid, openask, closebid, and closeask.
Sample data:
symbol,date,hour,openbid,highbid,lowbid,closebid,openask,highask,lowask,closeask,totalticks
AUD/USD,2018-07-24,22,0.7422,0.74249,0.74196,0.7423,0.74225,0.74252,0.74197,0.74234,1470
AUD/USD,2018-07-24,23,0.7423,0.74297,0.7423,0.74257,0.74234,0.743,0.74234,0.74258,3721
AUD/USD,2018-07-25,0,0.74257,0.74334,0.74237,0.74288,0.74258,0.74335,0.74239,0.74291,7443
AUD/USD,2018-07-25,1,0.74288,0.74492,0.74105,0.74111,0.74291,0.74501,0.74107,0.74111,14691
AUD/USD,2018-07-25,2,0.74111,0.74127,0.74015,0.74073,0.74111,0.74129,0.74018,0.74076,6898
AUD/USD,2018-07-25,3,0.74073,0.74076,0.73921,0.73987,0.74076,0.74077,0.73923,0.73989,6207
AUD/USD,2018-07-25,4,0.73987,0.74002,0.73921,0.73953,0.73989,0.74003,0.73923,0.73956,3453
AUD/USD,2018-07-25,5,0.73953,0.74094,0.73946,0.74041,0.73956,0.74096,0.73947,0.74042,7187
AUD/USD,2018-07-25,6,0.74041,0.74071,0.73921,0.74056,0.74042,0.74069,0.73922,0.74059,10646
AUD/USD,2018-07-25,7,0.74056,0.74066,0.73973,0.74035,0.74059,0.74068,0.73974,0.74037,9285
AUD/USD,2018-07-25,8,0.74035,0.74206,0.73996,0.74198,0.74037,0.74207,0.73998,0.742,10234
AUD/USD,2018-07-25,9,0.74198,0.74274,0.74176,0.74225,0.742,0.74275,0.74179,0.74227,8224
AUD/USD,2018-07-25,10,0.74225,0.74237,0.74122,0.74142,0.74227,0.74237,0.74124,0.74143,7143
AUD/USD,2018-07-25,11,0.74142,0.74176,0.74093,0.74152,0.74143,0.74176,0.74095,0.74152,7307
AUD/USD,2018-07-25,12,0.74152,0.74229,0.74078,0.74219,0.74152,0.74229,0.74079,0.74222,10523
AUD/USD,2018-07-25,13,0.74219,0.74329,0.74138,0.74141,0.74222,0.74332,0.74136,0.74145,13983
AUD/USD,2018-07-25,14,0.74141,0.74217,0.74032,0.74065,0.74145,0.7422,0.74034,0.74067,21814
AUD/USD,2018-07-25,15,0.74065,0.74151,0.73989,0.74113,0.74067,0.74152,0.73988,0.74115,16085
AUD/USD,2018-07-25,16,0.74113,0.74144,0.74056,0.7411,0.74115,0.74146,0.74058,0.74111,7752
AUD/USD,2018-07-25,17,0.7411,0.7435,0.74092,0.74346,0.74111,0.74353,0.74094,0.74348,11348
AUD/USD,2018-07-25,18,0.74346,0.74445,0.74331,0.74373,0.74348,0.74446,0.74333,0.74373,9898
AUD/USD,2018-07-25,19,0.74373,0.74643,0.74355,0.74559,0.74373,0.74643,0.74358,0.7456,11756
AUD/USD,2018-07-25,20,0.74559,0.74596,0.74478,0.74549,0.7456,0.746,0.74481,0.74562,5607
AUD/USD,2018-07-25,21,0.74549,0.74562,0.74417,0.74438,0.74562,0.74576,0.74422,0.74442,3613
AUD/USD,2018-07-26,22,0.73762,0.73792,0.73762,0.73774,0.73772,0.73798,0.73768,0.73779,1394
AUD/USD,2018-07-26,23,0.73774,0.73813,0.73744,0.73807,0.73779,0.73816,0.73746,0.73808,3465
AUD/USD,2018-07-27,0,0.73807,0.73826,0.73733,0.73763,0.73808,0.73828,0.73735,0.73764,6582
AUD/USD,2018-07-27,1,0.73763,0.73854,0.73734,0.73789,0.73764,0.73857,0.73736,0.73788,7373
AUD/USD,2018-07-27,2,0.73789,0.73881,0.73776,0.73881,0.73788,0.73883,0.73778,0.73882,3414
AUD/USD,2018-07-27,3,0.73881,0.7393,0.73849,0.73875,0.73882,0.73932,0.73851,0.73877,4639
AUD/USD,2018-07-27,4,0.73875,0.739,0.73852,0.73858,0.73877,0.73901,0.73852,0.73859,2487
AUD/USD,2018-07-27,5,0.73858,0.73896,0.7381,0.73887,0.73859,0.73896,0.73812,0.73888,5332
AUD/USD,2018-07-27,6,0.73887,0.73902,0.73792,0.73879,0.73888,0.73902,0.73793,0.73881,7623
AUD/USD,2018-07-27,7,0.73879,0.7395,0.73844,0.73885,0.73881,0.7395,0.73846,0.73887,9577
AUD/USD,2018-07-27,8,0.73885,0.73897,0.73701,0.73727,0.73887,0.73899,0.73702,0.73729,12280
AUD/USD,2018-07-27,9,0.73727,0.73784,0.737,0.73721,0.73729,0.73786,0.73701,0.73723,8634
AUD/USD,2018-07-27,10,0.73721,0.73798,0.73717,0.73777,0.73723,0.73798,0.73718,0.73779,7510
AUD/USD,2018-07-27,11,0.73777,0.73789,0.73728,0.73746,0.73779,0.73789,0.7373,0.73745,4947
AUD/USD,2018-07-27,12,0.73746,0.73927,0.73728,0.73888,0.73745,0.73929,0.73729,0.73891,16853
AUD/USD,2018-07-27,13,0.73888,0.74083,0.73853,0.74066,0.73891,0.74083,0.73855,0.74075,14412
AUD/USD,2018-07-27,14,0.74066,0.74147,0.74025,0.74062,0.74075,0.74148,0.74026,0.74064,15187
AUD/USD,2018-07-27,15,0.74062,0.74112,0.74002,0.74084,0.74064,0.74114,0.74003,0.74086,10044
AUD/USD,2018-07-27,16,0.74084,0.74091,0.73999,0.74001,0.74086,0.74092,0.74,0.74003,6893
AUD/USD,2018-07-27,17,0.74001,0.74022,0.73951,0.74008,0.74003,0.74025,0.73952,0.74009,5865
AUD/USD,2018-07-27,18,0.74008,0.74061,0.74002,0.74046,0.74009,0.74062,0.74004,0.74047,4334
AUD/USD,2018-07-27,19,0.74046,0.74072,0.74039,0.74041,0.74047,0.74073,0.74041,0.74043,3654
AUD/USD,2018-07-27,20,0.74041,0.74066,0.74005,0.74011,0.74043,0.74068,0.74018,0.74023,1547
AUD/USD,2018-07-25,22,0.74438,0.74526,0.74436,0.74489,0.74442,0.7453,0.74439,0.74494,2220
AUD/USD,2018-07-25,23,0.74489,0.74612,0.74489,0.7459,0.74494,0.74612,0.74492,0.74592,4886
AUD/USD,2018-07-26,0,0.7459,0.74625,0.74536,0.74571,0.74592,0.74623,0.74536,0.74573,6602
AUD/USD,2018-07-26,1,0.74571,0.74633,0.74472,0.74479,0.74573,0.74634,0.74471,0.74481,10123
AUD/USD,2018-07-26,2,0.74479,0.74485,0.74375,0.74434,0.74481,0.74487,0.74378,0.74437,7844
AUD/USD,2018-07-26,3,0.74434,0.74459,0.74324,0.744,0.74437,0.74461,0.74328,0.744,6037
AUD/USD,2018-07-26,4,0.744,0.74428,0.74378,0.74411,0.744,0.7443,0.74379,0.74414,3757
AUD/USD,2018-07-26,5,0.74411,0.74412,0.74346,0.74349,0.74414,0.74414,0.74344,0.74349,5713
AUD/USD,2018-07-26,6,0.74349,0.74462,0.74291,0.74299,0.74349,0.74464,0.74293,0.743,12650
AUD/USD,2018-07-26,7,0.74299,0.74363,0.74267,0.74361,0.743,0.74363,0.74269,0.74362,8067
AUD/USD,2018-07-26,8,0.74361,0.74375,0.74279,0.74287,0.74362,0.74376,0.7428,0.74288,6988
AUD/USD,2018-07-26,9,0.74287,0.74322,0.74212,0.74318,0.74288,0.74323,0.74212,0.74319,7784
AUD/USD,2018-07-26,10,0.74318,0.74329,0.74249,0.74276,0.74319,0.74331,0.7425,0.74276,5271
AUD/USD,2018-07-26,11,0.74276,0.74301,0.74179,0.74201,0.74276,0.74303,0.7418,0.74199,7434
AUD/USD,2018-07-26,12,0.74201,0.74239,0.74061,0.74064,0.74199,0.74241,0.74063,0.74066,20513
AUD/USD,2018-07-26,13,0.74064,0.74124,0.73942,0.74008,0.74066,0.74124,0.73943,0.74005,19715
AUD/USD,2018-07-26,14,0.74008,0.74014,0.73762,0.73887,0.74005,0.74013,0.73764,0.73889,21137
AUD/USD,2018-07-26,15,0.73887,0.73936,0.73823,0.73831,0.73889,0.73936,0.73824,0.73833,11186
AUD/USD,2018-07-26,16,0.73831,0.73915,0.73816,0.73908,0.73833,0.73916,0.73817,0.73908,6016
AUD/USD,2018-07-26,17,0.73908,0.73914,0.73821,0.73884,0.73908,0.73917,0.73823,0.73887,6197
AUD/USD,2018-07-26,18,0.73884,0.73885,0.73737,0.73773,0.73887,0.73887,0.73737,0.73775,6127
AUD/USD,2018-07-26,19,0.73773,0.73794,0.73721,0.73748,0.73775,0.73797,0.73724,0.73751,3614
AUD/USD,2018-07-26,20,0.73748,0.73787,0.73746,0.73767,0.73751,0.7379,0.73748,0.73773,1801
AUD/USD,2018-07-26,21,0.73767,0.73807,0.73755,0.73762,0.73773,0.73836,0.73769,0.73772,1687
To assign a new column avg_market_price as the average:
df = df.assign(avg_market_price=df[['openbid', 'highbid', 'lowbid', 'closebid',
'openask', 'highask', 'lowask', 'closeask']].mean(axis=1))
You then want to set the index to a datetime index by combining the date and time fields, then resample your data to daily time periods (1d). Finally, use apply to get the max, min and averge values on specific columns.
import numpy as np
>>> (df
.set_index(df['date'] + pd.to_timedelta(df['hour'], unit='h'))
.resample('1d')
.apply({'highask': 'max', 'lowask': 'min', 'avg_market_price': np.mean}))
highask lowask avg_market_price
2018-07-24 0.74300 0.74197 0.742402
2018-07-25 0.74643 0.73922 0.742142
2018-07-26 0.74634 0.73724 0.741239
2018-07-27 0.74148 0.73701 0.739011

Python pandas finding data in between time

I am using crime statistics (in a data frame)and I am trying to find when most crimes occur between 12 am-8am,8am-4pm, and 4pm-12pm. I have already converted the column to DateTime. the code I used is:
#first attempt
df_15['FIRST_OCCURRENCE_DATE']=pd.date_range('01/01/2015',periods=10000,freq='H')
df_15[(df_15['FIRST_OCCURrENCE_DATE'] > '2015-1-1 00:00:00') & (df_15['FIRST_OCCURRENCE_DATE'] <= '2015-12-31 08:00:00')]
#second attempt
df_15 = df_15.set_index(df_15['FIRST_OCCURRENCE_DATE'])
df_15.loc['2015-01-01 00:00:00':'2015-12-31 00:00:00']
#third attempt
date_rng = pd.date_range(start='00:00:00', end='08:00:00',freq='H')
date_rng1 = pd.DataFrame(date_rng)
date_rng1.head(30)
#fourth attempt
df_15.FIRST_OCCURRENCE_DATE.dt.hour
ts = pd.to_datetime('12/31/2015 08:00:00')
df_15.loc[df_15.FIRST_OCCURRENCE_DATE <= ts,:].head()
The results I get are time entries that go outside of 08:00:00.
PS. all the data is from the same year
Looks like you can just do a little arithmetic and count:
(df_15['FIRST_OCCURrENCE_DATE'].dt.hour // 8).value_counts()
There are a lot of ways to solve this problem but this is likely the simplest. Extract the hour of day from each date, find which time slot it belongs to. Floor-divide by 8 to get 0 (12AM-8AM), 1 (8AM-4PM), or 2 (4PM-12AM) for each, and just count these occurrences.

Categories

Resources