Upsample timeseries with weather data in a correct way - python
I have a dataset that holds weather data for each month from 1st day to 20th of month and for each hour of the day throw a year and the last 10 days(with it's hours) of each month are removed.
The weather data are :
(temperature - humidity - wind_speed - visibility - dew_temperature - solar_radiation - rainfall -snowfall)
I want to upsample the dataset as time series to fill the missing data of the days but i face many issue due too the changes of climate.
Here it what is tried so far
def get_hour_month_mean(data,date,hour,max_id):
return { 'ID':max_id,
'temperature':data['temperature'].mean(),
'humidity':data['humidity'].mean(),
'date':date,
'hour':hour,
'wind_speed':data['wind_speed'].mean(),
'visibility':data['visibility'].mean(),
'dew_temperature':data['dew_temperature'].mean(),
'solar_radiation':data['solar_radiation'].mean(),
'rainfall':data['rainfall'].mean(),
'count':data['count'].mean() if str(date.date()) not in seoul_not_func else 0,
'snowfall':data['snowfall'].mean(),
'season':data['season'].mode()[0],
'is_holiday':'No Holiday' if str(date.date()) not in seoul_p_holidays_17_18 else 'Holiday' ,
'functional_day':'Yes' if str(date.date()) not in seoul_not_func else 'No' ,
}
def upsample_data_with_missing_dates(data):
data_range = pd.date_range(
start="2017-12-20", end="2018-11-30", freq='D')
missing_range=data_range.difference(df['date'])
hour_range=range(0,24)
max_id=data['ID'].max()
data_copy=data.copy()
for date in missing_range:
for hour in hour_range:
max_id+=1
year=data_copy.year
month=date.month
if date.month==11:
year-=1
month=12
else:
month+=1
month_mask=((data_copy['year'] == year) &
(data_copy['month'] == month) &
(data_copy['hour'] == hour) &(data_copy['day'].isin([1,2])))
data_filter=data_copy[month_mask]
dict_row=get_hour_month_mean(data_filter,date,hour,max_id)
data = data.append(dict_row, ignore_index=True)
return data
any ideas what is the best way to get the values of the missing days if i have the previous 20 days and the next 20 days ?
There is a lot of manners to deal with missing timeseries values in fact.
You already tried the traditional way, imputing data with mean values. But the drawback of this method is the bias caused by so many values on the data.
You can try a genetic algorithm (GA), Support Vector Machine(SVR), autoregressive(AR) and moving average(MA) for time series imputation and modeling. To overcome the bias problem caused by the tradional method (mean), these methods are used to forecast or/and impute time series.
(Consider that you have a multivariate timeseries)
Here are some ressources you can use :
A Survey on Deep Learning Approaches
time.series.missing-values-in-time-series-in-python
Interpolation in Python to fill Missing Values
Related
how calculate icclim indicator tn10p over a period < 365 days of year
The goal is to calculate the climate indicator tn10p (percentage of days when Tmin < 10th percentile) based on the icclim package (link). Alternatively, I tried the same indicator from the xclim package. (here). I want to calculate the predicotor for a specific time period, e.g. '1960-12-01' to '1961-01-31', that can include two different years and is <= 12 months. 1 - open xarray dataset (for every 3 hours) t2m = xa.open_dataset('filepath.nc', decode_cf = True, decode_coords = "all").sel(time=slice('1960-12-01', '1961-01-31')) 2 - calculate minimum daily temperature values t2m_min = t2m.t2m.resample(time='1D').min(keep_attrs = True) 3.1 - With Icclim: icclim_tn10p = icclim._generated_api.tn10p(in_files=t2m_min, slice_mode=['season',([12,1])]) 3.2 - With xClim: t2m_min_q10 = percentile_doy(arr = t2m_min, window=5, per=10).sel(percentiles=10) xclim_tn10p = xclim.indicators.atmos.tn10p(tasmin = t2m_min, t10 = t2m_min_q10) In both cases, 3.1 and 3.2, I get the following ValueError: ValueError: conflicting sizes for dimension 'dayofyear': length 61 on <this-array> and length 365 on {'longitude': 'longitude', 'latitude': 'latitude', 'dayofyear': 'dayofyear', 'percentiles': 'percentiles'} I believe that the problem is the percentile_doy function (link) that only seems to work witg 365 or 366 calendar days. Any suggestions on how to solve this? It seems to be related to this xclim issue.
The following way works for icclim library: Compute tn10p for each year with slice_mode='month' Select the time range from the output array However, the process decreases performance.
Pandas Dataframes: Addition of float to column value based on if condition
Relative newbie with Python and Pandas, finally admitting defeat on not being able to figure this out myself. I have a pandas Dataframe from our energy suppliers API, each row is a 30min interval showing wholesale energy costs in p/kWH 'value_exc_vat', the solar output for the house 'export' and a datetime stamp 'datetime'. | index |'value_exc_vat'|'datetime'|'export'|'hour'|'export_rate'|'export_rate_var'| 'hour' is taken from datetime for each row e.g. 13, 14, 15, 16, etc. To calculate the price/kWh we are paid i need to calculate 0.97 x 'value_exc_vat' + peak_rate_uplift peak_rate_uplift is only applied during the hours 16:19 inclusive I've tried just about every method i can think of but i can't get this to work. peak_rate = [16,17,18,19] for hour in df['hour']: if hour == peak_rate: df['export_rate_var'] = (df['export_rate'] + peak_rate_uplift) else: df['export_rate_var'] = df['export_rate'] Printing the output from the if function i can see that 'hour' is being selected for the correct values but the remainder of the statement doesn't then add the peak_rate_uplift I would expect. Any advice or help on how to apply the addition to the selected row would be appreciated, feels like it should be something simple but I've been at this for 3 days now...
You could use: peak_rate = [16,17,18,19] df['export_rate_var'] = (df['export_rate'] + df.hour.isin(peak_rate) * peak_rate_uplift) Where df.hour.isin([peak_rate]) returns a boolean series. This multiplied with the integer peak_rate_uplift gives a Series of integers which is 0 where the hour is not in the peak rate hours.
Does this work: peak_rate = [16,17,18,19] for i in range(len(df)): if df.hour.iloc[i].isin(peak_rate): df['export_rate_var'] = (df['export_rate'] + peak_rate_uplift) else: df['export_rate_var'] = df['export_rate']
Is there a way to join two datasets on timestamp with an offset such that it connects time_1 with time_2 where time_2 is 2hrs earlier than time_1?
I'm trying to predict delays based on weather 2 hours before scheduled travel. I have one dataset of travel data (call df1) and one dataset of weather (call df2). In order to predict the delay, I am trying to join df1 and df2 with an offset of 2 hours. That is, I want to look at the weather data 2 hours before the scheduled travel data. A paired down view of the data would look something like this example df1 (travel data): travel_data location departure_time delayed blah KPHX 2015-04-23T15:02:00.000+0000 1 bleh KRDU 2015-04-27T15:19:00.000+0000 0 example df2 (weather data): location report_time weather_data KPHX 2015-01-01 01:53:00 blih KRDU 2015-01-01 09:53:00 bloh I would like to join the data first on location and then on the timestamp data with a minimum 2 hour offset. If there are multiple weather reports greater than 2 hours earlier than departure time, I would like to join the travel data with the closest report to a 2 hour offset as possible. So far I have used joinedDF = airlines_6m_recode.join(weather_filtered, (col("location") == col("location")) & (col("departure_time") == (col("report_date") + f.expr('INTERVAL 2 HOURS'))), "inner") This works only for the times when the departure time and (report date - 2hrs) match exactly, so I'm losing a large percentage of my data. Is there a way to join to the next closest report date outside the 2hr buffer? I have looked into window functions but they don't describe how to do joins.
Change the join condition to be >= and get largest report timestamp after partitioning by location. from pyspark.sql import functions as F from pyspark.sql.window import Window # 1.Join as per conditions # 2. Partition by location, order by report_ts desc, add row_number # 3. Filter row_number == 1 joinedDF = airlines_6m_recode.join( weather_filtered, (airlines_6m_recode["location"] == weather_filtered["location"]) & (weather_filtered["report_time_ts"] <= airlines_6m_recode["departure_time_ts"] - F.expr("INTERVAL 2 HOURS")) , "inner")\ .withColumn("row_number", F.row_number().over(Window.partitionBy(airlines_6m_recode['location'])\ .orderBy(weather_filtered["report_time_ts"].desc()))) # Just to Print Intermediate result. joinedDF.show() joinedDF.filter('row_number == 1').show()
Convert quandl fx hourly data to daily
I would like to convert hourly financial data imported into a pandas dataframe that has the following csv header to daily data: symbol,date,hour,openbid,highbid,lowbid,closebid,openask,highask,lowask,closeask,totalticks I've imported the data with pandas.read_csv(). I have eliminated all but one symbol from the data for testing purposes, and have figured out this part so far: df.groupby('date').agg({'highask': [max], 'lowask': [min]}) I'm still pretty new to python, so I'm not really sure how to continue. I'm guessing I can use some kind of anonymous function to create additional fields. For example, I'd like to get the open ask price for each date at hour 0, and the close ask price for each data at hour 23. Ideally, I would add additional columns and create a new dataframe. I want add a new column for market price, that is a an average of the ask/bid for low, high, open, and close. Any advice would be greatly appreciated. Thanks! edit As requested, here is the output I would expect for just 2018-07-24: symbol,date,openbid,highbid,lowbid,closebid,openask,highask,lowask,closeask,totalticks AUD/USD,2018-07-24,0.7422,0.74297,0.7429,0.74196,0.74257,0.743,0.74197,0.74258,5191 openbid is the openbid at the lowest hour column for a single date, closebid is the closebid at the highest hour for a single date, etc. Total ticks is the sum. What I am really struggling with is determining openbid, openask, closebid, and closeask. Sample data: symbol,date,hour,openbid,highbid,lowbid,closebid,openask,highask,lowask,closeask,totalticks AUD/USD,2018-07-24,22,0.7422,0.74249,0.74196,0.7423,0.74225,0.74252,0.74197,0.74234,1470 AUD/USD,2018-07-24,23,0.7423,0.74297,0.7423,0.74257,0.74234,0.743,0.74234,0.74258,3721 AUD/USD,2018-07-25,0,0.74257,0.74334,0.74237,0.74288,0.74258,0.74335,0.74239,0.74291,7443 AUD/USD,2018-07-25,1,0.74288,0.74492,0.74105,0.74111,0.74291,0.74501,0.74107,0.74111,14691 AUD/USD,2018-07-25,2,0.74111,0.74127,0.74015,0.74073,0.74111,0.74129,0.74018,0.74076,6898 AUD/USD,2018-07-25,3,0.74073,0.74076,0.73921,0.73987,0.74076,0.74077,0.73923,0.73989,6207 AUD/USD,2018-07-25,4,0.73987,0.74002,0.73921,0.73953,0.73989,0.74003,0.73923,0.73956,3453 AUD/USD,2018-07-25,5,0.73953,0.74094,0.73946,0.74041,0.73956,0.74096,0.73947,0.74042,7187 AUD/USD,2018-07-25,6,0.74041,0.74071,0.73921,0.74056,0.74042,0.74069,0.73922,0.74059,10646 AUD/USD,2018-07-25,7,0.74056,0.74066,0.73973,0.74035,0.74059,0.74068,0.73974,0.74037,9285 AUD/USD,2018-07-25,8,0.74035,0.74206,0.73996,0.74198,0.74037,0.74207,0.73998,0.742,10234 AUD/USD,2018-07-25,9,0.74198,0.74274,0.74176,0.74225,0.742,0.74275,0.74179,0.74227,8224 AUD/USD,2018-07-25,10,0.74225,0.74237,0.74122,0.74142,0.74227,0.74237,0.74124,0.74143,7143 AUD/USD,2018-07-25,11,0.74142,0.74176,0.74093,0.74152,0.74143,0.74176,0.74095,0.74152,7307 AUD/USD,2018-07-25,12,0.74152,0.74229,0.74078,0.74219,0.74152,0.74229,0.74079,0.74222,10523 AUD/USD,2018-07-25,13,0.74219,0.74329,0.74138,0.74141,0.74222,0.74332,0.74136,0.74145,13983 AUD/USD,2018-07-25,14,0.74141,0.74217,0.74032,0.74065,0.74145,0.7422,0.74034,0.74067,21814 AUD/USD,2018-07-25,15,0.74065,0.74151,0.73989,0.74113,0.74067,0.74152,0.73988,0.74115,16085 AUD/USD,2018-07-25,16,0.74113,0.74144,0.74056,0.7411,0.74115,0.74146,0.74058,0.74111,7752 AUD/USD,2018-07-25,17,0.7411,0.7435,0.74092,0.74346,0.74111,0.74353,0.74094,0.74348,11348 AUD/USD,2018-07-25,18,0.74346,0.74445,0.74331,0.74373,0.74348,0.74446,0.74333,0.74373,9898 AUD/USD,2018-07-25,19,0.74373,0.74643,0.74355,0.74559,0.74373,0.74643,0.74358,0.7456,11756 AUD/USD,2018-07-25,20,0.74559,0.74596,0.74478,0.74549,0.7456,0.746,0.74481,0.74562,5607 AUD/USD,2018-07-25,21,0.74549,0.74562,0.74417,0.74438,0.74562,0.74576,0.74422,0.74442,3613 AUD/USD,2018-07-26,22,0.73762,0.73792,0.73762,0.73774,0.73772,0.73798,0.73768,0.73779,1394 AUD/USD,2018-07-26,23,0.73774,0.73813,0.73744,0.73807,0.73779,0.73816,0.73746,0.73808,3465 AUD/USD,2018-07-27,0,0.73807,0.73826,0.73733,0.73763,0.73808,0.73828,0.73735,0.73764,6582 AUD/USD,2018-07-27,1,0.73763,0.73854,0.73734,0.73789,0.73764,0.73857,0.73736,0.73788,7373 AUD/USD,2018-07-27,2,0.73789,0.73881,0.73776,0.73881,0.73788,0.73883,0.73778,0.73882,3414 AUD/USD,2018-07-27,3,0.73881,0.7393,0.73849,0.73875,0.73882,0.73932,0.73851,0.73877,4639 AUD/USD,2018-07-27,4,0.73875,0.739,0.73852,0.73858,0.73877,0.73901,0.73852,0.73859,2487 AUD/USD,2018-07-27,5,0.73858,0.73896,0.7381,0.73887,0.73859,0.73896,0.73812,0.73888,5332 AUD/USD,2018-07-27,6,0.73887,0.73902,0.73792,0.73879,0.73888,0.73902,0.73793,0.73881,7623 AUD/USD,2018-07-27,7,0.73879,0.7395,0.73844,0.73885,0.73881,0.7395,0.73846,0.73887,9577 AUD/USD,2018-07-27,8,0.73885,0.73897,0.73701,0.73727,0.73887,0.73899,0.73702,0.73729,12280 AUD/USD,2018-07-27,9,0.73727,0.73784,0.737,0.73721,0.73729,0.73786,0.73701,0.73723,8634 AUD/USD,2018-07-27,10,0.73721,0.73798,0.73717,0.73777,0.73723,0.73798,0.73718,0.73779,7510 AUD/USD,2018-07-27,11,0.73777,0.73789,0.73728,0.73746,0.73779,0.73789,0.7373,0.73745,4947 AUD/USD,2018-07-27,12,0.73746,0.73927,0.73728,0.73888,0.73745,0.73929,0.73729,0.73891,16853 AUD/USD,2018-07-27,13,0.73888,0.74083,0.73853,0.74066,0.73891,0.74083,0.73855,0.74075,14412 AUD/USD,2018-07-27,14,0.74066,0.74147,0.74025,0.74062,0.74075,0.74148,0.74026,0.74064,15187 AUD/USD,2018-07-27,15,0.74062,0.74112,0.74002,0.74084,0.74064,0.74114,0.74003,0.74086,10044 AUD/USD,2018-07-27,16,0.74084,0.74091,0.73999,0.74001,0.74086,0.74092,0.74,0.74003,6893 AUD/USD,2018-07-27,17,0.74001,0.74022,0.73951,0.74008,0.74003,0.74025,0.73952,0.74009,5865 AUD/USD,2018-07-27,18,0.74008,0.74061,0.74002,0.74046,0.74009,0.74062,0.74004,0.74047,4334 AUD/USD,2018-07-27,19,0.74046,0.74072,0.74039,0.74041,0.74047,0.74073,0.74041,0.74043,3654 AUD/USD,2018-07-27,20,0.74041,0.74066,0.74005,0.74011,0.74043,0.74068,0.74018,0.74023,1547 AUD/USD,2018-07-25,22,0.74438,0.74526,0.74436,0.74489,0.74442,0.7453,0.74439,0.74494,2220 AUD/USD,2018-07-25,23,0.74489,0.74612,0.74489,0.7459,0.74494,0.74612,0.74492,0.74592,4886 AUD/USD,2018-07-26,0,0.7459,0.74625,0.74536,0.74571,0.74592,0.74623,0.74536,0.74573,6602 AUD/USD,2018-07-26,1,0.74571,0.74633,0.74472,0.74479,0.74573,0.74634,0.74471,0.74481,10123 AUD/USD,2018-07-26,2,0.74479,0.74485,0.74375,0.74434,0.74481,0.74487,0.74378,0.74437,7844 AUD/USD,2018-07-26,3,0.74434,0.74459,0.74324,0.744,0.74437,0.74461,0.74328,0.744,6037 AUD/USD,2018-07-26,4,0.744,0.74428,0.74378,0.74411,0.744,0.7443,0.74379,0.74414,3757 AUD/USD,2018-07-26,5,0.74411,0.74412,0.74346,0.74349,0.74414,0.74414,0.74344,0.74349,5713 AUD/USD,2018-07-26,6,0.74349,0.74462,0.74291,0.74299,0.74349,0.74464,0.74293,0.743,12650 AUD/USD,2018-07-26,7,0.74299,0.74363,0.74267,0.74361,0.743,0.74363,0.74269,0.74362,8067 AUD/USD,2018-07-26,8,0.74361,0.74375,0.74279,0.74287,0.74362,0.74376,0.7428,0.74288,6988 AUD/USD,2018-07-26,9,0.74287,0.74322,0.74212,0.74318,0.74288,0.74323,0.74212,0.74319,7784 AUD/USD,2018-07-26,10,0.74318,0.74329,0.74249,0.74276,0.74319,0.74331,0.7425,0.74276,5271 AUD/USD,2018-07-26,11,0.74276,0.74301,0.74179,0.74201,0.74276,0.74303,0.7418,0.74199,7434 AUD/USD,2018-07-26,12,0.74201,0.74239,0.74061,0.74064,0.74199,0.74241,0.74063,0.74066,20513 AUD/USD,2018-07-26,13,0.74064,0.74124,0.73942,0.74008,0.74066,0.74124,0.73943,0.74005,19715 AUD/USD,2018-07-26,14,0.74008,0.74014,0.73762,0.73887,0.74005,0.74013,0.73764,0.73889,21137 AUD/USD,2018-07-26,15,0.73887,0.73936,0.73823,0.73831,0.73889,0.73936,0.73824,0.73833,11186 AUD/USD,2018-07-26,16,0.73831,0.73915,0.73816,0.73908,0.73833,0.73916,0.73817,0.73908,6016 AUD/USD,2018-07-26,17,0.73908,0.73914,0.73821,0.73884,0.73908,0.73917,0.73823,0.73887,6197 AUD/USD,2018-07-26,18,0.73884,0.73885,0.73737,0.73773,0.73887,0.73887,0.73737,0.73775,6127 AUD/USD,2018-07-26,19,0.73773,0.73794,0.73721,0.73748,0.73775,0.73797,0.73724,0.73751,3614 AUD/USD,2018-07-26,20,0.73748,0.73787,0.73746,0.73767,0.73751,0.7379,0.73748,0.73773,1801 AUD/USD,2018-07-26,21,0.73767,0.73807,0.73755,0.73762,0.73773,0.73836,0.73769,0.73772,1687
To assign a new column avg_market_price as the average: df = df.assign(avg_market_price=df[['openbid', 'highbid', 'lowbid', 'closebid', 'openask', 'highask', 'lowask', 'closeask']].mean(axis=1)) You then want to set the index to a datetime index by combining the date and time fields, then resample your data to daily time periods (1d). Finally, use apply to get the max, min and averge values on specific columns. import numpy as np >>> (df .set_index(df['date'] + pd.to_timedelta(df['hour'], unit='h')) .resample('1d') .apply({'highask': 'max', 'lowask': 'min', 'avg_market_price': np.mean})) highask lowask avg_market_price 2018-07-24 0.74300 0.74197 0.742402 2018-07-25 0.74643 0.73922 0.742142 2018-07-26 0.74634 0.73724 0.741239 2018-07-27 0.74148 0.73701 0.739011
Python pandas finding data in between time
I am using crime statistics (in a data frame)and I am trying to find when most crimes occur between 12 am-8am,8am-4pm, and 4pm-12pm. I have already converted the column to DateTime. the code I used is: #first attempt df_15['FIRST_OCCURRENCE_DATE']=pd.date_range('01/01/2015',periods=10000,freq='H') df_15[(df_15['FIRST_OCCURrENCE_DATE'] > '2015-1-1 00:00:00') & (df_15['FIRST_OCCURRENCE_DATE'] <= '2015-12-31 08:00:00')] #second attempt df_15 = df_15.set_index(df_15['FIRST_OCCURRENCE_DATE']) df_15.loc['2015-01-01 00:00:00':'2015-12-31 00:00:00'] #third attempt date_rng = pd.date_range(start='00:00:00', end='08:00:00',freq='H') date_rng1 = pd.DataFrame(date_rng) date_rng1.head(30) #fourth attempt df_15.FIRST_OCCURRENCE_DATE.dt.hour ts = pd.to_datetime('12/31/2015 08:00:00') df_15.loc[df_15.FIRST_OCCURRENCE_DATE <= ts,:].head() The results I get are time entries that go outside of 08:00:00. PS. all the data is from the same year
Looks like you can just do a little arithmetic and count: (df_15['FIRST_OCCURrENCE_DATE'].dt.hour // 8).value_counts() There are a lot of ways to solve this problem but this is likely the simplest. Extract the hour of day from each date, find which time slot it belongs to. Floor-divide by 8 to get 0 (12AM-8AM), 1 (8AM-4PM), or 2 (4PM-12AM) for each, and just count these occurrences.