R merge two different time series with same date - python

I have csv file with data. Link is here. Granularity of time series is 5 min for year 2013. However, values are missing for some time stamps.
I want to create a time series with 5 minute interval with value zero for time stamps which are missing.
Please advise how to do this either in Pandas or Python

In pandas, you just join on the index:
from io import StringIO
import numpy as np
import pandas
ts1_string = StringIO("""\
V1,V2
01/01/2013 00:05:00,10
01/01/2013 00:10:00,6
01/01/2013 00:15:00,10
01/01/2013 00:25:00,8
01/01/2013 00:30:00,11
01/01/2013 00:35:00,7""")
ts2_string = StringIO("""
V1,V2
2013-01-01 00:00:00,0
2013-01-01 00:05:00,0
2013-01-01 00:10:00,0
2013-01-01 00:15:00,0
2013-01-01 00:20:00,0
2013-01-01 00:25:00,0""")
ts1 = pandas.read_csv(ts1_string, parse_dates=True, index_col='V1')
ts2 = pandas.read_csv(ts2_string, parse_dates=True, index_col='V1')
# here's where the join happens
# (suffixes deal with overlapping column names)
ts_joined = ts1.join(ts2, rsuffix='_ts1', lsuffix='_ts2')
# and finally
print(ts_joined.head())
Which gives:
V2_ts2 V2_ts1
V1
2013-01-01 00:05:00 10 0
2013-01-01 00:10:00 6 0
2013-01-01 00:15:00 10 0
2013-01-01 00:25:00 8 0
2013-01-01 00:30:00 11 NaN

Related

Converting timeseries data given in timedeltas64[ns] to datetime64[ns]

Suppose I'm given a pandas dataframe that is indexed in timedeltas64[ns].
A B C D E
0 days 00:00:00 0.642973 -0.041259 253.377516 0.0
0 days 00:15:00 0.647493 -0.041230 253.309167 0.0
0 days 00:30:00 0.723258 -0.063110 253.416138 0.0
0 days 00:45:00 0.739604 -0.070342 253.305809 0.0
0 days 01:00:00 0.643327 -0.041131 252.967084 0.0
... ... ... ... ...
364 days 22:45:00 0.650392 -0.064805 249.658052 0.0
364 days 23:00:00 0.652765 -0.064821 249.243891 0.0
364 days 23:15:00 0.607198 -0.103190 249.553821 0.0
364 days 23:30:00 0.597602 -0.107975 249.687942 0.0
364 days 23:45:00 0.595224 -0.110376 250.059530 0.0
There does not appear to be any "permitted" way of converting the index to datetimes. Basic operations to convert the index such as:
df.index = pd.DatetimeIndex(df.index)
Or:
test_df.time = pd.to_datetime(test_df.index,format='%Y%m%d%H%M')
Both yield:
TypeError: dtype timedelta64[ns] cannot be converted to datetime64[ns]
Is there any permitted way to do this operation other than completely reformatting all of these (very numerous) datasets manually? The data is yearly with 15 minute intervals.
Your issue is that you cannot convert a timedelta object to a datetime object because the former is the difference between two datetimes. Based on your question it sounds like all these deltas are from the same base time, so you would need to add that in. Example usages below
In [1]: import datetime
In [2]: now = datetime.datetime.now()
In [3]: delta = datetime.timedelta(minutes=5)
In [4]: print(now, delta + now)
2021-02-22 20:14:37.273444 2021-02-22 20:19:37.273444
You can see in the above that the second print datetime is 5 minutes after the now object

How to match time series in python?

I have two high frequency time series of 3 months worth of data.
The problem is that one goes from 15:30 to 23:00, the other from 01:00 to 00:00.
IS there any way to match the two time series, by discarding the extra data, in order to run some regression analysis?
use can use the function combine_first of pandas Series. This function selects the element of the calling object, if both series contain the same index.
Following code shows a minimum example:
idx1 = pd.date_range('2018-01-01', periods=5, freq='H')
idx2 = pd.date_range('2018-01-01 01:00', periods=5, freq='H')
ts1 = pd.Series(range(len(ts1)), index=idx1)
ts2 = pd.Series(range(len(ts2)), index=idx2)
idx1.combine_first(idx2)
This gives a dataframe with the content:
2018-01-01 00:00:00 0.0
2018-01-01 01:00:00 1.0
2018-01-01 02:00:00 2.0
2018-01-01 03:00:00 3.0
2018-01-01 04:00:00 4.0
2018-01-01 05:00:00 4.0
For more complex combinations you can use combine.

Is there a Pandas function to highlight a week's 10 lowest values in a time series?

Rookie here so please excuse my question format:
I got an event time series dataset for two months (columns for "date/time" and "# of events", each row representing an hour).
I would like to highlight the 10 hours with the lowest numbers of events for each week. Is there a specific Pandas function for that? Thanks!
Let's say you have a dataframe df with column col as well as a datetime column.
You can simply sort the column with
import pandas as pd
df = pd.DataFrame({'col' : [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],
'datetime' : ['2019-01-01 00:00:00','2015-02-01 00:00:00','2015-03-01 00:00:00','2015-04-01 00:00:00',
'2018-05-01 00:00:00','2016-06-01 00:00:00','2017-07-01 00:00:00','2013-08-01 00:00:00',
'2015-09-01 00:00:00','2015-10-01 00:00:00','2015-11-01 00:00:00','2015-12-01 00:00:00',
'2014-01-01 00:00:00','2020-01-01 00:00:00','2014-01-01 00:00:00']})
df = df.sort_values('col')
df = df.iloc[0:10,:]
df
Output:
col datetime
0 1 2019-01-01 00:00:00
1 2 2015-02-01 00:00:00
2 3 2015-03-01 00:00:00
3 4 2015-04-01 00:00:00
4 5 2018-05-01 00:00:00
5 6 2016-06-01 00:00:00
6 7 2017-07-01 00:00:00
7 8 2013-08-01 00:00:00
8 9 2015-09-01 00:00:00
9 10 2015-10-01 00:00:00
I know there's a function called nlargest. I guess there should be an nsmallest counterpart. pandas.DataFrame.nsmallest
df.nsmallest(n=10, columns=['col'])
My bad, so your DateTimeIndex is a Hourly sampling. And you need the hour(s) with least events weekly.
...
Date n_events
2020-06-06 08:00:00 3
2020-06-06 09:00:00 3
2020-06-06 10:00:00 2
...
Well I'd start by converting each hour into columns.
1. Create an Hour column that holds the hour of the day.
df['hour'] = df['date'].hour
Pivot the hour values into columns having values as n_events.
So you'll then have 1 datetime index, 24 hour columns, with values denoting #events. pandas.DataFrame.pivot_table
...
Date hour0 ... hour8 hour9 hour10 ... hour24
2020-06-06 0 3 3 2 0
...
Then you can resample it to weekly level aggregate using sum.
df.resample('w').sum()
The last part is a bit tricky to do on the dataframe. But fairly simple if you just need the output.
for row in df.itertuples():
print(sorted(row[1:]))

How can I do a calculation depending on a condition in a column?

I would like to make a calculation when there is a group of ones that follow continuously.
I have a database on how a compressor works. Every 5 minutes I get the compressor status if it is ON/OFF and the electricity consumed at this moment. The column On_Off there are a 1 when the compressor works (ON) and 0 when it is OFF.
Compresor = pd.Series([0,0,1,1,1,0,0,1,1,1,0,0,0,0,1,1,1,0], index = pd.date_range('1/1/2012', periods=18, freq='5 min'))
df = pd.DataFrame(Compresor)
df.index.rename("Date", inplace=True)
df.set_axis(["ON_OFF"], axis=1, inplace=True)
df.loc[(df.ON_OFF == 1), 'Electricity'] = np.random.randint(4, 20, df.sum())
df.loc[(df.ON_OFF < 1), 'Electricity'] = 0
df
ON_OFF Electricity
Date
2012-01-01 00:00:00 0 0.0
2012-01-01 00:05:00 0 0.0
2012-01-01 00:10:00 1 4.0
2012-01-01 00:15:00 1 10.0
2012-01-01 00:20:00 1 9.0
2012-01-01 00:25:00 0 0.0
2012-01-01 00:30:00 0 0.0
2012-01-01 00:35:00 1 17.0
2012-01-01 00:40:00 1 10.0
2012-01-01 00:45:00 1 5.0
2012-01-01 00:50:00 0 0.0
2012-01-01 00:55:00 0 0.0
2012-01-01 01:00:00 0 0.0
2012-01-01 01:05:00 0 0.0
2012-01-01 01:10:00 1 14.0
2012-01-01 01:15:00 1 5.0
2012-01-01 01:20:00 1 19.0
2012-01-01 01:25:00 0 0.0
What I would like to do is to add the electrical consumption only when there is a set of ones and make another Data.Frame. For example:
In this example, the first time that the compressor was turned on was between 00:20 -00:30. During this period it consumed 25 (10+10+5). The second time it lasted longer on (00:50-01:15) and consumed in this interval 50 (10+10+10+10+10+5+5). The third time it consume 20 (10 + 10).
I would like to do this automatically I'm new to pandas and I can't think of a way to do it.
Lets say you have the following data:
from operator import itemgetter
import numpy as np
import numpy.random as rnd
import pandas as pd
from funcy import concat, repeat
from toolz import partitionby
base_data = {
'time': list(range(20)),
'state': list(concat(repeat(0, 3), repeat(1, 4), repeat(0, 5), repeat(1, 6), repeat(0, 2))),
'value': list(concat(repeat(0, 3), rnd.randint(5, 20, 4), repeat(0, 5), rnd.randint(5, 20, 6), repeat(0, 2)))
}
Well, there are two ways:
The first one is functional and independent of pandas: you simply partition your data by a field, i.e. the method is processes the data sequentially and generates a new partition every time the value of the field changes. You can then simply summarize each partition as desired.
# transform into sample data
sample_data = [dict(zip(base_data.keys(), x)) for x in zip(*base_data.values())]
# and compute statistics the functional way
[sum(x['value'] for x in part if x['state'] == 1)
for part in partitionby(itemgetter('state'), sample_data)
if part[0]['state'] == 1]
There is also the pandas way, similarly to what #ivallesp mentioned:
You compute the change of state by shifting the state column. Then you summarize your data frame by the group
pd_data = pd.DataFrame(base_data)
pd_data['shifted_state'] = pd_data['state'].shift(fill_value = pd_data['state'][0])
pd_data['cum_state'] = np.cumsum(pd_data['state'] != pd_data['shifted_state'])
pd_data[pd_data['state'] == 1].groupby('cum_state').sum()
Depending on what you and your peers can read best you can choose your way. Also, the functional way may not be easily readable, and can also rewritten with readable loop statments.
What I would do is creating a variable representing each period of activity with an integer as an ID, then group by it and sum the Electricity column. An easy way of creating it would be by cumulative summing On_Off (the data has to be sorted by increasing date) and multiplying the resulting value by the On_Off column. If you provide a reproducible example of your table in Pandas I can quickly write you the solution.
Hope it helps

How to resample data in a single dataframe within 3 distinct groups

I've got a dataframe and want to resample certain columns (as hourly sums and means from 10-minutely data) WITHIN the 3 different 'users' that exist in the dataset.
A normal resample would use code like:
import pandas as pd
import numpy as np
df = pd.read_csv('example.csv')
df['Datetime'] = pd.to_datetime(df['date_datetime/_source'] + ' ' + df['time']) #create datetime stamp
df.set_index(df['Datetime'], inplace = True)
df = df.resample('1H', how={'energy_kwh': np.sum, 'average_w': np.mean, 'norm_average_kw/kw': np.mean, 'temperature_degc': np.mean, 'voltage_v': np.mean})
df
To geta a result like (please forgive the column formatting, I have no idea how to paste this properly to make it look nice):
energy_kwh norm_average_kw/kw voltage_v temperature_degc average_w
Datetime
2013-04-30 06:00:00 0.027 0.007333 266.333333 4.366667 30.000000
2013-04-30 07:00:00 1.250 0.052333 298.666667 5.300000 192.500000
2013-04-30 08:00:00 5.287 0.121417 302.333333 7.516667 444.000000
2013-04-30 09:00:00 12.449 0.201000 297.500000 9.683333 726.000000
2013-04-30 10:00:00 26.101 0.396417 288.166667 11.150000 1450.000000
2013-04-30 11:00:00 45.396 0.460250 282.333333 12.183333 1672.500000
2013-04-30 12:00:00 64.731 0.440833 276.166667 13.550000 1541.000000
2013-04-30 13:00:00 87.095 0.562750 284.833333 13.733333 2084.500000
However, in the original CSV, there is a column containing URLs - in the dataset of 100,000 rows, there are 3 different URLs (effectively IDs). I want to have each resampled individually rather than having a 'lump' resample from all (e.g. 9.00 AM on 2014-01-01 would have data for all 3 users, but each should have it's own hourly sums and means).
I hope this makes sense - please let me know if I need to clarify anything.
FYI, I tried using the advice in the following 2 posts but to no avail:
Resampling a multi-index DataFrame
Resampling Within a Pandas MultiIndex
Thanks in advance
You can resample a groupby object, groupby-ed by URLs, in this minimal example:
In [157]:
df=pd.DataFrame({'Val': np.random.random(100)})
df['Datetime'] = pd.date_range('2001-01-01', periods=100, freq='5H') #create random dataset
df.set_index(df['Datetime'], inplace = True)
df.__delitem__('Datetime')
df['Location']=np.tile(['l0', 'l1', 'l2', 'l3', 'l4'], 20)
In [158]:
print df.groupby('Location').resample('10D', how={'Val':np.mean})
Val
Location Datetime
l0 2001-01-01 00:00:00 0.334183
2001-01-11 00:00:00 0.584260
l1 2001-01-01 05:00:00 0.288290
2001-01-11 05:00:00 0.470140
l2 2001-01-01 10:00:00 0.381273
2001-01-11 10:00:00 0.461684
l3 2001-01-01 15:00:00 0.703523
2001-01-11 15:00:00 0.386858
l4 2001-01-01 20:00:00 0.448857
2001-01-11 20:00:00 0.310914

Categories

Resources