Interpolate (upsample) non-equispaced timeseries into equispaced with Pandas version 18.0rc1 - python

I want to interpolate (upscale) nonequispaced time-series to obtain equispaced time-series.
Currently I am doing it in following way:
take original timeseries.
create new timeseries with NaN values at each 30 seconds intervals ( using resample('30S').asfreq() )
concat original timeseries and new timeseries
sort the timeseries to restore order of times (This I do not like - sorting has complexity of O = n log(n) )
interpolate
remove original points from the timeseries
is there a more simple way with Pandas version 18.0rc1? like in matlab you have original timeseries and you pass new times as a parameter to the interpolate() function to receive values at desired times.
I remark that times of original timeseries might not be be a subset of the times of desired timeseries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
values = [271238, 329285, 50, 260260, 263711]
timestamps = pd.to_datetime(['2015-01-04 08:29:4',
'2015-01-04 08:37:05',
'2015-01-04 08:41:07',
'2015-01-04 08:43:05',
'2015-01-04 08:49:05'])
ts = pd.Series(values, index=timestamps)
ts
ts[ts==-1] = np.nan
newFreq=ts.resample('60S').asfreq()
new=pd.concat([ts,newFreq]).sort_index()
new=new.interpolate(method='time')
ts.plot(marker='o')
new.plot(marker='+',markersize=15)
new[newFreq.index].plot(marker='.')
lines, labels = plt.gca().get_legend_handles_labels()
labels = ['original values (nonequispaced)', 'original + interpolated at new frequency (nonequispaced)', 'interpolated values without original values (equispaced!)']
plt.legend(lines, labels, loc='best')
plt.show()

There have been several requests for a simpler way to interpolate at desired values (I'll edit in links later, but search the issue tracker for interpolate issues). So in the future there will be an easier way.
For now you can write the option a bit more cleanly as
In [9]: (ts.reindex(ts.index | newFreq.index)
.interpolate(method='time')
.loc[newFreq.index])
Out[9]:
2015-01-04 08:29:00 NaN
2015-01-04 08:30:00 277996.070686
2015-01-04 08:31:00 285236.860707
2015-01-04 08:32:00 292477.650728
2015-01-04 08:33:00 299718.440748
...
2015-01-04 08:45:00 261362.402778
2015-01-04 08:46:00 261937.569444
2015-01-04 08:47:00 262512.736111
2015-01-04 08:48:00 263087.902778
2015-01-04 08:49:00 263663.069444
Freq: 60S, dtype: float64
This still involves all the steps you listed above, but the unioning of the indexes is cleaner than concating and dropping.

Related

Resample a pandas dataframe with multiple variables

I have a dataframe in long format with data on a 15 min interval for several variables. If I apply the resample method to get the average daily value, I get the average values of all variables for a given time interval (and not the average value for speed, distance).
Does anyone know how to resample the dataframe and keep the 2 variables?
Note: The code below contains an EXAMPLE dataframe in long format, my real example loads data from csv and has different time intervals and frequencies for the variables, so I cannot simply resample the dataframe in wide format.
import pandas as pd
import numpy as np
dti = pd.date_range('2015-01-01', '2015-12-31', freq='15min')
df = pd.DataFrame(index = dti)
# Average speed in miles per hour
df['speed'] = np.random.randint(low=0, high=60, size=len(df.index))
# Distance in miles (speed * 0.5 hours)
df['distance'] = df['speed'] * 0.25
df.reset_index(inplace=True)
df2 = df.melt (id_vars = 'index')
df3 = df2.resample('d', on='index').mean()
IIUC:
>>> df.groupby(df.index.date).mean()
speed distance
2015-01-01 29.562500 7.390625
2015-01-02 31.885417 7.971354
2015-01-03 30.895833 7.723958
2015-01-04 30.489583 7.622396
2015-01-05 28.500000 7.125000
... ... ...
2015-12-27 28.552083 7.138021
2015-12-28 29.437500 7.359375
2015-12-29 29.479167 7.369792
2015-12-30 28.864583 7.216146
2015-12-31 48.000000 12.000000
[365 rows x 2 columns]

Boxplot of Multiindex df

I want to do 2 things:
I want to create one boxplot per date/day with all the values for MeanTravelTimeSeconds in that date. The number of MeanTravelTimeSeconds elements varies from date to date (e.g. one day might have a count of 300 values while another, 400).
Also, I want to transform the rows in my multiindex series into columns because I don't want the rows to repeat every time. If it stays like this I'd have tens of millions of unnecessary rows.
Here is the resulting series after using df.stack() on a df indexed by date (date is a datetime object index):
Date
2016-01-02 NumericIndex 1611664
OriginMovementID 4744
DestinationMovementID 5084
MeanTravelTimeSeconds 1233
RangeLowerBoundTravelTimeSeconds 756
...
2020-03-31 DestinationMovementID 3594
MeanTravelTimeSeconds 1778
RangeLowerBoundTravelTimeSeconds 1601
RangeUpperBoundTravelTimeSeconds 1973
DayOfWeek Tuesday
Length: 11281655, dtype: object
When I use seaborn to plot the boxplot I guet a bucnh of errors after playing with different selections.
If I try to do df.stack().unstack() or df.stack().T I get then following error:
Index contains duplicate entries, cannot reshape
How do I plot the boxplot and how do I turn the rows into columns?
You really do need to make your index unique to make the functions you want to work. I suggest a sequential number that resets at every change in the other two key columns.
import datetime as dt
import random
import numpy as np
cat = ["NumericIndex","OriginMovementID","DestinationMovementID","MeanTravelTimeSeconds",
"RangeLowerBoundTravelTimeSeconds"]
df = pd.DataFrame(
[{"Date":d, "Observation":cat[random.randint(0,len(cat)-1)],
"Value":random.randint(1000,10000)}
for i in range(random.randint(5,20))
for d in pd.date_range(dt.datetime(2016,1,2), dt.datetime(2016,3,31), freq="14D")])
# starting point....
df = df.sort_values(["Date","Observation"]).set_index(["Date","Observation"])
# generate an array that is sequential within change of key
seq = np.full(df.index.shape, 0)
s=0
p=""
for i, v in enumerate(df.index):
if i==0 or p!=v: s=0
else: s+=1
seq[i] = s
p=v
df["SeqNo"] = seq
# add to index - now unstack works as required
dfdd = df.set_index(["SeqNo"], append=True)
dfdd.unstack(0).loc["MeanTravelTimeSeconds"].boxplot()
print(dfdd.unstack(1).head().to_string())
output
Value
Observation DestinationMovementID MeanTravelTimeSeconds NumericIndex OriginMovementID RangeLowerBoundTravelTimeSeconds
Date SeqNo
2016-01-02 0 NaN NaN 2560.0 5324.0 5085.0
1 NaN NaN 1066.0 7372.0 NaN
2016-01-16 0 NaN 6226.0 NaN 7832.0 NaN
1 NaN 1384.0 NaN 8839.0 NaN
2 NaN 7892.0 NaN NaN NaN

How can I plot different length pandas series with matplotlib?

I've got two pandas series, one with a 7 day rolling mean for the entire year and another with monthly averages. I'm trying to plot them both on the same matplotlib figure, with the averages as a bar graph and the 7 day rolling mean as a line graph. Ideally, the line would be graph on top of the bar graph.
The issue I'm having is that, with my current code, the bar graph is showing up without the line graph, but when I try plotting the line graph first I get a ValueError: ordinal must be >= 1.
Here's what the series' look like:
These are first 15 values of the 7 day rolling mean series, it has a date and a value for the entire year:
date
2016-01-01 NaN
2016-01-03 NaN
2016-01-04 NaN
2016-01-05 NaN
2016-01-06 NaN
2016-01-07 NaN
2016-01-08 0.088473
2016-01-09 0.099122
2016-01-10 0.086265
2016-01-11 0.084836
2016-01-12 0.076741
2016-01-13 0.070670
2016-01-14 0.079731
2016-01-15 0.079187
2016-01-16 0.076395
This is the entire monthly average series:
dt_month
2016-01-01 0.498323
2016-02-01 0.497795
2016-03-01 0.726562
2016-04-01 1.000000
2016-05-01 0.986411
2016-06-01 0.899849
2016-07-01 0.219171
2016-08-01 0.511247
2016-09-01 0.371673
2016-10-01 0.000000
2016-11-01 0.972478
2016-12-01 0.326921
Here's the code I'm using to try and plot them:
ax = series_one.plot(kind="bar", figsize=(20,2))
series_two.plot(ax=ax)
plt.show()
Here's the graph that generates:
Any help is hugely appreciated! Also, advice on formatting this question and creating code to make two series for a minimum working example would be awesome.
Thanks!!
The problem is that pandas bar plots are categorical (Bars are at subsequent integer positions). Since in your case the two series have a different number of elements, plotting the line graph in categorical coordinates is not really an option. What remains is to plot the bar graph in numerical coordinates as well. This is not possible with pandas, but is the default behaviour with matplotlib.
Below I shift the monthly dates by 15 days to the middle of the month to have nicely centered bars.
import matplotlib.pyplot as plt
import numpy as np; np.random.seed(42)
import pandas as pd
t1 = pd.date_range("2018-01-01", "2018-12-31", freq="D")
s1 = pd.Series(np.cumsum(np.random.randn(len(t1)))+14, index=t1)
s1[:6] = np.nan
t2 = pd.date_range("2018-01-01", "2018-12-31", freq="MS")
s2 = pd.Series(np.random.rand(len(t2))*15+5, index=t2)
# shift monthly data to middle of month
s2.index += pd.Timedelta('15 days')
fig, ax = plt.subplots()
ax.bar(s2.index, s2.values, width=14, alpha=0.3)
ax.plot(s1.index, s1.values)
plt.show()
The problem might be the two series' indices are of very different scales. You can use ax.twiny to plot them:
ax = series_one.plot(kind="bar", figsize=(20,2))
ax_tw = ax.twiny()
series_two.plot(ax=ax_tw)
plt.show()
Output:

Pandas reindex and interpolate time series efficiently (reindex drops data)

Suppose I wish to re-index, with linear interpolation, a time series to a pre-defined index, where none of the index values are shared between old and new index. For example
# index is all precise timestamps e.g. 2018-10-08 05:23:07
series = pandas.Series(data,index)
# I want rounded date-times
desired_index = pandas.date_range("2010-10-08",periods=10,freq="30min")
Tutorials/API suggest the way to do this is to reindex then fill NaN values using interpolate. But, as there is no overlap of datetimes between the old and new index, reindex outputs all NaN:
# The following outputs all NaN as no date times match old to new index
series.reindex(desired_index)
I do not want to fill nearest values during reindex as that will lose precision, so I came up with the following; concatenate the reindexed series with the original before interpolating:
pandas.concat([series,series.reindex(desired_index)]).sort_index().interpolate(method="linear")
This seems very inefficient, concatenating and then sorting the two series. Is there a better way?
The only (simple) way I can see of doing this is to use resample to upsample to your time resolution (say 1 second), then reindex.
Get an example DataFrame:
import numpy as np
import pandas as pd
np.random.seed(2)
df = (pd.DataFrame()
.assign(SampleTime=pd.date_range(start='2018-10-01', end='2018-10-08', freq='30T')
+ pd.to_timedelta(np.random.randint(-5, 5, size=337), unit='s'),
Value=np.random.randn(337)
)
.set_index(['SampleTime'])
)
Let's see what the data looks like:
df.head()
Value
SampleTime
2018-10-01 00:00:03 0.033171
2018-10-01 00:30:03 0.481966
2018-10-01 01:00:01 -0.495496
Get the desired index:
desired_index = pd.date_range('2018-10-01', periods=10, freq='30T')
Now, reindex the data with the union of the desired and existing indices, interpolate based on the time, and reindex again using only the desired index:
(df
.reindex(df.index.union(desired_index))
.interpolate(method='time')
.reindex(desired_index)
)
Value
2018-10-01 00:00:00 NaN
2018-10-01 00:30:00 0.481218
2018-10-01 01:00:00 -0.494952
2018-10-01 01:30:00 -0.103270
As you can see, you still have an issue with the first timestamp because it's outside the range of the original index; there are number of ways to deal with this (pad, for example).
my methods
frequency = nyse_trading_dates.rename_axis([None]).index
df = prices.rename_axis([None]).reindex(frequency)
for d in prices.rename_axis([None]).index:
df.loc[d] = prices.loc[d]
df.interpolate(method='linear')
method 2
prices = data.loc[~data.index.duplicated(keep='last')]
#prices = data.reset_index()
idx1 = prices.index
idx1 = pd.to_datetime(idx1, errors='coerce')
merged = idx1.union(idx2)
s = prices.reindex(merged)
df = s.interpolate(method='linear').dropna(axis=0, how='any')
data=df

Pandas: bar plot with multiIndex dataframe

I have a pandas DataFrame with a TIMESTAMP column (not the index), and the timestamp format is as follows:
2015-03-31 22:56:45.510
I also have columns called CLASS and AXLES. I would like to compute the count of records for each month separately for each unique value of AXLES (AXLES can take an integer value between 3-12).
I came up with a combination of resample and groupby:
resamp = dfWIM.set_index('TIMESTAMP').groupby('AXLES').resample('M', how='count').CLASS
This seems to give me a multiIndex dataframe object, as shown below.
In [72]: resamp
Out [72]:
AXLES TIMESTAMP
3 2014-07-31 5517
2014-08-31 31553
2014-09-30 42816
2014-10-31 49308
2014-11-30 44168
2014-12-31 45518
2015-01-31 54782
2015-02-28 52166
2015-03-31 47929
4 2014-07-31 3147
2014-08-31 24810
2014-09-30 39075
2014-10-31 46857
2014-11-30 42651
2014-12-31 48282
2015-01-31 42708
2015-02-28 43904
2015-03-31 50033
From here, how can I access different components of this multiIndex object to create a bar plot for the following conditions?
show data when AXLES = 3
show x ticks in the Month - Year format (no days, hours, minutes etc.)
Thanks!
EDIT: Following code gives me the plot, but I could not change the xtick formatting to MM-YY.
resamp[3].plot(kind='bar')
EDIT 2 below is a code snippet that generates a small sample of the data similar to what I have:
dftest = {'TIMESTAMP':['2014-08-31','2014-09-30','2014-10-31'], 'AXLES':[3, 3, 3], 'CLASS':[5,6,7]}
dfTest = pd.DataFrame(dftest)
dfTest.TIMESTAMP = pd.to_datetime(pd.Series(dfTest.TIMESTAMP))
resamp = dfTest.set_index('TIMESTAMP').groupby('AXLES').resample('M', how='count').CLASS
resamp[3].plot(kind='bar')
EDIT 3:
Here below is the solution:
A.Plot the whole resampled dataframe (based on #Ako 's suggestion):
df = resamp.unstack(0)
df.index = [ts.strftime('%b 20%y') for ts in df.index]
df.plot(kind='bar', rot=0)
B.Plot an individual index from the resampled dataframe (based on #Alexander 's suggestion):
df = resamp[3]
df.index = [ts.strftime('%b 20%y') for ts in df.index]
df.plot(kind='bar', rot=0)
You could generate and set the labels explicitly using ax.xaxis.set_major_formatter with a ticker.FixedFormatter. This will allow you to keep your DataFrame's MultiIndex with timestamp values, while displaying the timestamps in the desired %m-%Y format:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib.ticker as ticker
dftest = {'TIMESTAMP':['2014-08-31','2014-09-30','2014-10-31'], 'AXLES':[3, 3, 3], 'CLASS':[5,6,7]}
dfTest = pd.DataFrame(dftest)
dfTest.TIMESTAMP = pd.to_datetime(pd.Series(dfTest.TIMESTAMP))
resamp = dfTest.set_index('TIMESTAMP').groupby('AXLES').resample('M', how='count').CLASS
ax = resamp[3].plot(kind='bar')
ticklabels = [timestamp.strftime('%m-%Y') for axle, timestamp in resamp.index]
ax.xaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: ticklabels[int(x)]))
plt.gcf().autofmt_xdate()
plt.show()
yields
The following should work, but it is difficult to test without some data.
Start by resetting your index to get access to the TIMESTAMP column. Then use strftime to format it to your desired text representation (e.g. mm-yy). Finally, reset the index back to AXLES and TIMESTAMP.
df = resamp.reset_index()
df['TIMESTAMP'] = [ts.strftime('%m-%y') for ts in df.TIMESTAMP]
df.set_index(['AXLES', 'TIMESTAMP'], inplace=True)
>>> df.xs(3, level=0).plot(kind='bar')

Categories

Resources