I am trying to learn about rolling statistics. I created a data frame for :
d = date_range('1/1/2011', periods=72, freq='H')
s = Series(randn(len(rng)), index=rng)
as :
import numpy as np
from numpy.random import randn
import time
r = date_range('1/1/2011', periods=72, freq='H')
r
len(r)
[r[i] for i in range(len(r))]
s = Series(randn(len(r)), index=r)
s
s.plot()
df_new = DataFrame(data = s, columns=['Random Number Generated'])
df_new.diff().hist()
Now I am trying to find the rolling mean of the series over the last 3 hours in a new column on a DataFrame. I tried to find the rolling mean first:
df_new['mean'] = rolling_mean(df_new, window=3)
Am I correct ? But the result doesn't look like mean. Can someone explain me this one please.
I have rerun your code and could not find any problems. It seems to work.
If you want to take the rolling mean over the last 3 hours, rolling_mean(df_new, window=5) should be rolling_mean(df_new, window=3)
Here is my code for the verification.
import numpy as np
window = 3
mean_list = []
val_list = []
for i, val in enumerate(s):
val_list.append(val)
if i < window - 1:
mean_list.append(np.nan)
else:
mean_list.append(np.mean(np.array(val_list)))
val_list.pop(0)
df_new['mean2'] = mean_list
print(df_new)
Output:
Random Number Generated mean mean2
2011-01-01 00:00:00 1.457483 NaN NaN
2011-01-01 01:00:00 0.009979 NaN NaN
2011-01-01 02:00:00 0.581128 0.682864 0.682864
2011-01-01 03:00:00 1.905528 0.832212 0.832212
2011-01-01 04:00:00 2.221040 1.569232 1.569232
2011-01-01 05:00:00 0.696211 1.607593 1.607593
2011-01-01 06:00:00 -0.854759 0.687497 0.687497
2011-01-01 07:00:00 -0.033226 -0.063925 -0.063925
2011-01-01 08:00:00 0.097187 -0.263599 -0.263599
2011-01-01 09:00:00 -1.579210 -0.505083 -0.505083
...
The results by rolling_mean is consistent with manually calculated rolling mean values.
Another way to confirm the validity is looking at the plots of calculated rolling mean. pandas.DataFrame prepares plot method to draw graph easily.
from matplotlib import pyplot
df_new.plot()
pyplot.show()
As long as your index is a timestamp (as it currently is), you can just use resample:
s.resample('3H')
When you use random numbers, it is best to set a seed value so that others can replicate your results.
np.random.seed(0)
s = pd.Series(np.random.randn(72), pd.date_range('1/1/2011', periods=72, freq='H'))
s.plot();s.resample('3H').plot()
Related
Say I have this dataframe:
import pandas as pd
import datetime
x = [datetime.time(23,0),datetime.time(6,0),datetime.time(18,0),datetime.time(17,0)]
y = [datetime.time(22,0),datetime.time(9,0),datetime.time(9,0),datetime.time(23,0)]
df = pd.DataFrame({'time1':x,'time2':y})
which looks like this:
How would I compute the absolute difference between the two columns? Subtraction doesn't work. The result should look like this:
df['abs_diff'] = [1,3,9,6]
Thanks so much!
Pandas doesn't like datetime objects so very much; it labels the series as object dtype, so you can't really do any arithmetics on those. You can convert the data to Pandas' timedelta:
df['abs_diff'] = (pd.to_timedelta(df['time1'].astype(str)) # convert to timedelta
.sub(pd.to_timedelta(df['time2'].astype(str))) # then you can subtract
.abs().div(pd.Timedelta('1H')) # and absolute value, and divide
)
Output:
time1 time2 abs_diff
0 23:00:00 22:00:00 1.0
1 06:00:00 09:00:00 3.0
2 18:00:00 09:00:00 9.0
3 17:00:00 23:00:00 6.0
I have a dataset with measurements acquired almost every 2-hours over a week. I would like to calculate a mean of measurements taken at the same time on different days. For example, I want to calculate the mean of every measurement taken between 12:00 and 13:59.
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
#generating test dataframe
date_today = datetime.now()
time_of_taken_measurment = pd.date_range(date_today, date_today +
timedelta(72), freq='2H20MIN')
np.random.seed(seed=1111)
data = np.random.randint(1, high=100,
size=len(time_of_taken_measurment))
df = pd.DataFrame({'measurementTimestamp': time_of_taken_measurment, 'measurment': data})
df = df.set_index('measurementTimestamp')
#Calculating the mean for measurments taken in the same hour
hourly_average = df.groupby([df.index.hour]).mean()
hourly_average
The code above gives me this output:
0 47.967742
1 43.354839
2 46.935484
.....
22 42.833333
23 52.741935
I would like to have a result like this:
0 mean0
2 mean1
4 mean2
.....
20 mean10
22 mean11
I was trying to solve my problem using rolling_mean function, but I could not find a way to apply it to my static case.
Use the built-in floor functionality of datetimeIndex, which allows you to easily create 2 hour time bins.
df.groupby(df.index.floor('2H').time).mean()
Output:
measurment
00:00:00 51.516129
02:00:00 54.868852
04:00:00 52.935484
06:00:00 43.177419
08:00:00 43.903226
10:00:00 55.048387
12:00:00 50.639344
14:00:00 48.870968
16:00:00 43.967742
18:00:00 49.225806
20:00:00 43.774194
22:00:00 50.590164
I am using Pandas dataframes with DatetimeIndex to manipulate timeseries data. The data is stored at UTC time and I usually keep it that way (with naive DatetimeIndex), and only use timezones for output. I like it that way because nothing in the world confuses me more than trying to manipuluate timezones.
e.g.
In: ts = pd.date_range('2017-01-01 00:00','2017-12-31 23:30',freq='30Min')
data = np.random.rand(17520,1)
df= pd.DataFrame(data,index=ts,columns = ['data'])
df.head()
Out[15]:
data
2017-01-01 00:00:00 0.697478
2017-01-01 00:30:00 0.506914
2017-01-01 01:00:00 0.792484
2017-01-01 01:30:00 0.043271
2017-01-01 02:00:00 0.558461
I want to plot a chart of data versus time for each day of the year so I reshape the dataframe to have time along the index and dates for columns
df.index = [df.index.time,df.index.date]
df_new = df['data'].unstack()
In: df_new.head()
Out :
2017-01-01 2017-01-02 2017-01-03 2017-01-04 2017-01-05 \
00:00:00 0.697478 0.143626 0.189567 0.061872 0.748223
00:30:00 0.506914 0.470634 0.430101 0.551144 0.081071
01:00:00 0.792484 0.045259 0.748604 0.305681 0.333207
01:30:00 0.043271 0.276888 0.034643 0.413243 0.921668
02:00:00 0.558461 0.723032 0.293308 0.597601 0.120549
If I'm not worried about timezones i can plot like this:
fig, ax = plt.subplots()
ax.plot(df_new.index,df_new)
but I want to plot the data in the local timezone (tz = pytz.timezone('Australia/Sydney') making allowance for daylight savings time, but the times and dates are no longer Timestamp objects so I can't use Pandas timezone handling. Or can I?
Assuming I can't, I'm trying to do the shift manually, (given DST starts 1/10 at 2am and finishes 1/4 at 2am), so I've got this far:
df_new[[c for c in df_new.columns if c >= dt.datetime(2017,4,1) and c <dt.datetime(2017,10,1)]].shift_by(+10)
df_new[[c for c in df_new.columns if c < dt.datetime(2017,4,1) or c >= dt.datetime(2017,10,1)]].shift_by(+11)
but am not sure how to write the function shift_by.
(This doesn't handle midnight to 2am on teh changeover days correctly, which is not ideal, but I could live with)
Use dt.tz_localize + dt.tz_convert to convert the dataframe dates to a particular timezone.
df.index = df.index.tz_localize('UTC').tz_convert('Australia/Sydney')
df.index = [df.index.time, df.index.date]
Be a little careful when creating the MuliIndex - as you observed, it creates two rows of duplicate timestamps, so if that's the case, get rid of it with duplicated:
df = df[~df.index.duplicated()]
df = df['data'].unstack()
You can also create subplots with df.plot:
df.plot(subplots=True)
plt.show()
I just started learning pandas. I came across this;
d = date_range('1/1/2011', periods=72, freq='H')
s = Series(randn(len(rng)), index=rng)
I have understood what is the above data means and I tried with IPython:
import numpy as np
from numpy.random import randn
import time
r = date_range('1/1/2011', periods=72, freq='H')
r
len(r)
[r[i] for i in range(len(r))]
s = Series(randn(len(r)), index=r)
s
s.plot()
df_new = DataFrame(data = s, columns=['Random Number Generated'])
Is it correct way of creating a data frame?
The Next step given is to : Return a series where the absolute difference between a number and the next number in the series is less than 0.5
Do I need to find the difference between each random number generated and store only the sets where the abs diff is < 0.5 ? Can someone explain how can I do that in pandas?
Also I tried to plot the series as histogram with;
df_new.diff().hist()
The graph display the x as Random number with Y axis 0 to 18 (which I don't understand). Can some one explain this to me as well?
To give you some pointers in addition to #Dthal's comments:
r = pd.date_range('1/1/2011', periods=72, freq='H')
As commented by #Dthal, you can simplify the creation of your DataFrame randomly sampled from the normal distribution like so:
df = pd.DataFrame(index=r, data=randn(len(r)), columns=['Random Number Generated'])
To show only values that differ by less than 0.5 from the preceding value:
diff = df.diff()
diff[abs(diff['Random Number Generated']) < 0.5]
Random Number Generated
2011-01-01 02:00:00 0.061821
2011-01-01 05:00:00 0.463712
2011-01-01 09:00:00 -0.402802
2011-01-01 11:00:00 -0.000434
2011-01-01 22:00:00 0.295019
2011-01-02 03:00:00 0.215095
2011-01-02 05:00:00 0.424368
2011-01-02 08:00:00 -0.452416
2011-01-02 09:00:00 -0.474999
2011-01-02 11:00:00 0.385204
2011-01-02 12:00:00 -0.248396
2011-01-02 14:00:00 0.081890
2011-01-02 17:00:00 0.421897
2011-01-02 18:00:00 0.104898
2011-01-03 05:00:00 -0.071969
2011-01-03 15:00:00 0.101156
2011-01-03 18:00:00 -0.175296
2011-01-03 20:00:00 -0.371812
Can simplify using .dropna() to get rid of the missing values.
The pandas.Series.hist() docs inform that the default number of bins is 10, so that's number of bars you should expect and so it turns out in this case roughly symmetric around zero ranging roughly [-4, +4].
Series.hist(by=None, ax=None, grid=True, xlabelsize=None, xrot=None, ylabelsize=None, yrot=None, figsize=None, bins=10, **kwds)
diff.hist()
Say I have a dataframe with several timestamps and values. I would like to measure Δ values / Δt every 2.5 seconds. Does Pandas provide any utilities for time differentiation?
time_stamp values
19492 2014-10-06 17:59:40.016000-04:00 1832128
167106 2014-10-06 17:59:41.771000-04:00 2671048
202511 2014-10-06 17:59:43.001000-04:00 2019434
161457 2014-10-06 17:59:44.792000-04:00 1294051
203944 2014-10-06 17:59:48.741000-04:00 867856
It most certainly does. First, you'll need to convert your indices into pandas date_rangeformat and then use the custom offset functions available to series/dataframes indexed with that class. Helpful documentation here. Read more here about offset aliases.
This code should resample your data to 2.5s intervals
#df is your dataframe
index = pd.date_range(df['time_stamp'])
values = pd.Series(df.values, index=index)
#Read above link about the different Offset Aliases, S=Seconds
resampled_values = values.resample('2.5S')
resampled_values.diff() #compute the difference between each point!
That should do it.
If you really want the time derivative, then you also need to divide by the time difference (delta time, dt) since last sample
An example:
dti = pd.DatetimeIndex([
'2018-01-01 00:00:00',
'2018-01-01 00:00:02',
'2018-01-01 00:00:03'])
X = pd.DataFrame({'data': [1,3,4]}, index=dti)
X.head()
data
2018-01-01 00:00:00 1
2018-01-01 00:00:02 3
2018-01-01 00:00:03 4
You can find the time delta by using the diff() on the DatetimeIndex. This gives you a series of type Time Deltas. You only need the values in seconds, though
dt = pd.Series(df.index).diff().dt.seconds.values
dXdt = df.diff().div(dt, axis=0, )
dXdt.head()
data
2018-01-01 00:00:00 NaN
2018-01-01 00:00:02 1.0
2018-01-01 00:00:03 1.0
As you can see, this approach takes into account that there are two seconds between the first two values, and only one between the two last values. :)