I have a dataframe with a column "time" of float numbers, representing days from 0 to 8, and one more column with other data. The time step is not continuous.
time_clean = np.arange(0, 8, 0.1)
noise = [random.random()/10 for n in range(len(time_clean))]
time = time_clean + noise
data = [random.random()*100 for n in range(len(time_clean))]
df = pd.DataFrame({"time": time, "data":data})
df.head()
data time
0 89.965240 0.041341
1 95.964621 0.109215
2 70.552763 0.232596
3 74.457244 0.330750
4 13.228426 0.471623
I want to resample and interpolate the data to every 15 minutes, (15/(60*24) days).
I think the most efficient way to do this would be using the resample method of pandas dataframes, but in order to do that I need to convert the time column into a timestamp, and make it the index.
What is the most efficient way of doing this? Is it possible to transform an int to datetime?
I think you need first convert column time to_timedelta and then sort_values with resample:
Also I think the best is add one new row with 0 for always starts resample from 0 (if 0 is not in time column it starts from minimal time value)
df.loc[-1] = 0
df.time = pd.to_timedelta(df.time, unit='d')
df = df.sort_values('time').set_index('time').resample('15T').ffill()
print (df.head(20))
data
time
00:00:00 0.000000
00:15:00 0.000000
00:30:00 0.000000
00:45:00 0.000000
01:00:00 0.000000
01:15:00 0.000000
01:30:00 50.869889
01:45:00 50.869889
02:00:00 50.869889
02:15:00 50.869889
02:30:00 50.869889
02:45:00 50.869889
03:00:00 50.869889
03:15:00 8.846017
03:30:00 8.846017
03:45:00 8.846017
04:00:00 8.846017
04:15:00 8.846017
04:30:00 8.846017
04:45:00 8.846017
Related
So I have a dataset with a specific date along with every data. I want to fill these values according to their specific date in Excel which contains the date range of the whole year. It's like the date starts from 01-01-2020 00:00:00 and end at 31-12-2020 23:45:00 with the frequency of 15 mins. So there will be a total of 35040 date-time values in Excel.
my data is like:
load date
12 01-02-2020 06:30:00
21 29-04-2020 03:45:00
23 02-07-2020 12:15:00
54 07-08-2020 16:00:00
23 22-09-2020 16:30:00
As you can see these values are not continuous but they have specific dates with them, so I these date values as the index and put it at that particular date in the Excel which has the date column, and also put zero in the missing values. Can someone please help?
Use DataFrame.reindex with date_range - so added 0 values for all not exist datetimes:
rng = pd.date_range('2020-01-01','2020-12-31 23:45:00', freq='15Min')
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date').reindex(rng, fill_value=0)
print (df)
load
2020-01-01 00:00:00 0
2020-01-01 00:15:00 0
2020-01-01 00:30:00 0
2020-01-01 00:45:00 0
2020-01-01 01:00:00 0
...
2020-12-31 22:45:00 0
2020-12-31 23:00:00 0
2020-12-31 23:15:00 0
2020-12-31 23:30:00 0
2020-12-31 23:45:00 0
[35136 rows x 1 columns]
I have the following pandas df (datetime is of type datetime64):
device datetime
0 846ee 2020-03-22 14:27:29
1 0a26e 2020-03-22 15:33:31
2 8a906 2020-03-27 16:19:06
3 6bf11 2020-03-27 16:05:20
4 d3923 2020-03-23 18:58:51
I wanted to use the KDE function of Seaborn's distplot. Even though I don't exactly understand why, I got this to work:
df['hour'] = df['datetime'].dt.floor('T').dt.time
df['hour'] = pd.to_timedelta(df['hour'].astype(str)) / pd.Timedelta(hours=1)
and then
sns.distplot(df['hour'], hist=False, bins=arr, label='tef')
The question is: how do I do the same, but only counting unique devices? I have tried
df.groupby(['hour']).nunique().reset_index()
df.groupby(['hour'])[['device']].size().reset_index()
But they give me different results (same order of magnitude, but a few more or less). I think I don't understand what I'm doing in pd.to_timedelta(df['hour'].astype(str)) / pd.Timedelta(hours=1) and that is preventing me to think about the uniques... maybe.
pd.to_timedelta(df['time'].astype(str)) crates an output like 0 days 01:00:00
pd.to_timedelta(df['time'].astype(str)) / pd.Timedelta(hours=1) create an output like 1.00, which is a float of the hours.
See Calculate Pandas DataFrame Time Difference Between Two Columns in Hours and Minutes for a more detailed discussion of timedeltas.
import pandas as pd
import numpy as np # for test data
import random # for test data
# test data
np.random.seed(365)
random.seed(365)
rows = 40
data = {'device': [random.choice(['846ee', '0a26e', '8a906', '6bf11', 'd3923']) for _ in range(rows)],
'datetime': pd.bdate_range(datetime(2020, 7, 1), freq='15min', periods=rows).tolist()}
# create test dataframe
df = pd.DataFrame(data)
# this date column is already in a datetime format; for the real dataframe, make sure it's converted
# df.datetime = pd.to_datetime(df.datetime)
# this extracts the time component from the datetime and is a datetime.time object
df['time'] = df['datetime'].dt.floor('T').dt.time
# this creates a timedelta column; note it's format
df['timedelta'] = pd.to_timedelta(df['time'].astype(str))
# this creates a float representing the hour and its fractional component (minutes)
df['hours'] = pd.to_timedelta(df['time'].astype(str)) / pd.Timedelta(hours=1)
# extracts just the hour
df['hour'] = df['datetime'].dt.hour
display(df.head())
This view should elucidate the difference between the time extraction methods.
device datetime time timedelta hours hour
0 8a906 2020-07-01 00:00:00 00:00:00 0 days 00:00:00 0.00 0
1 0a26e 2020-07-01 00:15:00 00:15:00 0 days 00:15:00 0.25 0
2 8a906 2020-07-01 00:30:00 00:30:00 0 days 00:30:00 0.50 0
3 d3923 2020-07-01 00:45:00 00:45:00 0 days 00:45:00 0.75 0
4 0a26e 2020-07-01 01:00:00 01:00:00 0 days 01:00:00 1.00 1
5 d3923 2020-07-01 01:15:00 01:15:00 0 days 01:15:00 1.25 1
6 6bf11 2020-07-01 01:30:00 01:30:00 0 days 01:30:00 1.50 1
7 d3923 2020-07-01 01:45:00 01:45:00 0 days 01:45:00 1.75 1
8 6bf11 2020-07-01 02:00:00 02:00:00 0 days 02:00:00 2.00 2
9 d3923 2020-07-01 02:15:00 02:15:00 0 days 02:15:00 2.25 2
10 0a26e 2020-07-01 02:30:00 02:30:00 0 days 02:30:00 2.50 2
11 846ee 2020-07-01 02:45:00 02:45:00 0 days 02:45:00 2.75 2
12 0a26e 2020-07-01 03:00:00 03:00:00 0 days 03:00:00 3.00 3
13 846ee 2020-07-01 03:15:00 03:15:00 0 days 03:15:00 3.25 3
14 846ee 2020-07-01 03:30:00 03:30:00 0 days 03:30:00 3.50 3
Plot device counts for each hour with seaborn.countplot
plt.figure(figsize=(8, 6))
sns.countplot(x='hour', hue='device', data=df)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
Plot a seaborn.distplot for each device
Use seaborn.FacetGrid
This will give the hourly distribution for each device.
import seaborn as sns
import matplotlib.pyplot as plt
g = sns.FacetGrid(df, row='device', height=5)
g.map(sns.distplot, 'hours', bins=24, kde=True)
g.set(xlim=(0, 24), xticks=range(0, 25, 1))
You can try
df['hour'] = df['datetime'].dt.strftime('%Y-%m-%d %H')
s = df.groupby('hour')['device'].value_counts()
I'm new on python, and I'm trying to convert this code from another language. And I don't know if there is a simple way to solve my problem and avoid the long processing time.
About the problem
I have a data frame with 2 columns (time, for every 30 minutes; and a value) trying to find a maximum aggregate value from a specific time step for each day.
About the time, they are already an accumulation. For example, '2019-03-28 04:00:00', represents an accumulation from 03:31:00 to 04:00:00.
So, for a time step equals to 2 hours, for example, I may find the maximum value ranging from 04:00:00 to 05:30:00 (=80.0) at 2019-03-28, but it could happen in a different set of data.
Time Value
2019-03-28 00:30:00 10.0
2019-03-28 01:00:00 5.0
2019-03-28 01:30:00 0.0
2019-03-28 02:00:00 15.0
2019-03-28 02:30:00 2.0
2019-03-28 03:00:00 0.0
2019-03-28 03:30:00 0.0
2019-03-28 04:00:00 10.0 *
2019-03-28 04:30:00 0.0 *
2019-03-28 05:00:00 10.0 *
2019-03-28 05:30:00 60.0 *
2019-03-28 06:00:00 0.0
........
........
2019-03-28 23:30:00 0.0
........
........
EDIT
Is there a simple way to automatically find the maximum value aggregating 2 hours for each day?
Please try the following. If doesn't work let us know we will help further
df['Time']=pd.to_datetime(df['Time'])#Coerce Time to datetime
df.set_index(df['Time'], inplace=True)#Set time as the index
df.groupby(df.index.date)['Value'].max().to_frame()#groupby date. Can also substitute date for day
Using .resample():
# Import and initialise pacakages in session:
import pandas as pd
# Coerce Time to datetime: Time => Date Vector
df['Time'] = pd.to_datetime(df['Time'])
# Replace index with date vec: index => Time
df.set_index(df['Time'], inplace=True)
# Resample to get the daily max: stdout => aggregated Series
df.resample('D').max()
I have a bunch of timestamp data in a csv file like this:
2012-01-01 00:00:00, data
2012-01-01 00:01:00, data
2012-01-01 00:02:00, data
...
2012-01-01 00:59:00, data
2012-01-01 01:00:00, data
2012-01-01 01:01:00, data
I want to delete data every minute and only display every hour in python like the following:
2012-01-01 00:00:00, data
2012-01-01 01:00:00, data
2012-01-01 02:00:00, data
Could any one help me? Thank you.
I believe you need to use pandas resample, here's is an example of how it is used to achieve the output you desire. However, keep in mind that since this is a resampling operation during frequency conversion, you must pass a function on how the other columns will beahve (summing all values corresponding to the new timeframe, calculating an average, calculating the difference, etc...) otherwise you will get returned a DatetimeIndexResample. Here is an example:
import pandas as pd
index = pd.date_range('1/1/2000', periods=9, freq='40T')
series = pd.Series(range(9),index=index)
print(series)
Output:
2000-01-01 00:00:00 0
2000-01-01 00:40:00 1
2000-01-01 01:20:00 2
2000-01-01 02:00:00 3
2000-01-01 02:40:00 4
2000-01-01 03:20:00 5
2000-01-01 04:00:00 6
2000-01-01 04:40:00 7
2000-01-01 05:20:00 8
Applying resample hourly without passing the aggregation function:
print(series.resample('H'))
Output:
DatetimeIndexResampler [freq=<Hour>, axis=0, closed=left, label=left, convention=start, base=0]
After passing .sum():
print(series.resample('H').sum())
Output:
2000-01-01 00:00:00 1
2000-01-01 01:00:00 2
2000-01-01 02:00:00 7
2000-01-01 03:00:00 5
2000-01-01 04:00:00 13
2000-01-01 05:00:00 8
Freq: H, dtype: int64
I have to resample my dataset from a 10-minute interval to a 15-minute interval to make it in sync with another dataset. Based on my searches at stackoverflow I have some ideas how to proceed, but none of them deliver a clean and clear solution.
Problem
Problem set up
#%% Import modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#%% make timestamps
periods = 12
startdate = '2010-01-01'
timestamp10min = pd.date_range(startdate, freq='10Min', periods=periods)
#%% Make DataFrame and fill it with some data
df = pd.DataFrame(index=timestamp10min)
y = -(np.arange(periods)-periods/2)**2
df['y'] = y
Desired output
Now I want the values that are already at the 10 minutes to be unchanged, and the values at **:15 and **:45 to be the mean of **:10, **:20 and **:40, **:50. The core of the problem is that 15 minutes is not a integer multiple of 10 minutes. Otherwise simply applying df.resample('10Min', how='mean') would have worked.
Possible solutions
Simply use the 15 minutes resampling and just live with the small introduced error.
Using two forms of resample, with close='left', label='left' and close='right' , label='right'. Afterwards I could average both resampled forms. The results will give me some error on the results, but smaller than the first method.
Resample everything to 5 minute data and then apply a rolling average. Something like that is apllied here: Pandas: rolling mean by time interval
Resample and average with a varying number of input: Use numpy.average with weights for resampling a pandas array
Therefore I would have to create a new Series with varying weight length. Were the weight should be alternating between 1 and 2.
Resample everything to 5 minute data and then apply linear interpolation. This method is close to method 3. Pandas data frame: resample with linear interpolation
Edit: #Paul H gave a workable solution along these lines, which is stille readable. Thanks!
All the methods are not really statisfying for me. Some lead to a small error, and other methods would be quite difficult to read for an outsider.
Implementation
The implementation of method 1, 2 and 5 together with the desired ouput. In combination with visualization.
#%% start plot
plt.figure()
plt.plot(df.index, df['y'], label='original')
#%% resample the data to 15 minutes and plot the result
close = 'left'; label='left'
dfresamplell = pd.DataFrame()
dfresamplell['15min'] = df.y.resample('15Min', how='mean', closed=close, label=label)
labelstring = 'close ' + close + ' label ' + label
plt.plot(dfresamplell.index, dfresamplell['15min'], label=labelstring)
close = 'right'; label='right'
dfresamplerr = pd.DataFrame()
dfresamplerr['15min'] = df.y.resample('15Min', how='mean', closed=close, label=label)
labelstring = 'close ' + close + ' label ' + label
plt.plot(dfresamplerr.index, dfresamplerr['15min'], label=labelstring)
#%% make an average
dfresampleaverage = pd.DataFrame(index=dfresamplell.index)
dfresampleaverage['15min'] = (dfresamplell['15min'].values+dfresamplerr['15min'].values[:-1])/2
plt.plot(dfresampleaverage.index, dfresampleaverage['15min'], label='average of both resampling methods')
#%% desired output
ydesired = np.zeros(periods/3*2)
i = 0
j = 0
k = 0
for val in ydesired:
if i+k==len(y): k=0
ydesired[j] = np.mean([y[i],y[i+k]])
j+=1
i+=1
if k==0: k=1;
else: k=0; i+=1
plt.plot(dfresamplell.index, ydesired, label='ydesired')
#%% suggestion of Paul H
dfreindex = df.reindex(pd.date_range(startdate, freq='5T', periods=periods*2))
dfreindex.interpolate(inplace=True)
dfreindex = dfreindex.resample('15T', how='first').head()
plt.plot(dfreindex.index, dfreindex['y'], label='method Paul H')
#%% finalize plot
plt.legend()
Implementation for angles
As a bonus I have added the code I will use for the interpolation of angles. This is done by using complex numbers. Because complex interpolation is not implemented (yet), I split the complex numbers into a real and a imaginary part. After averaging these numbers can be converted to angels again. For certain angels this is a better resampling method than simply averaging the two angels, for example: 345 and 5 degrees.
#%% make timestamps
periods = 24*6
startdate = '2010-01-01'
timestamp10min = pd.date_range(startdate, freq='10Min', periods=periods)
#%% Make DataFrame and fill it with some data
degrees = np.cumsum(np.random.randn(periods)*25) % 360
df = pd.DataFrame(index=timestamp10min)
df['deg'] = degrees
df['zreal'] = np.cos(df['deg']*np.pi/180)
df['zimag'] = np.sin(df['deg']*np.pi/180)
#%% suggestion of Paul H
dfreindex = df.reindex(pd.date_range(startdate, freq='5T', periods=periods*2))
dfreindex = dfreindex.interpolate()
dfresample = dfreindex.resample('15T', how='first')
#%% convert complex to degrees
def f(x):
return np.angle(x[0] + x[1]*1j, deg=True )
dfresample['degrees'] = dfresample[['zreal', 'zimag']].apply(f, axis=1)
#%% set all the values between 0-360 degrees
dfresample.loc[dfresample['degrees']<0] = 360 + dfresample.loc[dfresample['degrees']<0]
#%% wrong resampling
dfresample['deg'] = dfresample['deg'] % 360
#%% plot different sampling methods
plt.figure()
plt.plot(df.index, df['deg'], label='normal', marker='v')
plt.plot(dfresample.index, dfresample['degrees'], label='resampled according #Paul H', marker='^')
plt.plot(dfresample.index, dfresample['deg'], label='wrong resampling', marker='<')
plt.legend()
I might be misunderstanding the problem, but does this work?
TL;DR version:
import numpy as np
import pandas
data = np.arange(0, 101, 8)
index_10T = pandas.DatetimeIndex(freq='10T', start='2012-01-01 00:00', periods=data.shape[0])
index_05T = pandas.DatetimeIndex(freq='05T', start=index_10T[0], end=index_10T[-1])
index_15T = pandas.DatetimeIndex(freq='15T', start=index_10T[0], end=index_10T[-1])
df1 = pandas.DataFrame(data=data, index=index_10T, columns=['A'])
print(df.reindex(index=index_05T).interpolate().loc[index_15T])
Long version
setup fake data
import numpy as np
import pandas
data = np.arange(0, 101, 8)
index_10T = pandas.DatetimeIndex(freq='10T', start='2012-01-01 00:00', periods=data.shape[0])
df1 = pandas.DataFrame(data=data, index=index_10T, columns=['A'])
print(df1)
A
2012-01-01 00:00:00 0
2012-01-01 00:10:00 8
2012-01-01 00:20:00 16
2012-01-01 00:30:00 24
2012-01-01 00:40:00 32
2012-01-01 00:50:00 40
2012-01-01 01:00:00 48
2012-01-01 01:10:00 56
2012-01-01 01:20:00 64
2012-01-01 01:30:00 72
2012-01-01 01:40:00 80
2012-01-01 01:50:00 88
2012-01-01 02:00:00 96
So then build a new 5-minute index and reindex the original dataframe
index_05T = pandas.DatetimeIndex(freq='05T', start=index_10T[0], end=index_10T[-1])
df2 = df.reindex(index=index_05T)
print(df2)
A
2012-01-01 00:00:00 0
2012-01-01 00:05:00 NaN
2012-01-01 00:10:00 8
2012-01-01 00:15:00 NaN
2012-01-01 00:20:00 16
2012-01-01 00:25:00 NaN
2012-01-01 00:30:00 24
2012-01-01 00:35:00 NaN
2012-01-01 00:40:00 32
2012-01-01 00:45:00 NaN
2012-01-01 00:50:00 40
2012-01-01 00:55:00 NaN
2012-01-01 01:00:00 48
2012-01-01 01:05:00 NaN
2012-01-01 01:10:00 56
2012-01-01 01:15:00 NaN
2012-01-01 01:20:00 64
2012-01-01 01:25:00 NaN
2012-01-01 01:30:00 72
2012-01-01 01:35:00 NaN
2012-01-01 01:40:00 80
2012-01-01 01:45:00 NaN
2012-01-01 01:50:00 88
2012-01-01 01:55:00 NaN
2012-01-01 02:00:00 96
and then linearly interpolate
print(df2.interpolate())
A
2012-01-01 00:00:00 0
2012-01-01 00:05:00 4
2012-01-01 00:10:00 8
2012-01-01 00:15:00 12
2012-01-01 00:20:00 16
2012-01-01 00:25:00 20
2012-01-01 00:30:00 24
2012-01-01 00:35:00 28
2012-01-01 00:40:00 32
2012-01-01 00:45:00 36
2012-01-01 00:50:00 40
2012-01-01 00:55:00 44
2012-01-01 01:00:00 48
2012-01-01 01:05:00 52
2012-01-01 01:10:00 56
2012-01-01 01:15:00 60
2012-01-01 01:20:00 64
2012-01-01 01:25:00 68
2012-01-01 01:30:00 72
2012-01-01 01:35:00 76
2012-01-01 01:40:00 80
2012-01-01 01:45:00 84
2012-01-01 01:50:00 88
2012-01-01 01:55:00 92
2012-01-01 02:00:00 96
build a 15-minute index and use that to pull out data:
index_15T = pandas.DatetimeIndex(freq='15T', start=index_10T[0], end=index_10T[-1])
print(df2.interpolate().loc[index_15T])
A
2012-01-01 00:00:00 0
2012-01-01 00:15:00 12
2012-01-01 00:30:00 24
2012-01-01 00:45:00 36
2012-01-01 01:00:00 48
2012-01-01 01:15:00 60
2012-01-01 01:30:00 72
2012-01-01 01:45:00 84
2012-01-01 02:00:00 96
Ok, here's one way to do it.
Make a list of the times you want to have filled in
Make a combined index that includes the times you want and the times you already have
Take your data and "forward fill it"
Take your data and "backward fill it"
Average the forward and backward fills
Select only the rows you want
Note this only works since you want the values exactly halfway between the values you already have, time-wise. Note the last time comes out np.nan because you don't have any later data.
times_15 = []
current = df.index[0]
while current < df.index[-2]:
current = current + dt.timedelta(minutes=15)
times_15.append(current)
combined = set(times_15) | set(df.index)
df = df.reindex(combined).sort_index(axis=0)
df['ff'] = df['y'].fillna(method='ffill')
df['bf'] = df['y'].fillna(method='bfill')
df['solution'] = df[['ff', 'bf']].mean(1)
df.loc[times_15, :]
In case someone is working with data without regularity at all, here is an adapted solution from the one provided by Paul H above.
If you don't want to interpolate throughout the time-series, but only in those places where resample is meaningful, you may keep the interpolated column side by side and finish with a resample and dropna.
import numpy as np
import pandas
data = np.arange(0, 101, 3)
index_setup = pandas.date_range(freq='01T', start='2022-01-01 00:00', periods=data.shape[0])
df1 = pandas.DataFrame(data=data, index=index_setup, columns=['A'])
df1 = df1.sample(frac=0.2).sort_index()
print(df1)
A
2022-01-01 00:03:00 9
2022-01-01 00:06:00 18
2022-01-01 00:08:00 24
2022-01-01 00:18:00 54
2022-01-01 00:25:00 75
2022-01-01 00:27:00 81
2022-01-01 00:30:00 90
Notice resampling this DF without any regularity forces values to the floor index, without interpolating.
print(df1.resample('05T').mean())
A
2022-01-01 00:00:00 9.0
2022-01-01 00:05:00 24.0
2022-01-01 00:10:00 39.0
2022-01-01 00:15:00 51.0
2022-01-01 00:20:00 NaN
2022-01-01 00:25:00 79.5
A better solution can be achieved by interpolating in a small enough interval and then resampling. The result DF now has too much, but a dropna() brings it close to its original shape.
index_1min = pandas.date_range(freq='01T', start='2022-01-01 00:00', end='2022-01-01 23:59')
df2 = df1.reindex(index=index_1min)
df2['A_interp'] = df2['A'].interpolate(limit_direction='both')
print(df2.resample('05T').first().dropna())
A A_interp
2022-01-01 00:00:00 9.0 9.0
2022-01-01 00:05:00 21.0 15.0
2022-01-01 00:10:00 39.0 30.0
2022-01-01 00:15:00 51.0 45.0
2022-01-01 00:25:00 75.0 75.0