Resample a pandas dataframe with multiple variables - python

I have a dataframe in long format with data on a 15 min interval for several variables. If I apply the resample method to get the average daily value, I get the average values of all variables for a given time interval (and not the average value for speed, distance).
Does anyone know how to resample the dataframe and keep the 2 variables?
Note: The code below contains an EXAMPLE dataframe in long format, my real example loads data from csv and has different time intervals and frequencies for the variables, so I cannot simply resample the dataframe in wide format.
import pandas as pd
import numpy as np
dti = pd.date_range('2015-01-01', '2015-12-31', freq='15min')
df = pd.DataFrame(index = dti)
# Average speed in miles per hour
df['speed'] = np.random.randint(low=0, high=60, size=len(df.index))
# Distance in miles (speed * 0.5 hours)
df['distance'] = df['speed'] * 0.25
df.reset_index(inplace=True)
df2 = df.melt (id_vars = 'index')
df3 = df2.resample('d', on='index').mean()

IIUC:
>>> df.groupby(df.index.date).mean()
speed distance
2015-01-01 29.562500 7.390625
2015-01-02 31.885417 7.971354
2015-01-03 30.895833 7.723958
2015-01-04 30.489583 7.622396
2015-01-05 28.500000 7.125000
... ... ...
2015-12-27 28.552083 7.138021
2015-12-28 29.437500 7.359375
2015-12-29 29.479167 7.369792
2015-12-30 28.864583 7.216146
2015-12-31 48.000000 12.000000
[365 rows x 2 columns]

Related

Pandas: Calculate average of values for a time frame

I am working on a large datasets that looks like this:
Time, Value
01.01.2018 00:00:00.000, 5.1398
01.01.2018 00:01:00.000, 5.1298
01.01.2018 00:02:00.000, 5.1438
01.01.2018 00:03:00.000, 5.1228
01.01.2018 00:04:00.000, 5.1168
.... , ,,,,
31.12.2018 23:59:59.000, 6.3498
The data is a minute data from the first day of the year to the last day of the year
I want to use Pandas to find the average of every 5 days.
For example:
Average from 01.01.2018 00:00:00.000 to 05.01.2018 23:59:59.000 is average for 05.01.2018
The next average will be from 02.01.2018 00:00:00.000 to 6.01.2018 23:59:59.000 is average for 06.01.2018
The next average will be from 03.01.2018 00:00:00.000 to 7.01.2018 23:59:59.000 is average for 07.01.2018
and so on... We are incrementing day by 1 but calculating an average from the day to past 5days, including the current date.
For a given day, there are 24hours * 60minutes = 1440 data points. So I need to get the average of 1440 data points * 5 days = 7200 data points.
The final DataFrame will look like this, time format [DD.MM.YYYY] (without hh:mm:ss) and the Value is the average of 5 data including the current date:
Time, Value
05.01.2018, 5.1398
06.01.2018, 5.1298
07.01.2018, 5.1438
.... , ,,,,
31.12.2018, 6.3498
The bottom line is to calculate the average of data from today to the past 5 days and the average value is shown as above.
I tried to iterate through Python loop but I wanted something better than we can do from Pandas.
Perhaps this will work?
import numpy as np
# Create one year of random data spaced evenly in 1 minute intervals.
np.random.seed(0) # So that others can reproduce the same result given the random numbers.
time_idx = pd.date_range(start='2018-01-01', end='2018-12-31', freq='min')
df = pd.DataFrame({'Time': time_idx, 'Value': abs(np.random.randn(len(time_idx))) + 5})
>>> df.shape
(524161, 2)
Given the dataframe with 1 minute intervals, you can take a rolling average over the past five days (5 days * 24 hours/day * 60 minutes/hour = 7200 minutes) and assign the result to a new column named rolling_5d_avg. You can then group on the original timestamps using the dt accessor method to grab the date, and then take the last rolling_5d_avg value for each date.
df = (
df
.assign(rolling_5d_avg=df.rolling(window=5*24*60)['Value'].mean())
.groupby(df['Time'].dt.date)['rolling_5d_avg']
.last()
)
>>> df.head(10)
Time
2018-01-01 NaN
2018-01-02 NaN
2018-01-03 NaN
2018-01-04 NaN
2018-01-05 5.786603
2018-01-06 5.784011
2018-01-07 5.790133
2018-01-08 5.786967
2018-01-09 5.789944
2018-01-10 5.789299
Name: rolling_5d_avg, dtype: float64

Pandas group values and get mean by date range

I have a DataFrame like this
df = pd.DataFrame( data = numpy_data, columns=['value','date'])
value date
0 64.885 2018-01-11
1 74.839 2018-01-15
2 41.481 2018-01-17
3 22.027 2018-01-17
4 53.747 2018-01-18
... ... ...
514 61.017 2018-12-22
515 68.376 2018-12-21
516 79.079 2018-12-26
517 73.975 2018-12-26
518 76.923 2018-12-26
519 rows × 2 columns
And I want to plot this value vs date and I am using this
df.plot( x='date',y='value')
And I get this
The point here, this plot have to many fluctuation, and I want to soften this, my idea is group the values by date intervals and get the mean, for example 10 days, the mean between July 1 and July 10, and create de point in July 5
A long way is, get date range, separate in N ranges with start and end dates, filter data with date calculate the mean, and put in other DataFrame
Is there a short way to do that?
PD: Ignore the peaks
One thing you could do for instance is to take the rolling mean of the dataframe, using DataFrame.rolling along with mean:
df = df.set_index(df.date).drop('date', axis=1)
df.rolling(3).mean().plot()
For the example dataframe you have, directly plotting the dataframe would result in:
And having taking the rolling mean, you would have:
Here I chose a window of 3, but his will depend on how wmooth you want it to be
Based on yatu answer
The problem with his answer, is the rolling function considere values as index, not as date, with some transformations rolling can read Timestamp as use time as window [ pandas.rolling ]
df = pd.DataFrame( data = numpy_data, columns=['value','date'])
df['date'] = df.apply(lambda row: pd.Timestamp(row.date), axis=1 )
df = df.set_index(df.date).drop('date', axis=1)
df.sort_index(inplace=True)
df.rolling('10d').mean().plot( ylim=(30,100) , figsize=(16,5),grid='true')
Final results

Reading daily time-series using pandas and re-sampling to monthly

I am very new to Python. I usually use scikits.timeseries to process time-series data. Now I would like to use Panda such as read_csv to do the same as the code shown below. I used the read_csv manual to read the file, but I don't know how to convert the daily time-series to monthly time-series.
The input is one column daily data starting from 2002-01-01 to 2011-12-31, so the length is 3652. The output will be one column monthly data starting from 2002-01 to 2011-12, so the length is 120.
import numpy as np
import pandas as pd
import scikits.timeseries as ts
stgSim = ts.time_series(np.loadtxt('examp.txt', delimiter = ',' , skiprows = 1 ,
usecols = [37] ),
start_date ='2002-01-01',
freq='d' )
v4 = ts.time_series(np.random.rand(3652),start_date='2002-01-01',freq='d')
startD = stgSim.date_to_index(v4.start_date)
stgSim = stgSim[startD:]
stgSimAnMonth = stgSim.convert(freq='m',func=np.ma.mean)
Are you asking for resample which converts daily data to monthly data?
Say
rng = np.random.RandomState(42) # set a random seed so that result is repeatable
ts = pd.Series(data=rng.rand(100),
index=pd.date_range('2018/01/01', periods=100, freq='D'))
mts = ts.resample('M').mean() # resample (convert) to monthly data
ts is like
2018-01-01 0.374540
2018-01-02 0.950714
2018-01-03 0.731994
...
2018-04-08 0.427541
2018-04-09 0.025419
2018-04-10 0.107891
Now you should have mts like
2018-01-31 0.444047
2018-02-28 0.498545
2018-03-31 0.477100
2018-04-30 0.450325

Select a time range in DataFrame without date

I'm using/learning Pandas to load a csv style dataset where I have a time column that can be used as index. The data is sampled roughly at 100Hz. Here is a simplified snippet of the data:
Time (sec) Col_A Col_B Col_C
0.0100 14.175 -29.97 -22.68
0.0200 13.905 -29.835 -22.68
0.0300 12.257 -29.32 -22.67
... ...
1259.98 -0.405 2.205 3.825
1259.99 -0.495 2.115 3.735
There are 20 min of data, resulting in about 120,000 rows at 100 Hz. My goal is to select those rows within a certain time range, say 100-200 sec.
Here is what I've figured out
import panda as pd
df = pd.DataFrame(my_data) # my_data is a numpy array
df.set_index(0, inplace=True)
df.columns = ['Col_A', 'Col_B', 'Col_C']
df.index = pd.to_datetime(df.index, unit='s', origin='1900-1-1') # the date in origin is just a space-holder
My dataset doesn't include the date. How to avoid setting a fake date like I did above? It feels wrong, and also is quite annoying when I plot the data against time.
I know there are ways to remove date from the datatime object like here.
But my goal is to select some rows that are in a certain time range, which means I need to use pd.date_range(). This function does not seem to work without date.
It's not the end of the world if I just use a fake date throughout my project. But I'd like to know if there are more elegant ways around it.
I don't see why you need to use datetime64 objects for this. Your time column is an number, so you can very easily select time intervals with inequalities. You can also plot the columns without issue.
Sample Data
import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame({'Time': np.arange(0,1200,0.01),
'Col_A': np.random.randint(1,100,120000),
'Col_B': np.random.randint(1,10,120000)})
Select Data between 100 and 200 seconds.
df[df.Time.between(100,200)]
Outputs:
Time Col_A Col_B
10000 100.00 75 9
10001 100.01 23 7
...
19999 199.99 39 7
20000 200.00 25 2
Plotting against time
#First 100 rows just for illustration
df[0:100].plot(x='Time')
Convert to timedelta64
If you really wanted to, you could convert the column to a timedelta64[ns]
df['Time'] = pd.to_datetime(df.Time, unit='s') - pd.to_datetime('1970-01-01')
print(df.head())
# Time Col_A Col_B
#0 00:00:00 67 6
#1 00:00:00.010000 93 1
#2 00:00:00.020000 99 3
#3 00:00:00.030000 18 2
#4 00:00:00.040000 84 3
df.dtypes
#Time timedelta64[ns]
#Col_A int32
#Col_B int32
#dtype: object

Subtracting values across grouped data frames in Pandas

I have a set of IDs and Timestamps, and want to calculate the "total time elapsed per ID" by getting the difference of the oldest / earliest timestamps, grouped by ID.
Data
id timestamp
1 2018-02-01 03:00:00
1 2018-02-01 03:01:00
2 2018-02-02 10:03:00
2 2018-02-02 10:04:00
2 2018-02-02 11:05:00
Expected Result
(I want the delta converted to minutes)
id delta
1 1
2 62
I have a for loop, but it's very slow (10+ min for 1M+ rows). I was wondering if this was achievable via pandas functions?
# gb returns a DataFrameGroupedBy object, grouped by ID
gb = df.groupby(['id'])
# Create the resulting df
cycletime = pd.DataFrame(columns=['id','timeDeltaMin'])
def calculate_delta():
for id, groupdf in gb:
time = groupdf.timestamp
# returns timestamp rows for the current id
time_delta = time.max() - time.min()
# convert Timedelta object to minutes
time_delta = time_delta / pd.Timedelta(minutes=1)
# insert result to cycletime df
cycletime.loc[-1] = [id,time_delta]
cycletime.index += 1
Thinking of trying next:
- Multiprocessing
First ensure datetimes are OK:
df.timestamp = pd.to_datetime(df.timestamp)
Now find the number of minutes in the difference between the maximum and minimum for each id:
import numpy as np
>>> (df.timestamp.groupby(df.id).max() - df.timestamp.groupby(df.id).min()) / np.timedelta64(1, 'm')
id
1 1.0
2 62.0
Name: timestamp, dtype: float64
You can sort by id and tiemstamp, then groupby id and then find the difference between min and max timestamp per group.
df['timestamp'] = pd.to_datetime(df['timestamp'])
result = df.sort_values(['id']).groupby('id')['timestamp'].agg(['min', 'max'])
result['diff'] = (result['max']-result['min']) / np.timedelta64(1, 'm')
result.reset_index()[['id', 'diff']]
Output:
id diff
0 1 1.0
1 2 62.0
Another one:
import pandas as pd
import numpy as np
import datetime
ids = [1,1,2,2,2]
times = ['2018-02-01 03:00:00','2018-02-01 03:01:00','2018-02-02
10:03:00','2018-02-02 10:04:00','2018-02-02 11:05:00']
df = pd.DataFrame({'id':ids,'timestamp':pd.to_datetime(pd.Series(times))})
df.set_index('id', inplace=True)
print(df.groupby(level=0).diff().sum(level=0)['timestamp'].dt.seconds/60)

Categories

Resources