I want to do some data processing to sensor data (about 300 different sensors). This is an example of the raw data from a temperature sensor:
"2018-06-30T13:17:05.986Z" 30.5
"2018-06-30T13:12:05.984Z" 30.3
"2018-06-30T13:07:05.934Z" 29.5
"2018-06-30T13:02:05.873Z" 30.3
"2018-06-30T12:57:05.904Z" 30
I want to resample the data to smooth datetimes:
13:00:00
13:05:00
13:10:00
...
I have written some code that works, but is incredibly slow when used on bigger files. My code just upsamples all the data to 1 sec via linear interpolation. and downsamples afterwards to the requested frequency.
Is there a faster method to achieve this?
EDIT:
sensor data is written into a database and my code loads data from an arbitrary time intervall from the database
EDIT2: My working code
upsampled = dataframe.resample('1S').asfreq()
upsampled = upsampled.interpolate(method=method, limit=limitT) # ffill or bfill for some sensors
resampled = upsampled.astype(float).resample(str(sampling_time) + 'S').mean() # for temperature
resampled = upsampled.astype(float).resample(str(sampling_time) + 'S').asfreq() # for everything else
You can first set the index for the dataframe as the column with timestamps, and then use resample() method to bring it to every 1sec or every 5min interval data.
For example:
temp_df = pd.read_csv('temp.csv',header=None)
temp_df.columns = ['Timestamps','TEMP']
temp_df = temp_df.set_index('Timestamps') #set the timestamp column as index
temp_re_df = temp_df.TEMP.resample('5T').mean()
You can set the period as argument to the resample() i.e T - min , S - sec , M - month, H - hour etc. and also apply a function like mean() or max() or min() to consider the down-sampling method.
P.S : This is given that that your timestamp are in datetime format of pandas. Else use pd.to_datetime(temp_df['Timestamps'],unit='s') to convert to datetime index column
Related
I have a 1 minute interval intraday stock data which looks like this:
import yfinance as yf
import pandas as pd
n = yf.download('^nsei', period= '5d', interval= '1m')
I am trying to resample it to '5m' data like this:
n = n.resample('5T').agg(dict(zip(n.columns, ['first', 'max', 'min', 'last', 'last', 'sum'])))
But it tries to resample the datetime information which is not in my data. The market data is only available till 03:30 PM, but when I look at the resampled dataframe I find its tried to resample for entire 24 hrs.
How do I stop the resampling till 03:30PM and move on to the succeeding date?
Right now the dataframe has mostly NaN values due to this. Any suggestions will be welcome.
I am not sure what you are trying to achieve with that agg() function. Assuming 'first' refers to the first quantile and 'last' to the last quantile and you want to calculate some statistics per column, I suggest you do the following:
Get your data:
import yfinance as yf
import pandas as pd
n = yf.download('^nsei', period= '5d', interval= '1m')
Resample your data:
Note: your result is the same as when you resample with n.resample('5T').first() but this means every value in the dataframe
equals the first value from the 5 minute interval consisting of 5
values. A more logical resampling method is to use the mean() or
sum() function as shown below.
If this is data on stock prices it makes more sense to use mean():
resampled_df = n.resample('5T').mean()
To remove resampled hours that are outside of the working stock hours you have 2 options.
Option 1: drop na values:
filtered_df = resampled_df.dropna()
Note: this will not work if you use sum() since the result won't contain missing values but zeros.
Option 2 filter based on start and end hour
Get minimum and maximum time of day where data is available as datetime.time object:
start = n.index.min().time() # 09:15 as datetime.time object
end = n.index.max().time() # 15:29 as datetime.time object
Filter dataframe based on start and end times:
filtered_df = resampled_df.between_time(start, end)
Get the statistics:
statistics = filtered_df.describe()
statistics
Note that describe() will not contain the sum, so in order to add it you could do:
statistics = pd.concat([statistics, filtered_df.agg(['sum'])])
statistics
Output:
The agg() is to apply individual method of operation for each column, I used this so that I can get to see the 'candlestick' formation as it is called in stock technical analysis.
I was able to fix the issue, by dropping the NaN values.
I have a big amount of timeseries sensor data in a pandas dataframe. The resolution of the data is one observation every 15 minutes for 1 month for 876 sensors.
The data has some daily seasonality and some faulty measurements in single sensors on about 50% of the observations.
I want to remove the seasonality.
df.diff(periods=96)
This does not work, because then I have an outlier on 2 days (the day with the actual faulty measurement and the day after.
Therefore I wrote this snippet of code which does what it should and works fine:
for index in df.index:
for column in df.columns:
df[column][index] = df[column][index] - (
df[column][df.index % 96 == index % 96]).mean()
The problem is that this is incredibly slow.
Is there a way to achieve the same thing with a pandas function significantly faster?
Iterating over a DataFrame/ Series should be your last resort, it's very slow.
In this case, you can use groupby + transform to compute the mean of each season for all the columns, and then subtract with from your DataFrame in a vectorized way.
Based on your code, it seems that this should work
period = 96
season_mean = df.groupby(df.index % period).transform('mean')
df -= season_mean
Or, if you want
period = 96
df = df.groupby(df.index % period).transform(lambda g: g - g.mean())
I have a dataframe containing following 3 columns:
1. ID
2. timestamp
3. IP_Address
The data spans from 2019-07-01 to 2019-09-20. I am trying to aggregate counts of IP_address over the last 60 days partitioned by ID for all the rows between the 20day period of 2019-09-01 to 2019-09-20.
I have tried using the following window function and it works just fine:
days = lambda i: i*86400
w = Window.partitionBy('id')\
.orderBy(unix_timestamp(col('timestamp')))\
.rangeBetween(start=-days(60), end=Window.currentRow)
df = df.withColumn("ip_counts", count(df.ip_address).over(w))
However, the problem with that is that it calculates these aggregations even for the period I don't need the computation for: 2019-07-01 to 2019-08-31. I could easily filter out the results for the selected period retrospectively after calculations but I don't want unnecessary computations as I am dealing with ~3-10 Million rows per day.
If I filter the dataframe like this:
dates = ('2019-09-01', '2019-09-20')
date_from, date_to = [F.to_date(F.lit(s)).cast("timestamp") for s in dates]
w = Window.partitionBy('id')\
.orderBy(unix_timestamp(col('timestamp')))\
.rangeBetween(start=-days(60), end=Window.currentRow)
df = df.where((df.timestamp >= date_from) & (df.timestamp <= date_to))\
.withColumn("ip_counts", count(df.ip_address).over(w))
in that case, the IDs between these days are unable to access the data for those IDs from preceding 60 days and therefore, the counts are incorrect.
What can I do to compute aggregations only for the rows falling between 2019-09-01 to 2019-09-20 while at the same time making sure that windows have access to preceding 60 days of data for each of those aggregation. Thank you so much for your help.
I would first make a new data frame keeping all the data from the last 60 days, then follow your first method computing aggregations only for the rows falling between 2019-09-01 to 2019-09-20.
Python newbie here but I have some data that is intra-day financial data, going back to 2012, so it's got the same hours each day(same trading session each day) but just different dates. I want to be able to select certain times out of the data and check the corresponding OHLC data for that period and then do some analysis on it.
So at the moment it's a CSV file, and I'm doing:
import pandas as pd
data = pd.DataFrame.read_csv('data.csv')
date = data['date']
op = data['open']
high = data['high']
low = data['low']
close = data['close']
volume = data['volume']
The thing is that the date column is in the format of "dd/mm/yyyy 00:00:00 "as one string or whatever, so is it possible to still select between a certain time, like between "09:00:00" and "10:00:00"? or do I have to separate that time bit from the date and make it it's own column? If so, how?
So I believe pandas has a between_time() function, but that seems to need a DataFrame, so how can I convert it to a DataFrame, then I should be able to use the between_time function to select between the times I want. Also because there's obviously thousands of days, all with their own "xx:xx:xx" to "xx:xx:xx" I want to pull that same time period I want to look at from each day, not just the first lot of "xx:xx:xx" to "xx:xx:xx" as it makes its way down the data, if that makes sense. Thanks!!
Consider the dataframe df
from pandas_datareader import data
df = data.get_data_yahoo('AAPL', start='2016-08-01', end='2016-08-03')
df = df.asfreq('H').ffill()
option 1
convert index to series then dt.hour.isin
slc = df.index.to_series().dt.hour.isin([9, 10])
df.loc[slc]
option 2
numpy broadcasting
slc = (df.index.hour[:, None] == [9, 10]).any(1)
df.loc[slc]
response to comment
To then get a range within that time slot per day, use resample + agg + np.ptp (peak to peak)
df.loc[slc].resample('D').agg(np.ptp)
I have a file with intraday prices every ten minutes. [0:41] times in a day. Each date is repeated 42 times. The multi-index below should "collapse" the repeated dates into one for all times.
There are 62,035 rows x 3 columns: [date, time, price].
I would like write a function to get the difference of the ten minute prices, restricting differences to each unique date.
In other words, 09:30 is the first time of each day and 16:20 is the last: I cannot overlap differences between days of price from 16:20 - 09:30. The differences should start as 09:40 - 09:30 and end as 16:20 - 16:10 for each unique date in the dataframe.
Here is my attempt. Any suggestions would be greatly appreciated.
def diffSeries(rounded,data):
'''This function accepts a column called rounded from 'data'
The 2nd input 'data' is a dataframe
'''
df=rounded.shift(1)
idf=data.set_index(['date', 'time'])
data['diff']=['000']
for i in range(0,length(rounded)):
for day in idf.index.levels[0]:
for time in idf.index.levels[1]:
if idf.index.levels[1]!=1620:
data['diff']=rounded[i]-df[i]
else:
day+=1
time+=2
data[['date','time','price','II','diff']].to_csv('final.csv')
return data['diff']
Then I call:
data=read_csv('file.csv')
rounded=roundSeries(data['price'],5)
diffSeries(rounded,data)
On the traceback - I get an Assertion Error.
You can use groupby and then apply to achieve what you want:
diffs = data.groupby(lambda idx: idx[0]).apply(lambda row: row - row.shift(1))
For a full example, suppose you create a test data set for 14 Nov to 16 Nov:
import pandas as pd
from numpy.random import randn
from datetime import time
# Create date range with 10 minute intervals, and filter out irrelevant times
times = pd.bdate_range(start=pd.datetime(2012,11,14,0,0,0),end=pd.datetime(2012,11,17,0,0,0), freq='10T')
filtered_times = [x for x in times if x.time() >= time(9,30) and x.time() <= time(16,20)]
prices = randn(len(filtered_times))
# Create MultiIndex and data frame matching the format of your CSV
arrays = [[x.date() for x in filtered_times]
,[x.time() for x in filtered_times]]
tuples = zip(*arrays)
m_index = pd.MultiIndex.from_tuples(tuples, names=['date', 'time'])
data = pd.DataFrame({'prices': prices}, index=m_index)
You should get a DataFrame a bit like this:
prices
date time
2012-11-14 09:30:00 0.696054
09:40:00 -1.263852
09:50:00 0.196662
10:00:00 -0.942375
10:10:00 1.915207
As mentioned above, you can then get the differences by grouping by the first index and then subtracting the previous row for each row:
diffs = data.groupby(lambda idx: idx[0]).apply(lambda row: row - row.shift(1))
Which gives you something like:
prices
date time
2012-11-14 09:30:00 NaN
09:40:00 -1.959906
09:50:00 1.460514
10:00:00 -1.139036
10:10:00 2.857582
Since you are grouping by the date, the function is not applied for 16:20 - 09:30.
You might want to consider using a TimeSeries instead of a DataFrame, because it will give you far greater flexibility with this kind of data. Supposing you have already loaded your DataFrame from the CSV file, you can easily convert it into a TimeSeries and perform a similar function to get the differences:
dt_index = pd.DatetimeIndex([datetime.combine(i[0],i[1]) for i in data.index])
# or dt_index = pd.DatetimeIndex([datetime.combine(i.date,i.time) for i in data.index])
# if you don't have an multi-level index on data yet
ts = pd.Series(data.prices.values, dt_index)
diffs = ts.groupby(lambda idx: idx.date()).apply(lambda row: row - row.shift(1))
However, you would now have access to the built-in time series functions such as resampling. See here for more about time series in pandas.
#MattiJohn's construction gives a filtered list of length 86,772--when run over 1/3/2007-8/30/2012 for 42 times (10 minute intervals). Observe the data cleaning issues.
Here the data of prices coming from the csv is length: 62,034.
Hence, simply importing from the .csv, as follows, is problematic:
filtered_times = [x for x in times if x.time() >= time(9,30) and x.time() <= time(16,20)]
DF=pd.read_csv('MR10min.csv')
prices = DF.price
# I.E. rather than the generic: prices = randn(len(filtered_times)) above.
The fact that the real data falls short of the length it "should be" means there are data cleaning issues. Often we do not have the full times as bdate_time will generate (half days in the market, etc, holidays).
Your solution is elegant. But I am not sure how to overcome the mismatch between the actual data and the a priori, prescribed dataframe.
Your second TimesSeries suggestion seems to still require construction of a datetime index similar to the first one. For example, if I were use the following two lines to get the actual data of interest:
DF=pd.read_csv('MR10min.csv')
data=pd.DF.set_index(['date','time'])
dt_index = pd.DatetimeIndex([datetime.combine(i[0],i[1]) for i in data.index])
It will generate a:
TypeError: combine() argument 1 must be datetime.date, not str
How does one make a bdate_time array completely informed by the actual data available?
Thank you to (#MattiJohn) and to anyone with interest in continuing this discussion.