Performance issue on operations involving pandas dataframes - python

I have a panda dataframe containing OHLC 1mn data. (19724 rows). I am looking at adding 2 new columns keeping tracks of the min price and the maximum price over the past 3 days (including today up to the current bar in question, and ignoring missing days). However I am running into performance issues as a %timeit of the for loop indicates 57 seconds... I am looking at ways to speed up (vectorization? I tried but I am struggling a little bit I must admit).
#Import the data and put them in a DataFrame. The DataFrame should contain
#the following fields: DateTime (the index), Open, Close, High, Low, Volume.
#----------------------
#The following assume the first column of the file is Datetime
dfData=pd.read_csv(os.path.join(DataLocation,FileName),index_col='Date')
dfData.index=pd.to_datetime(dfData.index,dayfirst=True)
dfData.index.tz_localize('Singapore')
# Calculate the list of unique dates in the dataframe to find T-2
ListOfDates=pd.to_datetime(dfData.index.date).unique()
#Add a ExtMin and and ExtMax to the dataFrame to keep track of the min and max over a certain window
dfData['ExtMin']=np.nan
dfData['ExtMax']=np.nan
#For each line in the dataframe, calculate the minimum price reached over the past 3 days including today.
def addMaxMin(dfData):
for index,row in dfData.iterrows():
#Find the index in ListOfDates, strip out the time, offset by -2 rows
Start=ListOfDates[max(0,ListOfDates.get_loc(index.date())-2)]
#Populate the ExtMin and ExtMax columns
dfData.ix[index,'ExtMin']=dfData[(Start<=dfData.index) & (dfData.index<index)]['LOW'].min()
dfData.ix[index,'ExtMax']=dfData[(Start<=dfData.index) & (dfData.index<index)]['HIGH'].max()
return dfData
%timeit addMaxMin(dfData)
Thanks.

Related

How to compare elements of one dataframe to another?

I have a dataframe, called PORResult, of daily temperatures where rows are years and each column is a day (121 rows x 365 columns). I also have an array, called Percentile_90, of a threshold temperature for each day (length=365). For every day for every year in the PORResult dataframe I want to find out if the value for that day is higher than the value for that day in the Percentile_90 array. The results of which I want to store in a new dataframe, called Count (121rows x 365 columns). To start, the Count dataframe is full of zeros, but if the daily value in PORResult is greater than the daily value in Percentile_90. I want to change the daily value in Count to 1.
This is what I'm starting with:
for i in range(len(PORResult)):
if PORResult.loc[i] > Percentile_90[i]:
CountResult[i]+=1
But when I try this I get KeyError:0. What else can I try?
(Edited:)
Depending on your data structure, I think
CountResult = PORResult.gt(Percentile_90,axis=0).astype(int)
should do the trick. Generally, the toolset provided in pandas is sufficient that for-looping over a dataframe is unnecessary (as well as remarkably inefficient).

How do I drop rows in a pandas dataframe based on the time of day

I am trying to drop specific rows in a dataframe where the index is a date with 1hr intervals during specific times of the day. (It is hourly intervals of stock market data).
For instance, 2021-10-26 09:30:00-4:00,2021-10-26 10:30:00-4:00,2021-10-26 11:30:00-4:00, 2021-10-26 12:30:00-4:00 etc.
I want to be able to specify the row to keep by hh:mm (e.g. keep just the 6:30, 10:30 data each day), and drop all the rest.
I'm pretty new to programming so have absolutely no idea how to do this.
If your columns are datetime objects and not strings, you can do something like this
df = pd.Dataframe()
...input data, etc...
columns = df.columns
kept = []
for col in columns
if (col.dt.hour == 6 or col.dt.hour == 10) and col.dt.minute == 30
kept.append(col)
else:
continue
df = df[kept]
see about half way down about working with time in pandas on this source here
https://www.dataquest.io/blog/python-datetime-tutorial/

Pandas: Resampling Hourly Data for each Group

I have a dataframe that conains gps locations of vehicles recieved at various times in a day. For each vehicle, I want to resample hourly data such that I have the median report (according to the time stamp) for each hour of the day. For hours where there are no corresponding rows, I want a blank row.
I am using the following code:
for i,j in enumerate(list(df.id.unique())):
data=df.loc[df.id==j]
data['hour']=data['timestamp'].hour
data_grouped=data.groupby(['imo','hour']).median().reset_index()
data = data_grouped.set_index('hour').reindex(idx).reset_index() #idx is a list of integers from 0 to 23.
Since my dataframe has millions of id's it takes me a lot of time to iterate though all of them. Is there an efficient way of doing this?
Unlike Pandas reindex dates in Groupby, I have multiple rows for each hour, in addition to some hours having no rows at all.
Tested in last version of pandas, convert hour column to categoricals with all possible categories and then aggregate without loop:
df['hour'] = pd.Categorical(df['timestamp'].dt.hour, categories=range(24))
df1 = df.groupby(['id','imo','hour']).median().reset_index()

How to calculate moving average incrementally with daily data added to data frame in pandas?

I have daily data and want to calculate 5 days, 30 days and 90 days moving average per user and write out to a CSV. New data comes in everyday. How do I calculate these averages for the new data only, assuming I will load the data frame with last 89 days data plus today's data.
date user daily_sales 5_days_MA 30_days_MV 90_days_MV
2019-05-01 1 34
2019-05-01 2 20
....
2019-07-18 .....
The number of rows per day is about 1 million. If data for 90days is too much, 30 days is OK
You can apply rolling() method on your dataset if it's in DataFrame format.
your_df['MA_30_days'] = df[where_to_apply].rolling(window = 30).mean()
If you need different window on which moving average will be calculated just change window parameter. In my example I used mean() to calculate but you can choose some other statistic as well.
This code will create another column named 'MA_30_days' with calculated moving average in your DataFrame.
You can also create another DataFrame where you will collect and loop over your dataset to calculate all moving averages and save it to CSV format as you wanted.
your_df.to_csv('filename.csv')
In your case to calculation should be consider only the newest data. If you want to perform this on latest data just slice it. However the very first rows will be NaN (depends on window).
df[where_to_apply][-90:].rolling(window = 30).mean()
This will calculate moving average on last 90 rows of specific column in some df and first 29 rows would be NaN. If your latest 90 rows should be all meaningful data than you can start calculation earlier than on last 90 rows - depends on window size.
if the df already contains yesterday's moving average, and just the new day's Simple MA is required, I would say use this approach:
MAlength=90
df.loc[day-1:'MA']=(
(df.loc[day-1:'MA']*MAlength) #expand yesterday's MA value
-df.loc[day-MAlength:'Price'] #remove oldest price
+df.loc[day-MAlength:'Price'] #add newest price
)/MAlength #re-average

A Multi-Index Construction for Intraday TimeSeries (10 min price data)

I have a file with intraday prices every ten minutes. [0:41] times in a day. Each date is repeated 42 times. The multi-index below should "collapse" the repeated dates into one for all times.
There are 62,035 rows x 3 columns: [date, time, price].
I would like write a function to get the difference of the ten minute prices, restricting differences to each unique date.
In other words, 09:30 is the first time of each day and 16:20 is the last: I cannot overlap differences between days of price from 16:20 - 09:30. The differences should start as 09:40 - 09:30 and end as 16:20 - 16:10 for each unique date in the dataframe.
Here is my attempt. Any suggestions would be greatly appreciated.
def diffSeries(rounded,data):
'''This function accepts a column called rounded from 'data'
The 2nd input 'data' is a dataframe
'''
df=rounded.shift(1)
idf=data.set_index(['date', 'time'])
data['diff']=['000']
for i in range(0,length(rounded)):
for day in idf.index.levels[0]:
for time in idf.index.levels[1]:
if idf.index.levels[1]!=1620:
data['diff']=rounded[i]-df[i]
else:
day+=1
time+=2
data[['date','time','price','II','diff']].to_csv('final.csv')
return data['diff']
Then I call:
data=read_csv('file.csv')
rounded=roundSeries(data['price'],5)
diffSeries(rounded,data)
On the traceback - I get an Assertion Error.
You can use groupby and then apply to achieve what you want:
diffs = data.groupby(lambda idx: idx[0]).apply(lambda row: row - row.shift(1))
For a full example, suppose you create a test data set for 14 Nov to 16 Nov:
import pandas as pd
from numpy.random import randn
from datetime import time
# Create date range with 10 minute intervals, and filter out irrelevant times
times = pd.bdate_range(start=pd.datetime(2012,11,14,0,0,0),end=pd.datetime(2012,11,17,0,0,0), freq='10T')
filtered_times = [x for x in times if x.time() >= time(9,30) and x.time() <= time(16,20)]
prices = randn(len(filtered_times))
# Create MultiIndex and data frame matching the format of your CSV
arrays = [[x.date() for x in filtered_times]
,[x.time() for x in filtered_times]]
tuples = zip(*arrays)
m_index = pd.MultiIndex.from_tuples(tuples, names=['date', 'time'])
data = pd.DataFrame({'prices': prices}, index=m_index)
You should get a DataFrame a bit like this:
prices
date time
2012-11-14 09:30:00 0.696054
09:40:00 -1.263852
09:50:00 0.196662
10:00:00 -0.942375
10:10:00 1.915207
As mentioned above, you can then get the differences by grouping by the first index and then subtracting the previous row for each row:
diffs = data.groupby(lambda idx: idx[0]).apply(lambda row: row - row.shift(1))
Which gives you something like:
prices
date time
2012-11-14 09:30:00 NaN
09:40:00 -1.959906
09:50:00 1.460514
10:00:00 -1.139036
10:10:00 2.857582
Since you are grouping by the date, the function is not applied for 16:20 - 09:30.
You might want to consider using a TimeSeries instead of a DataFrame, because it will give you far greater flexibility with this kind of data. Supposing you have already loaded your DataFrame from the CSV file, you can easily convert it into a TimeSeries and perform a similar function to get the differences:
dt_index = pd.DatetimeIndex([datetime.combine(i[0],i[1]) for i in data.index])
# or dt_index = pd.DatetimeIndex([datetime.combine(i.date,i.time) for i in data.index])
# if you don't have an multi-level index on data yet
ts = pd.Series(data.prices.values, dt_index)
diffs = ts.groupby(lambda idx: idx.date()).apply(lambda row: row - row.shift(1))
However, you would now have access to the built-in time series functions such as resampling. See here for more about time series in pandas.
#MattiJohn's construction gives a filtered list of length 86,772--when run over 1/3/2007-8/30/2012 for 42 times (10 minute intervals). Observe the data cleaning issues.
Here the data of prices coming from the csv is length: 62,034.
Hence, simply importing from the .csv, as follows, is problematic:
filtered_times = [x for x in times if x.time() >= time(9,30) and x.time() <= time(16,20)]
DF=pd.read_csv('MR10min.csv')
prices = DF.price
# I.E. rather than the generic: prices = randn(len(filtered_times)) above.
The fact that the real data falls short of the length it "should be" means there are data cleaning issues. Often we do not have the full times as bdate_time will generate (half days in the market, etc, holidays).
Your solution is elegant. But I am not sure how to overcome the mismatch between the actual data and the a priori, prescribed dataframe.
Your second TimesSeries suggestion seems to still require construction of a datetime index similar to the first one. For example, if I were use the following two lines to get the actual data of interest:
DF=pd.read_csv('MR10min.csv')
data=pd.DF.set_index(['date','time'])
dt_index = pd.DatetimeIndex([datetime.combine(i[0],i[1]) for i in data.index])
It will generate a:
TypeError: combine() argument 1 must be datetime.date, not str
How does one make a bdate_time array completely informed by the actual data available?
Thank you to (#MattiJohn) and to anyone with interest in continuing this discussion.

Categories

Resources