I have a dataframe that conains gps locations of vehicles recieved at various times in a day. For each vehicle, I want to resample hourly data such that I have the median report (according to the time stamp) for each hour of the day. For hours where there are no corresponding rows, I want a blank row.
I am using the following code:
for i,j in enumerate(list(df.id.unique())):
data=df.loc[df.id==j]
data['hour']=data['timestamp'].hour
data_grouped=data.groupby(['imo','hour']).median().reset_index()
data = data_grouped.set_index('hour').reindex(idx).reset_index() #idx is a list of integers from 0 to 23.
Since my dataframe has millions of id's it takes me a lot of time to iterate though all of them. Is there an efficient way of doing this?
Unlike Pandas reindex dates in Groupby, I have multiple rows for each hour, in addition to some hours having no rows at all.
Tested in last version of pandas, convert hour column to categoricals with all possible categories and then aggregate without loop:
df['hour'] = pd.Categorical(df['timestamp'].dt.hour, categories=range(24))
df1 = df.groupby(['id','imo','hour']).median().reset_index()
Related
I have a few set of days where the index is based on 30min data from monday to friday. There might some missing dates (Might be because of holidays). But i would like to find the highest from column high and lowest from column low for ever past week. Like i am calculating today so previous week high and low is marked in the yellow of attached image.
Tried using rolling , resampling but some how not working. Can any one help
enter image description here
You really should add sample data to your question (by that I mean a piece of code/text that can easily be used to create a dataframe for illustrating how the proposed solution works).
Here's a suggestion. With df your dataframe, and column datatime with datetimes (and not strings):
df["week"] = (
df["datetime"].dt.isocalendar().year.astype(str)
+ df["datetime"].dt.isocalendar().week.astype(str)
)
mask = df["high"] == df.groupby("week")["high"].transform("max")
df = df.merge(
df[mask].rename(columns={"low": "high_low"})
.groupby("week").agg({"high_low": "min"}).shift(),
on="week", how="left"
).drop(columns="week")
Add a week column to df (year + week) for grouping along weeks.
Extract the rows with the weekly maximum highs by mask (there could be more than one for a week).
Build a corresponding dataframe with the weekly minimum of the lows corresponding to the weekly maximum highs (column named high_low), shift it once to get the value from the previous week, and .merge it to df.
If column datetime doesn't contain datetimes:
df["datetime"] = pd.to_datetime(df["datetime"])
If I have understood correctly, the solution should be
get the week number from the date
groupby the week number and fetch the max and min number.
groupby the week fetch max date to get max/last date for a week
now merge all the dataframes into one based on date key
Once the steps are done, you could do any formatting as required.
I have a dataframe, called PORResult, of daily temperatures where rows are years and each column is a day (121 rows x 365 columns). I also have an array, called Percentile_90, of a threshold temperature for each day (length=365). For every day for every year in the PORResult dataframe I want to find out if the value for that day is higher than the value for that day in the Percentile_90 array. The results of which I want to store in a new dataframe, called Count (121rows x 365 columns). To start, the Count dataframe is full of zeros, but if the daily value in PORResult is greater than the daily value in Percentile_90. I want to change the daily value in Count to 1.
This is what I'm starting with:
for i in range(len(PORResult)):
if PORResult.loc[i] > Percentile_90[i]:
CountResult[i]+=1
But when I try this I get KeyError:0. What else can I try?
(Edited:)
Depending on your data structure, I think
CountResult = PORResult.gt(Percentile_90,axis=0).astype(int)
should do the trick. Generally, the toolset provided in pandas is sufficient that for-looping over a dataframe is unnecessary (as well as remarkably inefficient).
I am trying to drop specific rows in a dataframe where the index is a date with 1hr intervals during specific times of the day. (It is hourly intervals of stock market data).
For instance, 2021-10-26 09:30:00-4:00,2021-10-26 10:30:00-4:00,2021-10-26 11:30:00-4:00, 2021-10-26 12:30:00-4:00 etc.
I want to be able to specify the row to keep by hh:mm (e.g. keep just the 6:30, 10:30 data each day), and drop all the rest.
I'm pretty new to programming so have absolutely no idea how to do this.
If your columns are datetime objects and not strings, you can do something like this
df = pd.Dataframe()
...input data, etc...
columns = df.columns
kept = []
for col in columns
if (col.dt.hour == 6 or col.dt.hour == 10) and col.dt.minute == 30
kept.append(col)
else:
continue
df = df[kept]
see about half way down about working with time in pandas on this source here
https://www.dataquest.io/blog/python-datetime-tutorial/
I have a time series hourly_df, containing some hourly data:
import pandas as pd
import numpy as np
hourly_index = pd.date_range(start='2018-01-01', end='2018-01-07', freq='H')
hourly_data = np.random.rand(hourly_index.shape[0])
hourly_df = pd.DataFrame(hourly_data, index=hourly_index)
and I have a DatetimeIndex, containing some dates (as days as I wish), e.g.
daily_index = pd.to_datetime(['2018-01-01', '2018-01-05', '2018-01-06'])
I want to select each row of hourly_df, which date of its index is in daily_index, so in my case all hourly data from 1st, 5th and 6th January. What is the best way to do this?
If I naively use hourly_df.loc[daily_index], I only get the rows at 0:00:00 for each of the three days. What I want is the hourly data for the whole day for each of the days in daily_index.
One possibility to solve this, is to create a filter that takes the date of each element in the index of hourly_df and compares whether of not this date is in daily_index.
day_filter = [hour.date() in daily_index.date for hour in hourly_df.index]
hourly_df[day_filter]
This produces the desired output, but it seems the usage of the filter is avoidable and can be done in an expression similar to hourly_df.loc[daily_index.date].
save the daily_index as a dataframe
merge on index using hourly_df.merge(daily_index, how = 'inner', ...)
I have a panda dataframe containing OHLC 1mn data. (19724 rows). I am looking at adding 2 new columns keeping tracks of the min price and the maximum price over the past 3 days (including today up to the current bar in question, and ignoring missing days). However I am running into performance issues as a %timeit of the for loop indicates 57 seconds... I am looking at ways to speed up (vectorization? I tried but I am struggling a little bit I must admit).
#Import the data and put them in a DataFrame. The DataFrame should contain
#the following fields: DateTime (the index), Open, Close, High, Low, Volume.
#----------------------
#The following assume the first column of the file is Datetime
dfData=pd.read_csv(os.path.join(DataLocation,FileName),index_col='Date')
dfData.index=pd.to_datetime(dfData.index,dayfirst=True)
dfData.index.tz_localize('Singapore')
# Calculate the list of unique dates in the dataframe to find T-2
ListOfDates=pd.to_datetime(dfData.index.date).unique()
#Add a ExtMin and and ExtMax to the dataFrame to keep track of the min and max over a certain window
dfData['ExtMin']=np.nan
dfData['ExtMax']=np.nan
#For each line in the dataframe, calculate the minimum price reached over the past 3 days including today.
def addMaxMin(dfData):
for index,row in dfData.iterrows():
#Find the index in ListOfDates, strip out the time, offset by -2 rows
Start=ListOfDates[max(0,ListOfDates.get_loc(index.date())-2)]
#Populate the ExtMin and ExtMax columns
dfData.ix[index,'ExtMin']=dfData[(Start<=dfData.index) & (dfData.index<index)]['LOW'].min()
dfData.ix[index,'ExtMax']=dfData[(Start<=dfData.index) & (dfData.index<index)]['HIGH'].max()
return dfData
%timeit addMaxMin(dfData)
Thanks.