Python pandas.datetimeindex piecewise dataframe slicing - python

I have a dataframe with a pandas DatetimeIndex. I need to take many slices from it(for printing a piecewise graph with matplotlib). In other words I need a new DF which would be a subset of the first one.
More precisely I need to take all rows that are between 9 and 16 o'clock but only if they are within a date range. Fortunately I have only one date range and one time range to apply.
What would be a clean way to do that? thanks

The first step is to set the index of the dataframe to the column where you store time. Once the index is based on time, you can subset the dataframe easily.
df['time'] = pd.to_datetime(df['time'], format='%H:%M:%S.%f') # assuming you have a col called 'time'
df['time'] = df['time'].dt.strftime('%H:%M:%S.%f')
df.set_index('time', inplace=True)
new_df = df[startTime:endTime] # startTime and endTime are strings

Related

Filter datetime index by date

Can you please explain wy this code doesn't work and how to make it work:
index = pd.date_range('1/1/2000', periods=9, freq='T')
dates = index.date
index.loc(dates[0])
I tried other solutions like:
index = pd.date_range('1/1/2000', periods=9, freq='T')
dates = pd.to_datetime(index.date)
index.loc(dates[0])
As you can see, I want to extract one date from datetime object.
When you create index, it is a DatetimeIndex object, that does not have the loc attribute. The DatetimeIndex object does not have indexes as it itself is used as an index. You can access elements just by square brackets as in lists [].
It is not clear what exactly you want to do.
You can use index[0] to access first element, or make list, numpy array, or DaraFrame using .to_list(), .to_numpy(), `.to_frame()' for easyer manipulations.
To extract date from index, just index[0].date() is enough.
Also when you create dates, all dates are the same, as index elements are different from each other only by minutes.
First, create a date_rangeby using 5 hours (so we will have several points in one day). The last line will return all indexes from a specific date. You can define any date and write index.date == your_date to use it.
index = pd.date_range('1/1/2000', periods=100, freq='5H')
dates = index.date
index[index.date == dates[0]]

Change date format for specific cells in Pandas

I am working with a big dataset (more than 2 million rows x 10 columns) that has a date column. Some of the rows are formatted correctly (e.g. 2020/04/08) but I want to change the format of others that are not (concretely, those are formatted as 20200408).
I want to change the format of those that are wrong but I don't want to iterate through all the rows.
Normally, for a small dataset I would do
for i in range (0,len(df)):
cell=str(df.iloc[i]['date'])
if len(cell)==8:
df.iat[i, df.columns.get_loc('date')] = datetime.strptime(cell, '%Y%m%d').strftime('%Y-%m-%d')
but I know this is far from optimal.
How can I use the power of pandas to avoid the loop here?
Thanks!
Filter rows by Series.str.len, then select column by DataFrame.loc and mask, convert to datetimes by to_datetime and last to custom format by Series.dt.strftime:
m = df['date'].str.len() == 8
df.loc[m, 'date'] = pd.to_datetime(df.loc[m, 'date'], format='%Y%m%d').dt.strftime('%Y-%m-%d')
Try
df['datetime'] = df['datetime'].apply(lambda x: x.to_datetime())

How to manipulate your Data Set based on the values of your index?

I have this Dataset, wind_modified. In this Dataset, columns are the locations and Index is the Date. And the Values in the columns are the wind speeds.
Let's say I want to find the average wind speed in January for each location, how do I use groupby or any other method to find the average?
Would it be possible without resetting the INDEX?
Edit - [This][2] is the actual dataset. I have combined the three columns "Yr, Mo, Dy" into one i.e. "DATE" and made it the INDEX.
I imported the dataset by using pd.read_fwf.
And "DATE" is of type datetime64[ns].
[2]:
Sure, if want all Januaries for all years first filter them by boolean indexing and add mean:
#if necessary convert index to DatetimeIndex
#df.index = pd.to_datetime(df.index)
df1 = df[df.index.month == 1].mean().to_frame().T
Or if need each January for each year separately after filter use groupby with DatetimeIndex.year and aggregate mean:
df2 = df[df.index.month == 1]
df3 = df2.groupby(df2.index.year).mean()

Pandas - New Row for Each Day in Date Range

I have a Pandas df with one column (Reservation_Dt_Start) representing the start of a date range and another (Reservation_Dt_End) representing the end of a date range.
Rather than each row having a date range, I'd like to expand each row to have as many records as there are dates in the date range, with each new row representing one of those dates.
See the two pics below for an example input and the desired output.
The code snippet below works!! However, for every 250 rows in the input table, it takes 1 second to run. Given my input table is 120,000,000 rows in size, this code will take about one week to run.
pd.concat([pd.DataFrame({'Book_Dt': row.Book_Dt,
'Day_Of_Reservation': pd.date_range(row.Reservation_Dt_Start, row.Reservation_Dt_End),
'Pickup': row.Pickup,
'Dropoff' : row.Dropoff,
'Price': row.Price},
columns=['Book_Dt','Day_Of_Reservation', 'Pickup', 'Dropoff' , 'Price'])
for i, row in df.iterrows()], ignore_index=True)
There has to be a faster way to do this. Any ideas? Thanks!
pd.concat in a loop with a large dataset gets pretty slow as it will make a copy of the frame each time and return a new dataframe. You are attempting to do this 120m times. I would try to work with this data as a simple list of tuples instead then convert to dataframe at the end.
e.g.
Given a list list = []
For each row in the dataframe:
get list of date range (can use pd.date_range here still) store in variable dates which is a list of dates
for each date in date range, add a tuple to the list list.append((row.Book_Dt, dates[i], row.Pickup, row.Dropoff, row.Price))
Finally you can convert the list of tuples to a dataframe:
df = pd.DataFrame(list, columns = ['Book_Dt', 'Day_Of_Reservation', 'Pickup', 'Dropoff', 'Price'])

How to make this row-wise operation performant (python)?

My issue is very simple, but I just can't wrap my head around it:
I have two dataframes:
time series dataframe with two columns: Timestamp and DataValue
A time interval dataframe with start, end timestamps and a label
What I want to do:
Add a third column to the timeseries that yields the labels according to the time interval dataframe.
Every timepoint needs to have an assigned label designated by the time interval dataframe.
This code works:
TimeSeries_labelled = TimeSeries.copy(deep=True)
TimeSeries_labelled["State"] = 0
for index in Timeintervals_States.index:
for entry in TimeSeries_labelled.index:
if Timeintervals_States.loc[index,"start"] <= TimeSeries_labelled.loc[entry, "Timestamp"] <= Timeintervals_States.loc[index,"end"]:
TimeSeries_labelled.loc[entry, "State"] = Timeintervals_States.loc[index,"state"]
But it is really slow. I tried to make it shorter and faster with pyhton built in filter codes, but failed miserably.
Please help!
I don't really know about TimeSeries, with a dataframe containing timestamps as datetime object you could use something like the following :
import pandas as pd
#Create the thrid column in the target dataframe
df_timeseries['label'] = pd.Series('',index=df_timeseries.index)
#Loop over the dataframe containing start and end timestamps
for index,row in df_start_end.iterrows():
#Create a boolean mask to filter data
mask = (df_timeseries['timestamp'] > row['start']) & (df_timeseries['timestamp'] < row['end'])
df_timeseries.loc[mask,'label'] = row['label']
This will make the rows your timeseries dataframe that match the condition of the mask have the label of the row, for each rows of your dataframe containing start & end timestamps

Categories

Resources