Search in pandas dataframe - python

Potentially a slightly misleading title but the problem is this:
I have a large dataframe with multiple columns. This looks a bit like
df =
id date value
A 01-01-2015 1.0
A 03-01-2015 1.2
...
B 01-01-2015 0.8
B 02-01-2015 0.8
...
What I want to do is within each of the IDs I identify the date one week earlier and place the value on this date into e.g. a 'lagvalue' column. The problem comes with not all dates existing for all ids so a simple .shift(7) won't pull the correct value [in this instance I guess I should put a NaN in].
I can do this with a lot of horrible iterating over the dates and ids to find the value, for example some rough idea
[
df[
df['date'] == df['date'].iloc[i] - datetime.timedelta(weeks=1)
][
df['id'] == df['id'].iloc[i]
]['value']
for i in range(len(df.index))
]
but I'm certain there is a 'better' way to do it that cuts down on time and processing that I just can't think of right now.
I could write a function using a groupby on the id and then look within that and I'm certain that would reduce the time it would take to perform the operation - is there a much quicker, simpler way [aka am I having a dim day]?

Basic strategy is, for each id, to:
Use date index
Use reindex to expand the data to include all dates
Use shift to shift 7 spots
Use ffill to do last value interpolation. I'm not sure if you want this, or possibly bfill which will use the last value less than a week in the past. But simple to change. Alternatively, if you want NaN when not available 7 days in the past, you can just remove the *fill completely.
Drop unneeded data
This algorithm gives NaN when the lag is too far in the past.
There are a few assumptions here. In particular that the dates are unique inside each id and they are sorted. If not sorted, then use sort_values to sort by id and date. If there are duplicate dates, then some rules will be needed to resolve which values to use.
import pandas as pd
import numpy as np
dates = pd.date_range('2001-01-01',periods=100)
dates = dates[::3]
A = pd.DataFrame({'date':dates,
'id':['A']*len(dates),
'value':np.random.randn(len(dates))})
dates = pd.date_range('2001-01-01',periods=100)
dates = dates[::5]
B = pd.DataFrame({'date':dates,
'id':['B']*len(dates),
'value':np.random.randn(len(dates))})
df = pd.concat([A,B])
with_lags = []
for id, group in df.groupby('id'):
group = group.set_index(group.date)
index = group.index
group = group.reindex(pd.date_range(group.index[0],group.index[-1]))
group = group.ffill()
group['lag_value'] = group.value.shift(7)
group = group.loc[index]
with_lags.append(group)
with_lags = pd.concat(with_lags, 0)
with_lags.index = np.arange(with_lags.shape[0])

Related

Front fill pandas DataFrame conditional on time

I have the following daily dataframe:
daily_index = pd.date_range(start='1/1/2015', end='1/01/2018', freq='D')
random_values = np.random.randint(1, 3,size=(len(daily_index), 1))
daily_df = pd.DataFrame(random_values, index=daily_index, columns=['A']).replace(1, np.nan)
I want to map each value to a dataframe where each day is expanded to multiple 1 minute intervals. The final DF looks like so:
intraday_index = pd.date_range(start='1/1/2015', end='1/01/2018', freq='1min')
intraday_df_full = daily_df.reindex(intraday_index)
# Choose random indices.
drop_indices = np.random.choice(intraday_df_full.index, 5000, replace=False)
intraday_df = intraday_df_full.drop(drop_indices)
In the final dataframe, each day is broken into 1 min intervals, but some are missing (so the minute count on each day is not the same). Some days have a value in the beginning of the day, but nan for the rest.
My question is, only for the days which start with some value in the first minute, how do I front fill for the rest of the day?
I initially tried to simply do the following daily_df.reindex(intraday_index, method='ffill', limit=1440), but since some rows are missing, this cannot work. Maybe there a way to limit by time?
Following #Datanovice's comments, this line achieves the desired result:
intraday_df.groupby(intraday_df.index.date).transform('ffill')
where my groupby defines the desired groups on which we want to apply the operation and transform does this without modifying the DataFrame's index.

Find if there is any holidays between two dates in a large dataset?

I am working on a dataset that has some 26 million rows and 13 columns including two datetime columns arr_date and dep_date. I am trying to create a new boolean column to check if there is any US holidays between these dates.
I am using apply function to the entire dataframe but the execution time is too slow. The code has been running for more than 48 hours now on Goolge Cloud Platform (24GB ram, 4 core). Is there a faster way to do this?
The dataset looks like this:
Sample data
The code I am using is -
import pandas as pd
import numpy as np
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar
df = pd.read_pickle('dataGT70.pkl')
cal = calendar()
def mark_holiday(df):
df.apply(lambda x: True if (len(cal.holidays(start=x['dep_date'], end=x['arr_date']))>0 and x['num_days']<20) else False, axis=1)
return df
df = mark_holiday(df)
This took me about two minutes to run on a sample dataframe of 30m rows with two columns, start_date and end_date.
The idea is to get a sorted list of all holidays occurring on or after the minimum start date, and then to use bisect_left from the bisect module to determine the next holiday occurring on or after each start date. This holiday is then compared to the end date. If it is less than or equal to the end date, then there must be at least one holiday in the date range between the start and end dates (both inclusive).
from bisect import bisect_left
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar
# Create sample dataframe of 10k rows with an interval of 1-19 days.
np.random.seed(0)
n = 10000 # Sample size, e.g. 10k rows.
years = np.random.randint(2010, 2019, n)
months = np.random.randint(1, 13, n)
days = np.random.randint(1, 29, n)
df = pd.DataFrame({'start_date': [pd.Timestamp(*x) for x in zip(years, months, days)],
'interval': np.random.randint(1, 20, n)})
df['end_date'] = df['start_date'] + pd.TimedeltaIndex(df['interval'], unit='d')
df = df.drop('interval', axis=1)
# Get a sorted list of holidays since the fist start date.
hols = calendar().holidays(df['start_date'].min())
# Determine if there is a holiday between the start and end dates (both inclusive).
df['holiday_in_range'] = df['end_date'].ge(
df['start_date'].apply(lambda x: bisect_left(hols, x)).map(lambda x: hols[x]))
>>> df.head(6)
start_date end_date holiday_in_range
0 2015-07-14 2015-07-31 False
1 2010-12-18 2010-12-30 True # 2010-12-24
2 2013-04-06 2013-04-16 False
3 2013-09-12 2013-09-24 False
4 2017-10-28 2017-10-31 False
5 2013-12-14 2013-12-29 True # 2013-12-25
So, for a given start_date timestamp (e.g. 2013-12-14), bisect_right(hols, '2013-12-14') would yield 39, and hols[39] results in 2013-12-25, the next holiday falling on or after the 2013-12-14 start date. The next holiday calculated as df['start_date'].apply(lambda x: bisect_left(hols, x)).map(lambda x: hols[x]). This holiday is then compared to the end_date, and holiday_in_range is thus True if the end_date is greater than or equal to this holiday value, otherwise the holiday must fall after this end_date.
Have you already considered using pandas.merge_asof for this?
I could imagine that map and apply with lambda functions cannot be executed that efficiently.
UPDATE: Ah sorry, I just read, that you only need a boolean if there are any holidays inbetween, this makes it much easier. If thats enough you just need to perform steps 1-5 then group the DataFrame that is the result of step5 by start/end date and use count as the aggregate function to have the number of holidays in the ranges. This result you can join to your original dataset similar to step 8 described below. Then fill the rest of the values with fillna(0). Do something like joined_df['includes_holiday']= joined_df['joined_count_column']>0. After that, you can delete the joined_count_column again from your DataFrame, if you like.
If you use pandas_merge_asof you could work through these steps (step 6 and 7 are only necessary if you need to have all the holidays inbetween start and end in your result DataFrame as well, not just the booleans):
Load your holiday records in a DataFrame and index it on the date. The holidays should be one date per line (storing ranges like for christmas from 24th-26th in one row, would make it much more complex).
Create a copy of your dataframe with just the start, end date columns. UPDATE: every start, end date should only occur once in it. E.g. by using groupby.
Use merge_asof with a reasonable tolerance value (if you join over the start of the period, use direction='forward', if you use the end date, use direction='backward' and how='inner'.
As a result you have a merged DataFrame with your start, end columns and the date column from your holiday dataframe. You get only records, for which a holiday was found with the given tolerance, but later you can merge this data back with your original DataFrame. You will probably now have duplicates of your original records.
Then check the joined holiday for your records with indexers by comparing them with the start and end column and remove the holidays, which are not inbetween.
Sort the dataframe you obtained form step 5 (use something like df.sort_values(['start', 'end', 'holiday'], inplace=True). Now you should insert a number column that numbers the holidays between your periods (the ones you obtained after step 5) form 1 to ... (for each period starting from 1). This is necesary to use unstack in the next step to get the holidays in columns.
Add an index on your dataframe based on period start date, period end date and the count column you inserted in step 6. Use df.unstack(level=-1) on the DataFrame you prepared in steps 1-7. What you now have, is a condensed DataFrame with your original periods with the holidays arranged columnwise.
Now you only have to merge this DataFrame back to your original data using original_df.merge(df_from_step7, left_on=['start', 'end'], right_index=True, how='left')
The result of this is a file with your original data containing the date ranges and for each date range the holidays that lie inbetween the period are stored in a separte columns each behind the data. Loosely speaking the numbering in step 6 assigns the holidays to the columns and has the effect, that the holidays are always assigned from right to left to the columns (you wouldn't have a holiday in column 3 if column 1 is empty).
Step 6. is probably also a bit tricky, but you can do that for example by adding a series filled with a range and then fixing it, so the numbering starts by 0 or 1 in each group by using shift or grouping by start, end with aggregate({'idcol':'min') and joining the result back to subtract it from the value assigned by the range-sequence.
In all, I think it sounds more complicated, than it is and it should be performed quite efficient. Especially if your periods are not that large, because then after step 5, your result set should be much smaller than your original dataframe, but even if that is not the case, it should still be quite efficient, since it can use compiled code.

Manual Date Filter in Pandas

I am working on processing a very large data set using pandas into more manageable data frames. I have a loop that goes through and splits the data frame into smaller data frames based on a leading ID number, I then sort by the date column. However, I notice that after everything runs there are still some issues with dates not being sorted correctly. I want to create a manual filter that basically loops through the date column and checks to see if next date is greater or equal to the previous date. This ideally would eliminate issues where the date column may go something like (obviously in more of a data frame format):
[2017,2017,2018,2018,2018,2017,2018,2018]
I am writing some code to take care of this however, I keep getting errors and was hoping someone could point me in the right direction to go.
for i in range(len(Rcols)):
dfs[i] = data.filter(regex=f'{Rcols[i]}-')
dfs[i]['Engine'] = data['Operation_ID:-PARAMETER_NAME:']
dfs[i].set_index('Engine',inplace=True)
dfs[i][f'{Rcols[i]}-DATE_TIME_START']=pd.to_datetime(dfs[i][f'{Rcols[i]}-DATE_TIME_START'],errors = 'ignore')
dfs[i].sort_values(by=f'{Rcols[i]}-DATE_TIME_START',ascending = True ,inplace=True)
for index, item in enumerate(dfs[i][f'{Rcols[i]}-DATE_TIME_START']):
if dfs[i][f'{Rcols[i]}-DATE_TIME_START'][index + 1] >= dfs[i][f'{Rcols[i]}-DATE_TIME_START'][index]:
continue
else:
dfs[i].drop(dfs[i][index])
Where Rcols is just a list of the column header leading IDs. dfs is a large list of names that call pandas data frames.
Thanks
This isn't particularly "manual", but you can use pd.Series.shift. Here's a minimal example, but the principle works equally well with a series of dates:
df = pd.DataFrame({'Years': [2017,2017,2018,2018,2018,2017,2018,2018]})
mask = df['Years'].shift() > df['Years']
df = df[~mask]
print(df)
Years
0 2017
1 2017
2 2018
3 2018
4 2018
6 2018
7 2018
Notice how the row with index 5 has been dropped since 2017 < 2018 (the row before). You can extend this for multiple columns via a for loop.
You should under no circumstances modify rows while you are iterating over them. This is spelt out in the docs for pd.DataFrame.iterrows:
You should never modify something you are iterating over. This is not
guaranteed to work in all cases. Depending on the data types, the
iterator returns a copy and not a view, and writing to it will have no
effect.
However, this becomes irrelevant when there is a vectorised solution available, as described above.

Pandas.DataFrame - find the oldest date for which a value is available

I have a pandas.DataFrame object containing 2 time series. One series is much shorter than the other.
I want to determine the farther date for which a data is available in the shortest series, and remove data in the 2 columns before that date.
What is the most pythonic way to do that?
(I apologize that I don't really follow the SO guideline for submitting questions)
Here is a fragment of my dataframe:
osr go
Date
1990-08-17 NaN 239.75
1990-08-20 NaN 251.50
1990-08-21 352.00 265.00
1990-08-22 353.25 274.25
1990-08-23 351.75 290.25
In this case, I want to get rid of all rows before 1990-08-21 (I add there may be NAs in one of the columns for more recent dates)
You can use idxmax in inverted s by df['osr'][::-1] and then use subset of df:
print df
# osr go
#Date
#1990-08-17 NaN 239.75
#1990-08-20 NaN 251.50
#1990-08-21 352.00 265.00
#1990-08-22 353.25 274.25
#1990-08-23 351.75 290.25
s = df['osr'][::-1]
print s
#Date
#1990-08-23 351.75
#1990-08-22 353.25
#1990-08-21 352.00
#1990-08-20 NaN
#1990-08-17 NaN
#Name: osr, dtype: float64
maxnull = s.isnull().idxmax()
print maxnull
#1990-08-20 00:00:00
print df[df.index > maxnull]
# osr go
#Date
#1990-08-21 352.00 265.00
#1990-08-22 353.25 274.25
#1990-08-23 351.75 290.25
EDIT: New answer based upon comments/edits
It sounds like the data is sequential and once you have lines that don't have data you want to throw them out. This can be done easily with dropna.
df = df.dropna()
This answer assumes that once you are passed the bad rows, they stay good. Or if you don't care about dropping rows in the middle...depends on how sequential you need to be. If the data needs to be sequential and your input is well formed jezrael answer is good
Original answer
You haven't given much here by way of structure in your dataframe so I am going to make assumptions here. I'm going to assume you have many columns, two of which: time_series_1 and time_series_2 are the ones you referred to in your question and this is all stored in df
First we can find the shorter series by just using
shorter_col = df['time_series_1'] if len(df['time_series_1']) > len(df['time_series_2']) else df['time_series_2']
Now we want the last date in that
remove_date = max(shorter_col)
Now we want to remove data before that date
mask = (df['time_series_1'] > remove_date) | (df['time_series_2'] > remove_date)
df = df[mask]

Load un-aligned time series into a DataFrame, with one index?

I am starting out learning this wonderful tool, and I am stuck at the simple task of loading several time series and aligning them with a "master" date vector.
For example: I have a csv file: Data.csv where the first row contains the headers "Date1, Rate1, Date2, Rate2" where Date1 is the dates of the Rate1 and Date2 are the dates of Rate2.
In this case, Rate2 has more observations (the start date is the same as Date1, but the end date is furhter apart then the end date in Date1, and there are less missing values), and everything should be indexed according to Date2.
What is the preferred way to get the following DataFrame? (or accomplishing something similar)
index(Date2) Rate1 Rate2
11/12/06 1.5 1.8
12/12/06 NaN 1.9
13/12/06 1.6 1.9
etc
etc
11/10/06 NaN 1.2
12/10/06 NaN 1.1
13/10/06 NaN 1.3
I have tried to follow the examples in the official pandas.pdf and Googling, but to no avail. (I even bought the Pre-Edition of Mr McKinney´s Pandas book, but the chapters concering Pandas where not ready yet :( )
Is there a nice recipe for this?
Thank you very much
EDIT: Concerning the answer of separating the series into two .CSV files:
But what if I have very many time series, e.g
Date1 Rate1 Date2 Rate2 ... DateN RateN
And all I know is that the dates should be almost the same, with exceptions coming from series that contain missing values (where there is no Date or Rate entry) (this would be an example of some financial economics time series, by the way)
Is the preferred way to load this dataset still to split every series into a separate .CSV?
EDIT2 archlight is completely right, just doing "csv_read" will mess things up.
Essentially my question would then boil down to: how to join several unaligned time series, where each series has a date column, and column for the series itself (.CSV file exported from Excel)
Thanks again
I don't think splitting up the data into multiple files is necessary. How about loading the file with read_csv and converting each date/rate pair into a separate time series? So your code would look like:
data = read_csv('foo.csv')
ts1 = Series(data['rate1'], index=data['date1'])
ts2 = Series(data['rate2'], index=data['date2'])
Now, to join then together and align the data in a DataFrame, you can do:
frame = DataFrame({'rate1': ts1, 'rate2': ts2})
This will form the union of the dates in ts1 and ts2 and align all of the data (inserting NA values where appropriate).
Or, if you have N time series, you could do:
all_series = {}
for i in range(N):
all_series['rate%d' % i] = Series(data['rate%d' % i], index=data['date%d' % i])
frame = DataFrame(all_series)
This is a very common pattern in my experience
if you are sure that Date1 is subset of Date2 and Date2 contains no empty value, you can simply do
df = read_csv('foo.csv', index_col=2, parse_dates=True)
df = df[["rate1", "rate2"]]
but it will be complicated if Date2 has date which Date1 doesn't have. I suggest you put date/rate pair in separate files with date as common header
df1 = read_csv('foo1.csv', index_col=0, parse_dates=True)
df2 = read_csv('foo2.csv', index_col=0, parse_dates=True)
df1.join(df2, how="outer")
EDIT:
This method doesn't look good. so for your NaN in your datetime, you can do sth like
dateindex2 = map(lambda x: datetime(int("20"+x.split("/")[2]), int(x.split("/")[0]), int(x.split("/")[1])), filter(notnull, df['Date2'].values))
ts2 = Series(df["Rate2"].dropna(), index=dateindex2)
#same for ts1
df2 = DataFrame({"rate1":ts1, "rate2":ts2})
the thing is you have to make sure that there is case like date exists but not rate. because dropna() will shift records and mismatch with index

Categories

Resources