Finding consecutive days in the pandas dataframe - python

I have a dataframe:
ColA ColB
0 A 1/2/2020
1 A 1/3/2020
2 A 1/4/2020
3 A 1/10/2020
4 B 1/3/2020
5 B 1/19/2020
6 C 1/2/2020
7 C 1/7/2020
8 D 1/8/2020
Now I want to find out the name of the series in colA which has three consecutive days in colB.
Output:
the answer would be A since it has 1/2/2020, 1/3/2020 and 1/4/2020 in colB.

A general approach would be like this:
# 1. To make sure the dates are sorted
df = df.sort_values(["ColA", "ColB"])
# 2. Standardize the dates by offseting them
df["ColB_std"] = df["ColB"] - pd.to_timedelta(range(df.shape[0]), 'day')
# 3. Counting each instance of ColA and standardized date
s = df.groupby(["ColA", "ColB_std"])["ColB_std"].count()
# 4. Getting elements from ColA that have at least 1 sequence of at least length 3
colA = s[ s >= 3 ].index.get_level_values(0).unique().values
# 5. Filtering the dataframe
df[ df["ColA"].isin(colA) ]
You want ColAs with 3 consecutive dates. Or you could think of it like you want ColAs where there's a sequence of date, date + 1 day and date + 2 days. By sorting the dataframe by ColA and ColB (1), we know that in the case you want to check, date + 1 day will always follow date, and date + 2 days will be the following that one.
With this, you can standardize the dates by removing n days corresponding to their row. So, the sequence of date, date + 1 day and date + 2 days, becomes date, date and date (2).
Now that we have the date column standardized, we just need to count how many elements each pair ('ColA', 'ColB_std') exist (3), get the elements from ColA that have counts of 3 or more (4), and filter the dataframe (5).
However, this doesn't support duplicated pairs of ('ColA', 'ColB'), for that you'd need to do this first:
df2 = df.drop_duplicates(["ColA", "ColB"])
Proceding to use this df2 in steps 1, 2, 3 and 4, and in the end filtering the real df in step 5.
Previously, I answered that you also could do it like this:
# To make sure the dates are sorted
df = df.sort_values(["ColA", "ColB"])
# Calculating the difference between dates inside each group
s = df.groupby("ColA")["ColB"].diff().dt.days
# Filtering the dataframe
df[ ((s == 1) & (s.shift(1) == 1)).groupby(df["ColA"]).transform("any") ]
The idea is that in s, the difference is always between the previous date, and the current date. However, this doesn't make sure that there are 3 consecutive dates, just 2. By shifting the series by 1, you are make sure that the current different and the previous one are 1 [ (s == 1) & (s.shift(1) == 1) ].
After that, I just groupby(df["ColA"]), and check if any element inside the group is true with transform("any").

Related

Check if row's date range overlap any previous rows date range in Python / Pandas Dataframe

I have some data in a pandas dataframe that contains a rank column, a start date and an end date. The data is sorted on the rank column lowest to highest (consequently the start/end dates are unordered). I wish to remove every row whose date range overlaps ANY PREVIOUS rows'
By way of a toy example:
Raw Data
Rank Start_Date End_Date
1 1/1/2021 2/1/2021
2 1/15/2021 2/15/2021
3 12/7/2020 1/7/2021
4 5/1/2020 6/1/2020
5 7/10/2020 8/10/2020
6 4/20/2020 5/20/2020
Desired Result
Rank Start_Date End_Date
1 1/1/2021 2/1/2021
4 5/1/2020 6/1/2020
5 7/10/2020 8/10/2020
Explanation: Row 2 is removed because its start overlaps Row 1, Row 3 is removed because its end overlaps Row 1. Row 4 is retained as it doesn’t overlap any previously retained Rows (ie Row 1). Similarly, Row 5 is retained as it doesn’t overlap Row 1 or Row 4. Row 6 is removed because it overlaps with Row 4.
Attempts:
I can use np.where to check the previous row with the current row and create a column “overlap” and then subsequently filter this column. But this doesn’t satisfy my requirement (ie in the toy example above Row 3 would be included as it doesn’t overlap with Row2 but should be excluded as it overlaps with Row 1).
df['overlap'] = np.where((df['start']> df['start'].shift(1)) &
(df['start'] < df['end'].shift(1)),1 ,0)
df['overlap'] = np.where((df['end'] < df['end'].shift(1)) &
(df['end'] > df['start'].shift(1)), 1, df['overlap'])
I have tried an implementation based on answers from this question Removing 'overlapping' dates from pandas dataframe, using a lookback period from the End Date, but the length of days between my Start Date and End Date are not constant, and it doesnt seem to produce the correct answer anyway.
target = df.iloc[0]
day_diff = abs(target['End_Date'] - df['End_Date'])
day_diff = day_diff.reset_index().sort_values(['End_Date', 'index'])
day_diff.columns = ['old_index', 'End_Date']
non_overlap = day_diff.groupby(day_diff['End_Date'].dt.days // window).first().old_index.values
results = df.iloc[non_overlap]
Two intervals overlap if (a) End2>Start1 and (b) Start2<End1:
We can use numpy.triu to calculate those comparisons with the previous rows only:
a = np.triu(df['End_Date'].values>df['Start_Date'].values[:,None])
b = np.triu(df['Start_Date'].values<df['End_Date'].values[:,None])
The good rows are those that have only True on the diagonal for a&b
df[(a&b).sum(0)==1]
output:
Rank Start_Date End_Date
1 2021-01-01 2021-02-01
4 2020-05-01 2020-06-01
5 2020-07-10 2020-08-10
NB. as it needs to calculate the combination of rows, this method can use a lot of memory when the array becomes large, but it should be fast
Another option, that could help with memory usage, is a combination of IntervalIndex and a for loop:
Convert dates:
df.Start_Date = df.Start_Date.transform(pd.to_datetime, format='%m/%d/%Y')
df.End_Date = df.End_Date.transform(pd.to_datetime, format='%m/%d/%Y')
Create IntervalIndex:
intervals = pd.IntervalIndex.from_arrays(df.Start_Date,
df.End_Date,
closed='both')
Run a for loop (this avoids broadcasting, which while fast, can be memory intensive, depending on the array size):
index = np.arange(intervals.size)
keep = [] # indices of `df` to be retained
# get rid of the indices where the intervals overlap
for interval in intervals:
keep.append(index[0])
checks = intervals[index].overlaps(intervals[index[0]])
if checks.any():
index = index[~checks]
else:
break
if index.size == 0:
break
df.loc[keep]
Rank Start_Date End_Date
0 1 2021-01-01 2021-02-01
3 4 2020-05-01 2020-06-01
4 5 2020-07-10 2020-08-10

find max-min values for one column based on another

I have a dataset that looks like this.
datetime id
2020-01-22 11:57:09.286 UTC 5
2020-01-22 11:57:02.303 UTC 6
2020-01-22 11:59:02.303 UTC 5
Ids are not unique and give different datetime values. Let's say:
duration = max(datetime)-min(datetime).
I want to count the ids for what the duration max(datetime)-min(datetime) is less than 2 seconds. So, for example I will output:
count = 1
because of id 5. Then, I want to create a new dataset which contains only those rows with the min(datetime) value for each of the unique ids. So, the new dataset will contain the first row but not the third. The final data set should not have any duplicate ids.
datetime id
2020-01-22 11:57:09.286 UTC 5
2020-01-22 11:57:02.303 UTC 6
How can I do any of these?
P.S: The dataset I provided might not be a good example since the condition is 2 seconds but here it's in minutes
Do you want this? :
df.datetime = pd.to_datetime(df.datetime)
c = 0
def count(x):
global c
x = x.sort_values('datetime')
if len(x) > 1:
diff = (x.iloc[-1]['datetime'] - x.iloc[0]['datetime'])
if diff < timedelta(seconds=2):
c += 1
return x.head(1)
new_df = df.groupby('id').apply(count).reset_index(drop=True)
Now, if you print c it'll show the count which is 1 for this case and new_df will hold the final dataframe.

Pandas: Transform and merge a date interval into a dummy variable in a panel

I have two dataframes, the main one is a monthly (MS) panel like this:
df = pd.DataFrame({'Location':['A', 'A', 'B', 'B'],
'Date':pd.to_datetime(['1990-1-1', '1990-2-1']*2, yearfirst=True)})
Date Location
0 1990-01-01 A
1 1990-02-01 A
2 1990-01-01 B
3 1990-02-01 B
The second one is a list of events that includes locations, start date and end date (month first), like this:
events = pd.DataFrame({'Location':['A', 'B'],
'Start Date':pd.to_datetime(['1/14/1990', '1/2/1990']),
'End Date':pd.to_datetime(['1/15/1990', '2/13/1990'])})
Location Start Date End Date
0 A 1990-01-14 1990-01-15
1 B 1990-01-02 1990-02-13
What I need is to turn the start-and-end-date/location combos in the second dataframe into dummy variables in the first. In other words, I need a column that takes on the value of 1 if a particular location had an event on a given date, 0 otherwise. Like this:
Date Location Event
0 1990-01-01 A 1
1 1990-02-01 A 0
2 1990-01-01 B 1
3 1990-02-01 B 1
As you can see, the date 1990-1-1 did not fall in the range of an event in the second dataframe for location B, so it's a 0. Sometimes events will span multiple months, other times not. The day of the event within the month is not relevant, since the main data is all MS frequency. It's a large panel, so the same location will have events on many different dates, and the same date will have events in different locations.
The solution I've worked out is messy and not very fast:
events2 = pd.melt(events, id_vars='Location',
value_vars=['Start Date', 'End Date'],
value_name='Event')
import datetime
def date_fill(g):
#to make sure the 1st of a month is always in the range
y, m = g['Event'].min().year, g['Event'].min().month
date_range = pd.date_range(datetime.datetime(year=y, month=m, day=1),
g['Event'].max(),
freq='MS')
return g.set_index('Event').reindex(date_range,
fill_value=g['Location'].iloc[0])
events3 = events2.groupby('Location', as_index=False).apply(lambda g: date_fill(g))
Which gives me this:
Location variable
0 1990-01-01 A A
1 1990-01-01 B B
1990-02-01 B B
Which I can then clean up a bit, create a column of all 1s, and left-merge into the first dataframe on location and date, filling NaNs with 0. It works, but it's obviously messy and slow (a smaller consideration than messy because the data isn't overly large). I feel like there must be a better way, but I haven't turned it up yet.
Edit: There are actually several problems with my "solution" also, as I explore this more, which was my fear with such a messy bit of work. Specifically it chokes on some corner cases, like when the event starts and ends on the 1st of the month (can't reindex with duplicates).
This one should produce the desired output. (not fast)
df["Date"] = df["Date"].dt.to_period('M')
events["Start Date"] = events["Start Date"].dt.to_period('M')
events["End Date"] = events["End Date"].dt.to_period('M')
e_g = events.groupby("Location")
def f(x):
g = e_g.get_group(x.Location)
return ((x.Date >= g["Start Date"])&(x.Date <= g["End Date"])).any()
df["dummy"] = df.apply(f, axis=1).astype(int)
df
Date Location dummy
0 1990-01 A 1
1 1990-02 A 0
2 1990-01 B 1
3 1990-02 B 1

pandas - How to check consecutive order of dates and copy groups of them?

At first I have two problems, the first will follow now:
I a dataframe df with many times the same userid and along with it a date and some unimportant other columns:
userid date
0 243 2014-04-01
1 234 2014-12-01
2 234 2015-11-01
3 589 2016-07-01
4 589 2016-03-01
I am currently trying to groupby them by userid and sort the dates descending and cut out the twelve oldest. My code looks like this:
df = df.groupby(['userid'], group_keys=False).agg(lambda x: x.sort_values(['date'], ascending=False, inplace=False).head(12))
And I get this error:
ValueError: cannot copy sequence with size 6 to array axis with dimension 12
At the moment my aim is to avoid to split the dataframe in individual ones.
My second problem is more complex:
I try to find out if the sorted dates (respectively per group of userids) are monthly consecutive. This means if there is an date for one group of userid, for example userid: 234 and date: 2014-04-01, the next entry below must be userid: 234 and date:2014-03-01. There is no focus on the day, only the year and month are important.
And only this consecutive 12 dates should be copied in another dataframe.
A second dataframe df2 contains the same userid, but they are unique and another column is 'code'. Here is an example:
userid code
0 433805 1
24 5448 0
48 3434 1
72 34434 1
96 3202 1
120 23766 1
153 39457 0
168 4113 1
172 3435 5
374 34093 1
I summarize: I try to check if there are 12 consecutive months per userid and copy every correct sequence in another dataframe. For this I have also compare the 'code' from df2.
This is a version of my code:
df['YearMonthDiff'] = df['date'].map(lambda x: 1000*x.year + x.month).diff()
df['id_before'] = df['userid'].shift()
final_df = pd.DataFrame()
for group in df.groupby(['userid'], group_keys=False):
fi = group[1]
if (fi['userid'] <> fi['id_before']) & group['YearMonthDiff'].all(-1.0) & df.loc[fi.userid]['code'] != 5:
final_df.append(group['userid','date', 'consum'])
At first calculated from the date an integer and made diff(). On other posts I saw they shift the column to compare the values from the current row and the row before. Then I made groupby(userid) to iterate over the single groups. Now it's extra ugly I tried to find the beginning of such an userid-group, try to check if there are only consecutive months and the correct 'code'. And at least I append it on the final dataframe.
On of the biggest problems is to compare the row with the following row. I can iterate over them with iterrow(), but I cannot compare them without shift(). There exits a calendar function, but on these I will take a look on the weekend. Sorry for the mess I am new to pandas.
Has anyone an idea how to solve my problem?
for your first problem, try this
df.groupby(by='userid').apply(lambda x: x.sort_values(by='date',ascending=False).iloc[[e for e in range(12) if e <len(x)]])
Using groupby and nlargest, we get the index values of those largest dates. Then we use .loc to get just those rows
df.loc[df.groupby('userid').date.nlargest(12).index.get_level_values(1)]
Consider the dataframe df
dates = pd.date_range('2015-08-08', periods=10)
df = pd.DataFrame(dict(
userid=np.arange(2).repeat(4),
date=np.random.choice(dates, 8, False)
))
print(df)
date userid
0 2015-08-12 0 # <-- keep
1 2015-08-09 0
2 2015-08-11 0
3 2015-08-15 0 # <-- keep
4 2015-08-13 1
5 2015-08-10 1
6 2015-08-17 1 # <-- keep
7 2015-08-16 1 # <-- keep
We'll keep the latest 2 dates per user id
df.loc[df.groupby('userid').date.nlargest(2).index.get_level_values(1)]
date userid
0 2015-08-12 0
3 2015-08-15 0
6 2015-08-17 1
7 2015-08-16 1
Maybe someone is interested, I solved my second problem thus:
I cast the date to an int, calculated the difference and I shift the userid one row, like in my example. And then follows this... found a solution on stackoverflow
gr_ob = df.groupby('userid')
gr_dict = gr_ob.groups
final_df = pd.DataFrame(columns=['userid', 'date', 'consum'])
for group_name in gr_dict.keys():
new_df = gr_ob.get_group(group_name)
if (new_df['userid'].iloc[0] <> new_df['id_before'].iloc[0]) & (new_df['YearMonthDiff'].iloc[1:] == -1.0).all() & (len(new_df) == 12):
final_df = final_df.append(new_df[['userid', 'date', 'consum']])

Slicing Multi Index Header DataFrames in Python With Custom Sorting

I'm trying to get a handle on slicing. I've got the following dataframe, df:
Feeder # 1 Feeder # 2
TimeStamp MW Month Day Hour TimeStamp MW Month Day Hour
0 2/3 1.2 1 30 22 2/3 2.4 1 30 22
1 2/4 2.3 1 31 23 2/3 4.1 1 31 23
2 2/5 3.4 2 1 0 2/3 3.7 2 1 0
There are 8 feeders in total.
If I want to select all the MW columns in all the Feeders, I can do:
df.xs('MW', level=1, axis=1,drop_level=False)
If I want Feeders 2 through 4, I can do:
df.loc[:,'Feeder #2':'Feeder #4']
BUT if I want columns MW through Day in just Feeders 2 through 4 via:
df.loc[:,pd.IndexSlice['Feeder #2':'Feeder #4','MW':'Day']]
I get the following error.
MultiIndex Slicing requires the index to be fully lexsorted tuple len (2), lexsort depth (1)
So if I sort the dataframe, then I'm able to do:
df.sortlevel(level=0,axis=1).loc[:,pd.IndexSlice['Feeder #2':'Feeder #4','Day':'MW']]
But sorting the dataframe destroys the original order of level 1 in the header-- everything gets alphabetized (lexsorted in Python-speak?). And my desired contents get jumbled: 'Day':'MW' yields the Day, Hour and MW columns. But what I want is 'MW':'Day' which would yield the MW, Month, and Day columns.
So my question is: is it possible to slice through my dataframe and preserve the order of the columns? Alternatively, can I lexsort the dataframe, perform the slices I need and then put the dataframe back in its original order?
Thanks in advance.
I think you can use CategoricalIndex to keep the order:
import pandas as pd
import numpy as np
level0 = "Feeder#1 Feeder#2 Feeder#3 Feeder#4".split()
level1 = "TimeStamp MW Month Day Hour".split()
idx0 = pd.CategoricalIndex(level0, level0, ordered=True)
idx1 = pd.CategoricalIndex(level1, level1, ordered=True)
columns = pd.MultiIndex.from_product([idx0, idx1])
df = pd.DataFrame(np.random.randint(0, 10, (10, 20)), columns=columns)
Then you can do this:
df.loc[:, pd.IndexSlice["Feeder#2":"Feeder#3", "MW":"Day"]]
edit
to convert the levels to CategoricalIndex:
columns = df.columns
for i in range(columns.nlevels):
level = pd.unique(columns.get_level_values(i))
cidx = pd.CategoricalIndex(level, level, sorted=True)
print(cidx)

Categories

Resources