Comparing values with groups - pandas - python

First things first, I have a data frame that has these columns:
issue_date | issue | special | group
Multiple rows can comprise the same group. For each group, I want to get its maximum date:
date_current = history.groupby('group').agg({'issue_date' : [np.min, np.max]})
date_current = date_current.issue_date.amax
After that, I want to filter each group by its max_date-months:
date_before = date_current.values - pd.Timedelta(weeks=4*n)
I.e., for each group, I want to discard rows where the column issue_date < date_before:
hh = history[history['issue_date'] > date_before]
ValueError: Lengths must match to compare
This last line doesn't work though, since the the lengths don't match. This is expected because I have x lines in my data frame, but date_before has length equals to the number of groups in my data frame.
Given data, I'm wondering how I can perform this subtraction, or filtering, by groups. Do I have to iterate of the data frame somehow?

You can solve this in a similar manner as you attempted it.
I've created my own sample data as follows:
history
issue_date group
0 2014-01-02 1
1 2014-01-02 2
2 2016-02-04 3
3 2016-03-05 2
You use group_by and apply to do what you were attempting. First you definge the function you want to apply. Then group_by.apply will apply it to every group. In this case I've used n=1 to demonstrate the point:
def date_compare(df):
date_current = df.issue_date.max()
date_before = date_current - pd.Timedelta(weeks=4*1)
hh = df[df['issue_date'] > date_before]
return hh
hh = history.groupby('group').apply(date_compare)
issue_date group
group
1 0 2014-01-02 1
2 3 2016-03-05 2
3 2 2016-02-04 3
So the smaller date in group 2 has not survived the cut.
Hope that's helpful and that it follows the same logic you were going for.

I think your best option will be to merge your original df with date_current, but this will only work if you change your calculation of the date_before such that the group information isn't lost:
date_before = date_current - pd.Timedelta(weeks=4*n)
Then you can merge left on group and right on index(since you grouped on that before)
history = pd.merge(history, date_before.to_frame(), left_on='group', right_index=True)
Then your filter should work. The call of to_frame is nessesary because you can't merge a dataframe and a series.
Hope that helps.

Related

Check if row's date range overlap any previous rows date range in Python / Pandas Dataframe

I have some data in a pandas dataframe that contains a rank column, a start date and an end date. The data is sorted on the rank column lowest to highest (consequently the start/end dates are unordered). I wish to remove every row whose date range overlaps ANY PREVIOUS rows'
By way of a toy example:
Raw Data
Rank Start_Date End_Date
1 1/1/2021 2/1/2021
2 1/15/2021 2/15/2021
3 12/7/2020 1/7/2021
4 5/1/2020 6/1/2020
5 7/10/2020 8/10/2020
6 4/20/2020 5/20/2020
Desired Result
Rank Start_Date End_Date
1 1/1/2021 2/1/2021
4 5/1/2020 6/1/2020
5 7/10/2020 8/10/2020
Explanation: Row 2 is removed because its start overlaps Row 1, Row 3 is removed because its end overlaps Row 1. Row 4 is retained as it doesn’t overlap any previously retained Rows (ie Row 1). Similarly, Row 5 is retained as it doesn’t overlap Row 1 or Row 4. Row 6 is removed because it overlaps with Row 4.
Attempts:
I can use np.where to check the previous row with the current row and create a column “overlap” and then subsequently filter this column. But this doesn’t satisfy my requirement (ie in the toy example above Row 3 would be included as it doesn’t overlap with Row2 but should be excluded as it overlaps with Row 1).
df['overlap'] = np.where((df['start']> df['start'].shift(1)) &
(df['start'] < df['end'].shift(1)),1 ,0)
df['overlap'] = np.where((df['end'] < df['end'].shift(1)) &
(df['end'] > df['start'].shift(1)), 1, df['overlap'])
I have tried an implementation based on answers from this question Removing 'overlapping' dates from pandas dataframe, using a lookback period from the End Date, but the length of days between my Start Date and End Date are not constant, and it doesnt seem to produce the correct answer anyway.
target = df.iloc[0]
day_diff = abs(target['End_Date'] - df['End_Date'])
day_diff = day_diff.reset_index().sort_values(['End_Date', 'index'])
day_diff.columns = ['old_index', 'End_Date']
non_overlap = day_diff.groupby(day_diff['End_Date'].dt.days // window).first().old_index.values
results = df.iloc[non_overlap]
Two intervals overlap if (a) End2>Start1 and (b) Start2<End1:
We can use numpy.triu to calculate those comparisons with the previous rows only:
a = np.triu(df['End_Date'].values>df['Start_Date'].values[:,None])
b = np.triu(df['Start_Date'].values<df['End_Date'].values[:,None])
The good rows are those that have only True on the diagonal for a&b
df[(a&b).sum(0)==1]
output:
Rank Start_Date End_Date
1 2021-01-01 2021-02-01
4 2020-05-01 2020-06-01
5 2020-07-10 2020-08-10
NB. as it needs to calculate the combination of rows, this method can use a lot of memory when the array becomes large, but it should be fast
Another option, that could help with memory usage, is a combination of IntervalIndex and a for loop:
Convert dates:
df.Start_Date = df.Start_Date.transform(pd.to_datetime, format='%m/%d/%Y')
df.End_Date = df.End_Date.transform(pd.to_datetime, format='%m/%d/%Y')
Create IntervalIndex:
intervals = pd.IntervalIndex.from_arrays(df.Start_Date,
df.End_Date,
closed='both')
Run a for loop (this avoids broadcasting, which while fast, can be memory intensive, depending on the array size):
index = np.arange(intervals.size)
keep = [] # indices of `df` to be retained
# get rid of the indices where the intervals overlap
for interval in intervals:
keep.append(index[0])
checks = intervals[index].overlaps(intervals[index[0]])
if checks.any():
index = index[~checks]
else:
break
if index.size == 0:
break
df.loc[keep]
Rank Start_Date End_Date
0 1 2021-01-01 2021-02-01
3 4 2020-05-01 2020-06-01
4 5 2020-07-10 2020-08-10

Updating a Panda dataframe using an apply and where, and a second apply

I have a data frame with the following structure:
>>>df
name threshold ... time
0 a no ... 1.1
1 a 1 ... 1.5
2 b no ... 1.1
3 a 2 ... 1.5
...
For each name (groupby), I'd like to find df.where['threshold']=='no' and divide the corresponding value of time to the rest of the name in the same group (a, b, etc.). I'd like to preserve the rest of the dataframe as it was. I was not able to find an option to do so with df.apply:
df.groupby(['name']).apply(lambda x: x['threshold'])
After which, I can't apply df.where on it and I can't quite make this multiple conditions with df.apply.
So the answer should do a groupby, apply by threshold, where threshold is no, find corresponding time value and divide that to the all of the names in the same group. Note that there is only one no per each group name.
Thanks for any suggestions.
IIUC, you could do:
df['no_time'] = df['threshold'].eq('no') * df['time']
df['time'] = df['time'] / df.groupby('name')['no_time'].transform('max')
res = df.drop('no_time', axis=1)
print(res)
Output
name threshold time
0 a no 1.000000
1 a 1 1.363636
2 b no 1.000000
3 a 2 1.363636
The first step:
df['no_time'] = df['threshold'].eq('no') * df['time']
creates a new column where the only values different than 0 are where threshold equals no.
The second step has two parts, the part 2.1
df.groupby('name')['no_time'].transform('max')
finds the maximum of the new column (no_time) by group i.e. the values of time where the threshold equals no. Assuming time is always positive (or at least where threshold equals no)
The final part just divide the df['time'] column by the one from the previous step (2.1)

Pandas: using groupby and nunique taking time into account

I have a dataframe in this form:
A B time
1 2 2019-01-03
1 3 2018-04-05
1 4 2020-01-01
1 4 2020-02-02
where A and B contain some integer identifiers.
I want to measure the number of different identifiers each A has interacted with. To do this I usually simply do
df.groupby('A')['B'].nunique()
I now have to do a slightly different thing: each identifier has a date assigned (different for each identifier), that splits its interactions in 2 parts: the ones happening before that date, and the ones happening after that date. The same operation previously done (counting number of unique B interacted with ) needs to be done for both parts separately.
For example, if the date for A=1 was 2018-07-01, the output would be
A before after
1 1 2
In the real data, A contains millions of different identifiers, each with its unique date assigned.
EDITED
To be more clear I added a line to df. I want to count the number of different values of B each A interacts with before and after the date
I would convert A into dates, compare those with df['time'] and then groupby().value_counts():
(df['A'].map(date_dict)
.gt(df['time'])
.groupby(df['A'])
.value_counts()
.unstack()
.rename({False:'after',True:'before'}, axis=1)
)
Output:
after before
A
1 2 1

pandas - How to check consecutive order of dates and copy groups of them?

At first I have two problems, the first will follow now:
I a dataframe df with many times the same userid and along with it a date and some unimportant other columns:
userid date
0 243 2014-04-01
1 234 2014-12-01
2 234 2015-11-01
3 589 2016-07-01
4 589 2016-03-01
I am currently trying to groupby them by userid and sort the dates descending and cut out the twelve oldest. My code looks like this:
df = df.groupby(['userid'], group_keys=False).agg(lambda x: x.sort_values(['date'], ascending=False, inplace=False).head(12))
And I get this error:
ValueError: cannot copy sequence with size 6 to array axis with dimension 12
At the moment my aim is to avoid to split the dataframe in individual ones.
My second problem is more complex:
I try to find out if the sorted dates (respectively per group of userids) are monthly consecutive. This means if there is an date for one group of userid, for example userid: 234 and date: 2014-04-01, the next entry below must be userid: 234 and date:2014-03-01. There is no focus on the day, only the year and month are important.
And only this consecutive 12 dates should be copied in another dataframe.
A second dataframe df2 contains the same userid, but they are unique and another column is 'code'. Here is an example:
userid code
0 433805 1
24 5448 0
48 3434 1
72 34434 1
96 3202 1
120 23766 1
153 39457 0
168 4113 1
172 3435 5
374 34093 1
I summarize: I try to check if there are 12 consecutive months per userid and copy every correct sequence in another dataframe. For this I have also compare the 'code' from df2.
This is a version of my code:
df['YearMonthDiff'] = df['date'].map(lambda x: 1000*x.year + x.month).diff()
df['id_before'] = df['userid'].shift()
final_df = pd.DataFrame()
for group in df.groupby(['userid'], group_keys=False):
fi = group[1]
if (fi['userid'] <> fi['id_before']) & group['YearMonthDiff'].all(-1.0) & df.loc[fi.userid]['code'] != 5:
final_df.append(group['userid','date', 'consum'])
At first calculated from the date an integer and made diff(). On other posts I saw they shift the column to compare the values from the current row and the row before. Then I made groupby(userid) to iterate over the single groups. Now it's extra ugly I tried to find the beginning of such an userid-group, try to check if there are only consecutive months and the correct 'code'. And at least I append it on the final dataframe.
On of the biggest problems is to compare the row with the following row. I can iterate over them with iterrow(), but I cannot compare them without shift(). There exits a calendar function, but on these I will take a look on the weekend. Sorry for the mess I am new to pandas.
Has anyone an idea how to solve my problem?
for your first problem, try this
df.groupby(by='userid').apply(lambda x: x.sort_values(by='date',ascending=False).iloc[[e for e in range(12) if e <len(x)]])
Using groupby and nlargest, we get the index values of those largest dates. Then we use .loc to get just those rows
df.loc[df.groupby('userid').date.nlargest(12).index.get_level_values(1)]
Consider the dataframe df
dates = pd.date_range('2015-08-08', periods=10)
df = pd.DataFrame(dict(
userid=np.arange(2).repeat(4),
date=np.random.choice(dates, 8, False)
))
print(df)
date userid
0 2015-08-12 0 # <-- keep
1 2015-08-09 0
2 2015-08-11 0
3 2015-08-15 0 # <-- keep
4 2015-08-13 1
5 2015-08-10 1
6 2015-08-17 1 # <-- keep
7 2015-08-16 1 # <-- keep
We'll keep the latest 2 dates per user id
df.loc[df.groupby('userid').date.nlargest(2).index.get_level_values(1)]
date userid
0 2015-08-12 0
3 2015-08-15 0
6 2015-08-17 1
7 2015-08-16 1
Maybe someone is interested, I solved my second problem thus:
I cast the date to an int, calculated the difference and I shift the userid one row, like in my example. And then follows this... found a solution on stackoverflow
gr_ob = df.groupby('userid')
gr_dict = gr_ob.groups
final_df = pd.DataFrame(columns=['userid', 'date', 'consum'])
for group_name in gr_dict.keys():
new_df = gr_ob.get_group(group_name)
if (new_df['userid'].iloc[0] <> new_df['id_before'].iloc[0]) & (new_df['YearMonthDiff'].iloc[1:] == -1.0).all() & (len(new_df) == 12):
final_df = final_df.append(new_df[['userid', 'date', 'consum']])

Dividing two columns of an unstacked dataframe

I have two columns in a pandas dataframe.
Column 1 is ed and contains strings (e.g. 'a','a','b,'c','c','a')
ed column = ['a','a','b','c','c','a']
Column 2 is job and also contains strings (e.g. 'aa','bb','aa','aa','bb','cc')
job column = ['aa','bb','aa','aa','bb','cc'] #these are example values from column 2 of my pandas data frame
I then generate a two column frequency table like this:
my_counts= pdata.groupby(['ed','job']).size().unstack().fillna(0)
Now how do I then divide the frequencies in one column by the frequencies in another column of that frequency table? I want to take that ratio and use it to argsort() so that I can sort by the calculated ratio but I don't know how to reference each column of the resulting table.
I initialized the data as follows:
ed_col = ['a','a','b','c','c','a']
job_col = ['aa','bb','aa','aa','bb','cc']
pdata = pd.DataFrame({'ed':ed_col, 'job':job_col})
my_counts= pdata.groupby(['ed','job']).size().unstack().fillna(0)
Now my_counts looks like this:
job aa bb cc
ed
a 1 1 1
b 1 0 0
c 1 1 0
To access a column, you could use my_counts.aa or my_counts['aa'].
To access a row, you could use my_counts.loc['a'].
So the frequencies of aa divided by bb are my_counts['aa'] / my_counts['bb']
and now, if you want to get it sorted, you can do:
my_counts.iloc[(my_counts['aa'] / my_counts['bb']).argsort()]

Categories

Resources