Python - Group by time periods with tolerance - python

Three columns exist in this data set: ID (unique employee identification), WorkComplete (indicating when all work has been completed), and DateDiff (number of days from their start date). I am looking to group the DaysDiff column based on certain time periods with an added layer of tolerance or leniency. For my mock data, I am spacing the time periods by 30 days.
Group 0: 0-30 DateDiff (with a 30 day extra window if 'Y' is not found)
Group 1: 31-60 DateDiff (with a 30 day extra window if 'Y' is not found)
Group 2: 61-90 DateDiff (with a 30 day extra window if 'Y' is not found)
I was able to create very basic code and assign the groupings, but I am having trouble with the extra 30 day window. For example, if an employee completed their work (Y) during the time periods above, then they receive the attributed grouping. For ID 111 below, you can see that the person did not complete their work within the first 30 days, so I am giving them an addition 30 days to complete their work. If they complete their work, then the first instance we see a 'Y', it is grouped in the previous grouping.
df = pd.DataFrame({'ID':[111, 111, 111, 111, 111, 111, 112, 112, 112],
'WorkComplete':['N', 'N', 'Y', 'N', 'N', 'N', 'N', 'Y', 'Y'],
'DaysDiff': [0, 29, 45, 46, 47, 88, 1, 12, 89]})
Input
ID WorkComplete DaysDiff
111 N 0
111 N 29
111 Y 45
111 N 46
111 N 47
111 N 88
123 N 1
123 Y 12
123 Y 89
Output
ID WorkComplete DaysDiff Group
111 N 0 0
111 N 29 0
111 Y 45 0 <---- note here the grouping is 0 to provide extra time
111 N 46 1 <---- back to normal
111 N 47 1
111 N 88 2
123 N 1 0
123 Y 12 0
123 Y 89 2
minQ1 = 0
highQ1 = 30
minQ2 = 31
highQ2 = 60
minQ2 = 61
highQ2 = 90
def Group_df(df):
if (minQ1 <= df['DateDiff'] <= highQ1): return '0'
elif (minQ1 <= df['DateDiff'] <= highQ1): return '1'
elif (minQ2 <= df['DateDiff'] <= highQ2): return '2'
df['Group'] = df.apply(Group_df, axis = 1)
The trouble I am having is allowing for the additional 30 days if the person did not complete the work. My above attempt is partial at trying to resolve the issue.

You can use np.select for the primary conditions.
Then, use mask for the specific condition you mention. s is the first index location for all Y values per group. I then temporarily assign s as a new column, so that I can check for rows against df.index (the index) to return rows that meet the condition. The second condition is if the group number is 1 from the previos line of code:
df['Group'] = np.select([df['DaysDiff'].between(0,30),
df['DaysDiff'].between(31,60),
df['DaysDiff'].between(61,90)],
[0,1,2])
s = df[df['WorkComplete'] == 'Y'].groupby('ID')['DaysDiff'].transform('idxmin')
df['Group'] = df['Group'].mask((df.assign(s=s)['s'].eq(df.index)) & (df['Group'].eq(1)), 0)
df
Out[1]:
ID WorkComplete DaysDiff Group
0 111 N 0 0
1 111 N 29 0
2 111 Y 45 0
3 111 N 46 1
4 111 N 47 1
5 111 N 88 2
6 123 N 1 0
7 123 Y 12 0
8 123 Y 89 2

Related

How to name the column when using value_count function in pandas?

I was counting the no of occurrence of angle and dist by the code below:
g = new_df.value_counts(subset=['Current_Angle','Current_dist'] ,sort = False)
the output:
current_angle current_dist 0
-50 30 1
-50 40 2
-50 41 6
-50 45 4
try1:
g.columns = ['angle','Distance','count','Percentage Missed'] - result was no change in the name of column
try2:
When I print the columns using print(g.columns) ended with error AttributeError: 'Series' object has no attribute 'columns'
I want to rename the column 0 as count and add a new column to the dataframe g as percent missed which is calculated by 100 - value in column 0
Expected output
current_angle current_dist count percent missed
-50 30 1 99
-50 40 2 98
-50 41 6 94
-50 45 4 96
1:How to modify the code? I mean instead of value_counts, is there any other function that can give the expected output?
2. How to get the expected output with the current method?
EDIT 1(exceptional case)
data:
angle
distance
velocity
0
124
-3
50
24
-25
50
34
25
expected output:
count is calculated based on distance
angle
distance
velocity
count
percent missed
0
124
-3
1
99
50
24
-25
1
99
50
34
25
1
99
First add Series.reset_index, because DataFrame.value_counts return Series, so possible use parameter name for change column 0 to count column and then subtract 100 to new column by Series.rsub for subtract from right side like 100 - df['count']:
df = (new_df.value_counts(subset=['Current_Angle','Current_dist'] ,sort = False)
.reset_index(name='count')
.assign(**{'percent missed': lambda x: x['count'].rsub(100)}))
Or if need also set new columns names use DataFrame.set_axis:
df = (new_df.value_counts(subset=['Current_Angle','Current_dist'] ,sort = False)
.reset_index(name='count')
.set_axis(['angle','Distance','count'], axis=1)
.assign(**{'percent missed': lambda x: x['count'].rsub(100)}))
If need assign new columns names here is alternative solution:
df = (new_df.value_counts(subset=['Current_Angle','Current_dist'] ,sort = False)
.reset_index())
df.columns = ['angle','Distance','count']
df['percent missed'] = df['count'].rsub(100)
Assuming a DataFrame as input (if not reset_index first), simply use rename and a subtraction:
df = df.rename(columns={'0': 'count'}) # assuming string '0' here, else use 0
df['percent missed'] = 100 - df['count']
output:
current_angle current_dist count percent missed
0 -50 30 1 99
1 -50 40 2 98
2 -50 41 6 94
3 -50 45 4 96
alternative: using groupby.size:
(new_df
.groupby(['current_angle','current_dist']).size()
.reset_index(name='count')
.assign(**{'percent missed': lambda d: 100-d['count']})
)
output:
current_angle current_dist count percent missed
0 -50 30 1 99
1 -50 40 2 98
2 -50 41 6 94
3 -50 45 4 96

determine the range of a value using a look up table

I have a df with numbers:
numbers = pd.DataFrame(columns=['number'], data=[
50,
65,
75,
85,
90
])
and a df with ranges (look up table):
ranges = pd.DataFrame(
columns=['range','range_min','range_max'],
data=[
['A',90,100],
['B',85,95],
['C',70,80]
]
)
I want to determine what range (in second table) a value (in the first table) falls in. Please note ranges overlap, and limits are inclusive.
Also please note the vanilla dataframe above has 3 ranges, however this dataframe gets generated dynamically. It could have from 2 to 7 ranges.
Desired result:
numbers = pd.DataFrame(columns=['number','detected_range'], data=[
[50,'out_of_range'],
[65, 'out_of_range'],
[75,'C'],
[85,'B'],
[90,'overlap'] * could be A or B *
])
I solved this with a for loop but this doesn't scale well to a big dataset I am using. Also code is too extensive and inelegant. See below:
numbers['detected_range'] = nan
for i, row1 in number.iterrows():
for j, row2 in ranges.iterrows():
if row1.number<row2.range_min and row1.number>row2.range_max:
numbers.loc[i,'detected_range'] = row1.loc[j,'range']
else if (other cases...):
...and so on...
How could I do this?
You can use a bit of numpy vectorial operations to generate masks, and use them to select your labels:
import numpy as np
a = numbers['number'].values # numpy array of numbers
r = ranges.set_index('range') # dataframe of min/max with labels as index
m1 = (a>=r['range_min'].values[:,None]).T # is number above each min
m2 = (a<r['range_max'].values[:,None]).T # is number below each max
m3 = (m1&m2) # combine both conditions above
# NB. the two operations could be done without the intermediate variables m1/m2
m4 = m3.sum(1) # how many matches?
# 0 -> out_of_range
# 2 -> overlap
# 1 -> get column name
# now we select the label according to the conditions
numbers['detected_range'] = np.select([m4==0, m4==2], # out_of_range and overlap
['out_of_range', 'overlap'],
# otherwise get column name
default=np.take(r.index, m3.argmax(1))
)
output:
number detected_range
0 50 out_of_range
1 65 out_of_range
2 75 C
3 85 B
4 90 overlap
edit:
It works with any number of intervals in ranges
example output with extra['D',50,51]:
number detected_range
0 50 D
1 65 out_of_range
2 75 C
3 85 B
4 90 overlap
Pandas IntervalIndex fits in here; however, since your data has overlapping points, a for loop is the approach I'll use here (for unique, non-overlapping indices, pd.get_indexer is a fast approach):
intervals = pd.IntervalIndex.from_arrays(ranges.range_min,
ranges.range_max,
closed='both')
box = []
for num in numbers.number:
bools = intervals.contains(num)
if bools.sum()==1:
box.append(ranges.range[bools].item())
elif bools.sum() > 1:
box.append('overlap')
else:
box.append('out_of_range')
numbers.assign(detected_range = box)
number detected_range
0 50 out_of_range
1 65 out_of_range
2 75 C
3 85 B
4 90 overlap
firstly,explode the ranges:
df1=ranges.assign(col1=ranges.apply(lambda ss:range(ss.range_min,ss.range_max),axis=1)).explode('col1')
df1
range range_min range_max col1
0 A 90 100 90
0 A 90 100 91
0 A 90 100 92
0 A 90 100 93
0 A 90 100 94
0 A 90 100 95
0 A 90 100 96
0 A 90 100 97
0 A 90 100 98
0 A 90 100 99
1 B 85 95 85
1 B 85 95 86
1 B 85 95 87
1 B 85 95 88
1 B 85 95 89
1 B 85 95 90
secondly,judge wether each of numbers in first df
def function1(x):
df11=df1.loc[df1.col1==x]
if len(df11)==0:
return 'out_of_range'
if len(df11)>1:
return 'overlap'
return df11.iloc[0,0]
numbers.assign(col2=numbers.number.map(function1))
number col2
0 50 out_of_range
1 65 out_of_range
2 75 C
3 85 B
4 90 overlap
the logic is simple and clear

Python - Groupby Multiple Criteria and Closest Integer

Here, I am trying to assign groups based on multiple criteria and the closest date diff prior to the zero. The groupby should look only within each ID, then find the closest negative datediff value prior to each zero (not positive, I am trying to look back in time), and based on the Location integer, assign a group. I will have hundreds of groups, and the groups should be assigned based on the Location integer. So, multiple IDs can have the same groups if the Location is the same
Please let me know if I should elaborate or reword - thank you for your help!
Input:
ID Location Date Diff (Days)
111 87 -5
111 88 0
123 97 -123
123 98 -21
123 55 0
123 56 -59
123 30 -29
123 46 0
123 46 25
123 31 87
234 87 -32
234 55 0
234 30 -26
234 54 0
Expected Output:
ID Location Date Diff (Days) Group
111 87 -5 1
111 88 0
123 97 -123
123 98 -21 2
123 55 0
123 56 -59
123 30 -29 3
123 46 0
123 46 25
123 31 87
234 87 -32 1
234 55 0
234 30 -26 3
234 54 0
IIUC, you can find the index to add a group value by using where and mask all values in Diff (I renamed the column Date Diff (Days) by Diff for simplicity) greater or equal to 0. Then groupby ID and groups made of where the column Diff, once shift is equal to 0 and cumsum. For each group get the idxmax. Clean the nan and get the list of all indexes. Second step is to use this list of index and the column Location to create unique ID for each Location with pd.factorize
idx = (df['Diff'].where(lambda x: x.lt(0))
.groupby([df['ID'],
df['Diff'].shift().eq(0).cumsum()])
.idxmax().dropna().tolist()
)
df['Group'] = ''
df.loc[idx, 'Group'] = (pd.factorize(df.loc[idx, 'Location'])[0]+1)
print (df)
ID Location Diff Group
0 111 87 -5 1
1 111 88 0
2 123 97 -123
3 123 98 -21 2
4 123 55 0
5 123 56 -59
6 123 30 -29 3
7 123 46 0
8 123 46 25
9 123 31 87
10 234 87 -32 1
11 234 55 0
12 234 30 -26 3
13 234 54 0
Because the order of rows matter, the most straightforward answer that that I can think of (that will have a somewhat readable code) can use a loop... So I sure hope that performance is not an issue.
The code is less cumbersome than it seems. I hope that the code comments are clear enough.
# Your data
df = pd.DataFrame(
data=[[111,87,-5],
[111,88,0],
[123,97,-123],
[123,98,-21],
[123,55,0],
[123,56,-59],
[123,30,-29],
[123,46,0],
[123,46,25],
[123,31,87],
[234,87,-32],
[234,55,0],
[234,30,-26],
[234,54,0]], columns=['ID','Location','Date Diff (Days)'])
N_ID, N_Location, N_Date, N_Group = 'ID', 'Location', 'Date Diff (Days)', 'Group'
# Some preparations
col_group = pd.Series(index=df.index) # The final column we'll add to our `df`
groups_found = 0
location_to_group = dict() # To maintain our mapping of Location to "group" values
# LOOP
prev_id, prev_DD, best_idx = None, None, None
for idx, row in df.iterrows():
#print(idx, row.values)
if prev_id is None:
if row[N_Date] < 0:
best_idx = idx
#best_date_diff_in_this_run = row[N_Date]
else:
if row[N_ID] != prev_id or row[N_Date] < prev_DD:
# Associate a 'group' value to row with index `best_idx`
if best_idx is not None:
best_location = df.loc[best_idx, N_Location]
if best_location in location_to_group:
col_group.loc[best_idx] = location_to_group[best_location]
else:
groups_found += 1
location_to_group[best_location] = groups_found
col_group.loc[best_idx] = groups_found
# New run
best_idx = None
# Regardless, update best_idx
if row[N_Date] < 0:
best_idx = idx
#best_date_diff_in_this_run = row[N_Date]
# Done
prev_id, prev_DD = row[N_ID], row[N_Date]
# Deal with the last "run" (same code as the one inside the loop)
# Associate a 'group' value to row with index `best_idx`
if best_idx is not None:
best_location = df.loc[best_idx, N_Location]
if best_location in location_to_group:
col_group.loc[best_idx] = location_to_group[best_location]
else:
groups_found += 1
location_to_group[best_location] = groups_found
col_group.loc[best_idx] = groups_found
# DONE
df['Group'] = col_group

What is the equivalent of proc format in SAS to python

I want
proc format;
value RNG
low - 24 = '1'
24< - 35 = '2'
35< - 44 = '3'
44< - high ='4'
I need this in python pandas.
If you are looking for equivalent of the mapping function, you can use something like this.
df = pd.DataFrame(np.random.randint(100,size=5), columns=['score'])
print(df)
output:
score
0 73
1 90
2 83
3 40
4 76
Now lets apply the binning function for score column in dataframe and create new column in the same dataframe.
def format_fn(x):
if x < 24:
return '1'
elif x <35:
return '2'
elif x< 44:
return '3'
else:
return '4'
df['binned_score']=df['score'].apply(format_fn)
print(df)
output:
score binned_score
0 73 4
1 90 4
2 83 4
3 40 3
4 76 4

Python Pandas Feature Generation as aggregate function

I have a pandas df which is mire or less like
ID key dist
0 1 57 1
1 2 22 1
2 3 12 1
3 4 45 1
4 5 94 1
5 6 36 1
6 7 38 1
.....
this DF contains couple of millions of points. I am trying to generate some descriptors now to incorporate the time nature of the data. The idea is for each line I should create a window of lenght x going back in the data and counting the occurrences of the particular key in the window. I did a implementation, but according to my estimation for 23 different windows the calculation will run 32 days. Here is the code
def features_wind2(inp):
all_window = inp
all_window['window1'] = 0
for index, row in all_window.iterrows():
lid = index
lid1 = lid - 200
pid = row['key']
row['window1'] = all_window.query('index < %d & index > %d & key == %d' % (lid, lid1, key)).count()[0]
return all_window
There are multiple different windows of different length. I however have that uneasy feeling that the iteration is probably not the smartest way to go for this data aggregation. Is there way to implement it to run faster?
On a toy example data frame, you can achieve about a 7x speedup by using apply() instead of iterrows().
Here's some sample data, expanded a bit from OP to include multiple key values:
ID key dist
0 1 57 1
1 2 22 1
2 3 12 1
3 4 45 1
4 5 94 1
5 6 36 1
6 7 38 1
7 8 94 1
8 9 94 1
9 10 38 1
import pandas as pd
df = pd.read_clipboard()
Based on these data, and the counting criteria defined by OP, we expect the output to be:
key dist window
ID
1 57 1 0
2 22 1 0
3 12 1 0
4 45 1 0
5 94 1 0
6 36 1 0
7 38 1 0
8 94 1 1
9 94 1 2
10 38 1 1
Using OP's approach:
def features_wind2(inp):
all_window = inp
all_window['window1'] = 0
for index, row in all_window.iterrows():
lid = index
lid1 = lid - 200
pid = row['key']
row['window1'] = all_window.query('index < %d & index > %d & key == %d' % (lid, lid1, pid)).count()[0]
return all_window
print('old solution: ')
%timeit features_wind2(df)
old solution:
10 loops, best of 3: 25.6 ms per loop
Using apply():
def compute_window(row):
# when using apply(), .name gives the row index
# pandas indexing is inclusive, so take index-1 as cut_idx
cut_idx = row.name - 1
key = row.key
# count the number of instances key appears in df, prior to this row
return sum(df.ix[:cut_idx,'key']==key)
print('new solution: ')
%timeit df['window1'] = df.apply(compute_window, axis='columns')
new solution:
100 loops, best of 3: 3.71 ms per loop
Note that with millions of records, this will still take awhile, and the relative performance gains will likely be diminished somewhat compared to this small test case.
UPDATE
Here's an even faster solution, using groupby() and cumsum(). I made some sample data that seems roughly in line with the provided example, but with 10 million rows. The computation finishes in well under a second, on average:
# sample data
import numpy as np
import pandas as pd
N = int(1e7)
idx = np.arange(N)
keys = np.random.randint(1,100,size=N)
dists = np.ones(N).astype(int)
df = pd.DataFrame({'ID':idx,'key':keys,'dist':dists})
df = df.set_index('ID')
Now performance testing:
%timeit df['window'] = df.groupby('key').cumsum().subtract(1)
1 loop, best of 3: 755 ms per loop
Here's enough output to show that the computation is working:
dist key window
ID
0 1 83 0
1 1 4 0
2 1 87 0
3 1 66 0
4 1 31 0
5 1 33 0
6 1 1 0
7 1 77 0
8 1 49 0
9 1 49 1
10 1 97 0
11 1 36 0
12 1 19 0
13 1 75 0
14 1 4 1
Note: To revert ID from index to column, use df.reset_index() at the end.

Categories

Resources