Complex Groupby Pandas Operation to Replace For Loops and If Statements - python

I have a complex group of a group problem I need help with.
I have names of drivers, each of who have driven several cars over time. Each time they turn on the car and drive, I capture cycles and hours, which are transmitted remotely.
What I am trying to do is use grouping to see when the driver gets a new car.
I'm using Car_Cycles and Car_Hours to monitor for a reset (new car). The hours and cycles are tabulated for each driver in ascending sequence until there's a new car and reset. I want to make each car a sequence, but logically can only recognize the car by a cycle/hour reset.
I used a for loop with if statements to do this on the dataframe, and the process time takes several hours. I have several hundred thousand rows with each containing about 20 columns.
My data comes from sensors over a moderately reliable connection, so I want to filter by using the following criteria: A new group is only valid when both Car_Hours and Car_Cycles are less the previous group's last row for 2 consecutive rows. Using both outputs and checking for two rows of change sufficiently filters all erroneous data.
If someone could show me how to quickly solve for Car_Group without using my cumbersome for loops and if statements, I would greatly appreciate it.
Also, for those how are very venturous, I added my original for loop below with if statements. Note I did some other data analysis/tracking within each group to look at other behavior of the car. If you dare look at that code and show me an efficient Pandas replacement, all the more kudos.
name Car_Hours Car_Cycles Car_Group DeltaH
jan 101 404 1 55
jan 102 405 1 55
jan 103 406 1 56
jan 104 410 1 55
jan 105 411 1 56
jan 0 10 2 55
jan 1 12 2 58
jan 2 14 2 57
jan 3 20 2 59
jan 4 26 2 55
jan 10 36 2 56
jan 15 42 2 57
jan 27 56 2 57
jan 100 61 2 58
jan 500 68 2 58
jan 2 4 3 56
jan 3 15 3 57
pete 190 21 1 54
pete 211 29 1 58
pete 212 38 1 55
pete 304 43 1 56
pete 14 20 2 57
pete 15 27 2 57
pete 36 38 2 58
pete 103 47 2 55
mike 1500 2001 1 55
mike 1512 2006 1 59
mike 1513 2012 1 58
mike 1515 2016 1 57
mike 1516 2020 1 55
mike 1517 2024 1 57
..............
for i in range(len(file)):
if i == 0:
DeltaH_limit = 57
car_thresholds = 0
car_threshold_counts = 0
car_threshold_counts = 0
car_change_true = 0
car_change_index_loc = i
total_person_thresholds = 0
person_alert_count = 0
person_car_count = 1
person_car_change_count = 0
total_fleet_thresholds = 0
fleet_alert_count = 0
fleet_car_count = 1
fleet_car_change_count = 0
if float(file['Delta_H'][i]) >= DeltaH_limit:
car_threshold_counts += 1
car_thresholds += 1
total_person_thresholds += 1
total_fleet_thresholds += 1
elif i == 1:
if float(file['Delta_H'][i]) >= DeltaH_limit:
car_threshold_counts += 1
car_thresholds += 1
total_person_thresholds += 1
total_fleet_thresholds += 1
elif i > 1:
if file['name'][i] == file['name'][i-1]: #is same person?
if float(file['Delta_H'][i]) >= DeltaH_limit:
car_threshold_counts += 1
car_thresholds += 1
total_person_thresholds += 1
total_fleet_thresholds += 1
else:
car_threshold_counts = 0
if car_threshold_counts == 3:
car_threshold_counts += 1
person_alert_count += 1
fleet_alert_count += 1
#Car Change?? Compare cycles and hours to look for reset
if i+1 < len(file):
if file['name'][i] == file['name'][i+1] == file['name'][i-1]:
if int(file['Car_Cycles'][i]) < int(file['Car_Cycles'][i-1]) and int(file['Car_Hours'][i]) < int(file['Car_Hours'][i-1]):
if int(file['Car_Cycles'][i+1]) < int(file['Car_Cycles'][i-1]) and int(file['Car_Hours'][i]) < int(file['Car_Hours'][i-1]):
car_thresholds = 0
car_change_true = 1
car_threshold_counts = 0
car_threshold_counts = 0
old_pump_first_flight = car_change_index_loc
car_change_index_loc = i
old_pump_last_flight = i-1
person_car_count += 1
person_car_change_count += 1
fleet_car_count += 1
fleet_car_change_count += 1
print(i, ' working hard!')
else:
car_change_true = 0
else:
car_change_true = 0
else:
car_change_true = 0
else:
car_change_true = 0
else: #new car
car_thresholds = 0
car_threshold_counts = 0
car_threshold_counts = 0
car_change_index_loc = i
car_change_true = 0
total_person_thresholds = 0
person_alert_count = 0
person_car_count = 1
person_car_change_count = 0
if float(file['Delta_H'][i]) >= DeltaH_limit:
car_threshold_counts += 1
car_thresholds += 1
total_person_thresholds += 1
total_fleet_thresholds += 1
file.loc[i, 'car_thresholds'] = car_thresholds
file.loc[i, 'car_threshold_counts'] = car_threshold_counts
file.loc[i, 'car_threshold_counts'] = car_threshold_counts
file.loc[i, 'car_change_true'] = car_change_true
file.loc[i, 'car_change_index_loc'] = car_change_index_loc
file.loc[i, 'total_person_thresholds'] = total_person_thresholds
file.loc[i, 'person_alert_count'] = person_alert_count
file.loc[i, 'person_car_count'] = person_car_count
file.loc[i, 'person_car_change_count'] = person_car_change_count
file.loc[i, 'Total_Fleet_Thresholds'] = total_fleet_thresholds
file.loc[i, 'Fleet_Alert_Count'] = fleet_alert_count
file.loc[i, 'fleet_car_count'] = fleet_car_count
file.loc[i, 'fleet_car_change_count'] = fleet_car_change_count

IIUC, and all we need to do is reproduce Car_Group, we can take advantage of a few tricks:
def twolow(s):
return (s < s.shift()) & (s.shift(-1) < s.shift())
new_hour = twolow(df["Car_Hours"])
new_cycle = twolow(df["Car_Cycles"])
new_name = df["name"] != df["name"].shift()
name_group = new_name.cumsum()
new_cargroup = new_name | (new_hour & new_cycle)
cargroup_without_reset = new_cargroup.cumsum()
cargroup = (cargroup_without_reset -
cargroup_without_reset.groupby(name_group).transform(min) + 1)
Trick #1: if you want to find out where a transition occurs, compare something to a shifted version of itself.
Trick #2: if you have a True where every new group begins, when you take the cumulative sum of that, you get a series where every group has an integer associated with it.
The above gives me
>>> cargroup.head(10)
0 1
1 1
2 1
3 1
4 1
5 2
6 2
7 2
8 2
9 2
dtype: int32
>>> (cargroup == df.Car_Group).all()
True

Related

How do I create a while loop for this df that has moving average in every stage? [duplicate]

This question already has an answer here:
For loop that adds and deducts from pandas columns
(1 answer)
Closed 1 year ago.
So I want to spread the shipments per ID in the group one by one by looking at avg sales to determine who to give it to.
Here's my dataframe:
ID STOREID BAL SALES SHIP
1 STR1 50 5 18
1 STR2 6 7 18
1 STR3 74 4 18
2 STR1 35 3 500
2 STR2 5 4 500
2 STR3 54 7 500
While SHIP (grouped by ID) is greater than 0, calculate AVG (BAL/SALES) and the lowest AVG per group give +1 to its column BAL and +1 to its column final. And then repeat the process until SHIP is 0. The AVG would be different every stage which is why I wanted it to be a while loop.
Sample output of first round is below. So do this until SHIP is 0 and SUM of Final per ID is = to SHIP:
ID STOREID BAL SALES SHIP AVG Final
1 STR1 50 5 18 10 0
1 STR2 6 4 18 1.5 1
1 STR3 8 4 18 2 0
2 STR1 35 3 500 11.67 0
2 STR2 5 4 500 1.25 1
2 STR3 54 7 500 7.71 0
I've tried a couple of ways in SQL, I thought it would be better to do it in python but I haven't been doing a great job with my loop. Here's what I tried so far:
df['AVG'] = 0
df['FINAL'] = 0
for i in df.groupby(["ID"])['SHIP']:
if i > 0:
df['AVG'] = df['BAL'] / df['SALES']
df['SHIP'] = df.groupby(["ID"])['SHIP']-1
total = df.groupby(["ID"])["FINAL"].transform("cumsum")
df['FINAL'] = + 1
df['A'] = + 1
else:
df['FINAL'] = 0
This was challenging because more than one row in the group can have the same average calculation. then it throws off the allocation.
This works on the example dataframe, if I understood you correctly.
d = {'ID': [1, 1, 1, 2,2,2], 'STOREID': ['str1', 'str2', 'str3','str1', 'str2', 'str3'],'BAL':[50, 6, 74, 35,5,54], 'SALES': [5, 7, 4, 3,4,7], 'SHIP': [18, 18, 18, 500,500,500]}
df = pd.DataFrame(data=d)
df['AVG'] = 0
df['FINAL'] = 0
def calc_something(x):
# print(x.iloc[0]['SHIP'])
for i in range(x.iloc[0]['SHIP'])[0:500]:
x['AVG'] = x['BAL'] / x['SALES']
x['SHIP'] = x['SHIP']-1
x = x.sort_values('AVG').reset_index(drop=True)
# print(x.iloc[0, 2])
x.iloc[0, 2] = x['BAL'][0] + 1
x.iloc[0, 6] = x['FINAL'][0] + 1
return x
df_final = df.groupby('ID').apply(calc_something).reset_index(drop=True).sort_values(['ID', 'STOREID'])
df_final
ID STOREID BAL SALES SHIP AVG FINAL
1 1 STR1 50 5 0 10.000 0
0 1 STR2 24 7 0 3.286 18
2 1 STR3 74 4 0 18.500 0
4 2 STR1 127 3 0 42.333 92
5 2 STR2 170 4 0 42.500 165
3 2 STR3 297 7 0 42.286 243

if and elif condition based on values of a dataframe

I'm trying to create a new columns based on the columns that are present in the dataframe. This is the sample data
ORG CHAIN_NBR SEQ_NBR INT_STATUS BLOCK_CODE_1 BLOCK_CODE_2
0 523 1 0 A C A
1 523 2 1 I A D
2 521 3 1 A H F
3 513 4 1 D H Q
4 513 5 1 8 I M
This is the code that I'm Executing:
df=pd.read_csv("rtl_one", sep="\x01")
def risk():
if df['INT_STATUS'].isin(['B','C','F','H','P','R','T','X','Z','8','9']):
df['rcut'] = '01'
elif df['BLOCK_CODE_1'].isin(['A','B','C','D','E','F','G','I','J','K','L','M', 'N','O','P','R','U','W','Y','Z']):
df['rcut'] = '02'
elif df["BLOCK_CODE_2"].isin(['A','B','C','D','E','F','G','I','J','K','L','M', 'N','O','P','R','U','W','Y','Z']):
df['rcut'] == '03'
else:
df['rcut'] = '00'
risk()
Output data should look like this:
ORG CHAIN_NBR SEQ_NBR INT_STATUS BLOCK_CODE_1 BLOCK_CODE_2 rcut
0 523 1 0 A C A 02
1 523 2 1 I A D 02
2 521 3 1 A H F 03
3 513 4 1 D H Q 00
4 513 5 1 8 I M 01
use iterrows and store the results in a list that you can then append to the dataframe as a column:
rcut = []
for i, row in df.iterrows():
if row['INT_STATUS'] in ['B','C','F','H','P','R','T','X','Z','8','9']:
rcut.append('01')
elif row['BLOCK_CODE_1'] in ['A','B','C','D','E','F','G','I','J','K','L','M', 'N','O','P','R','U','W','Y','Z']:
rcut.append('02')
elif row['BLOCK_CODE_1'] in ['A','B','C','D','E','F','G','I','J','K','L','M', 'N','O','P','R','U','W','Y','Z']:
rcut.append('03')
else:
rcut.append('00')
df['rcut'] = rcut
(Note: your 2nd and 3d conditions are the same, I've reused your code here so you would have to change that)
use .index to add a new column, df['rcut'] = df.index
and then use df.insert( index,'rcut',value) in your if elif condition.

Python Pandas Feature Generation as aggregate function

I have a pandas df which is mire or less like
ID key dist
0 1 57 1
1 2 22 1
2 3 12 1
3 4 45 1
4 5 94 1
5 6 36 1
6 7 38 1
.....
this DF contains couple of millions of points. I am trying to generate some descriptors now to incorporate the time nature of the data. The idea is for each line I should create a window of lenght x going back in the data and counting the occurrences of the particular key in the window. I did a implementation, but according to my estimation for 23 different windows the calculation will run 32 days. Here is the code
def features_wind2(inp):
all_window = inp
all_window['window1'] = 0
for index, row in all_window.iterrows():
lid = index
lid1 = lid - 200
pid = row['key']
row['window1'] = all_window.query('index < %d & index > %d & key == %d' % (lid, lid1, key)).count()[0]
return all_window
There are multiple different windows of different length. I however have that uneasy feeling that the iteration is probably not the smartest way to go for this data aggregation. Is there way to implement it to run faster?
On a toy example data frame, you can achieve about a 7x speedup by using apply() instead of iterrows().
Here's some sample data, expanded a bit from OP to include multiple key values:
ID key dist
0 1 57 1
1 2 22 1
2 3 12 1
3 4 45 1
4 5 94 1
5 6 36 1
6 7 38 1
7 8 94 1
8 9 94 1
9 10 38 1
import pandas as pd
df = pd.read_clipboard()
Based on these data, and the counting criteria defined by OP, we expect the output to be:
key dist window
ID
1 57 1 0
2 22 1 0
3 12 1 0
4 45 1 0
5 94 1 0
6 36 1 0
7 38 1 0
8 94 1 1
9 94 1 2
10 38 1 1
Using OP's approach:
def features_wind2(inp):
all_window = inp
all_window['window1'] = 0
for index, row in all_window.iterrows():
lid = index
lid1 = lid - 200
pid = row['key']
row['window1'] = all_window.query('index < %d & index > %d & key == %d' % (lid, lid1, pid)).count()[0]
return all_window
print('old solution: ')
%timeit features_wind2(df)
old solution:
10 loops, best of 3: 25.6 ms per loop
Using apply():
def compute_window(row):
# when using apply(), .name gives the row index
# pandas indexing is inclusive, so take index-1 as cut_idx
cut_idx = row.name - 1
key = row.key
# count the number of instances key appears in df, prior to this row
return sum(df.ix[:cut_idx,'key']==key)
print('new solution: ')
%timeit df['window1'] = df.apply(compute_window, axis='columns')
new solution:
100 loops, best of 3: 3.71 ms per loop
Note that with millions of records, this will still take awhile, and the relative performance gains will likely be diminished somewhat compared to this small test case.
UPDATE
Here's an even faster solution, using groupby() and cumsum(). I made some sample data that seems roughly in line with the provided example, but with 10 million rows. The computation finishes in well under a second, on average:
# sample data
import numpy as np
import pandas as pd
N = int(1e7)
idx = np.arange(N)
keys = np.random.randint(1,100,size=N)
dists = np.ones(N).astype(int)
df = pd.DataFrame({'ID':idx,'key':keys,'dist':dists})
df = df.set_index('ID')
Now performance testing:
%timeit df['window'] = df.groupby('key').cumsum().subtract(1)
1 loop, best of 3: 755 ms per loop
Here's enough output to show that the computation is working:
dist key window
ID
0 1 83 0
1 1 4 0
2 1 87 0
3 1 66 0
4 1 31 0
5 1 33 0
6 1 1 0
7 1 77 0
8 1 49 0
9 1 49 1
10 1 97 0
11 1 36 0
12 1 19 0
13 1 75 0
14 1 4 1
Note: To revert ID from index to column, use df.reset_index() at the end.

Python: How to read in file once but not exhaust iterator in nested for loops

Two input files. Regions input file:
Start: 1 id123 1234
Stop: 1 id234 3456
Start: 2 id456 34523
Stop: 2 id231 35234
Positions input file:
1 123
1 1234
1 1256
1 1390
1 1490
1 3456
1 3560
1 5000
2 345
2 456
2 34523
2 34589
2 35234
2 40000
I want to add a third field to the positions file, where the positions fall inside my regions. This is what I wrote, Option 1:
regions = open(fileone, 'r')
positions = open(filetwo, 'r').readlines()
for start in regions:
stop = regions.next()
a = start.split()
b = stop.split()
if 'Start' in a[0] and 'Stop' in b[0]:
for line in positions:
pos = line.split()
if pos[0] == a[1] and pos[1] >= a[3] and pos[1] <= b[3]:
pos.append("1")
else:
pos.append("0")
print("\t".join(pos))
Alternative, option 2:
regions = open(fileone, 'r')
positions = open(filetwo, 'r')
d = {}
for start in regions:
stop = regions.next()
a = start.split()
b = stop.split()
if 'Start' in a[0] and 'Stop' in b[0]:
d[a[1]] = [a[3],b[3]]
for line in positions:
pos = line.split()
chr = d.keys()
beg = d.values()[0][0]
end = d.values()[0][1]
if pos[0] == chr and pos[1] >= beg and pos[1] <= end:
pos.append("1")
else:
pos.append("0")
print("\t".join(pos))
Option 1 returns the file twice, with only one region annotated in each repetition:
1 123 0
1 1234 1
1 1256 1
1 1390 1
1 1490 1
1 3456 1
1 3560 0
1 5000 0
2 345 0
2 456 0
2 34523 0
2 34589 0
2 35234 0
2 40000 0
1 123 0
1 1234 0
1 1256 0
1 1390 0
1 1490 0
1 3456 0
1 3560 0
1 5000 0
2 345 0
2 456 0
2 34523 1
2 34589 1
2 35234 1
2 40000 0
Option 2 just returns all 0 in column 3.
What I would like is a combination of the two, where the second region is also annotated the first go around. I know I could run it once for each region and then combine, but that would get messy with the volume of my real data so I'd rather avoid combining them after the fact.
Thanks in advance :)
Desired output:
1 123 0
1 1234 1
1 1256 1
1 1390 1
1 1490 1
1 3456 1
1 3560 0
1 5000 0
2 345 0
2 456 0
2 34523 1
2 34589 1
2 35234 1
2 40000 0
I propose:
def one(regions):
with open(regions,'r') as f:
for line in f:
a = line.split()
b = f.next().split()
assert(a[0]=='Start:' and b[0]=='Stop:')
assert(a[1]==b[1])
yield (a[1], (int(a[3]),int(b[3])) )
def two(positions,regions):
d = dict(one(regions))
with open(positions,'r') as g:
for line in g:
ls = tuple(line.split())
yield(ls + (1 if d[ls[0]][0]<= int(ls[1]) <=d[ls[0]][1] else 0,))
print list(two('filetwo.txt','fileone.txt'))
print '================================'
print '\n'.join('%s\t%s\t%s' % x for x in two('filetwo.txt','fileone.txt'))
.
EDIT
It seems that the following code does what you ask as an improvement:
def two(positions,regions):
d = defaultdict(list)
for k,v in one(regions):
d[k].append(v)
with open(positions,'r') as g:
for line in g:
ls = tuple(line.split())
yield ls + (1 if any(x <= int(ls[1]) <= y for (x,y) in d[ls[0]])
else 0,)
with
from collections import defaultdict
at the beginning

Is it efficient to perform individual, nested if statements?

I'm working on the following problem:
You are driving a little too fast, and a police officer stops you. Write code to compute the result, encoded as an int value: 0=no ticket, 1=small ticket, 2=big ticket. If speed is 60 or less, the result is 0. If speed is between 61 and 80 inclusive, the result is 1. If speed is 81 or more, the result is 2. Unless it is your birthday -- on that day, your speed can be 5 higher in all cases.
I came up with the following code:
def caught_speeding(speed, is_birthday):
if is_birthday == True:
if speed <= 65:
return 0
elif speed <= 85:
return 1
else:
return 2
else:
if speed <= 60:
return 0
elif speed <= 80:
return 1
else:
return 2
I feel like checking each one individually is a bit inefficient, or is it ok?
You gotta love the bisect module.
def caught_speeding(speed, is_birthday):
l=[60,80]
if is_birthday:
speed-=5
return bisect.bisect_left(l,speed)
I have no problems with your code. It is readable and clear.
If you want to go for less lines then you can do something like this:
def caught_speeding(speed, is_birthday):
adjustment = 5 if is_birthday else 0
if speed <= 60 + adjustment:
return 0
elif speed <= 80 + adjustment:
return 1
else:
return 2
You can do this:
def caught_speeding(speed, is_birthday):
if is_birthday:
speed = speed - 5
if speed <= 60:
return 0
elif speed <= 80:
return 1
else:
return 2
Doing is_birthday == True means you didn't quite get booleans yet ;-)
Check this one. It is optimised:
def caught_speeding(speed, is_birthday):
if speed in range(0,66 if is_birthday else 61):
return 0
elif speed in range(0,86 if is_birthday else 81):
return 1
return 2
Assuming that speed is an integer, and that efficiency means speed of running, not speed of understanding:
>>> def t(speed, is_birthday):
... speed -= 5 * is_birthday
... return speed // 61 + speed // 81
...
>>> for s in xrange(58, 87):
... print s, t(s, False), t(s, True)
...
58 0 0
59 0 0
60 0 0
61 1 0
62 1 0
63 1 0
64 1 0
65 1 0
66 1 1
67 1 1
68 1 1
69 1 1
70 1 1
71 1 1
72 1 1
73 1 1
74 1 1
75 1 1
76 1 1
77 1 1
78 1 1
79 1 1
80 1 1
81 2 1
82 2 1
83 2 1
84 2 1
85 2 1
86 2 2
>>>
def caught_speeding(speed, is_birthday):
speed -= 5 * is_birthday
return 0 if speed < 61 else 2 if speed > 80 else 1

Categories

Resources