Find Common Region in two CSV File in PYTHON - python

I have two CSV files with 10 columns each where the first column is called the "Primary Key".
I need to use Python to find the common region between the two CSV files. For example, I should be able to detect that rows 27-45 in CSV1 is equal to rows 125-145 in CSV2 and so on.
I am only comparing the Primary Key (Column One). The rest of the data is not considered for comparison. I need to extract these common regions in two separate CSV files (one for CSV1 and one for CSV2).
I have already parsed and stored the rows of the two CSV files in two 'list of lists', lstCAN_LOG_TABLE and lstSHADOW_LOG_TABLE, so the problem reduces down to comparing these two list of lists.
I am currently assuming is that if there are 10 subsequent matches (MAX_COMMON_THRESHOLD), I have reached the beginning of a common region. I must not log single rows (comparing to true) because there would be regions equal (As per primary key) and those regions I need to identify.
for index in range(len(lstCAN_LOG_TABLE)):
for l_index in range(len(lstSHADOW_LOG_TABLE)):
if(lstSHADOW_LOG_TABLE[l_index][1] == lstCAN_LOG_TABLE[index][1]): #Consider for comparison only CAN IDs
index_can_log = index #Position where CAN Log is to be compared
index_shadow_log = l_index #Position from where CAN Shadow Log is to be considered
start = index_shadow_log
if((index_shadow_log + MAX_COMMON_THRESHOLD) <= (input_file_two_row_count-1)):
end = index_shadow_log + MAX_COMMON_THRESHOLD
else:
end = (index_shadow_log) + ((input_file_two_row_count-1) - (index_shadow_log))
can_index = index
bPreScreened = 1
for num in range(start,end):
if(lstSHADOW_LOG_TABLE[num][1] == lstCAN_LOG_TABLE[can_index][1]):
if((can_index + 1) < (input_file_one_row_count-1)):
can_index = can_index + 1
else:
break
else:
bPreScreened = 0
print("No Match")
break
#we might have found start of common region
if(bPreScreened == 1):
print("Start={0} End={1} can_index={2}".format(start,end,can_index))
for number in range(start,end):
if(lstSHADOW_LOG_TABLE[number][1] == lstCAN_LOG_TABLE[index][1]):
writer_two.writerow(lstSHADOW_LOG_TABLE[number][0])
writer_one.writerow(lstCAN_LOG_TABLE[index][0])
if((index + 1) < (input_file_one_row_count-1)):
index = index + 1
else:
dump_file.close()
print("\nCommon Region in Two CSVs identifed and recorded\n")
return
dump_file.close()
print("\nCommon Region in Two CSVs identifed and recorded\n")
I am getting strange output. Even the first CSV file has only 1880 Rows but in the common region CSV for the first CSV I am getting many more entries. I am not getting desired output.
EDITED FROM HERE
CSV1:
216 0.000238225 F4 41 C0 FB 28 0 0 0 MS CAN
109 0.0002256 15 8B 31 0 8 43 58 0 HS CAN
216 0.000238025 FB 47 C6 1 28 0 0 0 MS CAN
340 0.000240175 0A 18 0 C2 0 0 6F FF MS CAN
216 0.000240225 24 70 EF 28 28 0 0 0 MS CAN
216 0.000236225 2B 77 F7 2F 28 0 0 0 MS CAN
216 0.0002278 31 7D FD 35 28 0 0 0 MS CAN
CSV2:
216 0.0002361 0F 5C DB 14 28 0 0 0 MS CAN
216 0.000236225 16 63 E2 1B 28 0 0 0 MS CAN
109 0.0001412 16 A3 31 0 8 63 58 0 HS CAN
216 0.000234075 1C 6A E9 22 28 0 0 0 MS CAN
40A 0.000259925 C1 1 46 54 30 44 47 36 HS CAN
4A 0.000565975 2 0 0 0 0 0 0 C0 MS CAN
340 0.000240175 0A 18 0 C2 0 0 6F FF MS CAN
216 0.000240225 24 70 EF 28 28 0 0 0 MS CAN
216 0.000236225 2B 77 F7 2F 28 0 0 0 MS CAN
216 0.0002278 31 7D FD 35 28 0 0 0 MS CAN
EXPECTED OUTPUT CSV1:
340 0.000240175 0A 18 0 C2 0 0 6F FF MS CAN
216 0.000240225 24 70 EF 28 28 0 0 0 MS CAN
216 0.000236225 2B 77 F7 2F 28 0 0 0 MS CAN
216 0.0002278 31 7D FD 35 28 0 0 0 MS CAN
EXPECTED OUTPUT CSV2:
340 0.000240175 0A 18 0 C2 0 0 6F FF MS CAN
216 0.000240225 24 70 EF 28 28 0 0 0 MS CAN
216 0.000236225 2B 77 F7 2F 28 0 0 0 MS CAN
216 0.0002278 31 7D FD 35 28 0 0 0 MS CAN
OBSERVED OUTPUT CSV1
340 0.000240175 0A 18 0 C2 0 0 6F FF MS CAN
216 0.000240225 24 70 EF 28 28 0 0 0 MS CAN
216 0.000236225 2B 77 F7 2F 28 0 0 0 MS CAN
216 0.0002278 31 7D FD 35 28 0 0 0 MS CAN
And many thousands of redundant row data
EDITED - SOLVED AS PER ADVICE (CHANGED FOR TO WHILE):
LEARNING: In Python FOR Loop Index cannot be changed at RunTime
dump_file=open("MATCH_PATTERN.txt",'w+')
print("Number of Entries CAN LOG={0}".format(len(lstCAN_LOG_TABLE)))
print("Number of Entries SHADOW LOG={0}".format(len(lstSHADOW_LOG_TABLE)))
index = 0
while(index < (input_file_one_row_count - 1)):
l_index = 0
while(l_index < (input_file_two_row_count - 1)):
if(lstSHADOW_LOG_TABLE[l_index][1] == lstCAN_LOG_TABLE[index][1]): #Consider for comparison only CAN IDs
index_can_log = index #Position where CAN Log is to be compared
index_shadow_log = l_index #Position from where CAN Shadow Log is to be considered
start = index_shadow_log
can_index = index
if((index_shadow_log + MAX_COMMON_THRESHOLD) <= (input_file_two_row_count-1)):
end = index_shadow_log + MAX_COMMON_THRESHOLD
else:
end = (index_shadow_log) + ((input_file_two_row_count-1) - (index_shadow_log))
bPreScreened = 1
for num in range(start,end):
if(lstSHADOW_LOG_TABLE[num][1] == lstCAN_LOG_TABLE[can_index][1]):
if((can_index + 1) < (input_file_one_row_count-1)):
can_index = can_index + 1
else:
break
else:
bPreScreened = 0
break
#we might have found start of common region
if(bPreScreened == 1):
print("Shadow Start={0} Shadow End={1} CAN INDEX={2}".format(start,end,index))
for number in range(start,end):
if(lstSHADOW_LOG_TABLE[number][1] == lstCAN_LOG_TABLE[index][1]):
writer_two.writerow(lstSHADOW_LOG_TABLE[number][0])
writer_one.writerow(lstCAN_LOG_TABLE[index][0])
if((index + 1) < (input_file_one_row_count-1)):
index = index + 1
if((l_index + 1) < (input_file_two_row_count-1)):
l_index = l_index + 1
else:
dump_file.close()
print("\nCommon Region in Two CSVs identifed and recorded\n")
return
else:
l_index = l_index + 1
else:
l_index = l_index + 1
index = index + 1
dump_file.close()
print("\nCommon Region in Two CSVs identifed and recorded\n")

index is the iterator in your for loop. Should you changed it inside the loop, it will be reassigned after each iteration.
Say, when index = 5 in your for loop, and index += 1 was execute 3 times. Now index = 8. But after this iteration ends, when your code go back to for, index will be assigned to index x = 6.
Try following example:
for index in range(0,5):
print 'iterator:', index
index = index + 2
print 'index:', index
The output will be:
iterator: 0
index: 2
iterator: 1
index: 3
iterator: 2
index: 4
iterator: 3
index: 5
iterator: 4
index: 6
To fix this issue, you might want to change your for loop to while loop
EDIT:
If I didn't understand wrong, you were trying to find 'same' columns in two files and store them.
If this is the case, actually your work can be done easily by using following code:
import csv # import csv module to read csv files
file1 = 'csv1.csv' # input file 1
file2 = 'csv2.csv' # input file 2
outfile = 'csv3.csv' # only have one output file since two output files will be the same
read1 = csv.reader(open(file1, 'r')) # read input file 1
write = csv.writer(open(outfile, 'w')) # write to output file
# for each row in input file 1, compare it with each row in input file 2
# if they are the same, write that row into output file
for row1 in read1:
read2 = csv.reader(open(file2, 'r'))
for row2 in read2:
if row1 == row2:
write.writerow(row1)
read1.close()
write.close()

Related

How to skip a for loop by X amount and continue for Y amount, then repeat

Being struggle with this for a while but how am I able to skip for loops in python by X and continue for Y then repeat.
For example:
# This loop should loop until 99, but after it reaches a number that is a multiple of 10 e.g. 10,20,30 etc it should continue for 5 iterations, then skip to next multiple of 10.
# E.g 0,1,2,3,4,5,10,11,12,13,14,15,20,21,22,23,24,25,30...etc
for page_num in range(100):
# Use page_num however
Modify the loop to use a step of 10, and add a sub-loop to have iterations that break off after the 5th element.
for j in range(0,100,10):
for page_num in range(j, j+6):
# use page_num
Use continue to skip the rest of the loop if the digit in units place is greater than 5.
skip_after = 5
for page_num in range(100):
if page_num % 10 > skip_after: continue
# ... Do the rest of your loop
print(page_num)
page_num % 10 uses the modulo % operator to give the remainder from dividing page_num by 10 (which is the digit in the units place).
Output (joined into a single line for readability):
0 1 2 3 4 5 10 11 12 13 14 15 20 21 22 23 24 25 30
Using itertools.filterfalse
from itertools import filterfalse
for x in (filterfalse(lambda x: (x % 10) > 5, range(0, 100))):
print(x, end=' ')
Output:
0 1 2 3 4 5 10 11 12 13 14 15 20 21 22 23 24 25 30 31 32 33 34 35 40 41 42 43 44 45 50 51 52 53 54 55 60 61 62 63 64 65 70 71 72 73 74 75 80 81 82 83 84 85 90 91 92 93 94 95
I would recommend list comprehension. Im not positive if this is the complete answer but I would do it as:
def find_divisors_2(page_num):
"""
You can insert an algorithmic phrase after if and it will produce the answer.
"""
find_divisors_2 = [x for x in range(100) if (Insert your algorithm)]
return find_divisors_2
You can directly modify the value of page_num inside the for loop with a certain condition:
for page_num in range(100):
if str(page_num)[-1]=="6":
page_num += 4
print(page_num)

determine the range of a value using a look up table

I have a df with numbers:
numbers = pd.DataFrame(columns=['number'], data=[
50,
65,
75,
85,
90
])
and a df with ranges (look up table):
ranges = pd.DataFrame(
columns=['range','range_min','range_max'],
data=[
['A',90,100],
['B',85,95],
['C',70,80]
]
)
I want to determine what range (in second table) a value (in the first table) falls in. Please note ranges overlap, and limits are inclusive.
Also please note the vanilla dataframe above has 3 ranges, however this dataframe gets generated dynamically. It could have from 2 to 7 ranges.
Desired result:
numbers = pd.DataFrame(columns=['number','detected_range'], data=[
[50,'out_of_range'],
[65, 'out_of_range'],
[75,'C'],
[85,'B'],
[90,'overlap'] * could be A or B *
])
I solved this with a for loop but this doesn't scale well to a big dataset I am using. Also code is too extensive and inelegant. See below:
numbers['detected_range'] = nan
for i, row1 in number.iterrows():
for j, row2 in ranges.iterrows():
if row1.number<row2.range_min and row1.number>row2.range_max:
numbers.loc[i,'detected_range'] = row1.loc[j,'range']
else if (other cases...):
...and so on...
How could I do this?
You can use a bit of numpy vectorial operations to generate masks, and use them to select your labels:
import numpy as np
a = numbers['number'].values # numpy array of numbers
r = ranges.set_index('range') # dataframe of min/max with labels as index
m1 = (a>=r['range_min'].values[:,None]).T # is number above each min
m2 = (a<r['range_max'].values[:,None]).T # is number below each max
m3 = (m1&m2) # combine both conditions above
# NB. the two operations could be done without the intermediate variables m1/m2
m4 = m3.sum(1) # how many matches?
# 0 -> out_of_range
# 2 -> overlap
# 1 -> get column name
# now we select the label according to the conditions
numbers['detected_range'] = np.select([m4==0, m4==2], # out_of_range and overlap
['out_of_range', 'overlap'],
# otherwise get column name
default=np.take(r.index, m3.argmax(1))
)
output:
number detected_range
0 50 out_of_range
1 65 out_of_range
2 75 C
3 85 B
4 90 overlap
edit:
It works with any number of intervals in ranges
example output with extra['D',50,51]:
number detected_range
0 50 D
1 65 out_of_range
2 75 C
3 85 B
4 90 overlap
Pandas IntervalIndex fits in here; however, since your data has overlapping points, a for loop is the approach I'll use here (for unique, non-overlapping indices, pd.get_indexer is a fast approach):
intervals = pd.IntervalIndex.from_arrays(ranges.range_min,
ranges.range_max,
closed='both')
box = []
for num in numbers.number:
bools = intervals.contains(num)
if bools.sum()==1:
box.append(ranges.range[bools].item())
elif bools.sum() > 1:
box.append('overlap')
else:
box.append('out_of_range')
numbers.assign(detected_range = box)
number detected_range
0 50 out_of_range
1 65 out_of_range
2 75 C
3 85 B
4 90 overlap
firstly,explode the ranges:
df1=ranges.assign(col1=ranges.apply(lambda ss:range(ss.range_min,ss.range_max),axis=1)).explode('col1')
df1
range range_min range_max col1
0 A 90 100 90
0 A 90 100 91
0 A 90 100 92
0 A 90 100 93
0 A 90 100 94
0 A 90 100 95
0 A 90 100 96
0 A 90 100 97
0 A 90 100 98
0 A 90 100 99
1 B 85 95 85
1 B 85 95 86
1 B 85 95 87
1 B 85 95 88
1 B 85 95 89
1 B 85 95 90
secondly,judge wether each of numbers in first df
def function1(x):
df11=df1.loc[df1.col1==x]
if len(df11)==0:
return 'out_of_range'
if len(df11)>1:
return 'overlap'
return df11.iloc[0,0]
numbers.assign(col2=numbers.number.map(function1))
number col2
0 50 out_of_range
1 65 out_of_range
2 75 C
3 85 B
4 90 overlap
the logic is simple and clear

Python - Groupby Multiple Criteria and Closest Integer

Here, I am trying to assign groups based on multiple criteria and the closest date diff prior to the zero. The groupby should look only within each ID, then find the closest negative datediff value prior to each zero (not positive, I am trying to look back in time), and based on the Location integer, assign a group. I will have hundreds of groups, and the groups should be assigned based on the Location integer. So, multiple IDs can have the same groups if the Location is the same
Please let me know if I should elaborate or reword - thank you for your help!
Input:
ID Location Date Diff (Days)
111 87 -5
111 88 0
123 97 -123
123 98 -21
123 55 0
123 56 -59
123 30 -29
123 46 0
123 46 25
123 31 87
234 87 -32
234 55 0
234 30 -26
234 54 0
Expected Output:
ID Location Date Diff (Days) Group
111 87 -5 1
111 88 0
123 97 -123
123 98 -21 2
123 55 0
123 56 -59
123 30 -29 3
123 46 0
123 46 25
123 31 87
234 87 -32 1
234 55 0
234 30 -26 3
234 54 0
IIUC, you can find the index to add a group value by using where and mask all values in Diff (I renamed the column Date Diff (Days) by Diff for simplicity) greater or equal to 0. Then groupby ID and groups made of where the column Diff, once shift is equal to 0 and cumsum. For each group get the idxmax. Clean the nan and get the list of all indexes. Second step is to use this list of index and the column Location to create unique ID for each Location with pd.factorize
idx = (df['Diff'].where(lambda x: x.lt(0))
.groupby([df['ID'],
df['Diff'].shift().eq(0).cumsum()])
.idxmax().dropna().tolist()
)
df['Group'] = ''
df.loc[idx, 'Group'] = (pd.factorize(df.loc[idx, 'Location'])[0]+1)
print (df)
ID Location Diff Group
0 111 87 -5 1
1 111 88 0
2 123 97 -123
3 123 98 -21 2
4 123 55 0
5 123 56 -59
6 123 30 -29 3
7 123 46 0
8 123 46 25
9 123 31 87
10 234 87 -32 1
11 234 55 0
12 234 30 -26 3
13 234 54 0
Because the order of rows matter, the most straightforward answer that that I can think of (that will have a somewhat readable code) can use a loop... So I sure hope that performance is not an issue.
The code is less cumbersome than it seems. I hope that the code comments are clear enough.
# Your data
df = pd.DataFrame(
data=[[111,87,-5],
[111,88,0],
[123,97,-123],
[123,98,-21],
[123,55,0],
[123,56,-59],
[123,30,-29],
[123,46,0],
[123,46,25],
[123,31,87],
[234,87,-32],
[234,55,0],
[234,30,-26],
[234,54,0]], columns=['ID','Location','Date Diff (Days)'])
N_ID, N_Location, N_Date, N_Group = 'ID', 'Location', 'Date Diff (Days)', 'Group'
# Some preparations
col_group = pd.Series(index=df.index) # The final column we'll add to our `df`
groups_found = 0
location_to_group = dict() # To maintain our mapping of Location to "group" values
# LOOP
prev_id, prev_DD, best_idx = None, None, None
for idx, row in df.iterrows():
#print(idx, row.values)
if prev_id is None:
if row[N_Date] < 0:
best_idx = idx
#best_date_diff_in_this_run = row[N_Date]
else:
if row[N_ID] != prev_id or row[N_Date] < prev_DD:
# Associate a 'group' value to row with index `best_idx`
if best_idx is not None:
best_location = df.loc[best_idx, N_Location]
if best_location in location_to_group:
col_group.loc[best_idx] = location_to_group[best_location]
else:
groups_found += 1
location_to_group[best_location] = groups_found
col_group.loc[best_idx] = groups_found
# New run
best_idx = None
# Regardless, update best_idx
if row[N_Date] < 0:
best_idx = idx
#best_date_diff_in_this_run = row[N_Date]
# Done
prev_id, prev_DD = row[N_ID], row[N_Date]
# Deal with the last "run" (same code as the one inside the loop)
# Associate a 'group' value to row with index `best_idx`
if best_idx is not None:
best_location = df.loc[best_idx, N_Location]
if best_location in location_to_group:
col_group.loc[best_idx] = location_to_group[best_location]
else:
groups_found += 1
location_to_group[best_location] = groups_found
col_group.loc[best_idx] = groups_found
# DONE
df['Group'] = col_group

Python Pandas Feature Generation as aggregate function

I have a pandas df which is mire or less like
ID key dist
0 1 57 1
1 2 22 1
2 3 12 1
3 4 45 1
4 5 94 1
5 6 36 1
6 7 38 1
.....
this DF contains couple of millions of points. I am trying to generate some descriptors now to incorporate the time nature of the data. The idea is for each line I should create a window of lenght x going back in the data and counting the occurrences of the particular key in the window. I did a implementation, but according to my estimation for 23 different windows the calculation will run 32 days. Here is the code
def features_wind2(inp):
all_window = inp
all_window['window1'] = 0
for index, row in all_window.iterrows():
lid = index
lid1 = lid - 200
pid = row['key']
row['window1'] = all_window.query('index < %d & index > %d & key == %d' % (lid, lid1, key)).count()[0]
return all_window
There are multiple different windows of different length. I however have that uneasy feeling that the iteration is probably not the smartest way to go for this data aggregation. Is there way to implement it to run faster?
On a toy example data frame, you can achieve about a 7x speedup by using apply() instead of iterrows().
Here's some sample data, expanded a bit from OP to include multiple key values:
ID key dist
0 1 57 1
1 2 22 1
2 3 12 1
3 4 45 1
4 5 94 1
5 6 36 1
6 7 38 1
7 8 94 1
8 9 94 1
9 10 38 1
import pandas as pd
df = pd.read_clipboard()
Based on these data, and the counting criteria defined by OP, we expect the output to be:
key dist window
ID
1 57 1 0
2 22 1 0
3 12 1 0
4 45 1 0
5 94 1 0
6 36 1 0
7 38 1 0
8 94 1 1
9 94 1 2
10 38 1 1
Using OP's approach:
def features_wind2(inp):
all_window = inp
all_window['window1'] = 0
for index, row in all_window.iterrows():
lid = index
lid1 = lid - 200
pid = row['key']
row['window1'] = all_window.query('index < %d & index > %d & key == %d' % (lid, lid1, pid)).count()[0]
return all_window
print('old solution: ')
%timeit features_wind2(df)
old solution:
10 loops, best of 3: 25.6 ms per loop
Using apply():
def compute_window(row):
# when using apply(), .name gives the row index
# pandas indexing is inclusive, so take index-1 as cut_idx
cut_idx = row.name - 1
key = row.key
# count the number of instances key appears in df, prior to this row
return sum(df.ix[:cut_idx,'key']==key)
print('new solution: ')
%timeit df['window1'] = df.apply(compute_window, axis='columns')
new solution:
100 loops, best of 3: 3.71 ms per loop
Note that with millions of records, this will still take awhile, and the relative performance gains will likely be diminished somewhat compared to this small test case.
UPDATE
Here's an even faster solution, using groupby() and cumsum(). I made some sample data that seems roughly in line with the provided example, but with 10 million rows. The computation finishes in well under a second, on average:
# sample data
import numpy as np
import pandas as pd
N = int(1e7)
idx = np.arange(N)
keys = np.random.randint(1,100,size=N)
dists = np.ones(N).astype(int)
df = pd.DataFrame({'ID':idx,'key':keys,'dist':dists})
df = df.set_index('ID')
Now performance testing:
%timeit df['window'] = df.groupby('key').cumsum().subtract(1)
1 loop, best of 3: 755 ms per loop
Here's enough output to show that the computation is working:
dist key window
ID
0 1 83 0
1 1 4 0
2 1 87 0
3 1 66 0
4 1 31 0
5 1 33 0
6 1 1 0
7 1 77 0
8 1 49 0
9 1 49 1
10 1 97 0
11 1 36 0
12 1 19 0
13 1 75 0
14 1 4 1
Note: To revert ID from index to column, use df.reset_index() at the end.

Complex Groupby Pandas Operation to Replace For Loops and If Statements

I have a complex group of a group problem I need help with.
I have names of drivers, each of who have driven several cars over time. Each time they turn on the car and drive, I capture cycles and hours, which are transmitted remotely.
What I am trying to do is use grouping to see when the driver gets a new car.
I'm using Car_Cycles and Car_Hours to monitor for a reset (new car). The hours and cycles are tabulated for each driver in ascending sequence until there's a new car and reset. I want to make each car a sequence, but logically can only recognize the car by a cycle/hour reset.
I used a for loop with if statements to do this on the dataframe, and the process time takes several hours. I have several hundred thousand rows with each containing about 20 columns.
My data comes from sensors over a moderately reliable connection, so I want to filter by using the following criteria: A new group is only valid when both Car_Hours and Car_Cycles are less the previous group's last row for 2 consecutive rows. Using both outputs and checking for two rows of change sufficiently filters all erroneous data.
If someone could show me how to quickly solve for Car_Group without using my cumbersome for loops and if statements, I would greatly appreciate it.
Also, for those how are very venturous, I added my original for loop below with if statements. Note I did some other data analysis/tracking within each group to look at other behavior of the car. If you dare look at that code and show me an efficient Pandas replacement, all the more kudos.
name Car_Hours Car_Cycles Car_Group DeltaH
jan 101 404 1 55
jan 102 405 1 55
jan 103 406 1 56
jan 104 410 1 55
jan 105 411 1 56
jan 0 10 2 55
jan 1 12 2 58
jan 2 14 2 57
jan 3 20 2 59
jan 4 26 2 55
jan 10 36 2 56
jan 15 42 2 57
jan 27 56 2 57
jan 100 61 2 58
jan 500 68 2 58
jan 2 4 3 56
jan 3 15 3 57
pete 190 21 1 54
pete 211 29 1 58
pete 212 38 1 55
pete 304 43 1 56
pete 14 20 2 57
pete 15 27 2 57
pete 36 38 2 58
pete 103 47 2 55
mike 1500 2001 1 55
mike 1512 2006 1 59
mike 1513 2012 1 58
mike 1515 2016 1 57
mike 1516 2020 1 55
mike 1517 2024 1 57
..............
for i in range(len(file)):
if i == 0:
DeltaH_limit = 57
car_thresholds = 0
car_threshold_counts = 0
car_threshold_counts = 0
car_change_true = 0
car_change_index_loc = i
total_person_thresholds = 0
person_alert_count = 0
person_car_count = 1
person_car_change_count = 0
total_fleet_thresholds = 0
fleet_alert_count = 0
fleet_car_count = 1
fleet_car_change_count = 0
if float(file['Delta_H'][i]) >= DeltaH_limit:
car_threshold_counts += 1
car_thresholds += 1
total_person_thresholds += 1
total_fleet_thresholds += 1
elif i == 1:
if float(file['Delta_H'][i]) >= DeltaH_limit:
car_threshold_counts += 1
car_thresholds += 1
total_person_thresholds += 1
total_fleet_thresholds += 1
elif i > 1:
if file['name'][i] == file['name'][i-1]: #is same person?
if float(file['Delta_H'][i]) >= DeltaH_limit:
car_threshold_counts += 1
car_thresholds += 1
total_person_thresholds += 1
total_fleet_thresholds += 1
else:
car_threshold_counts = 0
if car_threshold_counts == 3:
car_threshold_counts += 1
person_alert_count += 1
fleet_alert_count += 1
#Car Change?? Compare cycles and hours to look for reset
if i+1 < len(file):
if file['name'][i] == file['name'][i+1] == file['name'][i-1]:
if int(file['Car_Cycles'][i]) < int(file['Car_Cycles'][i-1]) and int(file['Car_Hours'][i]) < int(file['Car_Hours'][i-1]):
if int(file['Car_Cycles'][i+1]) < int(file['Car_Cycles'][i-1]) and int(file['Car_Hours'][i]) < int(file['Car_Hours'][i-1]):
car_thresholds = 0
car_change_true = 1
car_threshold_counts = 0
car_threshold_counts = 0
old_pump_first_flight = car_change_index_loc
car_change_index_loc = i
old_pump_last_flight = i-1
person_car_count += 1
person_car_change_count += 1
fleet_car_count += 1
fleet_car_change_count += 1
print(i, ' working hard!')
else:
car_change_true = 0
else:
car_change_true = 0
else:
car_change_true = 0
else:
car_change_true = 0
else: #new car
car_thresholds = 0
car_threshold_counts = 0
car_threshold_counts = 0
car_change_index_loc = i
car_change_true = 0
total_person_thresholds = 0
person_alert_count = 0
person_car_count = 1
person_car_change_count = 0
if float(file['Delta_H'][i]) >= DeltaH_limit:
car_threshold_counts += 1
car_thresholds += 1
total_person_thresholds += 1
total_fleet_thresholds += 1
file.loc[i, 'car_thresholds'] = car_thresholds
file.loc[i, 'car_threshold_counts'] = car_threshold_counts
file.loc[i, 'car_threshold_counts'] = car_threshold_counts
file.loc[i, 'car_change_true'] = car_change_true
file.loc[i, 'car_change_index_loc'] = car_change_index_loc
file.loc[i, 'total_person_thresholds'] = total_person_thresholds
file.loc[i, 'person_alert_count'] = person_alert_count
file.loc[i, 'person_car_count'] = person_car_count
file.loc[i, 'person_car_change_count'] = person_car_change_count
file.loc[i, 'Total_Fleet_Thresholds'] = total_fleet_thresholds
file.loc[i, 'Fleet_Alert_Count'] = fleet_alert_count
file.loc[i, 'fleet_car_count'] = fleet_car_count
file.loc[i, 'fleet_car_change_count'] = fleet_car_change_count
IIUC, and all we need to do is reproduce Car_Group, we can take advantage of a few tricks:
def twolow(s):
return (s < s.shift()) & (s.shift(-1) < s.shift())
new_hour = twolow(df["Car_Hours"])
new_cycle = twolow(df["Car_Cycles"])
new_name = df["name"] != df["name"].shift()
name_group = new_name.cumsum()
new_cargroup = new_name | (new_hour & new_cycle)
cargroup_without_reset = new_cargroup.cumsum()
cargroup = (cargroup_without_reset -
cargroup_without_reset.groupby(name_group).transform(min) + 1)
Trick #1: if you want to find out where a transition occurs, compare something to a shifted version of itself.
Trick #2: if you have a True where every new group begins, when you take the cumulative sum of that, you get a series where every group has an integer associated with it.
The above gives me
>>> cargroup.head(10)
0 1
1 1
2 1
3 1
4 1
5 2
6 2
7 2
8 2
9 2
dtype: int32
>>> (cargroup == df.Car_Group).all()
True

Categories

Resources