Grouping but the lists can contain duplicate elements - python

I have an ordered list
L= [330.56, 330.6,330.65,330.7, .....]
and I want to group it by a certain tolerance -+, but I can have duplicates, so I need to iterate over every element ( see if there are other elements within element-tol and element +tol) and then delete if there is a complete sublist
The code I'm using look like this but there are no duplicates and only check next elements
def mw_grouper(iterable):
group = []
for item in iterable:
if not group or item - group[0] <= 0.05:
group.append(item)
else:
yield group
group = [item]
if group:
yield group
What I get is more of, I need to also check the previous elements
R = [[330.56, 330.6], [330.65], [330.7]]
The output I want
R = [[330.56, 330.6],[330.56, 330.6, 330.65], [330.6,330.65,330.7], [330.7, 330.65]]
and then delete the sublists
F = [[330.56, 330.6, 330.65], [330.6,330.65,330.7]]

Ok, this should work, though I didn't test it intensively, so there might be unforeseen errors.
Instead of deleting subgroups after the fact, I prevented the yielding of a subgroup of the last or next group.
I decided to use only one "sliding" list (group), but it would probably have been simpler (though more memory intensive in case of huge lists) to implement a list for each element of the iterable.
I may have overthought this and overlooked a simpler implementation, but that's the best I'm able to produce at this time ;)
def group_by_tolerance(my_iterable, tolerance):
active_index = 0 # The index (in group) of the next element "whose" group has to be yielded
group = []
last_released_group = []
for number in my_iterable:
if not group:
group = [number]
continue
elif number - group[active_index] <= tolerance:
group.append(number)
continue
while group and number - group[active_index] > tolerance:
# check that this group is not a subgroup of the next one
if active_index >= len(group)-1 or group[active_index+1] - group[0] > tolerance:
# check that this group is not a subgroup of the previous one
if len(last_released_group) < len(group) or group != last_released_group[-len(group)]:
last_released_group = group.copy()
yield last_released_group
active_index += 1
if active_index >= len(group):
active_index = 0
group = []
continue
while group[active_index] - group[0] > tolerance:
group.pop(0)
active_index -= 1
group.append(number)
if group:
yield group
L= [330.56, 330.6,330.65,330.7]
print(list(group_by_tolerance(L,0.01)))
# [[330.56], [330.6], [330.65], [330.7]]
print(list(group_by_tolerance(L,0.051)))
# [[330.56, 330.6, 330.65], [330.6, 330.65, 330.7]]
print(list(group_by_tolerance(L,0.1)))
# [[330.56, 330.6, 330.65, 330.7]]
Here's the code tweaked to return (index, value) instead of value only; you can then treat the output to get the values only, or the indices only:
def group_by_tolerance(my_iterable, tol):
active_index = 0
group = []
last_released_group = []
for i,number in enumerate(my_iterable):
if not group:
group = [(i,number)]
continue
elif number - group[active_index][1] <= tol:
group.append((i,number))
continue
while group and number - group[active_index][1] > tol:
# check that this group is not a subgroup of the next one
if active_index >= len(group)-1 or group[active_index+1][1] - group[0][1] > tol:
# check that this group is not a subgroup of the previous one
if len(last_released_group) < len(group) or group != last_released_group[-len(group)]:
last_released_group = group.copy()
yield last_released_group
active_index += 1
if active_index >= len(group):
active_index = 0
group = []
continue
while group[active_index][1] - group[0][1] > tol:
group.pop(0)
active_index -= 1
group.append((i,number))
if group:
yield group
print(list(group_by_tolerance(L,0.051)))
# [[(0, 330.56), (1, 330.6), (2, 330.65)], [(1, 330.6), (2, 330.65), (3, 330.7)]]

Related

Deleting matching items from a list

I have a list of values. I would like to remove values which cancel to each other (+ and -). The values are randomly in the list, so I first added a new column in excel with the absolute values. I then sorted on the absolute values so the amounts which needs to be cancelled are below each other.
I was thinking to create a for loop and sum up the ifrst row with the second row, and when this sums to zero, delete both rows and start from the top again. Please refer to the picture of an example. I have marked yellow the items which should be deleted. As I only want to delete matching items, the total sum of the amount column should not change after the operation.
Currently I have something like this
for i in df["Amount in Entity Currency"]:
if df["Amount in Entity Currency"][i] + df["Amount in Entity Currency"][i+1] == 0:
df.drop(df[df["Amount in Entity Currency"][i]])
df.drop(df[df["Amount in Entity Currency"][i + 1]])
try sth like this after you have sorted the list (as you already said):
for i,elem in enumerate(yourList):
nextElem = yourList[i+1]
if (elem + nextElem < 0.00000001):
yourList.remove(elem)
yourList.remove(nextElem)
I would base an answer off of several building blocks. First is creating two "iterable" sub-lists. One for positive numbers and the other negative numbers.
Then I would iterate over both of them using next() and as long as one of the two lists had values I would act on the current values as appropriate.
import random
full_data = [random.randint(0, 10) for _ in range(20)] + [-random.randint(0, 10) for _ in range(20)]
zeros = [i for i in full_data if i == 0]
positives = iter(sorted([i for i in full_data if i > 0]))
negatives = iter(sorted([i for i in full_data if i < 0], reverse=True))
## ------------------------------
## prime the list with zero or 1 0s where an odd number of 0s
## results in [0] as the evens cancel each other out.
## ------------------------------
result = [0] * (len(zeros) % 2)
## ------------------------------
pos = next(positives, None)
neg = next(negatives, None)
while pos is not None or neg is not None:
if pos is None:
# we ran out of positive numbers so add any remaining negatives
result.extend([neg] + list(negatives))
break
if neg is None:
# we ran out of negative numbers so add any remaining positives
result.extend([pos] + list(positives))
break
if pos == -neg:
# these results cancel each other
pos = next(positives, None)
neg = next(negatives, None)
elif pos > -neg:
# this positive is "larger" then this negative so add the negative
result.append(neg)
neg = next(negatives, None)
else:
# this positive is "smaller" than this negative so add the positive
result.append(pos)
pos = next(positives, None)
print(f"The original list has {len(full_data)} items and sums to: {sum(full_data)}")
print(f"The filtered list has {len(result)} items and sums to: {sum(result)}")

How to split a series by the longest repetition of a number in python?

df = pd.DataFrame({
'label':[f"subj_{i}" for i in range(28)],
'data':[i for i in range(1, 14)] + [1,0,0,0,2] + [0,0,0,0,0,0,0,0,0,0]
})
I have a dataset something like that. It looks like:
I want to cut it at where the longest repetitions of 0s occur, so I want to cut at index 18, but I want to leave index 14-16 intact. So far I've tried stuff like:
Counters
cad_recorder = 0
new_index = []
for i,row in tqdm(temp_df.iterrows()):
if row['cadence'] == 0:
cad_recorder += 1
new_index.append(i)
* But obviously that won't work since the indices will be rewritten at each occurrance of zero.
I also tried a dictionary, but I'm not sure how to compare previous and next values using iterrows.
I also took the rolling mean for X rows at a time, and if its zero then I got an index. But then I got stuck at actually inferring the range of indices. Or finding the longest sequence of zeroes.
Edit: A friend of mine suggested the following logic, which gave the same results as #shubham-sharma. The poster's solution is much more pythonic and elegant.
def find_longest_zeroes(df):
'''
Finds the index at which the longest reptitions of <1 values begin
'''
current_length = 0
max_length = 0
start_idx = 0
max_idx = 0
for i in range(len(df['data'])):
if df.iloc[i,9] <= 1:
if current_length == 0:
start_idx = i
current_length += 1
if current_length > max_length:
max_length = current_length
max_idx = start_idx
else:
current_length = 0
return max_idx
The code I went with following #shubham-sharma's solution:
cut_us_sof = {}
og_df_sof = pd.DataFrame()
cut_df_sof = pd.DataFrame()
for lab in df['label'].unique():
temp_df = df[df['label'] == lab].reset_index(drop=True)
mask = temp_df['data'] <= 1 # some values in actual dataset were 0.0000001
counts = temp_df[mask].groupby((~mask).cumsum()).transform('count')['data']
idx = counts.idxmax()
# my dataset's trailing zeroes are usually after 200th index. But I also didn't want to remove trailing zeroes < 500 in length
if (idx > 2000) & (counts.loc[idx] > 500):
cut_us_sof[lab] = idx
og_df_sof = og_df_sof.append(temp_df)
cut_df_sof = cut_df_sof.append(temp_df.iloc[:idx,:])
We can use boolean masking and cumsum to identify the blocks of zeros, then groupby and transform these blocks using count followed by idxmax to get the starting index of the block having the maximum consecutive zeros
m = df['data'].eq(0)
idx = m[m].groupby((~m).cumsum()).transform('count').idxmax()
print(idx)
18

python: filter based on IF condition

I am operating with simple python condition aimed at filtering of the values > or equal to zero, and store filtered values in the list
# make a object contained all clusters
clustering = d.clusterer.clustering_dict[cut_off]
# list of ignored objects
banned_conf=[]
for clust in clustering:
clustStr = str(clustering.index(clust))
clustStr = int(clustStr) + 1
# get the value of energy for the clust
ener=clust[0].energy
# set up filter to ignore conformations with positive energies
if ener > 0:
print('Conformation in ' + str(clustStr) + ' cluster poses positive energy')
banned_conf.append(ener)
print('Nonsence: It is ignored!')
continue
elif ener == 0:
print('Conformation in ' + str(clustStr) + ' cluster poses ZERO energy')
banned_conf.append(ener)
print('Very rare case: it is ignored!')
continue
#else:
#print("Ain't no wrong conformations in " + str(clustStr) + " cluster")
How would it be possible to ignore all values > or = 0 within the same IF statement (without elif)? Which filtering would be better (with elif or in single IF statement)?
I would use the filter function:
lst = [0,1,-1,2,-2,3,-3,4,-4]
filtered = list(filter(lambda x: x >= 0, lst))
for ele in filtered:
print(f'{ele} is >= 0')
Or if you don't want to use lamda function and filter I would do:
lst = [0,1,-1,2,-2,3,-3,4,-4]
filtered = []
for ele in lst:
if ele >= 0:
filtered.append(ele)
for ele in filtered:
print(f'{ele} is >= 0')
Or you can use list comprehension:
lst = [0,1,-1,2,-2,3,-3,4,-4]
filtered = [for ele in lst if ele >= 0]
for ele in filtered:
print(f'{ele} is >= 0')
You can use >= to test both conditions at once.
for index, clust in enumerate(clustering, 1):
ener = clust[0].energy
if ener >= 0:
print(f'Conformation in {index} cluster poses zero or positive energy, it is ignored')
banned_conf.append(clust)
Your original method is better if you want to show a different message for zero and positive energy.

How to optimize an O(N*M) to be O(n**2)?

I am trying to solve USACO's Milking Cows problem. The problem statement is here: https://train.usaco.org/usacoprob2?S=milk2&a=n3lMlotUxJ1
Given a series of intervals in the form of a 2d array, I have to find the longest interval and the longest interval in which no milking was occurring.
Ex. Given the array [[500,1200],[200,900],[100,1200]], the longest interval would be 1100 as there is continuous milking and the longest interval without milking would be 0 as there are no rest periods.
I have tried looking at whether utilizing a dictionary would decrease run times but I haven't had much success.
f = open('milk2.in', 'r')
w = open('milk2.out', 'w')
#getting the input
farmers = int(f.readline().strip())
schedule = []
for i in range(farmers):
schedule.append(f.readline().strip().split())
#schedule = data
minvalue = 0
maxvalue = 0
#getting the minimums and maximums of the data
for time in range(farmers):
schedule[time][0] = int(schedule[time][0])
schedule[time][1] = int(schedule[time][1])
if (minvalue == 0):
minvalue = schedule[time][0]
if (maxvalue == 0):
maxvalue = schedule[time][1]
minvalue = min(schedule[time][0], minvalue)
maxvalue = max(schedule[time][1], maxvalue)
filled_thistime = 0
filled_max = 0
empty_max = 0
empty_thistime = 0
#goes through all the possible items in between the minimum and the maximum
for point in range(minvalue, maxvalue):
isfilled = False
#goes through all the data for each point value in order to find the best values
for check in range(farmers):
if point >= schedule[check][0] and point < schedule[check][1]:
filled_thistime += 1
empty_thistime = 0
isfilled = True
break
if isfilled == False:
filled_thistime = 0
empty_thistime += 1
if (filled_max < filled_thistime) :
filled_max = filled_thistime
if (empty_max < empty_thistime) :
empty_max = empty_thistime
print(filled_max)
print(empty_max)
if (filled_max < filled_thistime):
filled_max = filled_thistime
w.write(str(filled_max) + " " + str(empty_max) + "\n")
f.close()
w.close()
The program works fine, but I need to decrease the time it takes to run.
A less pretty but more efficient approach would be to solve this like a free list, though it is a bit more tricky since the ranges can overlap. This method only requires looping through the input list a single time.
def insert(start, end):
for existing in times:
existing_start, existing_end = existing
# New time is a subset of existing time
if start >= existing_start and end <= existing_end:
return
# New time ends during existing time
elif end >= existing_start and end <= existing_end:
times.remove(existing)
return insert(start, existing_end)
# New time starts during existing time
elif start >= existing_start and start <= existing_end:
# existing[1] = max(existing_end, end)
times.remove(existing)
return insert(existing_start, end)
# New time is superset of existing time
elif start <= existing_start and end >= existing_end:
times.remove(existing)
return insert(start, end)
times.append([start, end])
data = [
[500,1200],
[200,900],
[100,1200]
]
times = [data[0]]
for start, end in data[1:]:
insert(start, end)
longest_milk = 0
longest_gap = 0
for i, time in enumerate(times):
duration = time[1] - time[0]
if duration > longest_milk:
longest_milk = duration
if i != len(times) - 1 and times[i+1][0] - times[i][1] > longest_gap:
longes_gap = times[i+1][0] - times[i][1]
print(longest_milk, longest_gap)
As stated in the comments, if the input is sorted, the complexity could be O(n), if that's not the case we need to sort it first and the complexity is O(nlog n):
lst = [ [300,1000],
[700,1200],
[1500,2100] ]
from itertools import groupby
longest_milking = 0
longest_idle = 0
l = sorted(lst, key=lambda k: k[0])
for v, g in groupby(zip(l[::1], l[1::1]), lambda k: k[1][0] <= k[0][1]):
l = [*g][0]
if v:
mn, mx = min(i[0] for i in l), max(i[1] for i in l)
if mx-mn > longest_milking:
longest_milking = mx-mn
else:
mx = max((i2[0] - i1[1] for i1, i2 in zip(l[::1], l[1::1])))
if mx > longest_idle:
longest_idle = mx
# corner case, N=1 (only one interval)
if len(lst) == 1:
longest_milking = lst[0][1] - lst[0][0]
print(longest_milking)
print(longest_idle)
Prints:
900
300
For input:
lst = [ [500,1200],
[200,900],
[100,1200] ]
Prints:
1100
0

How do I count these comparisons in selection sort?

Here is my code
count = 0
def selectionSort(data):
for index in range(len(data)):
min = index
count += 1
# Find the index'th smallest element
for scan in range(index + 1, len(data)):
if (data[scan] < data[min]):
min = scan
if min != index: # swap the elements
data[index], data[min] = data[min], data[index]
return data
data = selectionSort([3,4,5,2,6])
print(count, data)
Your code as-is should not run. You should get local variable 'count' referenced before assignment.
To fix this, add the following to the top of selectionSort(data):
global count
A better way is to scrap the global variable and return count alongside the sorted data:
def selectionSort(data):
count = 0
for index in range(len(data)):
min = index
count += 1
# Find the index'th smallest element
for scan in range(index + 1, len(data)):
if (data[scan] < data[min]):
min = scan
if min != index: # swap the elements
data[index], data[min] = data[min], data[index]
return count, data
count, data = selectionSort([3,4,5,2,6])
print(count, data)
Last but not least, you are counting something other than comparisons. I leave fixing that as an exercise for the reader.

Categories

Resources