Removing overlapping tuple values using Python

Removing overlapping tuple values using Python - python

I have a list of tuples (let's name it yz_list) that contains N tuples, which have the start and end values like: (start, end), represented by the example below:
yz_list = [(0, 6), (1, 7), (2, 8), (3, 9), (4, 10), (5, 11), (6, 12), (18, 24)]
And I would like to remove all values which are overlapped by the interval of a previous saved tuple. The output that represents this case on the sequences showed above is:
result = [(0,6), (6,12), (18,24)]
How could I achieve this result using Python?
Edit #1
The below code is the code that I'm generating this tuples:
for i, a in enumerate(seq):
if seq[i:i+multiplier] == "x"*multiplier:
to_replace.append((i, i+multiplier))
for i, j in enumerate(to_replace):
print(i,j)
if i == 0:
def_to_replace.append(j)
else:
ind = def_to_replace[i-1]
print(j[0]+1, "\n", ind)
if j[0]+1 not in range(ind[0], ind[1]):
def_to_replace.append(j)
# print(i, j)
print(def_to_replace)
for item in def_to_replace:
frag = replacer(frame_calc(seq[:item[0]]), rep0, rep1, rep2)
for k, v in enumerate(seq_dup[item[0]:item[1]]):
seq_dup[int(item[0]) + int(k)] = list(frag)[k]
return "".join(seq_dup)
As I'm developing with TDD, I'm making a step-by-step progress on the development and now I'm thinking on how to implement the removal of overlaping tuples. I don't really know if it's a good idea to use them as sets, and see the overlapping items.
The pseudocode for generating the result list is:
for item in yz_list:
if is not yz_list first item:
gets item first value
see if the value is betwen any of the values from tuples added on the result list

This may work. No fancy stuff, just manually process each tuple to see if either value is within the range of the saved tuple's set bounds:
yz_list = [(0, 6), (1, 7), (2, 8), (3, 9), (4, 10), (5, 11), (6, 12), (18, 24)]
result = [yz_list[0]]
bounds = yz_list[0][0], yz_list[0][1]
for tup in yz_list[1:]:
if tup[0] in range(bounds[0], bounds[1]) or tup[1] in range(bounds[0], bounds[1]):
pass
else:
result.append(tup)
print result # [(0, 6), (6, 12), (18, 24)]

Here is a class that calculates the overlaps using efficient binary search, and code showing its use to solve your problem. Run with python3.
import bisect
import sys
class Overlap():
def __init__(self):
self._intervals = []
def intervals(self):
return self._intervals
def put(self, interval):
istart, iend = interval
# Ignoring intervals that start after the window.
i = bisect.bisect_right(self._intervals, (iend, sys.maxsize))
# Look at remaining intervals to find overlap.
for start, end in self._intervals[:i]:
if end > istart:
return False
bisect.insort(self._intervals, interval)
return True
yz_list = [(0, 6), (1, 7), (2, 8), (3, 9), (4, 10), (5, 11), (6, 12), (18, 24)]
ov = Overlap()
for i in yz_list:
ov.put(i)
print('Original:', yz_list)
print('Result:', ov.intervals())
OUTPUT:
Original: [(0, 6), (1, 7), (2, 8), (3, 9), (4, 10), (5, 11), (6, 12), (18, 24)]
Result: [(0, 6), (6, 12), (18, 24)]

yz_list = [(0, 6), (1, 7), (2, 8), (3, 9), (4, 10), (5, 11), (6, 12), (18, 24)]
result = []
for start, stop in yz_list:
for low, high in result:
if (low < start < high) or (low < stop < high):
break
else:
result.append((start, stop))
This gives the desired output, and it's pretty easy to see how it works. The else clause basically just means "run this if we didn't break out of the loop".

Related

How to find the maximum per group in an rdd?

I'm using PySpark and I have an RDD that looks like this:
[
("Moviex", [(1, 100), (2, 20), (3, 50)]),
("MovieY", [(1, 100), (2, 250), (3, 100), (4, 120)]),
("MovieZ", [(1, 1000), (2, 250)]),
("MovieX", [(4, 50), (5, 10), (6, 0)]),
("MovieY", [(3, 0), (4, 260)]),
("MovieZ", [(5, 180)]),
]
The first element in the tuple represents the week number and the second element represents the number of viewers. I want to find the week with the most views for each movie, but ignoring the first week.
I've tried some things but nothing worked, for example:
stats.reduceByKey(max).collect()
returns:
[('MovieX', [(4, 50), (5, 10), (6, 0)]),
('MovieY', [(5, 180)]),
('MovieC', [(3, 0), (4, 260)])]
so the entire second set.
Also this:
stats.groupByKey().reduce(max)
which returns just this:
('MovieZ', <pyspark.resultiterable.ResultIterable at 0x558f75eeb0>)
How can I solve this?

If you want the most views per movie, ignoring the first week ... [('MovieA', 50), ('MovieC', 250), ('MovieB', 260)]
Then, you'll want your own map function rather than a reduce.
movie_stats = spark.sparkContext.parallelize([
("MovieA", [(1, 100), (2, 20), (3, "50")]),
("MovieC", [(1, 100), (2, "250"), (3, 100), (4, "120")]),
("MovieB", [(1, 1000), (2, 250)]),
("MovieA", [(4, 50), (5, "10"), (6, 0)]),
("MovieB", [(3, 0), (4, "260")]),
("MovieC", [(5, "180")]),
])
def get_views_after_first_week(v):
values = iter(v) # iterator of tuples, groupped by key
result = list()
for x in values:
result.extend([int(y[1]) for y in x if y[0] > 1])
return result
mapped = movie_stats.groupByKey().mapValues(get_views_after_first_week).mapValues(max)
mapped.collect()
to include the week number... [('MovieA', (3, 50)), ('MovieC', (2, 250)), ('MovieB', (4, 260))]
def get_max_weekly_views_after_first_week(v):
values = iter(v) # iterator of tuples, groupped by key
max_views = float('-inf')
max_week = None
for x in values:
for t in x:
week, views = t
views = int(views)
if week > 1 and views > max_views:
max_week = week
max_views = views
return (max_week, max_views, )
mapped = movie_stats.groupByKey().mapValues(get_max_weekly_views_after_first_week)

Some code is needed to convert the string into int, and apply a map function to 1) filter out week 1 data; 2) get the week with max view.
def helper(arr: list):
max_week = None
for sub_arr in arr:
for item in sub_arr:
if item[0] == 1:
continue
count = int(item[1])
if max_week is None or max_week[1] < count:
max_week = [item[0], count]
return max_week
movie_stats.groupByKey().map(lambda x: (x[0], helper(x[1]))).collect()

Number of passengers. Error: list indices must be integers or slices, not list

So, I'm trying to sum the number of passenger at each stop.
The "stops" variable are the number of stops, and is conformed by a tuple which contains the in's and out's of passengers, example:
stops = [(in1, out1), (in2, out2), (in3, out3), (in4, out4)]
stops = [(10, 0), (4, 1), (3, 5), (3, 4), (5, 1), (1, 5), (5, 8), (4, 6), (2, 3)]
number_passenger_per_stop = []
for i in stops:
resta = stops[i][0] - stops[i][1]
number_passenger_per_stop.append(resta)
print(number_passenger_per_stop)
I can do the math like this outside the loop, but I don't understand why in the loop crashes:
stops[i][0] - stops[i][1]

i is not the list index, it's the list element itself. You don't need to write stops[i].
resta = i[0] - i[1]
Your code would be correct if you had written
for i in range(len(stops)):
You could also replace the entire thing with a list comprehension:
number_passenger_per_stop = [on - off for on, off in stops]

I just edited the for loop to adress each in the index in the list correctly, you needed to call each element in the list by its position, and not by its value:
stops = [(10, 0), (4, 1), (3, 5), (3, 4), (5, 1), (1, 5), (5, 8), (4, 6), (2, 3)]
number_passenger_per_stop = []
for i in range(len(stops)):
resta = stops[i][0] - stops[i][1]
number_passenger_per_stop.append(resta)
print(number_passenger_per_stop)
Output:
[10, 3, -2, -1, 4, -4, -3, -2, -1]

code doesn't seem to run after if statement inside function

I'm following some online courses and I have this function sort but nothing nothing seems to run after the print "here" part:
import unittest
def sort(meetings, indx):
print("call function")
print meetings
firstfirst = meetings[indx][0]
firstsecond = meetings[indx][1]
secondfirst = meetings[indx+1][0]
secondsecond = meetings[indx+1][1]
first = meetings[indx]
second = meetings[indx+1]
print firstfirst
print secondfirst
if firstfirst > secondfirst:
meetings[indx] = second
meetings[indx+1] = first
print "here"
indx = index + 1
print "meetings: "
sort(meetings[indx:len(meetings)-1], indx)
def merge_ranges(meetings):
# Merge meeting range
sort(meetings, 0)
return []
# Tests
class Test(unittest.TestCase):
def test_meetings_overlap(self):
actual = merge_ranges([(1, 3), (2, 4)])
expected = [(1, 4)]
self.assertEqual(actual, expected)
def test_meetings_touch(self):
actual = merge_ranges([(5, 6), (6, 8)])
expected = [(5, 8)]
self.assertEqual(actual, expected)
def test_meeting_contains_other_meeting(self):
actual = merge_ranges([(1, 8), (2, 5)])
expected = [(1, 8)]
self.assertEqual(actual, expected)
def test_meetings_stay_separate(self):
actual = merge_ranges([(1, 3), (4, 8)])
expected = [(1, 3), (4, 8)]
self.assertEqual(actual, expected)
def test_multiple_merged_meetings(self):
actual = merge_ranges([(1, 4), (2, 5), (5, 8)])
expected = [(1, 8)]
self.assertEqual(actual, expected)
def test_meetings_not_sorted(self):
actual = merge_ranges([(5, 8), (1, 4), (6, 8)])
expected = [(1, 4), (5, 8)]
self.assertEqual(actual, expected)
def test_sample_input(self):
actual = merge_ranges([(0, 1), (3, 5), (4, 8), (10, 12), (9, 10)])
expected = [(0, 1), (3, 8), (9, 12)]
self.assertEqual(actual, expected)
unittest.main(verbosity=2)
the output shows this and only throws errors for the test cases (which I didn't include) since those are to be expected...
call function
[(1, 8), (2, 5)]
1
2
here
call function
[(5, 8), (1, 4), (6, 8)]
5
1
here
call function
[(1, 3), (2, 4)]
1
2
here
call function
[(1, 3), (4, 8)]
1
4
here
call function
[(5, 6), (6, 8)]
5
6
here
call function
[(1, 4), (2, 5), (5, 8)]
1
2
here
call function
[(0, 1), (3, 5), (4, 8), (10, 12), (9, 10)]
0
3
here

"but nothing nothing seems to run after the print "here" part"
Are you basing this on the fact that nothing else prints? If so thats because you have to print the variables you change. Also, none of your functions return anything you have worked on within the function and while sort mutates the meetings variable it has no way of knowing when to stop calling itself, it will just eventually throw an error when trying to index into an empty list held in the meetings variable. Even your use of print is confusing. You use print("call function") up top then print meetings afterwards mixing python 2 & 3 print syntax.
But let's get to the heart of your problem here.
def sort(meetings, indx):
print("call function")
print meetings
# eventually meetings will be an empty list and meetings[indx]
# will throw an IndexError
firstfirst = meetings[indx][0]
firstsecond = meetings[indx][1]
secondfirst = meetings[indx+1][0]
secondsecond = meetings[indx+1][1]
first = meetings[indx]
second = meetings[indx+1]
print firstfirst
print secondfirst
if firstfirst > secondfirst:
meetings[indx] = second
meetings[indx+1] = first
# "here" is printed
print "here"
# you alter the indx variable but do not print it
indx = index + 1
# "meetings:" is printed but nothing else is printed below it
print "meetings: "
# sort calls itself without any condition to stop calling itself
# and which will eventually have the indx variable exceed the
# meetings length in the call:
# meetings[indx:len(meetings)-1]
sort(meetings[indx:len(meetings)-1], indx)
# nothing is returned here and sort does not mutate the object in
# any way that I could see that would cause sort to stop
# calling itself
def merge_ranges(meetings):
# Merge meeting range
sort(meetings, 0)
return [] # <- this empty list is always returned no matter what
sort doesn't return anything, which isn't a huge issue if you are just mutating something
sort calls itself recursively until it exceeds the recursion limit, there is nothing to tell it to stop calling itself
Lets assume meetings is this list
meetings = [(0, 1), (3, 5)]
meetings[5:] # ==> [] will always return an empty list when indx exceed meetings length
This means sort keeps calling itself with an empty list and a higher index number
merge_meetings always returns an empty list
You need to test for the index being larger than len(meetings)
Suggestion:
Assuming python 3
def sort(meetings, indx):
print("call function")
print(meetings)
first = meetings[indx]
second = meetings[indx+1]
firstfirst = first[0]
firstsecond = first[1]
secondfirst = second[0]
secondsecond = second[1]
print(firstfirst)
print(secondfirst)
if firstfirst > secondfirst:
meetings[indx] = second
meetings[indx+1] = first
indx = index + 1
print("meetings: ", meetings)
if len(meetings) - 1 > indx:
sort(meetings[indx:], indx)
Now while this takes care of stopping the recursive calls it still doesn't sort completely, it sorts the 2 elements relative to their position to each other but it will need several passes to acheive a proper sort.
for example:
In [1]: a = [(5,3), (0,2), (4,1), (1,1)]
In [2]: sort(a, 0)
call function
[(0, 2), (5, 3), (4, 1), (1, 1)]
0
5
meetings: [(0, 2), (5, 3), (4, 1), (1, 1)]
call function
[(5, 3), (4, 1), (1, 1)]
4
1
meetings: [(5, 3), (1, 1), (4, 1)]
In [3]: a
Out[3]: [(0, 2), (5, 3), (4, 1), (1, 1)]
I'll leave that up to you to figure out seeing as this was an assignment.

Compare two tuple with variable length in Python

I have two tuples of tuples and I want to compare the values on the basis of their first element
list1 = ((1, 2450.0), (2, 2095.0), (4, 1290.0), (5, 1190.0), (6, 1150.0), (7, 1150.0), (8, 1090.0), (9, 1090.0))
list2 = ((1, 2673.0), (4, 1488.0), (5, 1139.0), (6, 1057.0), (7, 1482.0), (8, 1037.0), (9, 1169.0), (10, 937.0))
Expected result should be
list1 = ((1, 2450.0), (2, 2095.0), (3, nan),(4, 1290.0), (5, 1190.0), (6, 1150.0), (7, 1150.0), (8, 1090.0), (9, 1090.0), (10,nan))
list2 = ((1, 2673.0), (3, nan),(4, 1488.0), (5, 1139.0), (6, 1057.0), (7, 1482.0), (8, 1037.0), (9, 1169.0), (10, 937.0))
What would be the efficient way of doing this ?

If I understood your question correctly, you want to check if each tuple contain certain numbers which are stored in the first element for each sub-tuple and if the number is not inside create a sub-tuple with the second element equal to None (if nan means None).
I would follow this process, which may not be the most efficient.
# Create first a list which contains the desired numbers to be checked
checkTuple = ( 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 )
# Create a function to check if each number is in one of the sub-tuples
def chooseName( checkTuple, randomList ):
newList = []
for checkItem in checkTuple:
itemFound = False
for item in randomList:
if checkItem in item:
numberFound = True
break
if numberFound:
newList.append( checkItem )
else:
newList.append( (checkItem, None) )
return tuple( newList )
# Call the function and take back the tuple
newList1 = chooseName( checkTuple, list1 )

More memory efficient way of making a dictionary?

VERY sorry for the vagueness, but I don't actually know what part of what I'm doing is inefficient.
I've made a program that takes a list of positive integers (example*):
[1, 1, 3, 5, 16, 2, 4, 6, 6, 8, 9, 24, 200,]
*the real lists can be up to 2000 in length and the elements between 0 and 100,000 exclusive
And creates a dictionary where each number tupled with its index (like so: (number, index)) is a key and the value for each key is a list of every number (and that number's index) in the input that it goes evenly into.
So the entry for the 3 would be: (3, 2): [(16, 4), (6, 7), (6, 8), (9, 10), (24, 11)]
My code is this:
num_dict = {}
sorted_list = sorted(beginning_list)
for a2, a in enumerate(sorted_list):
num_dict[(a, a2)] = []
for x2, x in enumerate(sorted_list):
for y2, y in enumerate(sorted_list[x2 + 1:]):
if y % x == 0:
pair = (y, y2 + x2 + 1)
num_dict[(x, x2)].append(pair)
But, when I run this script, I hit a MemoryError.
I understand that this means that I am running out of memory but in the situation I'm in, adding more ram or updating to a 64-bit version of python is not an option.
I am certain that the problem is not coming from the list sorting or the first for loop. It has to be the second for loop. I just included the other lines for context.
The full output for the list above would be (sorry for the unsortedness, that's just how dictionaries do):
(200, 12): []
(6, 7): [(24, 11)]
(16, 10): []
(6, 6): [(6, 7), (24, 11)]
(5, 5): [(200, 12)]
(4, 4): [(8, 8), (16, 10), (24, 11), (200, 12)]
(9, 9): []
(8, 8): [(16, 10), (24, 11), (200, 12)]
(2, 2): [(4, 4), (6, 6), (6, 7), (8, 8), (16, 10), (24, 11), (200, 12)]
(24, 11): []
(1, 0): [(1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (6, 7), (8, 8), (9, 9), (16, 10), (24, 11), (200, 12)]
(1, 1): [(2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (6, 7), (8, 8), (9, 9), (16, 10), (24, 11), (200, 12)]
(3, 3): [(6, 6), (6, 7), (9, 9), (24, 11)]
Is there a better way of going about this?
EDIT:
This dictionary will then be fed into this:
ans_set = set()
for x in num_dict:
for y in num_dict[x]:
for z in num_dict[y]:
ans_set.add((x[0], y[0], z[0]))
return len(ans_set)
to find all unique possible triplets in which the 3rd value can be evenly divided by the 2nd value which can be evenly divided by the 1st.
If you think you know of a better way of doing the entire thing, I'm open to redoing the whole of it.
Final Edit
I've found the best way to find the number of triples by reevaluating what I needed it to do. This method doesn't actually find the triples, it just counts them.
def foo(l):
llen = len(l)
total = 0
cache = {}
for i in range(llen):
cache[i] = 0
for x in range(llen):
for y in range(x + 1, llen):
if l[y] % l[x] == 0:
cache[y] += 1
total += cache[x]
return total
And here's a version of the function that explains the thought process as it goes (not good for huge lists though because of spam prints):
def bar(l):
list_length = len(l)
total_triples = 0
cache = {}
for i in range(list_length):
cache[i] = 0
for x in range(list_length):
print("\n\nfor index[{}]: {}".format(x, l[x]))
for y in range(x + 1, list_length):
print("\n\ttry index[{}]: {}".format(y, l[y]))
if l[y] % l[x] == 0:
print("\n\t\t{} can be evenly diveded by {}".format(l[y], l[x]))
cache[y] += 1
total_triples += cache[x]
print("\t\tcache[{0}] is now {1}".format(y, cache[y]))
print("\t\tcount is now {}".format(total_triples))
print("\t\t(+{} from cache[{}])".format(cache[x], x))
else:
print("\n\t\tfalse")
print("\ntotal number of triples:", total_triples)

Well, you could start by not unnecessarily duplicating information.
Storing full tuples (number and index) for each multiple is inefficient when you already have that information available.
For example, rather than:
(3, 2): [(16, 4), (6, 7), (6, 8), (9, 10), (24, 11)]
(the 16 appears to be wrong there as it's not a multiple of 3 so I'm guessing you meant 15) you could instead opt for:
(3, 2): [15, 6, 9, 24]
(6, 7): ...
That pretty much halves your storage needs since you can go from the 6 in the list and find all its indexes by searching the tuples. That will, of course, be extra processing effort to traverse the list but it's probably better to have a slower working solution than a faster non-working one :-)
You could reduce the storage even more by not storing the multiples at all, instead running through the tuple list using % to see if you have a multiple.
But, of course, this all depends on your actual requirements which would be better off stating the intent of what your trying to achieve rather than pre-supposing a solution.

You rebuild tuples in places like pair = (y, y2 + x2 + 1) and num_dict[(x, x2)].append(pair) when you could build a canonical set of tuples early on and then just put references in the containers. I cobbled up a 2000 item test my machine that works. I have python 3.4 64 bit with a relatively modest 3.5 GIG of RAM...
import random
# a test list that should generate longish lists
l = list(random.randint(0, 2000) for _ in range(2000))
# setup canonical index and sort ascending
sorted_index = sorted((v,i) for i,v in enumerate(l))
num_dict = {}
for idx, vi in enumerate(sorted_index):
v = vi[0]
num_dict[vi] = [vi2 for vi2 in sorted_index[idx+1:] if not vi2[0] % v]
for item in num_dict.items():
print(item)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removing overlapping tuple values using Python - python

Related

How to find the maximum per group in an rdd?

Number of passengers. Error: list indices must be integers or slices, not list

code doesn't seem to run after if statement inside function

Compare two tuple with variable length in Python

More memory efficient way of making a dictionary?

Categories

Resources