How to find the maximum per group in an rdd? - python

I'm using PySpark and I have an RDD that looks like this:
[
("Moviex", [(1, 100), (2, 20), (3, 50)]),
("MovieY", [(1, 100), (2, 250), (3, 100), (4, 120)]),
("MovieZ", [(1, 1000), (2, 250)]),
("MovieX", [(4, 50), (5, 10), (6, 0)]),
("MovieY", [(3, 0), (4, 260)]),
("MovieZ", [(5, 180)]),
]
The first element in the tuple represents the week number and the second element represents the number of viewers. I want to find the week with the most views for each movie, but ignoring the first week.
I've tried some things but nothing worked, for example:
stats.reduceByKey(max).collect()
returns:
[('MovieX', [(4, 50), (5, 10), (6, 0)]),
('MovieY', [(5, 180)]),
('MovieC', [(3, 0), (4, 260)])]
so the entire second set.
Also this:
stats.groupByKey().reduce(max)
which returns just this:
('MovieZ', <pyspark.resultiterable.ResultIterable at 0x558f75eeb0>)
How can I solve this?

If you want the most views per movie, ignoring the first week ... [('MovieA', 50), ('MovieC', 250), ('MovieB', 260)]
Then, you'll want your own map function rather than a reduce.
movie_stats = spark.sparkContext.parallelize([
("MovieA", [(1, 100), (2, 20), (3, "50")]),
("MovieC", [(1, 100), (2, "250"), (3, 100), (4, "120")]),
("MovieB", [(1, 1000), (2, 250)]),
("MovieA", [(4, 50), (5, "10"), (6, 0)]),
("MovieB", [(3, 0), (4, "260")]),
("MovieC", [(5, "180")]),
])
def get_views_after_first_week(v):
values = iter(v) # iterator of tuples, groupped by key
result = list()
for x in values:
result.extend([int(y[1]) for y in x if y[0] > 1])
return result
mapped = movie_stats.groupByKey().mapValues(get_views_after_first_week).mapValues(max)
mapped.collect()
to include the week number... [('MovieA', (3, 50)), ('MovieC', (2, 250)), ('MovieB', (4, 260))]
def get_max_weekly_views_after_first_week(v):
values = iter(v) # iterator of tuples, groupped by key
max_views = float('-inf')
max_week = None
for x in values:
for t in x:
week, views = t
views = int(views)
if week > 1 and views > max_views:
max_week = week
max_views = views
return (max_week, max_views, )
mapped = movie_stats.groupByKey().mapValues(get_max_weekly_views_after_first_week)

Some code is needed to convert the string into int, and apply a map function to 1) filter out week 1 data; 2) get the week with max view.
def helper(arr: list):
max_week = None
for sub_arr in arr:
for item in sub_arr:
if item[0] == 1:
continue
count = int(item[1])
if max_week is None or max_week[1] < count:
max_week = [item[0], count]
return max_week
movie_stats.groupByKey().map(lambda x: (x[0], helper(x[1]))).collect()

Related

Number of passengers. Error: list indices must be integers or slices, not list

So, I'm trying to sum the number of passenger at each stop.
The "stops" variable are the number of stops, and is conformed by a tuple which contains the in's and out's of passengers, example:
stops = [(in1, out1), (in2, out2), (in3, out3), (in4, out4)]
stops = [(10, 0), (4, 1), (3, 5), (3, 4), (5, 1), (1, 5), (5, 8), (4, 6), (2, 3)]
number_passenger_per_stop = []
for i in stops:
resta = stops[i][0] - stops[i][1]
number_passenger_per_stop.append(resta)
print(number_passenger_per_stop)
I can do the math like this outside the loop, but I don't understand why in the loop crashes:
stops[i][0] - stops[i][1]
i is not the list index, it's the list element itself. You don't need to write stops[i].
resta = i[0] - i[1]
Your code would be correct if you had written
for i in range(len(stops)):
You could also replace the entire thing with a list comprehension:
number_passenger_per_stop = [on - off for on, off in stops]
I just edited the for loop to adress each in the index in the list correctly, you needed to call each element in the list by its position, and not by its value:
stops = [(10, 0), (4, 1), (3, 5), (3, 4), (5, 1), (1, 5), (5, 8), (4, 6), (2, 3)]
number_passenger_per_stop = []
for i in range(len(stops)):
resta = stops[i][0] - stops[i][1]
number_passenger_per_stop.append(resta)
print(number_passenger_per_stop)
Output:
[10, 3, -2, -1, 4, -4, -3, -2, -1]

How to iterate over a dictionary of tuples

I have a list of tuples called possible_moves containing possible moves on a board in my game:
[(2, 1), (2, 2), (2, 3), (3, 1), (4, 5), (5, 2), (5, 3), (6, 0), (6, 2), (7, 1)]
Then, I have a dictionary that assigns a value to each cell on the game board:
{(0,0): 10000, (0,1): -3000, (0,2): 1000, (0,3): 800, etc.}
I want to iterate over all possible moves and find the move with the highest value.
my_value = 0
possible_moves = dict(possible_moves)
for move, value in moves_values:
if move in possible_moves and possible_moves[move] > my_value:
my_move = possible_moves[move]
my_value = value
return my_move
The problem is in the part for move, value, because it creates two integer indexes, but I want the index move to be a tuple.
IIUC, you don't even need the list of possible moves. The moves and their scores you care about are already contained in the dictionary.
>>> from operator import itemgetter
>>>
>>> scores = {(0,0): 10000, (0,1): -3000, (0,2): 1000, (0,3): 800}
>>> max_move, max_score = max(scores.items(), key=itemgetter(1))
>>>
>>> max_move
(0, 0)
>>> max_score
10000
edit: turns out I did not understand quite correctly. Assuming that the list of moves, let's call it possible_moves, contains the moves possible right now and that the dictionary scores contains the scores for all moves, even the impossible ones, you can issue:
max_score, max_move = max((scores[move], move) for move in possible_moves)
... or if you don't need the score:
max_move = max(possible_moves, key=scores.get)
You can use max with dict.get:
possible_moves = [(2, 1), (2, 2), (2, 3), (3, 1), (4, 5), (5, 2),
(5, 3), (6, 0), (6, 2), (7, 1), (0, 2), (0, 1)]
scores = {(0,0): 10000, (0,1): -3000, (0,2): 1000, (0,3): 800}
res = max(possible_moves, key=lambda x: scores.get(x, 0)) # (0, 2)
This assumes moves not found in your dictionary have a default score of 0. If you can guarantee that every move is included as a key in your scores dictionary, you can simplify somewhat:
res = max(possible_moves, key=scores.__getitem__)
Note the syntax [] is syntactic sugar for __getitem__: if the key isn't found you'll meet KeyError.
If d is a dict, iterator of d generates keys. d.items() generates key-value pairs. So:
for move, value in moves_values.items():
possibleMoves=[(2, 1), (2, 2), (2, 3), (3, 1), (4, 5), (5, 2),(0, 3),(5, 3), (6, 0), (6, 2), (7, 1),(0,2)]
movevalues={(0,0): 10000, (0,1): -3000, (0,2): 1000, (0,3): 800}
def func():
my_value=0
for i in range(len(possibleMoves)):
for k,v in movevalues.items():
if possibleMoves[i]==k and v>my_value:
my_value=v
return my_value
maxValue=func()
print(maxValue)

Removing overlapping tuple values using Python

I have a list of tuples (let's name it yz_list) that contains N tuples, which have the start and end values like: (start, end), represented by the example below:
yz_list = [(0, 6), (1, 7), (2, 8), (3, 9), (4, 10), (5, 11), (6, 12), (18, 24)]
And I would like to remove all values which are overlapped by the interval of a previous saved tuple. The output that represents this case on the sequences showed above is:
result = [(0,6), (6,12), (18,24)]
How could I achieve this result using Python?
Edit #1
The below code is the code that I'm generating this tuples:
for i, a in enumerate(seq):
if seq[i:i+multiplier] == "x"*multiplier:
to_replace.append((i, i+multiplier))
for i, j in enumerate(to_replace):
print(i,j)
if i == 0:
def_to_replace.append(j)
else:
ind = def_to_replace[i-1]
print(j[0]+1, "\n", ind)
if j[0]+1 not in range(ind[0], ind[1]):
def_to_replace.append(j)
# print(i, j)
print(def_to_replace)
for item in def_to_replace:
frag = replacer(frame_calc(seq[:item[0]]), rep0, rep1, rep2)
for k, v in enumerate(seq_dup[item[0]:item[1]]):
seq_dup[int(item[0]) + int(k)] = list(frag)[k]
return "".join(seq_dup)
As I'm developing with TDD, I'm making a step-by-step progress on the development and now I'm thinking on how to implement the removal of overlaping tuples. I don't really know if it's a good idea to use them as sets, and see the overlapping items.
The pseudocode for generating the result list is:
for item in yz_list:
if is not yz_list first item:
gets item first value
see if the value is betwen any of the values from tuples added on the result list
This may work. No fancy stuff, just manually process each tuple to see if either value is within the range of the saved tuple's set bounds:
yz_list = [(0, 6), (1, 7), (2, 8), (3, 9), (4, 10), (5, 11), (6, 12), (18, 24)]
result = [yz_list[0]]
bounds = yz_list[0][0], yz_list[0][1]
for tup in yz_list[1:]:
if tup[0] in range(bounds[0], bounds[1]) or tup[1] in range(bounds[0], bounds[1]):
pass
else:
result.append(tup)
print result # [(0, 6), (6, 12), (18, 24)]
Here is a class that calculates the overlaps using efficient binary search, and code showing its use to solve your problem. Run with python3.
import bisect
import sys
class Overlap():
def __init__(self):
self._intervals = []
def intervals(self):
return self._intervals
def put(self, interval):
istart, iend = interval
# Ignoring intervals that start after the window.
i = bisect.bisect_right(self._intervals, (iend, sys.maxsize))
# Look at remaining intervals to find overlap.
for start, end in self._intervals[:i]:
if end > istart:
return False
bisect.insort(self._intervals, interval)
return True
yz_list = [(0, 6), (1, 7), (2, 8), (3, 9), (4, 10), (5, 11), (6, 12), (18, 24)]
ov = Overlap()
for i in yz_list:
ov.put(i)
print('Original:', yz_list)
print('Result:', ov.intervals())
OUTPUT:
Original: [(0, 6), (1, 7), (2, 8), (3, 9), (4, 10), (5, 11), (6, 12), (18, 24)]
Result: [(0, 6), (6, 12), (18, 24)]
yz_list = [(0, 6), (1, 7), (2, 8), (3, 9), (4, 10), (5, 11), (6, 12), (18, 24)]
result = []
for start, stop in yz_list:
for low, high in result:
if (low < start < high) or (low < stop < high):
break
else:
result.append((start, stop))
This gives the desired output, and it's pretty easy to see how it works. The else clause basically just means "run this if we didn't break out of the loop".

Compare two tuple with variable length in Python

I have two tuples of tuples and I want to compare the values on the basis of their first element
list1 = ((1, 2450.0), (2, 2095.0), (4, 1290.0), (5, 1190.0), (6, 1150.0), (7, 1150.0), (8, 1090.0), (9, 1090.0))
list2 = ((1, 2673.0), (4, 1488.0), (5, 1139.0), (6, 1057.0), (7, 1482.0), (8, 1037.0), (9, 1169.0), (10, 937.0))
Expected result should be
list1 = ((1, 2450.0), (2, 2095.0), (3, nan),(4, 1290.0), (5, 1190.0), (6, 1150.0), (7, 1150.0), (8, 1090.0), (9, 1090.0), (10,nan))
list2 = ((1, 2673.0), (3, nan),(4, 1488.0), (5, 1139.0), (6, 1057.0), (7, 1482.0), (8, 1037.0), (9, 1169.0), (10, 937.0))
What would be the efficient way of doing this ?
If I understood your question correctly, you want to check if each tuple contain certain numbers which are stored in the first element for each sub-tuple and if the number is not inside create a sub-tuple with the second element equal to None (if nan means None).
I would follow this process, which may not be the most efficient.
# Create first a list which contains the desired numbers to be checked
checkTuple = ( 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 )
# Create a function to check if each number is in one of the sub-tuples
def chooseName( checkTuple, randomList ):
newList = []
for checkItem in checkTuple:
itemFound = False
for item in randomList:
if checkItem in item:
numberFound = True
break
if numberFound:
newList.append( checkItem )
else:
newList.append( (checkItem, None) )
return tuple( newList )
# Call the function and take back the tuple
newList1 = chooseName( checkTuple, list1 )

More memory efficient way of making a dictionary?

VERY sorry for the vagueness, but I don't actually know what part of what I'm doing is inefficient.
I've made a program that takes a list of positive integers (example*):
[1, 1, 3, 5, 16, 2, 4, 6, 6, 8, 9, 24, 200,]
*the real lists can be up to 2000 in length and the elements between 0 and 100,000 exclusive
And creates a dictionary where each number tupled with its index (like so: (number, index)) is a key and the value for each key is a list of every number (and that number's index) in the input that it goes evenly into.
So the entry for the 3 would be: (3, 2): [(16, 4), (6, 7), (6, 8), (9, 10), (24, 11)]
My code is this:
num_dict = {}
sorted_list = sorted(beginning_list)
for a2, a in enumerate(sorted_list):
num_dict[(a, a2)] = []
for x2, x in enumerate(sorted_list):
for y2, y in enumerate(sorted_list[x2 + 1:]):
if y % x == 0:
pair = (y, y2 + x2 + 1)
num_dict[(x, x2)].append(pair)
But, when I run this script, I hit a MemoryError.
I understand that this means that I am running out of memory but in the situation I'm in, adding more ram or updating to a 64-bit version of python is not an option.
I am certain that the problem is not coming from the list sorting or the first for loop. It has to be the second for loop. I just included the other lines for context.
The full output for the list above would be (sorry for the unsortedness, that's just how dictionaries do):
(200, 12): []
(6, 7): [(24, 11)]
(16, 10): []
(6, 6): [(6, 7), (24, 11)]
(5, 5): [(200, 12)]
(4, 4): [(8, 8), (16, 10), (24, 11), (200, 12)]
(9, 9): []
(8, 8): [(16, 10), (24, 11), (200, 12)]
(2, 2): [(4, 4), (6, 6), (6, 7), (8, 8), (16, 10), (24, 11), (200, 12)]
(24, 11): []
(1, 0): [(1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (6, 7), (8, 8), (9, 9), (16, 10), (24, 11), (200, 12)]
(1, 1): [(2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (6, 7), (8, 8), (9, 9), (16, 10), (24, 11), (200, 12)]
(3, 3): [(6, 6), (6, 7), (9, 9), (24, 11)]
Is there a better way of going about this?
EDIT:
This dictionary will then be fed into this:
ans_set = set()
for x in num_dict:
for y in num_dict[x]:
for z in num_dict[y]:
ans_set.add((x[0], y[0], z[0]))
return len(ans_set)
to find all unique possible triplets in which the 3rd value can be evenly divided by the 2nd value which can be evenly divided by the 1st.
If you think you know of a better way of doing the entire thing, I'm open to redoing the whole of it.
Final Edit
I've found the best way to find the number of triples by reevaluating what I needed it to do. This method doesn't actually find the triples, it just counts them.
def foo(l):
llen = len(l)
total = 0
cache = {}
for i in range(llen):
cache[i] = 0
for x in range(llen):
for y in range(x + 1, llen):
if l[y] % l[x] == 0:
cache[y] += 1
total += cache[x]
return total
And here's a version of the function that explains the thought process as it goes (not good for huge lists though because of spam prints):
def bar(l):
list_length = len(l)
total_triples = 0
cache = {}
for i in range(list_length):
cache[i] = 0
for x in range(list_length):
print("\n\nfor index[{}]: {}".format(x, l[x]))
for y in range(x + 1, list_length):
print("\n\ttry index[{}]: {}".format(y, l[y]))
if l[y] % l[x] == 0:
print("\n\t\t{} can be evenly diveded by {}".format(l[y], l[x]))
cache[y] += 1
total_triples += cache[x]
print("\t\tcache[{0}] is now {1}".format(y, cache[y]))
print("\t\tcount is now {}".format(total_triples))
print("\t\t(+{} from cache[{}])".format(cache[x], x))
else:
print("\n\t\tfalse")
print("\ntotal number of triples:", total_triples)
Well, you could start by not unnecessarily duplicating information.
Storing full tuples (number and index) for each multiple is inefficient when you already have that information available.
For example, rather than:
(3, 2): [(16, 4), (6, 7), (6, 8), (9, 10), (24, 11)]
(the 16 appears to be wrong there as it's not a multiple of 3 so I'm guessing you meant 15) you could instead opt for:
(3, 2): [15, 6, 9, 24]
(6, 7): ...
That pretty much halves your storage needs since you can go from the 6 in the list and find all its indexes by searching the tuples. That will, of course, be extra processing effort to traverse the list but it's probably better to have a slower working solution than a faster non-working one :-)
You could reduce the storage even more by not storing the multiples at all, instead running through the tuple list using % to see if you have a multiple.
But, of course, this all depends on your actual requirements which would be better off stating the intent of what your trying to achieve rather than pre-supposing a solution.
You rebuild tuples in places like pair = (y, y2 + x2 + 1) and num_dict[(x, x2)].append(pair) when you could build a canonical set of tuples early on and then just put references in the containers. I cobbled up a 2000 item test my machine that works. I have python 3.4 64 bit with a relatively modest 3.5 GIG of RAM...
import random
# a test list that should generate longish lists
l = list(random.randint(0, 2000) for _ in range(2000))
# setup canonical index and sort ascending
sorted_index = sorted((v,i) for i,v in enumerate(l))
num_dict = {}
for idx, vi in enumerate(sorted_index):
v = vi[0]
num_dict[vi] = [vi2 for vi2 in sorted_index[idx+1:] if not vi2[0] % v]
for item in num_dict.items():
print(item)

Categories

Resources