More memory efficient way of making a dictionary?

More memory efficient way of making a dictionary? - python

VERY sorry for the vagueness, but I don't actually know what part of what I'm doing is inefficient.
I've made a program that takes a list of positive integers (example*):
[1, 1, 3, 5, 16, 2, 4, 6, 6, 8, 9, 24, 200,]
*the real lists can be up to 2000 in length and the elements between 0 and 100,000 exclusive
And creates a dictionary where each number tupled with its index (like so: (number, index)) is a key and the value for each key is a list of every number (and that number's index) in the input that it goes evenly into.
So the entry for the 3 would be: (3, 2): [(16, 4), (6, 7), (6, 8), (9, 10), (24, 11)]
My code is this:
num_dict = {}
sorted_list = sorted(beginning_list)
for a2, a in enumerate(sorted_list):
num_dict[(a, a2)] = []
for x2, x in enumerate(sorted_list):
for y2, y in enumerate(sorted_list[x2 + 1:]):
if y % x == 0:
pair = (y, y2 + x2 + 1)
num_dict[(x, x2)].append(pair)
But, when I run this script, I hit a MemoryError.
I understand that this means that I am running out of memory but in the situation I'm in, adding more ram or updating to a 64-bit version of python is not an option.
I am certain that the problem is not coming from the list sorting or the first for loop. It has to be the second for loop. I just included the other lines for context.
The full output for the list above would be (sorry for the unsortedness, that's just how dictionaries do):
(200, 12): []
(6, 7): [(24, 11)]
(16, 10): []
(6, 6): [(6, 7), (24, 11)]
(5, 5): [(200, 12)]
(4, 4): [(8, 8), (16, 10), (24, 11), (200, 12)]
(9, 9): []
(8, 8): [(16, 10), (24, 11), (200, 12)]
(2, 2): [(4, 4), (6, 6), (6, 7), (8, 8), (16, 10), (24, 11), (200, 12)]
(24, 11): []
(1, 0): [(1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (6, 7), (8, 8), (9, 9), (16, 10), (24, 11), (200, 12)]
(1, 1): [(2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (6, 7), (8, 8), (9, 9), (16, 10), (24, 11), (200, 12)]
(3, 3): [(6, 6), (6, 7), (9, 9), (24, 11)]
Is there a better way of going about this?
EDIT:
This dictionary will then be fed into this:
ans_set = set()
for x in num_dict:
for y in num_dict[x]:
for z in num_dict[y]:
ans_set.add((x[0], y[0], z[0]))
return len(ans_set)
to find all unique possible triplets in which the 3rd value can be evenly divided by the 2nd value which can be evenly divided by the 1st.
If you think you know of a better way of doing the entire thing, I'm open to redoing the whole of it.
Final Edit
I've found the best way to find the number of triples by reevaluating what I needed it to do. This method doesn't actually find the triples, it just counts them.
def foo(l):
llen = len(l)
total = 0
cache = {}
for i in range(llen):
cache[i] = 0
for x in range(llen):
for y in range(x + 1, llen):
if l[y] % l[x] == 0:
cache[y] += 1
total += cache[x]
return total
And here's a version of the function that explains the thought process as it goes (not good for huge lists though because of spam prints):
def bar(l):
list_length = len(l)
total_triples = 0
cache = {}
for i in range(list_length):
cache[i] = 0
for x in range(list_length):
print("\n\nfor index[{}]: {}".format(x, l[x]))
for y in range(x + 1, list_length):
print("\n\ttry index[{}]: {}".format(y, l[y]))
if l[y] % l[x] == 0:
print("\n\t\t{} can be evenly diveded by {}".format(l[y], l[x]))
cache[y] += 1
total_triples += cache[x]
print("\t\tcache[{0}] is now {1}".format(y, cache[y]))
print("\t\tcount is now {}".format(total_triples))
print("\t\t(+{} from cache[{}])".format(cache[x], x))
else:
print("\n\t\tfalse")
print("\ntotal number of triples:", total_triples)

Well, you could start by not unnecessarily duplicating information.
Storing full tuples (number and index) for each multiple is inefficient when you already have that information available.
For example, rather than:
(3, 2): [(16, 4), (6, 7), (6, 8), (9, 10), (24, 11)]
(the 16 appears to be wrong there as it's not a multiple of 3 so I'm guessing you meant 15) you could instead opt for:
(3, 2): [15, 6, 9, 24]
(6, 7): ...
That pretty much halves your storage needs since you can go from the 6 in the list and find all its indexes by searching the tuples. That will, of course, be extra processing effort to traverse the list but it's probably better to have a slower working solution than a faster non-working one :-)
You could reduce the storage even more by not storing the multiples at all, instead running through the tuple list using % to see if you have a multiple.
But, of course, this all depends on your actual requirements which would be better off stating the intent of what your trying to achieve rather than pre-supposing a solution.

You rebuild tuples in places like pair = (y, y2 + x2 + 1) and num_dict[(x, x2)].append(pair) when you could build a canonical set of tuples early on and then just put references in the containers. I cobbled up a 2000 item test my machine that works. I have python 3.4 64 bit with a relatively modest 3.5 GIG of RAM...
import random
# a test list that should generate longish lists
l = list(random.randint(0, 2000) for _ in range(2000))
# setup canonical index and sort ascending
sorted_index = sorted((v,i) for i,v in enumerate(l))
num_dict = {}
for idx, vi in enumerate(sorted_index):
v = vi[0]
num_dict[vi] = [vi2 for vi2 in sorted_index[idx+1:] if not vi2[0] % v]
for item in num_dict.items():
print(item)

Related

How to find the maximum per group in an rdd?

I'm using PySpark and I have an RDD that looks like this:
[
("Moviex", [(1, 100), (2, 20), (3, 50)]),
("MovieY", [(1, 100), (2, 250), (3, 100), (4, 120)]),
("MovieZ", [(1, 1000), (2, 250)]),
("MovieX", [(4, 50), (5, 10), (6, 0)]),
("MovieY", [(3, 0), (4, 260)]),
("MovieZ", [(5, 180)]),
]
The first element in the tuple represents the week number and the second element represents the number of viewers. I want to find the week with the most views for each movie, but ignoring the first week.
I've tried some things but nothing worked, for example:
stats.reduceByKey(max).collect()
returns:
[('MovieX', [(4, 50), (5, 10), (6, 0)]),
('MovieY', [(5, 180)]),
('MovieC', [(3, 0), (4, 260)])]
so the entire second set.
Also this:
stats.groupByKey().reduce(max)
which returns just this:
('MovieZ', <pyspark.resultiterable.ResultIterable at 0x558f75eeb0>)
How can I solve this?

If you want the most views per movie, ignoring the first week ... [('MovieA', 50), ('MovieC', 250), ('MovieB', 260)]
Then, you'll want your own map function rather than a reduce.
movie_stats = spark.sparkContext.parallelize([
("MovieA", [(1, 100), (2, 20), (3, "50")]),
("MovieC", [(1, 100), (2, "250"), (3, 100), (4, "120")]),
("MovieB", [(1, 1000), (2, 250)]),
("MovieA", [(4, 50), (5, "10"), (6, 0)]),
("MovieB", [(3, 0), (4, "260")]),
("MovieC", [(5, "180")]),
])
def get_views_after_first_week(v):
values = iter(v) # iterator of tuples, groupped by key
result = list()
for x in values:
result.extend([int(y[1]) for y in x if y[0] > 1])
return result
mapped = movie_stats.groupByKey().mapValues(get_views_after_first_week).mapValues(max)
mapped.collect()
to include the week number... [('MovieA', (3, 50)), ('MovieC', (2, 250)), ('MovieB', (4, 260))]
def get_max_weekly_views_after_first_week(v):
values = iter(v) # iterator of tuples, groupped by key
max_views = float('-inf')
max_week = None
for x in values:
for t in x:
week, views = t
views = int(views)
if week > 1 and views > max_views:
max_week = week
max_views = views
return (max_week, max_views, )
mapped = movie_stats.groupByKey().mapValues(get_max_weekly_views_after_first_week)

Some code is needed to convert the string into int, and apply a map function to 1) filter out week 1 data; 2) get the week with max view.
def helper(arr: list):
max_week = None
for sub_arr in arr:
for item in sub_arr:
if item[0] == 1:
continue
count = int(item[1])
if max_week is None or max_week[1] < count:
max_week = [item[0], count]
return max_week
movie_stats.groupByKey().map(lambda x: (x[0], helper(x[1]))).collect()

Number of passengers. Error: list indices must be integers or slices, not list

So, I'm trying to sum the number of passenger at each stop.
The "stops" variable are the number of stops, and is conformed by a tuple which contains the in's and out's of passengers, example:
stops = [(in1, out1), (in2, out2), (in3, out3), (in4, out4)]
stops = [(10, 0), (4, 1), (3, 5), (3, 4), (5, 1), (1, 5), (5, 8), (4, 6), (2, 3)]
number_passenger_per_stop = []
for i in stops:
resta = stops[i][0] - stops[i][1]
number_passenger_per_stop.append(resta)
print(number_passenger_per_stop)
I can do the math like this outside the loop, but I don't understand why in the loop crashes:
stops[i][0] - stops[i][1]

i is not the list index, it's the list element itself. You don't need to write stops[i].
resta = i[0] - i[1]
Your code would be correct if you had written
for i in range(len(stops)):
You could also replace the entire thing with a list comprehension:
number_passenger_per_stop = [on - off for on, off in stops]

I just edited the for loop to adress each in the index in the list correctly, you needed to call each element in the list by its position, and not by its value:
stops = [(10, 0), (4, 1), (3, 5), (3, 4), (5, 1), (1, 5), (5, 8), (4, 6), (2, 3)]
number_passenger_per_stop = []
for i in range(len(stops)):
resta = stops[i][0] - stops[i][1]
number_passenger_per_stop.append(resta)
print(number_passenger_per_stop)
Output:
[10, 3, -2, -1, 4, -4, -3, -2, -1]

Removing overlapping tuple values using Python

I have a list of tuples (let's name it yz_list) that contains N tuples, which have the start and end values like: (start, end), represented by the example below:
yz_list = [(0, 6), (1, 7), (2, 8), (3, 9), (4, 10), (5, 11), (6, 12), (18, 24)]
And I would like to remove all values which are overlapped by the interval of a previous saved tuple. The output that represents this case on the sequences showed above is:
result = [(0,6), (6,12), (18,24)]
How could I achieve this result using Python?
Edit #1
The below code is the code that I'm generating this tuples:
for i, a in enumerate(seq):
if seq[i:i+multiplier] == "x"*multiplier:
to_replace.append((i, i+multiplier))
for i, j in enumerate(to_replace):
print(i,j)
if i == 0:
def_to_replace.append(j)
else:
ind = def_to_replace[i-1]
print(j[0]+1, "\n", ind)
if j[0]+1 not in range(ind[0], ind[1]):
def_to_replace.append(j)
# print(i, j)
print(def_to_replace)
for item in def_to_replace:
frag = replacer(frame_calc(seq[:item[0]]), rep0, rep1, rep2)
for k, v in enumerate(seq_dup[item[0]:item[1]]):
seq_dup[int(item[0]) + int(k)] = list(frag)[k]
return "".join(seq_dup)
As I'm developing with TDD, I'm making a step-by-step progress on the development and now I'm thinking on how to implement the removal of overlaping tuples. I don't really know if it's a good idea to use them as sets, and see the overlapping items.
The pseudocode for generating the result list is:
for item in yz_list:
if is not yz_list first item:
gets item first value
see if the value is betwen any of the values from tuples added on the result list

This may work. No fancy stuff, just manually process each tuple to see if either value is within the range of the saved tuple's set bounds:
yz_list = [(0, 6), (1, 7), (2, 8), (3, 9), (4, 10), (5, 11), (6, 12), (18, 24)]
result = [yz_list[0]]
bounds = yz_list[0][0], yz_list[0][1]
for tup in yz_list[1:]:
if tup[0] in range(bounds[0], bounds[1]) or tup[1] in range(bounds[0], bounds[1]):
pass
else:
result.append(tup)
print result # [(0, 6), (6, 12), (18, 24)]

Here is a class that calculates the overlaps using efficient binary search, and code showing its use to solve your problem. Run with python3.
import bisect
import sys
class Overlap():
def __init__(self):
self._intervals = []
def intervals(self):
return self._intervals
def put(self, interval):
istart, iend = interval
# Ignoring intervals that start after the window.
i = bisect.bisect_right(self._intervals, (iend, sys.maxsize))
# Look at remaining intervals to find overlap.
for start, end in self._intervals[:i]:
if end > istart:
return False
bisect.insort(self._intervals, interval)
return True
yz_list = [(0, 6), (1, 7), (2, 8), (3, 9), (4, 10), (5, 11), (6, 12), (18, 24)]
ov = Overlap()
for i in yz_list:
ov.put(i)
print('Original:', yz_list)
print('Result:', ov.intervals())
OUTPUT:
Original: [(0, 6), (1, 7), (2, 8), (3, 9), (4, 10), (5, 11), (6, 12), (18, 24)]
Result: [(0, 6), (6, 12), (18, 24)]

yz_list = [(0, 6), (1, 7), (2, 8), (3, 9), (4, 10), (5, 11), (6, 12), (18, 24)]
result = []
for start, stop in yz_list:
for low, high in result:
if (low < start < high) or (low < stop < high):
break
else:
result.append((start, stop))
This gives the desired output, and it's pretty easy to see how it works. The else clause basically just means "run this if we didn't break out of the loop".

Compare two tuple with variable length in Python

I have two tuples of tuples and I want to compare the values on the basis of their first element
list1 = ((1, 2450.0), (2, 2095.0), (4, 1290.0), (5, 1190.0), (6, 1150.0), (7, 1150.0), (8, 1090.0), (9, 1090.0))
list2 = ((1, 2673.0), (4, 1488.0), (5, 1139.0), (6, 1057.0), (7, 1482.0), (8, 1037.0), (9, 1169.0), (10, 937.0))
Expected result should be
list1 = ((1, 2450.0), (2, 2095.0), (3, nan),(4, 1290.0), (5, 1190.0), (6, 1150.0), (7, 1150.0), (8, 1090.0), (9, 1090.0), (10,nan))
list2 = ((1, 2673.0), (3, nan),(4, 1488.0), (5, 1139.0), (6, 1057.0), (7, 1482.0), (8, 1037.0), (9, 1169.0), (10, 937.0))
What would be the efficient way of doing this ?

If I understood your question correctly, you want to check if each tuple contain certain numbers which are stored in the first element for each sub-tuple and if the number is not inside create a sub-tuple with the second element equal to None (if nan means None).
I would follow this process, which may not be the most efficient.
# Create first a list which contains the desired numbers to be checked
checkTuple = ( 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 )
# Create a function to check if each number is in one of the sub-tuples
def chooseName( checkTuple, randomList ):
newList = []
for checkItem in checkTuple:
itemFound = False
for item in randomList:
if checkItem in item:
numberFound = True
break
if numberFound:
newList.append( checkItem )
else:
newList.append( (checkItem, None) )
return tuple( newList )
# Call the function and take back the tuple
newList1 = chooseName( checkTuple, list1 )

Sort a list of tuples in consecutive order

I want to sort a list of tuples in a consecutive order, so the first element of each tuple is equal to the last element of the previous one.
For example:
input = [(10, 7), (4, 9), (13, 4), (7, 13), (9, 10)]
output = [(10, 7), (7, 13), (13, 4), (4, 9), (9, 10)]
I have developed a search like this:
output=[]
given = [(10, 7), (4, 9), (13, 4), (7, 13), (9, 10)]
t = given[0][0]
for i in range(len(given)):
# search tuples starting with element t
output += [e for e in given if e[0] == t]
t = output[-1][-1] # Get the next element to search
print(output)
Is there a pythonic way to achieve such order?
And a way to do it "in-place" (with only a list)?
In my problem, the input can be reordered in a circular way using all the tuples, so it is not important the first element chosen.

Assuming your tuples in the list will be circular, you may use dict to achieve it within complexity of O(n) as:
input = [(10, 7), (4, 9), (13, 4), (7, 13), (9, 10)]
input_dict = dict(input) # Convert list of `tuples` to dict
elem = input[0][0] # start point in the new list
new_list = [] # List of tuples for holding the values in required order
for _ in range(len(input)):
new_list.append((elem, input_dict[elem]))
elem = input_dict[elem]
if elem not in input_dict:
# Raise exception in case list of tuples is not circular
raise Exception('key {} not found in dict'.format(elem))
Final value hold by new_list will be:
>>> new_list
[(10, 7), (7, 13), (13, 4), (4, 9), (9, 10)]

if you are not afraid to waste some memory you could create a dictionary start_dict containing the start integers as keys and the tuples as values and do something like this:
tpl = [(10, 7), (4, 9), (13, 4), (7, 13), (9, 10)]
start_dict = {item[0]: item for item in tpl}
start = tpl[0][0]
res = []
while start_dict:
item = start_dict[start]
del start_dict[start]
res.append(item)
start = item[-1]
print(res)
if two tuples start with the same number you will lose one of them... if not all the start numbers are used the loop will not terminate.
but maybe this is something to build on.

Actually there're many questions about what you intend to have as an output and what if the input list has invalid structure to do what you need.
Assuming you have an input of pairs where each number is included twice only. So we can consider such input as a graph where numbers are nodes and each pair is an edge. And as far as I understand your question you suppose that this graph is cyclic and looks like this:
10 - 7 - 13 - 4 - 9 - 10 (same 10 as at the beginning)
This shows you that you can reduce the list to store the graph to [10, 7, 13, 4, 9]. And here is the script that sorts the input list:
# input
input = [(10, 7), (4, 9), (13, 4), (7, 13), (9, 10)]
# sorting and archiving
first = input[0][0]
last = input[0][1]
output_in_place = [first, last]
while last != first:
for item in input:
if item[0] == last:
last = item[1]
if last != first:
output_in_place.append(last)
print(output_in_place)
# output
output = []
for i in range(len(output_in_place) - 1):
output.append((output_in_place[i], output_in_place[i+1]))
output.append((output_in_place[-1], output_in_place[0]))
print(output)

I would first create a dictionary of the form
{first_value: [list of tuples with that first value], ...}
Then work from there:
from collections import defaultdict
chosen_tuples = input[:1] # Start from the first
first_values = defaultdict()
for tup in input[1:]:
first_values[tup[0]].append(tup)
while first_values: # Loop will end when all lists are removed
value = chosen_tuples[-1][1] # Second item of last tuple
tuples_with_that_value = first_values[value]
chosen_tuples.append(tuples_with_that_value.pop())
if not chosen_with_that_value:
del first_values[value] # List empty, remove it

You can try this:
input = [(10, 7), (4, 9), (13, 4), (7, 13), (9, 10)]
output = [input[0]] # output contains the first element of input
temp = input[1:] # temp contains the rest of elements in input
while temp:
item = [i for i in temp if i[0] == output[-1][1]].pop() # We compare each element with output[-1]
output.append(item) # We add the right item to output
temp.remove(item) # We remove each handled element from temp
Output:
>>> output
[(10, 7), (7, 13), (13, 4), (4, 9), (9, 10)]

Here is a robust solution using the sorted function and a custom key function:
input = [(10, 7), (4, 9), (13, 4), (7, 13), (9, 10)]
def consec_sort(lst):
def key(x):
nonlocal index
if index <= lower_index:
index += 1
return -1
return abs(x[0] - lst[index - 1][1])
for lower_index in range(len(lst) - 2):
index = 0
lst = sorted(lst, key=key)
return lst
output = consec_sort(input)
print(output)
The original list is not modified. Note that sorted is called 3 times for your input list of length 5. In each call, one additional tuple is placed correctly. The first tuple keeps it original position.
I have used the nonlocal keyword, meaning that this code is for Python 3 only (one could use global instead to make it legal Python 2 code).

My two cents:
def match_tuples(input):
# making a copy to not mess up with the original one
tuples = input[:] # [(10,7), (4,9), (13, 4), (7, 13), (9, 10)]
last_elem = tuples.pop(0) # (10,7)
# { "first tuple's element": "index in list"}
indexes = {tup[0]: i for i, tup in enumerate(tuples)} # {9: 3, 4: 0, 13: 1, 7: 2}
yield last_elem # yields de firts element
for i in range(len(tuples)):
# get where in the list is the tuple which first element match the last element in the last tuple
list_index = indexes.get(last_elem[1])
last_elem = tuples[list_index] # just get that tuple
yield last_elem
Output:
input = [(10,7), (4,9), (13, 4), (7, 13), (9, 10)]
print(list(match_tuples(input)))
# output: [(10, 7), (7, 13), (13, 4), (4, 9), (9, 10)]

this is a (less efficient than the dictionary version) variant where the list is changed in-place:
tpl = [(10, 7), (4, 9), (13, 4), (7, 13), (9, 10)]
for i in range(1, len(tpl)-1): # iterate over the indices of the list
item = tpl[i]
for j, next_item in enumerate(tpl[i+1:]): # find the next item
# in the remaining list
if next_item[0] == item[1]:
next_index = i + j
break
tpl[i], tpl[next_index] = tpl[next_index], tpl[i] # now swap the items
here is a more efficient version of the same idea:
tpl = [(10, 7), (4, 9), (13, 4), (7, 13), (9, 10)]
start_index = {item[0]: i for i, item in enumerate(tpl)}
item = tpl[0]
next_index = start_index[item[-1]]
for i in range(1, len(tpl)-1):
tpl[i], tpl[next_index] = tpl[next_index], tpl[i]
# need to update the start indices:
start_index[tpl[next_index][0]] = next_index
start_index[tpl[i][0]] = i
next_index = start_index[tpl[i][-1]]
print(tpl)
the list is changed in-place; the dictionary only contains the starting values of the tuples and their index in the list.

To get a O(n) algorithm one needs to make sure that one doesn't do a double-loop over the array. One way to do this is by keeping already processed values in some sort of lookup-table (a dict would be a good choice).
For example something like this (I hope the inline comments explain the functionality well). This modifies the list in-place and should avoid unnecessary (even implicit) looping over the list:
inp = [(10, 7), (4, 9), (13, 4), (7, 13), (9, 10)]
# A dictionary containing processed elements, first element is
# the key and the value represents the tuple. This is used to
# avoid the double loop
seen = {}
# The second value of the first tuple. This must match the first
# item of the next tuple
current = inp[0][1]
# Iteration to insert the next element
for insert_idx in range(1, len(inp)):
# print('insert', insert_idx, seen)
# If the next value was already found no need to search, just
# pop it from the seen dictionary and continue with the next loop
if current in seen:
item = seen.pop(current)
inp[insert_idx] = item
current = item[1]
continue
# Search the list until the next value is found saving all
# other items in the dictionary so we avoid to do unnecessary iterations
# over the list.
for search_idx in range(insert_idx, len(inp)):
# print('search', search_idx, inp[search_idx])
item = inp[search_idx]
first, second = item
if first == current:
# Found the next tuple, break out of the inner loop!
inp[insert_idx] = item
current = second
break
else:
seen[first] = item

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

More memory efficient way of making a dictionary? - python

Related

How to find the maximum per group in an rdd?

Number of passengers. Error: list indices must be integers or slices, not list

Removing overlapping tuple values using Python

Compare two tuple with variable length in Python

Sort a list of tuples in consecutive order

Categories

Resources