How to save memory in python3?

How to save memory in python3? - python

I have some question about memory error in python3.6
import itertools
input_list = ['a','b','c','d']
group_to_find = list(itertools.product(input_list,input_list))
a = []
for i in range(len(group_to_find)):
if group_to_find[i] not in a:
a.append(group_to_find[i])
group_to_find = list(itertools.product(input_list,input_list))
MemoryError

You are creating a list, in full, from the Cartesian product of your input list, so in addition to input_list you now need len(input_list) ** 2 memory slots for all the results. You then filter that list down again to a 4th list. All in all, for N items you need memory for 2N + (N * N) references. If N is 1000, that's 1 million and 2 thousand references, for N = 1 million, you need 1 million million plus 2 million references. Etc.
Your code doesn't need to create the group_to_find list, at all, for two reasons:
You could just iterate and handle each pair individually:
a = []
for pair in itertools.product(input_list, repeat=2):
if pair not in a:
a.append(pair)
This is still going to be slow, because pair not in a has to scan the whole list to find matches. You do this N times, for up to K pairs (where K is the product of the number of unique values in input_list, potentially equal to N), so that's N * K time spent checking for duplicates. You could use a = set() to make that faster. But see point 2.
Your end product in a is the exact same list of pairs that itertools.product() would produce anyway, unless you input values are not unique. You could just make those unique first:
a = itertools.product(set(input_list), repeat=2)
Again, don't put this in a list. Iterate over it in a loop and use the pairs it produces one by one.

Related

How to find all unique combinations of k size tuple using a single element from each list of n lists

Given a list containing N sublists of multiple lengths, find all unique combinations of a k size, selecting only one element from each sublist.
The order of the elements in the combination is not relevant: (a, b) = (b, a)
sample_k = 2
sample_list = [['B1','B2','B3'], ['T1','T2'], ['L1','L2','L3','L4']]
expected_output =
[
('B1', 'T1'),('B1', 'T2'),('B1', 'L1'),('B1', 'L2'),('B1', 'L3'),('B1', 'L4'),
('B2', 'T1'),('B2', 'T2'),('B2', 'L1'),('B2', 'L2'),('B2', 'L3'),('B2', 'L4'),
('B3', 'T1'),('B3', 'T2'),('B3', 'L1'),('B3', 'L2'),('B3', 'L3'),('B3', 'L4'),
('T1', 'L1'),('T1', 'L2'),('T1', 'L3'),('T1', 'L4'),
('T2', 'L1'),('T2', 'L2'),('T2', 'L3'),('T2', 'L4')
]
Extra points for a pythonic way of doing it
Speed/Efficiency matters, the idea is to use in a list with hundreds of lists ranging from 5 to 50 in length
What I have been able to accomplish so far:
Using for and while loops to move pointers and build the answer, however I am having a hard time figuring out how to include K parameter to set the size of tuple combination dinamically. (not really happy about it)
def build_combinations(lst):
result = []
count_of_lst = len(lst)
for i, sublist in enumerate(lst):
if i == count_of_lst - 1:
continue
else:
for item in sublist:
j = 0
while i < len(lst)-1:
while j <= len(lst[i+1])-1:
comb = (item, lst[i+1][j])
result.append(comb)
j = j + 1
i = i + 1
j = 0
i = 0
return result
I've seen many similar questions in stack overflow, but none of them addressed the parameters the way I am trying to (one item from each list, and the size of the combinations being a params of function)
I tried using itertools combinations, product, permutation and flipping them around without success. Whenever using itertools I have either a hard time using only one item from each list, or not being able to set the size of the tuple I need.
I tried NumPy using arrays and a more math/matrix approach, but didn't go too far. There's definitely a way of solving with NumPy, hence why I tagged numpy as well

You need to combine two itertools helpers, combinations to select the two unique ordered lists to use, then product to combine the elements of the two:
from itertools import combinations, product
sample_k = 2
sample_list = [['B1','B2','B3'], ['T1','T2'], ['L1','L2','L3','L4']]
expected_output = [pair
for lists in combinations(sample_list, sample_k)
for pair in product(*lists)]
print(expected_output)
Try it online!
If you want to get really fancy/clever/ugly, you can push all the work down to the C layer with:
from itertools import combinations, product, starmap, chain
sample_k = 2
sample_list = [['B1','B2','B3'], ['T1','T2'], ['L1','L2','L3','L4']]
expected_output = list(chain.from_iterable(starmap(product, combinations(sample_list, sample_k))))
print(expected_output)
That will almost certainly run meaningfully faster for huge inputs (especially if you can loop the results from chain.from_iterable directly rather than realizing them as a list), but it's probably not worth the ugliness unless you're really tight for cycles (I wouldn't expect much more than a 10% speed-up, but you'd need to benchmark to be sure).

Are there easier ways for filling and finding empty list slots without using two indexes?

import math
def my_func(stud, grps=2):
students_list = [[] for x in range(grps)]
students_amount = math.ceil(len(stud) / grps)
for x, y in zip(range(0,grps), range(0,students_amount * grps,students_amount)):
students_list[x].extend(stud[y:y + students_amount])
if (students_amount - len(students_list[x])) > 0:
students_list[x].extend(['__EMPTY__' for x in range(students_amount - len(students_list[x]))])
return (students_list)
print(my_func(['Anne', 'Diana', 'Gabriele', 'Hannah', 'Inna', 'Luna', 'Maya', 'Nora'], grps=7))
Hi everyone. This is a function, that takes a list with group members and the amount of groups as arguments. The function then calculates an equal number of students per group (rounded up, if it is a fraction number). It returns a new list containing all the groups with it's members first and if needed filling all empty slots with the word __EMPTY__.
For example, if the parameter grps = 3, then there will be three nested lists containing the 8 group members (3 slots per group or list) and one __EMPTY__ element in the last nested list.
My question is, can this be done differently without using two indexes? My function seems complicated and I wanted to find out, if there are easier solutions for doing this.

How I would do it: First create evenly splittable list, then return an evenly chunked list.
def my_func(stud, grps=2):
students_per_grp = math.ceil(len(stud)/grps)
while len(stud) < students_per_grp * grps:
stud.append('__EMPTY__')
return [stud[i:i + students_per_grp] for i in range(0, len(stud), students_per_grp)]
Edit:
Following #Matiiss suggestion makes it a bit shorter:
def my_func(stud, grps=2):
students_per_grp = math.ceil(len(stud)/grps)
stud += ['__EMPTY__'] * (students_per_grp * grps - len(stud))
return [stud[i:i + students_per_grp] for i in range(0, len(stud), students_per_grp)]

Why is "print (*sum (_2DArray, []))" slower than "print(' '.join(j for i in _2DArray for j in i))"?

Today, while I was practicing Problem solving on HackerRank The Full Counting Sort Problem,
I found out that flattening a 2D array, unpacking it and printing as below:
print (*sum (_2DArray, []))
was causing a "Time Limit Exceeded" Error on submission but using regular nested loops was fine.
print(' '.join(j for i in _2DArray for j in i))
Why is this flattening and unpacking slower than O(n^2) of nested loops ?
Thanks in advance
Edit:- Full Solution for problem
def countSort(arr):
result = [[] for x in range(len(arr))]
for i in range(len(arr)):
result [int (arr[i][0])].append ('-' if i < len(arr)//2 else arr[i][1])
print(' '.join(j for i in result for j in i))
# print (*sum (result, []))

The way you wrote it, in the worst case your 2D list has a million strings in the first row and then 999,999 empty rows. The sum then copies the million string references in every addition, i e., a million times. That's a trillion string reference copies. And all the time, the strings' reference counters get increased and decreased. A looooot of work. Your other method doesn't have that issue. In short: They're O(n2) vs O(n), where n, as defined in the problem, can be as large as a million.
And as discussed, the outer length never needs to be n but at most 100 (more precisely the maximum x + 1), and with that, the sum method gets accepted. Still does up to 100 million reference copies, but such copying is low-level and relatively fast compared to Python code.

sum (_2DArray, []) is essentially equivalent to
l = []
for item in _2DArray:
l = l + item
The computational cost of concatenating two lists of sizes n1 and n2 is O(n1 + n2)
Let us assume that our 2D array is a square matrix of size n * n. We can denote the size of l at each iteration to be n1 and the size of item to be n2. The number of steps of each loop can be determined as follows:
t = 1, n1 = 0, n2 = n ; The number of steps required to perform the concatenation is n
t = 2, n1 = n, n2 = n; The number of steps is n + n = 2n
t = 3, n1 = 2n, n2 = n; The number of steps is 2n + n = 3n
...
t = n, n1 = (n - 1)*n, n2 = n; The number of steps is n(n - 1) + n = n^2
The effective number of steps required to perform concatenation using sum is n + 2n + 3n + ... + n^2 = n * n * (n + 1) / 2 which yields a time complexity of O(n^3).
The concatenation approach is effectively slower than quadratic time. We also do not factor in the overhead for creation of n new lists which take at least O(n) and at most O(n^2) space. There's also the overhead of the print method having to treat each element of the array as an argument and iterate to print it.
Using ' '.join(j for i in _2DArray for j in i) essentially creates the resulting string in O(n^2) time by using an iterator over the list elements and also yields a single argument to pass to print.

If you denote n the number of atoms of your array, printing them with a loop is O(n), whereas flattening them is in O(n²). This is because adding Python lists is linear in the size of both lists.
Besides that, when you flatten the list, you actually store the intermediate results in the RAM, whereas the loop way is lazy-evaluated, meaning you only have O(1) memory consumption. Writing to memory is a slow operation, which might also contribute to the poor performances.
EDIT: Some details (that are lacking in my original answer).
When you print something, that something has to be stored somewhere (at least in a buffer). So when I talk about memory complexity, I do not take into account that aspect. To put it another way, the complexity I am talking about is the extra space required.
With atom, I meant the bottom elements. [["a", "b"], ["c", "d"]] has four atoms. Flattening a list of lists using sum has a complexity that is quadratic with respect to the number of atoms (in the case of a list of singletons, that high bound is reached). However, simply traversing the array has a complexity that is linear with respect to the number of atoms.
Don't let the join fool you into thinking you're doing something more than if you did not put it. Your last list is equivalent to
print(*(j for i in _2DArray for j in i))
Note that this last line really shows that the difference between the two ways to print the elements of the array is in the non-lazy flattening vs lazy traversal.
A note about merging python lists.
The theoretical problem
When you add two lists in python, you actually move (at least) on of the two lists, which simply means copying all of its elements. This operation requires writing to memory. The main problem of flattening (done this way) is that you keep copying some elements again and again. There is a way to make it linear, which is the following
result = []
for line in _2DArray:
result.extend(line)
however, since it Python you have a very thin control over how memory is managed, you can't be sure (unless you deep into the specs, which you usually want to avoid) that it's how it is done. The other way that it could be done would be
result = []
for line in _2DArray:
temp = result
result = line[:]
result.extend(temp)
This is clearly much worse. Well, when you simply sum a list of lists, you can't really tell which one is going to happen.
The practical problem
In any case, even if what was actually done was the linear-time solution, you still copy arrays a lot, which implies you write to memory more than if you simply did it with generators.

Averaging results from a list of lists where every nth list is a reptition

So I wrote a model that computes results over various parameters via a nested loop. Each computation returns a list of len(columns) = 10 elements, which is added to a list of lists (res).
Say I compute my results for some parameters len(alpha) = 2, len(gamma) = 2, rep = 3, where rep is the number of repetitions that I run. This yields results in the form of a list of lists like this:
res = [ [elem_1, ..., elem_10], ..., [elem_1, ..., elem_10] ]
I know that len(res) = len(alpha) * len(gamma) * repetitions = 12 and that each inner list has len(columns) = 10 elements. I also know that every 3rd list in res is going to be a repetition (which I know from the way I set up my nested loops to iterate over all parameter combinations, in fact I am using itertools).
I now want to average the result list of lists. What I need to do is to take every (len(res) // repetitions) = 4th list , add them together element-wise, and divide by the number of repetitions (3). Sounded easier than done, for me.
Here is my ugly attempt to do so:
# create a list of lists of lists, where the inner list of lists are lists of the runs with the identical parameters alpha and gamma
res = [res[i::(len(res)//rep)] for i in range(len(res)//rep)]
avg_res = []
for i in res:
result = []
for j in (zip(*i)):
result.append(sum(j))
avg_res.append([i/repetitions for i in result])
print(len(result_list), avg_res)
This actually yields, what I want, but it surely is not the pythonic way to do it. Ugly as hell and 5 minutes later I can hardly make sense of my own code...
What would be the most pythonic way to do it? Thanks in advance!

In some cases a pythonic code is a matter of style, one of its idioms is using list comprehension instead of loop so writing result = [sum(j) for j in (zip(*i))] is simpler than iterating over zip(*i).
On the other hand nested list comprehension looks more complex so don't do
avg_res = [[i/repetitions for i in [sum(j) for j in (zip(*j))]] for j in res]
You can write:
res = [res[i::(len(res)//rep)] for i in range(len(res)//rep)]
avg_res = []
for i in res:
result = [sum(j) for j in (zip(*i))]
avg_res.append([i/repetitions for i in result])
print(len(result_list), avg_res)
Another idiom in Programming in general (and in python in particular) is naming operations with functions, and variable names, to make the code more readable:
def sum_columns(list_of_rows):
return [sum(col) for col in (zip(*list_of_rows))]
def align_alpha_and_gamma(res):
return [res[i::(len(res)//rep)] for i in range(len(res)//rep)]
aligned_lists = align_alpha_and_gamma(res)
avg_res = []
for aligned_list in aligned_lists:
sums_of_column= sum_columns(aligned_list)
avg_res.append([sum_of_column/repetitions for sum_of_column in sums_of_column])
print(len(result_list), avg_res)
Off course you can choose better names according to what you want to do in the code.

It was a bit hard to follow your instructions, but as I caught, you attempt to try sum over all element in N'th list and divide it by repetitions.
res = [list(range(i,i+10)) for i in range(10)]
N = 4
repetitions = 3
average_of_Nth_lists = sum([num for iter,item in enumerate(res) for num in item if iter%N==0])/repetitions
print(average_of_Nth_lists)
output:
85.0
explanation for the result: equals to sum(0-9)+sum(4-13)+sum(8-17) = 255 --> 255/3=85.0
created res as a list of lists, iterate over N'th list (in my case, 1,5,9 you can transform it to 4,8 etc if that what you are wish, find out where in the code or ask for help if you don't get it), sum them up and divide by repetitions

Cython dictionary / map

I have a list of element, label pairs like this: [(e1, l1), (e2, l2), (e3, l1)]
I have to count how many labels two element have in common - ie. in the list above e1and e3have the label l1 in common and thus 1 label in common.
I have this Python implementation:
def common_count(e_l_list):
count = defaultdict(int)
l_list = defaultdict(set)
for e1, l in e_l_list:
for e2 in l_list[l]:
if e1 == e2:
continue
elif e1 > e2:
count[e1,e2] += 1
else:
count[e2,e1] += 1
l_list[l].add(e1)
return count
It takes a list like the one above and computes a dictionary of element pairs and counts. The result for the list above should give {(e1, e2): 1}
Now i have to scale this to millions of elements and labels and i though Cython would be a good solution to save CPU time and memory. But i can't find docs on how to use maps in Cython.
How would i implement the above in pure Cython?
It can be asumed that all elements and labels are unsigned integers.
Thanks in advance :-)

I think you are trying to over complicate this by creating pairs of elements and storing all common labels as the value when you can create a dict with the element as the key and have a list of all values associated with that element. When you want to find common labels convert the lists to a set and perform an intersection on them, the resulting set will have the common labels between the two. The average time of the intersection, checked with ~20000 lists, is roughly 0.006 or very fast
I tested this with this code
from collections import *
import random
import time
l =[]
for i in xrange(10000000):
#With element range 0-10000000 the dictionary creation time increases to ~16 seconds
l.append((random.randrange(0,50000),random.randrange(0,50000)))
start = time.clock()
d = defaultdict(list)
for i in l: #O(n)
d[i[0]].append(i[1]) #O(n)
print time.clock()-start
times = []
for i in xrange(10000):
start = time.clock()
tmp = set(d[random.randrange(0,50000)]) #picks a random list of labels
tmp2 = set(d[random.randrange(0,50000)]) #not guaranteed to be a different list but more than likely
times.append(time.clock()-start)
common_elements = tmp.intersection(tmp2)
print sum(times)/100.0
18.6747529999 #creation of list
4.17812619876 #creation of dictionary
0.00633531142994 #intersection
Note: The times do change slightly depending on number of labels. Also creating the dict might be too long for your situation but that is only a one time operation.
I would also highly not recommend creating all pairs of elements. If you have 5,000,000 elements and they all share at least one label, which is worst case, then you are looking at 1.24e+13 pairs or, more bluntly, 12.5 trillion. That would be ~1700 terabytes or ~1.7 petabytes

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to save memory in python3? - python

Related

How to find all unique combinations of k size tuple using a single element from each list of n lists

Are there easier ways for filling and finding empty list slots without using two indexes?

Why is "print (*sum (_2DArray, []))" slower than "print(' '.join(j for i in _2DArray for j in i))"?

Averaging results from a list of lists where every nth list is a reptition

Cython dictionary / map

Categories

Resources