Joblib - how to parallelize modifying an variable in memory

Joblib - how to parallelize modifying an variable in memory - python

I have a question regarding joblib. I am working with networkX graphs, and wanted to parallelize the modification of edges, since iterating over the edge list is indeed an embarrassingly parallel problem. In doing so, I thought of running a simplified version of the code.
I have a variable x. It is a list of lists, akin to an edge list, though I understand that networkX returns a list of tuples for the edge list, and has primarily a dictionary-based implementation. Please bear with this simple example for the moment.
x = [[0, 1, {'a': 1}],
[1, 3, {'a': 3}]]
I have two functions that modifies the dictionary's 'a' value to be either the addition of the first two values or a subtraction of the first two values. They are defined as such:
def additive_a(edge):
edge[2]['a'] = edge[0] + edge[1]
def subtractive_a(edge):
edge[2]['a'] = edge[0] - edge[1]
If I do a regular for loop, the variable x can be modified properly:
for edge in x:
subtractive_a(edge) # or additive_a(edge) works as well.
Result:
[[0, 1, {'a': -1}], [1, 3, {'a': -2}]]
However, when I try doing it with joblib, I cannot get the desired result:
Parallel(n_jobs=8)(delayed(subtractive_a)(edge) for edge in x)
# I understand that n_jobs=8 for a two-item list is overkill.
The desired result is:
[[0, 1, {'a': -1}], [1, 3, {'a': -2}]]
When I check x, it is unchanged:
[[0, 1, {'a': 1}], [1, 3, {'a': 3}]]
I am unsure as to what is going on here. I can understand the example provided in the joblib documentation - which specifically showed computing an array of numbers using a single, simple function. However, that did not involve modifying an existing object in memory, which is what I think I'm trying to do. Is there a solution to this? How would I modify this code to parallelize the modification of a single object in memory?

Related

Example of set subtraction in python

I'm taking a data structures course in Python, and a suggestion for a solution includes this code which I don't understand.
This is a sample of a dictionary:
vc_metro = {
'Richmond-Brighouse': set(['Lansdowne']),
'Lansdowne': set(['Richmond-Brighouse', 'Aberdeen'])
}
It is suggested that to remove some of the elements in the value, we use this code:
vc_metro['Lansdowne'] -= set(['Richmond-Brighouse'])
I have never seen such a structure, and using it in a basic situation such as:
my_list = [1, 2, 3, 4, 5, 6]
other_list = [1, 2]
my_list -= other_list
doesn't work. Where can I learn more about this recommended strategy?

You can't subtract lists, but you can subtract set objects meaningfully. Sets are hashtables, somewhat similar to dict.keys(), which allow only one instance of an object.
The -= operator is equivalent to the difference method, except that it is in-place. It removes all the elements that are present in both operands from the left one.
Your simple example with sets would look like this:
>>> my_set = {1, 2, 3, 4, 5, 6}
>>> other_set = {1, 2}
>>> my_set -= other_set
>>> my_set
{3, 4, 5, 6}
Curly braces with commas but no colons are interpreted as a set object. So the direct constructor call
set(['Richmond-Brighouse'])
is equivalent to
{'Richmond-Brighouse'}
Notice that you can't do set('Richmond-Brighouse'): that would add all the individual characters of the string to the set, since strings are iterable.
The reason to use -=/difference instead of remove is that differencing only removes existing elements, and silently ignores others. The discard method does this for a single element. Differencing allows removing multiple elements at once.
The original line vc_metro['Lansdowne'] -= set(['Richmond-Brighouse']) could be rewritten as
vc_metro['Lansdowne'].discard('Richmond-Brighouse')

Nested array computations in Python using numpy

I am trying to use numpy in Python in solving my project.
I have a random binary array rndm = [1, 0, 1, 1] and a resource_arr = [[2, 3], 4, 2, [1, 2]]. What I am trying to do is to multiply the array element wise, then get their sum. As an expected output for the sample above,
output = 5 0 2 3. I find hard to solve such problem because of the nested array/list.
So far my code looks like this:
def fitness_score():
output = numpy.add(rndm * resource_arr)
return output
fitness_score()
I keep getting
ValueError: invalid number of arguments.
For which I think is because of the addition that I am trying to do. Any help would be appreciated. Thank you!

Numpy treats its arrays as matrices, and resource_arr is not a (valid) matrix. In your case a python list is more suitable:
def sum_nested(l):
tmp = []
for element in l:
if isinstance(element, list):
tmp.append(numpy.sum(element))
else:
tmp.append(element)
return tmp
In this function we check for each element inside l if it is a list. If so, we sum its elements. On the other hand, if the encountered element is just a number, we leave it untouched. Please note that this only works for one level of nesting.
Now, if we run sum_nested([[2, 3], 4, 2, [1, 2]]) we will get [5 4 2 3]. All that's left is multiplying this result by the elements of rndm, which can be achieved easily using numpy:
def fitness_score(a, b):
return numpy.multiply(a, sum_nested(b))

Numpy is all about the non-jagged arrays. You can do things with jagged arrays, but doing so efficiently and elegantly isnt trivial.
Almost always, trying to find a way to map your datastructure to a non-nested one, for instance, encoding the information as below, will be more flexible, and more performant.
resource_arr = (
[0, 0, 1, 2, 3, 3]
[2, 3, 4, 2, 1, 2]
)
That is, an integer denoting the 'row' each value belongs to, paired with an array of equal size of the values themselves.
This may 'feel' wasteful when coming from a C-style way of doing arrays (omg more memory consumption), but staying away from nested datastructures is almost certainly your best bet in terms of performance, and the amount of numpy/scipy ecosystem that will actually be compatible with your data representation. If it really uses more memory is actually rather questionable; every new python object uses a ton of bytes, so if you have only few elements per nesting, it is the more memory efficient solution too.
In this case, that would give you the following efficient solution to your problem:
output = np.bincount(*resource_arr) * rndm

I have not worked much with pandas/numpy so I'm not sure if this is most efficient way, but it works (atleast for the example you have shown):
import numpy as np
rndm = [1, 0, 1, 1]
resource_arr = [[2, 3], 4, 2, [1, 2]]
multiplied_output = np.multiply(rndm, resource_arr)
print(multiplied_output)
output = []
for elem in multiplied_output:
output.append(sum(elem)) if isinstance(elem, list) else output.append(elem)
final_output = np.array(output)
print(final_output)

How to fix method in python that returns a 2d array with empty array elements?

The following code implements a backtracking algorithm to find all the possible permutations of a given array of numbers and the record variable stores the permutation when the code reaches base case. The code seems to run accordingly, that is, the record variable gets filled up with valid permutations, but for some reason when the method finishes the method returns a two-dimensional array whose elements are empty.
I tried declaring record as a tuple or a dictionary and tried using global and nonlocal variables, but it none of it worked.
def permute(arr):
record = []
def createPermutations(currentArr, optionArr):
if len(optionArr) == 0:
if len(currentArr) != 0: record.append(currentArr)
else: pass
print(record)
else:
for num in range(len(optionArr)):
currentArr.append(optionArr[num])
option = optionArr[0:num] + optionArr[num+1::]
createPermutations(currentArr, option)
currentArr.pop()
createPermutations([], arr)
return record
print(permute([1,2,3]))
The expect result should be [[1, 2, 3], [1, 3, 2], [2, 1, 3], [2, 3, 1], [3, 1, 2], [3, 2, 1]], but instead I got [[], [], [], [], [], []].

With recursive functions, you should pass a copy of the current array, rather than having all of those currentArr.pop() mutating the same array.
Replace
createPermutations(currentArr, option)
by
createPermutations(currentArr[:], option)
Finally, as a learning exercise for recursion, something like this is fine, but if you need permutations for a practical programming problem, use itertools:
print([list(p) for p in itertools.permutations([1,2,3])])

I would accept John Coleman's answer as it is the correct way to solve your issue and resolves other bugs that you run into as a result.
The reason you run into this issue because python is pass-by-object-reference, in which copies of lists are not passed in but the actual list itself. What this leads to is another issue in your code; in which you would get [[3, 2, 1], [3, 2, 1], [3, 2, 1], [3, 2, 1], [3, 2, 1], [3, 2, 1]] as your output when you print(record).
Why this happens is that when you call record.append(currentArr), it actually points to the same object reference as all the other times you call record.append(currentArr). Thus you will end up with 6 copies of the same array (in this case currentArr) at the end because all your appends point to the same array. A 2d list is just a list of pointers to other lists.
Now that you understand this, it is easier to understand why you get [[],[],[],[],[],[]] as your final output. Because you add to and then pop from currentArr over here currentArr.append(optionArr[num])
and over here
currentArr.pop() to return it back to normal,
your final version of currentArr will be what you passed in, i.e. [].
Since result is a 2d array of 6 currentArrs, you will get [[],[],[],[],[],[]] as your returned value.
This may help you better how it all works, since it has diagrams as well: https://robertheaton.com/2014/02/09/pythons-pass-by-object-reference-as-explained-by-philip-k-dick/

Is extending a list of dictionaries higher performing than iterating over the keys?

When helping my co-worker troubleshoot a problem I saw something I was unaware python did. When compared to other ways of doing this I am curious where the performance and time complexity stacks up and the best approach is for sake of performance.
what my co-worker did that prompted this question:
list_of_keys = []
test_dict = {'foo': 1, 'bar': [1, 2, 3, 4, 5]}
list_of_keys.extend(test_dict)
print(list_of keys)
['foo', 'bar']
vs other examples I have seen:
list_of_keys = []
test_dict = {'foo': 1, 'bar': [1, 2, 3, 4, 5]}
for i in test_dict.keys():
list_of_keys.append(i)
and
keys = list(test_dict)
which one of these is shown to be the most beneficial and the most pythonic for the sake of simply appending keys. which one yields the best performance?

As the docs explain, s.extend(t):
extends s with the contents of t (for the most part the same as s[len(s):len(s)] = t)
OK, so that isn't very clear as to whether it should be faster or slower than calling append in a loop. But it is a little faster—the looping is happening in C rather than in Python, and it can use some special optimized code for adding onto the list because it knows you're not touching the list at the same time.
More importantly, it's a lot simpler, more readable, and harder to get wrong.
As for starting with an empty list and then extending it (or appending to it), there's no good reason to do that. If you already have a list with some values in it, and want to add the dict keys, then use extend. But if you just want to create a list of the keys, just do list(d).
As for d.keys() vs. d, there's really no difference at all. Whether you iterate over a dict or its dict_keys view, you get the exact same values iterated, even using the exact same dict_keyiterator. The extra call to keys() does make things a tiny bit slower, but that's a fixed cost, not once per element, so unless your dicts are tiny, you won't see any noticeable difference.
So, do whichever one seems more readable in the circumstances. Generally speaking, the only reason you want to loop over d.keys() is when you want to make it clear that you're iterating over a dict's keys, but it isn't obvious from the surrounding code that d is a dict.
Among other things, you also asked about complexity.
All of these solutions have the same (linear) complexity, because they all do the same thing under the covers: for every keys in the dictionary, append it to the end of a list. That's one step per key, and the complexity of each step is amortized constant (because Python lists expand exponentially), so the title time is O(N) where N is the length of the dict.

After #thebjorn mentioned the module. seems that calling extend is fastest
It seems that list() is the most pythonic for sake of readability and cleanliness.
the most beneficial seems dependent on use-case. but more or less doing this is redundant as mentioned in a comment. This was discovered from a mistake and i got curious.
timeit.timeit("for i in {'foo': 1, 'bar': [1, 2, 3, 4, 5]}.keys():[].append(i)", number=1000000)
0.6147394659928977
timeit.timeit("[].extend({'foo': 1, 'bar': [1, 2, 3, 4, 5]})", number=1000000)
0.36140396299015265
timeit.timeit("list({'foo': 1, 'bar': [1, 2, 3, 4, 5]})", number=1000000)
0.4726199270080542

How to simultaneously iterate and modify list, set, etc?

In my program I have many lines where I need to both iterate over a something and modify it in that same for loop.
However, I know that modifying the thing over which you're iterating is bad because it may - probably will - result in an undesired result.
So I've been doing something like this:
for el_idx, el in enumerate(theList):
if theList[el_idx].IsSomething() is True:
theList[el_idx].SetIt(False)
Is this the best way to do this?

This is a conceptual misunderstanding.
It is dangerous to modify the list itself from within the iteration, because of the way Python translates the loop to lower level code. This can cause unexpected side effects during the iteration, there's a good example here :
https://unspecified.wordpress.com/2009/02/12/thou-shalt-not-modify-a-list-during-iteration/
But modifying mutable objects stored in the list is acceptable, and common practice.
I suspect that you're thinking that because the list is made up of those objects, that modifying those objects modifies the list. This is understandable - it's just not how it's normally thought of. If it helps, consider that the list only really contains references to those objects. When you modify the objects within the loop - you are merely using the list to modify the objects, not modifying the list itself.
What you should not do is add or remove items from the list during the iteration.

Your problem seems to be unclear to me. But if we talk about harmful of modifying list during a for loop iteration in Python. I can think about two scenarios.
First, You modify some elements in list that suppose to be used on the next round of computation as its original value.
e.g. You want to write a program that have such inputs and outputs like these.
Input:
[1, 2, 3, 4]
Expected output:
[1, 3, 6, 10] #[1, 1 + 2, 1 + 2 + 3, 1 + 2 + 3 + 4]
But...you write a code in this way:
#!/usr/bin/env python
mylist = [1, 2, 3, 4]
for idx, n in enumerate(mylist):
mylist[idx] = sum(mylist[:idx + 1])
print mylist
Result is:
[1, 3, 7, 15] # undesired result
Second, you make some change on size of list during a for loop iteration.
e.g. From python-delete-all-entries-of-a-value-in-list:
>>> s=[1,4,1,4,1,4,1,1,0,1]
>>> for i in s:
... if i ==1: s.remove(i)
...
>>> s
[4, 4, 4, 0, 1]
The example shows the undesired result that raised from side-effect of changing size in list. This obviously shows you that for each loop in Python can not handle list with dynamic size in a proper way. Below, I show you some simple way to overcome this problem:
#!/usr/bin/env python
s=[1, 4, 1, 4, 1, 4, 1, 1, 0, 1]
list_size=len(s)
i=0
while i!=list_size:
if s[i]==1:
del s[i]
list_size=len(s)
else:
i=i + 1
print s
Result:
[4, 4, 4, 0]
Conclusion: It's definitely not harmful to modify any elements in list during a loop iteration, if you don't 1) make change on size of list 2) make some side-effect of computation by your own.

you could get index first
idx = [ el_idx for el_idx, el in enumerate(theList) if el.IsSomething() ]
[ theList[i].SetIt(False) for i in idx ]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.