Spark: Using iterator lambda function in RDD map()

Spark: Using iterator lambda function in RDD map() - python

I have simple dataset on HDFS that I'm loading into Spark. It looks like this:
1 1 1 1 1 1 1
1 1 1 1 1 1 1
1 1 1 1 1 1 1
...
basically, a matrix. I'm trying to implement something that requires grouping matrix rows, and so I'm trying to add a unique key for every row like so:
(1, [1 1 1 1 1 ... ])
(2, [1 1 1 1 1 ... ])
(3, [1 1 1 1 1 ... ])
...
I tried something somewhat naive: set a global variable and write a lambda function to iterate over the global variable:
# initialize global index
global global_index
global_index = 0
# function to generate keys
def generateKeys(x):
global_index+=1
return (global_index,x)
# read in data and operate on it
data = sc.textFile("/data.txt")
...some preprocessing...
data.map(generateKeys)
And it seemed to not recognize the existence of the global variable.
Is there an easy way that comes to mind to do this?
Thanks,
Jack

>>> lsts = [
... [1, 1, 1, 1, 1, 1],
... [1, 1, 1, 1, 1, 1],
... [1, 1, 1, 1, 1, 1],
... [1, 1, 1, 1, 1, 1],
... [1, 1, 1, 1, 1, 1],
... [1, 1, 1, 1, 1, 1],
... [1, 1, 1, 1, 1, 2],
... [1, 1, 1, 2, 1, 2]
... ]
...
>>> list(enumerate(lsts))
[(0, [1, 1, 1, 1, 1, 1]),
(1, [1, 1, 1, 1, 1, 1]),
(2, [1, 1, 1, 1, 1, 1]),
(3, [1, 1, 1, 1, 1, 1]),
(4, [1, 1, 1, 1, 1, 1]),
(5, [1, 1, 1, 1, 1, 1]),
(6, [1, 1, 1, 1, 1, 2]),
(7, [1, 1, 1, 2, 1, 2])]
enumerate generates unique index for each item in the iterable and yields tuples with values (index, original_item)
If you want to start numbering with other than 0, pass the starting value to enumerate as second parameter.
>>> list(enumerate(lsts, 1))
[(1, [1, 1, 1, 1, 1, 1]),
(2, [1, 1, 1, 1, 1, 1]),
(3, [1, 1, 1, 1, 1, 1]),
(4, [1, 1, 1, 1, 1, 1]),
(5, [1, 1, 1, 1, 1, 1]),
(6, [1, 1, 1, 1, 1, 1]),
(7, [1, 1, 1, 1, 1, 2]),
(8, [1, 1, 1, 2, 1, 2])]
Note, that the list is used to get real values out from enumerate which is iterator and not a function, returning lists.
Alternative: globally available id assigner
enumerate is easy to use, but if you would need to assing id in diferrent pieces of your code, it
would become difficult or impossible. For such a case, globally available generator (as drafter in
OP) would be the way to go.
itertools provide count which can serve our need:
>>> from itertools import count
>>> idgen = count()
Now we have (globally available) idgen generator ready to yield unique ids.
We can test it by a function prid (print id):
>>> def prid():
... id = idgen.next()
... print id
...
>>> prid()
0
>>> prid()
1
>>> prid()
2
>>> prid()
3
As it works we can test it on list of values:
>>> lst = ['100', '101', '102', '103', '104', '105', '106', '107', '108', '109']
and define actual function, which when called with a value would return tuple (id, value)
>>> def assignId(val):
... return (idgen.next(), val)
...
note, that there is no need to declare idgen as global as we are not going to change it's value (the idgen will only change it's internal status when called, but will still remain the same generator).
Test, if it works:
>>> assignId("ahahah")
(4, 'ahahah')
and try it on the list:
>>> map(assignId, lst)
[(5, '100'),
(6, '101'),
(7, '102'),
(8, '103'),
(9, '104'),
(10, '105'),
(11, '106'),
(12, '107'),
(13, '108'),
(14, '109')]
The main diferrence to enumerate solution is, we can assign ids one by one anywhere in the code
without doing it all from within one all processing enumerate.
>>> assignId("lonely line")
(15, 'lonely line')

try dataRdd.zipWithIndex and eventually swap the resulting tuple if having the index first is a must.

Related

Find runs and lengths of consecutive values in an array

I'd like to find equal values in an array and their indices if they occur consecutively more then 2 times.
[0, 3, 0, 1, 0, 1, 2, 1, 2, 2, 2, 2, 1, 3, 4]
so in this example I would find value "2" occured "4" times, starting from position "8". Is there any build in function to do that?
I found a way with collections.Counter
collections.Counter(a)
# Counter({0: 3, 1: 4, 3: 2, 5: 1, 4: 1})
but this is not what I am looking for.
Of course I can write a loop and compare two values and then count them, but may be there is a more elegant solution?

Find consecutive runs and length of runs with condition
import numpy as np
arr = np.array([0, 3, 0, 1, 0, 1, 2, 1, 2, 2, 2, 2, 1, 3, 4])
res = np.ones_like(arr)
np.bitwise_xor(arr[:-1], arr[1:], out=res[1:]) # set equal, consecutive elements to 0
# use this for np.floats instead
# arr = np.array([0, 3, 0, 1, 0, 1, 2, 1, 2.4, 2.4, 2.4, 2, 1, 3, 4, 4, 4, 5])
# res = np.hstack([True, ~np.isclose(arr[:-1], arr[1:])])
idxs = np.flatnonzero(res) # get indices of non zero elements
values = arr[idxs]
counts = np.diff(idxs, append=len(arr)) # difference between consecutive indices are the length
cond = counts > 2
values[cond], counts[cond], idxs[cond]
Output
(array([2]), array([4]), array([8]))
# (array([2.4, 4. ]), array([3, 3]), array([ 8, 14]))

_, i, c = np.unique(np.r_[[0], ~np.isclose(arr[:-1], arr[1:])].cumsum(),
return_index = 1,
return_counts = 1)
for index, count in zip(i, c):
if count > 1:
print([arr[index], count, index])
Out[]: [2, 4, 8]
A little more compact way of doing it that works for all input types.

find the number of pairs that belong to a column but not a higher order column

I have an n x k binary numpy array, I am trying to find an efficient way to find the number of pairs of ones that belong to some column[j] but not to any higher column, in this case higher means in increasing index value.
For example in the array:
array([[1, 1, 1, 0, 1, 0],
[1, 0, 1, 1, 1, 0],
[1, 0, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0],
[1, 0, 1, 0, 1, 1],
[1, 0, 1, 0, 1, 1],
[1, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 1, 0]], dtype=int32)
the output should be array([ 0, 0, 11, 2, 14, 1], dtype=int32). This makes sense because we see column[2] has all ones, so any pair of ones will necessarily have a highest column in common of at least 2, because even though column[0] also has all ones, it's lower, so no pair of ones have it as their highest in common. In all cases I am considering, column[0] will always have all ones.
Here is some example code that works and I believe is something like O(n^2 k)
def hcc(i, j, k, bin_mat):
# hcc means highest common columns
# i: index i
# j: index j
# k: number of columns - 1
# bin_mat: binary matrix
for q in range(k, 0, -1):
if (bin_mat[i, q] and bin_mat[j, q]):
return q
return 0
def get_num_pairs_columns(bin_mat):
k = bin_mat.shape[1]-1
num_pairs_hcc = np.zeros(k+1, dtype=np.int32) # number of one-pairs in columns
for i in range(bin_mat.shape[0]):
for j in range(bin_mat.shape[0]):
if(i < j):
num_pairs_hcc[hcc(i, j, k, bin_mat)] += 1
return num_pairs_highest_column
Another way I've though of approaching the problem is through sets. So every column gets its own set, and the index of every row with a one gets added to such a set. So for the example above, this would look like:
set = [{0, 1, 2, 3, 4, 5, 6, 7},
{0, 3, 6, 7},
{0, 1, 2, 3, 4, 5, 6, 7},
{1, 3, 6},
{0, 1, 3, 4, 5, 7},
{4, 5}]
The idea is to find the number of pairs in set[j] that are in no higher set (it can be in a lower set, just not higher). Since, I mentioned before, all cases will have column zero with all ones, every set is a subset of set[0]. So a much worse performing code using this approach looks like this:
def generate_sets(bin_mat):
sets = []
for j in range(bin_mat.shape[1]):
column = set()
for i in range(bin_mat.shape[0]):
if bin_mat[i, j] == 1:
column.add(i)
sets.append(column)
return sets
def get_hcc_sets(bin_mat):
sets = generate_sets(bin_mat)
pairs_sets = []
num_pairs_hcc = np.zeros(len(sets), dtype=np.int32)
for subset in sets:
pairs_sets.append({p for p in itertools.combinations(sorted(subset), r = 2)})
for j in range(len(sets)-1):
intersections = [pairs_sets[j].intersection(pairs_sets[q]) for q in range(j+1, len(sets))]
num_pairs_hcc[j] = len(pairs_sets[j] - set.union(*intersections))
num_pairs_hcc[len(sets)-1]=len(pairs_sets[len(sets)-1])
return num_pairs_hcc
I haven't checked that this sets implementation always produces the same results as the previous one, but in the finitely many cases I tried, it works. However, I am 100% certain that my first implementation gives exactly the result I need.
another reference example:
input:
array([[1, 1, 0],
[1, 1, 0],
[1, 1, 0],
[1, 1, 0],
[1, 0, 1],
[1, 0, 1],
[1, 0, 1],
[1, 0, 1]], dtype=int32)
output:
array([16, 6, 6], dtype=int32)
Is there a way to beat my O(n^2 k) implementation. It seems rather brute force and like there should be something I can exploit to make this calculation faster. I always expect n to be greater than k, by a orders of magnitude in many cases. So I'd rather the k have a higher exponent than the n.

If you are going for the O(n² k) approach in python, you can do it with much shorter code using itertools and set; the code might be more efficient too.
import itertools
t = [[1, 1, 1, 0, 1, 0],
[1, 0, 1, 1, 1, 0],
[1, 0, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0],
[1, 0, 1, 0, 1, 1],
[1, 0, 1, 0, 1, 1],
[1, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 1, 0]]
n,k = len(t),len(t[0])
# build set of pairs of 1 in column j
def candidates(j):
return {(i1, i2) for (i1, i2) in itertools.combinations(range(n), 2) if 1 == t[i1][j] == t[i2][j]}
# build set of pairs of 1 in higher columns
def badpairs(j):
return {(i1, i2) for (i1, i2) in itertools.combinations(range(n), 2) if any(1 == t[i1][j0] == t[i2][j0] for j0 in range(j+1, k))}
# set difference
def finalpairs(j):
return candidates(j) - badpairs(j)
# print pairs
for j in range(k):
print(j, finalpairs(j))
# 0 set()
# 1 set()
# 2 {(2, 4), (1, 2), (2, 7), (4, 6), (0, 6), (2, 3), (6, 7), (0, 2), (2, 6), (5, 6), (2, 5)}
# 3 {(1, 6), (3, 6)}
# 4 {(0, 1), (0, 7), (0, 4), (3, 4), (1, 5), (3, 7), (0, 3), (1, 4), (5, 7), (1, 7), (0, 5), (1, 3), (4, 7), (3, 5)}
# 5 {(4, 5)}
# print number of pairs
for j in range(k):
print(j, len(finalpairs(j)))
# 0 0
# 1 0
# 2 11
# 3 2
# 4 14
# 5 1
Alternate definition for badpairs:
def badpairs(j):
return set().union(*(candidates(j0) for j0 in range(j+1, k)))
Slightly different approach: avoid building badpairs
def finalpairs(j):
return {(i1, i2) for (i1, i2) in itertools.combinations(range(n), 2) if 1 == t[i1][j] == t[i2][j] and not any(1 == t[i1][j0] == t[i2][j0] for j0 in range(j+1, k))}

Count the number of occurences of a pattern in a list in Python

Given a pattern [1,1,0,1,1], and a binary list of length 100, [0,1,1,0,0,...,0,1]. I want to count the number of occurences of this pattern in this list. Is there a simple way to do this without the need to track the each item at every index with a variable?
Note something like this, [...,1, 1, 0, 1, 1, 1, 1, 0, 1, 1,...,0] can occur but this should be counted as 2 occurrences.

Convert your list to string using join. Then do:
text.count(pattern)
If you need to count overlapping matches then you will have to use regex matching or define your own function.
Edit
Here is the full code:
def overlapping_occurences(string, sub):
count = start = 0
while True:
start = string.find(sub, start) + 1
if start > 0:
count+=1
else:
return count
given_list = [1, 1, 0, 1, 1, 1, 1, 0, 1, 1]
pattern = [1,1,0,1,1]
text = ''.join(str(x) for x in given_list)
print(text)
pattern = ''.join(str(x) for x in pattern)
print(pattern)
print(text.count(pattern)) #for no overlapping
print(overlapping_occurences(text, pattern))

l1 = [1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0]
l1str = str(l1).replace(" ", "").replace("[", "").replace("]", "")
l3 = [1, 1, 0, 1, 1]
l3str = str(l3).replace(" ", "").replace("[", "").replace("]", "")
l1str = l1str.replace(l3str, "foo")
foo = l1str.count("foo")
print(foo)

you can always use the naive way :
for loop on slices of the list (as in the slice that starts at i-th index and ends at i+[length of pattern]).
and you can improve it - notice that if you found an occurence in index i' you can skip i+1 and i+2 and check from i+3 and onwards (meaning - you can check if there is a sub-pattern that will ease your search )
it costs O(n*m)
you can use backwards convolution (called pattern matching algorithem)
this costs O(n*log(n)) which is better

I think a simple regex would suffice:
def find(sample_list):
list_1 = [1,1,0,1,1]
str_1 = str(list_1)[1:-1]
print len(re.findall(str_1, str(sample_list)))
Hope this solves your problem.

from collections import Counter
a = [1,1,0,1,1]
b = [1,1,0,1,1,1,1,0,1,1]
lst = list()
for i in range(len(b)-len(a)+1):
lst.append(tuple(b[i:i+len(a)]))
c = Counter(lst)
print c[tuple(a)]
output
2
the loop can be written in one line like, for more "clean" but less understood code
lst = [tuple(b[i:i+len(a)]) for i in range(len(b)-len(a)+1)]
NOTE, I'm using tuple cause they are immutable objects and can be hashed
you can also use the hash functionality and create your own hash method like multiple each var with 10 raised to his position e.g
[1,0,1] = 1 * 1 + 0 * 10 + 1 * 100 = 101
that way you can make a one pass on the list and check if it contains the pattern by simply check if sub_list == 101

You can solve it using following two steps:
Combine all elements of the list in a single string
Use python count function to match the pattern in the string
a_new = ''.join(map(str,a))
pattern = ''.join(map(str,pattern))
a_new.count(pattern)

You can divide the lookup list into chucks of size of the pattern you are looking. You can achieve this using simple recipe involving itertools.islice to yield a sliding window iterator
>>> from itertools import islice
>>> p = [1,1,0,1,1]
>>> l = [0,1,1,0,0,0,1,1,0,1,1,1,0,0,1]
>>> [tuple(islice(l,k,len(p)+k)) for k in range(len(l)-len(p)+1)]
This will give you output like:
>>> [(0, 1, 1, 0, 0), (1, 1, 0, 0, 0), (1, 0, 0, 0, 1), (0, 0, 0, 1, 1), (0, 0, 1, 1, 0), (0, 1, 1, 0, 1), (1, 1, 0, 1, 1), (1, 0, 1, 1, 1), (0, 1, 1, 1, 0), (1, 1, 1, 0, 0), (1, 1, 0, 0, 1)]
Now you can use collections.Counter to count the occurrence of each sublist in sequence like
>>> from collections import Counter
>>> c = Counter([tuple(islice(l,k,len(p)+k)) for k in range(len(l)-len(p)+1)])
>>> c
>>> Counter({(0, 1, 1, 0, 1): 1, (1, 1, 1, 0, 0): 1, (0, 0, 1, 1, 0): 1, (0, 1, 1, 1, 0): 1, (1, 1, 0, 0, 0): 1, (0, 0, 0, 1, 1): 1, (1, 1, 0, 1, 1): 1, (0, 1, 1, 0, 0): 1, (1, 0, 1, 1, 1): 1, (1, 1, 0, 0, 1): 1, (1, 0, 0, 0, 1): 1})
To fetch frequency of your desired sequence use
>>> c.get(tuple(p),0)
>>> 1
Note I have used tuple everywhere as dict keys since list is not a hashable type in python so cannot be used as dict keys.

You can try range approach :
pattern_data=[1,1,0,1,1]
data=[1,1,0,1,1,0,0,0,0,1,1,1,1,0,0,1,1,0,1,1,1,1,0,1,1,0,0,0,0,0,1,1,0,1,0,1,1,0,1,1,1,1,0,1,1,0,0,0,0,0,0,0,1,1,0,1,1,0,1,1,1,1,0,1,1,0,0,0,0,0,0,0,0,0,0,1,1,0,1,1,1,1,0,1,1,0,0,0,0,0,0,0,0,0,0,1,1,0,1,1]
count=0
for i in range(0,len(data),1):
if data[i:i+len(pattern_data)]==pattern_data:
print(i,data[i:i+len(pattern_data)])
j+=1
print(count)
output:
0 [1, 1, 0, 1, 1]
15 [1, 1, 0, 1, 1]
20 [1, 1, 0, 1, 1]
35 [1, 1, 0, 1, 1]
40 [1, 1, 0, 1, 1]
52 [1, 1, 0, 1, 1]
55 [1, 1, 0, 1, 1]
60 [1, 1, 0, 1, 1]
75 [1, 1, 0, 1, 1]
80 [1, 1, 0, 1, 1]
95 [1, 1, 0, 1, 1]
11

Finding unique sets without subsets in python array

I have a dataset that needs to output boolean style data, just 1 and 0, for true or not true. I am trying to parse simple data sets I've processed to look for a subset of information in a numpy array, the array is about 100,000 elements in one direction and 20 in the other. I only need to search along the 20 axis, but I need to do that for each of the 100,000 entries and get output that I can map.
I've produced an array of this size made up of zeros, with the intention to simply mark the matching index indicator to a 1. A main hitch is that if I find a long set (I'm working with long sets to small sets), I need to NOT include any smaller set that's within it.
Sample:
[0,0,1,1,1,1,1,0,0,1,1,1,0,0,0,1,0,1]
I need to find here that there are 1 group of 5, starting at index 2, and 1 group of 3, starting at index 9, and not return any subset of the group of 5 as though it were a group of 4 or a group of 3, thus leaving the results for all those already covered values. i.e. for groups of 3, the indices 2, 3, 4, 5, and 6 would all remain zero. It doesn't need to be overly efficient, I don't care if it searches anyways, I just need to not keep the result.
Currently I'm using a codeblock basically like this for a simple search:
values = numpy.array([0,1,1,1,1,1,0,0,1,1,1])
searchval = [1,2]
N = len(searchval)
possibles = numpy.where(values == searchval[0])[0]
print(possibles)
solns = []
for p in possibles:
check = values[p:p+N]
if numpy.all(check == searchval):
solns.append(p)
print(solns)
I've been wracking my brain trying to come up with a way to restructure this or similar code to produce the desires results. The end goal is to be searching for groups of 9 down to groups of 3, and having effectively a matrix of 1s and 0s indicating if an index has a group starting on it that is as long as we want.
Hopefully someone can point me to what I'm missing to make this work. Thanks!

Using more_itertools, a third-party library (pip install more_itertools):
import more_itertools as mit
sample = [0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1]
groups = [list(c) for c in mit.consecutive_groups((mit.locate(sample)))]
d = {group[0]: len(group) for group in groups}
d
# {2: 5, 9: 3, 15: 1, 17: 1}
This result reads "At index 2 is a group of 5 ones. At group 9 is a group of 3 ones," etc.
Details
more_itertools.locate finds indices for truthy items by default.
more_itertools.consecutive_groups chunks consecutive numbers together.
The result is a dictionary of (starting-index, length) pairs.
As a dictionary, you can extract different kinds of information:
>>> # List of starting indices
>>> list(d)
[2, 9, 15, 17]
>>> # List indices for all lonely groups
>>> [k for k, v in d.items() if v == 1]
[15, 17]
>>> # List indices of groups greater the 2 items
>>> [k for k, v in d.items() if v > 1]
[2, 9]

Here is a numpy solution. I'm using a small example for demonstration but it easily scales (20 x 100,000 takes 25 ms on my rather modest laptop, see timings at the end of this post):
>>> import numpy as np
>>>
>>>
>>> a = np.random.randint(0, 2, (5, 10), dtype=np.int8)
>>> a
array([[0, 1, 0, 0, 1, 1, 0, 0, 0, 0],
[0, 1, 1, 0, 1, 0, 1, 0, 0, 0],
[1, 0, 1, 1, 1, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 1, 0, 0, 0],
[0, 0, 1, 0, 1, 1, 1, 1, 0, 0]], dtype=int8)
>>>
>>> padded = np.pad(a,((1,1),(0,0)), 'constant')
# compare array to itself with offset to mark all switches from
# 0 to 1 or from 1 to 0
# then use 'where' to extract the coordinates
>>> colinds, rowinds = np.where((padded[:-1] != padded[1:]).T)
>>>
# the lengths of sets are the differences between switch points
>>> lengths = rowinds[1::2] - rowinds[::2]
# now we have the lengths we are free to throw the off-switches away
>>> colinds, rowinds = colinds[::2], rowinds[::2]
>>>
# admire
>>> from pprint import pprint
>>> pprint(list(zip(colinds, rowinds, lengths)))
[(0, 2, 1),
(1, 0, 2),
(2, 1, 2),
(2, 4, 1),
(3, 2, 1),
(4, 0, 5),
(5, 0, 1),
(5, 2, 1),
(5, 4, 1),
(6, 1, 1),
(6, 3, 2),
(7, 4, 1)]
Timings:
>>> def find_stretches(a):
... padded = np.pad(a,((1,1),(0,0)), 'constant')
... colinds, rowinds = np.where((padded[:-1] != padded[1:]).T)
... lengths = rowinds[1::2] - rowinds[::2]
... colinds, rowinds = colinds[::2], rowinds[::2]
... return colinds, rowinds, lengths
...
>>> a = np.random.randint(0, 2, (20, 100000), dtype=np.int8)
>>> from timeit import repeat
>>> kwds = dict(globals=globals(), number=100)
>>> repeat('find_stretches(a)', **kwds)
[2.475784719004878, 2.4715258619980887, 2.4705517270049313]

Something like this?
from collections import defaultdict
sample = [0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1]
# Keys are number of consecutive 1's, values are indicies
results = defaultdict(list)
found = 0
for i, x in enumerate(samples):
if x == 1:
found += 1
elif i == 0 or found == 0:
continue
else:
results[found].append(i - found)
found = 0
if found:
results[found].append(i - found + 1)
assert results == {1: [15, 17], 3: [9], 5: [2]}

Element (array) "multiplication" vs. list comprehension initialization [duplicate]

This question already has an answer here:
Having trouble making a list of lists of a designated size [duplicate]
(1 answer)
Closed 9 years ago.
In relation to the question:
How to initialize a two-dimensional array in Python?
While working with a two dimentional array, I found that initializing it in a certain way would produce unexpected results. I would like to understand the difference between initializing an 8x8 grid in these two ways:
>>> a = [[1]*8]*8
>>> a
[[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1], \
[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1], \
[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1], \
[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1]]
vs.
>>> A = [[1 for i in range(8)] for j in range(8)]
>>> A
[[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1], \
[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1], \
[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1], \
[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1]]
The unexpected results were that any element accessed with the indeces [0-6][x] would point to the last row in [7][x]. The arrays look identical in the interpreter, hence my confusion. What is wrong with the first approach?
If relevant, these arrays hold references to GTK EventBoxes that represent the squares of a chess board. After changing the initialization approach to the list-comprehension method, the squares respond properly to the intended hover and click events.

When you create the 2D array using a = [[1]*8]*8 then the * operator creates 8 references to the same object. Therefore, [1]*8 means create an array of size 8 where all 8 elements are the same object (same reference). Since all elements are the same references updating the value to which that reference points will change the value of every element in the array.
Using the list comprehension A = [[1 for i in range(8)] for j in range(8)] ensures that each element in your 2D array is uniquely referenced. This avoids the erroneous behavior you were seeing where all of the elements were updating simultaneously.

In your first version, you're creating a list containing the number 1 and by multiplying it 8 times create a list containing 8 1s, and using that list 8 times to create a.
So when you change anything in the first version, you'll see that change everywhere else. Your problem being that you're reusing the same instance, something that doesn't happen in the second version.

maybe refactoring the first way make it eeasier to understand:
one_item = [1]
row = one_item*8
matrix = row*8
as you can see, the array of arrays has eight references to row, which means
(a[0] is a[1]) and (a[1] is a[2]) ...
try this for example:
a = [[1]*8]*8
a[0][0] = 2
b = [[1 for i in range(8)] for j in range(8)]
b[0][0] = 3
print a
print b
print (a[0] is a[1]) and (a[1] is a[2]) and (a[2] is a[3]) # true
print (b[0] is b[1]) and (b[1] is b[2]) and (b[2] is b[3]) # false

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Spark: Using iterator lambda function in RDD map() - python

try dataRdd.zipWithIndex and eventually swap the resulting tuple if having the index first is a must.

Related

Find runs and lengths of consecutive values in an array

find the number of pairs that belong to a column but not a higher order column

Count the number of occurences of a pattern in a list in Python

Finding unique sets without subsets in python array

Element (array) "multiplication" vs. list comprehension initialization [duplicate]

Categories

Resources