How to vectorize a set of items in python

How to vectorize a set of items in python - python

I am trying to take a set of arrays and convert them into a matrix that will essentially be an indicator matrix for a set of items.
I currently have a array of N items
A_ = [A,B,C,D,E,...,Y,Z]
In addition, I have S arrays (currently stored in an array) that are have a subset of the items in vector A.
B_ = [A,B,C,Z]
C_ = [A,B]
D_ = [D,Y,Z]
The array they are stored in would is structures like so:
X = [B_,C_,D_]
I would like to convert the data into an indicator matrix for easier operation. It would ideally look like this (it would be an N x S sized matrix):
[1,1,1,0,...,0,1]
[1,1,0,0,...,0,0]
[0,0,0,1,...,1,1]
I know how I could use a for loop to iterate through this and create the matrix but I was wondering if there is a more efficient/syntactically simple way of going about this.

A concise way would be to use a list comprehension.
# Create a list containing the alphabet using a list comprehension
A_ = [chr(i) for i in range(65,91)]
# A list containing two sub-lists with some letters
M = [["A","B","C","Z"],["A","B","G"]]
# Nested list comprehension to convert character matrix
# into matrix of indicator vectors
I_M = [[1 if char in sublist else 0 for char in A_] for sublist in M]
The last line is a bit dense if you aren't familiar with comprehensions, but its not too tricky once you take it apart. The inner part...
[1 if char in sublist else 0 for char in A_]
Is a list comprehension in itself, which creates a list containing 1's for all characters (char) in A_ which are also found in sublist, and 0's for characters not found in sublist.
The outer bit...
[ ... for sublist in M]
simply runs the inner list comprehension for each sublist found in M, resulting in a list of all the sublists created by the inner list comprehension stored in I_M.
Edit:
While I tried to keep this example simple, it is worth noting (as DSM and jterrace point out) that testing membership in vanilla arrays is O(N). Converting it to a hashlike structure like a Set would speed up the checking for large sublists.

Using numpy:
>>> import numpy as np
>>> A_ = np.array(['A','B','C','D','E','Y','Z'])
>>> B_ = np.array(['A','B','C','Z'])
>>> C_ = np.array(['A','B'])
>>> D_ = np.array(['D','Y','Z'])
>>> X = [B_,C_,D_]
>>> matrix = np.array([np.in1d(A_, x) for x in X])
>>> matrix.shape
(3, 7)
>>> matrix
array([[ True, True, True, False, False, False, True],
[ True, True, False, False, False, False, False],
[False, False, False, True, False, True, True]], dtype=bool)
This is O(NS).

Related

Check if all values in the same index among several lists are false

I have a list of lists of boolean values. I'm trying to return a list of indexes where all values in the same positions are only False. So in the example below, position 3 and 4 of each inner list are False, so return [3,4].
Some assumptions to keep in mind, the list of lists could contain X number of lists, so can't rely on just three, it could be 50. Also, the inner lists will always have equal lengths, so no worrying about different-sized lists, but they could be longer than 6 like in the example I gave. So the solution must work for dynamic sizes/lengths.
list1 = [True, True, True, False, False, False]
list2 = [True, True, False, False, False, False]
list3 = [True, True, True, False, False, True]
list_of_lists = [list1, list2, list3]
result_list_of_all_false_indexes = []
# Pseudo code
# Look in position 0 of each list in the list of lists. If they're all false
# result_list_of_all_false_indexes.append(index)
# Look in position 1 of each list in the list of lists. If they're all false
# result_list_of_all_false_indexes.append(index)
# continue for entire length of inner lists
assert result_list_of_all_false_indexes == [3,4], "Try again..."

With some help from numpy, we can check your conditions by axis:
import numpy as np
results = np.where(~np.any(list_of_lists, axis=0))[0].tolist()
# Output:
[3, 4]

lol - list of lists
output - returns desired list of indexes.
def output(lol):
res = []
if lol: # Checks if list contains any list at all
res = [i for i in range(len(lol[0]))]
for list_ in lol:
res = [i for i in res if not list_[i]]
if not res:
break
return res

I would use zip to unpack the list_of_lists and enumerate to get the indexes. Then the any function can be used with not to test for all False values.
import random
n_lists = random.randrange(1, 20)
n_elements = random.randrange(3, 10)
# I set the relative weights to favor getting all False entries
list_of_lists = [
random.choices([True, False], k=n_elements, weights=[1, 10])
for i in range(n_lists)
]
result_list_of_all_false_indexes = [i for i, vals in enumerate(zip(*lol)) if not any(vals)]

result_list_of_all_false_indexes = []
for i in range(len(list_of_lists[0])):
if not any(lst[i] for lst in list_of_lists):
result_list_of_all_false_indexes.append(i)
EDIT: added explanation
Iterate over each possible index, and then check if each list at that index is False. If so, add the index to your results list.

Thanks to #Cleb in the comments, I was able to make this work:
from numpy import where, any
list1 = [True, True, True, False, False, False]
list2 = [True, True, False, False, False, False]
list3 = [True, True, True, False, False, True]
list_of_lists = [list1, list2, list3]
result = where(np.any(list_of_lists, axis=0) == False)[0]
print(result)
assert result == [3,4], "Try again..."
Rather than using all (which evaluates to ALL values on a certain access are True) I used any and ==False to accomplish what I was after.
It fails the assertion as it's now in an array format (not list separated by commas), but that's fine with me. A little more concise than having to use iteration, but either will work. Thanks all!

numpy multiple boolean index arrays

I have an array which I want to use boolean indexing on, with multiple index arrays, each producing a different array. Example:
w = np.array([1,2,3])
b = np.array([[False, True, True], [True, False, False]])
Should return something along the lines of:
[[2,3], [1]]
I assume that since the number of cells containing True can vary between masks, I cannot expect the result to reside in a 2d numpy array, but I'm still hoping for something more elegant than iterating over the masks the appending the result of indexing w by the i-th b mask to it.
Am I missing a better option?
Edit: The next step I want to do afterwards is to sum each of the arrays returned by w[b], returning a list of scalars. If that somehow makes the problem easier, I'd love to know as well.

Assuming you want a list of numpy arrays you can simply use a comprehension:
w = np.array([1,2,3])
b = np.array([[False, True, True], [True, False, False]])
[w[bool] for bool in b]
# [array([2, 3]), array([1])]
If your goal is just a sum of the masked values you use:
np.sum(w*b) # 6
or
np.sum(w*b, axis=1) # array([5, 1])
# or b # w
…since False times you number will be 0 and therefor won't effect the sum.

Try this:
[w[x] for x in b]
Hope this helps.

How to pythonically select a random index from a 2D list such that the corresponding element matches a value?

I have a 2D list of booleans. I want to select a random index from the the list where the value is False. For example, given the following list:
[[True, False, False],
[True, True, True],
[False, True, True]]
The valid choices would be: [0, 1], [0, 2], and [2, 0].
I could keep a list of valid indices and then use random.choice to select from it, but it seems unpythonic to keep a variable and update it every time the underlying list changes for only this one purpose.
Bonus points if your answer runs quickly.

We can use a oneliner like:
import numpy as np
from random import choice
choice(np.argwhere(~a))
With a the array of booleans.
This works as follows: by using ~a, we negate the elements of the array. Next we use np.argwhere to construct a k×2-array: an array where every row has two elements: for every dimension the value such that the corresponding value has as value False.
By choice(..) we thus select a random row. We can however not use this directly to access the element. We can use the tuple(..) constructor to cast it to a tuple:
>>> tuple(choice(np.argwhere(~a)))
(2, 0)
You can thus fetch the element then with:
t = tuple(choice(np.argwhere(~a)))
a[t]
But of course, it is not a surprise that:
>>> t = tuple(choice(np.argwhere(~a)))
>>> a[t]
False

My non-numpy version:
result = random.choice([
(i,j)
for i in range(len(a))
for j in range(len(a[i]))
if not a[i][j]])
Like Willem's np version, this generates a list of valid tuples and invokes random.choice() to pick one.
Alternatively, if you hate seeing range(len(...)) as much as I do, here is an enumerate() version:
result = random.choice([
(i, j)
for i, row in enumerate(a)
for j, cell in enumerate(row)
if not cell])

Assuming you don't want to use numpy.
matrix = [[True, False, False],
[True, True, True],
[False, True, True]]
valid_choices = [(i,j) for i, x in enumerate(matrix) for j, y in enumerate(x) if not y]
random.choice(valid_choices)
With list comprehensions, you can change the if condition (if not y) to suit your needs. This will return the coordinate that is randomly selected, but optionally you could change the value part of the list comprehension (i,j) in this case to: y and it'd return false, though thats a bit redundant in this case.

What does this: s[s[1:] == s[:-1]] do in numpy?

I've been looking for a way to efficiently check for duplicates in a numpy array and stumbled upon a question that contained an answer using this code.
What does this line mean in numpy?
s[s[1:] == s[:-1]]
Would like to understand the code before applying it. Looked in the Numpy doc but had trouble finding this information.

The slices [1:] and [:-1] mean all but the first and all but the last elements of the array:
>>> import numpy as np
>>> s = np.array((1, 2, 2, 3)) # four element array
>>> s[1:]
array([2, 2, 3]) # last three elements
>>> s[:-1]
array([1, 2, 2]) # first three elements
therefore the comparison generates an array of boolean comparisons between each element s[x] and its "neighbour" s[x+1], which will be one shorter than the original array (as the last element has no neighbour):
>>> s[1:] == s[:-1]
array([False, True, False], dtype=bool)
and using that array to index the original array gets you the elements where the comparison is True, i.e. the elements that are the same as their neighbour:
>>> s[s[1:] == s[:-1]]
array([2])
Note that this only identifies adjacent duplicate values.

Check this out:
>>> s=numpy.array([1,3,5,6,7,7,8,9])
>>> s[1:] == s[:-1]
array([False, False, False, False, True, False, False], dtype=bool)
>>> s[s[1:] == s[:-1]]
array([7])
So s[1:] gives all numbers but the first, and s[:-1] all but the last.
Now compare these two vectors, e.g. look if two adjacent elements are the same. Last, select these elements.

s[1:] == s[:-1] compares s without the first element with s without the last element, i.e. 0th with 1st, 1st with 2nd etc, giving you an array of len(s) - 1 boolean elements. s[boolarray] will select only those elements from s which have True at the corresponding place in boolarray. Thus, the code extracts all elements that are equal to the next element.

It will show duplicates in a sorted array.
Basically, the inner expression s[1:] == s[:-1] compares the array with its shifted version. Imagine this:
1, [2, 3, ... n-1, n ]
- [1, 2, ... n-2, n-1] n
=> [F, F, ... F, F ]
In a sorted array, there will be no True in resulted array unless you had repetition. Then, this expression s[array] filters those which has True in the index array.

Determine sum of numpy array while excluding certain values

I would like to determine the sum of a two dimensional numpy array. However, elements with a certain value I want to exclude from this summation. What is the most efficient way to do this?
For example, here I initialize a two dimensional numpy array of 1s and replace several of them by 2:
import numpy
data_set = numpy.ones((10, 10))
data_set[4][4] = 2
data_set[5][5] = 2
data_set[6][6] = 2
How can I sum over the elements in my two dimensional array while excluding all of the 2s? Note that with the 10 by 10 array the correct answer should be 97 as I replaced three elements with the value 2.
I know I can do this with nested for loops. For example:
elements = []
for idx_x in range(data_set.shape[0]):
for idx_y in range(data_set.shape[1]):
if data_set[idx_x][idx_y] != 2:
elements.append(data_set[idx_x][idx_y])
data_set_sum = numpy.sum(elements)
However on my actual data (which is very large) this is too slow. What is the correct way of doing this?

Use numpy's capability of indexing with boolean arrays. In the below example data_set!=2 evaluates to a boolean array which is True whenever the element is not 2 (and has the correct shape). So data_set[data_set!=2] is a fast and convenient way to get an array which doesn't contain a certain value. Of course, the boolean expression can be more complex.
In [1]: import numpy as np
In [2]: data_set = np.ones((10, 10))
In [4]: data_set[4,4] = 2
In [5]: data_set[5,5] = 2
In [6]: data_set[6,6] = 2
In [7]: data_set[data_set != 2].sum()
Out[7]: 97.0
In [8]: data_set != 2
Out[8]:
array([[ True, True, True, True, True, True, True, True, True,
True],
[ True, True, True, True, True, True, True, True, True,
True],
...
[ True, True, True, True, True, True, True, True, True,
True]], dtype=bool)

Without numpy, the solution is not much more complex:
x = [1,2,3,4,5,6,7]
sum(y for y in x if y != 7)
# 21
Works for a list of excluded values too:
# set is faster for resolving `in`
exl = set([1,2,3])
sum(y for y in x if y not in exl)
# 22

Using np.sums where= argument, we avoid the need for array copying which would otherwise be triggered from using advanced array indexing:
>>> import numpy as np
>>> data_set = np.ones((10,10))
>>> data_set[(4,5,6),(4,5,6)] = 2
>>> np.sum(data_set, where=data_set != 2)
97.0
>>> data_set.sum(where=data_set != 2)
97.0
https://numpy.org/doc/stable/reference/generated/numpy.sum.html
Advanced indexing always returns a copy of the data (contrast with basic slicing that returns a view).
https://numpy.org/doc/stable/user/basics.indexing.html#advanced-indexing

How about this way that makes use of numpy's boolean capabilities.
We simply set all the values that meet the specification to zero before taking the sum, that way we don't change the shape of the array as we would if we were to filter them from the array.
The other benefit of this is that it means we can sum along axis after the filter is applied.
import numpy
data_set = numpy.ones((10, 10))
data_set[4][4] = 2
data_set[5][5] = 2
data_set[6][6] = 2
print "Sum", data_set.sum()
another_set = numpy.array(data_set) # Take a copy, we'll need that later
data_set[data_set == 2] = 0 # Set all the values that are 2 to zero
print "Filtered sum", data_set.sum()
print "Along axis", data_set.sum(0), data_set.sum(1)
Equally we could use any other boolean to set the data we wish to exclude from the sum.
another_set[(another_set > 1) & (another_set < 3)] = 0
print "Another filtered sum", another_set.sum()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to vectorize a set of items in python - python

Related

Check if all values in the same index among several lists are false

numpy multiple boolean index arrays

How to pythonically select a random index from a 2D list such that the corresponding element matches a value?

What does this: s[s[1:] == s[:-1]] do in numpy?

Determine sum of numpy array while excluding certain values

Categories

Resources