Related
I have two 1D arrays, x & y, one smaller than the other. I'm trying to find the index of every element of y in x.
I've found two naive ways to do this, the first is slow, and the second memory-intensive.
The slow way
indices= []
for iy in y:
indices += np.where(x==iy)[0][0]
The memory hog
xe = np.outer([1,]*len(x), y)
ye = np.outer(x, [1,]*len(y))
junk, indices = np.where(np.equal(xe, ye))
Is there a faster way or less memory intensive approach? Ideally the search would take advantage of the fact that we are searching for not one thing in a list, but many things, and thus is slightly more amenable to parallelization.
Bonus points if you don't assume that every element of y is actually in x.
I want to suggest one-line solution:
indices = np.where(np.in1d(x, y))[0]
The result is an array with indices for x array which corresponds to elements from y which were found in x.
One can use it without numpy.where if needs.
As Joe Kington said, searchsorted() can search element very quickly. To deal with elements that are not in x, you can check the searched result with original y, and create a masked array:
import numpy as np
x = np.array([3,5,7,1,9,8,6,6])
y = np.array([2,1,5,10,100,6])
index = np.argsort(x)
sorted_x = x[index]
sorted_index = np.searchsorted(sorted_x, y)
yindex = np.take(index, sorted_index, mode="clip")
mask = x[yindex] != y
result = np.ma.array(yindex, mask=mask)
print result
the result is:
[-- 3 1 -- -- 6]
How about this?
It does assume that every element of y is in x, (and will return results even for elements that aren't!) but it is much faster.
import numpy as np
# Generate some example data...
x = np.arange(1000)
np.random.shuffle(x)
y = np.arange(100)
# Actually preform the operation...
xsorted = np.argsort(x)
ypos = np.searchsorted(x[xsorted], y)
indices = xsorted[ypos]
I think this is a clearer version:
np.where(y.reshape(y.size, 1) == x)[1]
than indices = np.where(y[:, None] == x[None, :])[1]. You don't need to broadcast x into 2D.
This type of solution I found to be best because unlike searchsorted() or in1d() based solutions that have seen posted here or elsewhere, the above works with duplicates and it doesn't care if anything is sorted. This was important to me because I wanted x to be in a particular custom order.
I would just do this:
indices = np.where(y[:, None] == x[None, :])[1]
Unlike your memory-hog way, this makes use of broadcast to directly generate 2D boolean array without creating 2D arrays for both x and y.
The numpy_indexed package (disclaimer: I am its author) contains a function that does exactly this:
import numpy_indexed as npi
indices = npi.indices(x, y, missing='mask')
It will currently raise a KeyError if not all elements in y are present in x; but perhaps I should add a kwarg so that one can elect to mark such items with a -1 or something.
It should have the same efficiency as the currently accepted answer, since the implementation is along similar lines. numpy_indexed is however more flexible, and also allows to search for indices of rows of multidimensional arrays, for instance.
EDIT: ive changed the handling of missing values; the 'missing' kwarg can now be set with 'raise', 'ignore' or 'mask'. In the latter case you get a masked array of the same length of y, on which you can call .compressed() to get the valid indices. Note that there is also npi.contains(x, y) if this is all you need to know.
Another solution would be:
a = np.array(['Bob', 'Alice', 'John', 'Jack', 'Brian', 'Dylan',])
z = ['Bob', 'Brian', 'John']
for i in z:
print(np.argwhere(i==a))
Use this line of code :-
indices = np.where(y[:, None] == x[None, :])[1]
My solution can additionally handle a multidimensional x. By default, it will return a standard numpy array of corresponding y indices in the shape of x.
If you can't assume that y is a subset of x, then set masked=True to return a masked array (this has a performance penalty). Otherwise, you will still get indices for elements not contained in y, but they probably won't be useful to you.
The answers by HYRY and Joe Kington were helpful in making this.
# For each element of ndarray x, return index of corresponding element in 1d array y
# If y contains duplicates, the index of the last duplicate is returned
# Optionally, mask indices where the x element does not exist in y
def matched_indices(x, y, masked=False):
# Flattened x
x_flat = x.ravel()
# Indices to sort y
y_argsort = y.argsort()
# Indices in sorted y of corresponding x elements, flat
x_in_y_sort_flat = y.searchsorted(x_flat, sorter=y_argsort)
# Indices in y of corresponding x elements, flat
x_in_y_flat = y_argsort[x_in_y_sort_flat]
if not masked:
# Reshape to shape of x
return x_in_y_flat.reshape(x.shape)
else:
# Check for inequality at each y index to mask invalid indices
mask = x_flat != y[x_in_y_flat]
# Reshape to shape of x
return np.ma.array(x_in_y_flat.reshape(x.shape), mask=mask.reshape(x.shape))
A more direct solution, that doesn't expect the array to be sorted.
import pandas as pd
A = pd.Series(['amsterdam', 'delhi', 'chromepet', 'tokyo', 'others'])
B = pd.Series(['chromepet', 'tokyo', 'tokyo', 'delhi', 'others'])
# Find index position of B's items in A
B.map(lambda x: np.where(A==x)[0][0]).tolist()
Result is:
[2, 3, 3, 1, 4]
more compact solution:
indices, = np.in1d(a, b).nonzero()
I want to map an array of lists like the one below using the function process_slide_index(x)
tiles_index:
[(1, 1024, 0, 16, 0, 0), (1, 1024, 0, 16, 0, 1), (1, 1024, 0, 16, 0, 2), (1, 1024, 0, 16, 0, 3), (1, 1024, 0, 16, 0, 4), (1, 1024, 0, 16, 0, 5), (1, 1024, 0, 16, 0, 6),...]
tiles:
tiles = map(lambda x: process_slide_index(x), tiles_index)
the map function:
def process_slide_index(tile_index):
print("PROCESS SLIDE INDEX")
slide_num, tile_size, overlap, zoom_level, col, row = tile_index
slide = open_slide(slide_num)
generator = create_tile_generator(slide, tile_size, overlap)
tile = np.asarray(generator.get_tile(zoom_level, (col, row)))
return (slide_num, tile)
I'm applying the map function but I don't seem to get inside my process_slide_index(tile_index) function.
I also want to filter some results given a function that returns True of False. But once again my function does not reach the filter function.
filtered_tiles = filter(lambda x: keep_tile(x, tile_size, tissue_threshold), tiles)
What am I doing wrong?
Regards
EDIT The only way I got to reach that checkpoint message PROCESS SLIDE INDEX was adding list(map(print, tiles)) after the tiles line. I was using this to try to debug and my prints started showing up. I'm pretty confused right now.
You are using python3, in python2 map and filter return a list, but in python3 they return an object that you have to consume to get the values:
>>> l = list(range(10))
>>> def foo(x):
... print(x)
... return x+1
...
>>> map(foo, l)
<map object at 0x7f69728da828>
For consuming this object, you can use list for example. Notice how the print is called this time:
>>> list(map(foo, l))
0
1
2
3
4
5
6
7
8
9
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
This objects are lazy, that means that they yield the values one by one. Check the differences when using them as iterators in a for loop:
>>> for e in map(foo, l):
... print(e)
...
0
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10
Using list does the same, but stores each taken value in that list.
You should remove the lambda from your map call. map will call the function provided in the first argument and in your case you have provided a wrapper function for the function you actually want to call.
tiles = map(process_slide_index, tiles_index)
TL;DR -
List comprehension can do a lot of things you might want here. [x for x in mylist if x > y] is a powerful expression that more than replaces filter(). It's also a nice alternative to map(), and is much more efficient than using a lambda expression. It also spits out a list instead of a generator, which is probably preferable in your case. (If you're dealing with huge streams of data, you might want to stick with map and filter, because with generators you don't have to keep the whole thing in RAM, you can work out one value at a time.) If you like this suggestion and want to skip the talk, I give you the code in 2b.
Don't write a lambda expression for a function that already exists! Lambda expressions are stand-in functions where you haven't defined one. They're much slower and have some weird behaviors. Avoid them where possible. You could replace the lambda in your map() call with the function itself: tiles = map(process_slide_index, tiles_index)
The long version:
There are two problems, both are pretty easy to fix. First one is more of a style/efficiency thing, but it'll save you some obscure headaches, too:
1. Instead of creating a lambda expression, it's best to use the function you already went to the work of defining!
tiles = map(process_slide_index, tiles_index) does the job just fine, and behaves better.
2. You should probably switch to list comprehensions. Why? Because map() and filter() are uglier and they're slower if you have to use a lambda or want to convert the output to a list afterwards. Still, if you insist on using map() and filter()...
2a. When you need to pass multiple arguments into a function for map, try functools.partial if you know many of the values ahead of time. I think it's an error in your logic when you're trying
filtered_tiles = filter(lambda x: keep_tile(x, tile_size, tissue_threshold), tiles)
What you're telling it to do is to call keep_tile() on a vector of [x for x in tiles] while holding tile_size and tissue_threshold constant.
If this is the intended behavior, try import functools and use functools.partial(keep_tile, tile_size, tissue_threshold).
Note: Using functools.partial requires that any variables you pass to the partial function are the rightmost arguments, so you'd have to rewrite the function header as def keep_tile(tile_size, tissue_threshold, tiles): instead of def keep_tile(tiles, tile_size, tissue_threshold):. (See that we again manage to avoid a lambda expression!)
If that isn't the intended behavior, and you wanted each of those values to change with every call, just pass a tuple in! filter(keep_tile, (tile, tile_size, tissue_threshold))). If you just want the tile variable from this, you can use a list comprehension:
[x[0] for x in filter(keep_tile, (tile, tile_size, tissue_threshold)))] (Again, with no lambdas.) However, since we're already doing a list comprehension here, you might want to try the solution in 2b.
2b. It's generally faster and cleaner on later Python releases just to use a list comprehension such as [x[0] for x in tiles if keep_tile(*x)]. (Or, if you meant to hold the other two values constant, you could use [x for x in tiles if keep_tile(x, tile_size, tissue_threshold)].) Any time you're just going to read that map() or filter()'s output into a list afterwards, you should probably have used a list comprehension instead. At this point map() and filter() are really only useful for streaming results through a pipeline, or for async routines.
I have been learning about ANN but the book I'm reading has examples in Python. The problem is that I have never written in Python and these lines of code are too hard for me to understand:
sizes = [3,2,4]
self.weights = [np.random.randn(y, x)
for x, y in zip(sizes[:-1], sizes[1:])]
I read some things about it and found out that the randn() function returns an array with y elements and x dimensions populated with random numbers between 0 and 1. zip() connects two arrays into one. sizes[:-1] returns the last element and sizes[1:] return the array without its first element.
But with all of this I still can't explain to myself what would this generate.
sizes[:-1] will return the sublist [3,2] (that is, all the elements except the last one).
sizes[1:] will return the sublist [2,4] (that is, all the elements except the first one).
zip([a,b], [c,d]) gives [(a,c), (b,d)].
So zipping the two lists above gives you [(3,2), (2,4)]
The construction of weights is a list comprehension. Therefore this code is equivalent to
weights = []
for x,y in [(3,2), (2,4)]:
weights.append(np.random.randn(y, x))
So the final result would be the same as
[ np.random.randn(2,3),
np.random.randn(4,2) ]
Let's break this up into chunks:
self.weights = [some junk]
is going to be a list comprehension. Meaning, do the some junk stuff and you'll end up with a list of elements from that. Usually these look like so:
self.weights = [some_func(x) for x in a_list]
This is the equivalent of:
self.weights = []
for x in a_list:
self.weights.append(some_func(x))
zip(a, b)
Will piecewise combine the elements of a and b into tuple pairs:
(a1, b1), (a2, b2), (a3, b3), ...
for x, y in zip(a, b):
This iterates through that tuple pairs talked about above
sizes[:-1]
This is stating to get all the elements of list sizes except the last item (-1).
sizes[1:]
This is stating to get the all the elements of list sizes except the first item.
So, finally piecing this all together you get:
self.weights = [np.random.randn(y, x) for x, y in zip(sizes[:-1], sizes[1:])]
Which is a list comprehension iterating through the tuple pairs of first sizes from 2nd item to last and second from 1st item to next to last, create a random number based on those two parameters, then append to a list that is stored as self.weights
a lot is going on here.
let's decompose that expression: as you said zip will create a list of tuples containing each element of sizes and it's successor (except for the last one)
The comprehension list [ ... for x, y in zip(..)] works as follows: the tuple is exploded in the variables x and y and those are passed onto np.random.randn to create a list of random matrices.
These matrices are characterized by having the first dimension (rows) long as specified by each element of sizes and the second dimension (columns) long as the following element.
Interestingly, the matrices have compatible dimensions to be multiplied to each other in that sequence, but I guess that this is not the purpose. The purpose of each matrix in the weights list is to specify the weights that are between fully connected layers of neurons. Good luck! Seems a fun project!
Post Scriptum
since you are a beginner: you can add the import pdb; pdb.set_trace() statement anywhere in your code to get a breakpoint. Then you can just copy and paste different parts of any expression to see what comes out.
For example:
ipdb> print sizes
[3, 2, 4]
ipdb> print sizes[:-1]
[3, 2]
ipdb> print sizes[1:]
[2, 4]
ipdb> print zip(sizes[:-1], sizes[1:])
[(3, 2), (2, 4)]
ipdb> print [np.random.randn(y, x) for x, y in zip(sizes[:-1], sizes[1:])]
[array([[ 0.25933943, 0.59855688, 0.49055744],
[ 0.94602292, -0.8012292 , 0.56352986]]), array([[ 0.81328847, -0.53234407],
[-0.272656 , -1.24978881],
[-1.2306653 , 0.56038948],
[ 1.15837792, 1.19408038]])]
This code generates a list and assignes it to the self.weights attribute (this is maybe inside a class? That would explain the self). The second line is a list comprehension. It generates a list, applying the function randn to pairs of variables (x, y)
What happens when numpy.apply_along_axis takes a 1d array as input? When I use it on 1d array, I see something strange:
y=array([1,2,3,4])
First try:
apply_along_axis(lambda x: x > 2, 0, y)
apply_along_axis(lambda x: x - 2, 0, y)
returns:
array([False, False, True, True], dtype=bool)
array([-1, 0, 1, 2])
However when I try:
apply_along_axis(lambda x: x - 2 if x > 2 else x, 0, y)
I get an error:
The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I could of course use list comprehension then convert back to array instead, but that seems convoluted and I feel like I'm missing something about apply_along_axis when applied to a 1d array.
UPDATE: as per Jeff G's answer, my confusion stems from the fact that for 1d array with only one axis, what is being passed to the function is in fact the 1d array itself rather than the individual elements.
"numpy.where" is clearly better for my chosen example (and no need for apply_along_axis), but my question is really about the proper idiom for applying a general function (that takes one scalar and returns one scalar) to each element of an array (other than list comprehension), something akin to pandas.Series.apply (or map). I know of 'vectorize' but it seems no less unwieldy than list comprehension.
I'm unclear whether you're asking if y must be 1-D (answer is no, it can be multidimensional) or if you're asking about the function passed into apply_along_axis. To that, the answer is yes: the function you pass must take a 1-D array. (This is stated clearly in the function's documentation).
In your three examples, the type of x is always a 1-D array. The reason your first two examples work is because Python is implicitly broadcasting the > and - operators along that array.
Your third example fails because there is no such broadcasting along an array for if / else. For this to work with apply_along_axis you need to pass a function that takes a 1-D array. numpy.where would work for this:
>>> apply_along_axis(lambda x: numpy.where(x > 2, x - 2, x), 0, y)
array([1, 2, 1, 2])
P.S. In all these examples, apply_along_axis is unnecessary, thanks to broadcasting. You could achieve the same results with these:
>>> y > 2
>>> y - 2
>>> numpy.where(y > 2, y - 2, y)
This answer addresses the updated addendum to your original question:
numpy.vectorize will take an elementwise function and return a new function. The new function can be applied to an entire array. It's like map, but it uses the broadcasting rules of numpy.
f = lambda x: x - 2 if x > 2 else x # your elementwise fn
fv = np.vectorize(f)
fv(np.array([1,2,3,4]))
# Out[5]: array([1, 2, 1, 2])
I am trying to use itertools.product to manage the bookkeeping of some nested for loops, where the number of nested loops is not known in advance. Below is a specific example where I have chosen two nested for loops; the choice of two is only for clarity, what I need is a solution that works for an arbitrary number of loops.
This question provides an extension/generalization of the question appearing here:
Efficient algorithm for evaluating a 1-d array of functions on a same-length 1d numpy array
Now I am extending the above technique using an itertools trick I learned here:
Iterating over an unknown number of nested loops in python
Preamble:
from itertools import product
def trivial_functional(i, j): return lambda x : (i+j)*x
idx1 = [1, 2, 3, 4]
idx2 = [5, 6, 7]
joint = [idx1, idx2]
func_table = []
for items in product(*joint):
f = trivial_functional(*items)
func_table.append(f)
At the end of the above itertools loop, I have a 12-element, 1-d array of functions, func_table, each element having been built from the trivial_functional.
Question:
Suppose I am given a pair of integers, (i_1, i_2), where these integers are to be interpreted as the indices of idx1 and idx2, respectively. How can I use itertools.product to determine the correct corresponding element of the func_table array?
I know how to hack the answer by writing my own function that mimics the itertools.product bookkeeping, but surely there is a built-in feature of itertools.product that is intended for exactly this purpose?
I don't know of a way of calculating the flat index other than doing it yourself. Fortunately this isn't that difficult:
def product_flat_index(factors, indices):
if len(factors) == 1: return indices[0]
else: return indices[0] * len(factors[0]) + product_flat_index(factors[1:], indices[1:])
>> product_flat_index(joint, (2, 1))
9
An alternative approach is to store the results in a nested array in the first place, making translation unnecessary, though this is more complex:
from functools import reduce
from operator import getitem, setitem, itemgetter
def get_items(container, indices):
return reduce(getitem, indices, container)
def set_items(container, indices, value):
c = reduce(getitem, indices[:-1], container)
setitem(c, indices[-1], value)
def initialize_table(lengths):
if len(lengths) == 1: return [0] * lengths[0]
subtable = initialize_table(lengths[1:])
return [subtable[:] for _ in range(lengths[0])]
func_table = initialize_table(list(map(len, joint)))
for items in product(*map(enumerate, joint)):
f = trivial_functional(*map(itemgetter(1), items))
set_items(func_table, list(map(itemgetter(0), items)), f)
>>> get_items(func_table, (2, 1)) # same as func_table[2][1]
<function>
So numerous answers were quite useful, thanks to everyone for the solutions.
It turns out that if I recast the problem slightly with Numpy, I can accomplish the same bookkeeping, and solve the problem I was trying to solve with vastly improved speed relative to pure python solutions. The trick is just to use Numpy's reshape method together with the normal multi-dimensional array indexing syntax.
Here's how this works. We just convert func_table into a Numpy array, and reshape it:
func_table = np.array(func_table)
component_dimensions = [len(idx1), len(idx2)]
func_table = np.array(func_table).reshape(component_dimensions)
Now func_table can be used to return the correct function not just for a single 2d point, but for a full array of 2d points:
dim1_pts = [3,1,2,1,3,3,1,3,0]
dim2_pts = [0,1,2,1,2,0,1,2,1]
func_array = func_table[dim1_pts, dim2_pts]
As usual, Numpy to the rescue!
This is a little messy, but here you go:
from itertools import product
def trivial_functional(i, j): return lambda x : (i+j)*x
idx1 = [1, 2, 3, 4]
idx2 = [5, 6, 7]
joint = [enumerate(idx1), enumerate(idx2)]
func_map = {}
for indexes, items in map(lambda x: zip(*x), product(*joint)):
f = trivial_functional(*items)
func_map[indexes] = f
print(func_map[(2, 0)](5)) # 40 = (3+5)*5
I'd suggest using enumerate() in the right place:
from itertools import product
def trivial_functional(i, j): return lambda x : (i+j)*x
idx1 = [1, 2, 3, 4]
idx2 = [5, 6, 7]
joint = [idx1, idx2]
func_table = []
for items in product(*joint):
f = trivial_functional(*items)
func_table.append(f)
From what I understood from your comments and your code, func_table is simply indexed by the occurence of a certain input in the sequence. You can access it back again using:
for index, items in enumerate(product(*joint)):
# because of the append(), index is now the
# position of the function created from the
# respective tuple in join()
func_table[index](some_value)