Python: How to find the a unique element pattern from 2 arrays? - python

I have two numpy arrays, A and B:
A = ([1, 2, 3, 2, 3, 1, 2, 1, 3])
B = ([2, 3, 1, 2])
where B is a unique pattern within A.
I need the output to be all the elements of A, which aren't present in B.
Output = ([1, 2, 3, 1, 3])

Easiest is to use Python's builtins, i.e. string type:
A = "123231213"
B = "2312"
result = A.replace(B, "")
To efficiently convert numpy.array to an from str, use these functions:
x = numpy.frombuffer("3452353", dtype="|i1")
x
array([51, 52, 53, 50, 51, 53, 51], dtype=int8)
x.tostring()
"3452353"
(*) thus mixes up ascii codes (1 != "1"), but substring search will work just fine. Your data type should better fit in one char, or you may get a false match.
To sum it up, a quick hack looks like this:
A = numpy.array([1, 2, 3, 2, 3, 1, 2, 1, 3])
B = numpy.array([2, 3, 1, 2])
numpy.fromstring(A.tostring().replace(B.tostring(), ""), dtype=A.dtype)
array([1, 2, 3, 1, 3])
# note, here dtype is some int, I'm relying on the fact that:
# "1 matches 1" is equivalent to "0001 matches 00001"
# this holds as long as values of B are typically non-zero.
#
# this trick can conceptually be used with floating point too,
# but beware of multiple floating point representations of same number
In depth explanation:
Assuming size of A and B is arbitrary, naive approach runs in quadratic time. However better, probabilistic algorithms exit, for example Rabin-Karp, which relies on sliding window hash.
Which is the main reason text oriented functions, such as xxx in str or str.replace or re will be much faster than custom numpy code.
If you truly need this function to be integrated with numpy, you can always write an extension, but it's not easy :)

Related

How to get a reverse mapping in numpy in O(1)?

I have a numpy array, whose elements are unique, for example:
b = np.array([5, 4, 6, 8, 1, 2])
(Edit2: b can have large numbers, and float numbers. The above example is there for simplicity)
I am getting numbers, that are elements in b.
I want to find their index in b, meaning I want a reverse mapping, from value to index, in b.
I could do
for number in input:
ind = np.where(number==b)
which would iterate over the entire array every call to where.
I could also create a dictionary,
d = {}
for i, element in enumerate(list(b)):
d[element] = i
I could create this dictionary at "preprocessing" time, but still I would be left with a strange looking dictionary, in a mostly numpy code, which seems (to me) not how numpy is meant to be used.
How can I do this reverse mapping in numpy?
usage (O(1) time and memory required):
print("index of 8 is: ", foo(b, 8))
Edit1: not a duplicate of this
Using in1d like explained here doesn't solve my problem. Using their example:
b = np.array([1, 2, 3, 10, 4])
I want to be able to find for example 10's index in b, at runtime, in O(1).
Doing a pre-processing move
mapping = np.in1d(b, b).nonzero()[0]
>> [0, 1, 2, 3, 4]
(which could be accomplished using np.arange(len(b)))
doesn't really help, because when 10 comes in as input, It is not possible to tell its index in O(1) time with this method.
It's simpler than you think, by exploiting numpy's advanced indexing.
What we do is make our target array and just assign usign b as an index. We'll assign the indices we want by using arange.
>>> t = np.zeros((np.max(b) + 1,))
>>> t[b] = np.arange(0, b.size)
>>> t
array([0., 4., 5., 0., 1., 0., 2., 0., 3.])
You might use nans or -1 instead of zeros to construct the target to help detect invalid lookups.
Memory usage: this is optimally performant in both space and time as it's handled entirely by numpy.
If you can tolerate collisions, you can implement a poor man's hashtable. Suppose we have currencies, for example:
h = np.int32(b * 100.0) % 101 # Typically some prime number
t = np.zeros((101,))
t[h] = np.arange(0, h.size)
# Retrieving a value v; keep in mind v can be an ndarray itself.
t[np.int32(v * 100.0) % 101]
You can do any other steps to munge the address if you know what your dataset looks like.
This is about the limit of what's useful to do with numpy.
Solution
If you want constant time (ie O(1)), then you'll need to precompute a lookup table of some sort. If you want to make your lookup table using another Numpy array, it'll effectively have to be a sparse array, in which most values are "empty". Here's a workable approach in which empty values are marked as -1:
b = np.array([5, 4, 6, 8, 1, 2])
_b_ix = np.array([-1]*(b.max() + 1))
_b_ix[b] = np.arange(b.size)
# _b_ix: array([-1, 4, 5, -1, 1, 0, 2, -1, 3])
def foo(*val):
return _b_ix[list(val)]
Test:
print("index of 8 is: %s" % foo(8))
print("index of 0,5,1,8 is: %s" % foo(0,5,1,8))
Output:
index of 8 is: [3]
index of 0,5,1,8 is: [-1 0 4 3]
Caveat
In production code, you should definitely use a dictionary to solve this problem, as other answerers have pointed out. Why? Well, for one thing, say that your array b contains float values, or any non-int value. Then a Numpy-based lookup table won't work at all.
Thus, you should use the above answer only if you have a deep-seated philosophical opposition to using a dictionary (eg a dict ran over your pet cat).
Here's a nice way to generate a reverse lookup dict:
ix = {k:v for v,k in enumerate(b.flat)}
You can use dict, zip and numpy.arrange to create your reverse lookup:
import numpy
b = np.array([5, 4, 6, 8, 1, 2])
d = dict(zip(b, np.arange(0,len(b))))
print(d)
gives:
{5: 0, 4: 1, 6: 2, 8: 3, 1: 4, 2: 5}
If you want to do multiple lookups, you can do these in O(1) after an initial O(n) traversal to create a lookup dictionary.
b = np.array([5, 4, 6, 8, 1, 2])
lookup_dict = {e:i for i,e in enumerate(b)}
def foo(element):
return lookup_dict[element]
And this works for your test:
>>> print('index of 8 is:', foo(8))
index of 8 is: 3
Note that if there is a possibility that b may have changed since the last foo() call, we must re-create the dictionary.

Numpy - the best way to remove the last element from 1 dimensional array?

What is the most efficient way to remove the last element from a numpy 1 dimensional array? (like pop for list)
NumPy arrays have a fixed size, so you cannot remove an element in-place. For example using del doesn't work:
>>> import numpy as np
>>> arr = np.arange(5)
>>> del arr[-1]
ValueError: cannot delete array elements
Note that the index -1 represents the last element. That's because negative indices in Python (and NumPy) are counted from the end, so -1 is the last, -2 is the one before last and -len is actually the first element. That's just for your information in case you didn't know.
Python lists are variable sized so it's easy to add or remove elements.
So if you want to remove an element you need to create a new array or view.
Creating a new view
You can create a new view containing all elements except the last one using the slice notation:
>>> arr = np.arange(5)
>>> arr
array([0, 1, 2, 3, 4])
>>> arr[:-1] # all but the last element
array([0, 1, 2, 3])
>>> arr[:-2] # all but the last two elements
array([0, 1, 2])
>>> arr[1:] # all but the first element
array([1, 2, 3, 4])
>>> arr[1:-1] # all but the first and last element
array([1, 2, 3])
However a view shares the data with the original array, so if one is modified so is the other:
>>> sub = arr[:-1]
>>> sub
array([0, 1, 2, 3])
>>> sub[0] = 100
>>> sub
array([100, 1, 2, 3])
>>> arr
array([100, 1, 2, 3, 4])
Creating a new array
1. Copy the view
If you don't like this memory sharing you have to create a new array, in this case it's probably simplest to create a view and then copy (for example using the copy() method of arrays) it:
>>> arr = np.arange(5)
>>> arr
array([0, 1, 2, 3, 4])
>>> sub_arr = arr[:-1].copy()
>>> sub_arr
array([0, 1, 2, 3])
>>> sub_arr[0] = 100
>>> sub_arr
array([100, 1, 2, 3])
>>> arr
array([0, 1, 2, 3, 4])
2. Using integer array indexing [docs]
However, you can also use integer array indexing to remove the last element and get a new array. This integer array indexing will always (not 100% sure there) create a copy and not a view:
>>> arr = np.arange(5)
>>> arr
array([0, 1, 2, 3, 4])
>>> indices_to_keep = [0, 1, 2, 3]
>>> sub_arr = arr[indices_to_keep]
>>> sub_arr
array([0, 1, 2, 3])
>>> sub_arr[0] = 100
>>> sub_arr
array([100, 1, 2, 3])
>>> arr
array([0, 1, 2, 3, 4])
This integer array indexing can be useful to remove arbitrary elements from an array (which can be tricky or impossible when you want a view):
>>> arr = np.arange(5, 10)
>>> arr
array([5, 6, 7, 8, 9])
>>> arr[[0, 1, 3, 4]] # keep first, second, fourth and fifth element
array([5, 6, 8, 9])
If you want a generalized function that removes the last element using integer array indexing:
def remove_last_element(arr):
return arr[np.arange(arr.size - 1)]
3. Using boolean array indexing [docs]
There is also boolean indexing that could be used, for example:
>>> arr = np.arange(5, 10)
>>> arr
array([5, 6, 7, 8, 9])
>>> keep = [True, True, True, True, False]
>>> arr[keep]
array([5, 6, 7, 8])
This also creates a copy! And a generalized approach could look like this:
def remove_last_element(arr):
if not arr.size:
raise IndexError('cannot remove last element of empty array')
keep = np.ones(arr.shape, dtype=bool)
keep[-1] = False
return arr[keep]
If you would like more information on NumPys indexing the documentation on "Indexing" is quite good and covers a lot of cases.
4. Using np.delete()
Normally I wouldn't recommend the NumPy functions that "seem" like they are modifying the array in-place (like np.append and np.insert) but do return copies because these are generally needlessly slow and misleading. You should avoid them whenever possible, that's why it's the last point in my answer. However in this case it's actually a perfect fit so I have to mention it:
>>> arr = np.arange(10, 20)
>>> arr
array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
>>> np.delete(arr, -1)
array([10, 11, 12, 13, 14, 15, 16, 17, 18])
5.) Using np.resize()
NumPy has another method that sounds like it does an in-place operation but it really returns a new array:
>>> arr = np.arange(5)
>>> arr
array([0, 1, 2, 3, 4])
>>> np.resize(arr, arr.size - 1)
array([0, 1, 2, 3])
To remove the last element I simply provided a new shape that is 1 smaller than before, which effectively removes the last element.
Modifying the array inplace
Yes, I've written previously that you cannot modify an array in place. But I said that because in most cases it's not possible or only by disabling some (completely useful) safety checks. I'm not sure about the internals but depending on the old size and the new size it could be possible that this includes an (internal-only) copy operation so it might be slower than creating a view.
Using np.ndarray.resize()
If the array doesn't share its memory with any other array, then it's possible to resize the array in place:
>>> arr = np.arange(5, 10)
>>> arr.resize(4)
>>> arr
array([5, 6, 7, 8])
However that will throw ValueErrors in case it's actually referenced by another array as well:
>>> arr = np.arange(5)
>>> view = arr[1:]
>>> arr.resize(4)
ValueError: cannot resize an array that references or is referenced by another array in this way. Use the resize function
You can disable that safety-check by setting refcheck=False but that shouldn't be done lightly because you make yourself vulnerable for segmentation faults and memory corruption in case the other reference tries to access the removed elements! This refcheck argument should be treated as an expert-only option!
Summary
Creating a view is really fast and doesn't take much additional memory, so whenever possible you should try to work as much with views as possible. However depending on the use-cases it's not so easy to remove arbitrary elements using basic slicing. While it's easy to remove the first n elements and/or last n elements or remove every x element (the step argument for slicing) this is all you can do with it.
But in your case of removing the last element of a one-dimensional array I would recommend:
arr[:-1] # if you want a view
arr[:-1].copy() # if you want a new array
because these most clearly express the intent and everyone with Python/NumPy experience will recognize that.
Timings
Based on the timing framework from this answer:
# Setup
import numpy as np
def view(arr):
return arr[:-1]
def array_copy_view(arr):
return arr[:-1].copy()
def array_int_index(arr):
return arr[np.arange(arr.size - 1)]
def array_bool_index(arr):
if not arr.size:
raise IndexError('cannot remove last element of empty array')
keep = np.ones(arr.shape, dtype=bool)
keep[-1] = False
return arr[keep]
def array_delete(arr):
return np.delete(arr, -1)
def array_resize(arr):
return np.resize(arr, arr.size - 1)
# Timing setup
timings = {view: [],
array_copy_view: [], array_int_index: [], array_bool_index: [],
array_delete: [], array_resize: []}
sizes = [2**i for i in range(1, 20, 2)]
# Timing
for size in sizes:
print(size)
func_input = np.random.random(size=size)
for func in timings:
print(func.__name__.ljust(20), ' ', end='')
res = %timeit -o func(func_input) # if you use IPython, otherwise use the "timeit" module
timings[func].append(res)
# Plotting
%matplotlib notebook
import matplotlib.pyplot as plt
import numpy as np
fig = plt.figure(1)
ax = plt.subplot(111)
for func in timings:
ax.plot(sizes,
[time.best for time in timings[func]],
label=func.__name__)
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('size')
ax.set_ylabel('time [seconds]')
ax.grid(which='both')
ax.legend()
plt.tight_layout()
I get the following timings as log-log plot to cover all the details, lower time still means faster, but the range between two ticks represents one order of magnitude instead of a fixed amount. In case you're interested in the specific values, I copied them into this gist:
According to these timings those two approaches are also the fastest. (Python 3.6 and NumPy 1.14.0)
If you want to quickly get array without last element (not removing explicit), use slicing:
array[:-1]
To delete the last element from a 1-dimensional NumPy array, use the numpy.delete method, like so:
import numpy as np
# Create a 1-dimensional NumPy array that holds 5 values
values = np.array([1, 2, 3, 4, 5])
# Remove the last element of the array using the numpy.delete method
values = np.delete(values, -1)
print(values)
Output:
[1 2 3 4]
The last value of the NumPy array, which was 5, is now removed.

Getting the indices of several elements in a NumPy array at once

Is there any way to get the indices of several elements in a NumPy array at once?
E.g.
import numpy as np
a = np.array([1, 2, 4])
b = np.array([1, 2, 3, 10, 4])
I would like to find the index of each element of a in b, namely: [0,1,4].
I find the solution I am using a bit verbose:
import numpy as np
a = np.array([1, 2, 4])
b = np.array([1, 2, 3, 10, 4])
c = np.zeros_like(a)
for i, aa in np.ndenumerate(a):
c[i] = np.where(b == aa)[0]
print('c: {0}'.format(c))
Output:
c: [0 1 4]
You could use in1d and nonzero (or where for that matter):
>>> np.in1d(b, a).nonzero()[0]
array([0, 1, 4])
This works fine for your example arrays, but in general the array of returned indices does not honour the order of the values in a. This may be a problem depending on what you want to do next.
In that case, a much better answer is the one #Jaime gives here, using searchsorted:
>>> sorter = np.argsort(b)
>>> sorter[np.searchsorted(b, a, sorter=sorter)]
array([0, 1, 4])
This returns the indices for values as they appear in a. For instance:
a = np.array([1, 2, 4])
b = np.array([4, 2, 3, 1])
>>> sorter = np.argsort(b)
>>> sorter[np.searchsorted(b, a, sorter=sorter)]
array([3, 1, 0]) # the other method would return [0, 1, 3]
This is a simple one-liner using the numpy-indexed package (disclaimer: I am its author):
import numpy_indexed as npi
idx = npi.indices(b, a)
The implementation is fully vectorized, and it gives you control over the handling of missing values. Moreover, it works for nd-arrays as well (for instance, finding the indices of rows of a in b).
All of the solutions here recommend using a linear search. You can use np.argsort and np.searchsorted to speed things up dramatically for large arrays:
sorter = b.argsort()
i = sorter[np.searchsorted(b, a, sorter=sorter)]
For an order-agnostic solution, you can use np.flatnonzero with np.isin (v 1.13+).
import numpy as np
a = np.array([1, 2, 4])
b = np.array([1, 2, 3, 10, 4])
res = np.flatnonzero(np.isin(a, b)) # NumPy v1.13+
res = np.flatnonzero(np.in1d(a, b)) # earlier versions
# array([0, 1, 2], dtype=int64)
There are a bunch of approaches for getting the index of multiple items at once mentioned in passing in answers to this related question: Is there a NumPy function to return the first index of something in an array?. The wide variety and creativity of the answers suggests there is no single best practice, so if your code above works and is easy to understand, I'd say keep it.
I personally found this approach to be both performant and easy to read: https://stackoverflow.com/a/23994923/3823857
Adapting it for your example:
import numpy as np
a = np.array([1, 2, 4])
b_list = [1, 2, 3, 10, 4]
b_array = np.array(b_list)
indices = [b_list.index(x) for x in a]
vals_at_indices = b_array[indices]
I personally like adding a little bit of error handling in case a value in a does not exist in b.
import numpy as np
a = np.array([1, 2, 4])
b_list = [1, 2, 3, 10, 4]
b_array = np.array(b_list)
b_set = set(b_list)
indices = [b_list.index(x) if x in b_set else np.nan for x in a]
vals_at_indices = b_array[indices]
For my use case, it's pretty fast, since it relies on parts of Python that are fast (list comprehensions, .index(), sets, numpy indexing). Would still love to see something that's a NumPy equivalent to VLOOKUP, or even a Pandas merge. But this seems to work for now.

numpy parse int into bit groupings

I have a np.array of np.uint8
a = np.array([randint(1,255) for _ in range(100)],dtype=np.uint8)
and I want to split this into low and high nibbles
I could get the low nibble
low = np.bitwise_and(a,0xF)
and I could get the high nibble with
high = np.bitwise_and(np.right_shift(a,4),0xF)
is there some way to do something like
>>> numpy.keep_bits(a,[(0,3),(4,7)])
numpy.array([
[low1,high1],
[low2,high2],
...
[lowN,highN]
])
Im not even sure what this would be called ... but I thought maybe some numpy guru would know a cool way to do this (in reality i am looking to do this with uint32's and much more varied nibbles
basically something like struct.unpack but for vectorized numpy operations
EDIT: I went with a modified version of the accepted answer below
here is my final code for anyone interested
def bitmask(start,end):
"""
>>> bitmask(0,2) == 0b111
>>> bitmask(3,5) == 0b111000
:param start: start bit
:param end: end bit (unlike range, end bit is inclusive)
:return: integer bitmask for the specified bit pattern
"""
return (2**(end+1-start)-1)<<start
def mask_and_shift(a,mask_a,shift_a):
"""
:param a: np.array
:param mask_a: array of masks to apply (must be same size as shift_a)
:param shift_a: array of shifts to apply (must be same size as mask_a)
:return: reshaped a, that has masks and shifts applied
"""
masked_a = numpy.bitwise_and(a.reshape(-1,1), mask_a)
return numpy.right_shift(masked_a,shift_a)
def bit_partition(rawValues,bit_groups):
"""
>>> a = numpy.array([1,15,16,17,125,126,127,128,129,254,255])
>>> bit_partition(a,[(0,2),(3,7)])
>>> bit_partition(a,[(0,2),(3,5),(6,7)])
:param rawValues: np.array of raw values
:param bit_groups: list of start_bit,end_bit values for where to bit twiddle
:return: np.array len(rawValues)xlen(bit_groups)
"""
masks,shifts = zip(*[(bitmask(s,e),s) for s,e in bit_groups])
return mask_and_shift(rawValues,masks,shifts)
A one-liner, using broadcasting, for the four bit lower and upper nibbles:
In [38]: a
Out[38]: array([ 1, 15, 16, 17, 127, 128, 255], dtype=uint8)
In [39]: (a.reshape(-1,1) & np.array([0xF, 0xF0], dtype=np.uint8)) >> np.array([0, 4], dtype=np.uint8)
Out[39]:
array([[ 1, 0],
[15, 0],
[ 0, 1],
[ 1, 1],
[15, 7],
[ 0, 8],
[15, 15]], dtype=uint8)
To generalize this, replace the hardcoded values [0xF, 0xF0] and [0, 4] with the appropriate bit masks and shifts. For example, to split the values into three groups, containing the highest two bits, followed by the remaining two groups of three bits, you can do this:
In [41]: masks = np.array([0b11000000, 0b00111000, 0b00000111], dtype=np.uint8)
In [42]: shifts = np.array([6, 3, 0], dtype=np.uint8)
In [43]: a
Out[43]: array([ 1, 15, 16, 17, 127, 128, 255], dtype=uint8)
In [44]: (a.reshape(-1,1) & np.array(masks, dtype=np.uint8)) >> np.array(shifts, dtype=np.uint8)
Out[44]:
array([[0, 0, 1],
[0, 1, 7],
[0, 2, 0],
[0, 2, 1],
[1, 7, 7],
[2, 0, 0],
[3, 7, 7]], dtype=uint8)
So, I won't comment on the specific logical operators you want to implement, since bit-hacking isn't quite a specialty of mine, but I can tell you where you should look in numpy to implement this kind of custom operator.
If you look through the numpy source you'll notice that nearly all of the bit-maniuplation techniques in numpy are just instances of _MaskedBinaryOperation, for example, the definition of bitwise_and is simply:
bitwise_and = _MaskedBinaryOperation(umath.bitwise_and)
The magic here comes in the form of the umath module, which calls down, typically to the low level libraries that numpy is built on. If you really want to, you could add your operator there, but I don't think it's worth mucking around at that level.
That said, this isn't the only way to incorporate these functions into numpy. In fact, the umath module has a really handy function called frompyfunc that will let you turn an arbitrary python function into one of these handy umath operators. Documentation can be found here. An example of creating such a function is below:
>>> oct_array = np.frompyfunc(oct, 1, 1)
>>> oct_array(np.array((10, 30, 100)))
array([012, 036, 0144], dtype=object)
>>> np.array((oct(10), oct(30), oct(100))) # for comparison
array(['012', '036', '0144'],
dtype='|S4')
If you decide on the specifics of the bitwise operator you want to implement, using this interface would be the best way to implement it.
This doesn't answer 100% of your question, but I assumed your question was much more about implementing some custom bitwise operator in proper numpy form rather than digging into the bitwise operator itself. Let me know if that's inaccurate and I can put together an example using the bitwise operator you alluded to above.

numpy.subtract but only until difference reaches threshold - replace numbers smaller than that with threshold

I want to subtract a given value from each element in my numpy array.
For example, if I have a numpy array called a_q, and variable called subtract_me, then I can simply do this:
result = np.subtract(a_q,subtract_me)
That's fine. But I don't want it to simply subtract blindly from every element. If the difference is lower than a threshold, then I don't want the subtraction to happen. Instead, I want that element of the array to be replaced by that threshold.
What's the most efficient way to do this? I could simply iterate through the array and subtract from each element and put a check condition on whether the threshold has been reached or not, and build a new array out of the results (as below) - but is there a better or more efficient way to do it?
threshold = 3 # in my real program, the threshold is the
# lowest non-infinity number that python can handle
subtract_me = 6
a_q = []
for i in range(10):
val = i - subtract_me
if val < threshold:
val = threshold
a_q.append(val)
myarr = np.array(a_q)
print myarr
Vectorised methods are typically most efficient with NumPy arrays so here's one way which is likely to be more efficient than iterating over an array one element at a time:
>>> threshold = 3
>>> subtract_me = 6
>>> a_q = np.arange(10)
>>> arr = a_q - subtract_me # takeaway the subtract_me value
array([-6, -5, -4, -3, -2, -1, 0, 1, 2, 3])
>>> arr[arr - subtract_me < threshold] = threshold # replace any value less than threshold
array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3])
EDIT: since np.clip was mentioned in the comments below the question, I may as well absorb it into my answer for completeness ;-)
Here's one way you could use it to get the desired result:
>>> np.clip((a_q - subtract_me), threshold, np.max(a_q))
array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3])

Categories

Resources