Find the number of occurrences of a sequence

Find the number of occurrences of a sequence - python

I'm looking for an efficient way (maybe numpy?) to count the number of occurrences of a sequence of numbers in a 2D array.
e.g
count_seq_occ([2,3],
array([[ 2, 3 , 5, 2, 3],
[ 5, 2, 3],
[ 1]]))
Will output the result 3.
The Three way nested loop option is clear, but maybe a better approach exists?Thanks

EDITED
KMP search
Try using this code and editing it to search in every vector of the matrix:
http://code.activestate.com/recipes/117214/
It's a KMP (Knuth-Morris-Pratt) python function for finding a pattern in a text or list. You can slightly optimize it by creating the search pattern's shifts array once, then running the rest of the algorithm on every 1D sub-array.
Alternative
How about converting the array into a string representation and counting occurrences in the string?
repr(your_array).count("2, 3")
Notice: you should really format the representation or counted substring to both match the same style. For example, sometimes a repr() of a numpy array would return something like this inside: "1., 2., 3., " and you might want to fix this somehow.
Alternatively you can flatten the array and join all rows into a string, but be careful and add an extra unique character after every row.
The method could vary a bit regarding how you turn it into a string, but it should be fast enough. Searching for substrings in a string is O(n) time so you shouldn't worry about that. The only possible reason not to use this method would be if you don't want to allocate the temporary string object if the array is very large.

This is one way, but I hope there is a better solution. It would be helpful if you showed us your nested loop and provided some data for benchmarking.
from itertools import chain
x = [2, 3]
A = np.array([[ 2, 3, 5, 2, 3],
[ 5, 2, 3],
[ 1]])
arr = list(chain.from_iterable(A))
res = sum(arr[i:i+len(x)] == x for i in range(len(arr))) # 3

Related

speed up list iteration bottleneck

I have a bottleneck in a piece of code that is ruining the performance of my code. I re-wrote the section, but, after timing it, things didn't improve.
The problem is as follows. Given a list of fixed-length-lists of integers
data = [[1,2,3], [3,2,1], [8,1,0], [1,3,4]]
I need to append the index of each sublist to a separate list as many times as its list value at a given column index. There is a separate list for each column in the data.
For instance, for the above data, there will be three resulting lists since the sub-lists have three columns.
There are 4 sublists, so we expect the numbers 0-3 to appear in each of the final lists.
We expect the following three lists to be generated from the above data
[[0, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3],
[0, 0, 1, 1, 2, 3, 3, 3],
[0, 0, 0, 1, 3, 3, 3, 3]]
I have two ways of doing this:
processed_data = list([] for _ in range(len(data[0])))
for n in range(len(data)):
sub_list = data[n]
for k, proc_list in enumerate(processed_data):
for _ in range(sub_list[k]):
proc_list.append(n)
processed_data = []
for i, col in enumerate(zip(*data)):
processed_data.append([j for j,count in enumerate(col) for _ in range(count)])
The average size of the data list is around 100,000.
Is there a way I can speed this up?

You can't improve the computational complexity of your algorithm unless you're able to tweak the output format (see below). In other words, you'll at best be able to improve the speed by a modest percentage (and the percentage will be independent of the size of the input).
I don't see any obvious implementation issues. The one idea I had was to get rid of the large number of append() calls and the overhead that is incurred by gradual list expansions by preallocating the output matrix, but #juanpa.arrivillaga suggests in their comment that append() is in fact very optimized on CPython. If you're on another interpreter, you could try it: you know that the length of the output list for column c will be equal to the sum of all the input numbers in column c. So you can just preallocate each output list by using [0] * sum_of_input_values_at_column_c, and then do proc_list[i] = n instead of proc_list.append(n) (and manually increment i). This does, however, require two passes over the input, so it might not actually be an improvement - your problem is quite memory-intensive as its core computation is extremely simple.
The reason that you can't improve the computational complexity is that it is already optimal: any algorithm needs to spend time on generating its output, so the size of the output is a lower bound for how fast the algorithm can possibly be. And in your case, the size of the output is equal to the sum of the values in your input matrix (and it's generally considered bad when you depend on the input values themselves rather than on the number of input values). And that's the number of iterations that your algorithm spends, so it is optimal. However, if the output of this function is going to reside in memory to be consumed by another function (rather than being written to a file), and you are able to make some adaptations in that function, you could instead output a matrix of generators, where each generator knows that it needs to generate sub_list[k] occurrences of n. Then, the complexity of your algorithm becomes proportional to the size of the input matrix (but consuming the output will still take the same amount of time that it would have taken to generate the full output).

Perhaps itertools can make this go faster for you by minimizing the amount of python code inside loops:
data = [[1,2,3], [3,2,1], [8,1,0], [1,3,4]]
from itertools import chain,repeat,starmap
result = [ list(chain.from_iterable(starmap(repeat,r)))
for r in map(enumerate,zip(*data)) ]
print(result)
[[0, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3],
[0, 0, 1, 1, 2, 3, 3, 3],
[0, 0, 0, 1, 3, 3, 3, 3]]
If you're processing the output in the same order as the result's rows come out, you can convert this to a generator and use it directly in your main process:
iResult = ( chain.from_iterable(starmap(repeat,r))
for r in map(enumerate,zip(*data)) )
for iRow in iResult: # iRow is also an iterator
for resultItem in iRow:
# Perform your item processing here
print(resultItem, end=" ")
print()
0 1 1 1 2 2 2 2 2 2 2 2 3
0 0 1 1 2 3 3 3
0 0 0 1 3 3 3 3
This will avoid creating and storing the lists of indexes altogether (i.e. bringing that bottleneck down to zero). But that's only if you process the result sequentially

Nested array computations in Python using numpy

I am trying to use numpy in Python in solving my project.
I have a random binary array rndm = [1, 0, 1, 1] and a resource_arr = [[2, 3], 4, 2, [1, 2]]. What I am trying to do is to multiply the array element wise, then get their sum. As an expected output for the sample above,
output = 5 0 2 3. I find hard to solve such problem because of the nested array/list.
So far my code looks like this:
def fitness_score():
output = numpy.add(rndm * resource_arr)
return output
fitness_score()
I keep getting
ValueError: invalid number of arguments.
For which I think is because of the addition that I am trying to do. Any help would be appreciated. Thank you!

Numpy treats its arrays as matrices, and resource_arr is not a (valid) matrix. In your case a python list is more suitable:
def sum_nested(l):
tmp = []
for element in l:
if isinstance(element, list):
tmp.append(numpy.sum(element))
else:
tmp.append(element)
return tmp
In this function we check for each element inside l if it is a list. If so, we sum its elements. On the other hand, if the encountered element is just a number, we leave it untouched. Please note that this only works for one level of nesting.
Now, if we run sum_nested([[2, 3], 4, 2, [1, 2]]) we will get [5 4 2 3]. All that's left is multiplying this result by the elements of rndm, which can be achieved easily using numpy:
def fitness_score(a, b):
return numpy.multiply(a, sum_nested(b))

Numpy is all about the non-jagged arrays. You can do things with jagged arrays, but doing so efficiently and elegantly isnt trivial.
Almost always, trying to find a way to map your datastructure to a non-nested one, for instance, encoding the information as below, will be more flexible, and more performant.
resource_arr = (
[0, 0, 1, 2, 3, 3]
[2, 3, 4, 2, 1, 2]
)
That is, an integer denoting the 'row' each value belongs to, paired with an array of equal size of the values themselves.
This may 'feel' wasteful when coming from a C-style way of doing arrays (omg more memory consumption), but staying away from nested datastructures is almost certainly your best bet in terms of performance, and the amount of numpy/scipy ecosystem that will actually be compatible with your data representation. If it really uses more memory is actually rather questionable; every new python object uses a ton of bytes, so if you have only few elements per nesting, it is the more memory efficient solution too.
In this case, that would give you the following efficient solution to your problem:
output = np.bincount(*resource_arr) * rndm

I have not worked much with pandas/numpy so I'm not sure if this is most efficient way, but it works (atleast for the example you have shown):
import numpy as np
rndm = [1, 0, 1, 1]
resource_arr = [[2, 3], 4, 2, [1, 2]]
multiplied_output = np.multiply(rndm, resource_arr)
print(multiplied_output)
output = []
for elem in multiplied_output:
output.append(sum(elem)) if isinstance(elem, list) else output.append(elem)
final_output = np.array(output)
print(final_output)

Generate pairs from a numpy array in a specific pattern without loops

Let's consider an array [1,2,3] say, what I would like to generate is the list containing the pairs [[1,2], [1,3], [2,3]]. This can be done using itertools. However, I would like to produce them using pure numpy operations, and no loops or branching is allowed.
A close solution is provided here, but it generates all the possible pairs, instead of a particular fashion as in my case.
Can you please suggest a way to do that? The array will always be 1D.
Also, this is my first question on SE. If it requires any edit, please let me know.

Here is a one-liner using np.triu_indices:
>>> a = np.array([1, 2, 3])
>>> a[np.transpose(np.triu_indices(len(a), 1))]
array([[1, 2],
[1, 3],
[2, 3]])

Iterate over numpy with index (numpy equivalent of python enumerate)

I'm trying to create a function that will calculate the lattice distance (number of horizontal and vertical steps) between elements in a multi-dimensional numpy array. For this I need to retrieve the actual numbers from the indexes of each element as I iterate through the array. I want to store those values as numbers that I can run through a distance formula.
For the example array A
A=np.array([[1,2,3],[4,5,6],[7,8,9]])
I'd like to create a loop that iterates through each element and for the first element 1 it would retrieve a=0, b=0 since 1 is at A[0,0], then a=0, b=1 for element 2 as it is located at A[0,1], and so on...
My envisioned output is two numbers (corresponding to the two index values for that element) for each element in the array. So in the example above, it would be the two values that I am assigning to be a and b. I only will need to retrieve these two numbers within the loop (rather than save separately as another data object).
Any thoughts on how to do this would be greatly appreciated!

As I've become more familiar with the numpy and pandas ecosystem, it's become clearer to me that iteration is usually outright wrong due to how slow it is in comparison, and writing to use a vectorized operation is best whenever possible. Though the style is not as obvious/Pythonic at first, I've (anecdotally) gained ridiculous speedups with vectorized operations; more than 1000x in a case of swapping out a form like some row iteration .apply(lambda)
#MSeifert's answer much better provides this and will be significantly more performant on a dataset of any real size
More general Answer by #cs95 covering and comparing alternatives to iteration in Pandas
Original Answer
You can iterate through the values in your array with numpy.ndenumerate to get the indices of the values in your array.
Using the documentation above:
A = np.array([[1,2,3],[4,5,6],[7,8,9]])
for index, values in np.ndenumerate(A):
print(index, values) # operate here

You can do it using np.ndenumerate but generally you don't need to iterate over an array.
You can simply create a meshgrid (or open grid) to get all indices at once and you can then process them (vectorized) much faster.
For example
>>> x, y = np.mgrid[slice(A.shape[0]), slice(A.shape[1])]
>>> x
array([[0, 0, 0],
[1, 1, 1],
[2, 2, 2]])
>>> y
array([[0, 1, 2],
[0, 1, 2],
[0, 1, 2]])
and these can be processed like any other array. So if your function that needs the indices can be vectorized you shouldn't do the manual loop!
For example to calculate the lattice distance for each point to a point say (2, 3):
>>> abs(x - 2) + abs(y - 3)
array([[5, 4, 3],
[4, 3, 2],
[3, 2, 1]])
For distances an ogrid would be faster. Just replace np.mgrid with np.ogrid:
>>> x, y = np.ogrid[slice(A.shape[0]), slice(A.shape[1])]
>>> np.hypot(x - 2, y - 3) # cartesian distance this time! :-)
array([[ 3.60555128, 2.82842712, 2.23606798],
[ 3.16227766, 2.23606798, 1.41421356],
[ 3. , 2. , 1. ]])

Another possible solution:
import numpy as np
A=np.array([[1,2,3],[4,5,6],[7,8,9]])
for _, val in np.ndenumerate(A):
ind = np.argwhere(A==val)
print val, ind
In this case you will obtain the array of indexes if value appears in array not once.

Pythonic way to get the first AND the last element of the sequence

What is the easiest and cleanest way to get the first AND the last elements of a sequence? E.g., I have a sequence [1, 2, 3, 4, 5], and I'd like to get [1, 5] via some kind of slicing magic. What I have come up with so far is:
l = len(s)
result = s[0:l:l-1]
I actually need this for a bit more complex task. I have a 3D numpy array, which is cubic (i.e. is of size NxNxN, where N may vary). I'd like an easy and fast way to get a 2x2x2 array containing the values from the vertices of the source array. The example above is an oversimplified, 1D version of my task.

Use this:
result = [s[0], s[-1]]

Since you're using a numpy array, you may want to use fancy indexing:
a = np.arange(27)
indices = [0, -1]
b = a[indices] # array([0, 26])
For the 3d case:
vertices = [(0,0,0),(0,0,-1),(0,-1,0),(0,-1,-1),(-1,-1,-1),(-1,-1,0),(-1,0,0),(-1,0,-1)]
indices = list(zip(*vertices)) #Can store this for later use.
a = np.arange(27).reshape((3,3,3)) #dummy array for testing. Can be any shape size :)
vertex_values = a[indices].reshape((2,2,2))
I first write down all the vertices (although I am willing to bet there is a clever way to do it using itertools which would let you scale this up to N dimensions ...). The order you specify the vertices is the order they will be in the output array. Then I "transpose" the list of vertices (using zip) so that all the x indices are together and all the y indices are together, etc. (that's how numpy likes it). At this point, you can save that index array and use it to index your array whenever you want the corners of your box. You can easily reshape the result into a 2x2x2 array (although the order I have it is probably not the order you want).

This would give you a list of the first and last element in your sequence:
result = [s[0], s[-1]]
Alternatively, this would give you a tuple
result = s[0], s[-1]

With the particular case of a (N,N,N) ndarray X that you mention, would the following work for you?
s = slice(0,N,N-1)
X[s,s,s]
Example
>>> N = 3
>>> X = np.arange(N*N*N).reshape(N,N,N)
>>> s = slice(0,N,N-1)
>>> print X[s,s,s]
[[[ 0 2]
[ 6 8]]
[[18 20]
[24 26]]]

>>> from operator import itemgetter
>>> first_and_last = itemgetter(0, -1)
>>> first_and_last([1, 2, 3, 4, 5])
(1, 5)

Why do you want to use a slice? Getting each element with
result = [s[0], s[-1]]
is better and more readable.
If you really need to use the slice, then your solution is the simplest working one that I can think of.
This also works for the 3D case you've mentioned.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.