Frequencies of elements in 2D numpy array - python

I have a numpy array output of shape (1000,4). It is an array which contains 1000 quadruples with no repetitions and they are ordered (i.e. an element is [0,1,2,3]). I want to count how many times I got all possible quadruples. More practically, I use the following code:
comb=np.array(list(itertools.combinations(range(32),4)))
def counting(comb, output):
k=0
n_output=np.zeros(comb.shape[0])
for i in range(comb.shape[0]):
k=0
for j in range(output.shape[0]):
if (output[j]==comb[i]).all():
k+=1
n_output[i]=k
return n_output
How can I optimize the code? At the moment it takes 30 s to run

Your current implementation is inefficient for 2 reasons:
the complexity of the algorithm is O(n^2);
it makes use of (slow CPython) loops.
You write a simple O(n) algorithm using Python sets (still with a loop) since output does not have any repetitions. Here is the result:
def countingFast(comb, output):
k=0
n_output=np.zeros(comb.shape[0])
tmp = set(map(tuple, output))
for i in range(comb.shape[0]):
n_output[i] = int(tuple(comb[i]) in tmp)
return n_output
On my machine, using the described input sizes, the original version takes 55.2 seconds while this implementation takes 0.038 second. This is roughly 1400 times faster.

You can generate a boolean array representing if the sequence you want to check is equal to a given row in your array. As numpy's boolean arrays can be summed, you could then use this result to get the total number of matching rows.
A basic approach could look like this (including sample data generation):
import numpy as np
# set seed value of random generator to fixed value for repeatable output
np.random.seed(1234)
# create a random array with 950x4 elements
arr = np.random.rand(950, 4)
# create a 50x4 array with sample sequence
# this is the sequence we want to count in our final array
sequence = [0, 1, 2, 3]
sample = np.array([sequence, ]*50)
# stack arrays to create sample data with 1000x4 elements
arr = np.vstack((arr, sample))
# shuffle array to get a random distribution of random sample data and known sequence
np.random.shuffle(arr)
# check for equal array elements, returns a boolean array
results = np.equal(sequence, arr)
# sum the boolean array to get the number of total occurences per axis
# as the sum is the same for all columns, we just need to get the first element at index 0
occurences = np.sum(results, axis=0)[0]
print(occurences)
# --> 50
You need to call the required lines for each of sequence you are interested in. Therefore, it would be useful to write a function like this:
def number_of_occurences(data, sequence):
results = np.equal(sequence, data)
return np.sum(results, axis=0)[0]

Related

Random partitioning given array with given bin sizes

How to randomly partition given array with given bin sizes?
Is there an inbuilt function for that? For example, I want something like
function(12,(2,3,3,2,2)) to output four partitions of numbers from 1 go 12 (or 0 to 11, doesn't matter). So output may be a list like [[3,4],[7,8,11],[12,1,2],[5,9],[6,10]](or some other efficient data structure). The first argument of the function may be just a number n, in which case it will consider np.arange(n) as the input, otherwise it may be any other ndarray.
Of course we can randomly permute the list and then pick the first 2, next 3, next 3, next 2 and last 2 elements. But does there exist something more efficient?
numpy.partition() function has a different meaning, it performs a step in quicksort, and I also couldn't find any such function in the numpy.random submodule.
Try this following solution:
def func(a, b:List):
# a is integer and b is a python list
indx = np.random.rand(a).argsort() # Get randomly arranged index
b = np.array(b)
return np.r_[np.split(indx,b.cumsum()[:-1])] # split the index and merge

Access multiple items of list

Im currently trying to implement a replay buffer, in which i store 20 numbers in a list and then want to sample 5 of these numbers randomly.
I tried the following with numpy arrays:
ac = np.zeros(20, dtype=np.int32)
for i in range(20):
ac[i] = i+1
batch = np.random.choice(20, 5, replace=False)
sample = ac[batch]
print(sample)
This works the way it should, but i want the same done with a list instead of a numpy array.
But when i try to get sample = ac[batch] with a list i get this error message:
TypeError: only integer scalar arrays can be converted to a scalar index
How can i access multiple elements of a list like it did with numpy?
For a list it is quite easy. Just use the sample function from the random module:
import random
ac = [i+1 for i in range(20)]
sample = random.sample(ac, 5)
Also on a side note: When you want to create a numpy array with a range of numbers, you don't have to create an array with zeros and then fill it in a for loop, that is less convenient and also significantly slower than using the numpy function arange.
ac = np.arange(1, 21, 1)
If you really want to create a batch list that conaints the indexes you want to access, then you will have to use a list comprehension to access those, since you cant just index a list with multiple indexes like a numpy array.
batch = [random.randint(0, 20) for _ in range(5)]
sample = [ac[i] for i in batch]

Randomly select a length of numbers from numpy array

I have a number of data files, each containing a large amount of data points.
After loading the file with numpy, I get a numpy array:
f=np.loadtxt("...-1.txt")
How do I randomly select a length of x, but the order of numbers should not be changed?
For example:
f = [1,5,3,7,4,8]
if I wanted to select a random length of 3 data points, the output should be:
1,5,3, or
3,7,4, or
5,3,7, etc.
Pure logic will get you there.
For a list f and a max length x, the valid starting points of your random slices are limited to 0, len(f)-x:
0 1 2 3
f = [1,5,3,7,4,8]
So all valid starting point can be selected with random.randrange(len(f)-x+1) (where the +1 is because randrange works like range).
Store the random starting point into a variable start and slice your array with [start:start+x], or be creative and use another slice after the first:
result = f[random.randrange(len(f)-x+1):][:3]
Building on usr2564301's answer you can take out only the elements you need in 1 go using a range so you avoid building a potentially very large intermediate array:
result = f[range(random.randrange(len(f)-x+1), x)]
A range also avoids that you build large index arrays when your length x becomes larger.

Numpy Arrays comparison and indexing

I have 2 arrays of unequal size:
>>> np.size(array1)
4004001
>>> np.size(array2)
1000
Now, each element in array2 needs to be compared to all the elements in array1, to find the element which has the nearest value to that of this element in array2.
Upon finding this value, I need to store it in a different array of size 1000 - one of a size corresponding to array2.
The tedious and crude way of doing it could be using a for loop and taking each element from Array 2, subtracting its absolute value from array 1 elements and then taking the minimum value- this is going to make my code really slow.
I'd like to use numpy vectorized operations to do this but i've kind of hit a wall.
To make full use of the numpy parallelism we need vectorized functions. Further all values are found in the same array (array1) using the same criterium (nearest). Therefore, it is possible to make a special function for searching in array1 specifically.
However, to make the solution more reusable it is better to make a more general solution and then transform it into a more specific one. Thus, as a general approach to find the closest value, we start with this find nearest solution. Then we turn that into a more specific and vectorize it, to allow it to work on multiple element at once:
import math
import numpy as np
from functools import partial
def find_nearest_sorted(array,value):
idx = np.searchsorted(array, value, side="left")
if idx > 0 and (idx == len(array) or math.fabs(value - array[idx-1]) < math.fabs(value - array[idx])):
return array[idx-1]
else:
return array[idx]
array1 = np.random.rand(4004001)
array2 = np.random.rand(1000)
array1_sorted = np.sort(array1)
# Partially apply array1 to find function, to turn the general function
# into a specific, working with array1 only.
find_nearest_in_array1 = partial(find_nearest_sorted, array1_sorted)
# Vectorize specific function to allow us to apply it to all elements of
# array2, the numpy way.
vectorized_find = np.vectorize(find_nearest_in_array1)
output = vectorized_find(array2)
Hopefully this is what you wanted, a new vector, mapping the data in array2 to the nearest values in array1.
The most "numpythonic" way is is to use broadcasting. This is a quick and easy way to calculate a distance matrix, for which you can then take the argmin of the absolute value.
array1 = np.random.rand(4004001)
array2 = np.random.rand(1000)
# Calculate distance matrix (on truncated array1 for memory reasons)
dmat = array1[:400400] - array2[:,None]
# Take the abs of the distance matrix and work out the argmin along the last axis
ix = np.abs(dmat).argmin(axis=1)
shape of dmat:
(1000, 400400)
shape of ix and contents:
(1000,)
array([237473, 166831, 72369, 11663, 22998, 85179, 231702, 322752, ...])
However, it's memory hungry if you do this operation in one go, and actually doesn't work on my 8GB machine for the size of arrays that you specify, which is why I reduced the size of array1.
To make it work within memory constraints, simply slice one of the arrays into chunks and apply broadcasting on each chunk in turn (or parallelise). In this case, I've sliced array2 into 10 chunks:
# Define number of chunks and calculate chunk size
n_chunks = 10
chunk_len = array2.size // n_chunks
# Preallocate output array
out = np.zeros(1000)
for i in range(n_chunks):
s = slice(i*chunk_len, (i+1)*chunk_len)
out[s] = np.abs(array1 - array2[s, None]).argmin(axis=1)
import numpy as np
a = np.random.random(size=4004001).astype(np.float16)
b = np.random.random(size=1000).astype(np.float16)
#use numpy broadcasting to compare pairwise difference and then find the min arg in a for each element in b. Finally extract elements from a using the argmin array as indexes.
output = a[np.argmin(np.abs(b[:,None] -a),axis=1)]
This solution while simple can be very memory intensive. It may need a bit further optimisation if using it on large arrays.

Float required in list output

I am trying to create a custom filter to run it with the generic filter from SciPy package.
scipy.ndimage.filters.generic_filter
The problem is that I don't know how to get the returned value to be a scalar, as it needs for the generic function to work. I read through these threads (bottom), but I can't find a way for my function to perform.
The code is this:
import scipy.ndimage as sc
def minimum(window):
list = []
for i in range(window.shape[0]):
window[i] -= min(window)
list.append(window[i])
return list
test = np.ones((10, 10)) * np.arange(10)
result = sc.generic_filter(test, minimum, size=3)
It gives the error:
cval, origins, extra_arguments, extra_keywords)
TypeError: a float is required
Scipy filter with multi-dimensional (or non-scalar) output
How to apply ndimage.generic_filter()
http://ilovesymposia.com/2014/06/24/a-clever-use-of-scipys-ndimage-generic_filter-for-n-dimensional-image-processing/
If I understand, you want to substract each pixel the min of its 3-horizontal neighbourhood. It's not a good practice to do that with lists, because numpy is for efficiency( ~100 times faster ). The simplest way to do that is just :
test-sc.generic_filter(test, np.min, size=3)
Then the substraction is vectorized on the whole array.
You can also do:
test-np.min([np.roll(test,1),np.roll(test,-1),test],axis=0)
10 times faster, if you accept the artefact at the border.
Using the example in Scipy filter with multi-dimensional (or non-scalar) output I converted your code to:
def minimum(window,out):
list = []
for i in range(window.shape[0]):
window[i] -= min(window)
list.append(window[i])
out.append(list)
return 0
test = np.ones((10, 10)) * np.arange(10)
result = []
sc.generic_filter(test, minimum, size=3, extra_arguments=(result,))
Now your function minimum outputs its result to the parameter out, and the return value is not used anymore. So the final result matrix contains all the results concatenated, not the output of generic_filter.
Edit 1: Using the generic_filter with a function that returns a scalar, a matrix of the same dimensions is returned. In this case however the lists are appended of each call by the filter which results in a larger matrix (100x9 in this case).

Categories

Resources