I'm just learning python, but have decided to do so by recoding and improving some old java based school AI project.
My project involved a mathematical operation that is basically a discrete convolution operation, but without one of the functions time reversed.
So, while in my original java project I just wrote all the code to do the operation myself, since I'm working in python, and it's got great math libraries like numpy and scipy, I figured I could just make use of an existing convolution function like scipy.convolve. However, this would require me to pre-reverse one of the two arrays so that when scipy.convolve runs, and reverses one of the arrays to perform the convolution, it's really un-reversing the array. (I also still don't know how I can be sure to pre-reverse the right one of the two arrays so that the two arrays are still slid past each other both forwards rather than both backwards, but I assume I should ask that as a separate question.)
Unlike my java code, which only handled one dimensional data, I wanted to extend this project to multidimensional data. And so, while I have learned that if I had a numpy array of known dimension, such as a three dimensional array a, I could fully reverse the array (or rather get back a view that is reversed, which is much faster), by
a = a(::-1, ::-1, ::-1)
However, this requires me to have a ::-1 for every dimension. How can I perform this same reversal within a method for an array of arbitrary dimension that has the same result as the above code?
You can use np.flip. From the documentation:
numpy.flip(m, axis=None)
Reverse the order of elements in an array along the given axis.
The shape of the array is preserved, but the elements are reordered.
Note: flip(m) corresponds to m[::-1,::-1,...,::-1] with ::-1 at all positions.
This is a possible solution:
slices = tuple([slice(-1, -n-1, -1) for n in a.shape])
result = a[slices]
extends to arbitrary number of axes. Verification:
a = np.arange(8).reshape(2, 4)
slices = tuple([slice(-1, -n-1, -1) for n in a.shape])
result = a[slices]
yields:
>>> a
array([[0, 1, 2, 3],
[4, 5, 6, 7]])
>>> result
array([[7, 6, 5, 4],
[3, 2, 1, 0]])
Related
guys I am new to python and learning machine learning.I have trouble understanding what does this line of code mean:-
y_scores_knn = y_probas_knn[:,1]
I didnt get what ,1 mean inside an array box [] and what it does
Say for example your list is somewhat like this:
a = [[1, 2, 3],
[4, 5, 6],
[7, 8, 9]]
if you do this print(a[:,0])
it will return [1, 2, 3]
So basically ':' means that you want to iterate through every item in the x co-ordinate
That is how you access the column with index 1. This functionality is only if y_scores_knn is part of certain classes that deal with 2-d arrays. One class I'm most familiar with is a numpy array, which allows the accessing of columns, along with many other helpful methods.
if x is a two dimensional array (a matrix) then x[:,1] is the second column (all elements where the second index is 1). This notation is quite common for things like octave, matlab and numpy which have a concept of a two dimensional array. While it does work for numpy, it will not work for a python list of lists. A list of lists that happen to have the same number of elements is just another one dimensional list as far as python is concerned. If you want to do work with matrices and you want to use python, use numpy.
I am trying to randomly select a set of integers in numpy and am encountering a strange error. If I define a numpy array with two sets of different sizes, np.random.choice chooses between them without issue:
Set1 = np.array([[1, 2, 3], [2, 4]])
In: np.random.choice(Set1)
Out: [4, 5]
However, once the numpy array are sets of the same size, I get a value error:
Set2 = np.array([[1, 3, 5], [2, 4, 6]])
In: np.random.choice(Set2)
ValueError: a must be 1-dimensional
Could be user error, but I've checked several times and the only difference is the size of the sets. I realize I can do something like:
Chosen = np.random.choice(N, k)
Selection = Set[Chosen]
Where N is the number of sets and k is the number of samples, but I'm just wondering if there was a better way and specifically what I am doing wrong to raise a value error when the sets are the same size.
Printout of Set1 and Set2 for reference:
In: Set1
Out: array([list([1, 3, 5]), list([2, 4])], dtype=object)
In: type(Set1)
Out: numpy.ndarray
In: Set2
Out:
array([[1, 3, 5],
[2, 4, 6]])
In: type(Set2)
Out: numpy.ndarray
Your issue is caused by a misunderstanding of how numpy arrays work. The first example can not "really" be turned into an array because numpy does not support ragged arrays. You end up with an array of object references that points to two python lists. The second example is a proper 2xN numerical array. I can think of two types of solutions here.
The obvious approach (which would work in both cases, by the way), would be to choose the index instead of the sublist. Since you are sampling with replacement, you can just generate the index and use it directly:
Set[np.random.randint(N, size=k)]
This is the same as
Set[np.random.choice(N, k)]
If you want to choose without replacement, your best bet is to use np.random.choice, with replace=False. This is similar to, but less efficient than shuffling. In either case, you can write a one-liner for the index:
Set[np.random.choice(N, k, replace=False)]
Or:
index = np.arange(Set.shape[0])
np.random.shuffle(index)
Set[index[:k]]
The nice thing about np.random.shuffle, though, is that you can apply it to Set directly, whether it is a one- or many-dimensional array. Shuffling will always happen along the first axis, so you can just take the top k elements afterwards:
np.random.shuffle(Set)
Set[:k]
The shuffling operation works only in-place, so you have to write it out the long way. It's also less efficient for large arrays, since you have to create the entire range up front, no matter how small k is.
The other solution is to turn the second example into an array of list objects like the first one. I do not recommend this solution unless the only reason you are using numpy is for the choice function. In fact I wouldn't recommend it at all, since you can, and probably should, use pythons standard random module at this point. Disclaimers aside, you can coerce the datatype of the second array to be object. It will remove any benefits of using numpy, and can't be done directly. Simply setting dtype=object will still create a 2D array, but will store references to python int objects instead of primitives in it. You have to do something like this:
Set = np.zeros(N, dtype=object)
Set[:] = [[1, 2, 3], [2, 4]]
You will now get an object essentially equivalent to the one in the first example, and can therefore apply np.random.choice directly.
Note
I show the legacy np.random methods here because of personal inertia if nothing else. The correct way, as suggested in the documentation I link to, is to use the new Generator API. This is especially true for the choice method, which is much more efficient in the new implementation. The usage is not any more difficult:
Set[np.random.default_rng().choice(N, k, replace=False)]
There are additional advantages, like the fact that you can now choose directly, even from a multidimensional array:
np.random.default_rng().choice(Set2, k, replace=False)
The same goes for shuffle, which, like choice, now allows you to select the axis you want to rearrange:
np.random.default_rng().shuffle(Set)
Set[:k]
I have a numpy function that convert a 2D array of x,y coordinates into a flat array of distance of each coordinates between the previous. (see Numpy - transform 2D array of x,y coordinates into flat array of distance between coordinates)
input = [[-8081441,5685214], [-8081446,5685216], [-8081442,5685219], [-8081440,5685211], [-8081441,5685214]]
output = [-8081441, 5685214, 5, -2, -4, -3, -2, 8, 1, -3]
Thanks to Divakar's answer, I have two numpy functions that is doing what I want
arr = np.asarray(input).astype(int)
np.hstack((arr[0], (-np.diff(arr, axis=0)).ravel()))
Another approach with slicing to replicate the differentiation -
arr = np.asarray(input).astype(int)
np.hstack((arr[0], (arr[:-1,:] - arr[1:,:]).ravel()))
My question, is there a way to transpose one of these numpy function into a generator to improve performance? Is it possible to use numpy in a generator?
A Python generator is a spinoff of lists.
In [207]: [i*2 for i in range(3)]
Out[207]: [0, 2, 4]
In [208]: (i*2 for i in range(3))
Out[208]: <generator object <genexpr> at 0xb6a1ffbc>
In [209]: list(_)
Out[209]: [0, 2, 4]
You could think of it as a lazy list. It doesn't actually evaluate the elements until you iterate through it. In Py3 range is a generator (xrange in Py2). The In[208] line sets up a generator, but doesn't evaluate anything. So it is fast. But iterating over it in [209] takes just as long as the original at [207]. (Well there might be minor differences.)
Thus a generator lets you think in blocks as you with lists, without creating all the intermediate lists. It's more of a code organization tool than a performance one.
I can't think of anything equivalent when working with numpy arrays.
arr=np.array(input) # creates fixed size array from input list
-np.diff(arr, axis=0) # create another array
This creates a number of intermediate arrays, even a list, and ends up returning an array (and discarding the intermediates):
np.hstack((arr[0],(-np.diff(arr, axis=0)).ravel()))
There are a number of simple building blocks in that expression. Numpy's speed comes from performing those steps in fast compiled code. To get better speed you'd have to rewrite the problem in C or Cython. In that code you can iterate, and perform complex operations at each step.
Conceivably numpy could perform some sort of lazy evaluation, but that would require major lowlevel coding. And there's no guarantee that it would result in performance improvements.
I looked at the issue of intermediate buffers, and whether add.at improved performance (it doesn't) at: https://stackoverflow.com/a/40688879/901925
This seems like a pretty basic question, but I didn't find anything related to it on stack. Apologies if I missed an existing question.
I've seen some mathematical/linear algebraic reasons why one might want to use numpy vectors "proper" (i.e. ndim 1), as opposed to row/column vectors (i.e. ndim 2).
But now I'm wondering: are there any (significant) efficiency reasons why one might pick one over the other? Or is the choice pretty much arbitrary in that respect?
(edit) To clarify: By "ndim 1 vs ndim 2 vectors" I mean representing a vector that contains, say, numbers 3 and 4 as either:
np.array([3, 4]) # ndim 1
np.array([[3, 4]]) # ndim 2
The numpy documentation seems to lean towards the first case as the default, but like I said, I'm wondering if there's any performance difference.
If you use numpy properly, then no - it is not a consideration.
If you look at the numpy internals documentation, you can see that
Numpy arrays consist of two major components, the raw array data (from now on, referred to as the data buffer), and the information about the raw array data. The data buffer is typically what people think of as arrays in C or Fortran, a contiguous (and fixed) block of memory containing fixed sized data items. Numpy also contains a significant set of data that describes how to interpret the data in the data buffer.
So, irrespective of the dimensions of the array, all data is stored in a continuous buffer. Now consider
a = np.array([1, 2, 3, 4])
and
b = np.array([[1, 2], [3, 4]])
It is true that accessing a[1] requires (slightly) less operations than b[1, 1] (as the translation of 1, 1 to the flat index requires some calculations), but, for high performance, vectorized operations are required anyway.
If you want to sum all elements in the arrays, then, in both case you would use the same thing: a.sum(), and b.sum(), and the sum would be over elements in contiguous memory anyway. Conversely, if the data is inherently 2d, then you could do things like b.sum(axis=1) to sum over rows. Doing this yourself in a 1d array would be error prone, and not more efficient.
So, basically a 2d array, if it is natural for the problem just gives greater functionality, with zero or negligible overhead.
I'm new to python and wanted to do something I normally do in matlab/R all the time, but couldn't figure it out from the docs.
I'd like to slice an array not as 0:3 which includes elements 0,1,2 but as an explicit vector of indices such as 0,3
For example, say I had this data structure
a = [1, 2, 3, 4, 5]
I'd like the second and third element
so I thought something like this would work
a[list(1,3)]
but that gives me this error
TypeError: list indices must be
integers
This happens for most other data types as well such as numpy arrays
In matlab, you could even say a[list(2,1)] which would return this second and then the first element.
There is an alternative implementation I am considering, but I think it would be slow for large arrays. At least it would be damn slow in matlab. I'm primarily using numpy arrays.
[ a[i] for i in [1,3] ]
What's the python way oh wise ones?
Thanks!!
NumPy allows you to use lists as indices:
import numpy
a = numpy.array([1, 2, 3, 4, 5])
a[[1, 3]]
Note that this makes a copy instead of a view.
I believe you want numpy.take:
newA = numpy.take(a, [1,3])