Given an array of values say 300x80, where 300 represents the # of samples and 80 represents the features you want to keep.
I know in MATLAB and Python you can do interp1d and such, but I don't think that works for me in this situation. All I could find are 1D examples.
Is there a way to do interpolation to make this array say 500x80 in Python?
Simple question of 300x80 -> 500x80.
http://docs.scipy.org/doc/scipy/reference/generated/scipy.interpolate.interp2d.html
x, y are your matrix indices (row/column index), and z is the value at that position. It returns a function that you can call on all points of a new 500x80 grid.
Of course it does not make any sense, since they are sample/variable indices and it just means inventing more of them and extrapolate what the values should look like for them. Interpolation only works for an x (y) that represents several measurements of the same variable (unlike a sample#).
Related
I have found the following code:
x=0.3*np.random.randn(100,2)
x_train=np.r_[x+2,x-2]
In the first case x is an array of 100 rows and two columns in a format list of list, for what I see. In this case when I use size it returns 200. However, in the x_train part it is using np.r_. For what I know this instruction serves to concatenate arrays, so when I run size again it returns 400. However, I cannot get what does x+2 and x-2 perform in this case. For example, why in the first case is adding 2 and in the other case is subtracting 2?
I have read the documentation and still not get any clue.
The linked scikit is showing how to find two separate classes in 2 dimensions. The code you are asking about generates random x&y coordinate data for those two separate classes
The purpose of np.random.randn is to generate 100 standard normally-distributed random x and y coordinate pairs (ie x is a 100x2 matrix). Side note, the .3 multiplier is probably used to decreased standard deviation for tighter clusters.
By adding 2 to x (ie add the value 2 to each element in x), they create a set of x and y coordinates that are closely scattered around (2,2) and by subtracting 2 from x, they create a set of x and y coordinates that are scattered around (-2,-2).
np.r_ ,in this case, is the same as using np.concatenate((x-2,x+2),0) which creates a 200x2 array with 100 observations of x&y points scattered around (2,2) and 100 scattered around (-2,-2)
I've got a multidimensional array that has 1 million sets of 3 points, each point being a coordinate specified by x and y. Calling this array pointVec, what I mean is
np.shape(pointVec) = (1000000,3,2)
I want to find the center of each of the set of 3 points. One obvious way is to iterate through all 1 million sets, finding the center of each set at each iteration. However, I have heard that vectorization is a strong-suit of Numpy's, so I'm trying to adapt it to this problem. Since this problem fits so intuitively with iteration, I don't have a grasp of how one might do it with vectorization, or if using vectorization would even be useful.
It depends how you define a center of a three-point. However, if it is average coordinates, like #Quang mentioned in the comments, you can take the average along a specific axis in numpy:
pointVec.mean(1)
This will take the mean along axis=1 (which is second axis with 3 points) and return a (1000000,2) shaped array.
I'm trying to optimize code that currently uses nested for loops & calls scipy's functions.
Basically, I have a first function that calls scipy's find_peaks() methods, and then I want to interpolate those data points (the peak) to find a function that describes them. For example, I first find the peak. This basically is a 2D array of dimension 25*30 (axis 0) with 1000 elements in each (axis 1).
arr = np.random.rand(25,30,1000)
arr = arr.reshape((arr.shape[0]*arr.shape[1], arr.shape[2]))
# we have a 25*30 set of 1000 pts each. find peaks for that
peaks = np.apply_along_axis(find_peaks, 1, arr, height=0,)
Find peaks returns something of the form:
peak_indices = peaks[:,0]
peak_values = peaks[:,1]["peak_heights"]
So far so good. That's essentially the (x,y) coordinates of the points I want to interpolate.
Now, I want to interpolate those couples of indices-heights values to obtain some function, using scipy.interpolate.interpolate.interp1d(...). Interp1d's signature is of the form:
interp1d(x, y, kind='linear', axis=-1, copy=True, bounds_error=None, fill_value=nan, assume_sorted=False)
Where x would be my peak_indices, and y my peak_values.
The question:
How can I pass to this function 2 arguments that vary with each slice? E.g. in other words, my first use of apply_along_axis only used a single slice-dependant argument (the 1000 points for each of my 25*30 elements of axis 0). However here I need to pass to the function TWO arguments - the peak_indices & the peak_values. Can any pythonista think of a clever way to unpack those arguments AFTER I pass them to apply_along_axis as tuples or something? Kind of:
arr=*[peak_indices, peak_values]
I cannot really edit the interp1D function itself, which would be my solution if I was going to call my own function...
EDIT: part of the benefits of using apply along axis is that I should get performance improvements compared to nested ifs, since numpy should be able to bulk-process those calculation. Ideally any solution should use a notation that will still allow those optimisation.
Where do you get the idea that apply_along_axis is a performance tool? Does it actually work faster in this case?
arr = np.random.rand(25,30,1000)
arr = arr.reshape((arr.shape[0]*arr.shape[1], arr.shape[2]))
# we have a 25*30 set of 1000 pts each. find peaks for that
peaks = np.apply_along_axis(find_peaks, 1, arr, height=0,)
compared to:
peaks = np.array([find_peaks(x, height=0) for x in arr])
That is a simple iteration over the 25*30 set of 1d arrays.
apply does a test calculation to determine the return shape and dtype. It constructs are result array, and then iterates on all axes except 1, and calls the function with that 1d array. There's no compiling, or "bulk processing" (what ever that is). It just hides a loop in a function call.
It does make iteration over 2 axes of a 3d array prettier, but not faster:
You could have used it on the original arr, to get (25,30,2) result:
peaks = np.apply_along_axis(find_peaks, 2, arr_3d, height=0,)
I'm guessing find_peaks returns a 2 element tuple of values, and peaks will then be an object dtype array.
Since apply_along_axis does not have any performance advantages, I don't see the point to trying to use it with a more complex array. It's handy when you have a 3d array, and a function that takes a 1d input, but beyond that ....?
I got a np.ndarray with ~3000 trajectories. Each trajectory has x, y and z coordinates and a different length; between 150 and 250 (points in time). Now I want to remove the z coordinate for all of these trajectories.
So arr.shape gives me (3000,),(3000 trajectories) and (for example) arr[0].shape yields (3,178) (three axis of coordinates and 178 values).
I have found multiple explanations for removing lines in 2D-arrays and I found np.delete(arr[0], 2, axis=0) working for me. However, I don't just want to delete the z coordinates for the first trajectory; I want to do this for every trajectory.
If I want to do this with a loop for arr[i] I would need to know the exact length of every trajectory (It doesn't suit my purpose to just create the array with the length of the longest and fill it up with zeroes).
TL;DR: So how do I get from a ndarray with [amountOfTrajectories][3][value] to [amountOfTrajectories][2][value]?
The purpose is to use these trajectories as labels for a neural net that creates trajectories. So I guess it's a entirely new question but is the shape I'm asking for suitable for usage as labels for tensorflow?
Also: What would have been a better title and some terms to find results for this with google? I just started with Python and I'm afraid I'm missing some keywords here...
If this comes from loadmat, the source is probably a MATLAB workspace with a cell, which contains these matrices.
loadmat has, evidently created a 1d array of object dtype (the equivalent of a cell, with squeeze on).
A 1d object array is similar to a Python list - it contains pointers to arrays else where in memory. Most operations on such an array use Python iteration. Iterating on the equivalent list is usually faster. (arr.tolist()).
alist = [a[:2,:] for a in arr]
should give you a list of arrays, each of shape (2, n) (n varying). This makes new arrays - but then so does np.delete.
You can't operate on all arrays in the 1d array with one operation. It has to be iterative.
I have a question about the linear interpolation in python\numpy.
I have a 4D array with the data (all data in binary files) that arrange in this way:
t- time (lets say each hour for a month = 720)
Z-levels (lets say Z'=7)
Y-data1 (one for each t and Z)
X-data2 (one for each t and Z)
So, I want to obtain a new Y and X data for the Z'=25 with the same t.
The first thing, I have a small trouble with the right way to read my data from the binary file. Second, I have to interpolate first 3 levels to Z'=15 and others for the other values.
If anyone has an idea how to do it and can help it will be great.
Thank you for your attention!
You can create different interpolation formulas for different combinations of z' and t.
For example, for z=7, and a specific value of t, you can create an interpolation formula:
formula = scipy.interp1d(x,y)
Another one for say z=25 and so on.
Then, given any combination of z and t, you can refer to the specific interpolation formula and do the interpolation.
In 2D for instance there is bilinear interpolation - with an example on the unit square with the z-values 0, 1, 1 and 0.5 as indicated. Interpolated values in between represented by colour:
Then trilinear, and so on...
Follow the pattern and you'll see that you can nest interpolations to any dimension you require...
:)