np_r function with two values - python

I have found the following code:
x=0.3*np.random.randn(100,2)
x_train=np.r_[x+2,x-2]
In the first case x is an array of 100 rows and two columns in a format list of list, for what I see. In this case when I use size it returns 200. However, in the x_train part it is using np.r_. For what I know this instruction serves to concatenate arrays, so when I run size again it returns 400. However, I cannot get what does x+2 and x-2 perform in this case. For example, why in the first case is adding 2 and in the other case is subtracting 2?
I have read the documentation and still not get any clue.

The linked scikit is showing how to find two separate classes in 2 dimensions. The code you are asking about generates random x&y coordinate data for those two separate classes
The purpose of np.random.randn is to generate 100 standard normally-distributed random x and y coordinate pairs (ie x is a 100x2 matrix). Side note, the .3 multiplier is probably used to decreased standard deviation for tighter clusters.
By adding 2 to x (ie add the value 2 to each element in x), they create a set of x and y coordinates that are closely scattered around (2,2) and by subtracting 2 from x, they create a set of x and y coordinates that are scattered around (-2,-2).
np.r_ ,in this case, is the same as using np.concatenate((x-2,x+2),0) which creates a 200x2 array with 100 observations of x&y points scattered around (2,2) and 100 scattered around (-2,-2)

Related

Vectorize finding center of sets of points in multidimensional array in Numpy

I've got a multidimensional array that has 1 million sets of 3 points, each point being a coordinate specified by x and y. Calling this array pointVec, what I mean is
np.shape(pointVec) = (1000000,3,2)
I want to find the center of each of the set of 3 points. One obvious way is to iterate through all 1 million sets, finding the center of each set at each iteration. However, I have heard that vectorization is a strong-suit of Numpy's, so I'm trying to adapt it to this problem. Since this problem fits so intuitively with iteration, I don't have a grasp of how one might do it with vectorization, or if using vectorization would even be useful.
It depends how you define a center of a three-point. However, if it is average coordinates, like #Quang mentioned in the comments, you can take the average along a specific axis in numpy:
pointVec.mean(1)
This will take the mean along axis=1 (which is second axis with 3 points) and return a (1000000,2) shaped array.

Can I vectorize scipy.interpolate.interp1d

interp1d works excellently for the individual datasets that I have, however I have in excess of 5 million datasets that I need to have interpolated.
I need the interpolation to be cubic and there should be one interpolation per subset.
Right now I am able to do this with a for loop, however, for 5 million sets to be interpolated, this takes quite some time (15 minutes):
interpolants = []
for i in range(5000000):
interpolants.append(interp1d(xArray[i],interpData[i],kind='cubic'))
What I'd like to do would maybe look something like this:
interpolants = interp1d(xArray, interpData, kind='cubic')
This however fails, with the error:
ValueError: x and y arrays must be equal in length along interpolation axis.
Both my x array (xArray) and my y array (interpData) have identical dimensions...
I could parallelize the for loop, but that would only give me a small increase in speed, I'd greatly prefer to vectorize the operation.
I have also been trying to do something similar over the past few days. I finally managed to do it with np.vectorize, using function signatures. Try with the code snippet below:
fn_vectorized = np.vectorize(interpolate.interp1d,
signature='(n),(n)->()')
interp_fn_array = fn_vectorized(x[np.newaxis, :, :], y)
x and y are arrays of shape (m x n). The objective was to generate an array of interpolation functions, for row i of x and row i of y. The array interp_fn_array contains the interpolation functions (shape is (1 x m).

Remove column from a 3D array with varied length for every first-level index (Python)

I got a np.ndarray with ~3000 trajectories. Each trajectory has x, y and z coordinates and a different length; between 150 and 250 (points in time). Now I want to remove the z coordinate for all of these trajectories.
So arr.shape gives me (3000,),(3000 trajectories) and (for example) arr[0].shape yields (3,178) (three axis of coordinates and 178 values).
I have found multiple explanations for removing lines in 2D-arrays and I found np.delete(arr[0], 2, axis=0) working for me. However, I don't just want to delete the z coordinates for the first trajectory; I want to do this for every trajectory.
If I want to do this with a loop for arr[i] I would need to know the exact length of every trajectory (It doesn't suit my purpose to just create the array with the length of the longest and fill it up with zeroes).
TL;DR: So how do I get from a ndarray with [amountOfTrajectories][3][value] to [amountOfTrajectories][2][value]?
The purpose is to use these trajectories as labels for a neural net that creates trajectories. So I guess it's a entirely new question but is the shape I'm asking for suitable for usage as labels for tensorflow?
Also: What would have been a better title and some terms to find results for this with google? I just started with Python and I'm afraid I'm missing some keywords here...
If this comes from loadmat, the source is probably a MATLAB workspace with a cell, which contains these matrices.
loadmat has, evidently created a 1d array of object dtype (the equivalent of a cell, with squeeze on).
A 1d object array is similar to a Python list - it contains pointers to arrays else where in memory. Most operations on such an array use Python iteration. Iterating on the equivalent list is usually faster. (arr.tolist()).
alist = [a[:2,:] for a in arr]
should give you a list of arrays, each of shape (2, n) (n varying). This makes new arrays - but then so does np.delete.
You can't operate on all arrays in the 1d array with one operation. It has to be iterative.

Understanding scikitlearn PCA.transform function in Python

so I'm currently working on a project that involves the use of Principal Component Analysis, or PCA, and I'm attempting to kind of learn it on the fly. Luckily, Python has a very convenient module from scikitlearn.decomposition that seems to do most of the work for you. Before I really start to use it though, I'm trying to figure out exactly what it's doing.
The dataframe I've been testing on looks like this:
0 1
0 1 2
1 3 1
2 4 6
3 5 3
And when I call PCA.fit() and then view the components I get:
array([[ 0.5172843 , 0.85581362],
[ 0.85581362, -0.5172843 ]])
From my rather limited knowledge of PCA, I kind of grasp how this was calculated, but where I get lost is when I then call PCA.transform. This is the output it gives me:
array([[-2.0197033 , -1.40829634],
[-1.84094831, 0.8206152 ],
[ 2.95540408, -0.9099927 ],
[ 0.90524753, 1.49767383]])
Could someone potentially walk me through how it takes the original dataframe and components and transforms it into this new array? I'd like to be able to understand the exact calculations it's doing so that when I scale up I'll have a better sense of what's going on. Thanks!
When you call fit PCA is going to compute some vectors that you can project your data onto in order to reduce the dimension of your data. Since each row of your data is 2 dimensional there will be a maximum of 2 vectors onto which data can be projected and each of those vectors will be 2-dimensional. Each row of PCA.components_ is a single vector onto which things get projected and it will have the same size as the number of columns in your training data. Since you did a full PCA you get 2 such vectors so you get a 2x2 matrix. The first of those vectors will maximize the variance of the projected data. The 2nd will maximize the variance of what's left after the first projection. Typically one passed a value of n_components that's less than the dimension of the input data so that you get back fewer rows and you have a wide but not tall components_ array.
When you call transform you're asking sklearn to actually do the projection. That is, you are asking it to project each row of your data into the vector space that was learned when fit was called. For each row of the data you pass to transform you'll have 1 row in the output and the number of columns in that row will be the number of vectors that were learned in the fit phase. In other words, the number of columns will be equal to the value of n_components you passed to the constructor.
Typically one uses PCA when the source data has lots of columns and you want to reduce the number of columns while preserving as much information as possible. Suppose you had a data set with 100 rows and each row had 500 columns. If you constructed a PCA like PCA(n_components = 10) and then called fit you'd find that components_ has 10 rows, one for each of the components you requested, and 500 columns as that's the input dimension. If you then called transform all 100 rows of your data would be projected into this 10-dimensional space so the output would have 100 rows (1 for each in the input) but only 10 columns thus reducing the dimension of your data.
The short answer to how this is done is that PCA computes a Singular Value Decomposition and then keeps only some of the columns of one of those matrices. Wikipedia has much more information on the actual linear algebra behind this - it's a bit long for a StackOverflow answer.

Interpolate Array to a New Length | Python

Given an array of values say 300x80, where 300 represents the # of samples and 80 represents the features you want to keep.
I know in MATLAB and Python you can do interp1d and such, but I don't think that works for me in this situation. All I could find are 1D examples.
Is there a way to do interpolation to make this array say 500x80 in Python?
Simple question of 300x80 -> 500x80.
http://docs.scipy.org/doc/scipy/reference/generated/scipy.interpolate.interp2d.html
x, y are your matrix indices (row/column index), and z is the value at that position. It returns a function that you can call on all points of a new 500x80 grid.
Of course it does not make any sense, since they are sample/variable indices and it just means inventing more of them and extrapolate what the values should look like for them. Interpolation only works for an x (y) that represents several measurements of the same variable (unlike a sample#).

Categories

Resources