Resample a categorical numpy array

Resample a categorical numpy array - python

I have a 1 dimensional numpy array labels (say its length is 700k) sampled at 700 Hz. So, it corresponds to 1000 seconds of time series data. The array consists of integers 0 to 3 which stand for some categorical information. Also, the categories rarely change, like 200 seconds of 0, then 150 seconds of 2 and so on...
Now, I would like to convert it to an array of 64 Hz, that is, the new length of the array will be 700k x (64/700) = 64k.
resampledLabels = scipy.signal.resample(labels, 64000)
The problem with the above code is that it makes some assumptions about the array, and make interpolations. I tried to round them to the nearest integer, but the result also contained a -1 which is actually out of the range of the actual array.
My problem is, how can I resample the array without making interpolations?

I think you can just simple numpy slicing, which is of the format start:stop:step. This is constant time and reflects possible changes you might make to the resampled array.
In your case it would be: labels[0::64000]

Related

Python, fast computation of rolling percentile

Given a multidimensional array, I want to compute a rolling percentile over one of its axes, with the rolling windows truncated near the boundaries of the array. Below is a minimal example implementation using only numpy via np.nanpercentile() applied to stacked, rolled (through np.roll()) arrays. However, the input array may be very large (~ 1 GB or more), so two issues arise:
For the current implementation, the stacked, rolled array may
not fit into RAM memory. Avoidable with for-loops over all axes
unaffected by the rolling, but may be slow.
Even fully vectorized (as below), the computation time is quite long,
understandably due to the sheer amount of computations performed.
Questions: Is there a more efficient python implementation of a rolling percentile (with axis/axes argument or the like and with truncated windows near the boundaries)?** If not, how could the computation be sped up (and, if possible, without exceeding the RAM)? C-code called from Python? Computation of percentiles at fewer "central" points, and approximation in between via (e.g. linear) interpolation? Other ideas?
Related post (implementing rolling percentiles): How to compute moving (or rolling, if you will) percentile/quantile for a 1d array in numpy? Issues are:
pandas implementation via pd.Series().rolling().quantile() works only for pd.Series or pd.DataFrame objects, not multidimensional (4D or arbitrary D) arrays;
implementation via np.lib.stride_tricks.as_strided() with np.nanpercentile() is similar to the one below and should not be much faster given that np.nanpercentile() is the speed bottleneck, see below
Minimal example implementation:
import numpy as np
np.random.seed(100)
# random array of numbers
a = np.random.rand(10000,1,70,70)
# size of rolling window
n_window = 150
# percentile to compute
p = 0.7
# NaN values to prepend/append to array before rolling
nan_temp = np.full(tuple([n_window] + list(np.array(a.shape)[1:])), fill_value=np.nan)
# prepend and append NaN values to array
a_temp = np.concatenate((nan_temp, a, nan_temp), axis=0)
# roll array, stack rolled arrays along new dimension, compute percentile (ignoring NaNs) using np.nanpercentile()
res = np.nanpercentile(np.concatenate([np.roll(a_temp, shift=i, axis=0)[...,None] for i in range(-n_window, n_window+1)],axis=-1),p*100,axis=-1)
# cut away the prepended/appended NaN values
res = res[n_window:-n_window]
Computation times (in seconds), example (for the case of a having a shape of (1000,1,70,70) instead of (10000,1,70,70)):
create random array: 0.0688176155090332
prepend/append NaN values: 0.03478217124938965
stack rolled arrays: 38.17830514907837
compute nanpercentile: 1145.1418626308441
cut out result: 0.0004646778106689453

How to avoid for-loop while using append()

First of all, I apologize for being an absolute beginner in both python and numpy. Please forgive my ignorance.
I have a 4D cube of pressure measurements where the dimensions are (number of samples, time, y-axis, x-axis), which means, for each sample, I have a 3D cube of spatio-temporal profile. I need to collect the pressure readings of this 3D cube (time, y-axis, x-axis) and store it into an array for each sample only where the coordinates satisfy a specific condition. Upon varying the specific condition, the size of this array will vary too. So, I have to use append() to build this array. However, since say for 1000 samples, I have to search through more than a millions coordinates using For-Loop for each sample, the code I have written is pretty inefficient and takes a lot of time to run (more than several hours). Can you please help me to write it more efficiently?
Below is the code I've tried to solve the problem. It works nicely and gives expected result but it is extremely slow.
import numpy as np
# Number of sample points in x,y and t-axis
Nx = 101
Ny = 101
Nt = 100
n_train = 1000
target_array = []
for i_train in range (n_train):
for k in range (Nt):
for j in range (Ny):
for i in range (Nx):
if np.round(np.sqrt((i-np.round(Nx/2))**2+(j-np.round(Ny/2))**2)) == 2*k:
target_array.append(Pressure[i_train,k,j,i])

Since the condition involves the indexes and not the values of your 4D array, you can vectorize it using numpy.meshgrid.
Here pp is your 4D array:
iv, jv, kv = np.meshgrid(np.arange(pp.shape[3]), np.arange(pp.shape[2]), np.arange(pp.shape[1]))
selecting = np.round(np.sqrt((iv - np.round(pp.shape[3]/2))**2 + (jv - np.round(pp.shape[2]/2))**2)) == 2*kv
target = pp[:,selecting]
Provided that I've understood correctly how your 4D array is organized:
the arrays created by meshgrid hold the indexes to select pp elements on the 3 dimensions x, y, t.
selecting is a boolean array created by replicating your equation, to check which coordinates satisfies the condition.
target is a selection of pp, taking all element on 0 axis which satisfies the condition (i.e. selecting is True) on the other 3 axes.
Note that target is a 2D array, to have a 1D array, use target.flatten().

Understanding scikitlearn PCA.transform function in Python

so I'm currently working on a project that involves the use of Principal Component Analysis, or PCA, and I'm attempting to kind of learn it on the fly. Luckily, Python has a very convenient module from scikitlearn.decomposition that seems to do most of the work for you. Before I really start to use it though, I'm trying to figure out exactly what it's doing.
The dataframe I've been testing on looks like this:
0 1
0 1 2
1 3 1
2 4 6
3 5 3
And when I call PCA.fit() and then view the components I get:
array([[ 0.5172843 , 0.85581362],
[ 0.85581362, -0.5172843 ]])
From my rather limited knowledge of PCA, I kind of grasp how this was calculated, but where I get lost is when I then call PCA.transform. This is the output it gives me:
array([[-2.0197033 , -1.40829634],
[-1.84094831, 0.8206152 ],
[ 2.95540408, -0.9099927 ],
[ 0.90524753, 1.49767383]])
Could someone potentially walk me through how it takes the original dataframe and components and transforms it into this new array? I'd like to be able to understand the exact calculations it's doing so that when I scale up I'll have a better sense of what's going on. Thanks!

When you call fit PCA is going to compute some vectors that you can project your data onto in order to reduce the dimension of your data. Since each row of your data is 2 dimensional there will be a maximum of 2 vectors onto which data can be projected and each of those vectors will be 2-dimensional. Each row of PCA.components_ is a single vector onto which things get projected and it will have the same size as the number of columns in your training data. Since you did a full PCA you get 2 such vectors so you get a 2x2 matrix. The first of those vectors will maximize the variance of the projected data. The 2nd will maximize the variance of what's left after the first projection. Typically one passed a value of n_components that's less than the dimension of the input data so that you get back fewer rows and you have a wide but not tall components_ array.
When you call transform you're asking sklearn to actually do the projection. That is, you are asking it to project each row of your data into the vector space that was learned when fit was called. For each row of the data you pass to transform you'll have 1 row in the output and the number of columns in that row will be the number of vectors that were learned in the fit phase. In other words, the number of columns will be equal to the value of n_components you passed to the constructor.
Typically one uses PCA when the source data has lots of columns and you want to reduce the number of columns while preserving as much information as possible. Suppose you had a data set with 100 rows and each row had 500 columns. If you constructed a PCA like PCA(n_components = 10) and then called fit you'd find that components_ has 10 rows, one for each of the components you requested, and 500 columns as that's the input dimension. If you then called transform all 100 rows of your data would be projected into this 10-dimensional space so the output would have 100 rows (1 for each in the input) but only 10 columns thus reducing the dimension of your data.
The short answer to how this is done is that PCA computes a Singular Value Decomposition and then keeps only some of the columns of one of those matrices. Wikipedia has much more information on the actual linear algebra behind this - it's a bit long for a StackOverflow answer.

Integer data in -> averaged -> float data out in python for large files

I have a large amount of data from a sensor represented as integers from 0-255 in a column of length 2048, and there are at least 10000 rows in the same 2D numpy array. I wish to average all of the rows together to obtain a column of floats and plot them. Simple, right?
When there was less than 1000 rows, my graph looked beautiful and not very quantized at all. Averaging was obvious.
When there was more than 10000 rows, my graph looked worse - more quantized-looking than the average of the smaller array, even though it was all made out of floats, it was so close to integers, it hurt.
What I'm asking is: why that might be the case? Averaging should "smooth out" sensor measurements. They are now more noisy (and quantized) when I take a longer data sample!
Here's an example of my current code:
import matplotlib.pyplot as plt
import numpy as np
lower_bound=0
upper_bound=2048
#this loads data into raw_array as [n rows][2048 columns]
raw_array=np.loadtxt('raw_data.txt',dtype=int)
avg_array=np.mean(raw_array,0) #averages over zeroth column
x_inc=np.arange(lower_bound,upper_bound)
plt.plot(x_inc[lower_bound,upper_bound],avg_array[lower_bound,upper_bound])
plt.show()

The problem was that the data that I was averaging had an error (caused by data acquisition program). Basically, more than half of the frames of data collected from the sensor were identical and repeated. This caused the output to approximate a single frame of data, rather than smooth out the large set of data that I thought I had.

Interpolate Array to a New Length | Python

Given an array of values say 300x80, where 300 represents the # of samples and 80 represents the features you want to keep.
I know in MATLAB and Python you can do interp1d and such, but I don't think that works for me in this situation. All I could find are 1D examples.
Is there a way to do interpolation to make this array say 500x80 in Python?
Simple question of 300x80 -> 500x80.

http://docs.scipy.org/doc/scipy/reference/generated/scipy.interpolate.interp2d.html
x, y are your matrix indices (row/column index), and z is the value at that position. It returns a function that you can call on all points of a new 500x80 grid.
Of course it does not make any sense, since they are sample/variable indices and it just means inventing more of them and extrapolate what the values should look like for them. Interpolation only works for an x (y) that represents several measurements of the same variable (unlike a sample#).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.