Pytorch and numpy least squares with an intercept: performance complications

Pytorch and numpy least squares with an intercept: performance complications - python

I am performing regression analysis on some reasonably large vectors (for now, working with numpy and other scientific tools is ok if I leave the computer working overnight) but they will grow by several factors eventually, and so I was looking to improve performance, moving the implementation to pytorch.
The regression is fairly simple. I have 2 vectors, predictions and betas, with dimensions (750, 6340) and (750, 4313) respectively. The least squares solution I am looking for is predictions * x = betas where x would have dimensions (6340, 4313), but I have to account for intercepts in the regression. With numpy I solved this by iterating through the second dimension in predictions, creating a vector with each column + a column of ones, and passing that as the first argument
for candidate in range(0, predictions.shape[1])): #each column is a candidate
prediction = predictions[:, candidate]
#allow for an intercept by adding a column with ones
prediction = np.vstack([prediction, np.ones(prediction.shape[0])]).T
sol = np.linalg.lstsq(prediction, betas, rcond=-1)
Question number 1 would be: is there a way to avoid iterating over each candidate in order to allow the least squares calculation to account for an intercept? That would improve computation time by a lot.
I tried using statsmodels.regression.linear_model.ols which allows for this by default (you can add a -1to the formula if you want it removed), but using this approach either forces me to iterate over each candidate (using apply was appealing but didn't really improve computation time noticeably) or there is something I'm missing. Question 1.5 then: can I use this tool in such a way or is all there is to it?
Similarly in pytorch I would do
t_predictions = torch.tensor(predictions, dtype=torch.float)
t_betas_roi = torch.tensor(betas, dtype=torch.float)
t_sol = torch.linalg.lstsq(t_predictions, t_betas_roi)
And it's fast indeed, but I'm missing the intercept here. I reckon if I did this with numpy instead of iterating as I do it would also be much faster but either way, if question 1 has a solution I imagine it could be similarly applied here, right?

Related

How to perform Weighted dimensionality reduction with Umap

The title pretty much says it all, I have a df with 40+ dimension which I'd like to process into the Umap algorithm in order to have a 2-d output.
I would like to know if it is possible to weight the input columns differently for the purpose of studying the possible different Umap outcomes.
Thank you for your time
P.S. I work in python

Why not simply applying UMAP to A:
A = X*W
where X is your Nx40 matrix and W=diag(w) is a 40x40 diagonal matrix of weights w=[w1, w2,..., w40]?
Consider using normalized weights wi, i=1,2,...,40 such that sum(w) == 1, to distribute normally your information.

Rolling window PCA in python

I'm wondering if anyone knows how to implement a rolling/moving window PCA that reuses the calculated PCA when adding and removing measurements.
The idea is that I have a large set of data (measurement) over a very long time, and I would like to have a moving window (say, 200 days) starting at the beginning of my dataset and each step, I include the next day's measurement and throw out the last measurement, so my window is always 200 days long. However, I would not like to simply recalculate the PCA each time.
Is it possible to make an algorithm that is more efficient than simply calculating the PCA for each window independently? Thanks in advance!

A complete answer depends on a lot of factors. I'll cover what I think are the most important such factors, and hopefully that'll be enough information to point you in the right direction.
First, directly answering your question, yes it is possible to make an algorithm that is more efficient than simply calculating the PCA for each window independently.
Improving the Naive PCA Algorithm (low-dimensional inputs)
As a first pass at the problem, let's assume that you're doing a naive PCA calculation with no normalization (i.e., you're leaving the data lone, computing the covariance matrix, and finding that matrix's eigenvalues/eigenvectors).
When faced with an input matrix X whose PCA we want to compute, the naive algorithm first computes the covariance matrix W = X.T # X. Once we've computed that for some window of 200 elements, we can cheaply add or remove elements from consideration from the original data set by removing their contribution to the covariance.
"""
W: shape (p, p)
row: shape (1, p)
"""
def add_row(W, row):
return W + (row.T # row)
def remove_row(W, row):
return W - (row.T # row)
Your description of a sliding window is equivalent to removing a row and adding a new one, so we can quickly compute a new covariance matrix using O(p^2) computations rather than the O(n p^2) a typical matrix multiply would take (with n==200 for this problem).
The covariance matrix isn't the final answer though, and we still need to find the principal components. If you aren't hand-rolling the eigensolver yourself there isn't a lot to be done -- you'll pay the cost for new eigenvalues and eigenvectors every time.
However, if you are writing your own eigensolver, most such methods accept a starting input and iterate till some termination condition (usually a max number of iterations or if the error becomes low enough, whichever you hit first). Swapping out a single data point isn't likely to drastically alter the principal components, so for typical data one might expect that re-using the existing eigenvalues/eigenvectors as inputs into the eigensolver would allow you to terminate in far fewer iterations than when starting from randomized inputs, affording an additional speedup.
Improving Covariance-Free Algorithms (high-dimensional inputs)
Usually (maybe always?), covariance-free PCA algorithms have some kind of iterated solver (much like an eigensolver), but they have computational shortcuts that allow finding eigenvalues/eigenvectors without explicitly materializing the covariance matrix.
Any individual such method might have additional tricks that allow you to save some information from one window to the next, but in general one would expect that you can reduce the total number of iterations simply by re-using the existing principal components instead of using random inputs to start the solver (much like in the eigensolver case above).
Window Normalization w/ Naive Algorithm
Supposing you're normalizing each window to have a mean of 0 in each column (common in PCA), you'll have some additional work when modifying the covariance matrix.
First I'll assume you already have a rolling mechanism for keeping track of any differences that need to be applied from one window to the next. If not, consider something like the following:
"""
We're lazy and don't want to handle a change in sample
size, so only work with row swaps -- good enough for
a sliding window.
old_row: shape (1, p)
new_row: shape (1, p)
"""
def replaced_row_mean_adjustment(old_row, new_row):
return (new_row - old_row)/200. # whatever your window size is
The effect on the covariance matrix isn't too bad to compute, but I'll put some code here anyway.
"""
W: shape (p, p)
center: shape (1, p)
exactly equal to the mean diff vector we referenced above
X: shape (200, p)
exactly equal to the window you're examining after any
previous mean centering has been applied, but before the
current centering has happened. Note that we only use
its row and column sums, so you could get away with
a rolling computation for those instead, but that's
a little more code, and I want to leave at least some
of the problem for you to work on
"""
def update_covariance(W, center, X):
result = W
result -= center.T # np.sum(X, axis=0).reshape(1, -1)
result -= np.sum(X, axis=1).reshape(-1, 1) # center
result += 200 * center.T # center # or whatever your window size is
return result
Rescaling to have a standard deviation of 1 is also common in PCA. That's pretty easy to accomodate as well.
"""
Updates the covariance matrix assuming you're modifing a window
of data X with shape (200, p) by multiplying each column by
its corresponding element in v. A rolling algorithm to compute
v isn't covered here, but it shouldn't be hard to figure out.
W: shape (p, p)
v: shape (1, p)
"""
def update_covariance(W, v):
return W * (v.T # v) # Note that this is element-wise multiplication of W
Window Normalization w/ Covariance-free Algorithm
The tricks that you have available here will vary quite a bit depending on the algorithm that you're using, but the general strategy I'd try first is to use a rolling algorithm to keep track of the mean and standard deviation for each column for the current window and to modify the iterative solver to take that into account (i.e., given a window X you want to iterate on the rescaled window Y -- substitute Y=a*X+b into the iterative algorithm of your choice and simplify symbolically to hopefully yield a version with a small additional constant cost).
As before you'll want to re-use any principal components you find instead of using a random initialization vector for each window.

More efficient way for multidimensional state-action space tiling than with np.meshgrid?

First of all, this is for practice and comparison, I know there are more efficient ways to tile state space than with an linear grid.
To run some reinforcement learning algorithm, I would like to tile my state and action space lineary. As result I want to have every space-action-pair in array form. The problem is, there are different (gym) environments with different state- and action-space dimensions. Therefore I dontlike to have hard coded variables or dimensions.
So I need to calculate every state-action pair given only the min and max for each.
I've mostly solved the easy problems, but none of the solutions are "pretty".
First lets compute state and action space. Tile the area with linspace from min to max. I've given the variables for one random test environment.
import numpy as np
NOF_ACTION_SPACE_TILES = 20
NOF_STATE_SPACE_TILES = 10
action_low = np.array([-2])
state_low = np.array([-1, -1, -8])
action_space = np.vstack([*[x.flatten() for x in (np.meshgrid(*(np.linspace(action_low, action_high, NOF_ACTION_SPACE_TILES).T)))]]).T
state_space = np.vstack([*[x.flatten() for x in (np.meshgrid(*(np.linspace(state_low, state_high, NOF_STATE_SPACE_TILES).T)))]]).T
That works as intended and gives all the possible combinations for the states and actions on their own. Any way to do this more straight forward? I needed to use the *[] two times, due to np.meshgrid returning multiple matrices and trying to flatten the vectors.
Now to the funny part...
In the end I want to have every possible state-action pair. Every state with every action. This is coded pretty fast with for loops, but well... numpy and for loops are no speedy friends.
So heres my workaround, that works for 1D action space:
s_s, a_s = np.meshgrid(state_space, action_space)
state_action_space = np.concatenate((
s_s.reshape(-1, state_space.shape[1]),
a_s.reshape(state_space.shape[1], action_space.shape[1], -1)[0].T), axis=1)
With state_space.shape[1] beeing the dim of a single state / action.
One problem beeing, that np.meshgrid returns a_s for each of the 3 state-space dimensions, and reshaping it like above does not work, because we need to reshape the states to 3xn and the action to 1xn.
This is even worse than the code above, but works for now. Does anyone have suggestions how to use meshgrid or sth else properly and fast?
In the end, for the second step, its just a combination of every row of the two matrices. There has to be a better way...

Thanks to the both answers above, here my final results.
I still had to use *() to disassemble the linspace for meshgrid, but it looks more human readable now.
The big issue with the state-action code before was that I tried to overcomplicate it. Its just copying the arrays on top of each other. So just copy (or tile in this case) the state-space array as often as you have different actions in your action-space.This is the same as ACTION_SPACE_SIZE^(action-dims).
action_space = np.stack(np.meshgrid(*(np.linspace(env.action_space.low, env.action_space.high, ACTION_SPACE_SIZE)).T), -1).reshape(-1, env.action_space.shape[0])
state_space = np.stack(np.meshgrid(*(np.linspace(env.observation_space.low, env.observation_space.high, STATE_SPACE_SIZE)).T), -1).reshape(-1, env.observation_space.shape[0])
state_action_space = np.concatenate((
np.tile(state_space, (action_space.shape[0])).reshape(-1, state_space.shape[1])
np.tile(action_space, (state_space.shape[0], 1))
), axis=1)

Normalizations in sklearn and their differences

I have read many articles suggested this formula
N = (x - min(x))/(max(x)-min(x))
for normalization
but when i dig into the normalizor of sklearn somewhere i found they are using this formula
x / np.linalg.norm(x)
As the later use l2-norm by default. Which one should I use? Why is there a difference in between both?

There are different normalization techniques and sklearn provides for many of them. Please note that we are looking at 1d arrays here. For a matrix these operations are applied to each column (have a look at this post for an in depth example Scaling features for machine learning) Let's go through some of them:
Scikit-learn's MinMaxScaler performs (x - min(x))/(max(x)-min(x)) This scales your array in such a way that you only have values between 0 and 1. Can be useful if you want to apṕly some transformation afterwards where no negative values are allowed (e.g. a log-transform or in scaling RGB pixels like done in some MNIST examples)
scikit-learns StandardScaler performs (x-x.mean())/x.std() which centers the array around zero and scales by the variance of the features. This is a standard transformation and is appicable in many situations but keep in mind that you will get negative values. This is especially useful when you have gaussian sampled data which is not centered around 0 and/or does not have a unit variance.
Scikit-learn's Normalizer performs x / np.linalg.norm(x). This sets the length of your array/vector to 1. Might come in handy if you want to do some linear algebra stuff like if you want to implement the Gram-Schmidt Algorithm.
Scikit-learn's RobustScaler can be used to scale data with outliers. Mean and standard deviation are not robust to outliers therefore this scaler uses the median and scales the data to quantile ranges.
There are other non-linear transformations like QuantileTransformer that scales be quantile ranges and PowerTransformer that maps any distribution to a distribution similar to a Gaussian distribution.
And there are many other normalizations used in machine learning and there vast amount can be confusing. The idea behind normalizing data in ML is usually that you want dont want your model to treat one feature differently than others simply because it has a higher mean or a larger variance. For most standard cases I use MinMaxScaler or StandardScaler depending on whether scaling according to the variance seems important to me.

np.ling.norm is given by:
np.linalg.norm(x) = sqrt((sum_i_j(abs(x_i_j)))^2)
so lets assume you have:
X= (1 2
0 -1)
then with this you would have:
np.linalg.norm(x)= sqr((1+2+0+1)^2)= sqr(16)=4
X = (0.25 0.5
0 -0.25)
with the other approach you would have:
min(x)= -1
max(x)= 2
max(x)-min(x)=3
X = (0.66 1
0.33 0)
So the min(x)/max(x) is also called MinMaxScaler, there all the values are always between 0-1, the other approaches normalizes your values , but you can still have negativ values. Depending on your next steps you need to decide which one to use.

Based on the API description
Scikit-learn normalizer scales input vectors individually to a unit norm (vector length).
That is why it uses the L2 regularizer (you can also use L1 as well, as explained in the API)
I think you are looking for a scaler instead of a normalizer by your description. Please find the Min-Max scaler in this link.
Also, you can consider a standard scaler that normalizes value by removing its mean and scales to its standard deviation.

python sklearn: what is the different between "sklearn.preprocessing.normalize(X, norm='l2')" and "sklearn.svm.LinearSVC(penalty='l2')"

here is two method of normalize :
1:this one is using in the data Pre-Processing: sklearn.preprocessing.normalize(X, norm='l2')
2:the other method is using in the classify method : sklearn.svm.LinearSVC(penalty='l2')
i want to know ,what is the different between them? and does the two step must be used in a completely model ? is it right that just use a method is enough?

These 2 are different things and you normally need them both in order to make a good SVC model.
1) The first one means that in order to scale (normalize) the X data matrix you need to divide with the L2 norm of each column, which is just this : sqrt(sum(abs(X[:,j]).^2)) , where j is each column in your data matrix X . This ensures that none of the values of each column become too big, which makes it tough for some algorithms to converge.
2) Irrespective of how scaled (and small in values) your data is, there still may be outliers or some features (j) that are way too dominant and your algorithm (LinearSVC()) may over trust them while it shouldn't. This is where L2 regularization comes into play , that says apart from the function the algorithm minimizes, a cost will be applied to the coefficients so that they don't become too big . In other words the coefficients of the model become additional cost for the SVR cost function. How much cost ? is decided by the C (L2) value as C*(beta[j])^2
To sum up, first one tells with which value to divide each column of the X matrix. How much weight should a coefficient burden the cost function with is the second.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.