Scipy Detrend in python - python

I detrended my data in python using the following code from scipy.signal.detrend
detrended =signal.detrend(feature, axis=-1, type='constant', bp=0, overwrite_data=True)
np.savetxt('constant detrend.csv', detrended, delimiter=',', fmt='%s')
The last line saves the data into a csv file then i reload this data to run some models. I found that the my RandomForest model performs really well with the detrended dataset.
So next will be to make forecasts using this model. However i am a bit unsure of how i can move from the detrended dataset to a more meaningful dataset that i can understand. From my understanding the detrend removed the mean and normalized the data. But if i do my predictions i need to be able to see the actual numbers of my forecasts not the detrended numbers.
Is there a way i can readd the mean and renormalize to get a 'meanful dataset' that i can interpret. For example my dataset has a rainfall variable. So for each month i can see how much it rained. But when i detrended, the rainfall value is no longer the actual rainfall value. When i make forecasts i want to be able to say in this month it rained 200mm but my forecasts don't tell me this since the data has been detrended.
Any help would be appraciated.

According to the docs, detrend simply removes the least squares line fit from the data. When you use type='constant', it's even simpler, since it just removes the mean:
If type == 'constant', only the mean of data is subtracted.
The source code bears this out. After checking the inputs, the entire computation is done in one line (scipy/signal/signaltools.py, line 3261):
ret = data - np.expand_dims(np.mean(data, axis), axis)
The easiest way to get the subtracted mean is to implement the calculation by hand, given how simple it is.
mean = np.mean(feature, axis=-1, keepdims=True)
detrended = feature - mean
You can save the mean to a file, or do whatever else you want with it. To "retrend", just add the mean back:
point = prediction + mean
If you had some other manipulation you were concerned with, like normalizing to the maximum, you could handle it the same way.
max = np.amax(detrended, axis=-1, keepdims=True)
detrended /= max
In this case you'd have to multiply before offsetting to retrend:
point = prediction * max + mean
Simple manipulations like this are easy to reproduce by hand. A more complicated function might be hard to reproduce reliably, but would also be more likely to return the parameters it uses, at least optionally.

Related

Correct way of normalizing and scaling the MNIST dataset

I've looked everywhere but couldn't quite find what I want. Basically the MNIST dataset has images with pixel values in the range [0, 255]. People say that in general, it is good to do the following:
Scale the data to the [0,1] range.
Normalize the data to have zero mean and unit standard deviation (data - mean) / std.
Unfortunately, no one ever shows how to do both of these things. They all subtract a mean of 0.1307 and divide by a standard deviation of 0.3081. These values are basically the mean and the standard deviation of the dataset divided by 255:
from torchvision.datasets import MNIST
import torchvision.transforms as transforms
trainset = torchvision.datasets.MNIST(root='./data', train=True, download=True)
print('Min Pixel Value: {} \nMax Pixel Value: {}'.format(trainset.data.min(), trainset.data.max()))
print('Mean Pixel Value {} \nPixel Values Std: {}'.format(trainset.data.float().mean(), trainset.data.float().std()))
print('Scaled Mean Pixel Value {} \nScaled Pixel Values Std: {}'.format(trainset.data.float().mean() / 255, trainset.data.float().std() / 255))
This outputs the following
Min Pixel Value: 0
Max Pixel Value: 255
Mean Pixel Value 33.31002426147461
Pixel Values Std: 78.56748962402344
Scaled Mean: 0.13062754273414612
Scaled Std: 0.30810779333114624
However clearly this does none of the above! The resulting data 1) will not be between [0, 1] and will not have mean 0 or std 1. In fact this is what we are doing:
[data - (mean / 255)] / (std / 255)
which is very different from this
[(scaled_data) - (mean/255)] / (std/255)
where scaled_data is just data / 255.
Euler_Salter
I may have stumbled upon this a little too late, but hopefully I can help a little bit.
Assuming that you are using torchvision.Transform, the following code can be used to normalize the MNIST dataset.
train_loader = torch.utils.data.DataLoader(
datasets.MNIST('./data', train=True
transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])),
Usually, 'transforms.ToTensor()' is used to turn the input data in the range of [0,255] to a 3-dimensional Tensor. This function automatically scales the input data to the range of [0,1]. (This is equivalent to scaling the data down to 0,1)
Therefore, it makes sense that the mean and std used in the 'transforms.Normalize(...)' will be 0.1307 and 0.3081, respectively. (This is equivalent to normalizing zero mean and unit standard deviation.)
Please refer to the link below for better explanation.
https://pytorch.org/vision/stable/transforms.html
I think you misunderstand one critical concept: these are two different, and inconsistent, scaling operations. You can have only one of the two:
mean = 0, stdev = 1
data range [0,1]
Think about it, considering the [0,1] range: if the data are all small positive values, with min=0 and max=1, then the sum of the data must be positive, giving a positive, non-zero mean. Similarly, the stdev cannot be 1 when none of the data can possibly be as much as 1.0 different from the mean.
Conversely, if you have mean=0, then some of the data must be negative.
You use only one of the two transformations. Which one you use depends on the characteristics of your data set, and -- ultimately -- which one works better for your model.
For the [0,1] scaling, you simply divide by 255.
For the mean=0, stdev=1 scaling, you perform the simple linear transformation you already know:
new_val = (old_val - old_mean) / old_stdev
Does that clarify it for you, or have I entirely missed your point of confusion?
Purpose
Two of the most important reasons for features scaling are:
You scale features to make them all of the same magnitude (i.e. importance or weight).
Example:
Dataset with two features: Age and Weight. The ages in years and the weights in grams! Now a fella in the 20th of his age and weights only 60Kg would translate to a vector = [20 yrs, 60000g], and so on for the whole dataset. The Weight Attribute will dominate during the training process. How is that, depends on the type of the algorithm you are using - Some are more sensitive than others: E.g. Neural Network where the Learning Rate for Gradient Descent get affected by the magnitude of the Neural Network Thetas (i.e. Weights), and the latter varies in correlation to the input (i.e. features) during the training process; also Feature Scaling improves Convergence. Another example is the K-Mean Clustering Algorithm requires Features of the same magnitude since it is isotropic in all directions of space. INTERESTING LIST.
You scale features to speed up execution time.
This is straightforward: All these matrices multiplications and parameters summation would be faster with small numbers compared to very large number (or very large number produced from multiplying features by some other parameters..etc)
Types
The most popular types of Feature Scalers can be summarized as follows:
StandardScaler: usually your first option, it's very commonly used. It works via standardizing the data (i.e. centering them), that's to bring them to a STD=1 and Mean=0. It gets affected by outliers, and should only be used if your data have Gaussian-Like Distribution.
MinMaxScaler: usually used when you want to bring all your data point into a specific range (e.g. [0-1]). It heavily gets affected by outliers simply because it uses the Range.
RobustScaler: It's "robust" against outliers because it scales the data according to the quantile range. However, you should know that outliers will still exist in the scaled data.
MaxAbsScaler: Mainly used for sparse data.
Unit Normalization: It basically scales the vector for each sample to have unit norm, independently of the distribution of the samples.
Which One & How Many
You need to get to know your dataset first. As per mentioned above, there are things you need to look at before, such as: the Distribution of the Data, the Existence of Outliers, and the Algorithm being utilized.
Anyhow, you need one scaler per dataset, unless there is a specific requirement, such that if there exist an algorithm that works only if data are within certain range and has mean of zero and standard deviation of 1 - all together. Nevertheless, I have never come across such case.
Key Takeaways
There are different types of Feature Scalers that are used based on some rules of thumb mentioned above.
You pick one Scaler based on the requirements, not randomly.
You scale data for a purpose, for example, in the Random Forest Algorithm you do NOT usually need to scale.
Well the data gets scaled to [0,1] using torchvision.transforms.ToTensor() and then the normalization (0.1306,0.3081) is applied.
You can look about it in the Pytorch documentation : https://pytorch.org/vision/stable/transforms.html.
Hope that answers your question.

Detecting and Replacing Outliers

In my mind, there are multiple ways to treat dataset outliers
> -> Delete data
> -> Transforming using log or Bin
> -> using mean median
> -> Test separately
I have a dataset of around 50000 observations and each observation has quite some outlier values (some variable have small amount of outliers some has 100-200 outliers) so excluding data is not the one I'm looking for as it causing me to loose a huge chunk of data.
I read somewhere that using mean and median is for artificial outliers but in my case I think the outliers are Natural
I was actually about to use median to get rid of the outliers and then using mean to fill in missing values but it doesn't seem ok, however I did use it neverthless with this code
median = X.median()
std =X.std()
outliers = (X - median).abs() > std
X.outliers = np.nan
X.fillna(median, inplace = True)
it did lower the overfitting of just one model logistic regression but still gives 100% on Random Forest and the shape of graph changed from
to this
So I'm really confuse what technique to use? I tried replacing 5th and 95th percentile of data as well but it didn't work as well. Should I bin the data present in each column from 1-10? Also should I normalize or standardize my data before applying any model? Any guidance will be appreciated
Check robust statistics.
I would suggest to check the Huber's method / winsorization, which you also have in Python.
For Hypothesis testing you have Wilcoxon signed ranked test and I think Mann-Whitney test

Double integration of discrete data with respect to time

Let's say I have a set of data points called signal and I want to integrate it twice with respect to time (i.e., if signal was acceleration, I'd like to integrate it twice w.r.t. time to get the position). I can integrate it once using simps but the output here is a scalar. How can you numerically integrate a (random) data set twice? I'd imagine it would look something like this, but obviously the inputs are not compatible after the first integration.
n_samples = 5000
t_range = np.arange(float(n_samples))
signal = np.random.normal(0.,1.,n_samples)
signal_integration = simps(signal, t_range)
signal_integration_double = simps(simps(signal, t_range), t_range)
Any help would be appreciated.
Sorry I answered too fast. scipy.integrate.simps give the value of the integration over the range you give it, similar to np.sum(signal).
What you want is the integration beween the start and each data point, which is what cumsum does. A better method could be scipy.integrate.cumtrapz. You can apply either method twice to get the result you want.
See:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.integrate.simps.html
https://docs.scipy.org/doc/scipy/reference/generated/scipy.integrate.cumtrapz.html
Original answer:
I think you want np.cumsum. Integration of discrete data is just a sum. You have to multiply the result by the step value to get the correct scale.
See https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.cumsum.html
By partial integration you get from y''=f to
y(t) = y(0) + y'(0)*t + integral from 0 to t of (t-s)*f(s) ds
As you seem to assume that y(0)=0 and also y'(0)=0, you can thus get the the desired integral value in one integration as
simps((t-t_range)*signal, t_range)

Python, Arima prediction out of sample

I am trying to use ARIMA model fitted by arima_mod = sm.tsa.ARIMA(residual, (p,d,q)).fit(trend="c",maxiter = 20) for out of sample prediction of the next value in the residual series. To to that, I may apply one of the following:
next_pred1 = arima_mod.predict(start,end,dynamic=True)[-1]
next_pred2 = arima_mod.predict(start,end,dynamic=False)[-1]
The results of both predictions are bad. Correction - with dynamic=False these are bad. With dynamic=True these are horrible. I am trying to understand why: when I set start to distant past and end to the next value (which is out of sample), the prediction is bad but at least is does not change between different values. I.e., for start = 4 and for start = 8 the prediction will give the same output (I have about 40 samples to base the prediction on). When I set the same start/end values and use dynamic=True, the output gets worse given more past. Which does not make sense - for ARIMA - (2,0,2) the prediction should only use the estimate of the last 2 values with the dynamic setting, but it seems that it uses all past samples for forecasting. The output value is suddenly 1E-20 rather than, say 0.15 or 1.1... It is as if a strange (weighted by p or similar) approximation of the mean of the series is used for the prediction...
Please advise - how many samples should I take back for forecasting of the next out-of-sample value with ARIMA = (p,d,q)? The last p values only? Else? Why do the results of the prediction are so bad (model trained on 40 values - is this not "sufficiently enough" for small p and q? I know that more values the better, but this is all I have).
Your advice will be appreciated.
Thanks!

Pseudoexperiments in PyMC

Is it possible to perform "pseudoexperiments" using PyMC?
By pseudoexperiments, I mean generating random "observations" by sampling from the prior, and then, given each pseudoexperiment, drawing samples from the posterior. Afterwards, one would compare the trace for each parameter to the sample (obtained from the prior) used in sampling from the posterior.
A more concrete example: Suppose that I want to know the rate of process X. I count how many occurrences there are in a certain period of time. However, I know that process Y also sometimes occurs and will contaminate my count. The rate of process Y is known with some uncertainty. So, I build a model, include my observations, and sample from the posterior:
import pymc
class mymodel:
rate_x = pymc.Uniform('rate_x', lower=0, upper=100)
rate_y = pymc.Normal('rate_y', mu=150, tau=1./(15**2))
total_rate = pymc.LinearCombination('total_rate', [1,1], [rate_x, rate_y])
data = pymc.Poisson('data', mu=total_rate, value=193, observed=True)
Mod = pymc.Model(mymodel)
MCMC = pymc.MCMC(Mod)
MCMC.sample(100000, burn=5000, thin=5)
print MCMC.stats()['rate_x']['quantiles']
However, before I do my experiment (or before I "unblind" my analysis and look at my data), I would like to know how sensitive I expect to be -- what will be the uncertainty on my measurement of rate_x?
To answer this, I could sample from the prior
Mod.draw_from_prior()
but this only samples rate_x, rate_y, and calculates total_rate. But once the values of those are set by draw_from_prior(), I can draw a pseudoexperiment:
Mod.data.random()
This just returns a number, so I have to set the value of Mod.data to a random sample. Because Mod.data has the observed flag set, I have to also "force" it:
Mod.data.set_value(Mod.data.random(), force=True)
Now I can sample from the posterior again
MCMC.sample(100000, burn=500, thin=5)
print MCMC.stats()['rate_x']['quantiles']
All this works, so I suppose the simple answer to my question is "yes". But it feels very hacky. Is there a better or more natural way to accomplish this?

Categories

Resources