How do I calculate standard deviation of two arrays in python? - python

I have two arrays: one with 30 years of observations, and one with 30 years of historical model runs. I want to calculate the standard deviation between observations and model results, to see how much the model deviates from observations. How do I go about doing this?
Edit
Here are the two arrays (Each number represents a year(1971-2000)):
obs = [ 2790.90283203 2871.02514648 2641.31738281 2721.64453125
2554.19384766 2773.7746582 2500.95825195 3238.41186523
2571.62133789 2421.93017578 2615.80395508 2271.70654297
2703.82275391 3062.25366211 2656.18359375 2593.62231445
2547.87182617 2846.01245117 2530.37573242 2535.79931641
2237.58032227 2890.19067383 2406.27587891 2294.24975586
2510.43847656 2395.32055664 2378.36157227 2361.31689453 2410.75
2593.62915039]
model = [ 2976.01928711 3353.92114258 3000.92700195 3116.5078125 2935.31787109
2799.75805664 3328.06225586 3344.66333008 3318.31689453
3348.85302734 3578.70800781 2791.78198242 4187.99902344
3610.77124023 2991.984375 3112.97412109 4223.96826172
3590.92724609 3284.6015625 3846.34936523 3955.84350586
3034.26074219 3574.46362305 3674.80175781 3047.98144531
3209.56616211 2654.86547852 2780.55053711 3117.91699219
2737.67626953]

You want to compare two signals, e.g. A and B in the following example:
import numpy as np
A = np.random.rand(5)
B = np.random.rand(5)
print "A:", A
print "B:", B
Output:
A: [ 0.66926369 0.63547359 0.5294013 0.65333154 0.63912645]
B: [ 0.17207719 0.26638423 0.55176735 0.05251388 0.90012135]
Analyzing individual signals
The standard deviation of each single signal is not what you need:
print "standard deviation of A:", np.std(A)
print "standard deviation of B:", np.std(B)
Output:
standard deviation of A: 0.0494162021651
standard deviation of B: 0.304319034639
Analyzing the difference
Instead you might compute the difference and apply some common measure like the sum of absolute differences (SAD), the sum of squared differences (SSD) or the correlation coefficient:
print "difference:", A - B
print "SAD:", np.sum(np.abs(A - B))
print "SSD:", np.sum(np.square(A - B))
print "correlation:", np.corrcoef(np.array((A, B)))[0, 1]
Output:
difference: [ 0.4971865 0.36908937 -0.02236605 0.60081766 -0.2609949 ]
SAD: 1.75045448355
SSD: 0.813021824351
correlation: -0.38247081

Use numpy.
import numpy as np
data = [1.2, 2.3, 1.3, 1.2, 5.4]
np.std(data)
Or you could try this:
import numpy as np
obs = np.array([1.2, 2.3, 1.3, 1.2, 5.4])
model = np.array([1.1, 2.4, 1.2, 1.2, 5.3])
np.std(obs-model)

The standard deviation of the same index of multiple lists (e.g. comparing model vs measurement, multiple measurement data etc.. ) as such as
import numpy as np
obs = np.array([0,1,2,3,4])
model = np.array([2,4,6,8,10])
can be calculated by stacking the data into one array:
arr = np.vstack((obs,model))
Now the standard deviation is calculated using np.std() with a specific axis
std = np.std(arr,axis=0)
Alternative one line solution:
std = np.std((model,obs),axis=0)
Output:
[1.0, 1.5, 2.0, 2.5, 3.0]

If you're doing anything more complicated than just finding the standard deviation and/or mean, use numpy/scipy. If that's all you need to do, use the statistics package from the Python Standard Library.
>>> import statistics
>>> statistics.stdev([1, 2, 3])
1.0
It was added in Python 3.4 (see PEP-450) as a lightweight alternative to Numpy for basic stats equations.

Related

How to Use NumPy 1.4 Polynomial Class to Fit Values

How do you use the new Polynomials sub-package in numpy to give it new x values and get an output of y values?
https://numpy.org/doc/stable/reference/routines.polynomials.package.html
In prior versions of numpy it went something like this:
poly = np.poly1d(np.polyfit(x, y, 3)
new_x = np.linspace(0, 100)
new_y = poly(new_x)
The new version I am struggling to give it x values that give me the y values of each?
from numpy.polynomial import Polynomial
poly = Polynomial(Polynomial.fit(x, y, 3))
When I give it an array of x it just returns the coefficients.
You can directly call the resulting series to evaluate it:
from numpy.polynomial import Polynomial
poly = Polynomial.fit(x, y, 3)
new_y = poly(new_x)
Check this page of the documentation it has several examples.
Unfortunately, the answer by #Joan Charmant and the supportive comment #rh109019 do not work.
The intuitive way suggested by #Joan Charmant is, basically, what the question's about: it doesn't work.
Evidently, there is a new method introduced in numpy.polynomial.polynomial devoted specifically to evaluating polynomials. See here.
Here's my code where I'm comparing the two approaches.
import numpy as np
Pgauge = np.asarray([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0])
NIST = np.asarray([1.1, 2.1, 3.1, 4.1, 5.1, 6.1, 7.1, 8.1])
calibrationCurve = np.polynomial.polynomial.Polynomial.fit(Pgauge,
NIST,
deg=1
)
print("The polynomial: {}".format(calibrationCurve))
x = np.asarray([0, 1]) # values of x to evaluate the polynomial at
c = calibrationCurve.coef # coefficients of the polynomial
print("The intuitive (wrong) way: {}".format(calibrationCurve(x)))
print("The correct way: {}".format(np.polynomial.polynomial.polyval(x, c)))
The first print command prints out the polynomial:4.6+3.5x.
If we want to evaluate it at the points 0 and 1 (x = np.asarray([0, 1])), we expect to get 4.6 and 8.1 respectively.
The second print command (that reads "The intuitive (wrong) way"), uses the method suggested by #Joan Charmant. It gives [0.1, 1.1] as the result. Which is wrong. Though seemingly, it looks ok: it gives two numbers as expected. But the numbers themselves are wrong. I don't know how these numbers were calculated. But if I had a bigger series of data, I wouldn't go with a calculator through it and assume I've got a correct result.
The last print command makes use of the polyval method suggested in the user manual that I cited above. And it works perfectly well. It gives [4.6, 8.1] as the result.
It so happens that my answer is wrong as well (see all the comments below by #user2357112 supports Monica).
But still, I'll leave it here for the folks who, like me, fell the victim of the confusing new numpy.polynomial library.
FIRST: why my code is wrong?
Everything's ok with it. But the line print("The polynomial: {}".format(calibrationCurve)) doesn't give me what, I think, it must give me. It takes the correct polynomial, changes its coefficients somehow and prints out a new polynomial with the changed coefficients. Still, it does store the correct polynomial in its memory and when you do the thing suggested by #Joan Charmant it may give you the correct answer if you ask it properly.
SECOND: how to use the new numpy.polynomial library in order to get a correct result?
Due to that peculiarity, you have to introduce a new line of code. Namely, do the Polynomial.fit() and immediately afterwards use the .convert() method. Then work with the converted polynomial only.
Here's my code that works correctly now.
import numpy as np
Pgauge = np.asarray([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0])
NIST = np.asarray([1.1, 2.1, 3.1, 4.1, 5.1, 6.1, 7.1, 8.1])
calibrationCurveMessedUp = np.polynomial.polynomial.Polynomial.fit(Pgauge,
NIST,
deg=1
)
calibrationCurve = calibrationCurveMessedUp.convert()
print("The polynomial: {}".format(calibrationCurve))
print("The rounded polynomial coefficients: {}".format(calibrationCurve.coef))
x = np.asarray([0, 1]) # values of x to evaluate the polynomial at
print(calibrationCurve(x))
THIRD: a little note.
Apparently, there is a possibility to get the correct polynomial without the additional line of code. Probably, you have to give the correct window and domain parameters to the Polynomial.fit() function. Or may be there is another way.
If anybody knows such a way, you're welcome to edit my current answer and add your code.

Find a pivot point to which minimize the "overlap" of two python lists

I have two Python lists of float numbers and I want to find a pivot point which minimizes the "overlap" of these two lists.
The problem is illustrated in the figure below, where I would like to get the cross point of the two curves (each curve can be imagined as the histogram plot of a list), and the "overlap" is defined as the green area.
For example, I have two lists [2.1, 3.5, 3.8, 3.8, 3.8, 4.2] and [3.7, 4.1, 4.1, 4.1, 5.0]. A good pivot point could be 4.0 (or any number between 3.8 and 4.1), where the "overlap" corresponds to only one number (4.2) from the 1st list and one number (3.7) from the 2nd list.
Apparently the set() & set() method doesn't apply here as the numbers wouldn't be the same in both lists. The only method I came up is a brute force search, starting from 4.2 and ending at 3.7, which is not ideal.
By the comments, I need to separate it into two questions:
1) What's the Python solution to find such a pivot point of the two lists?
2) Much better, maybe too much to ask it here, but how to get a statistically rigor solution to minimize the separation of the two set of values? I am not sure if I can assume a Gaussian distribution of the values, but let's assume we can if that helps to formulate a solution.
We have two lists a and b. We are looking for such a value x for which the cumulative probability of higher values in a is equal to cumulative probability of lower values in b.
Formally:
1 − CDF(a, x) == CDF(b, x)
Alternatively:
1 − CDF(a, x) − CDF(b, x) == 0
Let's implement it in Python.
import itertools
import random
def boundary(a, b):
"""Return interval of boundary values."""
# Calculate probability density function for both list
# Merge lists and sort them by their values
cc = sorted(itertools.chain(
((x, 1/len(a)) for i, x in enumerate(a)),
((x, 1/len(b)) for i, x in enumerate(b))))
# Mark all values with 1 − CDF(a, x) − CDF(b, x)
pp = [(x[0], 1-sum(z[1] for z in cc[:i+1])) for i, x in enumerate(cc)]
# Find index of a value closest to zero
m = min(enumerate(pp), key=lambda x: abs(x[1][1]))
# Return range of values
index = m[0]
return pp[index][0], pp[index+1][0]
Test simple cases:
print(boundary([1, 2], [3, 4])) # -> (2, 3)
print(boundary([1], [3])) # -> (1, 3)
print(boundary([1, 3], [2, 4])) # -> (2, 3)
And test a more complicated case:
a = sorted(random.gauss(0, 1) for _ in range(300))
b = sorted(random.gauss(1, 1) for _ in range(200))
print(boundary(a, b)) # -> approx (0.5, 0.5 + Δ)
Please note that the algorithm correctly processes lists of different lengths.
And with slight performance optimizations it can successfully handle lists with millions of items.
One idea is to use a decision classifier to determine the best separation point.
Code
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
# Setup Data
df = pd.DataFrame({'Feature': [1.0, 2.1, 3.5, 4.2,3.7, 4.1, 5.0],'Label':[0,0,0,0,1,1,1]})
feature_cols = ['Feature']
X = df[feature_cols] # Features
y = df.Label # Target variable
# Create Decision Tree classifer object (use max_depth of 1 to have one boundary)
clf = DecisionTreeClassifier(max_depth = 1)
# Train Decision Tree Classifer
clf = clf.fit(X, y)
# Found decision boundary by creating test data in 0.1 steps from min to max
# (i.e. 1 to 5)
arr = np.arange(1, 5.1, 0.1)
test_set = pd.DataFrame({'Feature': arr})
# Create predictor so we can see where boundary is created
y_pred = clf.predict(test_set)
indexes = np.where(y_pred > 0) # all points with label 1
pivot_index = indexes[0][0] # first point with label 1
pivot_value = arr[pivot_index] # value is pivot value
print(f'Pivot value: {pivot_value}')
Output
Pivot value: 3.7000000000000024
If I understood correctly the values you are looking for do not necessarily belong to the lists.
If that is the case you can "artificially" resample your lists with decimal spacing between min and max of the original lists, transform them to "sets" and compute the intersection between them.

QQ-plot python mean and standard deviation

I am trying to plot a Q-Q plot using python. I was checking scipy.stats.probplot, and the input seems to be the measurement against a normal distributiom.
import numpy as np
import pylab
import scipy.stats as stats
measurements = np.random.normal(loc = 20, scale = 5, size=100)
stats.probplot(measurements, dist="norm", plot=pylab)
pylab.show()
and in my code, I had
stats.probplot(mean, dist="norm", plot=plt)
to compare distributions.
But I am wondering where can I input standard deviation? I thought that's a very important factor when comparing distributions but so far I can only input the mean.
Thanks
Let's suppose you have a list on float
X = [-1.31,
4.82,
2.18,
1.99,
4.37,
2.58,
7.22,
3.93,
6.95,
2.41,
2.02,
2.48,
-1.01,
2.3,
2.87,
-0.06,
2.13,
3.62,
5.24,
0.57]
If you want to make a QQ_plot test you need to compare X against a distribution.
For example : N(0, 1) a normal distribution whose mean = 0 and sigma = 1
In OpenTURNS, it goes like that:
import openturns as ot
sample = ot.Sample([[p] for p in X])
graph = ot.VisualTest.DrawQQplot(sample, ot.Normal(0,1))
View(graph);
Explanation: I tell OpenTURNS I have a sample of 20 points [p] coming from X and not 1 point in dimension 20. Then I call ot.VisualTest.DrawQQplot with 2 arguments: sample and the Normal distribution (0,1) ot.Normal(0,1).
We see on the graph that the test fails:
The question now is: what is the best Normal Distribution fitting the sample?
Thanks to NormalFactory() the answer is simple:
BestNormalDistribution = ot.NormalFactory().build(sample)
If you print(BestNormalDistribution) you get the parameters of this distribution:
Normal(mu = 2.76832, sigma = 2.27773)
If we repeat the QQ_plot test of sample against BestNormalDistribution it would be much better

Weighted Least Squares in Statsmodels vs. Numpy?

I am trying to replicate the functionality of Statsmodels's weight least squares (WLS) function with Numpy's ordinary least squares (OLS) function (i.e. Numpy refers to OLS as just "least squares").
In other words, I want to compute the WLS in Numpy. I used this Stackoverflow post as reference, but drastically different R² values arise moving from Statsmodel to Numpy.
Take the following example code that replicates this:
import numpy as np
import statsmodels.formula.api as smf
import pandas as pd
# Test Data
patsy_equation = "y ~ C(x) - 1" # Use minus one to get ride of hidden intercept of "+ 1"
weight = np.array([0.37, 0.37, 0.53, 0.754])
y = np.array([0.23, 0.55, 0.66, 0.88])
x = np.array([3, 3, 3, 3])
d = {"x": x.tolist(), "y": y.tolist()}
data_df = pd.DataFrame(data=d)
# Weighted Least Squares from Statsmodel API
statsmodel_model = smf.wls(formula=patsy_equation, weights=weight, data=data_df)
statsmodel_r2 = statsmodel_model.fit().rsquared
# Weighted Least Squares from Numpy API
Aw = x.reshape((-1, 1)) * np.sqrt(weight[:, np.newaxis]) # Multiply two column vectors
Bw = y * np.sqrt(weight)
numpy_model, numpy_resid = np.linalg.lstsq(Aw, Bw, rcond=None)[:2]
numpy_r2 = 1 - numpy_resid / (Bw.size * Bw.var())
print("Statsmodels R²: " + str(statsmodel_r2))
print("Numpy R²: " + str(numpy_r2[0]))
After running such code, I get the following results:
Statsmodels R²: 2.220446049250313e-16
Numpy R²: 0.475486515775414
Clearly something is wrong here! Can anyone point out my flaws here? Am I miss understanding the patsy formula?

Difference between R.scale() and sklearn.preprocessing.scale()

I am currently moving my data analysis from R to Python. When scaling a dataset in R i would use R.scale(), which in my understanding would do the following: (x-mean(x))/sd(x)
To replace that function I tried to use sklearn.preprocessing.scale(). From my understanding of the description it does the same thing. Nonetheless I ran a little test-file and found out, that both of these methods have different return-values. Obviously the standard deviations are not the same... Is someone able to explain why the standard deviations "deviate" from one another?
MWE:
# import packages
from sklearn import preprocessing
import numpy
import rpy2.robjects.numpy2ri
from rpy2.robjects.packages import importr
rpy2.robjects.numpy2ri.activate()
# Set up R namespaces
R = rpy2.robjects.r
np1 = numpy.array([[1.0,2.0],[3.0,1.0]])
print "Numpy-array:"
print np1
print "Scaled numpy array through R.scale()"
print R.scale(np1)
print "-------"
print "Scaled numpy array through preprocessing.scale()"
print preprocessing.scale(np1, axis = 0, with_mean = True, with_std = True)
scaler = preprocessing.StandardScaler()
scaler.fit(np1)
print "Mean of preprocessing.scale():"
print scaler.mean_
print "Std of preprocessing.scale():"
print scaler.std_
Output:
It seems to have to do with how standard deviation is calculated.
>>> import numpy as np
>>> a = np.array([[1, 2],[3, 1]])
>>> np.std(a, axis=0)
array([ 1. , 0.5])
>>> np.std(a, axis=0, ddof=1)
array([ 1.41421356, 0.70710678])
From numpy.std documentation,
ddof : int, optional
Means Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements. By default ddof is zero.
Apparently, R.scale() uses ddof=1, but sklearn.preprocessing.StandardScaler() uses ddof=0.
EDIT: (To explain how to use alternate ddof)
There doesn't seem to be a straightforward way to calculate std with alternate ddof, without accessing the variables of the StandardScaler() object itself.
sc = StandardScaler()
sc.fit(data)
# Now, sc.mean_ and sc.std_ are the mean and standard deviation of the data
# Replace the sc.std_ value using std calculated using numpy
sc.std_ = numpy.std(data, axis=0, ddof=1)
The current answers are good, but sklearn has changed a bit meanwhile. The new syntax that makes sklearn behave exactly like R.scale() now is:
from sklearn.preprocessing import StandardScaler
import numpy as np
sc = StandardScaler()
sc.fit(data)
sc.scale_ = np.std(data, axis=0, ddof=1).to_list()
sc.transform(data)
Feature request:
https://github.com/scikit-learn/scikit-learn/issues/23758
R.scale documentation says:
The root-mean-square for a (possibly centered) column is defined as sqrt(sum(x^2)/(n-1)), where x is a vector of the non-missing values and n is the number of non-missing values. In the case center = TRUE, this is the same as the standard deviation, but in general it is not. (To scale by the standard deviations without centering, use scale(x, center = FALSE, scale = apply(x, 2, sd, na.rm = TRUE)).)
However, sklearn.preprocessing.StandardScale always scale with standard deviation.
In my case, I want to replicate R.scale in Python without centered,I followed #Sid advice in a slightly different way:
import numpy as np
def get_scale_1d(v):
# I copy this function from R source code haha
v = v[~np.isnan(v)]
std = np.sqrt(
np.sum(v ** 2) / np.max([1, len(v) - 1])
)
return std
sc = StandardScaler()
sc.fit(data)
sc.std_ = np.apply_along_axis(func1d=get_scale_1d, axis=0, arr=x)
sc.transform(data)

Categories

Resources