linear fit with predefined error in response variable - python

I have the following dataset (replication):
ordinal_var fraction error_on_fraction
1 1.2 0.1
2 0.87 0.23
4 1.12 0.11
5 0.75 0.06
5 0.66 0.15
6 0.98 0.08
7 1.34 0.05
7 2.86 0.12
Now I want to do linear regression analysis (preferably in R but python is also fine) were I pass the error in y for each point within the formula. So in R this would be something like (for better understanding of the question):
lm(fraction +-error_on_fraction ~ ordinal_var, data = dataset)
Of course I tried to find how to do it myself first but I can't find an answer.
For previous analysis with error on x and y I just the scipy.odr library but I can't find how to do it with only an error in the y(response) variable.
Any help would be much appreciated!

We can use a simple weighted least squares model.
Sample data
Let's read in your sample data.
df <- read.table(text =
"ordinal_var fraction error_on_fraction
1 1.2 0.1
2 0.87 0.23
4 1.12 0.11
5 0.75 0.06
5 0.66 0.15
6 0.98 0.08
7 1.34 0.05
7 2.86 0.12", header = T)
Weighted least squares model
We fit a weighted linear model of the form fraction ~ ordered(ordinal_var), where the weights are given by 1 / error_on_fraction.
fit <- lm(
fraction ~ ordered(ordinal_var),
weights = 1 / error_on_fraction,
data = df)
summary(fit)
#
#Call:
#lm(formula = fraction ~ ordered(ordinal_var), data = df, weights = 1/error_on_fraction)
#
#Weighted Residuals:
# 1 2 3 4 5 6 7
# 2.220e-16 -1.851e-16 -1.753e-17 1.050e-01 -1.660e-01 1.810e-17 -1.999e+00
# 8
# 3.097e+00
#
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 1.1136 0.3365 3.309 0.0804 .
#ordered(ordinal_var).L 0.3430 0.7847 0.437 0.7047
#ordered(ordinal_var).Q 0.6228 0.7057 0.883 0.4706
#ordered(ordinal_var).C 0.2794 0.8920 0.313 0.7838
#ordered(ordinal_var)^4 0.2127 0.9278 0.229 0.8400
#ordered(ordinal_var)^5 -0.2469 0.7916 -0.312 0.7846
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#Residual standard error: 2.61 on 2 degrees of freedom
#Multiple R-squared: 0.5427, Adjusted R-squared: -0.6004
#F-statistic: 0.4748 on 5 and 2 DF, p-value: 0.783

Related

How to assign new observations to cluster using distance matrix and kmedoids?

I have a dataframe that holds the Word Mover's Distance between each document in my dataframe. I am running kmediods on this to generate clusters.
1 2 3 4 5
1 0.00 0.05 0.07 0.04 0.05
2 0.05 0.00 0.06 0.04 0.05
3. 0.07 0.06 0.00 0.06 0.06
4 0.04 0.04. 0.06 0.00 0.04
5 0.05 0.05 0.06 0.04 0.00
kmed = KMedoids(n_clusters= 3, random_state=123, method ='pam').fit(distance)
After running on this initial matrix and generating clusters, I want to add new points to be clustered. After adding a new document to the distance matrix I end up with:
1 2 3 4 5 6
1 0.00 0.05 0.07 0.04 0.05 0.12
2 0.05 0.00 0.06 0.04 0.05 0.21
3. 0.07 0.06 0.00 0.06 0.06 0.01
4 0.04 0.04. 0.06 0.00 0.04 0.05
5 0.05 0.05 0.06 0.04 0.00 0.12
6. 0.12 0.21 0.01 0.05 0.12 0.00
I have tried using kmed.predict on the new row.
kmed.predict(new_distance.loc[-1: ])
However, this gives me an error of incompatible dimensions X.shape[1] == 6 while Y.shape[1] == 5.
How can I use this distance of the new document to determine which cluster it should be a part of? Is this even possible, or do I have to recompute clusters every time? Thanks!
The source code for k-medoids says the following:
def transform(self, X):
"""Transforms X to cluster-distance space.
Parameters
----------
X : {array-like, sparse matrix}, shape (n_query, n_features), \
or (n_query, n_indexed) if metric == 'precomputed'
Data to transform.
"""
I assume that you use the precomputed metric (because you compute the distances outside the classifier), so in your case n_query is the number of new documents, and n_indexed is the number of the documents for which the fit method was called.
In your particular case when you fit the model on 5 documents and then want to classify the 6'th one, the X for classification should have shape (1,5), that can be computed as
kmed.predict(new_distance.loc[-1: , :-1])
this is my trial, we must recompute the distance between the new point and the old ones each time.
import pandas as pd
from sklearn_extra.cluster import KMedoids
from sklearn.metrics import pairwise_distances
import numpy as np
# dummy data for trial
df = pd.DataFrame({0: [0,1],1 : [1,2]})
# calculatie distance
distance = pairwise_distances(df.values, df.values)
# fit model
kmed = KMedoids(n_clusters=2, random_state=123, method='pam').fit(distance)
new_point = [2,3]
distance = pairwise_distances(np.array(new_point).reshape(1, -1), df.values)
#calculate the distance between the new point and the initial dataset
print(distance)
#get ride of the last element which is the ditance of the new point with itself
print(kmed.predict(distance[0][:2].reshape(1, -1)))

How can I compute the cumulative weighted average in new column?

Read all related pages on google and stackoverflow, and I still can't find the solution..
Given this df fragment:
key_br_acc_posid lot_in price
ix
1 1_885020_76141036 0.03 1.30004
2 1_885020_76236801 0.02 1.15297
5 1_885020_76502318 0.50 2752.08000
8 1_885020_76502318 4.50 2753.93000
9 1_885020_76502318 0.50 2753.93000
... ... ...
1042 1_896967_123068980 0.01 1.17657
1044 1_896967_110335293 0.01 28.07100
1047 1_896967_110335293 0.01 24.14000
1053 1_896967_146913299 25.00 38.55000
1054 1_896967_147039856 2.00 121450.00000
How can I create a new column w_avg_price computing the moving weighted average price by key_br_acc_posid? The lot_in is the weight and the price is the value.
I tried many approaches with groupby() + np.average() buy I have to avoid the data aggregation. I need this value in each row.
groupby and then perform the calculation for each group using cumsum()s:
(df.groupby('key_br_acc_posid', as_index = False)
.apply(lambda g: g.assign(w_avg_price = (g['lot_in']*g['price']).cumsum()/g['lot_in'].cumsum()))
.reset_index(level = 0, drop = True)
)
result:
key_br_acc_posid lot_in price w_avg_price
---- ------------------ -------- ------------ -------------
1 1_885020_76141036 0.03 1.30004 1.30004
2 1_885020_76236801 0.02 1.15297 1.15297
5 1_885020_76502318 0.5 2752.08 2752.08
8 1_885020_76502318 4.5 2753.93 2753.74
9 1_885020_76502318 0.5 2753.93 2753.76
1044 1_896967_110335293 0.01 28.071 28.071
1047 1_896967_110335293 0.01 24.14 26.1055
1042 1_896967_123068980 0.01 1.17657 1.17657
1053 1_896967_146913299 25 38.55 38.55
1054 1_896967_147039856 2 121450 121450
I don't think I'm calculating it right, but what you want is cumsum()
df = pd.DataFrame({'lot_in':[.1,.2,.3],'price':[1.0,1.25,1.3]})
df['mvg_avg'] = (df['lot_in'] * df['price']).cumsum()
print(df)
lot_in price mvg_avg
0 0.1 1.00 0.10
1 0.2 1.25 0.35
2 0.3 1.30 0.74

Jupyter RMagic doesn't return interactive R output

According to the rpy2 docs it should be possible to see interactive R output like this:
%%R
X=c(1,4,5,7)
Y = c(2,4,3,9)
summary(lm(Y~X))
Which is supposed to produce this:
Call:
lm(formula = Y ~ X)
Residuals:
1 2 3 4
0.88 -0.24 -2.28 1.64
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0800 2.3000 0.035 0.975
X 1.0400 0.4822 2.157 0.164
Residual standard error: 2.088 on 2 degrees of freedom
Multiple R-squared: 0.6993,Adjusted R-squared: 0.549
F-statistic: 4.651 on 1 and 2 DF, p-value: 0.1638
But despite being able to return nice-looking graphs from R using stuff like this SO answer, the R interactive output from summary is not returned.
I can get round this by doing:
In [75]: %%R -o return_value
: X = c(1,4,5,7)
: Y = c(2,4,3,9)
: return_value = summary(lm(Y~X))
Then printing the returned value like this (although the output looks pretty bad because there are blank lines between each output line):
In [86]: print(return_value)
Call:
lm(formula = Y ~ X)
Residuals:
1 2 3 4
0.88 -0.24 -2.28 1.64
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0800 2.3000 0.035 0.975
X 1.0400 0.4822 2.157 0.164
Are the docs incorrect, or am I doing something wrong?

Make console-friendly string a useable pandas dataframe python

A quick question as I'm currently changing from R to pandas for some projects:
I get the following print output from metrics.classification_report from sci-kit learn:
precision recall f1-score support
0 0.67 0.67 0.67 3
1 0.50 1.00 0.67 1
2 1.00 0.80 0.89 5
avg / total 0.83 0.78 0.79 9
I want to use this (and similar ones) as a matrix/dataframe so, that I could subset it to extract, say the precision of class 0.
In R, I'd give the first "column" a name like 'outcome_class' and then subset it:
my_dataframe[my_dataframe$class_outcome == 1, 'precision']
And I can do this in pandas but the dataframe that I want to use is simply a string see sckikit's doc
How can I make the table output here to a useable dataframe in pandas?
Assign it to a variable, s:
s = classification_report(y_true, y_pred, target_names=target_names)
Or directly:
s = '''
precision recall f1-score support
class 0 0.50 1.00 0.67 1
class 1 0.00 0.00 0.00 1
class 2 1.00 0.67 0.80 3
avg / total 0.70 0.60 0.61 5
'''
Use that as the string input for StringIO:
import io # For Python 2.x use import StringIO
df = pd.read_table(io.StringIO(s), sep='\s{2,}') # For Python 2.x use StringIO.StringIO(s)
df
Out:
precision recall f1-score support
class 0 0.5 1.00 0.67 1
class 1 0.0 0.00 0.00 1
class 2 1.0 0.67 0.80 3
avg / total 0.7 0.60 0.61 5
Now you can slice it like an R data.frame:
df.loc['class 2']['f1-score']
Out: 0.80000000000000004
Here, classes are the index of the DataFrame. You can use reset_index() if you want to use it as a regular column:
df = df.reset_index().rename(columns={'index': 'outcome_class'})
df.loc[df['outcome_class']=='class 1', 'support']
Out:
1 1
Name: support, dtype: int64

MUSIC Algorithm Spectrum Python Implementation

I am working on a small radar project that can measure the Doppler shift created by the heart and chest. Since I know the number of sources in advance, I decided to choose the MUSIC Algorithm for spectral analysis. I am acquiring data and sending it to Python for analysis. However, my Python code is saying that the power for ALL frequencies of a signal with two mixed sinusoids of frequency 1 Hz and 2 Hz is equal. My code is linked here with a sample output:
from scipy import signal
import numpy as np
from numpy import linalg as LA
import matplotlib.pyplot as plt
import cmath
import scipy
N = 5
z = np.linspace(0,2*np.pi, num=N)
x = np.sin(2*np.pi * z) + np.sin(1 * np.pi * z) + np.random.random(N) * 0.3 # sample signal
conj = np.conj(x);
l = len(conj)
sRate = 25 # sampling rate
p = 2
flipped = [0 for h in range(0, l)]
flipped = conj[::-1]
acf = signal.convolve(x,flipped,'full')
a1 = scipy.linalg.toeplitz(c=np.asarray(acf),r=np.asarray(acf))#autocorrelation matrix that will be decomposed into eigenvectors
eigenValues,eigenVectors = LA.eig(a1)
idx = eigenValues.argsort()[::-1]
eigenValues = eigenValues[idx]
eigenVectors = eigenVectors[:,idx]
idx = eigenValues.argsort()[::-1]
eigenValues = eigenValues[idx]# soriting the eigenvectors and eigenvalues from greatest to least eigenvalue
eigenVectors = eigenVectors[:,idx]
signal_eigen = eigenVectors[0:p]#these vectors make up the signal subspace, by using the number of principal compoenets, 2 to split the eigenvectors
noise_eigen = eigenVectors[p:len(eigenVectors)]# noise subspace
for f in range(0, sRate):
sum1 = 0
frequencyVector = np.zeros(len(noise_eigen[0]), dtype=np.complex_)
for i in range(0,len(noise_eigen[0])):
frequencyVector[i] = np.conjugate(complex(np.cos(2 * np.pi * i * f), np.sin(2 * np.pi * i * f)))#creating a frequency vector with e to the 2pi *k *f and taking the conjugate of the each component
for u in range(0,len(noise_eigen)):
sum1 += (abs(np.dot(np.asarray(frequencyVector).transpose(), np.asarray( noise_eigen[u]) )))**2 # summing the dot product of each noise eigenvector and frequency vector taking the absolute value and squaring
print(1/sum1)
print("\n")
"""
(OUTPUT OF THE ABOVE CODE)
0.120681885992
0
0.120681885992
1
0.120681885992
2
0.120681885992
3
0.120681885992
4
0.120681885992
5
0.120681885992
6
0.120681885992
7
0.120681885992
8
0.120681885992
9
0.120681885992
10
0.120681885992
11
0.120681885992
12
0.120681885992
13
0.120681885992
14
0.120681885992
15
0.120681885992
16
0.120681885992
17
0.120681885992
18
0.120681885992
19
0.120681885992
20
0.120681885992
21
0.120681885992
22
0.120681885992
23
0.120681885992
24
Process finished with exit code 0
"""
Here is the formula for the MUSIC Algorithm:
https://drive.google.com/file/d/0B5EG2FEWlIZwYmkteUludHNXS0k/view?usp=sharing
Mathematically, the problem is that i and f are both integers. Thus, 2*π*i*f is an integral multiple of 2π. Allowing for a tiny bit of round-off error, this gives you a cosine very close to 1.0 and a sin very close to 0.0. These values yield virtually no variation in frequencyVector from one iteration to the next.
I also see a problem in that you set up your signal_eigen matrix, but never use it. Isn't the signal itself required by this algorithm? As a result, all you're doing is sampling the noise at intervals of 2πi.
Let's try chopping up one cycle into sRate evenly-spaced sampling points. This results in spikes at 0.24 and 0.76 (out of the range 0.0 - 0.99). Does this match your intuition about how this should work?
signal_eigen = eigenVectors[0:p]
noise_eigen = eigenVectors[p:len(eigenVectors)] # noise subspace
print "Signal\n", signal_eigen
print "Noise\n", noise_eigen
for f_int in range(0, sRate * p + 1):
sum1 = 0
frequencyVector = np.zeros(len(noise_eigen[0]), dtype=np.complex_)
f = float(f_int) / sRate
for i in range(0,len(noise_eigen[0])):
# create a frequency vector with e to the 2pi *k *f and taking the conjugate of the each component
frequencyVector[i] = np.conjugate(complex(np.cos(2 * np.pi * i * f), np.sin(2 * np.pi * i * f)))
# print f, i, np.pi, np.cos(2 * np.pi * i * f)
# print frequencyVector
for u in range(0,len(noise_eigen)):
# sum the squared dot product of each noise eigenvector and frequency vector.
sum1 += (abs(np.dot(np.asarray(frequencyVector).transpose(), np.asarray( noise_eigen[u]) )))**2
print f, 1/sum1
Output
Signal
[[ -3.25974386e-01 3.26744322e-01 -5.24205744e-16 -1.84108176e-01
-7.07106781e-01 -6.86652798e-17 2.71561652e-01 3.78607948e-16
4.23482344e-01]
[ 3.40976541e-01 5.42419088e-02 -5.00000000e-01 -3.62655793e-01
-1.06880232e-16 3.53553391e-01 -3.89304223e-01 -3.53553391e-01
3.12595284e-01]]
Noise
[[ -3.06261935e-01 -5.16768248e-01 7.82012443e-16 -3.72989138e-01
-3.12515753e-16 -5.00000000e-01 5.19589478e-03 -5.00000000e-01
-2.51205535e-03]
[ 3.21775774e-01 8.19916352e-02 5.00000000e-01 -3.70053622e-01
1.44550753e-16 3.53553391e-01 4.33613344e-01 -3.53553391e-01
-2.54514258e-01]
[ -4.00349040e-01 4.82750272e-01 -8.71533036e-16 -3.42123880e-01
-2.68725150e-16 2.42479504e-16 -4.16290671e-01 -4.89739378e-16
-5.62428795e-01]
[ 3.21775774e-01 8.19916352e-02 -5.00000000e-01 -3.70053622e-01
-2.80456498e-16 -3.53553391e-01 4.33613344e-01 3.53553391e-01
-2.54514258e-01]
[ -3.06261935e-01 -5.16768248e-01 1.08027782e-15 -3.72989138e-01
-1.25036869e-16 5.00000000e-01 5.19589478e-03 5.00000000e-01
-2.51205535e-03]
[ 3.40976541e-01 5.42419088e-02 5.00000000e-01 -3.62655793e-01
-2.64414807e-16 -3.53553391e-01 -3.89304223e-01 3.53553391e-01
3.12595284e-01]
[ -3.25974386e-01 3.26744322e-01 -4.97151703e-16 -1.84108176e-01
7.07106781e-01 -1.62796158e-16 2.71561652e-01 2.06561854e-16
4.23482344e-01]]
0.0 0.115397176866
0.04 0.12355071192
0.08 0.135377011677
0.12 0.136669716901
0.16 0.148772917566
0.2 0.195742574649
0.24 0.237792763699
0.28 0.181921271171
0.32 0.12959840172
0.36 0.121070836044
0.4 0.139075881122
0.44 0.139216853056
0.48 0.117815494324
0.52 0.117815494324
0.56 0.139216853056
0.6 0.139075881122
0.64 0.121070836044
0.68 0.12959840172
0.72 0.181921271171
0.76 0.237792763699
0.8 0.195742574649
0.84 0.148772917566
0.88 0.136669716901
0.92 0.135377011677
0.96 0.12355071192
I'm also unsure of the correct implementation; having more of the paper for formula context would help. I'm not certain about the range and sampling of the f values. When I worked on FFT software, f was swept over the wave form in small increments, typically 2π/sRate.
I'm not getting those distinctive spikes now -- not sure what I did before. I made a small parametrized change, adding a num_slice variable:
num_slice = sRate * N
for f_int in range(0, num_slice + 1):
sum1 = 0
frequencyVector = np.zeros(len(noise_eigen[0]), dtype=np.complex_)
f = float(f_int) / num_slice
You can compute it however you like, of course, but the ensuing loop runs through just the one cycle. Here's my output:
0.0 0.136398199883
0.008 0.136583829848
0.016 0.13711117893
0.024 0.137893463111
0.032 0.138792904453
0.04 0.139633157335
0.048 0.140219450839
0.056 0.140365986349
0.064 0.139926689416
0.072 0.138822121693
0.08 0.137054535152
0.088 0.13470609994
0.096 0.131921188389
0.104 0.128879079596
0.112 0.125765649854
0.12 0.122750994163
0.128 0.119976226317
0.136 0.117549199221
0.144 0.115546862203
0.152 0.114021482029
0.16 0.113008398728
0.168 0.112533730494
0.176 0.112621097254
0.184 0.113296863522
0.192 0.114593615279
0.2 0.116551634665
0.208 0.119218062482
0.216 0.12264326497
0.224 0.126873674308
0.232 0.131940131305
0.24 0.137840727381
0.248 0.144517728837
0.256 0.151830000359
0.264 0.159526062508
0.272 0.167228413981
0.28 0.174444818009
0.288 0.180621604818
0.296 0.185241411664
0.304 0.187943197745
0.312 0.188619481273
0.32 0.187445977812
0.328 0.184829467764
0.336 0.181300320748
0.344 0.177396490666
0.352 0.173576190425
0.36 0.170171993077
0.368 0.167379359825
0.376 0.165265454514
0.384 0.163786582966
0.392 0.16280869726
0.4 0.162130870823
0.408 0.161514399035
0.416 0.160719375729
0.424 0.159546457646
0.432 0.157875982968
0.44 0.155693319037
0.448 0.153091632029
0.456 0.150251065569
0.464 0.147402137481
0.472 0.144785618099
0.48 0.14261932062
0.488 0.141076562538
0.496 0.140275496354
0.504 0.140275496354
0.512 0.141076562538
0.52 0.14261932062
0.528 0.144785618099
0.536 0.147402137481
0.544 0.150251065569
0.552 0.153091632029
0.56 0.155693319037
0.568 0.157875982968
0.576 0.159546457646
0.584 0.160719375729
0.592 0.161514399035
0.6 0.162130870823
0.608 0.16280869726
0.616 0.163786582966
0.624 0.165265454514
0.632 0.167379359825
0.64 0.170171993077
0.648 0.173576190425
0.656 0.177396490666
0.664 0.181300320748
0.672 0.184829467764
0.68 0.187445977812
0.688 0.188619481273
0.696 0.187943197745
0.704 0.185241411664
0.712 0.180621604818
0.72 0.174444818009
0.728 0.167228413981
0.736 0.159526062508
0.744 0.151830000359
0.752 0.144517728837
0.76 0.137840727381
0.768 0.131940131305
0.776 0.126873674308
0.784 0.12264326497
0.792 0.119218062482
0.8 0.116551634665
0.808 0.114593615279
0.816 0.113296863522
0.824 0.112621097254
0.832 0.112533730494
0.84 0.113008398728
0.848 0.114021482029
0.856 0.115546862203
0.864 0.117549199221
0.872 0.119976226317
0.88 0.122750994163
0.888 0.125765649854
0.896 0.128879079596
0.904 0.131921188389
0.912 0.13470609994
0.92 0.137054535152
0.928 0.138822121693
0.936 0.139926689416
0.944 0.140365986349
0.952 0.140219450839
0.96 0.139633157335
0.968 0.138792904453
0.976 0.137893463111
0.984 0.13711117893
0.992 0.136583829848
1.0 0.136398199883

Categories

Resources