Fitting Linear Regression with one row of data - Scikit Learn

Fitting Linear Regression with one row of data - Scikit Learn - python

I've my data in the following format
http://www.codeskulptor.org/#user41_xK7cevIoVs_0.py
Basically what I've is I've one row per data and I want to find linear combination between the variables to predict the output variable.
PNL is my output variable, GRID_MTH_DT is different per row and I need that.
All others are my independent variables.
What I want is an equation like this
PNL = linear_fn(WTI,Brent,Canada C5,Canadian Heavy,Gulf Coast ASCI,Gulf Coast LLS,Gulf Coast Mars,Gulf Coast St. James,Midwest Bakken,Midwest WTS,Midwest Midland,Midwest WTI)
How can I achieve this?
I have tried to run and fit this into a linear regression model using scikit learn, but all the coefficients come out as 0
('Coefficients: \n', array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]), '\n')
I'd appreciate any help here.

Related

How to estimate the importance of a query for a particular document?

I have a two lists of words:
q = ['hi', 'how', 'are', 'you']
doc1 = ['hi', 'there', 'guys']
doc2 = ['how', 'is', 'it', 'going']
Is there any way of calculate the "relevance" or imporance score between q and doc1 and doc2? My intuition tells me I can do this through IDF. Therefore, this is an implementation for idf:
def IDF(term,allDocs):
docsWithTheTerm = 0
for doc in allDocs:
if term.lower() in allDocs[doc].lower().split():
docsWithTheTerm = docsWithTheTerm + 1
if docsWithTheTerm > 0:
return 1.0 + log(float(len(allDocs)) / docsWithTheTerm)
else:
return 1.0
However, this doesnt give me itself something like a "relevance score". Is IDF the correct way of getting a relevance score? In the case of IDF is the incorrect way of measuring the importance of a query given a document how can I get something like a "relevance score"?

The premise of using tf-idf is to place emphasis on rarer words that appear in the text: the premise being that focusing on overly common words will not allow one to determine which words are meaningful and which are not.
In your example, here is how you could implement tf-idf in Python:
doc1 = ['hi', 'there', 'guys']
doc2 = ['how', 'is', 'it', 'going']
doc1=str(doc1)
doc2=str(doc2)
stringdata=doc1+doc2
stringdata
import re
text2=re.sub('[^A-Za-z]+', ' ', stringdata)
from nltk.tokenize import word_tokenize
print(word_tokenize(text2))
text3=word_tokenize(text2)
The words have been tokenized and appear as follows:
['hi', 'there', 'guys', 'how', 'is', 'it', 'going']
Then, a matrix is generated:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(text3).todense()
This is the matrix output:
matrix([[0., 0., 1., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 1.],
[0., 1., 0., 0., 0., 0., 0.],
[0., 0., 0., 1., 0., 0., 0.],
[0., 0., 0., 0., 1., 0., 0.],
[0., 0., 0., 0., 0., 1., 0.],
[1., 0., 0., 0., 0., 0., 0.]])
However, to make sense of this matrix, we now wish to store as a pandas dataframe, with word frequency in ascending order:
import pandas as pd
# transform the matrix to a pandas df
matrix = pd.DataFrame(matrix, columns=vectorizer.get_feature_names())
# sum over each document (axis=0)
top_words = matrix.sum(axis=0).sort_values(ascending=True)
Here is what we come up with:
going 1.0
guys 1.0
hi 1.0
how 1.0
is 1.0
it 1.0
there 1.0
dtype: float64
In this example, there is little context to the words - all three sentences are common introductions. Therefore, tf-idf won't necessarily reveal anything meaningful here, but in the context of a text with 1000+ words for example, tf-idf can be quite useful in terms of determining importance across words. e.g. you might decide that words appearing between 20-100 times in the text are rare - yet commonly occurring enough to merit importance.
In this particular case, one could potentially obtain a relevance score by determining how many times the words in the query appear in the relevant documents - specifically the words that tf-idf have flagged as important.

Basically, you have to represent the words as numbers somehow so you can do arithmetic on them to find "similarity". TF-IDF is one such way and Michael Grogan's answer should get you started there.
Another way is to use a pretrained Word2Vec or GloVe model. These word embedding models map words to a set of numbers which represent the semantic meaning of the word.
Libraries such as Gensim would allow you to very easily use pretrained embedding models to measure similarity. See here: https://github.com/RaRe-Technologies/gensim-data
===
Edit: For more advanced word embeddings, checkout ELMo or BERT

How can I change multiple values at once in pandas dataframe, using arrays as indices that vary in length?

I want to change a number of values in my pandas dataframe, where the indices that are indicating the columns may vary in size.
I need something that is faster than a for-loop, because it will be done on a lot of rows, and this turned out to be too slow.
As a simple example, consider this
df = pd.DataFrame(np.zeros((5,5)))
Now, I want to change some of the values in this dataframe to 1. If I e.g. want to change the values in the second and fith row for the first two columns, but in the fourth row I want to change all the values, I want something like this to work:
col_indices = np.array([np.arange(2),np.arange(5),np.arange(2)])
row_indices = np.array([1,3,4])
df.loc(row_indices,col_indices) =1
However, this does not work (I suspect that it does not work because the shape of the data you would select is not conform with a dataframe).
Is there any more flexible way of indexing without having to loop over rows etc.?
A solution that works only for range-like arrays (as above) would also work for my current problem - but general answer would also be nice.
Thanks for any help!

IIUC here's one approach. Define the column indices as the amount of columns where you want to insert 1s instead, and the rows where you want to insert them:
col_indices = np.array([2,5,2])
row_indices = np.array([1,3,4])
arr = df.values
And use advanced indexing to set the cells of interest to 1:
arr[row_indices] = np.arange(arr.shape[0]) <= col_indices[:,None]
array([[0., 0., 0., 0., 0.],
[1., 1., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[1., 1., 1., 1., 1.],
[1., 1., 0., 0., 0.]])

Dummy variables, is necessary to standardize them?

I have the following dataset represented like numpy array
direccion_viento_pos
Out[32]:
array([['S'],
['S'],
['S'],
...,
['SO'],
['NO'],
['SO']], dtype=object)
The dimension of this array is:
direccion_viento_pos.shape
(17249, 8)
I am using python and scikit learn to encode these categorical variables in this way:
from __future__ import unicode_literals
import pandas as pd
import numpy as np
# from sklearn import preprocessing
# from matplotlib import pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
Then I create a label encoder object:
labelencoder_direccion_viento_pos = LabelEncoder()
I take the column position 0 (the unique column) of the direccion_viento_pos and apply the fit_transform() method addressing all their rows:
direccion_viento_pos[:, 0] = labelencoder_direccion_viento_pos.fit_transform(direccion_viento_pos[:, 0])
My direccion_viento_pos is of this way:
direccion_viento_pos[:, 0]
array([5, 5, 5, ..., 7, 3, 7], dtype=object)
Until this moment, each row/observation of direccion_viento_pos have a numeric value, but I want solve the inconvenient of weight in the sense that there are rows with a value more higher than others.
Due to this, I create the dummy variables, which according to this reference are:
A Dummy variable or Indicator Variable is an artificial variable created to represent an attribute with two or more distinct categories/levels
Then, in my direccion_viento_pos context, I have 8 values
SO - Sur oeste
SE - Sur este
S - Sur
N - Norte
NO - Nor oeste
NE - Nor este
O - Oeste
E - Este
This mean, 8 categories.
Next, I create a OneHotEncoder object with the categorical_features attribute which specifies what features will be treated like categorical variables.
onehotencoder = OneHotEncoder(categorical_features = [0])
And apply this onehotencoder to our direccion_viento_pos matrix.
direccion_viento_pos = onehotencoder.fit_transform(direccion_viento_pos).toarray()
My direccion_viento_pos with their categorized variables has stayed so:
direccion_viento_pos
array([[0., 0., 0., ..., 1., 0., 0.],
[0., 0., 0., ..., 1., 0., 0.],
[0., 0., 0., ..., 1., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 1.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 1.]])
Then, until here, I've created dummy variables to each category.
I wanted to narrate this process, to arrive at my question.
If these dummy encoder variables already in a 0-1 range, is necessary apply the MinMaxScaler feature scaling?
Some say that it is not necessary to scale these fictitious variables. Others say that if necessary because we want accuracy in predictions
I ask this question due to when I apply the MinMaxScaler with the feature_range=(0, 1)
my values have been changed in some positions ... despite to still keep this scale.
What is the best option which can I have to choose with respect to my dataset direccion_viento_pos

I don't think scaling them will change the answer at all. They're all on the same scale already. Min 0, max 1, range 1. If some continuous variables were present, you'd want to normalize the continuous variables only, leaving the dummy variables alone. You could use the min-max scaler to give those continuous variables the same minimum of zero, max of one, range of 1. Then your regression slopes would be very easy to interpret. Your dummy variables are already normalized.
Here's a related question asking if one should ever standardize binary variables.

5D Kalman filter is not working, and we're not sure where we're going wrong

Right now we're attempting to create a 5 Dimensional Kalman filter that recieves as input the x and y coordinates of a small bug that is wandering around in the box. We observe this bug as it makes roughly 2000 movements, then predict it's position from then forward. The dimensions are x-coord, y-coord, velocity, heading, and angular acceleration. Following is the code that we have so far.
#x is a list of five variables - x, y, velocity, angularVelocity, angularAcceleration
#currentmeasurement is the x and y that were observed
def kalman_filter(x, P, currentmeasurement, lastMeasurement = None):
prevmeasurement = []
#if there is a lastMeasurement argument, it becomes measurement
if lastMeasurement:
prevmeasurement = [lastMeasurement[0], lastMeasurement[1], x.value[3][0]]
#if there is no lastMeasurement argument, the current measurement becomes measurement.
else:
prevmeasurement = [x.value[0][0], x.value[1][0], x.value[3][0]]
#Prediction Step
a = x.value[3][0]
F = matrix([
[1., 0., cos(a), 0., 0.],
[0., 1., sin(a), 0., 0.],
[0., 0., 1., 0., 0.],
[0., 0., 0., 1., dt],
[0., 0., 0., 0., 1.]])
x = (F * x) + u
P = F * P * F.transpose()
#we can calculate the heading from our observations.
heading = atan2(currentmeasurement[0][1] - prevmeasurement[1], currentmeasurement[0][0] - prevmeasurement[0])
while(abs(heading - prevmeasurement[2]) > pi):
heading = heading + 2*pi*((prevmeasurement[2]-heading)/abs(prevmeasurement[2]-heading))
#perhaps the velocity should also be calculated?
# measurement update
dt = 1.
u = matrix([[0.], [0.], [0.], [0.], [0.]]) # external motion
H = matrix([[1., 0., 0., 0., 0.], #the measurement function
[0., 1., 0., 0., 0.],
[0., 0., 0., 1., 0.]])
R = matrix([[1., 0., .0], #measurement uncertainty
[0., 1., .0],
[0., 0., 1.]])
I = matrix([[]]) #a 5x5 identity matrix
I.identity(5)
prevmeasurement = [currentmeasurement[0][0], currentmeasurement[0][1], heading]
Z = matrix([prevmeasurement])
y = Z.transpose() - (H * x)
S = H * P * H.transpose() + R
K = P * H.transpose() * S.inverse()
x = x + (K * y)
P = (I - (K * H)) * P
return x,P
This is resulting in wildly incorrect estimations. We're not sure what we're doing wrong here - I think we're following all of the steps correctly, but not sure if we have all of the required matrices correct. Any input would be helpful!

The top comment has the wrong description. Your state must be x, y, velocity, angle, angularVelocity
You're missing Q, the process covariance. It should reflect how much your state can change between updates, and is added to your update of P in the predict step.
You're building an EKF (since your update requires trig, it is nonlinear). You've constructed a matrix F which performs your state update, but what you need for the process covariance update is the Jacobian of your update function. In your case it looks like:
J = matrix([
[1., 0., cos(a), -sin(a), 0.],
[0., 1., sin(a), cos(a), 0.],
[0., 0., 1., 0., 0.],
[0., 0., 0., 1., dt],
[0., 0., 0., 0., 1.]])
If your only direct measurement is position, you should not compute heading and use it as a measurement. Let the KF do it.
You should set R to be the real uncertainties of the measurement. The noise represented by Q, R, and propagated in P is more important than your state x if you want the KF to work.

I was working on a very similar problem, and having exactly the same issues. Given noisy (x,y) measurements of a robot's position, and the knowledge that it is turning at constant velocity in a circle of constant radius, predict its next position.
I also attempted a 5D EKF, which would blow up after a few iterations. My state consisted of:
[x, y, distance_per_timestep, heading, heading_delta_per_timestep]
I tried numerous formulations for the state transition matrix (Jacobian of f(x), modified version of Jacobian, etc). In increasing desperation, I tried manually tweaking Kalman gains, covariances, everything... nothing seemed to work.
My final solution, which works amazingly well, was to use 2D Kalman filter on x and y positions, with velocity, heading, and turning rate calculated from the previous 3 positions and used as control inputs to the state estimation update. I tried other variations on that theme (state of [x, y, vx, vy, ax, ay] and so forth, but nothing beat the KISS (keep it simple, stupid) approach.
At first I used the raw, noisy measurements to calculate velocity, heading, and delta_heading. I tried using my state estimates of x and y after the filter had run for a while, but that led to instability too- very slow onset, but a gently growing weave back and forth across the ground truth track that eventually turned into a piece of Spirograph art! I later found that if i did my measurement update first, then used those state values (after measurement, but before state update), I'd get slightly better performance than the raw measurements.
I'm not sure if my approach would solve your problem (I keep getting noisy measurements, and only need to predict a few moves ahead in order to "catch" my runaway), but take from it what you may.
I am also very interested in knowing why my EKF blew up, and also how to formulate the correct F matrix in a non-linear problem like this- the "proper" Jacobian, if I calculated it correctly, causes the EKF to blow up almost immediately!
I'd be very interested in hearing your solution, or any other advances you may have made! Good luck!

Add Q the model noise matrix
Your H matrix should be the Jacobian. Not your Q matrix.
Don't estimate anything for the filter. Let the Kalman do it for you and you stick to the observed space
If you send the main script I can help more.

scipy.optimize.leastsq returns best guess parameters not new best fit

I want to fit a lorentzian peak to a set of data x and y, the data is fine. Other programs like OriginLab fit it perfectly, but I wanted to automate the fitting with python so I have the below code which is based on http://mesa.ac.nz/?page_id=1800
The problem I have is that the scipy.optimize.leastsq returns as the best fit the same initial guess parameters I passed to it, essentially doing nothing. Here is the code.
#x, y are the arrays with the x,y axes respectively
#defining funcitons
def lorentzian(x,p):
return p[2]*(p[0]**2)/(( x - (p[1]) )**2 + p[0]**2)
def residuals(p,y,x):
err = y - lorentzian(x,p)
return err
p = [0.055, wv[midIdx], y[midIdx-minIdx]]
pbest = leastsq(residuals, p, args=(y, x), full_output=1)
best_parameters = pbest[0]
print p
print pbest
p are the initial guesses and best_parameters are the returned 'best fit' parameters from leastsq, but they are always the same.
this is what returned by the full_output=1 (the long numeric arrays have been shortened but are still representitive)
[0.055, 855.50732, 1327.0]
(array([ 5.50000000e-02, 8.55507324e+02, 1.32700000e+03]),
None, {'qtf':array([ 62.05192947, 69.98033905, 57.90628052]),
'nfev': 4,
'fjac': array([[-0., 0., 0., 0., 0., 0., 0.,],
[ 0., -0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0.],
[ 0., 0., -0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0.]]),
'fvec': array([ 62.05192947, 69.98033905,
53.41218567, 45.49879837, 49.58242035, 36.66483688,
34.74443436, 50.82238007, 34.89669037]),
'ipvt': array([1, 2, 3])},
'The cosine of the angle between func(x) and any column of the\n Jacobian
is at most 0.000000 in absolute value', 4)
can anyone see whats wrong?

A quick google search hints at a problem with the data being single precision (your other programs almost certainly upcast to double precision too, though this explicitely is a problem with scipy as well, see also this bug report). If you look at your full_output=1 result, you see the the Jacobian is approximated as zero everywhere.
So giving the Jacobian explicitely might help (though even then you might want to upcast, because the minimum precision for a relative error you can get with single precision is just very limited).
Solution: the easiest and numerically best solution (of course giving the real Jacobian is also a bonus) is to just cast your x and y data to double precision (x = x.astype(np.float64) will do for example).
I would not suggest this, but you also may be able to fix it by setting epsfcn keyword argument (and also the tolerance keyword arguments probably) by hand, something along epsfcn=np.finfo(np.float32).eps. This seems to fix the issue in a way, but (since most calculations are with scalars, and scalars do not force an upcast in your calculation) the calculations are done in float32 and the precision loss seem to be rather big, at least when not providing Dfunc.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.