Calculate logistic regression in python - python

I tried to calculate logical regression. I have the data as csv file.
it looks like
node_id,second_major,gender,major_index,year,dorm,high_school,student_fac
0,0,2,257,2007,111,2849,1
1,0,2,271,2005,0,51195,2
2,0,2,269,2007,0,21462,1
3,269,1,245,2008,111,2597,1
..........................
This is my coding.
import pandas as pd
import statsmodels.api as sm
import pylab as pl
import numpy as np
df = pd.read_csv("Reed98.csv")
print df.describe()
dummy_ranks = pd.get_dummies(df['second_major'], prefix='second_major')
cols_to_keep = ['second_major', 'dorm', 'high_school']
data = df[cols_to_keep].join(dummy_ranks.ix[:, 'year':])
train_cols = data.columns[1:]
# Index([gre, gpa, prestige_2, prestige_3, prestige_4], dtype=object)
logit = sm.Logit(data['second_major'], data[train_cols])
result = logit.fit()
print result.summary()
When I run the coding in python I got an error:
Traceback (most recent call last):
File "D:\project\logisticregression.py", line 24, in <module>
result = logit.fit()
File "c:\python26\lib\site-packages\statsmodels-0.5.0-py2.6- win32.egg\statsmodels\discrete\discrete_model.py", line 282, in fit
disp=disp, callback=callback, **kwargs)
File "c:\python26\lib\site-packages\statsmodels-0.5.0-py2.6- win32.egg\statsmodels\discrete\discrete_model.py", line 233, in fit
disp=disp, callback=callback, **kwargs)
File "c:\python26\lib\site-packages\statsmodels-0.5.0-py2.6- win32.egg\statsmodels\base\model.py", line 291, in fit
hess=hess)
File "c:\python26\lib\site-packages\statsmodels-0.5.0-py2.6-win32.egg\statsmodels\base\model.py", line 341, in _fit_mle_newton
newparams = oldparams - np.dot(np.linalg.inv(H),
File "C:\Python26\Lib\site-packages\numpy\linalg\linalg.py", line 445, in inv
return wrap(solve(a, identity(a.shape[0], dtype=a.dtype)))
File "C:\Python26\Lib\site-packages\numpy\linalg\linalg.py", line 328, in solve
raise LinAlgError('Singular matrix')
LinAlgError: Singular matrix
How to rewrite the code?

There's nothing wrong with your code. My guess is that you have missing values in your data. Try a dropna or use missing='drop' to Logit. You might also check that the right hand side is full rank np.linalg.matrix_rank(data[train_cols].values)

Related

2D array error in python using scikitlearn package

i have used following code in my pycharm but i am constantly getting the error mentioned below:
import numpy as np
import seaborn as sns
from sklearn import linear_model
import matplotlib.pyplot as plt
df=pd.read_csv(r"C:\Users\gmcks\Downloads\Data samples\homeprices.csv")
df
https://docs.google.com/spreadsheets/d/1wxaadKAHTZtECv6gW6Mpreq3tFb2PWgVOhqANbWlIAk/edit?usp=sharing
x=df[["area"]]
y=df.price
reg=linear_model.LinearRegression()
reg.fit(x,y)
LinearRegression()
m=reg.coef_
c=reg.intercept_
print(m,c)
reg.predict(2000)
ERROR :
Traceback (most recent call last):
File "C:\Users\gmcks\PycharmProjects\using jupyter.py\venv\lib\site-packages\IPython\core\interactiveshell.py", line 3319, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-30-b5b06b1b028e>", line 1, in <module>
reg.predict(2000)
File "C:\Users\gmcks\PycharmProjects\using jupyter.py\venv\lib\site-packages\sklearn\linear_model\_base.py", line 236, in predict
return self._decision_function(X)
File "C:\Users\gmcks\PycharmProjects\using jupyter.py\venv\lib\site-packages\sklearn\linear_model\_base.py", line 218, in _decision_function
X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])
File "C:\Users\gmcks\PycharmProjects\using jupyter.py\venv\lib\site-packages\sklearn\utils\validation.py", line 72, in inner_f
return f(**kwargs)
File "C:\Users\gmcks\PycharmProjects\using jupyter.py\venv\lib\site-packages\sklearn\utils\validation.py", line 616, in check_array`enter code here`
"if it contains a single sample.".format(array))
ValueError: Expected 2D array, got scalar array instead:
array=2000.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Why do I have to shape my data again as I have already written the code as df[["area"]]? This piece of code converts the array into (5,1), so 2D array is created.
You need to provide the input that is the same shape as your predictor:
from sklearn import linear_model
import numpy as np
import pandas as pd
np.random.seed(111)
df = pd.DataFrame({'x' : np.random.uniform(0,1,100),
'y' : np.random.uniform(0,1,100)})
reg=linear_model.LinearRegression()
reg.fit(df[["x"]],df['y'])
You can do:
reg.predict([[2000]])

PatsyError: Error evaluating factor: NameError:

I am an absolute newbie in Python programming and currently learning basic statistics on it.
I am facing a
"PatsyError: Error evaluating factor: NameError:"
on a code with pred = model.predict(pd.DataFrame(calo['wt'])
Below is my code:
# For reading data set
# importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# reading a csv file using pandas library
calo=pd.read_csv("/Users/Sanjeev/Desktop/Excel R Assignments/Simple Linear Regression/calories_consumed.csv")
calo.columns = ['wt','cal']
np.corrcoef(calo.wt,calo.cal)
plt.plot(calo.wt,calo.cal,"bo");plt.xlabel("WEIGHT");plt.ylabel("CALORIES")
# For preparing linear regression model we need to import the statsmodels.formula.api
import statsmodels.formula.api as smf
model = smf.ols("wt~cal",data=calo).fit()
# For getting coefficients of the varibles used in equation
model.params
# P-values for the variables and R-squared value for prepared model
model.summary()
model.conf_int(0.05) # 95% confidence interval
pred = model.predict(pd.DataFrame(calo['wt']))
This throws up an error:
Traceback (most recent call last):
File "<ipython-input-43-4fcbf1ee1921>", line 1, in <module>
pred = model.predict(pd.DataFrame(calo['wt']))
File "/anaconda3/lib/python3.7/site-packages/statsmodels/base/model.py", line 837, in predict
exog = dmatrix(design_info, exog, return_type="dataframe")
File "/anaconda3/lib/python3.7/site-packages/patsy/highlevel.py", line 291, in dmatrix
NA_action, return_type)
File "/anaconda3/lib/python3.7/site-packages/patsy/highlevel.py", line 169, in _do_highlevel_design
return_type=return_type)
File "/anaconda3/lib/python3.7/site-packages/patsy/build.py", line 888, in build_design_matrices
value, is_NA = _eval_factor(factor_info, data, NA_action)
File "/anaconda3/lib/python3.7/site-packages/patsy/build.py", line 63, in _eval_factor
result = factor.eval(factor_info.state, data)
File "/anaconda3/lib/python3.7/site-packages/patsy/eval.py", line 566, in eval
data)
File "/anaconda3/lib/python3.7/site-packages/patsy/eval.py", line 551, in _eval
inner_namespace=inner_namespace)
File "/anaconda3/lib/python3.7/site-packages/patsy/compat.py", line 43, in call_and_wrap_exc
exec("raise new_exc from e")
File "<string>", line 1, in <module>
PatsyError: Error evaluating factor: NameError: name 'cal' is not defined
wt~cal
^^^
Need your help to resolve this.
Thanks in advance. :)
Looking at the statsmodels API here, it looks like they expect the parameters as input, rather than the covariates.
So what you probably want is
pred = model.predict(model.params)
you need to put a variable based on which you are going to decide dependent variable(y)
model = statsmodels.formula.api.ols('y ~x ',data=df)
model.predict(pd.DataFrame(df['x']))
I was having this problem. I was doing something like this:
for _, i in frame.iterrows()
model.predict(i)
This doesn't provide it with the necessary headers. You have to do this:
for _, i in frame.iterrows()
model.predict(pd.DataFrame([i]))

Numpy Error "Could not convert string to float: 'Illinois'"

I created the below table in Google Sheets and downloaded it as a CSV file.
My code is posted below. I'm really not sure where it's failing. I tried to highlight and run the code line by line and it keeps throwing that error.
# Data Preprocessing
# Import Libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Import Dataset
dataset = pd.read_csv('Data2.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 5].values
# Replace Missing Values
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, 1:5 ])
X[:, 1:6] = imputer.transform(X[:, 1:5])
The error I'm getting is:
Could not convert string to float: 'Illinois'
I also have this line above my error message
array = np.array(array, dtype=dtype, order=order, copy=copy)
It seems like my code is not able to read my GPA column which contains floats. Maybe I didn't create that column right and have to specify that they're floats?
*** I'm updating with the full error message:
[15]: runfile('/Users/jim/Desktop/Machine Learning Class/Part 1/Machine Learning A-Z Template Folder/Part 1 - Data Preprocessing/data_preprocessing_template2.py', wdir='/Users/jim/Desktop/Machine Learning Class/Part 1/Machine Learning A-Z Template Folder/Part 1 - Data Preprocessing')
Traceback (most recent call last):
File "<ipython-input-15-5f895cf9ba62>", line 1, in <module>
runfile('/Users/jim/Desktop/Machine Learning Class/Part 1/Machine Learning A-Z Template Folder/Part 1 - Data Preprocessing/data_preprocessing_template2.py', wdir='/Users/jim/Desktop/Machine Learning Class/Part 1/Machine Learning A-Z Template Folder/Part 1 - Data Preprocessing')
File "/Users/jim/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 710, in runfile
execfile(filename, namespace)
File "/Users/jim/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 101, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "/Users/jim/Desktop/Machine Learning Class/Part 1/Machine Learning A-Z Template Folder/Part 1 - Data Preprocessing/data_preprocessing_template2.py", line 16, in <module>
imputer = imputer.fit(X[:, 1:5 ])
File "/Users/jim/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/imputation.py", line 155, in fit
force_all_finite=False)
File "/Users/jim/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 433, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: 'Illinois'
Actually the full error you are getting is this (which would help tremendously if you pasted it in full):
Traceback (most recent call last):
File "<ipython-input-7-6a92ceaf227a>", line 8, in <module>
imputer = imputer.fit(X[:, 1:5 ])
File "C:\Users\Fatih\Anaconda2\lib\site-packages\sklearn\preprocessing\imputation.py", line 155, in fit
force_all_finite=False)
File "C:\Users\Fatih\Anaconda2\lib\site-packages\sklearn\utils\validation.py", line 433, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: Illinois
which, if you look carefully, points out where it is failing:
imputer = imputer.fit(X[:, 1:5 ])
which is due to your effort in taking mean of a categorical variable, which, doesn't make sense, and
which is already asked and answered in this StackOverflow thread.
Change the line:
dataset = pd.read_csv('Data2.csv')
by:
dataset = pd.read_csv('Data2.csv', delimiter=";")

Having Issues with an AssertionError when trying to use the psd() command in matplotlib

I'm trying to write a short script that takes a .csv file with some distance data, and outputs the psd file for it. the code is here:
import math
import matplotlib.pyplot as plt
name = raw_input('File:')
data = open(name + '.csv', 'r')
distances = []
for row in data:
distances.append(row.replace("\n",""))
for i in range(len(distances)):
distances[i] = float(distances[i])
Pxx, freqs = plt.psd(distances, NFFT=16,Fs=2,detrend='detrend_mean',window='window_none',noverlap=128,sides='onesided',scale_by_freq=True)
plot(Pxx,freqs)
plt.savefig(name + 'psd.png', bbox_inches = 'tight')
As you can see, it's pretty simple. the csv file just features one column of numbers, so distances is a vector.
The error I'm getting is as follows:
Traceback (most recent call last):
File "C:psdplot.py", line 15, in <module>
Pxx, freqs = plt.psd(distances, NFFT=16,Fs=2,detrend='detrend_mean',window='window_none',noverlap=128,sides='onesided',scale_by_freq=True)
File "C:\Python27\lib\site-packages\matplotlib\pyplot.py", line 3029, in psd
sides=sides, scale_by_freq=scale_by_freq, **kwargs)
File "C:\Python27\lib\site-packages\matplotlib\axes.py", line 8696, in psd
sides, scale_by_freq)
File "C:\Python27\lib\site-packages\matplotlib\mlab.py", line 389, in psd
scale_by_freq)
File "C:\Python27\lib\site-packages\matplotlib\mlab.py", line 423, in csd
noverlap, pad_to, sides, scale_by_freq)
File "C:\Python27\lib\site-packages\matplotlib\mlab.py", line 251, in _spectral_helper
assert(len(window) == NFFT)
AssertionError
Could someone direct me on how to fix this? I'm sure it's rather obvious, but I haven't been able to find anything on fixing it in this particular context.
Thanks in advance!

sci-kit learn crashing on certain amounts of data

I'm trying to process a numpy array with 71,000 rows of 200 columns of floats and the two sci-kit learn models I'm trying both give different errors when I exceed 5853 rows. I tried removing the problematic row, but it continues to fail. Can sci-kit learn not handle this much data, or is it something else? The X is numpy array of a list of lists.
KNN:
nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X)
Error:
File "knn.py", line 48, in <module>
nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/base.py", line 642, in fit
return self._fit(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/base.py", line 180, in _fit
raise ValueError("data type not understood")
ValueError: data type not understood
K-Means:
kmeans_model = KMeans(n_clusters=2, random_state=1).fit(X)
Error:
Traceback (most recent call last):
File "knn.py", line 48, in <module>
kmeans_model = KMeans(n_clusters=2, random_state=1).fit(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/k_means_.py", line 702, in fit
X = self._check_fit_data(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/k_means_.py", line 668, in _check_fit_data
X = atleast2d_or_csr(X, dtype=np.float64)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 134, in atleast2d_or_csr
"tocsr", force_all_finite)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 111, in _atleast2d_or_sparse
force_all_finite=force_all_finite)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 91, in array2d
X_2d = np.asarray(np.atleast_2d(X), dtype=dtype, order=order)
File "/usr/local/lib/python2.7/dist-packages/numpy/core/numeric.py", line 235, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence.
Please check the dtype of your matrix X, e.g. by typing X.dtype. If it is object or dtype('O'), then write the lengths of the lines of X into an array:
lengths = [len(line) for line in X]
Then take a look to see whether all lines have the same length, by invoking
np.unique(lengths)
If there is more than one number in the output, then your line lengths are different, e.g. from line 5853 on, but possibly not all the time.
Numpy data arrays are only useful if all lines have the same length (they continue to work if not, but don't do what you expect.). You should check to see what is causing this, correct it, and then return to knn.
Here is an example of what happens if line lengths are not the same:
import numpy as np
rng = np.random.RandomState(42)
X = rng.randn(100, 20)
# now remove one element from the 56th line
X = list(X)
X[55] = X[55][:-1]
# turn it back into an ndarray
X = np.array(X)
# check the dtype
print X.dtype # returns dtype('O')
from sklearn.neighbors import NearestNeighbors
nbrs = NearestNeighbors()
nbrs.fit(X) # raises your first error
from sklearn.cluster import KMeans
kmeans = KMeans()
kmeans.fit(X) # raises your second error

Categories

Resources