2D array error in python using scikitlearn package

2D array error in python using scikitlearn package - python

i have used following code in my pycharm but i am constantly getting the error mentioned below:
import numpy as np
import seaborn as sns
from sklearn import linear_model
import matplotlib.pyplot as plt
df=pd.read_csv(r"C:\Users\gmcks\Downloads\Data samples\homeprices.csv")
df
https://docs.google.com/spreadsheets/d/1wxaadKAHTZtECv6gW6Mpreq3tFb2PWgVOhqANbWlIAk/edit?usp=sharing
x=df[["area"]]
y=df.price
reg=linear_model.LinearRegression()
reg.fit(x,y)
LinearRegression()
m=reg.coef_
c=reg.intercept_
print(m,c)
reg.predict(2000)
ERROR :
Traceback (most recent call last):
File "C:\Users\gmcks\PycharmProjects\using jupyter.py\venv\lib\site-packages\IPython\core\interactiveshell.py", line 3319, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-30-b5b06b1b028e>", line 1, in <module>
reg.predict(2000)
File "C:\Users\gmcks\PycharmProjects\using jupyter.py\venv\lib\site-packages\sklearn\linear_model\_base.py", line 236, in predict
return self._decision_function(X)
File "C:\Users\gmcks\PycharmProjects\using jupyter.py\venv\lib\site-packages\sklearn\linear_model\_base.py", line 218, in _decision_function
X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])
File "C:\Users\gmcks\PycharmProjects\using jupyter.py\venv\lib\site-packages\sklearn\utils\validation.py", line 72, in inner_f
return f(**kwargs)
File "C:\Users\gmcks\PycharmProjects\using jupyter.py\venv\lib\site-packages\sklearn\utils\validation.py", line 616, in check_array`enter code here`
"if it contains a single sample.".format(array))
ValueError: Expected 2D array, got scalar array instead:
array=2000.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Why do I have to shape my data again as I have already written the code as df[["area"]]? This piece of code converts the array into (5,1), so 2D array is created.

You need to provide the input that is the same shape as your predictor:
from sklearn import linear_model
import numpy as np
import pandas as pd
np.random.seed(111)
df = pd.DataFrame({'x' : np.random.uniform(0,1,100),
'y' : np.random.uniform(0,1,100)})
reg=linear_model.LinearRegression()
reg.fit(df[["x"]],df['y'])
You can do:
reg.predict([[2000]])

Related

ML Code throws value error when transforming data

Data source can be found here.
Hello all,
I've hit a stumbling block in some code I'm writing because the fit_transform method continuously fails. It throws this error:
Traceback (most recent call last):
File "/home/user/Datasets/CSVs/Working/Playstore/untitled0.py", line 18, in <module>
data = data[oh_cols].apply(oh.fit_transform)
File "/usr/lib/python3.8/site-packages/pandas/core/frame.py", line 7547, in apply
return op.get_result()
File "/usr/lib/python3.8/site-packages/pandas/core/apply.py", line 180, in get_result
return self.apply_standard()
File "/usr/lib/python3.8/site-packages/pandas/core/apply.py", line 255, in apply_standard
results, res_index = self.apply_series_generator()
File "/usr/lib/python3.8/site-packages/pandas/core/apply.py", line 284, in apply_series_generator
results[i] = self.f(v)
File "/usr/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py", line 410, in fit_transform
return super().fit_transform(X, y)
File "/usr/lib/python3.8/site-packages/sklearn/base.py", line 690, in fit_transform
return self.fit(X, **fit_params).transform(X)
File "/usr/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py", line 385, in fit
self._fit(X, handle_unknown=self.handle_unknown)
File "/usr/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py", line 74, in _fit
X_list, n_samples, n_features = self._check_X(X)
File "/usr/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py", line 43, in _check_X
X_temp = check_array(X, dtype=None)
File "/usr/lib/python3.8/site-packages/sklearn/utils/validation.py", line 73, in inner_f
return f(**kwargs)
File "/usr/lib/python3.8/site-packages/sklearn/utils/validation.py", line 620, in check_array
raise ValueError(
ValueError: Expected 2D array, got 1D array instead:
array=['Everyone' 'Everyone' 'Everyone' ... 'Everyone' 'Mature 17+' 'Everyone'].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
To put short:
ValueError: Expected 2D array, got 1D array instead:
I've done some searching on this online and arrived at a few potential solutions, but they didn't seem to work.
Here's my code:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from category_encoders import CatBoostEncoder,CountEncoder,TargetEncoder
data = pd.read_csv("/home/user/Datasets/CSVs/Working/Playstore/data.csv")
oh = OneHotEncoder()
cb = CatBoostEncoder()
ce = CountEncoder()
te = TargetEncoder()
obj = [i for i in data if data[i].dtypes=="object"]
unique = dict(zip(list(obj),[len(data[i].unique()) for i in obj]))
oh_cols = [i for i in unique if unique[i] < 100]
te_cols = [i for i in unique if unique[i] > 100]
data = data[oh_cols].apply(oh.fit_transform)
It throws the aforementioned error. A solution I saw advised me to use .values when transforming the data and I tried the following:
data = data[oh_cols].values.apply(oh.fit_transform)
data = data[oh_cols].apply(oh.fit_transform).values
encoding = np.array(data[oh_cols])
encoding.apply(oh.fit_transform)
The first and the third threw the same error which is below,:
AttributeError: 'numpy.ndarray' object has no attribute 'apply'
While the second threw the first error I mentioned again:
ValueError: Expected 2D array, got 1D array instead:
I'm honestly stumped and I'm not sure where to go from here. The Kaggle exercise I learnt this from went smoothly, but for some reason things never do when I try my hand at things myself.

The fix
data_enc = oh.fit_transform(data[oh_cols])
This is much better than the apply approach anyway, because now the object oh has lots of useful information in it when you want to inspect the results, you can later oh.transform your test data, etc.
Explaining the errors
Your data is in a pandas DataFrame object. The pandas function apply is trying to apply oh.fit_transform to each column, but OneHotEncoder expects a 2D input.
Using .values or np.array() casts your dataframe to a numpy array, but numpy has no apply method.

Numpy Error "Could not convert string to float: 'Illinois'"

I created the below table in Google Sheets and downloaded it as a CSV file.
My code is posted below. I'm really not sure where it's failing. I tried to highlight and run the code line by line and it keeps throwing that error.
# Data Preprocessing
# Import Libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Import Dataset
dataset = pd.read_csv('Data2.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 5].values
# Replace Missing Values
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, 1:5 ])
X[:, 1:6] = imputer.transform(X[:, 1:5])
The error I'm getting is:
Could not convert string to float: 'Illinois'
I also have this line above my error message
array = np.array(array, dtype=dtype, order=order, copy=copy)
It seems like my code is not able to read my GPA column which contains floats. Maybe I didn't create that column right and have to specify that they're floats?
*** I'm updating with the full error message:
[15]: runfile('/Users/jim/Desktop/Machine Learning Class/Part 1/Machine Learning A-Z Template Folder/Part 1 - Data Preprocessing/data_preprocessing_template2.py', wdir='/Users/jim/Desktop/Machine Learning Class/Part 1/Machine Learning A-Z Template Folder/Part 1 - Data Preprocessing')
Traceback (most recent call last):
File "<ipython-input-15-5f895cf9ba62>", line 1, in <module>
runfile('/Users/jim/Desktop/Machine Learning Class/Part 1/Machine Learning A-Z Template Folder/Part 1 - Data Preprocessing/data_preprocessing_template2.py', wdir='/Users/jim/Desktop/Machine Learning Class/Part 1/Machine Learning A-Z Template Folder/Part 1 - Data Preprocessing')
File "/Users/jim/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 710, in runfile
execfile(filename, namespace)
File "/Users/jim/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 101, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "/Users/jim/Desktop/Machine Learning Class/Part 1/Machine Learning A-Z Template Folder/Part 1 - Data Preprocessing/data_preprocessing_template2.py", line 16, in <module>
imputer = imputer.fit(X[:, 1:5 ])
File "/Users/jim/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/imputation.py", line 155, in fit
force_all_finite=False)
File "/Users/jim/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 433, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: 'Illinois'

Actually the full error you are getting is this (which would help tremendously if you pasted it in full):
Traceback (most recent call last):
File "<ipython-input-7-6a92ceaf227a>", line 8, in <module>
imputer = imputer.fit(X[:, 1:5 ])
File "C:\Users\Fatih\Anaconda2\lib\site-packages\sklearn\preprocessing\imputation.py", line 155, in fit
force_all_finite=False)
File "C:\Users\Fatih\Anaconda2\lib\site-packages\sklearn\utils\validation.py", line 433, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: Illinois
which, if you look carefully, points out where it is failing:
imputer = imputer.fit(X[:, 1:5 ])
which is due to your effort in taking mean of a categorical variable, which, doesn't make sense, and
which is already asked and answered in this StackOverflow thread.

Change the line:
dataset = pd.read_csv('Data2.csv')
by:
dataset = pd.read_csv('Data2.csv', delimiter=";")

sci-kit learn crashing on certain amounts of data

I'm trying to process a numpy array with 71,000 rows of 200 columns of floats and the two sci-kit learn models I'm trying both give different errors when I exceed 5853 rows. I tried removing the problematic row, but it continues to fail. Can sci-kit learn not handle this much data, or is it something else? The X is numpy array of a list of lists.
KNN:
nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X)
Error:
File "knn.py", line 48, in <module>
nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/base.py", line 642, in fit
return self._fit(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/base.py", line 180, in _fit
raise ValueError("data type not understood")
ValueError: data type not understood
K-Means:
kmeans_model = KMeans(n_clusters=2, random_state=1).fit(X)
Error:
Traceback (most recent call last):
File "knn.py", line 48, in <module>
kmeans_model = KMeans(n_clusters=2, random_state=1).fit(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/k_means_.py", line 702, in fit
X = self._check_fit_data(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/k_means_.py", line 668, in _check_fit_data
X = atleast2d_or_csr(X, dtype=np.float64)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 134, in atleast2d_or_csr
"tocsr", force_all_finite)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 111, in _atleast2d_or_sparse
force_all_finite=force_all_finite)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 91, in array2d
X_2d = np.asarray(np.atleast_2d(X), dtype=dtype, order=order)
File "/usr/local/lib/python2.7/dist-packages/numpy/core/numeric.py", line 235, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence.

Please check the dtype of your matrix X, e.g. by typing X.dtype. If it is object or dtype('O'), then write the lengths of the lines of X into an array:
lengths = [len(line) for line in X]
Then take a look to see whether all lines have the same length, by invoking
np.unique(lengths)
If there is more than one number in the output, then your line lengths are different, e.g. from line 5853 on, but possibly not all the time.
Numpy data arrays are only useful if all lines have the same length (they continue to work if not, but don't do what you expect.). You should check to see what is causing this, correct it, and then return to knn.
Here is an example of what happens if line lengths are not the same:
import numpy as np
rng = np.random.RandomState(42)
X = rng.randn(100, 20)
# now remove one element from the 56th line
X = list(X)
X[55] = X[55][:-1]
# turn it back into an ndarray
X = np.array(X)
# check the dtype
print X.dtype # returns dtype('O')
from sklearn.neighbors import NearestNeighbors
nbrs = NearestNeighbors()
nbrs.fit(X) # raises your first error
from sklearn.cluster import KMeans
kmeans = KMeans()
kmeans.fit(X) # raises your second error

Plotting data from csv using matplotlib.pyplot

I am trying to follow a tutorial on youtube, now in the tutorial they plot some standard text files using matplotlib.pyplot, I can achieve this easy enough, however I am now trying to perform the same thing using some csvs I have of real data.
The code I am using is import matplotlib.pyplot as plt
import csv
#import numpy as np
with open(r"Example RFI regression axis\Delta RFI.csv") as x, open(r"Example RFI regression axis\strikerate.csv") as y:
readx = csv.reader(x)
ready = csv.reader(y)
plt.plot(readx,ready)
plt.title ('Test graph')
plt.xlabel('x axis')
plt.ylabel('y axis')
plt.show()
The traceback I receive is long
Traceback (most recent call last):
File "C:\V4 code snippets\matplotlib_test.py", line 11, in <module>
plt.plot(readx,ready)
File "C:\Python27\lib\site-packages\matplotlib\pyplot.py", line 2832, in plot
ret = ax.plot(*args, **kwargs)
File "C:\Python27\lib\site-packages\matplotlib\axes.py", line 3997, in plot
self.add_line(line)
File "C:\Python27\lib\site-packages\matplotlib\axes.py", line 1507, in add_line
self._update_line_limits(line)
File "C:\Python27\lib\site-packages\matplotlib\axes.py", line 1516, in _update_line_limits
path = line.get_path()
File "C:\Python27\lib\site-packages\matplotlib\lines.py", line 677, in get_path
self.recache()
File "C:\Python27\lib\site-packages\matplotlib\lines.py", line 401, in recache
x = np.asarray(xconv, np.float_)
File "C:\Python27\lib\site-packages\numpy\core\numeric.py", line 320, in asarray
return array(a, dtype, copy=False, order=order)
TypeError: float() argument must be a string or a number
Please advise what I need to do, I realise this is probably very easy to most seasoned coders. Kind regards SMNALLY

csv.reader() returns strings (technically, .next()method of reader object returns lists of strings). Without converting them to float or int, you won't be able to plt.plot() them.
To save the trouble of converting, I suggest using genfromtxt() from numpy. (http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html)
For example, there are two files:
data1.csv:
data1
2
3
4
3
6
6
4
and data2.csv:
data2
92
73
64
53
16
26
74
Both of them have one line of header. We can do:
import numpy as np
data1=np.genfromtxt('data1.csv', skip_header=1) #suppose it is in the current working directory
data2=np.genfromtxt('data2.csv', skip_header=1)
plt.plot(data1, data2,'o-')
and the result:

Calculate logistic regression in python

I tried to calculate logical regression. I have the data as csv file.
it looks like
node_id,second_major,gender,major_index,year,dorm,high_school,student_fac
0,0,2,257,2007,111,2849,1
1,0,2,271,2005,0,51195,2
2,0,2,269,2007,0,21462,1
3,269,1,245,2008,111,2597,1
..........................
This is my coding.
import pandas as pd
import statsmodels.api as sm
import pylab as pl
import numpy as np
df = pd.read_csv("Reed98.csv")
print df.describe()
dummy_ranks = pd.get_dummies(df['second_major'], prefix='second_major')
cols_to_keep = ['second_major', 'dorm', 'high_school']
data = df[cols_to_keep].join(dummy_ranks.ix[:, 'year':])
train_cols = data.columns[1:]
# Index([gre, gpa, prestige_2, prestige_3, prestige_4], dtype=object)
logit = sm.Logit(data['second_major'], data[train_cols])
result = logit.fit()
print result.summary()
When I run the coding in python I got an error:
Traceback (most recent call last):
File "D:\project\logisticregression.py", line 24, in <module>
result = logit.fit()
File "c:\python26\lib\site-packages\statsmodels-0.5.0-py2.6- win32.egg\statsmodels\discrete\discrete_model.py", line 282, in fit
disp=disp, callback=callback, **kwargs)
File "c:\python26\lib\site-packages\statsmodels-0.5.0-py2.6- win32.egg\statsmodels\discrete\discrete_model.py", line 233, in fit
disp=disp, callback=callback, **kwargs)
File "c:\python26\lib\site-packages\statsmodels-0.5.0-py2.6- win32.egg\statsmodels\base\model.py", line 291, in fit
hess=hess)
File "c:\python26\lib\site-packages\statsmodels-0.5.0-py2.6-win32.egg\statsmodels\base\model.py", line 341, in _fit_mle_newton
newparams = oldparams - np.dot(np.linalg.inv(H),
File "C:\Python26\Lib\site-packages\numpy\linalg\linalg.py", line 445, in inv
return wrap(solve(a, identity(a.shape[0], dtype=a.dtype)))
File "C:\Python26\Lib\site-packages\numpy\linalg\linalg.py", line 328, in solve
raise LinAlgError('Singular matrix')
LinAlgError: Singular matrix
How to rewrite the code?

There's nothing wrong with your code. My guess is that you have missing values in your data. Try a dropna or use missing='drop' to Logit. You might also check that the right hand side is full rank np.linalg.matrix_rank(data[train_cols].values)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

2D array error in python using scikitlearn package - python

Related

ML Code throws value error when transforming data

Numpy Error "Could not convert string to float: 'Illinois'"

sci-kit learn crashing on certain amounts of data

Plotting data from csv using matplotlib.pyplot

Calculate logistic regression in python

Categories

Resources