ValueError: could not convert string to float: '2100 - 2850' - python

How can i solve this problem?
import pandas as pd
import numpy as numpy
import matplotlib.pyplot as plt
import seaborn as sns
train = pd.read_csv(r"G:\data_science\input\train.csv")
cat_columns = ['area_type','availability','location','size','society','bath','balcony']
for col in train.columns:
if col in cat_columns:
train[col]= train[col].astype('category')
train[col]= train[col].cat.codes
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.metrics import mean_squared_error
test = pd.read_csv(r"G:\data_science\input\test.csv")
y_train = train['price']
x_train = train.drop('price', axis = 1)
y_test = test['price']
x_test = test.drop('price',axis = 1)
model = LinearRegression()
model.fit(x_train, y_train)
prediction = model.predict(x_test)
prediction

All your values are read as string from the csv file.
Now, somewhere you are trying to convert some values into float.
But the value that it has encountered is '2100 - 2850'.
Now you cannot convert this value into float and that is what the error is saying.
Please check the dataset once and resolve any such garbage value.

Related

y should be a 1d array, got an array of shape (2603, 2) instead

import numpy as np
import pandas as pd
import matplotlib
import xgboost as xgb
from sklearn.metrics import roc_auc_score
from xgboost.sklearn import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
train = pd.read_csv("train_final.csv")
y = train['Y']
YValues = []
for x in range(len(y)):
YValues.append(y[x])
print(YValues)
print(type(YValues))
YVal = np.array(YValues)
train = train.drop(['Y'], axis=1)
test = pd.read_csv("test_final.csv")
dtrain = xgb.DMatrix(train, label = y)
dtest = xgb.DMatrix(test)
xgb2_hyperparams = XGBClassifier()
xgb2_hyperparams = xgb2.predict_proba(test)
xgb2_hyperparams_test = xgb2.predict_proba(train)
print('Accuracy: ', roc_auc_score(YVal, xgb2_hyperparams_test))
np.savetxt("xgboostHyperParams.csv", xgb2_hyperparams, delimiter=",")
print(xgb2_hyperparams)
I've explicitly created YVal to be a 1D np-array but it is still saying that YVal is an array of shape (2603, 2) and I'm not sure what is up with that. I originally tried fiddling with y but that led to more errors and at this point, I'm not sure why Python is so adamant about the (2603, 2) shape - I'm not sure what I'm missing that it is always reading it as (2603, 2) no matter whether it is data type series, ndarray or array.
You have to convert it to an array. Pandas essentially creates a container around your data. Try:
import numpy as np
import pandas as pd
import matplotlib
import xgboost as xgb
from sklearn.metrics import roc_auc_score
from xgboost.sklearn import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
train = pd.read_csv("train_final.csv")
y = train['Y'].values
YValues = []
for x in range(len(y)):
YValues.append(y[x])
print(YValues)
print(type(YValues))
YVal = np.array(YValues)
train = train.drop(['Y'], axis=1).values
test = pd.read_csv("test_final.csv")
dtrain = xgb.DMatrix(train, label = y)
dtest = xgb.DMatrix(test)
xgb2_hyperparams = XGBClassifier()
xgb2_hyperparams = xgb2.predict_proba(test)
xgb2_hyperparams_test = xgb2.predict_proba(train)
print('Accuracy: ', roc_auc_score(YVal, xgb2_hyperparams_test))
np.savetxt("xgboostHyperParams.csv", xgb2_hyperparams, delimiter=",")
print(xgb2_hyperparams)

Using Python to build a linear regression model and find R2'd; cannot get the model to fit or predict

Some imports for several reasons
import pandas as pd
import numpy as np
I successfully split the data -test(30%) and train(70%) and separated it:
X_train = df_train.drop(columns='Rating')
y_train = df_train.Rating
from sklearn.linear_model import LinearRegression
X_test = df_test.drop(columns='Rating')
y_test = df_test.Rating
Everything is fine to this point, then
linreg = LinearRegression()
linreg.fit(X_train, y_train)
ValueError: could not convert string to float: 'GAME'
Am positive the Rating column is a float
Check your df first row, it might have header repeating again in that place. or Just train from second row.

Simple Linear Regression using Sklearn. Fit() is not working

I am using this dataset:
https://filebin.net/wr2jy0ass7rsl0vt
There are three colums : "Date","Temperature","Anomaly" . I use "Date" to predict "Temperature". The code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
data_df = pd.read_csv("ave_yearly_temp_nyc_1895-2017.csv")
data_df.columns= ["Date","Temperature","Anomaly"]
data_df["Date"] = data_df["Date"]//100
regressor = LinearRegression()
X_train,X_test, y_train,y_test = train_test_split(data_df.iloc[:,0],data_df.iloc[:,1],test_size=0.2, random_state=0)
regressor.fit(X_train,y_train) #training the algorithm
The data_df:
The error:
How to fix it?
It needs a 2D array, using iloc[:,0] you are getting a 1D array.
Instead you can use the entire dataframe column as parameter.
Try using:
X_train,X_test, y_train,y_test = train_test_split(data_df['Date'],data_df['Temperature'],test_size=0.2, random_state=0)
Try to do what the error message tells you. It seems that the implementation expects X to contain more than only one feature. Hence you'll need to transform it like this:
X_train, X_test, y_train, y_test = train_test_split(np.array(data_df.iloc[:,0]).reshape(-1, 1),data_df.iloc[:,1],test_size=0.2, random_state=0)

How to get predicted values along with test data, and visualize actual vs predicted?

from sklearn import datasets
import numpy as np
import pandas as pd from sklearn.model_selection
import train_test_split
from sklearn.linear_model import Perceptron
data = pd.read_csv('student_selection.csv')
x = data[['Average','Pass','Division','Domicile']]
y = data[['Selected']]
x_train,x_test,y_train,y_test train_test_split(x,y,test_size=1,random_state=0)
ppn = Perceptron(eta0=1.0, fit_intercept=True, max_iter=1000, n_iter_no_change=5, random_state=0)
ppn.fit(x_train, y_train)
y_pred = ppn.predict(x_train)
x_train['Predicted'] = pd.Series(y_pred)
How to see the actual vs predicted as a table and along with a plot? x_train is the value I am getting as predicted, but I am unable to merge it with the actual data to see the deviation.
How to see the actual vs predicted as a table and along with a plot?
Just run:
y_predict= pnn.predict(x)
data['y_predict'] = y_predict
and have the column in your dataframe, if you want to plot it you can use:
import matplotlib.pyplot as plt
plt.scatter(data['Selected'], data['y_predict'])
plt.show()

Expected 2D array, got 1D array instead, any solution?

I'm new using Machine Learning and I am trying to predict the price of the stocks in 30 days.
This is my code:
import pandas as pd
import matplotlib.pyplot as plt
import pymysql as MySQLdb
import numpy as np
import sqlalchemy
import datetime
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing, svm
from sklearn.model_selection import train_test_split
forecast_out = int(30)
df['Prediction'] = df[['LastPrice']].shift(-forecast_out)
df['Prediction'].fillna(0)
X = np.array(df['Prediction'].fillna(0))
X = preprocessing.scale(X)
X_forecast = X[-forecast_out:]
X = X[:-forecast_out]
y = np.array(df['Prediction'].fillna(0))
y = y[:-forecast_out]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
X_train, X_test, y_train, y_test.reshape(-1,1)
# Training
clf = LinearRegression()
clf.fit(X_train,y_train)
# Testing
confidence = clf.score(X_test, y_test)
print("confidence: ", confidence)
forecast_prediction = clf.predict(X_forecast)
print(forecast_prediction)
I got this error:
ValueError: Expected 2D array, got 1D array instead:
array=[-0.46939923 -0.47076913 -0.47004993 ... -0.42782272 3.07433019 -0.46573474].
Reshape your data either using
array.reshape(-1, 1) if your data has a single feature
or
array.reshape(1, -1) if it contains a single sample.
It's expecting a 2D Array when you're only passing in a 1D Array. You can solve this by putting another set of brackets around where you're getting the probelm. For example
x = [1,2,3,4]
Foo(x)
If that throws the error, you could just do
Foo([x])

Categories

Resources