Here is the problem
Extract just the median_income column from the independent variables (from X_train and X_test).
Perform Linear Regression to predict housing values based on median_income.
Predict output for test dataset using the fitted model.
Plot the fitted model for training data as well as for test data to check if the fitted model satisfies the test data.
I did a linear regression earlier.Following is the code
import pandas as pd
import os
os.getcwd()
os.chdir('/Users/saurabhsaha/Documents/PGP-AI:ML-Purdue/New/datasets')
df=pd.read_excel('California_housing.xlsx')
df.total_bedrooms=df.total_bedrooms.fillna(df.total_bedrooms.mean())
x = df.iloc[:,2:8]
y = df.median_house_value
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=.20)
from sklearn.linear_model import LinearRegression
california_model = LinearRegression().fit(x_train,y_train)
california_model.predict(x_test)
Prdicted_values = pd.DataFrame(california_model.predict(x_test),columns=['Pred'])
Prdicted_values
Final = pd.concat([x_test.reset_index(drop=True),y_test.reset_index(drop=True),Prdicted_values],axis=1)
Final['Err_pct'] = abs(Final.median_house_value-
Final.Pred)/Final.median_house_value
Here is my dataset- https://docs.google.com/spreadsheets/d/1vYngxWw7tqX8FpwkWB5G7Q9axhe9ipTu/edit?usp=sharing&ouid=114925088866643320785&rtpof=true&sd=true
Following is my code.
x1_train=x_train.median_income
x1_train
x1_train.shape
x1_test=x_test.median_income
x1_test
type(x1_test)
x1_test.shape
from sklearn.linear_model import LinearRegression
california_model_new = LinearRegression().fit(x1_train,y_train)```
I get an error right here and when I try converting my 2 D array to 1 D as follows , i can not
```python
import numpy as np
x1_train= x1_train.reshape(-1, 1)
x1_test = x1_train.reshape(-1, 1)
This is the error I get
AttributeError: 'Series' object has no attribute 'reshape'
I am new to data science so if you can explain a bit then it would be real helpful
x1_train and x1_test are pandas Series objects, whereas the the reshape() method is applied to numpy arrays.
Do this instead:
x1_train= x1_train.to_numpy().reshape(-1, 1)
x1_test = x1_train.to_numpy().reshape(-1, 1)
Some imports for several reasons
import pandas as pd
import numpy as np
I successfully split the data -test(30%) and train(70%) and separated it:
X_train = df_train.drop(columns='Rating')
y_train = df_train.Rating
from sklearn.linear_model import LinearRegression
X_test = df_test.drop(columns='Rating')
y_test = df_test.Rating
Everything is fine to this point, then
linreg = LinearRegression()
linreg.fit(X_train, y_train)
ValueError: could not convert string to float: 'GAME'
Am positive the Rating column is a float
Check your df first row, it might have header repeating again in that place. or Just train from second row.
I am attempting to use MultinomialNB from sklearn to classify some data. I have made a sample csv with some labelled training data, which I want to use to train the model but I receive the following error message:
ValueError: Expected 2D array, got 1D array instead: array=[0 1 2 2].
I know it is a very small data set but I will eventually add more data once the code is working.
Here is my data:
Here is my code:
import numpy as np
import pandas as pd
import array as array
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
data_file = pd.read_csv("CSV_Labels.csv", engine='python')
data_file.tail()
vectorizer = CountVectorizer(stop_words='english')
all_features = vectorizer.fit_transform(data_file.Word)
all_features.shape
x_train = data_file.label
y_train = data_file.Word
x_train.values.reshape(1, -1)
y_train.values.reshape(1, -1)
classifer = MultinomialNB()
classifer.fit(x_train, y_train)
Try this:
x_train = x_train.values.reshape(-1, 1)
y_train = y_train.values.reshape(-1, 1)
numpy reshape operations are not inplace. So the array's you're passing to the classifier have actual the old shapes.
I have used sklearn scikit python for prediction. While importing following package
from sklearn import datasets and storing the result in iris = datasets.load_iris() , it works fine to train model
iris = pandas.read_csv("E:\scikit\sampleTestingCSVInput.csv")
iris_header = ["Sepal_Length","Sepal_Width","Petal_Length","Petal_Width"]
Model Algorithm :
model = SVC(gamma='scale')
model.fit(iris.data, iris.target_names[iris.target])
But while importing CSV file to train model , creating new array for target_names also , I am facing some error like
ValueError: Found input variables with inconsistent numbers of
samples: [150, 4]
My CSV file has 5 Columns in which 4 columns are input and 1 column is output. Need to fit model for that output column.
How to provide argument for fit model?
Could anyone share the code sample to import CSV file to fit SVM model in sklearn python?
Since the question was not very clear to begin with and attempts to explain it were going in vain, I decided to download the dataset and do it for myself. So just to make sure we are working with the same dataset iris.head() will give you or something similar, a few names might be changed and a few values, but overall strucure will be the same.
Now the first four columns are features and the fifth one is target/output.
Now you will need your X and Y as numpy arrays, to do that use
X = iris[ ['sepal length:','sepal Width:','petal length','petal width']].values
Y = iris[['Target']].values
Now since Y is categorical Data, You will need to one hot encode it using sklearn's LabelEncoder and scale the input X to do that use
label_encoder = LabelEncoder()
Y = label_encoder.fit_transform(Y)
X = StandardScaler().fit_transform(X)
To keep with the norm of separate train and test data, split the dataset using
X_train , X_test, y_train, y_test = train_test_split(X,Y)
Now just train it on your model using X_train and y_train
clf = SVC(C=1.0, kernel='rbf').fit(X_train,y_train)
After this you can use the test data to evaluate the model and tune the value of C as you wish.
Edit Just in case you don't know where the functions are here are the import statements
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
I'm relatively new to using sklearn and python for data analysis and am trying to run some linear regression on a dataset that I loaded from a .csv file.
I have loaded my data into train_test_split without any issues, but when I try to fit my training data I receive an error ValueError: Expected 2D array, got 1D array instead: ... Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample..
Error at model = lm.fit(X_train, y_train)
Because of my freshness with working with these packages, I'm trying to determine if this is the result of not setting my imported csv to a pandas data frame before running the regression or if this has to do with something else.
My CSV is in the format of:
Month,Date,Day of Week,Growth,Sunlight,Plants
7,7/1/17,Saturday,44,611,26
7,7/2/17,Sunday,30,507,14
7,7/5/17,Wednesday,55,994,25
7,7/6/17,Thursday,50,1014,23
7,7/7/17,Friday,78,850,49
7,7/8/17,Saturday,81,551,50
7,7/9/17,Sunday,59,506,29
Here is how I set up the regression:
import numpy as np
import pandas as pd
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
organic = pd.read_csv("linear-regression.csv")
organic.columns
Index(['Month', 'Date', 'Day of Week', 'Growth', 'Sunlight', 'Plants'], dtype='object')
# Set the depedent (Growth) and independent (Sunlight)
y = organic['Growth']
X = organic['Sunlight']
# Test train split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print (X_train.shape, X_test.shape)
print (y_train.shape, y_test.shape)
(192,) (49,)
(192,) (49,)
lm = linear_model.LinearRegression()
model = lm.fit(X_train, y_train)
# Error pointing to an array with values from Sunlight [611, 507, 994, ...]
You just need to adjust your last columns to
lm = linear_model.LinearRegression()
model = lm.fit(X_train.values.reshape(-1,1), y_train)
and the model will fit. The reason for this is that the linear model from sklearn expects
X : numpy array or sparse matrix of shape [n_samples,n_features]
So our training data must be of form [7,1] in this particular case
You are only using one feature, so it tells you what to do within the error:
Reshape your data either using array.reshape(-1, 1) if your data has a single feature.
The data always has to be 2D in scikit-learn.
(Don't forget the typo in X = organic['Sunglight'])
Once you load the data into train_test_split(X, y, test_size=0.2), it returns Pandas Series X_train and X_test with (192, ) and (49, ) dimensions. As mentioned in the previous answers, sklearn expect matrices of shape [n_samples,n_features] as the X_train, X_test data. You can simply convert the Pandas Series X_train and X_test to Pandas Dataframes to change their dimensions to (192, 1) and (49, 1).
lm = linear_model.LinearRegression()
model = lm.fit(X_train.to_frame(), y_train)