Here is the problem
Extract just the median_income column from the independent variables (from X_train and X_test).
Perform Linear Regression to predict housing values based on median_income.
Predict output for test dataset using the fitted model.
Plot the fitted model for training data as well as for test data to check if the fitted model satisfies the test data.
I did a linear regression earlier.Following is the code
import pandas as pd
import os
os.getcwd()
os.chdir('/Users/saurabhsaha/Documents/PGP-AI:ML-Purdue/New/datasets')
df=pd.read_excel('California_housing.xlsx')
df.total_bedrooms=df.total_bedrooms.fillna(df.total_bedrooms.mean())
x = df.iloc[:,2:8]
y = df.median_house_value
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=.20)
from sklearn.linear_model import LinearRegression
california_model = LinearRegression().fit(x_train,y_train)
california_model.predict(x_test)
Prdicted_values = pd.DataFrame(california_model.predict(x_test),columns=['Pred'])
Prdicted_values
Final = pd.concat([x_test.reset_index(drop=True),y_test.reset_index(drop=True),Prdicted_values],axis=1)
Final['Err_pct'] = abs(Final.median_house_value-
Final.Pred)/Final.median_house_value
Here is my dataset- https://docs.google.com/spreadsheets/d/1vYngxWw7tqX8FpwkWB5G7Q9axhe9ipTu/edit?usp=sharing&ouid=114925088866643320785&rtpof=true&sd=true
Following is my code.
x1_train=x_train.median_income
x1_train
x1_train.shape
x1_test=x_test.median_income
x1_test
type(x1_test)
x1_test.shape
from sklearn.linear_model import LinearRegression
california_model_new = LinearRegression().fit(x1_train,y_train)```
I get an error right here and when I try converting my 2 D array to 1 D as follows , i can not
```python
import numpy as np
x1_train= x1_train.reshape(-1, 1)
x1_test = x1_train.reshape(-1, 1)
This is the error I get
AttributeError: 'Series' object has no attribute 'reshape'
I am new to data science so if you can explain a bit then it would be real helpful
x1_train and x1_test are pandas Series objects, whereas the the reshape() method is applied to numpy arrays.
Do this instead:
x1_train= x1_train.to_numpy().reshape(-1, 1)
x1_test = x1_train.to_numpy().reshape(-1, 1)
Related
Some imports for several reasons
import pandas as pd
import numpy as np
I successfully split the data -test(30%) and train(70%) and separated it:
X_train = df_train.drop(columns='Rating')
y_train = df_train.Rating
from sklearn.linear_model import LinearRegression
X_test = df_test.drop(columns='Rating')
y_test = df_test.Rating
Everything is fine to this point, then
linreg = LinearRegression()
linreg.fit(X_train, y_train)
ValueError: could not convert string to float: 'GAME'
Am positive the Rating column is a float
Check your df first row, it might have header repeating again in that place. or Just train from second row.
I am attempting to use MultinomialNB from sklearn to classify some data. I have made a sample csv with some labelled training data, which I want to use to train the model but I receive the following error message:
ValueError: Expected 2D array, got 1D array instead: array=[0 1 2 2].
I know it is a very small data set but I will eventually add more data once the code is working.
Here is my data:
Here is my code:
import numpy as np
import pandas as pd
import array as array
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
data_file = pd.read_csv("CSV_Labels.csv", engine='python')
data_file.tail()
vectorizer = CountVectorizer(stop_words='english')
all_features = vectorizer.fit_transform(data_file.Word)
all_features.shape
x_train = data_file.label
y_train = data_file.Word
x_train.values.reshape(1, -1)
y_train.values.reshape(1, -1)
classifer = MultinomialNB()
classifer.fit(x_train, y_train)
Try this:
x_train = x_train.values.reshape(-1, 1)
y_train = y_train.values.reshape(-1, 1)
numpy reshape operations are not inplace. So the array's you're passing to the classifier have actual the old shapes.
I am using this dataset:
https://filebin.net/wr2jy0ass7rsl0vt
There are three colums : "Date","Temperature","Anomaly" . I use "Date" to predict "Temperature". The code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
data_df = pd.read_csv("ave_yearly_temp_nyc_1895-2017.csv")
data_df.columns= ["Date","Temperature","Anomaly"]
data_df["Date"] = data_df["Date"]//100
regressor = LinearRegression()
X_train,X_test, y_train,y_test = train_test_split(data_df.iloc[:,0],data_df.iloc[:,1],test_size=0.2, random_state=0)
regressor.fit(X_train,y_train) #training the algorithm
The data_df:
The error:
How to fix it?
It needs a 2D array, using iloc[:,0] you are getting a 1D array.
Instead you can use the entire dataframe column as parameter.
Try using:
X_train,X_test, y_train,y_test = train_test_split(data_df['Date'],data_df['Temperature'],test_size=0.2, random_state=0)
Try to do what the error message tells you. It seems that the implementation expects X to contain more than only one feature. Hence you'll need to transform it like this:
X_train, X_test, y_train, y_test = train_test_split(np.array(data_df.iloc[:,0]).reshape(-1, 1),data_df.iloc[:,1],test_size=0.2, random_state=0)
I am building an application in Python which can predict the values for Pm2.5 pollution from a dataframe. I am using the values for November and I am trying to first build the linear regression model. How can I make the linear regression without using the dates? I only need predictions for the Pm2.5, the dates are known.
Here is what I tried so far:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
data = pd.read_csv("https://raw.githubusercontent.com/iulianastroia/csv_data/master/final_dataframe.csv")
data['day'] = pd.to_datetime(data['day'], dayfirst=True)
#Splitting the dataset into training(70%) and test(30%)
X_train, X_test, y_train, y_test = train_test_split(data['day'], data['pm25'], test_size=0.3,
random_state=0
)
#Fitting Linear Regression to the dataset
lin_reg = LinearRegression()
lin_reg.fit(data['day'], data['pm25'])
This code throws the following error:
ValueError: Expected 2D array, got 1D array instead:
array=['2019-11-01T00:00:00.000000000' '2019-11-01T00:00:00.000000000'
'2019-11-01T00:00:00.000000000' ... '2019-11-30T00:00:00.000000000'
'2019-11-30T00:00:00.000000000' '2019-11-30T00:00:00.000000000'].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
You need to pass pandas dataframe instead of pandas series for X values, so you might want to do something like this,
UPDATE:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import datetime
data = pd.read_csv("https://raw.githubusercontent.com/iulianastroia/csv_data/master/final_dataframe.csv")
data['day'] = pd.to_datetime(data['day'], dayfirst=True)
print(data.head())
x_data = data[['day']]
y_data = data['pm25']
#Splitting the dataset into training(70%) and test(30%)
X_train, X_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.3,
random_state=0
)
# linear regression does not work on date type of data, convert it into numerical type
X_train['day'] = X_train['day'].map(datetime.datetime.toordinal)
X_test['day'] = X_test['day'].map(datetime.datetime.toordinal)
#Fitting Linear Regression to the dataset
lin_reg = LinearRegression()
lin_reg.fit(X_train[["day"]], y_train)
Now you can predict the data using,
print(lin_reg.predict(X_test[["day"]])) #-->predict the data
This is just something else to add to why you need the "[[", and how to avoid the frustration.
The reason the data[['day']] works and data['day'] doesn't is that the fit method expects for X an tuple of 2 with shape, but not for Y, see the vignette:
fit(self, X, y, sample_weight=None)[source]¶ Fit linear model.
Parameters X{array-like, sparse matrix} of shape (n_samples,
n_features) Training data
yarray-like of shape (n_samples,) or (n_samples, n_targets) Target
values. Will be cast to X’s dtype if necessary
So for example:
data[['day']].shape
(43040, 1)
data['day'].shape
(43040,)
np.resize(data['day'],(len(data['day']),1)).shape
(43040, 1)
These work because they have the structure required:
lin_reg.fit(data[['day']], data['pm25'])
lin_reg.fit(np.resize(data['day'],(len(data['day']),1)), data['pm25'])
While this doesn't:
lin_reg.fit(data['day'], data['pm25'])
Hence before running the function, check that you are providing input in the required format :)
I am working on a Multi-Target (binary) classification. There are 11 targets and I am using sklearn's MultiOutputClassifier. I am having difficulty with the Predict_proba function. See a snippet of the dataset, and code below:
import pandas as pd
import numpy as npy
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.multioutput import MultiOutputClassifier
data = pd.read_csv("123.csv")
dataset
target = ['H67BC97','H67GC93','H67LC63','H67WC103','H67RC91','H67YC73','H67RC92','H67GC94','H67LC64','H67NC60','H67YC72']
train, test = train_test_split(data, test_size=0.2)
X_train = train.drop(['H67BC97','H67GC93','H67LC63','H67WC103','H67RC91','H67YC73','H67RC92','H67GC94','H67LC64','H67NC60','H67YC72','FORMULA_NUMBER'],axis=1)
X_test = test.drop(['H67BC97','H67GC93','H67LC63','H67WC103','H67RC91','H67YC73','H67RC92','H67GC94','H67LC64','H67NC60','H67YC72','FORMULA_NUMBER'],axis=1)
Y_train = train[target]
Y_test = test[target]
model = MultiOutputClassifier(GradientBoostingClassifier())
model.fit(X_train, Y_train)
target_probabilities = model.predict_proba(X_test)
print(target_probabilities)
probabilities
The probabilities output do not seem to be in the correct form. I get 11 565x2 arrays (565 is length of my test set). I'd like to save the target_probabilities to a csv file, but I get the error: ValueError: Expected 1D or 2D array, got 3D array instead. My question is essentially the same as on the link -
https://datascience.stackexchange.com/questions/22762/understanding-predict-proba-from-multioutputclassifier, but the answer there only explains why the output is a set of arrays.
EDIT: I have simplified the problem.
target_probabilities = array(target_probabilities)
Now target_probabilities is an (11,565,2) matrix - need to change the form of the matrix to be (565,11), where each row is of the form target_probabilities[:,i][:,1], for i in the range(0,565).