Linear regression throws ValueError and UserWarning - python

I was writing a python program to predict the price of a house by given area:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error
import pandas as pd
df = pd.read_csv('traindata.csv')
plt.xlabel('area')
plt.ylabel('price')
plt.scatter(df.area,df.price)
reg = linear_model.LinearRegression()
reg.fit(df[['area']], df.price)
reg.predict(33000)
When was executing the program, it showed that
raise ValueError(
ValueError: Expected 2D array, got scalar array instead:
array=33000.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Then I changed the (33000) to ([[33000]]) and it showed
UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names warnings.warn(.
Then I changed it to ([['33000']]) and still showed the same error.

You cannot use ([['33000']]) because you will be trying to predict with a string value, which doesn't work.
If you are worried about the warning, you can create a data frame on the fly, for example :
import pandas as pd
import numpy as np
from sklearn import linear_model
df = pd.DataFrame({'area':np.random.randint(10000,40000,100),
'price':np.random.uniform(1,100,100)})
reg = linear_model.LinearRegression()
reg.fit(df[['area']], df.price)
reg.predict(pd.DataFrame({'area':[33000]}))
array([53.70626723])
But you can see that it's the same as if you do :
reg.predict([[33000]])
/Users/gen/anaconda3/lib/python3.8/site-packages/sklearn/base.py:445: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names
warnings.warn(
Out[15]: array([53.70626723])

Related

LinearSVC: Equation of a straight line that separates two classes from a scatterplot graph and pandas DataFrame

I'm trying to create a straight line that separates two classes. I'm using panda's dataframe with scatterplot.
Here is my code before I get you into my problem:
Libraries:
import pandas as pd
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.svm import LinearSVC
from sklearn.metrics import ConfusionMatrixDisplay
from scipy.io import arff
Data:
arquivo_arff = arff.loadarff(r"/content/Rice_MSC_Dataset.arff")
dados = pd.DataFrame(arquivo_arff[0])
Filter:
dados = dados[['MINOR_AXIS', 'MAJOR_AXIS', 'CLASS']]
Another filter:
dados = dados[dados['CLASS'].isin([b"Arborio", b"Ipsala"])]
Graph with two parameters:
sns.scatterplot(
data=dados,
x="MINOR_AXIS",
y="MAJOR_AXIS",
hue="CLASS")
plt.show()
My problem is here, when I use LinearSVC for finding que parameters and coeficients of my equation:
model = LinearSVC()
model.fit(dados.drop('CLASS', axis=1), dados['CLASS'])
a, b = model.coef_[0]
d = model.intercept_[0]
print('a:', a)
print('b:', b)
print('d:', d)
You appear to be using a legacy multi-label data representation. Sequence of sequences are no longer supported; use a binary array or sparse matrix instead - the MultiLabelBinarizer transformer can convert to this format.
I didn't understand that error quite well. Is there any ways that I can fix this in my code?
The documentation for multilabelbinarizer have some good examples for specific use, but a general workflow for sklearn transformers is:
Split data into features and labels
X = dados.drop('CLASS', axis=1)
y = dados['CLASS']
#optionally, use train_test_split to split data into training and validation sets
#X_train,X_test,y_train,y_test=train_test_split(X,y)
Do transformations on input and target data
mb = MultiLabelBinarizer()
mb.fit(y)
mb.transform(y)
#can also be done in one step with mb.fit_transform(y)
#if using train_test_split: mb.fit_transform(y_train); mb.transform(y_test)
Fit your model
model = LinearSVC()
model.fit(X,y) #or model.fit(X_train,y_train) if using training and validation sets

FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version

I am building a model to train it for binary classification. While processing the data before feeding it to the model i come across this warning
FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
Here is my code
import torch
import torch.nn as nn
import matplotlib.pyplot as pyp
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix,accuracy_score
from sklearn.metrics import precision_score,recall_score,roc_curve,auc,roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
#loading the dataset
path='G:/My Drive/datasets/bank.csv'
df=pd.read_csv(path)
print(df.head(5))
print(df.shape)
#distirbuting the target values
print("Distribution of Target Values in Dataset -")
df.deposit.value_counts()
#check f we have na values in the datset
df.isna().sum()
#extracting columns whith strings
cartegorical_columns=df.select_dtypes(include='object').columns
print('cartegprical columns:',list(cartegorical_columns))
#for all cartegorical column if values in(yes/no) convert into a 1/10 flag
for col in cartegorical_columns:
if df[col].nunique()==2:
df[col]=np.where(df[col]=='yes',1,0)
print(df.head(5))
#for the remaining cartegorical values that have no binary values
#crate one hot encoded version of the dataset
new_df=pd.get_dummies(df)
#define the target and predictors for the model
target='deposit'
predictors=set(new_df.columns) - set([target])
print('new_df shape:',new_df.shape)
print(new_df[predictors].head())
The specific error
FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
print(new_df[predictors].head())
What could be raising this error in my code and how can i solve it
you are trying to access the new_df with predictors which is set.
convert it to list.
example:
print(new_df[list(predictors)].head())

ValueError: Expected 2D array, got 1D array instead: array=[-1]

Here is the problem
Extract just the median_income column from the independent variables (from X_train and X_test).
Perform Linear Regression to predict housing values based on median_income.
Predict output for test dataset using the fitted model.
Plot the fitted model for training data as well as for test data to check if the fitted model satisfies the test data.
I did a linear regression earlier.Following is the code
import pandas as pd
import os
os.getcwd()
os.chdir('/Users/saurabhsaha/Documents/PGP-AI:ML-Purdue/New/datasets')
df=pd.read_excel('California_housing.xlsx')
df.total_bedrooms=df.total_bedrooms.fillna(df.total_bedrooms.mean())
x = df.iloc[:,2:8]
y = df.median_house_value
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=.20)
from sklearn.linear_model import LinearRegression
california_model = LinearRegression().fit(x_train,y_train)
california_model.predict(x_test)
Prdicted_values = pd.DataFrame(california_model.predict(x_test),columns=['Pred'])
Prdicted_values
Final = pd.concat([x_test.reset_index(drop=True),y_test.reset_index(drop=True),Prdicted_values],axis=1)
Final['Err_pct'] = abs(Final.median_house_value-
Final.Pred)/Final.median_house_value
Here is my dataset- https://docs.google.com/spreadsheets/d/1vYngxWw7tqX8FpwkWB5G7Q9axhe9ipTu/edit?usp=sharing&ouid=114925088866643320785&rtpof=true&sd=true
Following is my code.
x1_train=x_train.median_income
x1_train
x1_train.shape
x1_test=x_test.median_income
x1_test
type(x1_test)
x1_test.shape
from sklearn.linear_model import LinearRegression
california_model_new = LinearRegression().fit(x1_train,y_train)```
I get an error right here and when I try converting my 2 D array to 1 D as follows , i can not
```python
import numpy as np
x1_train= x1_train.reshape(-1, 1)
x1_test = x1_train.reshape(-1, 1)
This is the error I get
AttributeError: 'Series' object has no attribute 'reshape'
I am new to data science so if you can explain a bit then it would be real helpful
x1_train and x1_test are pandas Series objects, whereas the the reshape() method is applied to numpy arrays.
Do this instead:
x1_train= x1_train.to_numpy().reshape(-1, 1)
x1_test = x1_train.to_numpy().reshape(-1, 1)

Identifying arrays with a band structure

I would like to identify arrays with a band like structure (first image) as compared to a more homogeneous structure shown in the homogenous image.
I have so far used some skewness and RMS techniques to test for this but it doesn't work well if the bands are evenly spaced. Are there any more refined ways of identifying such arrays in Python?
Try sns.pairplot.
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# The dataset: wages
# We fetch the data from OpenML. Note that setting the parameter as_frame to True will retrieve the data as a pandas dataframe.
from sklearn.datasets import fetch_openml
survey = fetch_openml(data_id=534, as_frame=True)
# Then, we identify features X and targets y: the column WAGE is our target variable (i.e., the variable which we want to predict).
X = survey.data[survey.feature_names]
X.describe(include="all")
y = survey.target.values.ravel()
survey.target.head()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
train_dataset = X_train.copy()
train_dataset.insert(0, "WAGE", y_train)
sns.pairplot(train_dataset, kind='reg', diag_kind='kde')

Unable to run logit model/ logistic regression

I'm trying to run a logistic regression. The data has been scrubbed and categorical variables change to dummies however when i run the code i get an error message from the "statsmodels" package outside of my code and not sure how to correct in this case.
A friend of mine ran the same code and he got an output (print screen below), as i'm using spyder with python 3.6 he thinks it might be a version issue - he is using python 3.5
I've got the code below. Any ideas on how to fix it or how better to run a logistic regression is appreciated.
error message i'm getting is in statsmodels library:
File "C:\Users\sebas\Anaconda3\lib\site-packages\statsmodels\discrete\discrete_model.py", line 2405, in llr_pvalue
return stats.chisqprob(self.llr, self.df_model)
AttributeError: module 'scipy.stats' has no attribute 'chisqprob'
thanks!
import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt
plt.rc("font", size=14)
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from sklearn.cross_validation import train_test_split
import seaborn as sns
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)
################################################################################
## Logistic regression
###############################################################################
data = pd.read_csv(r"log reg test Lending club 2007-2011 car only.csv")
#data = data.dropna()
print(data.shape)
##print(list(data.columns))
print(data['Distressed'].value_counts()) ## number of defaulted car loans is binary
sns.countplot(x='Distressed', data=data, palette='hls')
plt.show ## confrim dependent variable is binary
##basic numerical analysis of variables to check feasibility for model
## we will need to create dummy variables for strings
#print(data.groupby('Distressed').mean()) ##numerical variable means
#print(data.groupby('grade').mean()) ## string variable means
#print(data.groupby('sub_grade').mean())
#print(data.groupby('emp_length').mean())
#print(data.groupby('home_ownership').mean())
##testing for nulls in dataset
print(data.isnull().sum())
scrub_data=data.drop(['mths_since_last_delinq'],1) ## this variable is not statistically significant
print('Here is the logit model data')
print(scrub_data.isnull().sum()) ## removed records of missing info, sample still sufficiently large
print(list(scrub_data.columns))
print(scrub_data.head())
##convert categorical variables to dummies completed in csv file
X=scrub_data.ix[:,(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22)].values
y=scrub_data.ix[:,0].values
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=.3, random_state=25)
LogReg=LogisticRegression()
LogReg.fit(X_train,y_train)
y_pred=LogReg.predict(X_test)
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))
print('alternative method using RFE')
#y=['Distressed']
#x=[i for i in data if i not in y]
#print(y)
#print(x)
#print(data.info())
## check for independance between features
correlation=sns.heatmap(data.corr()) ## heatmap showing correlations of the variables
print(correlation)
from sklearn.svm import LinearSVC
#logreg = LogisticRegression()
#rfe = RFE(logreg,10)
#rfe=rfe.fit(x,y)
#print(rfe.support_)
#print(rfe.ranking_)
import statsmodels.api as sm
logit_model=sm.Logit(y,X)
result=logit_model.fit()
print(result.summary())
The error can be fixed by assigning the missing function back into the scipy.stats namespace as shown below:
from scipy import stats
stats.chisqprob = lambda chisq, df: stats.chi2.sf(chisq, df)

Categories

Resources