How to make a linear regression for a dataframe? - python

I am building an application in Python which can predict the values for Pm2.5 pollution from a dataframe. I am using the values for November and I am trying to first build the linear regression model. How can I make the linear regression without using the dates? I only need predictions for the Pm2.5, the dates are known.
Here is what I tried so far:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
data = pd.read_csv("https://raw.githubusercontent.com/iulianastroia/csv_data/master/final_dataframe.csv")
data['day'] = pd.to_datetime(data['day'], dayfirst=True)
#Splitting the dataset into training(70%) and test(30%)
X_train, X_test, y_train, y_test = train_test_split(data['day'], data['pm25'], test_size=0.3,
random_state=0
)
#Fitting Linear Regression to the dataset
lin_reg = LinearRegression()
lin_reg.fit(data['day'], data['pm25'])
This code throws the following error:
ValueError: Expected 2D array, got 1D array instead:
array=['2019-11-01T00:00:00.000000000' '2019-11-01T00:00:00.000000000'
'2019-11-01T00:00:00.000000000' ... '2019-11-30T00:00:00.000000000'
'2019-11-30T00:00:00.000000000' '2019-11-30T00:00:00.000000000'].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

You need to pass pandas dataframe instead of pandas series for X values, so you might want to do something like this,
UPDATE:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import datetime
data = pd.read_csv("https://raw.githubusercontent.com/iulianastroia/csv_data/master/final_dataframe.csv")
data['day'] = pd.to_datetime(data['day'], dayfirst=True)
print(data.head())
x_data = data[['day']]
y_data = data['pm25']
#Splitting the dataset into training(70%) and test(30%)
X_train, X_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.3,
random_state=0
)
# linear regression does not work on date type of data, convert it into numerical type
X_train['day'] = X_train['day'].map(datetime.datetime.toordinal)
X_test['day'] = X_test['day'].map(datetime.datetime.toordinal)
#Fitting Linear Regression to the dataset
lin_reg = LinearRegression()
lin_reg.fit(X_train[["day"]], y_train)
Now you can predict the data using,
print(lin_reg.predict(X_test[["day"]])) #-->predict the data

This is just something else to add to why you need the "[[", and how to avoid the frustration.
The reason the data[['day']] works and data['day'] doesn't is that the fit method expects for X an tuple of 2 with shape, but not for Y, see the vignette:
fit(self, X, y, sample_weight=None)[source]¶ Fit linear model.
Parameters X{array-like, sparse matrix} of shape (n_samples,
n_features) Training data
yarray-like of shape (n_samples,) or (n_samples, n_targets) Target
values. Will be cast to X’s dtype if necessary
So for example:
data[['day']].shape
(43040, 1)
data['day'].shape
(43040,)
np.resize(data['day'],(len(data['day']),1)).shape
(43040, 1)
These work because they have the structure required:
lin_reg.fit(data[['day']], data['pm25'])
lin_reg.fit(np.resize(data['day'],(len(data['day']),1)), data['pm25'])
While this doesn't:
lin_reg.fit(data['day'], data['pm25'])
Hence before running the function, check that you are providing input in the required format :)

Related

When I want to pass the tf-idf vectorize data and numeric data at the same time in the fit() of machine learning model why does this error arised?

I want to combine text and numeric feature same time for duplicate question pair detection. But when I pass the data to the classififer.fit()
setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (3,) + inhomogeneous part.
error occured. How can I solve this? The snippet of code:
from sklearn.model_selection import train_test_split
X=train_df.drop(['is_duplicate', 'Unnamed: 0', 'id', 'qid1',
'qid2','word_mover_distance','jaccard_sim'], axis=1)
y=train_df['is_duplicate'].values
X_num = train_df[['word_mover_distance', 'jaccard_sim']].values
scaler = StandardScaler()
X_num = scaler.fit_transform(X_num)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_num_train, X_num_test = train_test_split(X_num, test_size=0.2, random_state=42)
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(strip_accents=None,
lowercase=True,
preprocessor=None, # applied in Data Cleaning
use_idf=True,
norm='l2',
smooth_idf=True
train_q1 =tfidf.fit_transform(X_train['question1'])
train_q2 =tfidf.fit_transform(X_train['question2'])
from sklearn.linear_model import LogisticRegression
log_clf = LogisticRegression(C=0.5, max_iter=1000)
log_clf.fit([train_q1,train_q2, X_num_train],y_train) #Error arise here
Logistic regression .fit() method expects (from documentation):
X{array-like, sparse matrix} of shape (n_samples, n_features)
Passing features as list of array leads to fact that shape of X becomes [3, n_samples, n_features] and this is why it is not working.
You can try wrapping your list with np.concatenate() to avoid this problem.

ValueError: Expected 2D array, got 1D array instead: array=[-1]

Here is the problem
Extract just the median_income column from the independent variables (from X_train and X_test).
Perform Linear Regression to predict housing values based on median_income.
Predict output for test dataset using the fitted model.
Plot the fitted model for training data as well as for test data to check if the fitted model satisfies the test data.
I did a linear regression earlier.Following is the code
import pandas as pd
import os
os.getcwd()
os.chdir('/Users/saurabhsaha/Documents/PGP-AI:ML-Purdue/New/datasets')
df=pd.read_excel('California_housing.xlsx')
df.total_bedrooms=df.total_bedrooms.fillna(df.total_bedrooms.mean())
x = df.iloc[:,2:8]
y = df.median_house_value
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=.20)
from sklearn.linear_model import LinearRegression
california_model = LinearRegression().fit(x_train,y_train)
california_model.predict(x_test)
Prdicted_values = pd.DataFrame(california_model.predict(x_test),columns=['Pred'])
Prdicted_values
Final = pd.concat([x_test.reset_index(drop=True),y_test.reset_index(drop=True),Prdicted_values],axis=1)
Final['Err_pct'] = abs(Final.median_house_value-
Final.Pred)/Final.median_house_value
Here is my dataset- https://docs.google.com/spreadsheets/d/1vYngxWw7tqX8FpwkWB5G7Q9axhe9ipTu/edit?usp=sharing&ouid=114925088866643320785&rtpof=true&sd=true
Following is my code.
x1_train=x_train.median_income
x1_train
x1_train.shape
x1_test=x_test.median_income
x1_test
type(x1_test)
x1_test.shape
from sklearn.linear_model import LinearRegression
california_model_new = LinearRegression().fit(x1_train,y_train)```
I get an error right here and when I try converting my 2 D array to 1 D as follows , i can not
```python
import numpy as np
x1_train= x1_train.reshape(-1, 1)
x1_test = x1_train.reshape(-1, 1)
This is the error I get
AttributeError: 'Series' object has no attribute 'reshape'
I am new to data science so if you can explain a bit then it would be real helpful
x1_train and x1_test are pandas Series objects, whereas the the reshape() method is applied to numpy arrays.
Do this instead:
x1_train= x1_train.to_numpy().reshape(-1, 1)
x1_test = x1_train.to_numpy().reshape(-1, 1)

Using Python to build a linear regression model and find R2'd; cannot get the model to fit or predict

Some imports for several reasons
import pandas as pd
import numpy as np
I successfully split the data -test(30%) and train(70%) and separated it:
X_train = df_train.drop(columns='Rating')
y_train = df_train.Rating
from sklearn.linear_model import LinearRegression
X_test = df_test.drop(columns='Rating')
y_test = df_test.Rating
Everything is fine to this point, then
linreg = LinearRegression()
linreg.fit(X_train, y_train)
ValueError: could not convert string to float: 'GAME'
Am positive the Rating column is a float
Check your df first row, it might have header repeating again in that place. or Just train from second row.

How can I normalize columns in a data frame and then plot a regression line? [duplicate]

This question already has answers here:
Linear Regression on Pandas DataFrame using Sklearn ( IndexError: tuple index out of range)
(5 answers)
Closed 3 years ago.
I have a large data frame with MANY columns. I want to normalize a few columns which are all numeric, and then plot two using regression. I thought the code below would do it for me.
from sklearn import preprocessing
# Create x, where x the 'scores' column's values as floats
modDF = df[['WeightedAvg','Score','Co','Score', 'PeerGroup', 'TimeT', 'Ter', 'Spread']].values.astype(float)
# Create a minimum and maximum processor object
min_max_scaler = preprocessing.MinMaxScaler()
# Create an object to transform the data to fit minmax processor
x_scaled = min_max_scaler.fit_transform(modDF)
# Run the normalizer on the dataframe
df_normalized = pd.DataFrame(x_scaled)
import seaborn as sns
import matplotlib.pyplot as plt
sns.regplot(x="WeightedAvg", y="Spread", data=modDF)
However, I am getting the following error: IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices
I did a regression without normalizing, using sns.regplot and it worked, but it looked weird, so I want to see it with normalization applied. I know how the regression works. I just don't know how the regression works.
There is no need to use the command: df_normalized = pd.DataFrame(x_scaled).
If you want to run a linear regression. This should work:
from sklearn import preprocessing
from sklearn.linear_model import LinearRegression
df = ['WeightedAvg','Score','Co','Score', 'PeerGroup', 'TimeT', 'Ter', 'Spread']
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce', axis=1)
X = df[['WeightedAvg','Score','Co','Score', 'PeerGroup', 'TimeT', 'Ter', 'Spread']]
#select your target variable
y = df[['target']]
#train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# Create a minimum and maximum processor object
min_max_scaler = preprocessing.MinMaxScaler()
# Create an object to transform the data to fit minmax processor
X_train_scaled = min_max_scaler.fit_transform(X_train)
X_test_scaled = min_max_scaler.transform(X_test)
#start linear regression
reg = LinearRegression().fit(X_train_scaled, y_train)
#predict for test
y_predict = reg(X_test_scaled, y_test)
If you work with train/test-split it is important that you use the scaler fitting only on the training data, the test data is unknow to that point in time! For the testing part you are only allowed to use it for transforming.

Python - SKLearn Fit Array Error

I'm relatively new to using sklearn and python for data analysis and am trying to run some linear regression on a dataset that I loaded from a .csv file.
I have loaded my data into train_test_split without any issues, but when I try to fit my training data I receive an error ValueError: Expected 2D array, got 1D array instead: ... Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample..
Error at model = lm.fit(X_train, y_train)
Because of my freshness with working with these packages, I'm trying to determine if this is the result of not setting my imported csv to a pandas data frame before running the regression or if this has to do with something else.
My CSV is in the format of:
Month,Date,Day of Week,Growth,Sunlight,Plants
7,7/1/17,Saturday,44,611,26
7,7/2/17,Sunday,30,507,14
7,7/5/17,Wednesday,55,994,25
7,7/6/17,Thursday,50,1014,23
7,7/7/17,Friday,78,850,49
7,7/8/17,Saturday,81,551,50
7,7/9/17,Sunday,59,506,29
Here is how I set up the regression:
import numpy as np
import pandas as pd
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
organic = pd.read_csv("linear-regression.csv")
organic.columns
Index(['Month', 'Date', 'Day of Week', 'Growth', 'Sunlight', 'Plants'], dtype='object')
# Set the depedent (Growth) and independent (Sunlight)
y = organic['Growth']
X = organic['Sunlight']
# Test train split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print (X_train.shape, X_test.shape)
print (y_train.shape, y_test.shape)
(192,) (49,)
(192,) (49,)
lm = linear_model.LinearRegression()
model = lm.fit(X_train, y_train)
# Error pointing to an array with values from Sunlight [611, 507, 994, ...]
You just need to adjust your last columns to
lm = linear_model.LinearRegression()
model = lm.fit(X_train.values.reshape(-1,1), y_train)
and the model will fit. The reason for this is that the linear model from sklearn expects
X : numpy array or sparse matrix of shape [n_samples,n_features]
So our training data must be of form [7,1] in this particular case
You are only using one feature, so it tells you what to do within the error:
Reshape your data either using array.reshape(-1, 1) if your data has a single feature.
The data always has to be 2D in scikit-learn.
(Don't forget the typo in X = organic['Sunglight'])
Once you load the data into train_test_split(X, y, test_size=0.2), it returns Pandas Series X_train and X_test with (192, ) and (49, ) dimensions. As mentioned in the previous answers, sklearn expect matrices of shape [n_samples,n_features] as the X_train, X_test data. You can simply convert the Pandas Series X_train and X_test to Pandas Dataframes to change their dimensions to (192, 1) and (49, 1).
lm = linear_model.LinearRegression()
model = lm.fit(X_train.to_frame(), y_train)

Categories

Resources