Random Forest on Panel Data using Python - python

So I am having some troubles running a random forest regression on panel data.
The data currently looks like this:
I want to conduct a random forest regression which predicts KwH for each ID over time based on the variables I have. I have split my data into training and test samples using the following code:
from sklearn.model_selection import train_test_split
X = df[['hour', 'day', 'month', 'dayofweek', 'apparentTemperature',
'summary', 'household_size', 'work_from_home', 'num_rooms',
'int_in_renew', 'int_in_gen', 'conc_abt_cc', 'feel_abt_lifestyle',
'smrt_meter_help', 'avg_gender', 'avg_age', 'house_type', 'sum_insul',
'total_lb', 'total_fridges', 'bigg_apps', 'small_apps',
'look_at_meter']]
y = df[['KwH']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
I then wish to train my model and test it against the testing sample however I am unsure of how to do this. I have tried this code:
from sklearn.ensemble import RandomForestRegressor
rfc = RandomForestRegressor(n_estimators=200)
rfc.fit(X_train, y_train)
However I get the following error message:
A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
Im not sure if the error is fundamentally in the way my data is arranged or the way I am doing the random forest so any help with this and then testing the data against the test sample after would be greatly appreciated.
Thanks in advance.

Simply switching y = df[['KwH']] to y = df['KwH'] or y = df.KwH should solve this.
This is because scikit-learn doesn't expect y to be a dataframe, and selecting columns with the double [[...]] precisely is returning a dataframe.

Related

Building a basic prediction model with the output being the sum of the two inputs but accuracy score is significantly low

I have a csv of size 12500 X 3. The first two columns (A and B) are inputs and the the final column (C) is the sum of the two columns.
I wanted to build a prediction model to get the value of C for a given A and B. This is just a basic model to imporve my understanding of machine learning.
The accuracy score is almost zero (0.00032) and the model is way to simple to get the predictions wrong. The code is below:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
data = pd.read_csv('Dataset.csv') #importing dataset
X = data.drop(columns=['C'])
y = data['C']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = DecisionTreeClassifier()
model.fit(X_train,y_train)
predictions = model.predict(X_test)
score = accuracy_score(y_test, predictions)
score
I did not even include outlier into the data and I create the csv using excel formulae. I used jupyter notebook to build this prediction model. Can someone please point out if/what I'm doing wrong?
Before you build your model, you should understand the behavior of the model and its main function. Decision Tree is used to classify data based on the criterias extracted from data. For this purpose, you should just choose the simple Linear Regression model, not the Decision Tree.

How to use multiclassification model to make predicitions in entire dataframe

I have trained multiclassification models in my training and test sets and have achieved good results with SVC. Now, I want to use the model o make predictions in my entire dataframe, but when I get the following error: ValueError: X has 36976 features, but SVC is expecting 8989 features as input.
My dataframe has two columns: one with the categories (which I manually labeled for around 1/5 of the dataframe) and the text columns with all the texts (including those that have not been labeled).
data={'categories':['1','NaN','3', 'NaN'], 'documents':['Paragraph 1.\nParagraph 2.\nParagraph 3.', 'Paragraph 1.\nParagraph 2.', 'Paragraph 1.\nParagraph 2.\nParagraph 3.\nParagraph 4.', ''Paragraph 1.\nParagraph 2.']}
df=pd.DataFrame(data)
First, I drop the rows with Nan values in the 'categories' column. I then, create the document term matrix, define the 'y', and split into training and test sets.
tf = CountVectorizer(tokenizer=word_tokenize)
X = tf.fit_transform(df['documents'])
y = df['categories']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
Second, I run the SVC model getting good results:
from sklearn.svm import SVC
svm = SVC(C=0.1, class_weight='balanced', kernel='linear', probability=True)
model = svm.fit(X_train, y_train)
print('accuracy:', model.score(X_test, y_test))
y_pred = model.predict(X_test)
print(metrics.classification_report(y_test, y_pred))
Finally, I try to apply the the SVC model to predict the categories of the entire column 'documents' of my dataframe. To do so, I create the document term matrix of the entire column 'documents' and then apply the model:
tf_entire_df = CountVectorizer(tokenizer=word_tokenize)
X_entire_df = tf_entire_df.fit_transform(df['documents'])
y_pred_entire_df = model.predict(X_entire_df)
Bu then I get the error that my X_entire_df has more features than the SVC model is expecting as input. I magine that this is because now I am trying to apply the model to the whole column documents, but I do know how to fix this.
I would appreciate your help!
These issues usually comes from the fact that you are feeding the model with unknown or unseen data (more/less features than the one used for training).
I would strongly suggest you to use sklearn.pipeline and create a pipeline to include preprocessing (CountVectorizer) and your machine learning model (SVC) in a single object.
From experience, this helps a lot to avoid tedious complex preprocessing fitting issues.

X has 8 features, but RandomForestRegressor is expecting 67 features as input

I want to build a House Price Prediction app. The content has features where user can enter their inputs, then a predictive model will predict the price and display it to the user. I am using a dataset from Kaggle to do the prediction. When I run the code, it shows an error message that says
X has 8 features, but RandomForestRegressor is expecting 67 features as input.
Below is the code. Xy contains the data from Kaggle and df is the user input. Xy is the train set and df is the test. Xy has 8 variables including the target. df will only retrieve 7 inputs (so it will have 7 variables because there's no target variables received from user).
# Assign to X for input features and Y for target
X = Xy.drop('Price', axis=1)
Y = Xy['Price'].values
# Build Regression Model
model = RandomForestRegressor()
model.fit(X, Y)
df = pd.get_dummies(df, columns=['Location', 'Furnishing', 'Property_Type_Supergroup', 'Size_Type'])
# Apply Model to Make Prediction
prediction = model.predict(df)
I tried to search the solutions online but nothing works for my code. Hope someone can help.
It's a little difficult to tell without seeing the data that you're fitting the model on. Between the error and your code though, it seems like possibly you're fitting the model on a data frame of 67 features. The data frame that you call fit on needs to be the same as the data frame you call predict on (at least in terms of features).
Sorry if this answer is redundant, it is difficult to tell without seeing the data and the exact error.
"X has 8 features, but RandomForestRegressor is expecting 67 features as input."
I assumed that this is the standard dataset you used, and after unzipping and loading it has the following files:
sample_submission.csv
test.csv
data_description.txt
train.csv
if you check the shape of train.csv and test.csv:
train = pd.read_csv('./house_prices/train.csv')
test = pd.read_csv('./house_prices/test.csv')
print(f'Train shape : {train.shape}')
print(f'Test shape : {test.shape}')
#Train shape : (1460, 81)
#Test shape : (1459, 80)
That shows you deleted or dropped some column/features/attributes and reduced them from 81 to 67, so no problem till now. The problem is once you converted the categorical variables into numeric variables using pd.get_dummies() in the data pre-processing stage then split data into x_train & y_train using same df to fit() your model. Finally, you predict on x_test via y_pred = model.predict(x_test). Otherwise, the shape of df does not match X (one has 8 columns, the other has 67 columns in your case)!!
So I suggest first the df should be splitted:
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
# Chossing features for predicting the target variable
x = df
# Data split on df
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2 , random_state=42)
# Apply RandomForestRegressor
model = RandomForestRegressor(n_estimators=300, max_depth=13, random_state=0)
model.fit(x_train,y_train)
# Predicting the data using the model
y_pred = model.predict(x_test)
# Evaluating the model
print(metrics.r2_score(y_test,y_pred))
I included following posts for your reference:
post1
post2
post3

How to make a linear regression for a dataframe?

I am building an application in Python which can predict the values for Pm2.5 pollution from a dataframe. I am using the values for November and I am trying to first build the linear regression model. How can I make the linear regression without using the dates? I only need predictions for the Pm2.5, the dates are known.
Here is what I tried so far:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
data = pd.read_csv("https://raw.githubusercontent.com/iulianastroia/csv_data/master/final_dataframe.csv")
data['day'] = pd.to_datetime(data['day'], dayfirst=True)
#Splitting the dataset into training(70%) and test(30%)
X_train, X_test, y_train, y_test = train_test_split(data['day'], data['pm25'], test_size=0.3,
random_state=0
)
#Fitting Linear Regression to the dataset
lin_reg = LinearRegression()
lin_reg.fit(data['day'], data['pm25'])
This code throws the following error:
ValueError: Expected 2D array, got 1D array instead:
array=['2019-11-01T00:00:00.000000000' '2019-11-01T00:00:00.000000000'
'2019-11-01T00:00:00.000000000' ... '2019-11-30T00:00:00.000000000'
'2019-11-30T00:00:00.000000000' '2019-11-30T00:00:00.000000000'].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
You need to pass pandas dataframe instead of pandas series for X values, so you might want to do something like this,
UPDATE:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import datetime
data = pd.read_csv("https://raw.githubusercontent.com/iulianastroia/csv_data/master/final_dataframe.csv")
data['day'] = pd.to_datetime(data['day'], dayfirst=True)
print(data.head())
x_data = data[['day']]
y_data = data['pm25']
#Splitting the dataset into training(70%) and test(30%)
X_train, X_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.3,
random_state=0
)
# linear regression does not work on date type of data, convert it into numerical type
X_train['day'] = X_train['day'].map(datetime.datetime.toordinal)
X_test['day'] = X_test['day'].map(datetime.datetime.toordinal)
#Fitting Linear Regression to the dataset
lin_reg = LinearRegression()
lin_reg.fit(X_train[["day"]], y_train)
Now you can predict the data using,
print(lin_reg.predict(X_test[["day"]])) #-->predict the data
This is just something else to add to why you need the "[[", and how to avoid the frustration.
The reason the data[['day']] works and data['day'] doesn't is that the fit method expects for X an tuple of 2 with shape, but not for Y, see the vignette:
fit(self, X, y, sample_weight=None)[source]¶ Fit linear model.
Parameters X{array-like, sparse matrix} of shape (n_samples,
n_features) Training data
yarray-like of shape (n_samples,) or (n_samples, n_targets) Target
values. Will be cast to X’s dtype if necessary
So for example:
data[['day']].shape
(43040, 1)
data['day'].shape
(43040,)
np.resize(data['day'],(len(data['day']),1)).shape
(43040, 1)
These work because they have the structure required:
lin_reg.fit(data[['day']], data['pm25'])
lin_reg.fit(np.resize(data['day'],(len(data['day']),1)), data['pm25'])
While this doesn't:
lin_reg.fit(data['day'], data['pm25'])
Hence before running the function, check that you are providing input in the required format :)

How to train SVM model in sklearn python by input CSV file?

I have used sklearn scikit python for prediction. While importing following package
from sklearn import datasets and storing the result in iris = datasets.load_iris() , it works fine to train model
iris = pandas.read_csv("E:\scikit\sampleTestingCSVInput.csv")
iris_header = ["Sepal_Length","Sepal_Width","Petal_Length","Petal_Width"]
Model Algorithm :
model = SVC(gamma='scale')
model.fit(iris.data, iris.target_names[iris.target])
But while importing CSV file to train model , creating new array for target_names also , I am facing some error like
ValueError: Found input variables with inconsistent numbers of
samples: [150, 4]
My CSV file has 5 Columns in which 4 columns are input and 1 column is output. Need to fit model for that output column.
How to provide argument for fit model?
Could anyone share the code sample to import CSV file to fit SVM model in sklearn python?
Since the question was not very clear to begin with and attempts to explain it were going in vain, I decided to download the dataset and do it for myself. So just to make sure we are working with the same dataset iris.head() will give you or something similar, a few names might be changed and a few values, but overall strucure will be the same.
Now the first four columns are features and the fifth one is target/output.
Now you will need your X and Y as numpy arrays, to do that use
X = iris[ ['sepal length:','sepal Width:','petal length','petal width']].values
Y = iris[['Target']].values
Now since Y is categorical Data, You will need to one hot encode it using sklearn's LabelEncoder and scale the input X to do that use
label_encoder = LabelEncoder()
Y = label_encoder.fit_transform(Y)
X = StandardScaler().fit_transform(X)
To keep with the norm of separate train and test data, split the dataset using
X_train , X_test, y_train, y_test = train_test_split(X,Y)
Now just train it on your model using X_train and y_train
clf = SVC(C=1.0, kernel='rbf').fit(X_train,y_train)
After this you can use the test data to evaluate the model and tune the value of C as you wish.
Edit Just in case you don't know where the functions are here are the import statements
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler

Categories

Resources