I use SpaCy to output a vectorized array of my text field. I'm having issues plugging this output into my random forest and could use some guidance. I label encoded other fields so my pandas dataframe looks something like:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
d = {'le1': [0,1,2,1], 'le2': [3,0,2,1], 'spacy_output':[[0.12,0.14,3.5],[1.21,0.84,1.92],[0.34,0.85,2.43],[0.09,0.18,2.21]], 'response':[0,1,1,0]}
df = pd.DataFrame(d)
Then I try to plug this into my model:
X = np.array(df.drop('response', axis=1))
y = df['response'].values.ravel()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 23)
clf = RandomForestClassifier(min_samples_split=4, n_estimators=100, criterion='entropy')
clf.fit(X_train,y_train)
Confused on how to pass this dataframe to my model. I get the following errors:
TypeError: only size-1 arrays can be converted to Python scalars
ValueError: setting an array element with a sequence.
spacy_output column in your DataFrame is a list of lists, so when you convert the DataFrame to a NumPy array using np.array, you end up with a 2D array where each element is a list. This causes problems when you try to pass this array to the RandomForestClassifier because the classifier expects a 2D array of numerical values.
d = {'le1': [0,1,2,1], 'le2': [3,0,2,1], 'spacy_output':[[0.12,0.14,3.5],[1.21,0.84,1.92],[0.34,0.85,2.43],[0.09,0.18,2.21]], 'response':[0,1,1,0]}
df = pd.DataFrame(d)
X = np.concatenate([np.array(df[['le1', 'le2']]), np.array(df['spacy_output'].tolist())], axis=1)
y = df['response'].values.ravel()
Related
Here is the problem
Extract just the median_income column from the independent variables (from X_train and X_test).
Perform Linear Regression to predict housing values based on median_income.
Predict output for test dataset using the fitted model.
Plot the fitted model for training data as well as for test data to check if the fitted model satisfies the test data.
I did a linear regression earlier.Following is the code
import pandas as pd
import os
os.getcwd()
os.chdir('/Users/saurabhsaha/Documents/PGP-AI:ML-Purdue/New/datasets')
df=pd.read_excel('California_housing.xlsx')
df.total_bedrooms=df.total_bedrooms.fillna(df.total_bedrooms.mean())
x = df.iloc[:,2:8]
y = df.median_house_value
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=.20)
from sklearn.linear_model import LinearRegression
california_model = LinearRegression().fit(x_train,y_train)
california_model.predict(x_test)
Prdicted_values = pd.DataFrame(california_model.predict(x_test),columns=['Pred'])
Prdicted_values
Final = pd.concat([x_test.reset_index(drop=True),y_test.reset_index(drop=True),Prdicted_values],axis=1)
Final['Err_pct'] = abs(Final.median_house_value-
Final.Pred)/Final.median_house_value
Here is my dataset- https://docs.google.com/spreadsheets/d/1vYngxWw7tqX8FpwkWB5G7Q9axhe9ipTu/edit?usp=sharing&ouid=114925088866643320785&rtpof=true&sd=true
Following is my code.
x1_train=x_train.median_income
x1_train
x1_train.shape
x1_test=x_test.median_income
x1_test
type(x1_test)
x1_test.shape
from sklearn.linear_model import LinearRegression
california_model_new = LinearRegression().fit(x1_train,y_train)```
I get an error right here and when I try converting my 2 D array to 1 D as follows , i can not
```python
import numpy as np
x1_train= x1_train.reshape(-1, 1)
x1_test = x1_train.reshape(-1, 1)
This is the error I get
AttributeError: 'Series' object has no attribute 'reshape'
I am new to data science so if you can explain a bit then it would be real helpful
x1_train and x1_test are pandas Series objects, whereas the the reshape() method is applied to numpy arrays.
Do this instead:
x1_train= x1_train.to_numpy().reshape(-1, 1)
x1_test = x1_train.to_numpy().reshape(-1, 1)
Some imports for several reasons
import pandas as pd
import numpy as np
I successfully split the data -test(30%) and train(70%) and separated it:
X_train = df_train.drop(columns='Rating')
y_train = df_train.Rating
from sklearn.linear_model import LinearRegression
X_test = df_test.drop(columns='Rating')
y_test = df_test.Rating
Everything is fine to this point, then
linreg = LinearRegression()
linreg.fit(X_train, y_train)
ValueError: could not convert string to float: 'GAME'
Am positive the Rating column is a float
Check your df first row, it might have header repeating again in that place. or Just train from second row.
I am building an application in Python which can predict the values for Pm2.5 pollution from a dataframe. I am using the values for November and I am trying to first build the linear regression model. How can I make the linear regression without using the dates? I only need predictions for the Pm2.5, the dates are known.
Here is what I tried so far:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
data = pd.read_csv("https://raw.githubusercontent.com/iulianastroia/csv_data/master/final_dataframe.csv")
data['day'] = pd.to_datetime(data['day'], dayfirst=True)
#Splitting the dataset into training(70%) and test(30%)
X_train, X_test, y_train, y_test = train_test_split(data['day'], data['pm25'], test_size=0.3,
random_state=0
)
#Fitting Linear Regression to the dataset
lin_reg = LinearRegression()
lin_reg.fit(data['day'], data['pm25'])
This code throws the following error:
ValueError: Expected 2D array, got 1D array instead:
array=['2019-11-01T00:00:00.000000000' '2019-11-01T00:00:00.000000000'
'2019-11-01T00:00:00.000000000' ... '2019-11-30T00:00:00.000000000'
'2019-11-30T00:00:00.000000000' '2019-11-30T00:00:00.000000000'].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
You need to pass pandas dataframe instead of pandas series for X values, so you might want to do something like this,
UPDATE:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import datetime
data = pd.read_csv("https://raw.githubusercontent.com/iulianastroia/csv_data/master/final_dataframe.csv")
data['day'] = pd.to_datetime(data['day'], dayfirst=True)
print(data.head())
x_data = data[['day']]
y_data = data['pm25']
#Splitting the dataset into training(70%) and test(30%)
X_train, X_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.3,
random_state=0
)
# linear regression does not work on date type of data, convert it into numerical type
X_train['day'] = X_train['day'].map(datetime.datetime.toordinal)
X_test['day'] = X_test['day'].map(datetime.datetime.toordinal)
#Fitting Linear Regression to the dataset
lin_reg = LinearRegression()
lin_reg.fit(X_train[["day"]], y_train)
Now you can predict the data using,
print(lin_reg.predict(X_test[["day"]])) #-->predict the data
This is just something else to add to why you need the "[[", and how to avoid the frustration.
The reason the data[['day']] works and data['day'] doesn't is that the fit method expects for X an tuple of 2 with shape, but not for Y, see the vignette:
fit(self, X, y, sample_weight=None)[source]¶ Fit linear model.
Parameters X{array-like, sparse matrix} of shape (n_samples,
n_features) Training data
yarray-like of shape (n_samples,) or (n_samples, n_targets) Target
values. Will be cast to X’s dtype if necessary
So for example:
data[['day']].shape
(43040, 1)
data['day'].shape
(43040,)
np.resize(data['day'],(len(data['day']),1)).shape
(43040, 1)
These work because they have the structure required:
lin_reg.fit(data[['day']], data['pm25'])
lin_reg.fit(np.resize(data['day'],(len(data['day']),1)), data['pm25'])
While this doesn't:
lin_reg.fit(data['day'], data['pm25'])
Hence before running the function, check that you are providing input in the required format :)
I'm new using Machine Learning and I am trying to predict the price of the stocks in 30 days.
This is my code:
import pandas as pd
import matplotlib.pyplot as plt
import pymysql as MySQLdb
import numpy as np
import sqlalchemy
import datetime
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing, svm
from sklearn.model_selection import train_test_split
forecast_out = int(30)
df['Prediction'] = df[['LastPrice']].shift(-forecast_out)
df['Prediction'].fillna(0)
X = np.array(df['Prediction'].fillna(0))
X = preprocessing.scale(X)
X_forecast = X[-forecast_out:]
X = X[:-forecast_out]
y = np.array(df['Prediction'].fillna(0))
y = y[:-forecast_out]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
X_train, X_test, y_train, y_test.reshape(-1,1)
# Training
clf = LinearRegression()
clf.fit(X_train,y_train)
# Testing
confidence = clf.score(X_test, y_test)
print("confidence: ", confidence)
forecast_prediction = clf.predict(X_forecast)
print(forecast_prediction)
I got this error:
ValueError: Expected 2D array, got 1D array instead:
array=[-0.46939923 -0.47076913 -0.47004993 ... -0.42782272 3.07433019 -0.46573474].
Reshape your data either using
array.reshape(-1, 1) if your data has a single feature
or
array.reshape(1, -1) if it contains a single sample.
It's expecting a 2D Array when you're only passing in a 1D Array. You can solve this by putting another set of brackets around where you're getting the probelm. For example
x = [1,2,3,4]
Foo(x)
If that throws the error, you could just do
Foo([x])
We are trying to plot the predicted values and truth values on the same graph after fitting a model to predict a truth value using a RandomForestRegressor in Python of the three column dataset (click the link to download the full CSV-dataset formatted as in the following
t_stamp,X,Y
0.000543,0,10
0.000575,0,10
0.041324,1,10
0.041331,2,10
0.041336,3,10
0.04134,4,10
0.041345,5,10
0.04135,6,10
0.041354,7,10
Here is how we do the prediction.
import pandas as pd
import numpy as np
import glob, os
from io import StringIO
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
import math
from math import sqrt
from sklearn.cross_validation import train_test_split
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "data*.csv"))))
for i in range(1,10):
df['X_t'+str(i)] = df['X'].shift(i)
print(df)
df.dropna(inplace=True)
X = pd.DataFrame({ 'X_%d'%i : df['X'].shift(i) for i in range(10)}).apply(np.nan_to_num, axis=0).values
y = df['Y'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40)
reg = RandomForestRegressor(criterion='mse')
reg.fit(X_train,y_train)
modelPred_test = reg.predict(X_test)
print(modelPred_test)
For comparison, we wish to generate a plot before prediction and after prediction. For the truth value, we tried it with
fig, ax = plt.subplots()
ax.plot(df['time'].values, df['Y'].values)
We wish to plot (in the same graph) the ground truth (time as x-axis and the value of Y as y-axis. When we do
ax.plot(df['time'].values, modelPred_test)
We are getting the following error.
raise ValueError("x and y must have same first dimension")
ValueError: x and y must have same first dimension
This means that we have less prediction values than we have time stamps in our dataset. To verify this, I did
print(df['time'].values.shape) and print(modelPred_test.shape) - and it outputs (258523,) and (103410,) respectively. How can we match which of my time values correspond to the prediction values, then i can use a subset of the time values for my plot command?
You have to set your data like the following.
X = df.drop('Y', axis=1)
y = df['Y']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40)
X_train = X_train.drop('time', axis=1)
X_test = X_test.drop('time', axis=1)
and then sort the datasets
index_values=range(0,len(y_test))
y_test.sort_index(inplace=True)
X_test.sort_index(inplace=True)
modelPred_test = reg.predict(X_test)
ax.plot(pd.Series(index_values), y_test.values)
finally, do the same plot for the predicted values of y. Hope this helps.
You need to keep track of the indices for training and testing datasets. For example, you could define
train_index, test_index = train_test_split(df.index, test_size=0.40)
and then X_train = X[train_index], etc.
Then, you could plot the results via ax.plot(df['time'][test_index].values, modelPred_test[df.index == test_index]).