Cross validation in random forest using anaconda - python

I'm using the titanic data set to predict if a passenger survived or not using random forest. This is my code:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn import cross_validation
import matplotlib.pyplot as plt
%matplotlib inline
data=pd.read_csv("C:\\Users\\kabala\\Downloads\\Titanic.csv")
data.isnull().any()
data["Age"]=data1["Age"].fillna(data1["Age"].median())
data["PClass"]=data["PClass"].fillna("3rd")
data["PClass"].isnull().any()
data1.isnull().any()
pd.get_dummies(data.Sex)
# choosing the predictive variables
x=data[["PClass","Age","Sex"]]
# the target variable is y
y=data["Survived"]
modelrandom=RandomForestClassifier(max_depth=3)
modelrandom=cross_validation.cross_val_score(modelrandom,x,y,cv=5)
But, I keep on getting this error:
ValueError: could not convert string to float: 'female'
and I don't understand what is the problem because I changed the Sex feature to a dummy
Thanks:)

pd.get_dummies returns a data frame, and does not do the operation in place. Therefore you really are sending a sting with the sex column.
So you would need something like X = pd.get_dummies(data[['Sex','PClass','Age']], columns=['Sex','PClass']) and this should fix your problem. I think PClass will also be a string column you need to use dummy variables, as you have it filling '3rd'.
There are still some more places where you call data.isnull().any() that is not doing anything to the underlying dataframe. I left those as they were, but just FYI they may not be doing what you intended.
Full code would be:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn import cross_validation
import matplotlib.pyplot as plt
%matplotlib inline
data=pd.read_csv("C:\\Users\\kabala\\Downloads\\Titanic.csv")
data.isnull().any() <-----Beware this is not doing anything to the data
data["Age"]=data1["Age"].fillna(data1["Age"].median())
data["PClass"]=data["PClass"].fillna("3rd")
data["PClass"].isnull().any() <-----Beware this is not doing anything to the data
data1.isnull().any() <-----Beware this is not doing anything to the data
#********Fix for your code*******
X = pd.get_dummies(data[['Sex','PClass','Age']], columns=['Sex','PClass'])
# choosing the predictive variables
# x=data[["PClass","Age","Sex"]]
# the target variable is y
y=data["Survived"]
modelrandom=RandomForestClassifier(max_depth=3)
modelrandom=cross_validation.cross_val_score(modelrandom,x,y,cv=5)

Related

i can't run a linear regression and cross validation. can someone enlighten me? i get errors such as could not convert string to float

import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from statsmodels.formula.api import ols
import statsmodels.api as sm
import scipy
import scipy.stats
import seaborn as sns
import numpy.random as npr
import math
from scipy.stats import norm
import sqlite3 as sql
import seaborn
from numba import jit, prange
df = pd.read_csv('ODI-2021.edited.csv')
df.info()
sr_targets = pd.Series(df['What is your stress level (0-100)?'])
sr_targets.describe()
df_features = df.drop('What is your stress level (0-100)?', axis=1)
print (df_features)
df_features.describe()
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
df.isnull().sum()
df_missing = df.dropna()
df_missing.shape
X = df["What is your stress level (0-100)?"]
y = df["Time you went to be Yesterday"]
est = sm.OLS(y, X.astype(float)).fit()
model = sm.OLS(y, X).fit()
predictions = model.predict(X)
-ValueError: could not convert string to float: 'over 9000'
model.summary()
-AttributeError: 'LinearRegression' object has no attribute 'summary'
from sklearn import preprocessing
def convert(df):
number = preprocessing.LabelEncoder()
data['Date'] = number.fit_transform(df['Date'])
data=data.fillna(-999)
return data
model = LinearRegression(fit_intercept=True)
result = model.fit(df_features, sr_targets)
-ValueError: could not convert string to float: '3/16/2021'
sr_coef = pd.Series(result.coef_, index=df_features.columns)
sr_coef
-NameError: name 'result' is not defined
sr_endog = sr_targets.copy()
df_exog = sm.add_constant(df_features)
model = sm.OLS(sr_endog, df_exog)
result = model.fit()
-ValueError: Pandas data cast to numpy dtype of object. Check input data with
np.asarray(data)
result.summary()
-NameError: name 'result' is not defined
cross validation
from sklearn.linear_model import LassoCV
from sklearn.model_selection import Fold
nb_folds = 10
cv = KFold(n_splits=nb_folds)
model = LassoCV(fit_intercept=True, cv=cv, n_alphas=200, max_iter=2000)
result = model.fit(df_features_rescaled, sr_targets)
-NameError: name 'df_features_rescaled' is not defined
fig = plt.figure(figsize=[16,15])
xvalues = np.log10(result.alphas_)
rmse_path = np.sqrt(result.mse_path_)
for k in range (nb_folds):
yvalues = rmse_path[:,k]
plt.plot(xvalues, yvalues)
pos_ymin = yvalues.argmin()
plt.plot(xvalues[pos_ymin], yvalues[pos_ymin], marker='o')
plt.axvline(np.log10(result.alpha_))
plt.title('RMSE for differebt alpha', fontsize=20)
plt.grid()
-NameError: name 'result' is not define
sr_coef = pd.Series(result.coef_, index=df_features.columns)
sr_coef
this is my code and these are the errors im getting, could someone help me with what im doing wrong? i have looked up the errors and i have no clue how to fix those. my data set has numbers but also dates and answers such as yes no and university level education level responses which i have no clue how to convert in float. i have been trying to run a regression with two columns which consist of numbers and i get there errors. for the cross validation i am dropping one column and im using the rest and im getting the error that i havent defined the variable result which i have , im clueless
3.thanks in advance!
Each of your errors means something. Learning to read the errors is extremely important in understanding what is going on. For example,
est = sm.OLS(y, X.astype(float)).fit()
model = sm.OLS(y, X).fit()
predictions = model.predict(X)
-ValueError: could not convert string to float: 'over 9000'
This appears to suggest that someone place the phrase "over 9000" in the cell of the CSV file you are opening. Hence, python is having trouble figuring out how to convert that to a float. Same thing would happen if you tried to run
float("over 9000")
It appears the data needs to be cleaned up a bit before it can be used by your sm. It appears that python is trying to tell you the same thing here too:
-ValueError: could not convert string to float: '3/16/2021'
The string "3/16/2021" has symbols that are not apart of a float(), namely "/" symbol.
I think it would be helpful if you broke up your errors and concerns into separate questions, then people could tackle them one at a time for you.

Issues with Pymc3 Summary

I am currently struggling to obtain a summary of the statistics of a model I ran through Bayesian regression on. I first used Lasso and model selection to filter the best variables, then used pm.Model to obtain the regression proper.
Of course, having 'filtered' the explanatory variables that weren't relevant, the shape of the X matrix had changed. The data I worked on is the load_boston dataset from sklearn.dataset. I coded the data as independent variable and the target as dependent variable.
Having performed model selection with SelectFromModel, I used the get.support method to obtain an index of the retained variables. I then used a loop over both the indexes of all variables and the numbers contained in the support, with the purpose of storing the names of the retained variables in an empty list I had created at hoc. The code looks something like this
import pandas as pd
import numpy as np
import pymc3 as pm
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(9)
# Load the boston dataset.
from sklearn.datasets import load_boston
boston = load_boston()
X, y = boston['data'], boston['target']
# Here is the code for the estimator LassoCV
# Here is the code for Model Selection
support(indices=True) #to obtain the list of indices of retained variables
X_transform = sfm.transform(X) #to remove the unnecessary variables
#Here is the line for linear modeling
#I initialize some useful variables
m = y.shape[0]
n = X.shape[1]
c = supp.shape[0]
L = boston['feature_names']
varnames=[]
for i in range (0, n):
for j in range (0, c):
if i == supp[j]:
varnames.append(L[i])
pm.summary(trace, varnames=varnames)
The console then displays 'KeyError: RM', which is one of the names of the variables used. One issue I noticed that every object of varnames is classified as str_ object of numpy module, meaning that I can't read the name of the retained variables on the list unless I double click on them.
How could I fix this? I have no clue what I am doing wrong.

np.argmax does not return the correct index

I used np.argmax to search for the index of the highest value of this array:
And it returned 720. It was supposed to be 721. I tried to google the problem but haven't found the solution yet.
Here is my code:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from matplotlib.pylab import rcParams
from statsmodels.tsa.stattools import acf, pacf
dir='C:\\Users\\DELL\\Google Drive\\JVN couse materials\\Projects\\Practice projects\\Time series project\\energydata_complete.csv'
rawdata=pd.read_csv(dir, index_col='date')
timeseries=pd.DataFrame(rawdata['Appliances'])
timeseries.index=pd.to_datetime(timeseries.index)
timeseries['Log scale']=np.log10(timeseries['Appliances'])
lag_pacf = pacf(timeseries.loc['2016-01-12':'2016-01-21','Log scale'], nlags=1439, method='ols')
highest_pacf_lag=np.argmax(lag_pacf[1:]) ###this is where the problem happens
csv file indexes values from 1 and Python (and numpy and pandas too)is zero indexed. Hence cell no 721 is shown as 720 in python

Why does Pandas say this data frame has only one column?

I began a python course in linear and logistic regression but I am encountering what is probably a stupid error. I have to work with this data frame:
http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv
And this is my code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
rwq = pd.read_csv('*filepath*/winequality-red.csv')
rows = len(rwq.index)
cols = rwq.shape[1]
When I print rows and cols, rows correctly prints 1599 but for some reason cols always equals 1 (when in fact they are 12).
I also tried 'len(rwq.columns)' and I still get 1.
Am I doing something wrong or is the problem with the file provided?

Trying to work out a python code

I am new to Machine Learning and python. Recently i have been working with Amazon fine food review data from kaggle and its code.
What i don't understand is how is the 'partiton' method used here ?
Moreover, What actually does last 3 lines of code do ?
%matplotlib inline
import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer
# using the SQLite Table to read data.
con = sqlite3.connect('./amazon-fine-food-reviews/database.sqlite')
#filtering only positive and negative reviews i.e.
# not taking into consideration those reviews with Score=3
filtered_data = pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE Score != 3
""", con)
# Give reviews with Score>3 a positive rating, and reviews with a
score<3 a negative rating.
def partition(x):
if x < 3:
return 'negative'
return 'positive'
#changing reviews with score less than 3 to be positive vice-versa
actualScore = filtered_data['Score']
positiveNegative = actualScore.map(partition)
filtered_data['Score'] = positiveNegative
creates an array called actualScore using the column Score from filtered_data
actualScore = filtered_data['Score']
creates array positiveNegative coding negative for values <3 and positive for >3
positiveNegative = actualScore.map(partition)
overwrites old column score with new coded values
filtered_data['Score'] = positiveNegative
I think Actually to replace Score column in table with positve or negetive, we use method called partition.Get the Score column as dataframe actualScore, then map the dataframe with replacing values of whether it is positive or negetive. Then replace values in score column with positiveNegative.

Categories

Resources