Label Encoder is not creating dummy variables

Label Encoder is not creating dummy variables - python

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset=pd.read_csv('Churn_Modelling.csv')
X=dataset.iloc[:, 3:13]
Y=dataset.iloc[:, 13]
from sklearn.preprocessing import LabelEncoder
label_en1=LabelEncoder()
X.values[:, 1]=label_en1.fit_transform(X.values[:, 1])
label_en2=LabelEncoder()
X.values[:, 2]=label_en2.fit_transform(X.values[:, 2])
I tried creating dummy variables but it is not happening. I am using X.values int the encoding section because the version of Spyder that I have does not support object arrays so let X and Y be dataframes. I added .values because it dataframes do not support slice terminology. Where might I have gone wrong ?
I created a similar program before for creating dummy variables and it worked then. I don't understand why it is not happening for this one.

Edit:
Can you pass in a slice of your slice? Like so:
X.iloc[:, 1] = label_en1.fit_transform(X.iloc[:, 1])
You would essentially trim your dataframe down to what appears to be an array
Instead of accessing X.values, try accessing the feature / column name directly:
X['col_name'] = label_en1.fit_transform(X['col_name'])

Related

Cross validation in random forest using anaconda

I'm using the titanic data set to predict if a passenger survived or not using random forest. This is my code:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn import cross_validation
import matplotlib.pyplot as plt
%matplotlib inline
data=pd.read_csv("C:\\Users\\kabala\\Downloads\\Titanic.csv")
data.isnull().any()
data["Age"]=data1["Age"].fillna(data1["Age"].median())
data["PClass"]=data["PClass"].fillna("3rd")
data["PClass"].isnull().any()
data1.isnull().any()
pd.get_dummies(data.Sex)
# choosing the predictive variables
x=data[["PClass","Age","Sex"]]
# the target variable is y
y=data["Survived"]
modelrandom=RandomForestClassifier(max_depth=3)
modelrandom=cross_validation.cross_val_score(modelrandom,x,y,cv=5)
But, I keep on getting this error:
ValueError: could not convert string to float: 'female'
and I don't understand what is the problem because I changed the Sex feature to a dummy
Thanks:)

pd.get_dummies returns a data frame, and does not do the operation in place. Therefore you really are sending a sting with the sex column.
So you would need something like X = pd.get_dummies(data[['Sex','PClass','Age']], columns=['Sex','PClass']) and this should fix your problem. I think PClass will also be a string column you need to use dummy variables, as you have it filling '3rd'.
There are still some more places where you call data.isnull().any() that is not doing anything to the underlying dataframe. I left those as they were, but just FYI they may not be doing what you intended.
Full code would be:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn import cross_validation
import matplotlib.pyplot as plt
%matplotlib inline
data=pd.read_csv("C:\\Users\\kabala\\Downloads\\Titanic.csv")
data.isnull().any() <-----Beware this is not doing anything to the data
data["Age"]=data1["Age"].fillna(data1["Age"].median())
data["PClass"]=data["PClass"].fillna("3rd")
data["PClass"].isnull().any() <-----Beware this is not doing anything to the data
data1.isnull().any() <-----Beware this is not doing anything to the data
#********Fix for your code*******
X = pd.get_dummies(data[['Sex','PClass','Age']], columns=['Sex','PClass'])
# choosing the predictive variables
# x=data[["PClass","Age","Sex"]]
# the target variable is y
y=data["Survived"]
modelrandom=RandomForestClassifier(max_depth=3)
modelrandom=cross_validation.cross_val_score(modelrandom,x,y,cv=5)

How do I fix this "TypeError: float() argument must be a string or a number, not 'method'" Error?

I've tried to use the imputer to replace all of the NaN portions of my database with the averages of its respectful column. For example, I wanted to fix a blank entry in my database under the salary column and I want that blank section to be filled with the average salary values under that column. I tried doing this by following along with a tutorial but I think the video was outdated, resulting in this error.
Code:
#Data Proccesing
#Importing the Libaries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv("Data.csv")
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 3].values
#Taking care of Missig Data
from sklearn.preprocessing import Imputer
#The source of all the problems
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform
Initially, X looked like this when compiled prior to using Imputer:
However, Once I compiled lines 16-18, I got this error and I'm not sure what to do

The line
imputer.transform
Should be
imputer.transform()
...With parentheses to actually call the method rather than assign it's name to something.

Issues with Pymc3 Summary

I am currently struggling to obtain a summary of the statistics of a model I ran through Bayesian regression on. I first used Lasso and model selection to filter the best variables, then used pm.Model to obtain the regression proper.
Of course, having 'filtered' the explanatory variables that weren't relevant, the shape of the X matrix had changed. The data I worked on is the load_boston dataset from sklearn.dataset. I coded the data as independent variable and the target as dependent variable.
Having performed model selection with SelectFromModel, I used the get.support method to obtain an index of the retained variables. I then used a loop over both the indexes of all variables and the numbers contained in the support, with the purpose of storing the names of the retained variables in an empty list I had created at hoc. The code looks something like this
import pandas as pd
import numpy as np
import pymc3 as pm
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(9)
# Load the boston dataset.
from sklearn.datasets import load_boston
boston = load_boston()
X, y = boston['data'], boston['target']
# Here is the code for the estimator LassoCV
# Here is the code for Model Selection
support(indices=True) #to obtain the list of indices of retained variables
X_transform = sfm.transform(X) #to remove the unnecessary variables
#Here is the line for linear modeling
#I initialize some useful variables
m = y.shape[0]
n = X.shape[1]
c = supp.shape[0]
L = boston['feature_names']
varnames=[]
for i in range (0, n):
for j in range (0, c):
if i == supp[j]:
varnames.append(L[i])
pm.summary(trace, varnames=varnames)
The console then displays 'KeyError: RM', which is one of the names of the variables used. One issue I noticed that every object of varnames is classified as str_ object of numpy module, meaning that I can't read the name of the retained variables on the list unless I double click on them.
How could I fix this? I have no clue what I am doing wrong.

Time Series Regression Model Issue

I am new to Python trying to do a time series regression model. I have 3 columns, X, Y, and the date. I imported everything below, but I am getting stuck with an error.
import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 15, 6
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.tsa.stattools import adfuller
raw_data = pd.read_csv("IMF and BBG Fair Values.csv")
ISO_TH = raw_data[["IMF_VALUE", "BBG_FV", "IMF_DATE"]]
Filtering to get rid of NaN:
filtered_TH = ISO_TH[np.isfinite(raw_data['BBG_FV'])]
I get this error:
C:\Program Files\Anaconda3\lib\site-packages\pandas\core\generic.py:2698: >SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation

Your problem has the exact same origin as it is written in the pandas documentation you linked. Look at the minimal example they provided there:
def do_something(df):
foo = df[['bar', 'baz']] # Is foo a view? A copy? Nobody knows!
# ... many lines here ...
foo['quux'] = value # We don't know whether this will modify df or not!
return foo
The problem is that foo might either be a copy of the dataframe df or a view. If it is a view, then changes on foo will also affect the original dataframe df. If foo is a copy, then the line foo['quux'] = value will have no effect on df.
How does this now translate to your problem?
You start with creating a dataframe from a *.csv file:
raw_data = pd.read_csv("IMF and BBG Fair Values.csv")
Then you select the columns "IMF_VALUE", "BBG_FV", "IMF_DATE" from the dataframe raw_data in the following way:
ISO_TH = raw_data[["IMF_VALUE", "BBG_FV", "IMF_DATE"]]
Now, this looks very similar to the second line from the documentation:
foo = df[['bar', 'baz']]
Is your ISO_TH a view or a copy of raw_data? We don't now! So what happens if we change a column of ISO_TH? Does raw_data also change or not? We don't now and hence the warning.
Toy example:
import pandas as pd
import numpy as np
raw_data=pd.DataFrame([[np.inf,22,333,44], [3,4,5,2],[1,2,3,4],[np.inf,0,0,0]],columns=["BBG_FV", "IMF_VALUE", "IMF_DATE", "unused"])
ISO_TH = raw_data[["IMF_VALUE", "BBG_FV", "IMF_DATE"]]
# if we now change ISO_TH, we get a warning
ISO_TH.IMF_VALUE=[0,0,0,0] # SettingWithCopyWarning
The fact that you create an intermediate object filtered_TH from ISO_TH changes nothing here.
How can we solve this? Easy, we read the docs and do what is written there!
ISO_TH = raw_data.loc[:,["IMF_VALUE", "BBG_FV", "IMF_DATE"]]
And continue as before.
Additional information: What rules does Pandas use to generate a view vs a copy?

could not convert categorical data to number OneHotEncoder

I have a simple code to convert categorical data into one hot encoding in python:
a,1,p
b,3,r
a,5,t
I tried to convert them with python OneHotEncoder:
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
import numpy as np
data = pd.read_csv("C:\\test.txt", sep=",", header=None)
one_hot_encoder = OneHotEncoder(categorical_features=[0,2])
one_hot_encoder.fit(data.values)
This piece of code does not work and throws an error
ValueError: could not convert string to float: 't'
Can you please help me?

Try this:
from sklearn import preprocessing
for c in df.columns:
df[c]=df[c].apply(str)
le=preprocessing.LabelEncoder().fit(df[c])
df[c] =le.transform(df[c])
pd.to_numeric(df[c]).astype(np.float)

#user3104352,
I encountered the same behavior and found it frustrating.
Scikit-Learn requires all data to be numerical before it even considers selecting the columns provided in the categorical_features parameter.
Specifically, the column selection is handled by the _transform_selected() method in /sklearn/preprocessing/data.py and the very first line of that method is
X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES).
This check fails if any of the data in the provided dataframe X cannot be successfully converted to a float.
I agree that the documentation of sklearn.preprocessing.OneHotEncoder is very misleading in that regard.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Label Encoder is not creating dummy variables - python

Related

Cross validation in random forest using anaconda

How do I fix this "TypeError: float() argument must be a string or a number, not 'method'" Error?

Issues with Pymc3 Summary

Time Series Regression Model Issue

could not convert categorical data to number OneHotEncoder

Categories

Resources