Dealing with empty data points in Titanic Machine Learning train.csv - python

In the train.csv data in Titanic Machine Learning project, some passengers have their age data missing so the pandas module fills it in as 'NaN' and when feeding it into a sklearn algorithm it does not accept it. I tried dataset.fillna('') but now it turns into a empty string and not a float. Please send help.
https://www.kaggle.com/c/titanic/data
import pandas as pd
from sklearn.cross_validation import train_test_split
dataset = pd.read_csv('train.csv')
#dataset = dataset.fillna()
def preprocess(df):
from sklearn.preprocessing import LabelEncoder
processed_df = df.copy()
le = LabelEncoder()
done = le.fit_transform(processed_df)
return done
survival = preprocess(dataset.Survived)
data = dataset.drop('Survived',axis= 1)
data = data.drop('PassengerId',axis=1)
data = data.drop('Embarked',axis = 1)
data = data.drop('Cabin',axis = 1)
data = data.drop('Fare',axis = 1)
data = data.drop('Ticket',axis = 1)
data = data.drop('Name',axis=1)
x_train,x_test,y_train,y_test=
train_test_split(data,survival,test_size=0.25,random_state=0)
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn import svm
from sklearn.metrics import accuracy_score
pipeline = make_pipeline(StandardScaler(),
svm.SVC(kernel='rbf',C=0.1))
pipeline.fit(x_train,y_train)
print(accuracy_score(pipeline.predict(x_test),y_test))

fillna replaces the Nan values with what you write so if you write '', it will be an empty string. just write:
dataset.fillna(0)
if you need to distinguish between 0 and Nan, you can try replace it with -1, that's what we do.

there are many methods you can use to deal with the missing values in a machine learning project :
drop all the column with missing values
drop row containing missing values
Set the values to some value (zero, the mean, the median, etc.).
For the third option :
Scikit-Learn provides a handy class to take care of missing values:
Imputer. Here is how to use it. First, you need to create an Imputer
instance, specifying that you want to replace each attribute’s missing
values with the median of that attribute:
from sklearn.preprocessing import Imputer
imputer = Imputer(strategy="median") #or mean as you want
x_train = imputer.fit_transform(x_train)
x_test = imputer.fit_transform(x_test)
The result is a plain Numpy array containing the transformed features. If you want to put it back into a
Pandas DataFrame, it’s simple.
NB : You could also add the imputer in the pipeline just before the scaler .
pipeline = make_pipeline(Imputer(strategy="median"),
StandardScaler(),
svm.SVC(kernel='rbf',C=0.1))

Related

sklearn2pmml omits field names

I export an instance of sklearn.preprocessing.StandardScaler into a pmml-file. The problem is, that the names of the fields do not appear in the pmml-file, e.g. when using the iris dataset then the original field names ['sepal length (cm)','sepal width (cm)','petal length (cm)','petal width (cm)'] do not appear. Instead only names like x1,x2, etc appear. Is there a way to get the original field names in the pmml-file?
The Following code should be runnable:
from sklearn2pmml import sklearn2pmml, PMMLPipeline, make_pmml_pipeline
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import pandas as pd
data = load_iris()
dfIris = pd.DataFrame(data=data.data, columns=data.feature_names)
ssModel = StandardScaler()
ssModel.fit(dfIris)
pipe = PMMLPipeline([("StandardScaler", ssModel)])
sklearn2pmml(pipeline=make_pmml_pipeline(pipe), pmml="ssIris.pmml")
In the ssIris.pmml I see this:
First, I believe you want to fit the PMMLPipeline after initialization so you may use pipe.fit(dfIris) instead of fitting before the ssModel. To preserve the column names add a none preprocessing function that uses DataFrameMapper to map pandas data frame columns to different sklearn transformations before the scaler, as the pipeline expects a preprocessing function in order to keep the column names. I am not sure whether this is the best way but I checked it and it was preserving the column names.
from sklearn_pandas import DataFrameMapper
from sklearn2pmml import sklearn2pmml, PMMLPipeline, make_pmml_pipeline
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import pandas as pd
data = load_iris()
dfIris = pd.DataFrame(data=data.data, columns=data.feature_names)
ssModel = StandardScaler()
pipe.fit(dfIris)
pipe = PMMLPipeline([("df_mapper",
DataFrameMapper([(d, None) for d in data.feature_names],
df_out=True)), ("StandardScaler", ssModel)])
pipe.fit(dfIris)
sklearn2pmml(pipeline=make_pmml_pipeline(pipe), pmml="ssIris.pmml")
The only component that comes in contact with dfIris data frame (holds feature name information) is the StandardScaler.fit(X) method. This method does not collect or store incoming feature names in any way.
The SkLearn2PMML package gets feature names from the value of the PMMLPipeline.active_fields attribute. Right now it's missing, which is why SkLearn2PMML falls back to default feature names "x1", "x2", .., "xn".
This attribute is automatically set during the PMMLPipeline.fit(X, y) method invocation. Alternatively, you may set/reset this attribute manually at any time.
If you're constructing a PMMLPipeline object using the sklearn2pmml.make_pmml_pipeline utility method, then this method also takes active_fields and target_fields arguments. Please note that in your example code you have a manually constructed PMMLPipeline object, which you then wrap into a new PMMLPipeline object using this utility function call. This is redundant, and actually masks any feature/target names that were possibly set there.
A much better example:
from pandas import DataFrame
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn2pmml import sklearn2pmml, PMMLPipeline
data = load_iris()
iris_X = DataFrame(data = data.data, columns = data.feature_names)
iris_y = None
pipeline = PMMLPipeline([
("ss", StandardScaler())
])
pipeline.fit(iris_X, iris_y)
sklearn2pmml(pipeline, "ssIris.pmml")

Creating a Decision Tree in Python, Numerical and Categorical Variables: "Unable to coerce to Series"

Creating a Decision Tree and the dataset has 21 columns, a mix of numeric and categorical variables. Using sklearn, I understand it does not support categorical variables. I converted categorical to numeric using Label Encoding while also separating the numeric variables. I would then think I'd have to add both groups together so I can split into testing and training data. However when I tried to add the two together (originally numeric variables with the categorical variables converted to numeric) I received a ValueError.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
credit = pd.read_csv('german_credit_risk.csv')
credit.head(10)
image of output
credit.info()
image of output
credit.describe(include='all')
image ouf output
col_names = ['Duration', 'Credit.Amount', 'Disposable.Income', 'Present.Residence', 'Age', 'Existing.Credits', 'Number.Liable', 'Cost.Matrix']
obj_cols = list(credit.select_dtypes(include='O').columns)
obj_cols
image of output
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
encoded_obj_df = pd.DataFrame(columns=obj_cols)
for col in obj_cols:
encoded_obj_df[col] = le.fit_transform(credit[col])
encoded_obj_df.head(10)
image of output
credit.columns = col_names + encoded_obj_df
ValueError
Do I have the right idea and I'm just not adding the two together properly?
The error occurred because you are adding a list of strings to a DataFrame and try to assign the result of this operation to column names of other DataFrame.
You would need to concatenate data frames (with only numerical and label encoded values) on axis 1 with pd.concat function.
However, as you are using Scikit Learn then I would advise you to use it to the full extend. There is Pipeline and ColumnTransformer classes that can help you with the task of preprocessing and classification.
The Pipeline combines the sequence of SK Learn transformers so you don't need to pass the data to each component by yourself.
The ColumnTransformer is used to select the data and apply given transformers to the given data slices. Then it automatically combines the processed (and remained) data into single np.array.
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.compose import ColumnTransformer
clf = make_pipeline(
ColumnTransformer(
[('categorical', LabelEncoder(), credit.select_dtypes(include='O').columns)],
remainder='passthrough'
),
DecisionTreeClassifier()
)
You can then use the standard clf.fit and clf.predict on the resulting pipeline and all of the data processing and prediction will happen at once.

creating a pipeline for onehotencoded variables not working

i have a problem where i am trying to apply transformations to my catgeorical feature 'country' and the rest of my numerical columns. how can i do this as i am trying below:
preprocess = make_column_transformer(
(numeric_cols, make_pipeline(MinMaxScaler())),
(categorical_cols, OneHotEncoder()))
model = make_pipeline(preprocess,XGBClassifier())
model.fit(X_train, y_train)
note that numeric_cols is passed as a list and so is categorical_cols.
however i get this error: TypeError: All estimators should implement fit and transform, or can be 'drop' or 'passthrough' specifiers. along with a list of all my numerical columns (type <class 'list'>) doesn't.
what am i doing wrong, also how can i deal with unseen categories in column country?
You need to put the transform function first, then the columns as subsequent arguments, if you check out the help page, it writes:
sklearn.compose.make_column_transformer(*transformers, **kwargs)
Some like below will work:
from sklearn.preprocessing import StandardScaler, OneHotEncoder,MinMaxScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from xgboost import XGBClassifier
import numpy as np
import pandas as pd
X = pd.DataFrame({'x1':np.random.uniform(0,1,5),
'x2':np.random.choice(['A','B'],5)})
y = pd.Series(np.random.choice(['0','1'],5))
numeric_cols = X.select_dtypes('number').columns.to_list()
categorical_cols = X.select_dtypes('object').columns.to_list()
preprocess = make_column_transformer(
(MinMaxScaler(),numeric_cols),
(OneHotEncoder(),categorical_cols)
)
model = make_pipeline(preprocess,XGBClassifier())
model.fit(X,y)
Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('minmaxscaler',
MinMaxScaler(), ['x1']),
('onehotencoder',
OneHotEncoder(), ['x2'])])),
('xgbclassifier', XGBClassifier())])

Can encode categorical data in train set but not in the test set

I need to encode the categorical values on my test set, somehow it throws TypeError: argument must be a string or number. I do not know why this happens because i could do it to my train set. I mean they're train/test feature set so they're exactly the same, what differentiates them is just the number of the rows of course. I do not know how to fix this, i have tried to use different LabelEncoder for each, but it still does not fix the error. Please someone help me.
For your information the categorical data is on the column 8th in both train and test features set
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestRegressor
import scipy.stats as ss
avo_sales = pd.read_csv('avocados.csv')
avo_sales.rename(columns = {'4046':'small PLU sold',
'4225':'large PLU sold',
'4770':'xlarge PLU sold'},
inplace= True)
avo_sales.columns = avo_sales.columns.str.replace(' ','')
x = np.array(avo_sales.drop(['TotalBags','Unnamed:0','year','region','Date'],1))
y = np.array(avo_sales.TotalBags)
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
impC = SimpleImputer(strategy='most_frequent')
X_train[:,8] = impC.fit_transform(X_train[:,8].reshape(-1,1)).ravel()
imp = SimpleImputer(strategy='median')
X_train[:,1:8] = imp.fit_transform(X_train[:,1:8])
le = LabelEncoder()
X_train[:,8] = le.fit_transform(X_train[:,8])
X_test[:,8] = le.fit_transform(X_test[:,8])
On the test set you should never use fit_transform, but only transform. And it seems that you're not applying the preprocessing you did on the training data to your test data, that is also a mistake.
EDIT
When you use fit_transform, for example SimpleImputer(strategy='most_frequent') on your training data, you're basically calculating the most frequent value, to input it in the rows containing nan. This is fine. If you do fit_transform on your test set what you're doing is cheating, because you're assuming to have lot of instances from which calculate the most frequent value (whereas instead you might be predicting only one instance). The right thing to do is to input the missing data using the most frequent value you found on the training set. This is done by using only transform. The same logic apply to every other fit_transform / transform you can find in sklearn, for example when applying PCA or a CountVectorizer.

Using Imputer in a Pipeline doesn't remove NaNs, gives ''Input contains NaN, infinity or a value too large for dtype('float64')''

So I'm using a pipeline to perform a Ridge Regression on some data, which also includes an imputer to remove the NaNs.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Ridge
from sklearn.preprocessing import Imputer
from sklearn.pipeline import Pipeline
#making Data ready
gapdata = pd.read_csv('/Users/naveedanwer/desktop/Python Files/Life Expectancy Data.csv')
gapdata.columns = gapdata.columns.str.strip()
gapdata.rename(columns={'Life expectancy':'life'},
inplace=True)
gapdata.Status = gapdata.Status.astype('category')
model_data = gapdata.drop('Country',axis =1)
model_data = pd.get_dummies(model_data)
#initialize imputer
imp = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
#initializing regression model
reg = Ridge(alpha = 0.5, normalize = True)
#steps for pipeline
steps = [('imputation',imp),('Ridge',reg)]
#initializing pipeline
pipeline = Pipeline(steps)
#target and feature variables
X = model_data.drop('life', axis = 1)
y = model_data.loc[:,'life']
#splitting into training and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)
pipeline.fit(X_train, y_train)
The original data does contain a lot of NaN values, which is why the imputer is in place. However, the follow error is output after the code is executed:
Input contains NaN, infinity or a value too large for dtype('float64')
Which indicates that there are still NaNs in the data, despite the presence of the imputer. Any idea why this is happening?
There are a few types of NaNs that can be present in the dataframe.
Type 1: np.nan
Type 2: math.nan
Type 3: float('nan')
When you try comparing the three with each other, they return False. So one thing you can do is to check if these NaNs were saved in a specific format in the CSV and then use that format for the missing_value argument.
You can also use the pandas fillna() function to impute values instead of the sklearn Imputer to see if it's a problem there. For example if you want to impute each column with the columns mean value something like this should work:
df = df.apply(lambda x:x.fillna(x.mean(), inplace = True))

Categories

Resources