I have a simple code to convert categorical data into one hot encoding in python:
I tried to convert them with python OneHotEncoder:
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
import numpy as np
data = pd.read_csv("C:\\test.txt", sep=",", header=None)
one_hot_encoder = OneHotEncoder(categorical_features=[0,2])
This piece of code does not work and throws an error
ValueError: could not convert string to float: 't'
Can you please help me?
Try this:
from sklearn import preprocessing
for c in df.columns:
df[c] =le.transform(df[c])
I encountered the same behavior and found it frustrating.
Scikit-Learn requires all data to be numerical before it even considers selecting the columns provided in the categorical_features parameter.
Specifically, the column selection is handled by the _transform_selected() method in /sklearn/preprocessing/data.py and the very first line of that method is
X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES).
This check fails if any of the data in the provided dataframe X cannot be successfully converted to a float.
I agree that the documentation of sklearn.preprocessing.OneHotEncoder is very misleading in that regard.
I try to convert breast canser dataset to a dataframe and to use it
but i got following error;
TypeError: only integer scalar arrays can be converted to a scalar
#Import library
from sklearn.datasets import load_breast_cancer
import pandas as pd
#Load dataset
cancer = load_breast_cancer()
#Convert the sklearn.dataset cancer to a DataFrame.
df=pd.DataFrame(cancer.data,columns =[cancer.feature_names])
The dataset converts to a dataframe correctly. It seems to be an indexing issue, so I changed the last line to
and it worked for me.
I've tried to use the imputer to replace all of the NaN portions of my database with the averages of its respectful column. For example, I wanted to fix a blank entry in my database under the salary column and I want that blank section to be filled with the average salary values under that column. I tried doing this by following along with a tutorial but I think the video was outdated, resulting in this error.
#Data Proccesing
#Importing the Libaries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv("Data.csv")
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 3].values
#Taking care of Missig Data
from sklearn.preprocessing import Imputer
#The source of all the problems
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform
Initially, X looked like this when compiled prior to using Imputer:
However, Once I compiled lines 16-18, I got this error and I'm not sure what to do
The line
Should be
...With parentheses to actually call the method rather than assign it's name to something.
Currently, I am getting this error in my code
'ValueError: Input contains NaN, infinity or a value too large for dtype('float64')'
when I want to run this code
import pandas as pd
features_col=['Num_comments', 'Num_Commits','Changed_files']
from sklearn.linear_model import LogisticRegression
So if you need a sample of my dataset to check what is realy happened please let me know
I've loaded the sample set and the code below ran in my computer:
import pandas as pd
from sklearn.linear_model import LogisticRegression
train = pd.read_csv('RailsDataset_bis.csv')
features_col = ['Num_Comments', 'Num_Commits', 'Changed_files']
X = train[features_col].dropna()
y = train['class'].dropna()
logreg = LogisticRegression()
logreg.fit(X, y)
I've have corrected issues such as:
There is no Num_comments column, there only is a Num_Comments
column as pandas is case-sensitive. This line
X=train.loc[:,features_col] didn't give you an error, but
generated a column full of NaN. Selecting columns like this X = train[features_col]will throw an error in case the column name doesn't exist.
There is no train.classes as the column name is class and not classes.
There was a line full of NaN on the bottom of the set that needed to be
removed with dropna().
I am encoding some data to pass into an ML model using the LabelEncoder and OneHotEncoder from sklearn however I am getting an error back that relates to a column I that I don't think should be being encoded.
Here is my code;
import numpy as np
import pandas as pd
import matplotlib.pyplot as py
Dataset = pd.read_csv('C:\\Users\\taylorr2\\Desktop\\SID Alerts.csv', sep = ',')
X = Dataset.iloc[:,:-1].values
Y = Dataset.iloc[:,18].values
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
I can only see how I am trying to encode the first column of data however the error I am getting is the following;
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
Traceback (most recent call last):
File "<ipython-input-132-360fc0133165>", line 2, in <module>
X = onehotencoder.fit_transform(X).toarray()
File "C:\Users\taylorr2\AppData\Local\Continuum\Anaconda2\lib\site- packages\sklearn\preprocessing\data.py", line 1902, in fit_transform
self.categorical_features, copy=True)
File "C:\Users\taylorr2\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\preprocessing\data.py", line 1697, in _transform_selected
X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)
File "C:\Users\taylorr2\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\utils\validation.py", line 382, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: 'A string that only appears in column 16 or 18 of my data'
What is it about my code that is making it think it needs to try and convert a value in column 16 or 18 into a float and anyway, what should be the issue with doing that!!?
Thanks in advance for your advice!
I'm sorry, this is actually a comment but due to my reputation I can't post comments yet :(
Probably that string appears on column 17 of your data, and I think it's because for some reason the last columns of the data are checked first (you can try passing less columns (e.g. 17 by passing X[:,0:17]) to see what happens. It'll complain about the last column again).
Anyway, the input to OneHotEncoder should be a matrix of integers, as described here: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html.
But I think since you specified the index of the categorical features to OneHotEncoder class, that shouldn't matter anyway (at least I'd expect the non categorical features to be "ignored").
Reading the code in 'sklearn/preprocessing/data.py' I've seen that when they do "X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)", they are considering the non categorical features, even though their indexes are passed as argument to the function that calls check_array. I don't know, maybe it should be checked with the sklearn community on github?
I encountered the same behavior and found it frustrating. As #Vivek pointed out, Scikit-Learn requires all data to be numerical before it even considers selecting the columns provided in the categorical_features parameter.
Specifically, the column selection is handled by the _transform_selected() method in /sklearn/preprocessing/data.py and the very first line of that method is
X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES).
This check fails if any of the data in the provided dataframe X cannot be successfully converted to a float.
I agree that the documentation of sklearn.preprocessing.OneHotEncoder is very misleading in that regard.
When trying to convert the sklearn dataset into pandas dataframe by the following code I am getting this error "ufunc 'add' did not contain a loop with signature matching types dtype('
import numpy as np
from sklearn.datasets import load_breast_cancer
import numpy as np
cancer = load_breast_cancer()
data = pd.DataFrame(data= np.c_[cancer['data'], cancer['target']],columns= cancer['feature_names'] + cancer['target'])
Here is how I converted the sklearn dataset to a pandas dataframe. The target column name needs to be appended.
bostonData = pd.DataFrame(data= np.c_[boston['data'], boston['target']],
columns= np.append(boston['feature_names'],['target']))
You have numpy array of strings please provide full error therefore we figure out what's missing;
For example I am assuming you got dtype('U9'), please add;
dtype=float into your array. Something like not certain;
data = pd.DataFrame(data= np.c_[cancer['data'], cancer['target']],columns= cancer['feature_names'] + cancer['target'], dtype=float)
Sometimes it's just easier to keep it simple. Create a DF for both data and target, then merge using pandas.
data_df = pd.DataFrame(data=cancer['data'] ,columns=cancer['feature_names'])
target_df = pd.DataFrame(data=cancer['target'], columns=['target']).reset_index(drop=True)
df = pd.concat([data_df, target_df], axis=1)