Using sklearn feature selection for reducing the number of features - python

I have a dataframe of 205 features and 949 observations. In order to decrease the number of features and considering the inputs that are more important, I wanted to use from sklearn.feature_selection import RFE . From the doc, it is just choosing the X & y and passing the data through the method. The code is below:
new_array = DF_new.values
x = new_array[:,1:]
y = new_array[:,0]
model = LinearRegression()
rfe = RFE(model)
fit = rfe.fit(x, y)
It throws the error of ValueError: could not convert string to float: '' I tried to find and replace any str but again I face the same error.
DF_new = df.replace(r'^([A-Za-z]|[0-9]|_)+$', np.NaN, regex=True)
But the same error arose. This is a problem with my data or I made mistake with something else?
Any help appreciated.
Converting any str to float and feature selection using sklearn library.

Related

knn, cannot perform reduce with flexible type

y = df.pitch_name
y = np.array(y)
y = y.reshape(-1, 1)
from sklearn.preprocessing import OrdinalEncoder
ord_enc = OrdinalEncoder()
y = ord_enc.fit_transform(y.reshape(-1, 1))
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=12345
)
knn_model = KNeighborsRegressor(n_neighbors=3)
knn_model.fit(X_train, y_train)
knn_model.predict([X_test[0]])
X has all float value and y is all string type. if I use ordinalEncoder and predict with the model, it works but the issue is that the result I am getting is sometime not a whole number (e.g. 6.3333) when I want to get the exact category.
So whenever I fit the model with the raw categorical value, string, I see this error message TypeError: cannot perform reduce with flexible type. When I check error message, I suppose that the error is happening due to the 238 where they try to get y_pred = np.mean(_y[neigh_ind], axis=1) when it should be median since y is a list of string? any help will be appreciated.
237 if weights is None:
--> 238 y_pred = np.mean(_y[neigh_ind], axis=1)
239 else:
240 y_pred = np.empty((X.shape[0], _y.shape[1]), dtype=np.float64)
Forgive me if I misunderstood something, but it sounds like you are trying to perform classification with a regression model: KNeighborsRegressor.
From here you can see that:
Neighbors-based regression can be used in cases where the data labels are continuous rather than discrete variables. The label assigned to a query point is computed based on the mean of the labels of its nearest neighbors.
So, by using the OrdinalEncoder, you just encoded the float categories, however, afterwards you predicted the mean of the labels of its nearest neighbors, which will not be an integer, and thus not a category.
I suggest that you read this, to learn how to use a KNeighborsClassifier.

Sklearn Naive Bayes GaussianNB from .csv

I'm having a problem with sklearn.
When I train it with ".fit()" it shows me the ValueError "ValueError: could not convert string to float: 'Casado'"
This is my code:
"""
from sklearn.naive_bayes import GaussianNB
import pandas as pd
# 1. Create Naive Bayes classifier:
gaunb = GaussianNB()
# 2. Create dataset:
dataset = pd.read_csv("archivos_de_datos/Datos_Historicos_Clientes.csv")
X_train = dataset.drop(["Compra"], axis=1) #Here I removed the last column "Compra"
Y_train = dataset["Compra"] #This one only consists of that column "Compra"
print("X_train: ","\n", X_train)
print("Y_train: ","\n", Y_train)
dataset2 = pd.read_csv("archivos_de_datos/Nuevos_Clientes.csv")
X_test = dataset2.drop("Compra", axis=1)
print("X_test: ","\n", X_test)
# 3. Train classifier with dataset:
gaunb = gaunb.fit(X_train, Y_train) #Here shows "ValueError: could not convert string to float: 'Casado'"
# 4. Predict using classifier:
prediction = gaunb.predict(X_test)
print("PREDICTION: ",prediction)
"""
And the dataset I'm using is an .csv file that looks like this (but with more rows):
IdCliente,EstadoCivil,Profesion,Universitario,TieneVehiculo,Compra
1,Casado,Empresario,Si,No,No
2,Casado,Empresario,Si,Si,No
3,Soltero,Empresario,Si,No,Si
I'm trying to train it to determine (with a test dataset) whether the last column would be a Yes or No (Si or No)
I appreciate your help, I'm obviously new at this and I don't understand what am I doing wrong here
I would use onehotencoder to, like Lavin mentioned, make the yes or no a numerical value. A model such as this can't process categorical data.
Onehotencoder is used to handle binary data such as yes/no, male/female, while label encoder is used for categorical data with more than 2 values, ei, country names.
It will look something like this, however, you'll have to do this with all categorical data, not just your y column, and use label encoder for columns that are not binary ( more than 2 variables - for example, perhaps Estadio Civil)
Also I would suggest removing any dependent variables that don't contribute to your model, for instant client ID sounds like it may not add any value in determining your dependent variable. This is context specific, but something to keep in mind.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [Insert column number for your df])], remainder='passthrough')
X = np.array(ct.fit_transform(X))
For the docs:
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
More info:
https://contactsunny.medium.com/label-encoder-vs-one-hot-encoder-in-machine-learning-3fc273365621#:~:text=What%20one%20hot%20encoding%20does,which%20column%20has%20what%20value.&text=So%2C%20that's%20the%20difference%20between%20Label%20Encoding%20and%20One%20Hot%20Encoding.

The target is binary, but I get "ValueError: Supported target types are: ('binary', 'multiclass'). Got 'unknown' instead."

I'm facing an issue in a simple ML model using sklearn KFold
I categorize my target value using the following code:
# Import the DB
df = pd.read_csv("DB_ML_TJA20182019.csv")
#Transform continuous target into binary
category = pd.cut(df.length,bins=[0,4,100],labels=[0,1])
df.insert(18,"length_over", category)
Now, if I open the csv, I can see an added column (length_over, the 18th column, counting from 0) with the binarized variable made by the binarization of the column length. Then, i save the dataset as a new file, and split it to test-validation subsets, using the following code:
# Save the dataset with binary target
df.to_csv(r'DB_ML_TJA20182019_multilabel.csv', index = False)
# Load dataset for ML modeling (already imputed)
url = 'DB_ML_TJA20182019_multilabel.csv'
names = ...
dataset = read_csv(url, names=features, skiprows=1)
# Split-out validation dataset
array = dataset.values
X = array[:,0:18]
y = array[:,18]
X_train, X_validation, Y_train, Y_validation = train_test_split(X, y, test_size=0.30, random_state=1)
However, before proceding with models evaluation and comparison, I get the error: Out: "ValueError: Supported target types are: ('binary', 'multiclass'). Got 'unknown' instead."
I also checked the type of target using
#Check the type of target
from sklearn.utils.multiclass import type_of_target
print(type_of_target(y))
And the result is unknown
What could be the issue? The target is binary when I open the csv, but the function get it as unknown...
dtype is int64
Very late to the party, but I met this error while preparing the dataset for a multi-label classification task and using MultilabelStratifiedKFold().
Essentially, ensure that each label you have is a numpy array with correct data type (int, for example).
In my case, after performing some operations on a pd.DataFrame, the y label was a pandas.Series which contained list as elements(=labels), rather than np.array().
I solved it by:
y = df["label_column"].to_numpy()
y = [np.array(label) for label in y]

Different ways to pre-process date in Machine Learning using Python?

I want to pre-process the date and use it to train my model in python.
My date format is like this.
22-02-2026
The code I have developed so far is attached below
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
df=pd.read_csv('data.csv')
df['previous_date'] = pd.to_datetime(df['previous_date'])
df['current_date'] = pd.to_datetime(df['current_date'])
df['previous_date_day'] = df['previous_date'].dt.day
df['previous_date_month'] = df['previous_date'].dt.month
df['previous_date_year'] = df['previous_date'].dt.year
df['current_date_day'] = df['current_date'].dt.day
df['current_date_month'] = df['current_date'].dt.month
df['current_date_year'] = df['current_date'].dt.year
X=df.iloc[:,3:]
Y=df['value']
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, np.ravel(y_train))
from sklearn.metrics import accuracy_score
y_pred=clf.predict(X_test)
acc_score=accuracy_score(y_test, y_pred)*100
print("Accuracy Score : " , acc_score)
Based on your comment, you need to convert a date to an ordinal number so that the algorithm can tell the order.
Here is one way to do it:
import datetime
origin = datetime.datetime(1970,1,1)
days = (datetime.datetime.strptime('22-02-2026', '%d-%m-%Y') - origin).days
In this case it's 20506.
I set the origin to Unix epoch, but you can modify it to your likeness. It doesn't really matter, since the purpose here is to tell the order. Majority of machine learning algorithms will be able to use feature in this format, but if it's the best way depends on the nature of the problem.
As there are many dates that need to be converted to numeric representation, the first thing to make sure is that the output list also has the same order as Lukas mentioned. The easiest way to do this is by adding weight to each unit (weight_year > weight_month > weight_day).
def date2num(date_time):
d, m, y = date_time.split('-')
num = int(d)*10 + int(m)*100 + int(y)*1000 # these weights can be anything as long as
# they are ordered
return num
Now, it's important to normalize the numeric values.
import numpy as np
date_features = []
for d in list(df['date_time']):
date_features.append(date2num(d))
date_features = np.array(date_features)
date_features_normalized = (date_features - np.min(date_features))/(np.max(date_features) - np.min(date_features))
You wrote in one of your comments to your post :
I just want to compare 2 dates. If the first date is bigger than the
second date i want to predict true else i want my prediction as
*false. So my question is how should I pre-process the date to train the Machine Learning model.
You do not need machine learning for this, you can solve this only with a if / else condition.
You really do not need to make things complicated when they are simple !
All you need is this :
if (first_date > second_date)
return True
else
return False
Or in your case:
def get_value_for_dates(row):
if row['first_column'] > row['second_column']:
return 1
else:
return 0
df['value'] = df.apply(get_value_for_dates, axis=1)

ValueError using sklearn and pandas for decision trees?

I'm new to scikit learn and I just saw the documentation and a couple of other stackoverflow posts to build a decision tree.
I have a CSV data set with 16 attributes and 1 target label. How should I pass it into the decision tree classifier?
My current code looks like this:
import pandas
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import tree
data = pandas.read_csv("yelp_atlanta_data_labelled.csv", sep=',')
vect = TfidfVectorizer()
X = vect.fit_transform(data)
Y = data['go']
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)
When I run the code it gives me the following error:
ValueError: Number of labels=501 does not match number of samples=17
To give some context, my data set has 501 data points and 17 total columns. The go column is the target column with yes/no labels.
The problem is TfidfVectorizer cannot operate on a dataframe directly. It can only operate on a sequence of strings. Because you are passing a dataframe, it sees it as a sequence of columns and attempts to vectorize each column separately.
Try instead using:
X = vect.fit_transform(data['my_column_name'])
You may want to preprocess the dataframe to concatenate different columns prior to calling vect.fit_transform.

Categories

Resources