Different ways to pre-process date in Machine Learning using Python? - python

I want to pre-process the date and use it to train my model in python.
My date format is like this.
22-02-2026
The code I have developed so far is attached below
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
df=pd.read_csv('data.csv')
df['previous_date'] = pd.to_datetime(df['previous_date'])
df['current_date'] = pd.to_datetime(df['current_date'])
df['previous_date_day'] = df['previous_date'].dt.day
df['previous_date_month'] = df['previous_date'].dt.month
df['previous_date_year'] = df['previous_date'].dt.year
df['current_date_day'] = df['current_date'].dt.day
df['current_date_month'] = df['current_date'].dt.month
df['current_date_year'] = df['current_date'].dt.year
X=df.iloc[:,3:]
Y=df['value']
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, np.ravel(y_train))
from sklearn.metrics import accuracy_score
y_pred=clf.predict(X_test)
acc_score=accuracy_score(y_test, y_pred)*100
print("Accuracy Score : " , acc_score)

Based on your comment, you need to convert a date to an ordinal number so that the algorithm can tell the order.
Here is one way to do it:
import datetime
origin = datetime.datetime(1970,1,1)
days = (datetime.datetime.strptime('22-02-2026', '%d-%m-%Y') - origin).days
In this case it's 20506.
I set the origin to Unix epoch, but you can modify it to your likeness. It doesn't really matter, since the purpose here is to tell the order. Majority of machine learning algorithms will be able to use feature in this format, but if it's the best way depends on the nature of the problem.

As there are many dates that need to be converted to numeric representation, the first thing to make sure is that the output list also has the same order as Lukas mentioned. The easiest way to do this is by adding weight to each unit (weight_year > weight_month > weight_day).
def date2num(date_time):
d, m, y = date_time.split('-')
num = int(d)*10 + int(m)*100 + int(y)*1000 # these weights can be anything as long as
# they are ordered
return num
Now, it's important to normalize the numeric values.
import numpy as np
date_features = []
for d in list(df['date_time']):
date_features.append(date2num(d))
date_features = np.array(date_features)
date_features_normalized = (date_features - np.min(date_features))/(np.max(date_features) - np.min(date_features))

You wrote in one of your comments to your post :
I just want to compare 2 dates. If the first date is bigger than the
second date i want to predict true else i want my prediction as
*false. So my question is how should I pre-process the date to train the Machine Learning model.
You do not need machine learning for this, you can solve this only with a if / else condition.
You really do not need to make things complicated when they are simple !
All you need is this :
if (first_date > second_date)
return True
else
return False
Or in your case:
def get_value_for_dates(row):
if row['first_column'] > row['second_column']:
return 1
else:
return 0
df['value'] = df.apply(get_value_for_dates, axis=1)

Related

Using sklearn feature selection for reducing the number of features

I have a dataframe of 205 features and 949 observations. In order to decrease the number of features and considering the inputs that are more important, I wanted to use from sklearn.feature_selection import RFE . From the doc, it is just choosing the X & y and passing the data through the method. The code is below:
new_array = DF_new.values
x = new_array[:,1:]
y = new_array[:,0]
model = LinearRegression()
rfe = RFE(model)
fit = rfe.fit(x, y)
It throws the error of ValueError: could not convert string to float: '' I tried to find and replace any str but again I face the same error.
DF_new = df.replace(r'^([A-Za-z]|[0-9]|_)+$', np.NaN, regex=True)
But the same error arose. This is a problem with my data or I made mistake with something else?
Any help appreciated.
Converting any str to float and feature selection using sklearn library.

sklearn to pmml, cant create pipeline for preprocessing step of categorical columns

I'm having a tough time trying to create a PMML pipeline in the library sklearn2pmml (python). I want to convert categorical variables to numerical ones by reasigning them but don't have any clue, I tried many sklearn preprocessors but they are not compatible, have anyone encounter the same problem?
Here's an example,I know it is clearly wrong, but wanted to make sure that you understand what I'm trying to do.
Even an automatable solution in PMML would help me.
See the example below, thanks.
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np
from sklearn2pmml.pipeline import PMMLPipeline
from sklearn2pmml import make_pmml_pipeline, sklearn2pmml
# FORGET ABOIT TRAIN TEST SPLIT; we only care if the PMML pipeline works for now
BIRTHDAY_SEED = 1995
nrows, cols = 1000, 5
X, y = make_classification(n_samples=nrows, n_features=cols, n_informative=2, n_redundant=3, n_classes=2, shuffle=True, random_state=BIRTHDAY_SEED)
X, y = pd.DataFrame(X), pd.Series(y)
X["cat_variable"] = np.random.choice(["a","b","c"],size=len(X) )
# DEFINE FUNCTIONS FOR REASIGNING CATEGORY
def simple_category_asignation(value):
"""
The returns are random; I just want to reasign a number to a category.
"""
if value == "a":
return 1.5
elif value == "b":
return 2.0
elif value == "c":
return 1.97
else:
return -1
def preprocessing_cat_variables(X):
"""
Reprocess cateogorical variable.
"""
X["cat_variable"] = X["cat_variable"].apply(simple_category_asignation)
return X
# FIT THE MODEL AND TRY TO CREATE THE PMML PIPELINE, it does not work
X = preprocessing_cat_variables(X)
model = DecisionTreeClassifier()
model.fit(X,y)
pmml_pipeline = PMMLPipeline([
# here we should place the category preprocesor; I know it does not work but , so you can get the idea
("preprocessing_categories_step",preprocessing_cat_variables),
#
('decisiontree',model)
])
sklearn2pmml(pmml_pipeline, "example_pipeline_pmml.pmml", with_repr = True)
You can map from one categorical value space (strings) to another (floats) using the sklearn2pmml.preprocessing.LookupTransformer transformer type.
Your code simplifies to this:
from sklearn2pmml.preprocessing import LookupTransformer
mapping = {
"a" : 1.5,
"b" : 2.0,
"c" : 1.97
}
pmml_pipeline = PMMLPipeline([
("category_remapper", LookupTransformer(mapping, default_value = -1.0)),
("classifier", DecisionTreeClassifier())
])
Alternatively, you can build a mapper based on free-form Python expressions using the sklearn2pmml.preprocessing.ExpressionTransformer transformation type.

How to get the machine learning model permutation Importance into an array or dictionary in python

I have this code sample, that works well when I try this in jupyternotebook. And it shows as a table (an image) with two columns as below for the below code:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import eli5
from eli5.sklearn import PermutationImportance
X = inputsdf
y = targetdf
X_traindf, X_testdf, y_traindf, y_testdf = train_test_split(inputsNew, target, random_state=0)
estimator = RandomForestClassifier(max_depth=2, random_state=0)
estimator.fit(X_traindf, y_traindf)
perm = PermutationImportance(estimator, random_state=1).fit(X_testdf, y_testdf)
eli5.show_weights(perm, feature_names = X_testdf.columns.tolist())
But I need these values to be converted as an array or dictionary or anything that I can assign to variable/s and re-use them. so it will look like kind of below:
{
"PercentageSalaryHike": "0.0960 +- 0.0222",
.
.
.
}
Can someone please help me? OR IS THERE A BETTER WAY TO FIND THE PERMUTATION IMPORTANCE FOR EACH COLUMN?
variable = np.array(eli5.show_weights(perm, feature_names = X_testdf.columns.tolist()))
I hope this will work.

Check the accuracy of decision tree classifier with Python

I wrote a function that takes dataset (excel / pandas) and some values, and then predicts outcome with decision tree classifier. I have done that with sklearn.
Can you help me with this, I have looked over the web and this website but I couldnt find the answer that works.
I have tried to do this, but it does not work:
from sklearn.metrics import accuracy_score
score = accuracy_score(variable_list, result_list)
This is the error that I get:
ValueError: Classification metrics can't handle a mix of continuous-multioutput and multiclass targets
This is the code(I removed code for accuracy)
import pandas as pd
import math
import xlrd
from sklearn.model_selection import train_test_split
from sklearn import tree
def predict_concrete_class(input_data, cement, blast_fur_slug,fly_ash,
water, superpl, coarse_aggr, fine_aggr, days):
data_for_tree = concrete_strenght_class(input_data)
variable_list = []
result_list = []
for index, row in data_for_tree.iterrows():
variable = row.tolist()
variable = variable[0:8]
variable_list.append(variable)
result_list.append(row[-1])
decision_tree = tree.DecisionTreeClassifier()
decision_tree = decision_tree.fit(variable_list,result_list)
input_values = [cement, blast_fur_slug, fly_ash, water, superpl, coarse_aggr, fine_aggr, days]
prediction = decision_tree.predict([input_values])
info = "Prediction of future concrete class after "+ str(days)+" days: "+ str(prediction[0])
return info
print(predict_concrete_class(data, 500, 0, 0, 200, 0, 1125, 613, 3))
Split your data into train and test:
var_train, var_test, res_train, res_test = train_test_split(variable_list, result_list, test_size = 0.3)
Train your decision tree on train set:
decision_tree = tree.DecisionTreeClassifier()
decision_tree = decision_tree.fit(var_train, res_train)
Test model performance by calculating accuracy on test set:
res_pred = decision_tree.predict(var_test)
score = accuracy_score(res_test, res_pred)
Or you could directly use decision_tree.score:
score = decision_tree.score(var_test, res_test)
The error you are getting is because you are trying to pass variable_list (which is your list of input features) as a parameter in accuracy_score. You are supposed to pass your list of true labels and predicted labels.
You should perform a cross validation if you want to check the accuracy of your system.
You have to split you data set into two parts. The first one is used to learn your system. Then you perform the prediction process on the second part of the data set and compared the predicted results with the good ones. With this method, you check your system on a unlearned data set.
In order to split your set, you should use train_test_split from sklearn.model_selection
You will split your set randomly.
Here is good lecture: https://machinelearningmastery.com/k-fold-cross-validation/

ValueError: could not convert string to float: 'n'

Hello I am following a video on Udemy. We are trying to apply a random forest classifier. Before we do so, we convert one of the columns in a data frame into a string. The 'Cabin' column represents values such as "4C" but in order to reduce the number of unique values, we want to use simply the first number to map onto a new column 'Cabin_mapped'.
data['Cabin_mapped'] = data['Cabin'].astype(str).str[0]
# this transforms the letters into numbers
cabin_dict = {k:i for i, k in enumerate(
data['Cabin_mapped'].unique(),0)}
data.loc[:,'Cabin_mapped'] = data.loc[:,'Cabin_mapped'].map(cabin_dict)
data[['Cabin_mapped', 'Cabin']].head()
This part below is simply splitting the data into training and test set. The parameters don't really matter for figuring out the problem.
X_train_less_cat, X_test_less_cat, y_train, y_test = \
train_test_split(data[use_cols].fillna(0), data.Survived,
test_size = 0.3, random_state=0)
I get an error here after the fit, saying I could not convert the string into a float.
rf = RandomForestClassifier(n_estimators=200, random_state=39)
rf.fit(X_train_less_cat, y_train)
It seems like I need to convert one of the inputs back into float to use the random forest algorithms. This is despite the error not showing up in the tutorial video. If anyone could help me out, that'd be great.
here's fully working example - I've highlighted the bit that you are missing. You need to convert EVERY column to a number, not just "cabin".
!wget https://raw.githubusercontent.com/agconti/kaggle-titanic/master/data/train.csv
import pandas as pd
data = pd.read_csv("train.csv")
data['Cabin_mapped'] = data['Cabin'].astype(str).str[0]
# this transforms the letters into numbers
cabin_dict = {k:i for i, k in enumerate(
data['Cabin_mapped'].unique(),0)}
data.loc[:,'Cabin_mapped'] = data.loc[:,'Cabin_mapped'].map(cabin_dict)
data[['Cabin_mapped', 'Cabin']].head()
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
## YOU ARE MISSING THIS BIT, some of your columns are still strings
## they need to be converted to numbers (ints OR floats)
for n,v in data.items():
if v.dtype == "object":
data[n] = v.factorize()[0]
## END of the bit you're missing
use_cols = data.drop("Survived",axis=1).columns
X_train_less_cat, X_test_less_cat, y_train, y_test = \
train_test_split(data[use_cols].fillna(0), data.Survived,
test_size = 0.3, random_state=0)
rf = RandomForestClassifier(n_estimators=200, random_state=39)
rf.fit(X_train_less_cat, y_train)

Categories

Resources