Looping scikit-learn machine learning datasets - python

How to use datasets.load_DATASET_NAME with every string from the Datasets array when looping to apply some ML algorithms on one dataset at a time.
I have the following sample program:
from sklearn import datasets
_Datasets_=['iris' , 'breast_cancer' , 'wine' , 'diabetes', 'linnerud' , 'boston' ]
for Dataset_name in _Datasets_:
# Load the dataset
Dataset = datasets.load_'DATASET_NAME'()

You could make a dictionary of names to function names. Then call as you are iterating
Datasets = {'data1':load_data_1}
Data = Datasets['data1']()

Related

standardizing data column-wise before using keras models

I'm working with a large dataset whose data I want to standardize to use with a CNN.
Does keras have a quick utility to standardize a block of numbers column-wise that you can use inside a Sequential model? I'm asking this as i expect eventually the data to be used on-line so ideally this standardization feature could be used on incoming data, ie a trailing moving average of mean and std to normalize the incoming data.
import numpy as np
import pandas as pd
np.random.seed(42)
col_names = ['Column' + str(x+1) for x in range(5)]
training_data = pd.DataFrame(np.random.randint(1,10 **6, 50).reshape(-1,5), columns = col_names)
I am not sure about online, but using sklearn's StandardScaler() should do the right thing, as described here, seems like the right thing.
We can do from sklearn
from sklearn.preprocessing import StandardScaler
training_data[:]= StandardScaler().fit_transform(training_data.T).T

Generate Test data using TfIdfVectorizer

I have separated my data into train and test parts. My data table has a 'text' column. Consider that I have ten other columns representing numerical features. I have used TfidfVectorizer and the training data to generate term matrix and combine that with numerical features to create the training dataframe.
tfidf_vectorizer=TfidfVectorizer(use_idf=True, max_features=5000, max_df=0.95)
tfidf_vectorizer_train = tfidf_vectorizer.fit_transform(X_train['text'].values)
df1_tfidf_train = pd.DataFrame(tfidf_vectorizer_train.toarray(), columns=tfidf_vectorizer.get_feature_names())
df2_train = df_main_ques.iloc[train_index][traffic_metrics]#to collect numerical features
df_combined_train = pd.concat([df1_tfidf_train, df2_train], axis=1)
To calculate the tf-idf score for test part, I need to reuse the training data set. I am not sure how to generate the test data part.
Related post:
[1]Append tfidf to pandas dataframe: discuss only creating training dataset part
[2]How does TfidfVectorizer compute scores on test data: Discussed test data part but it is not clear how to generate the test dataframe that contains both terms and numerical features.
you can use transform method of trained vectorizer for transforming your test data on already trained vectorizer. you can reuse the trained vectorizer for test data set TF-IDF score generation by
tfidf_vectorizer_test = tfidf_vectorizer.transform(X_test['text'].values)

I would like to consider a feature set(vector) for a data in python for my machine learning algorithm. How can I do it?

I have data in the following form
Class Feature set list
classlabel1 - [size,time] example:[6780.3,350.00]
classlabel2 - [size,time]
classlabel3 - [size,time]
classlabel4 - [size,time]
How do I save this data in excel sheet and how can I train the model using this feature set? Currently I am working on SVM classifier.
I have tried saving the feature set list in a dataframe and saving this dataframe to a csv file. But the size and time are getting split into two different columns.
The dataframe is getting saved in csv file in the following way:
col 0 col1 col2
62309 396.5099154 label1
I would like to train and test on the feature vector [size,time] combined. Is it possible and is this a right way? If it is possible, how can I do it?
Firstly responding to your question:
I would like to train and test on the feature vector [size,time]
combined. Is it possible and is this a right way? If it is possible,
how can I do it?
Combining the two is not the right thing to do because both are in two different scales (if they are actually what there name suggests) and also combining them will result in loss of information which they will provide, so they are two totally independent features for any ML supervised algorithm. So I would suggest to treat these two features separately rather than combining into one.
Now let's move onto to next section:
How do I save this data in excel sheet and how can I train the model
using this feature set? Currently I am working on SVM classifier.
Storing data : In my opinion, you can store data in whichever format you want but I would prefer storing data in csv format as it is convenient and loading of data file is faster.
sample_data.csv
size,time,class_label
100,150,label1
200,250,label2
240,180,label1
Below is the code for reading the data from csv and training SVM :
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
# loading data
data = pd.read_csv("sample_data.csv", error_bad_lines=True,
warn_bad_lines=True)
# Dividing into dependent and independent features
Y = data.class_label_col.values
X = data.drop("class_label_col", axis=1).values
# encode the class column values
label_encoded_Y = preprocessing.LabelEncoder().fit_transform(list(Y))
# split training and testing data
x_train,x_test,y_train,y_test=train_test_split(X,label_encoded_Y,
train_size=0.8,
test_size=0.2)
# Now use the whichever trainig algo you want
clf = SVC(gamma='auto')
clf.fit(x_train, y_train)
# Using the predictor
y_pred = clf.predict(x_test)
Since size and time are different features, you should separate them into 2 different columns so your model could set separate weight to each of them, i.e.
# data.csv
size time label
6780.3 3,350.00 classLabel1
...
If you want to transform the data you have into the format above you could use pandas.read_excel and use ast to transform the string list into python list object.
import pandas as pd
import ast
df = pd.read_excel("data.xlsx")
size_time = [(ast.literal_eval(x)[0], ast.literal_eval(x)[1]) for x in df["Feature set list"]]
size = [x[0] for x in size_time]
time = [x[1] for x in size_time]
label = df["Class"]
new_df = pd.DataFrame({"size":size, "time":time, "label":label})
# This will result in the DataFrame below.
# size time label
# 6780.3 350.0 classlabel1
# Save DataFrame to csv
new_df.to_csv("data_fix.csv")
# Use it
x = new_df.drop("label", axis=1)
y = new_df.label
# Further data preparation, such as split the dataset
# into train and test set, etc.
...
Hope this helps

Bag of Words formation for articles

I am trying to play around with the 20 NewsGroups dataset in sklearn. I have used the following code to import all the training and testing data into 2 utils.Bunch structures:
from sklearn.datasets import fetch_20newsgroups
# Import Newsgroup data
newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test= fetch_20newsgroups(subset='test')
My end goal is to use a naive bayes classifier on the dataset to learn how it works and see how accurate I can make it. I'm trying to prep the dataset for the classifier by representing it with the 'bag-of-words' representation.
By my research, I should be able to accomplish this with the sklearn.feature_extraction.text.HashingVectorizer
However, I'm unclear as to how to implement this seeing as the two data structures I have are unusual and I'm not sure how to pull the data out of them.
After loading the data using your code, newsgroups_train is a dictionary with the following keys:
In [3]: newsgroups_train.keys()
Out[3]: dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR', 'description'])
You can simply get the data via:
train_data = newsgroups_train["data"]
test_data = newsgroups_test["data"]
And it is stored as a list of strings. Then, you can simply apply the HashVectorizer function on data.
You should be getting sparse matrix for your train and test data via .fit() and then .transform(). For example:
from sklearn.feature_extraction.text import HashingVectorizer
h = HashingVectorizer()
h.fit(train_data)
h_train = h.transform(train_data)
h_test = h.transform(test_data)
Then, h_train and h_test will be sparse matrices.

Ignore a column while building a model with SKLearn

With R, one can ignore a variable (column) while building a model with the following syntax:
model = lm(dependant.variable ~ . - ignored.variable, data=my.training,set)
It's very handy when your data set contains indexes or ID.
How would you do that with SKlearn in python, assuming your data are Pandas data frames ?
So this is from my own code I used to do some prediction on StackOverflow last year:
from __future__ import division
from pandas import *
from sklearn import cross_validation
from sklearn import metrics
from sklearn.ensemble import GradientBoostingClassifier
basic_feature_names = [ 'BodyLength'
, 'NumTags'
, 'OwnerUndeletedAnswerCountAtPostTime'
, 'ReputationAtPostCreation'
, 'TitleLength'
, 'UserAge' ]
fea = # extract the features - removed for brevity
# construct our classifier
clf = GradientBoostingClassifier(n_estimators=num_estimators, random_state=0)
# now fit
clf.fit(fea[basic_feature_names], orig_data['OpenStatusMod'].values)
# now
priv_fea = # this was my test dataset
# now calculate the predicted classes
pred = clf.predict(priv_fea[basic_feature_names])
So if we wanted a subset of the features for classification I could have done this:
# want to train using fewer features so remove 'BodyLength'
basic_feature_names.remove('BodyLength')
clf.fit(fea[basic_feature_names], orig_data['OpenStatusMod'].values)
So the idea here is that a list can be used to select a subset of the columns in the pandas dataframe, as such we can construct a new list or remove a value and use this for selection
I'm not sure how you could do this easily using numpy arrays as indexing is done differently.

Categories

Resources