Bag of Words formation for articles - python

I am trying to play around with the 20 NewsGroups dataset in sklearn. I have used the following code to import all the training and testing data into 2 utils.Bunch structures:
from sklearn.datasets import fetch_20newsgroups
# Import Newsgroup data
newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test= fetch_20newsgroups(subset='test')
My end goal is to use a naive bayes classifier on the dataset to learn how it works and see how accurate I can make it. I'm trying to prep the dataset for the classifier by representing it with the 'bag-of-words' representation.
By my research, I should be able to accomplish this with the sklearn.feature_extraction.text.HashingVectorizer
However, I'm unclear as to how to implement this seeing as the two data structures I have are unusual and I'm not sure how to pull the data out of them.

After loading the data using your code, newsgroups_train is a dictionary with the following keys:
In [3]: newsgroups_train.keys()
Out[3]: dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR', 'description'])
You can simply get the data via:
train_data = newsgroups_train["data"]
test_data = newsgroups_test["data"]
And it is stored as a list of strings. Then, you can simply apply the HashVectorizer function on data.
You should be getting sparse matrix for your train and test data via .fit() and then .transform(). For example:
from sklearn.feature_extraction.text import HashingVectorizer
h = HashingVectorizer()
h_train = h.transform(train_data)
h_test = h.transform(test_data)
Then, h_train and h_test will be sparse matrices.


standardizing data column-wise before using keras models

I'm working with a large dataset whose data I want to standardize to use with a CNN.
Does keras have a quick utility to standardize a block of numbers column-wise that you can use inside a Sequential model? I'm asking this as i expect eventually the data to be used on-line so ideally this standardization feature could be used on incoming data, ie a trailing moving average of mean and std to normalize the incoming data.
import numpy as np
import pandas as pd
col_names = ['Column' + str(x+1) for x in range(5)]
training_data = pd.DataFrame(np.random.randint(1,10 **6, 50).reshape(-1,5), columns = col_names)
I am not sure about online, but using sklearn's StandardScaler() should do the right thing, as described here, seems like the right thing.
We can do from sklearn
from sklearn.preprocessing import StandardScaler
training_data[:]= StandardScaler().fit_transform(training_data.T).T

Custom function in sklearn2pmml PMMLPipeline

I am trying to create a machine learning model to suggest treatment for stroke patients based on their responses to various questionnaires and assessments. For instance, the patient will be asked to rate the stiffness of the fingers, elbow,  shoulder, and pectoral muscles (each on a scale of 0 to 100) or answer 14 questions related to mental health (each on a scale of 0 to 3).
I would like to create an sklearn pipeline roughly as follows:
1.       The patient responses are aggregated. For example, the four stiffness responses should be averaged to create a single “stiffness” value, while the fourteen mental health questions should be summed up to create a single “mental health” value. The “stiffness” and “mental health” values would then be features in the model.
2.       Once the features have been aggregated in this way, a decision tree classifier is trained on labeled data to assign each patient to the appropriate therapy.
3.       The trained pipeline is exported as a pmml file for production
I assume this must be doable with some code like this:
from sklearn2pmml.pipeline import PMMLPipeline
from sklearn2pmml import sklearn2pmml
from sklearn.tree import DecisionTreeClassifier
from somewhere import Something
pipeline = PMMLPipeline([
    ("input_aggregation", Something()),
    ("classifier", DecisionTreeClassifier())
]), therapy_labels)
sklearn2pmml(pipeline, "ClassificationPipeline.pmml", with_repr = True)
I’ve been poking around the documentation and I can figure out to apply PCA to a group of columns but not how to do something as straightforward as collapsing a group of columns by summing or averaging. Does anyone have any hints about how I could do this?
Thanks for your help.
Sample code:
from sklearn_pandas import DataFrameMapper
from sklearn2pmml.preprocessing import Aggregator
pipeline = PMMLPipeline([
("mapper", DataFrameMapper([
(["stiffness_1", "stiffness_2", "stiffness_3", "stiffness_4"], Aggregator(function = "mean")),
(["mental_health_1", "mental_health2", .., "mental_health_14"], Aggregator(function = "sum"))
("classifier", DecisionTreeClassifier())
]), y)
Explanation - you can use sklearn_pandas.DataFrameMapper to define a column group, and apply a transformation to it. For the conversion to PMML work, you need to provide a transformer class, not a direct function. Perhaps all your transformation needs are handled by the sklearn2pmml.preprocessing.Aggregator transformer class. If not, you can always define your own.
While #makis has provided a 100% valid Python example, it wouldn't work in the Python-to-PMML case, because the converter cannot parse/handle custom Python functions.
You just need to define a custom function and use it in the Pipeline.
Here is the full code:
from sklearn.preprocessing import FunctionTransformer
import numpy as np
from sklearn2pmml import make_pmml_pipeline
# fake data with 7 columns
X = np.random.rand(10,7)
n_rows = X.shape[0]
def custom_function(X):
#averiging 4 first columns, sums the others, column-wise
return np.concatenate([np.mean(X[:,0:5],axis = 1).reshape(n_rows,1), np.sum(X[:,5:],axis=1).reshape(n_rows,1)],axis = 1)
# Now, if you run: `custom_function(X)` it should return an array (10,2).
pipeline = make_pmml_pipeline(

I would like to consider a feature set(vector) for a data in python for my machine learning algorithm. How can I do it?

I have data in the following form
Class Feature set list
classlabel1 - [size,time] example:[6780.3,350.00]
classlabel2 - [size,time]
classlabel3 - [size,time]
classlabel4 - [size,time]
How do I save this data in excel sheet and how can I train the model using this feature set? Currently I am working on SVM classifier.
I have tried saving the feature set list in a dataframe and saving this dataframe to a csv file. But the size and time are getting split into two different columns.
The dataframe is getting saved in csv file in the following way:
col 0 col1 col2
62309 396.5099154 label1
I would like to train and test on the feature vector [size,time] combined. Is it possible and is this a right way? If it is possible, how can I do it?
Firstly responding to your question:
I would like to train and test on the feature vector [size,time]
combined. Is it possible and is this a right way? If it is possible,
how can I do it?
Combining the two is not the right thing to do because both are in two different scales (if they are actually what there name suggests) and also combining them will result in loss of information which they will provide, so they are two totally independent features for any ML supervised algorithm. So I would suggest to treat these two features separately rather than combining into one.
Now let's move onto to next section:
How do I save this data in excel sheet and how can I train the model
using this feature set? Currently I am working on SVM classifier.
Storing data : In my opinion, you can store data in whichever format you want but I would prefer storing data in csv format as it is convenient and loading of data file is faster.
Below is the code for reading the data from csv and training SVM :
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
# loading data
data = pd.read_csv("sample_data.csv", error_bad_lines=True,
# Dividing into dependent and independent features
Y = data.class_label_col.values
X = data.drop("class_label_col", axis=1).values
# encode the class column values
label_encoded_Y = preprocessing.LabelEncoder().fit_transform(list(Y))
# split training and testing data
# Now use the whichever trainig algo you want
clf = SVC(gamma='auto'), y_train)
# Using the predictor
y_pred = clf.predict(x_test)
Since size and time are different features, you should separate them into 2 different columns so your model could set separate weight to each of them, i.e.
# data.csv
size time label
6780.3 3,350.00 classLabel1
If you want to transform the data you have into the format above you could use pandas.read_excel and use ast to transform the string list into python list object.
import pandas as pd
import ast
df = pd.read_excel("data.xlsx")
size_time = [(ast.literal_eval(x)[0], ast.literal_eval(x)[1]) for x in df["Feature set list"]]
size = [x[0] for x in size_time]
time = [x[1] for x in size_time]
label = df["Class"]
new_df = pd.DataFrame({"size":size, "time":time, "label":label})
# This will result in the DataFrame below.
# size time label
# 6780.3 350.0 classlabel1
# Save DataFrame to csv
# Use it
x = new_df.drop("label", axis=1)
y = new_df.label
# Further data preparation, such as split the dataset
# into train and test set, etc.
Hope this helps

How to match the name column with result after classification scikit-learn

This is an example of my data:
When I train my data, I just use the columns from 2 -> 7 as input, class as output. But when I test the model after it trained and save, I need to know that which files are belong to which class. I mean like how to know d.txt is class 1.
I use pandas to import the data from .csv file, I use train set and test set in 2 different csv files. In the train phase, I uses columns 2-7 as input, and column class as target, these columns are numerical. The filename class is just text. In the test phase, I need to know the filename with the predicted class. But I don't know how to do that.
P/s: I used MLP,SVM, NB as classifier.
Assuming your data is in .csv format:
You can output the corresponding filename to a predicted class using:
features=[1,0,1,0,0,1] #input
output=clf.predict([features])[0] #predicted class
print(df[df["class"]==output]["filename"]) #corresponding filename
Note that in your example you're facing the problem where the amount of features is greater than the amount of examples, therefore the classifier may deteriorate.
Hopefully you just gave a sample of your data. In this case you're likely to be good. Just watch out for what classifier to use.
Full code:
import numpy as np
import pandas as pd
from sklearn import svm
X = df.iloc[:,1:7].values
y = df.iloc[:,7:8].values
clf = svm.SVC() #using SVM as classifier, y)

ValueError using sklearn and pandas for decision trees?

I'm new to scikit learn and I just saw the documentation and a couple of other stackoverflow posts to build a decision tree.
I have a CSV data set with 16 attributes and 1 target label. How should I pass it into the decision tree classifier?
My current code looks like this:
import pandas
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import tree
data = pandas.read_csv("yelp_atlanta_data_labelled.csv", sep=',')
vect = TfidfVectorizer()
X = vect.fit_transform(data)
Y = data['go']
clf = tree.DecisionTreeClassifier()
clf =, Y)
When I run the code it gives me the following error:
ValueError: Number of labels=501 does not match number of samples=17
To give some context, my data set has 501 data points and 17 total columns. The go column is the target column with yes/no labels.
The problem is TfidfVectorizer cannot operate on a dataframe directly. It can only operate on a sequence of strings. Because you are passing a dataframe, it sees it as a sequence of columns and attempts to vectorize each column separately.
Try instead using:
X = vect.fit_transform(data['my_column_name'])
You may want to preprocess the dataframe to concatenate different columns prior to calling vect.fit_transform.

