Machine Learning Feature Columns - python

I'm trying to wrap my head around the whole feature column thing. I can build the Iris training model etc with Keras but I want to understand a little more.
What I'm having trouble is with the "Feature_Column" aspect. For example lets say my data set looks like:
Length Width Weight Label
1.5 2 .07 1
1 5 .09 3
3 6 .19 4
I don't get a Feature "Column" as the the "Row" is the example isn't it? So when I read about a Feature Column what do they mean by that?
From what I can see it's a Feature Row no? [1.5, 2, .07] = 1 <= that's a Row of Features with a Label 1
I don't know I'm just confused :). Anyone willing to explain what a Feature Column is when they Transform the data for a feature column?

Related

How can I convert a Pandas dataframe to a PyTorch tensor?

I have the following data frame (first 5 rows showing):
text
verified
review_score
product_category
polarity_score
sentiment_label
work described nice combo tool nice combo tool
1
4
0
0.6808
1
used using straight bridge make third string t...
1
5
0
0.9468
1
first hesitant game seemed surface capcom chan...
0
5
1
0.9907
1
great reverb pedal great addition pedal board ...
1
5
0
0.9022
1
five star good deal must buy
1
5
1
0.4404
1
I want to convert this data frame to a PyTorch dataset/tensor to use with a neural network model. The text has already been tokenised and preprocessed, and I would like to use all columns in the data frame. Also, after doing so, I would like to acquire the number of classes and vocabulary size, as well as split the data to train and test components. I have done some programming already, but it has not gone to plan as I cannot acquire the number of classes or the vocabulary size.
Here is some of the code I have done but I am unsure if this is the correct method:
TEXT = Field(tokenize=None) # already has been tokenised
LABEL = LabelField(dtype=torch.float)
TEXT.build_vocab(train, max_size=25000, vectors='glove.6B.100d')
LABEL.build_vocab(train)
fields = {'review_score': LABEL, 'product_category': LABEL, 'sentiment_label': LABEL, 'text': TEXT}
train_dataset = DataFrameDataset(train, fields)
test_dataset = DataFrameDataset(test, fields)
If anyone can help, then that will be much appreciated.

How to handle categorical data for preprocessing in Machine Learning

This may be a basic question, I have a categorical data and I want to feed this into my machine learning model. my ML model accepts only numerical data. What is the correct way to convert this categorical data into numerical data.
My Sample DF:
T-size Gender Label
0 L M 1
1 L M 1
2 M F 1
3 S F 0
4 M M 1
5 L M 0
6 S F 1
7 S F 0
8 M M 1
I know this following code convert my categorical data into numerical
Type-1:
df['T-size'] = df['T-size'].cat.codes
Above line simply converts category from 0 to N-1. It doesn't follow any relationship between them.
For this example I know S < M < L. What should I do when I have want to convert data like above.
Type-2:
In this type I No relationship between M and F. But I can tell that When M has more probability than F. i.e., sample to be 1 / Total number of sample
for Male,
(4/5)
for Female,
(2/4)
WKT,
(4/5) > (2/4)
How should I replace for this kind of column?
Can I replace M with (4/5) and F with (2/4) for this problem?
What is the proper way to dealing with column?
help me to understand this better.
There are many ways to encode categorical data, some of them depend on exactly what you plan to do with it. For example, one-hot-encoding which is easily the most popular choice is an extremely poor choice if you're planning on using a decision tree / random forest / GBM.
Regarding your t-shirts above, you can give a pandas categorical type an order:
df['T-size'].astype(pd.api.types.CategoricalDtype(['S','M','L'],ordered=True)).
if you had set up your tshirt categorical like that then your .cat.codes method would work perfectly. It also means you can easily use scikit-learn's LabelEconder which fits neatly into pipelines.
Regarding you encoding of gender, you need to be very careful when using your target variable (your Label). You don't want to do this encoding before your train-test split otherwise you're using knowledge of your unseen data making it not truly unseen. This gets even more complicated if you're using cross-validation as you'll need to do the encoding with in each CV iteration (i.e. new encoding per fold). If you want to do this, I recommend you check out TargetEncoder from skcontribs Category Encoders but again, be sure to use this within an sklearn Pipeline or you will mess up the train-test splits and leak information from your test set into you training set.
If you want to have a hierarchy in your size parameter, you may consider using a linear mapping for it. This would be :
size_mapping = {"S": 1, "M":2 , "L":3}
#mapping to the DataFrame
df['T-size_num'] = df['T-size'].map(size_mapping)
This allows you to treat the input as numerical data while preserving the hierarchy
And as for the gender, you are misconceiving the repartition and the preproces. If you already put the repartition as an input, you will introduce a bias in your data. You must consider that Male and female as two distinct categories regardless of their existing repartition. You should map it with two different numbers, but without introducing proportions.
df['Gender_num'] = df['Gender'].map({'M':0 , 'F':1})
For a more detailed explanation and a coverage of more specificities than your question, I suggest reading this article regarding categorical data in Machine Learning
For the first question, if you have a small number of categories, you could map the column with a dictionary. In this way you can set an order:
d = {'L':2, 'M':1, 'S':0}
df['T-size'] = df['T-size'].map(d)
Output:
T-size Gender Label
0 2 M 1
1 2 M 1
2 1 F 1
3 0 F 0
4 1 M 1
5 2 M 0
6 0 F 1
7 0 F 0
8 1 M 1
For the second question, you can use the same method, but i would leave the 2 values for males and females 0 and 1. If you need just the category and you dont have to make operations with the values, a values is equal to another.
It might be overkill for the M/F example, since it's binary - but if you are ever concerned about mapping a categorical into a numerical form, then consider one hot encoding. It basically stretches your single column containing n categories, into n binary columns.
So a dataset of:
Gender
M
F
M
M
F
Would become
Gender_M Gender_F
1 0
0 1
1 0
1 0
0 1
This takes away any notion of one thing being more "positive" than another - an absolute must for categorical data with more than 2 options, where there's no transitive A > B > C relationship and you don't want to smear your results by forcing one into your encoding scheme.

What is the best approach for learning the relationship of two dataframes affecting values of a given list?

I have a the following dataset of pandas dataframes:
A:
col1 col2
0 5 3
1 5 4
B:
col1 col2
0 6 4
1 2 4
my_list:
[24.5, 65.4]
Assume I have a dataset of thirty A,B,my_list pairs which have different sets of values. Changing a single or multiple values in either or both of the dataframes A and B affect the values in my_list.
Provided that I want to achieve [65.0, 46.21] in my_list, I need to find out what values need to be present in A,B dataframes.
I am looking for suggestions what would be the best solution to this problem? A simple ML algorithm? A deep learning model? If so, which one should I be using?
Please note that my dataset is small as 30 and I am looking to achieve a value as close as possible to the desired my_list value.
Any suggestions will be highly appreciated.
Sounds like you need a regression algorithm.
May I translate your task to: given 8 positional inputs, find a general formula, that produces the output closest to the desired output. This is a typical regression problem and you have many powerful tools you can use.
Given that your dataset is small, you'd better start with simple algorithms such as linear regression, then move to more complicated ones such as support vector machines if necessary.

Conditional Forcasting with VAR model in python

I am using VAR model to forecast multivariate time series with lag 2. I have three features, and would like to forcast several timestamps forward. Instead of forcasting all the three features, I actually know the values of two of the features, and would like to forcast only one feature.
If I wanted to forcast all the three features 5 timestamps a head, I could have done that as follows (this is a toy example):
import pandas as pd
from statsmodels.tsa.api import VAR
data=pd.DataFrame({'Date':['1959-06-01','1959-06-02','1959-06-03','1959-06-04']\
,'a':[1,2,3,5],'b':[2,3,5,8],'c':[3,4,7,11]})
data.set_index('Date', inplace=True)
model = VAR(data)
results = model.fit(2)
results.forecast(data.values[-2:], 5)
Note that data is
a b c
Date
1959-06-01 1 2 3
1959-06-02 2 3 4
1959-06-03 3 5 7
1959-06-04 5 8 11
And the forecast gives me
array([[ 8.01388889, 12.90277778, 17.79166667],
[ 12.93113426, 20.67650463, 28.421875 ],
[ 20.73343461, 33.12405961, 45.51468461],
[ 33.22366195, 52.98948789, 72.75531383],
[ 53.15895736, 84.72805652, 116.29715569]])
Let's say I knew that the next 5 values for a should have actually been 8,13,21,34,55 and the next 5 values for b should have been 13,21,34,55,89. Is there a way to incorporate that into the model in statsmodels.tsa (or any other python package) to forecast only the 5 values of c? I know that R has such an option, by incorporating "hard" conditions into cpredict.VAR, but I was wondering if this can be done in python as well.
The above is a toy example. In reality I have several dozens of features, but I still know all of them and would like predict only one of them using VAR model.
I have a similar issue when solving this problem. This is a makeshift method to accomplish what you are asking.
prediction = model_fit.forecast(model_fit.y, steps=len(test))
predictions = prediction[:,0]
`
Where 0 in the prediction[:,0] refers to the column that contains the desired forecasting value.

Machine Learning: combining features into single feature

I am a beginner in machine learning. I am confused how to combine different features of a data set into one single feature.
For example, I have a data set in Python Pandas data frame of features like this:
movie unknown action adventure animation fantasy horror romance sci-fi
Toy Story 0 1 1 0 1 0 0 1
Golden Eye 0 1 0 0 0 0 1 0
Four Rooms 1 0 0 0 0 0 0 0
Get Shorty 0 0 0 1 1 0 1 0
Copy Cat 0 0 1 0 0 1 0 0
I would like to convert this n features into one single feature named "movie_genre". One solution would be assign an integer value to each genre (unknown = 0, action = 1, adventure = 2 ..etc) and create a data frame like this:
movie genre
Toy Story 1,2,4,7
Golden Eye 1,6
Four Rooms 0
Get Shorty 3,4,6
Copy Cat 2,5
But in this case the entries in the column will be no longer an integer/ float value. Will that affect my future steps in machine learning process like fitting model and evaluating the algorithms?
convert each series of zeros and ones into an 8-bit number
toy story = 01101001
in binary, that's 105
similarly, Golden Eye=01000010 = 26946
you can do the rest here manually: http://www.binaryhexconverter.com/binary-to-decimal-converter
it's relatively straight forward to do programatically - just look through each label, and assign it the appropriate power of two then sum them up
It may be effective to leave them in their current multi-feature format and perform some sort of dimensionality reduction technique on that data.
This is very similar to a classic question: how do we treat categorical variables? One answer is one-hot or dummy encoding, which your original DataFrame is very similar to. With one-hot encoding, you start with a single, categorical feature. Using that feature, you make a column for each level, and assign a binary value to that column. The encoded result looks quite similar to what you are starting with. This sort of encoding is popular and many find it quite effective. Yours takes this one step further as each movie could be multiple genres. I'm not sure reversing that is a good idea.
Simply having more features is not always a bad thing if it is representing the data appropriately, and if you have enough observations. If you end up with a prohibitive number of features, there are many ways of reducing dimensionality. There is a wealth of knowledge on this topic out there, but one common technique is to apply principal component analysis (PCA) to a higher-dimensional dataset to find a lower-dimensional representation.
Since you're using python, you might want to check out what's available in scikit-learn for more ideas. A few resources in their documentation can be found here and here.
One thing you can do is to make a matrix of all possible combinations and reshape it into a single vector. If you want to account for all combinations it will have the same length as the original. If there are combinations that you don't need simply don't take them into account. Your network is label-agnostic and it won't mind.
But why is that a problem? Your dataset looks small enough.

Categories

Resources