I am a beginner in machine learning. I am confused how to combine different features of a data set into one single feature.
For example, I have a data set in Python Pandas data frame of features like this:
movie unknown action adventure animation fantasy horror romance sci-fi
Toy Story 0 1 1 0 1 0 0 1
Golden Eye 0 1 0 0 0 0 1 0
Four Rooms 1 0 0 0 0 0 0 0
Get Shorty 0 0 0 1 1 0 1 0
Copy Cat 0 0 1 0 0 1 0 0
I would like to convert this n features into one single feature named "movie_genre". One solution would be assign an integer value to each genre (unknown = 0, action = 1, adventure = 2 ..etc) and create a data frame like this:
movie genre
Toy Story 1,2,4,7
Golden Eye 1,6
Four Rooms 0
Get Shorty 3,4,6
Copy Cat 2,5
But in this case the entries in the column will be no longer an integer/ float value. Will that affect my future steps in machine learning process like fitting model and evaluating the algorithms?
convert each series of zeros and ones into an 8-bit number
toy story = 01101001
in binary, that's 105
similarly, Golden Eye=01000010 = 26946
you can do the rest here manually: http://www.binaryhexconverter.com/binary-to-decimal-converter
it's relatively straight forward to do programatically - just look through each label, and assign it the appropriate power of two then sum them up
It may be effective to leave them in their current multi-feature format and perform some sort of dimensionality reduction technique on that data.
This is very similar to a classic question: how do we treat categorical variables? One answer is one-hot or dummy encoding, which your original DataFrame is very similar to. With one-hot encoding, you start with a single, categorical feature. Using that feature, you make a column for each level, and assign a binary value to that column. The encoded result looks quite similar to what you are starting with. This sort of encoding is popular and many find it quite effective. Yours takes this one step further as each movie could be multiple genres. I'm not sure reversing that is a good idea.
Simply having more features is not always a bad thing if it is representing the data appropriately, and if you have enough observations. If you end up with a prohibitive number of features, there are many ways of reducing dimensionality. There is a wealth of knowledge on this topic out there, but one common technique is to apply principal component analysis (PCA) to a higher-dimensional dataset to find a lower-dimensional representation.
Since you're using python, you might want to check out what's available in scikit-learn for more ideas. A few resources in their documentation can be found here and here.
One thing you can do is to make a matrix of all possible combinations and reshape it into a single vector. If you want to account for all combinations it will have the same length as the original. If there are combinations that you don't need simply don't take them into account. Your network is label-agnostic and it won't mind.
But why is that a problem? Your dataset looks small enough.
Related
I have the following data frame (first 5 rows showing):
text
verified
review_score
product_category
polarity_score
sentiment_label
work described nice combo tool nice combo tool
1
4
0
0.6808
1
used using straight bridge make third string t...
1
5
0
0.9468
1
first hesitant game seemed surface capcom chan...
0
5
1
0.9907
1
great reverb pedal great addition pedal board ...
1
5
0
0.9022
1
five star good deal must buy
1
5
1
0.4404
1
I want to convert this data frame to a PyTorch dataset/tensor to use with a neural network model. The text has already been tokenised and preprocessed, and I would like to use all columns in the data frame. Also, after doing so, I would like to acquire the number of classes and vocabulary size, as well as split the data to train and test components. I have done some programming already, but it has not gone to plan as I cannot acquire the number of classes or the vocabulary size.
Here is some of the code I have done but I am unsure if this is the correct method:
TEXT = Field(tokenize=None) # already has been tokenised
LABEL = LabelField(dtype=torch.float)
TEXT.build_vocab(train, max_size=25000, vectors='glove.6B.100d')
LABEL.build_vocab(train)
fields = {'review_score': LABEL, 'product_category': LABEL, 'sentiment_label': LABEL, 'text': TEXT}
train_dataset = DataFrameDataset(train, fields)
test_dataset = DataFrameDataset(test, fields)
If anyone can help, then that will be much appreciated.
I'm experimenting with Machine Learning and LSTM models for river level prediction based on the current level and rainfall within the upstream catchment. I'm currently using TensorFlow and a Keras LSTM model.
I have a hourly rainfall data in mm from eight sites within the catchment, and the river stage (level) in meters at a site downstream from the catchment area. The problem I face is that every now and then the rainfall sites are tested by pouring water into them. This creates a significant spike in the data that I would like to filter out.
Here's an example of what a typical rainfall event looks like within the dataframe:
DataFrame showing typical seqeunce rainfall and river stage data
And here is an example of what it looks like when two sites have been tested
DataFrame showing abnormal rainfall data due to two sites being tested
I've come across several ways to statistically cluster data and identify outliers however none of these really worked on a timed sequence of events. Also, the rainfall site columns are listed in the DataFrame in the order in which they are located with the catchment so there is a loose spatial coupling moving across the columns.
I was thinking of using something a little like a 3x3 or 3x4 convolution matrix, but rather than calculating new cell values it would find outliers by comparing the values from the central cells with the values in the outer cells. Any significant difference would indicate abnormal data.
The Pandas DataFrame API is quite large and I'm still getting familiar with it. Any suggestions on specific functions or methods I should be looking at would be much appreciated.
In the following example the 10:00:00 reading for Site 2 would be an obvious anomaly.
Timestamp
Site 1
Site 2
Site 3
09:00:00
0
0
0
10:00:00
0
20
0
11:00:00
0
0
0
20mm of rainfall at one site with zero rainfall at the adjacent sites, or at the same site for the hour before and hour after is a very clear and obvious case.
This is what a normal rainfall pattern might look like:
Timestamp
Site 1
Site 2
Site 3
09:00:00
6
4
0
10:00:00
0
20
2
11:00:00
0
0
11
This is a less obvious example:
Timestamp
Site 1
Site 2
Site 3
09:00:00
1
0
0
10:00:00
0
20
2
11:00:00
0
3
1
One possibility might be to compare the central cell value to the maximum of the surrounding cell values and if the difference is greater than 15 (or some other arbitrary threshold value).
The exact criteria will probably change as I discover more about the data. The mechanism of how to apply that criteria to the dataframe is what I'm looking for. For example, if the criteria were implemented as a lambda function that could reference adjacent cells, is there a way to apply that lambda function across DataFrame?'
An extra complication is how to deal with checking values for Site 1 when there is preceding site to the left, and Site 3 where there is no following site to the right.
This may be a basic question, I have a categorical data and I want to feed this into my machine learning model. my ML model accepts only numerical data. What is the correct way to convert this categorical data into numerical data.
My Sample DF:
T-size Gender Label
0 L M 1
1 L M 1
2 M F 1
3 S F 0
4 M M 1
5 L M 0
6 S F 1
7 S F 0
8 M M 1
I know this following code convert my categorical data into numerical
Type-1:
df['T-size'] = df['T-size'].cat.codes
Above line simply converts category from 0 to N-1. It doesn't follow any relationship between them.
For this example I know S < M < L. What should I do when I have want to convert data like above.
Type-2:
In this type I No relationship between M and F. But I can tell that When M has more probability than F. i.e., sample to be 1 / Total number of sample
for Male,
(4/5)
for Female,
(2/4)
WKT,
(4/5) > (2/4)
How should I replace for this kind of column?
Can I replace M with (4/5) and F with (2/4) for this problem?
What is the proper way to dealing with column?
help me to understand this better.
There are many ways to encode categorical data, some of them depend on exactly what you plan to do with it. For example, one-hot-encoding which is easily the most popular choice is an extremely poor choice if you're planning on using a decision tree / random forest / GBM.
Regarding your t-shirts above, you can give a pandas categorical type an order:
df['T-size'].astype(pd.api.types.CategoricalDtype(['S','M','L'],ordered=True)).
if you had set up your tshirt categorical like that then your .cat.codes method would work perfectly. It also means you can easily use scikit-learn's LabelEconder which fits neatly into pipelines.
Regarding you encoding of gender, you need to be very careful when using your target variable (your Label). You don't want to do this encoding before your train-test split otherwise you're using knowledge of your unseen data making it not truly unseen. This gets even more complicated if you're using cross-validation as you'll need to do the encoding with in each CV iteration (i.e. new encoding per fold). If you want to do this, I recommend you check out TargetEncoder from skcontribs Category Encoders but again, be sure to use this within an sklearn Pipeline or you will mess up the train-test splits and leak information from your test set into you training set.
If you want to have a hierarchy in your size parameter, you may consider using a linear mapping for it. This would be :
size_mapping = {"S": 1, "M":2 , "L":3}
#mapping to the DataFrame
df['T-size_num'] = df['T-size'].map(size_mapping)
This allows you to treat the input as numerical data while preserving the hierarchy
And as for the gender, you are misconceiving the repartition and the preproces. If you already put the repartition as an input, you will introduce a bias in your data. You must consider that Male and female as two distinct categories regardless of their existing repartition. You should map it with two different numbers, but without introducing proportions.
df['Gender_num'] = df['Gender'].map({'M':0 , 'F':1})
For a more detailed explanation and a coverage of more specificities than your question, I suggest reading this article regarding categorical data in Machine Learning
For the first question, if you have a small number of categories, you could map the column with a dictionary. In this way you can set an order:
d = {'L':2, 'M':1, 'S':0}
df['T-size'] = df['T-size'].map(d)
Output:
T-size Gender Label
0 2 M 1
1 2 M 1
2 1 F 1
3 0 F 0
4 1 M 1
5 2 M 0
6 0 F 1
7 0 F 0
8 1 M 1
For the second question, you can use the same method, but i would leave the 2 values for males and females 0 and 1. If you need just the category and you dont have to make operations with the values, a values is equal to another.
It might be overkill for the M/F example, since it's binary - but if you are ever concerned about mapping a categorical into a numerical form, then consider one hot encoding. It basically stretches your single column containing n categories, into n binary columns.
So a dataset of:
Gender
M
F
M
M
F
Would become
Gender_M Gender_F
1 0
0 1
1 0
1 0
0 1
This takes away any notion of one thing being more "positive" than another - an absolute must for categorical data with more than 2 options, where there's no transitive A > B > C relationship and you don't want to smear your results by forcing one into your encoding scheme.
I have a the following dataset of pandas dataframes:
A:
col1 col2
0 5 3
1 5 4
B:
col1 col2
0 6 4
1 2 4
my_list:
[24.5, 65.4]
Assume I have a dataset of thirty A,B,my_list pairs which have different sets of values. Changing a single or multiple values in either or both of the dataframes A and B affect the values in my_list.
Provided that I want to achieve [65.0, 46.21] in my_list, I need to find out what values need to be present in A,B dataframes.
I am looking for suggestions what would be the best solution to this problem? A simple ML algorithm? A deep learning model? If so, which one should I be using?
Please note that my dataset is small as 30 and I am looking to achieve a value as close as possible to the desired my_list value.
Any suggestions will be highly appreciated.
Sounds like you need a regression algorithm.
May I translate your task to: given 8 positional inputs, find a general formula, that produces the output closest to the desired output. This is a typical regression problem and you have many powerful tools you can use.
Given that your dataset is small, you'd better start with simple algorithms such as linear regression, then move to more complicated ones such as support vector machines if necessary.
I want perform cluster analysis for the following data (sample):
ID CODE1 CODE2 CODE3 CODE4 CODE5 CODE6
------------------------------------------------------------------
00001 0 1 1 0 0 0
00002 1 0 0 0 1 1
00003 0 1 0 1 1 1
00004 1 1 1 0 1 0
...
Where 1 indicates the presence of that code for a person, and 0 the absence..
Is k-means or hierarchical clustering most appropriate for clustering the codes for this kind of data (for about a million distinct ids), and with which distance measure? If neither of these methods are appropriate, what do you think is most appropriate?
Thank you
No, k-means does not make a lot of sense for binary data.
Because k-means computes means. But what is the mean vector for binary data?
Your cluster "centers" will be not part of your data space, and nothing like your input data. That doesn't seem like a proper "center" to me, when it's totally different from your objects.
Most likely, your cluster "centers" will end up being more similar to each other than to the actual cluster members, because they are somewhere in the center, and all your data is in corners.
Seriously, investigate similarity functions for your data type. Then choose a clustering algorithm that works with this distance function. Hierarchical clustering is quite general, but really slow. But you don't have to use a 40 year old algorithm, you may want to look into more modern stuff.