I am trying to predict the output of tennis matches - just a fun side project. Im using a random forest regressor to do this. now, one of the features is the ranking of the player before a specific match. for many matches I dont have a ranking (I only have the first 200 ranked). The question is - is it better to put a value that is not an integer, like for example the string "NoRank", or put an integer that is beyond the range of 1-200? Considering the learning algorithm, Im inclined to put the value 201, but I would like to hear your opinions on this..
Thanks!
scikit-learn random forests do not support missing values unfortunately. If you think that unranked players are likely to behave worst that players ranked 200 on average then inputing the 201 rank makes sense.
Note: all scikit-learn models expect homogeneous numerical input features, not string labels or other python objects. If you have string labels as features you first need to find the right feature extraction strategy depending on the meaning of your string features (e.g. categorical variable identifiers or free text to be extracted as a bag of words).
I will be careful with just adding 201 (or any other value) to the nonranked ones.
RF normalize the data (Do I need to normalize (or scale) data for randomForest (R package)?), which means it can group 200 with 201 in the split, or it might not. you are basically faking data that you do not have.
I will add another column: "haverank"
and use a 0/1 for it.
0 will be for people without rank
1 for people with rank.
call it "highrank" if the name sounds better.
you can also add another column named "veryhighrank"
and give the value 1 to all players between ranks 1-50. etc...
Related
I am working with a Dataset consist of 190 columns and more than 3mln rows.
But unfortunately it has all the data as string values.
is there any way of building a model with such kind of data
except tokenising
Thank you and regards!
This may not answer your question fully, but I hope it will shed some light on what you can and cannot do.
Firstly, I don't think there is any straight-forward answer, as most ML models depend on the data that you have. If your string data is simply Yes or No (binary), it could be easily dealt with by replacing Yes = 1 and No = 0, but it doesn't work on something like country.
One-Hot Encoding - For features like country, it would be fairly simple to just one-hot encode it and start training the model with the thus obtained numerical data. But with the number of columns that you have and based on the unique values in such a large amount of data, the dimension will be increased by a lot.
Assigning numeric values - We also cannot simply assign numeric values to the strings and expect our model to work, as there is a very high chance that the model will pick up the numeric order which we do not have in the first place. more info
Bag of words, Word2Vec - Since you excluded tokenization, I don't know if you want to do this but there are also these option.
Breiman's randomForest in R - ""This implementation would allow you to use actual strings as inputs."" I am not familiar with R so cannot confirm to how far this is true. Nevertheless, find more about it here
One-Hot + Vector Assembler - I only came up with this in theory. If you manage to somehow convert your string data into numeric values (using one-hot or other encoders) while preserving the underlying representation of the data, the numerical features can be converted into a single vector using VectorAssembler in PySpark(spark). More info about VectorAssembler is here
The question is two-fold:
1. How to select the ideal value for size?
2. How to get the vocabulary size dynamically (per row as I intend) to set that ideal size?
My data looks like the following (example)—just one row and one column:
Row 1
{kfhahf}
Lfhslnf;
.
.
.
Row 2
(stdgff ksshu, hsihf)
asgasf;
.
.
.
Etc.
Based on this post: Python: What is the "size" parameter in Gensim Word2vec model class The size parameter should be less than (or equal to?) the vocabulary size. So, I am trying to dynamically assign the size as following:
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec
# I do Word2Vec for each row
For item in dataset:
Tokenized = word_tokenize(item)
model = Word2Vec([Tokenized], min_count=1)
I get the vocabulary size here. So I create a second model:
model1 = Word2Vec([Tokenized], min_count=1, size=len(model.wv.vocab))
This sets the size value to the current vocab value of the current row, as I intended. But is it the right way to do? What is the right size for a small vocabulary text?
There's no simple formula for the best size - it will depend on your data and purposes.
The best practice is to devise a robust, automatable way to score a set of word-vectors for your purposes – likely with some hand-constructed representative subset of the kinds of judgments, and preferred results, you need. Then, try many values of size (and other parameters) until you find the value(s) that score highest for your purposes.
In the domain of natural language modeling, where vocabularies are at least in the tens-of-thousands of unique words but possibly in the hundreds-of-thousands or millions, typical size values are usually in the 100-1000 range, but very often in the 200-400 range. So you might start a search of alternate values around there, if your task/vocabulary is similar.
But if your data or vocabulary is small, you may need to try smaller values. (Word2Vec really needs large, diverse training data to work best, though.)
Regarding your code-as-shown:
there's unlikely any point to computing a new model for every item in your dataset (discarding the previous model on each loop iteration). If you want a count of the unique tokens in any one tokenized item, you could use idiomatic Python like len(set(word_tokenize(item))). Any Word2Vec model of interest would likely need to be trained on the combined corpus of tokens from all items.
it's usually the case that min_count=1 makes a model worse than larger values (like the default of min_count=5). Words that only appear once generally can't get good word-vectors, as the algorithm needs multiple subtly-contrasting examples to work its magic. But, trying-and-failing to make useful word-vectors from such singletons tends to take up training-effort and model-state that could be more helpful for other words with adequate examples – so retaining those rare words even makes other word-vectors worse. (It is most definitely not the case that "retaining every raw word makes the model better", though it is almost always the case that "more real diverse data makes the model better".)
I would like to implement ML model for classification problem. My csv data looks like this:
Method1; Method2; Method3; Method4; Category; Class
result1; result2; result3; result4; Sport; 12
...
...
All methods, gives a text. Sometimes it is a one word, sometimes more and sometimes the cell is empty (no answer for this method). Column "category" always has a text and column "class" is a numerical showing number of method with correct answer (i.e. number 12 means that only result from method 1 and 2 is correct). Maybe will add more column if necessary.
Now, having a new answers from all methods I would like to classify it to one of the class.
How should I prepare this data? I know I should have a numerical data but how to do that, and handle with all empty cells, and inconsistent number of words in each answer?
How should I prepare this data? I know I should have a numerical data but how to do that, and handle with all empty cells, and inconsistent number of words in each answer?
There are many different ways of doing this, but the most simple would be to just use a Bag of Words representation, which means concatenating all your Methodx columns and counting how many times does each word appears on them.
With that, you have a vector representation (each word is a column/feature, each count is a numerical value).
Now, from here there are several problems (the main one is that the number of columns/features in your dataset will be quite large), so you may have to either preprocess your data further or find a ML technique that can dealt with it for you. But, in any case, I would recommend to try to have a look at several tutorials on NLP to get a better idea of this and get a better estimation of what would be the best solution for your dataset.
I'm trying to predict whether a fan is going to turn out to a sporting event or not. My data (pandas DataFrame) consists of fan information (demographic's, etc.), and whether or not they attended the last 10 matches (g1_attend - g10_attend).
fan_info age neighborhood g1_attend g2_attend ... g1_neigh_turnout
2717 22 downtown 0 1 .47
2219 67 east side 1 1 .78
How can I predict if they're going to attend g11_attend, when g11_attend doesn't exist in the DataFrame?
Originally, I was going to look into applying some of the basic models in scikit-learn for classification, and possibly just add a g11_attend column into the DataFrame. This all has me quite confused for some reason. I'm thinking now that it would be more appropriate to treat this as a time-series, and was looking into other models.
You are correct, you can't just add a new category (ie output class) to a classifier -- this requires something that does time series.
But there is a fairly standard technique for using a classifier on times-series. Asserting (conditional) Time Independence, and using windowing.
In short we are going to make the assumption that whether or not someone attends a game depends only on variables we have captured, and not on some other time factor (or other factor in general).
i.e. we assume we can translate their history of games attended around the year and it will still be the same probability.
This is clearly wrong, but we do it anyway because machine learning techneques will deal with some noised in the data.
It is clearly wrong because some people are going to avoid games in winter cos it is too cold etc.
So now on the the classifier:
We have inputs, and we want just one output.
So the basic idea is that we are going to train a model,
that given as input whether they attended the first 9 games, predicts if they will attend the 10th
So out inputs are 1 age, neighbourhood, g1_attend, g2_attend,... g9_attend
and the output is g10_attend -- a binary value.
This gives us training data.
Then when it it time to test it we move everything accross: switch g1_attend for g2_attend, and g2_attend for g3_attend and ... and g9_attend for g10_attend.
And then our prediction output will be for g11_attend.
You can also train several models with different window sizes.
Eg only looking at the last 2 games, to predict attendance of the 3rd.
This gives you a lot more trainind data, since you can do.
g1,g2->g3 and g2,g3->g4 etc for each row.
You could train a bundle of different window sizes and merge the results with some ensemble technique.
In particular it is a good idea to train g1,...,g8-> g9,
and then use that to predict g10 (using g2,...,g9 as inputs)
to check if it is working.
I suggest in future you may like to ask these questions on Cross Validated. While this may be on topic on stack overflow, it is more on topic there, and has a lot more statisticians and machine learning experts.
1 I suggest discarding fan_id for now as an input. I just don't think it will get you anywhere, but it is beyond this question to explain why.
I want to do linear regression analysis. I have multiple features. Some features has unassigned (null) values for some items in data. Because for some items some specific feature values were missed in data source. To be more clear, I provide example:
As you can see, some items missing values for some features. For now, I just assigned it to 'Null', but how to handle this values when doing linear regression analysis of the data? I do not want this unassigned values to incorrectly affect regression model. Unfortunately I cannot get rid of items where unassigned feature values presented. I plan to use Python for regression.
You need to either ignore those rows -- you've already said you can't, and it's not a good idea with the quantity of missing values -- or use an algorithm that proactively discounts those items, or impute (that's the technical term for filling in an educated guess) the missing data.
There's a limited amount of help we can give, because you haven't given us the semantics you want for missing data. You can impute some of the missing values by using your favourite "closest match" algorithm against the data you do have. For instance, you may well be able to infer a good guess for area from the other data.
For your non-linear, discrete items (i.e. District), you may well want to to keep NULL as a separate district. If you have few enough missing entries, you'll be able to get a decent model anyway.
A simple imputation is to replace each NULL with the mean value for the feature, but this works only for those with a proper mean (i.e. not District).
Overall, I suggest that you search for appropriate references on "impute missing data". Since we're not sure of your needs, we can't help much with this, and doing so is outside the scope of SO.