I have applied Boruta on my dataset to determine the importance of features with respect to a predictor variable. However it is unable to determine the importance of several features.They are being shown as tentative.
Is there any function as TentativeRoughFix in Python. The TentativeRoughFix function is present in R-language. If there is any such function, can anybody guide me towards it. Or any suggestion regarding how to change the importance of variables from "tentative" to "important" or "not important" in python will be very appreciated.
There are plenty of options for feature selection in scikit-learn (see docu).
There is also a Boruta python implementation Boruta_py, but I never tested it.
Related
I am working on a cyber security project wherein we have to prioritize vulnerabilities based on the existing features which are mostly categorical variables (also including couple of ordinal variables).
The objective here is to detect vulnerability that is most likely to be exploited, and thereby prioritizing it. Hence we have to predict a score of 0-10 . Whichever is the highest rating that we predict (in this case 10), will be the most critical vulnerability that needs immediate attention.
All that we have are the categorical variables (as input features).
Once again summarizing the problem here :
Current Input features : All categorical variables (with couple of ordinal variables)
Current Output feature : DOES NOT EXIST
Expected Output : Predict a score in the range 0-10, with 10 being most critical vulnerability
Never came across this kind of problem. It definitely looks like Regression is not the answer. Can you please share your thoughts on the same.
I may be misunderstanding but it appears that you don't have the necessary information to make the prediction.
My understanding is that you have category information but no other associations. For some categories you might be able to hard code your prediction based on expert opinion. Predicting a ping sweep is basically benign, for example, just by knowing what it's called. For anything more dynamic you're going to need more information than you listed.
If you can't assign a score yourself, there's no way a machine learning algorithm is going to be able to do it. It can't know what to optimize for.
However, you might find success by using an unsupervised algorithm to cluster your data based on the categorical values, then looking at the clusters and determining which ones seem to have the most important issues. You can find one discussion on categorical k-means clustering here.
I am looking at the popular python implementation of SOM : MiniSom.
A lot of blogs cite various examples like fraud detection using MiniSom.
While I do get how we can get the items associated with the outlier nodes(BMUs),
I am not able to understand how to get the important features that distinguish the outlier. What is a function or package that may help me do that?
If you take at the python implementation of popsom (originally for R), it has a function to show feature significance: m.significance()
popsom for python
Premise: I am not an expert of Machine Learning/Maths/Statistics. I am a linguist and I am entering the world of ML. Please when answering, try to be the more explicit you can.
My problem: I have 3000 expressions containing some aspects (or characteristics, or features) that users usually review in online reviews. These expressions are recognized and approved by human beings and experts.
Example: “they play a difficult role”
The labels are: Acting (referring to the act of acting and also to actors), Direction, Script, Sound, Image.
The goal: I am trying to classify these expressions according to their aspects.
My system: I am using SkLearn and Python under a Jupyter environment.
Technique used until now:
I built a bag-of-words matrix (so I kept track of the
presence/absence of – stemmed - words for each expression) and
I applied a SVM multiclass classifier with kernel RBF and C = 1 (or I
tuned according to the final accuracy.). The code used is this one from
https://www.geeksforgeeks.org/multiclass-classification-using-scikit-learn/
First attempt showed 0.63 of accuracy. When I tried to create more labels from the class Script accuracy went down to 0.50. I was interested in doing that because I have some expressions that for sure describe the plot or the characters.
I think that the problem is due to the presence of some words that are shared among these aspects.
I searched for a solution to improve the model. I found something called “learning curve”. I use the official code provided by sklearn documentation http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html .
The result is like the second picture (the right one). I can't understand if it is good or not.
In addition to this, I would like to:
import the expressions from a text file. For the moment I have
just created an array and put inside the expressions and I don't
feel so comfortable.
find a way, if it possible, to communicate to the system that there are some words that are very specific / important to an Aspect and help it to improve the classification.
How can I do this? I read that in some works researchers have used more systems... How should I handle this? From where can I retrieve the resulting numbers from the first system to use them in the second one?
I would like to underline that there are some expressions, verbs, nouns, etc. that are used a lot in some contexts and not in others. There are some names that for sure are names of actors and not directors, for example. In the future I would like to add more linguistic pieces of information to the system and trying to improve it.
I hope to have expressed myself in an enough clear way and to have used an appropriate and understandable language.
Does anyone know a python library which has online PCA estimations (something similar to what is described in this paper online PCA)
Does it make sense to use the sklearn.decomposition.IncrementalPCA method with batch_size =1.
You can check this out:
https://github.com/flatironinstitute/online_psp
It is not exactly PCA since components might be not orthogonal (you can easily orthogonalize them at need, there is also an object method to do so)
Cheers
DISCLAIMER: I am one of the developers of this project.
So mixed-effects regression model is used when I believe that there is dependency with a particular group of a feature. I've attached the Wiki link because it explains better than me. (https://en.wikipedia.org/wiki/Mixed_model)
Although I believe that there are many occasions in which we need to consider the mixed-effects, there aren't too many modules that support this.
R has lme4 and Python seems to have a similar module, but they are both statistic driven; they do not use the cost function algorithm such as gradient boosting.
In Machine Learning setting, how would you handle the situation that you need to consider mixed-effects? Are there any other models that can handle longitudinal data with mixed-effects(random-effects)?
(R seems to have a package that supports mixed-effects: https://rd.springer.com/article/10.1007%2Fs10994-011-5258-3
But I am looking for a Python solution.
There are, at least, two ways to handle longitudinal data with mixed-effects in Python:
StatsModel for linear mixed effects;
MERF for mixed effects random forest.
If you go for StatsModel, I'd recommend you to do some of the examples provided here. If you go for MERF, I'd say that the best starting point is here.
I hope it helps!