I've read in the documentation that sklearn uses CART algorithm for trees.
Are there specific attributes to change so that it becomes similar to a c4.5 implementation?
CART and C4.5 are somehow similar algorithms, but there are fundamental differences which won't let you tweak sklearn's implementation to get a C4.5 without a lot of work.
C4.5 uses rule sets to decide where to split the data, whereas CART merely uses a numerical splitting criterion.
You can take a look at this implementation of C4.5
Related
I have read this to learn about various method in multi-label classifiers.
I learned that there are 3 techniques to do multi-label classifications:
1.Problem Transformation
2.Adapted Algorithm
3.Ensemble approaches
In the category of Problem transformations there are more three sub categories:
a.Binary Relevance
b.Classifier Chains
c.Label Powerset
I know that when we want better result we should apply the ensemble model.
I would like to know in which situations the other different algorithm we should use.
I know how they differently work, but I do not know when I should use each of them.
And Also there is only two method implemented for Adapted Algorithm.
what if I want other methods but implemented in adapted algorithm approach?
Please let me know if my statements are not clear.
Thanks,
Is there a way to have an x,y pair dataset given to a function that will return a list of curve fit models and the coeff. The program DataFit does this with about 200 different models, but we are looking for a pythonic way. From exponential to inverse polynomial etc.
I have seen many posts of manually using scipy to type each model, but this is not feasible for the number of models we want to test.
The closest I found was pyeq2, but this is not returning the list of functions, and seems to be a rabbit hole to code for.
If R has this available, we could use that but python is really the goal
Below is an example of the data, we want to find the best way to describe this curve
You can try library splines in R. I have used this for higher order curve fitting to some univariate data. You can try to change and achieve similar thing with corresponding R^2 errors.
You can either decide to do the following:
Choose a model to fit a parameters. This model should be based on a single independent variable. This can be done by python's scipy.optimize curve_fit function. You can choose something like a hyberbola.
Choose a model that is complex and likely represents an underlying mechanism of something at work. Like the system of ODE's from a disease SIR model. Fitting the parameters will be no easy task. This will be done by Markov Chain Monte Carlo (MCMC) methods. This is VERY difficult.
Realise that you have data and can use machine learning via scikit learn to predict from your data. This is a method that doesn't require parameters.
Machine learning and neural networks don't fit something and can't really tell you about the underlying mechanism but can make predicitions just as a best fit model would...dare I say even better.
In the end, we found that Eureqa software was able to achieve this. https://www.nutonian.com/products/eureqa/
I'm a beginner to using statsmodels & I'm also open to using other Python based methods of solving my problem:
I have a data set with ~ 85 features some of which are highly correlated.
When I run the OLS method I get a helpful 'strong multicollinearity problems' warning as I might expect.
I've previously run this data through Weka, which as part of the regression classifier has an eliminateColinearAttributes option.
How can I do the same thing - get the model to chose which attributes to use instead of having them all in the model?
Thanks!
To run multivariate regression use scipy.stats.linregress. Check out this nice example which has a good explanation.
The eliminateColinearAttributes option in the software you've mentioned is just some algorithm implemented in this software to fight the problem. Here, you need to implement some iterative algorithm yourself based on elimination of one of highly correlated variables with the highest p-value (then run regression again and repeat until multicollinearity is not there).
There's no one and only way here, there are different techniques. It is also a good practice to choose manually from the set of highly correlated with each other set of variables which to omit that it also makes sense.
I'm wondering what the set_weights method of the Maxent class in NLTK is used for (or more specifically how to use it). As I understand, it allows you to manually assign weights to certain features? Could somebody provide a basic example of the type of parameter that would be passed into it?
Thanks
Alex
It apparently allows you to set the coefficient matrix of the classifier. This may be useful if you have an external MaxEnt/logistic regression learning package from which you can export the coefficients. The train_maxent_classifier_with_gis and train_maxent_classifier_with_iis learning algorithms call this function.
If you don't know what a coefficient matrix is; it's the β mentioned in Wikipedia's treatment of MaxEnt.
(To be honest, it looks like NLTK is either leaking implementation details here, or has a very poorly documented API.)
I have a human tagged corpus of over 5000 subject indexed documents in XML. They vary in size from a few hundred kilobytes to a few hundred megabytes. Being short articles to manuscripts. They have all been subjected indexed as deep as the paragraph level. I am lucky to have such a corpus available, and I am trying to teach myself some NLP concepts. Admittedly, I've only begun. Thus far reading only the freely available NLTK book, streamhacker, and skimming jacobs(?) NLTK cookbook. I like to experiment with some ideas.
It was suggested to me, that perhaps, I could take bi-grams and use naive Bayes classification to tag new documents. I feel as if this is the wrong approach. a Naive Bayes is proficient at a true/false sort of relationship, but to use it on my hierarchical tag set I would need to build a new classifier for each tag. Nearly a 1000 of them. I have the memory and processor power to undertake such a task, but am skeptical of the results. However, I will be trying this approach first, to appease someones request. I should likely have this accomplished in the next day or two, but I predict the accuracy to be low.
So my question is a bit open ended. Laregly becuase of the nature of the discipline and the general unfamilirity with my data it will likely be hard to give an exact answer.
What sort of classifier would be appropriate for this task. Was I wrong can a Bayes be used for more than a true/false sort of operation.
what feature extraction should I pursue for such a task. I am not expecting much with the bigrams.
Each document also contains some citational information including, author/s, an authors gender of m,f,mix(m&f),and other (Gov't inst et al.), document type, published date(16th cent. to current), human analyst, and a few other general elements. I'd also appreciate some useful descriptive tasks to help investigate this data better for gender bias, analyst bias, etc. But realize that is a bit beyond the scope of this question.
What sort of classifier would be appropriate for this task. Was I wrong can a Bayes be used for more than a true/false sort of operation.
You can easily build a multilabel classifier by building a separate binary classifier for each class, that can distinguish between that class and all others. The classes for which the corresponding classifier yields a positive value are the combined classifier's output. You can use Naïve Bayes for this or any other algorithm. (You could also play tricks with NB's probability output and a threshold value, but NB's probability estimates are notoriously bad; only its ranking among them is what makes it valuable.)
what feature extraction should I pursue for such a task
For text classification, tf-idf vectors are known to work well, but you haven't specified what the exact task is. Any metadata on the documents might work as well; try doing some simple statistical analysis. If any feature of the data is more frequently present in some classes than in others, it may be a useful feature.
I understand that you have two tasks to solve here. The 1st one is that you want to tag an article based on its topic(?) and thus the article can be classified in more than one categories/classes and thus you have a multi-label classification problem. There are several algorithms proposed for solving a multi-label classification problem - please check the literature. I found this paper quite helpful when I was dealing with a similar problem: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.104.9401
The 2nd problem you want to solve is to tag the paper with authors, gender, type of document. This is a multi-class problem - each class has more than two potential values but all documents have some values for these classes.
I think as a first step it is important to understand the differences between multi-class and multi-label classification.