How to improve svm - python

I apologize in advance for my bad English.
So, I'm working on my final year study project and the description says to make a state of the art on the methods often used to predict churn in telecommunications and then choose two methods that will be applied to the data.
It also says to try to add my contribution to one of the methods.
I chose the decision tree and SVM methods. I would like to add my contribution to the SVM method but I do not know how. I did some research and the most common thing is the "Cross-Validation" method but since it is used by everyone, is it considered a contribution?
I also thought about a hybridization but I'm not sure which algorithm would be best for that.
So I wanted to know if you could give me some ideas to explore in order to try to improve this algorithm, whether in precision, speed or otherwise.
If I sound like a beginner, that's because I am XD.

I am also a beginner in this field , but i can give you some pointers i've come across,
You can look at newer feature generation(Try to do research on that fields specific to telecommunications)
Use different algorithm for imputation(KNN, central imputation).
If you want high accuracy go for XGBOOST .
As this is a churn problem i would concentrate on the recall.

Related

Feedback on Data Science LSTM Project

I realize that this is slightly outside the realm of what sort of questions are normally asked here, so please forgive that. I have been tasked with an open ended technical screening for a job as a data scientist. This is my first job that has asked for something like this, so I want to make sure that I am submitting really good work. I was given a dataset and asked to identify the problem and how to use machine learning to solve it, give stats on the target feature, pre-process the data data, model the data, and interpret the results.
I am looking for feedback about if I am missing anything huge in my results. High level feedback is fine. Hopefully some of you are data scientists and have either had to complete a technical screening like this or have had to review one and can offer some valuable feedback to an up-and-coming data scientist.
Thank you!
Github Link to Project
have a look on the
Mars Express Power Challenge Get the data, model and predict the
thermal power consumption
here https://kelvins.esa.int/mars-express-power-challenge/
The chalenge was to get the data and predict future consumption of the orbiter to plan how to save energy (when in the solar field there is a risk of over heating, and in the solar night a risk from being to cold)
The teams used different approach LSTM is probably the one I would choose. But the winning team conducted a very detailed explanation on the "Feature Engineering and Selection".The point is what is important is not the tool used but the correct choice of feature extraction and selection.
https://arc.aiaa.org/doi/pdf/10.2514/6.2018-2561
I read both the winning paper and your work. Really I prefer your way.
As you see if you read the paper, your methodology is quite comparable, but they put the feature extraction study at the center of the research.
You may secure your work by providing more evidences that you picked the right method for the FE. For exemple you could provide 2 method of FE and compare the result given the method, or, you explain you chosen one knowing the current state of the art about this particular paper which prove blablabla...
You could add the comparative result of ARIMA VAR VARMA and yours to illustrate the "outperform" and reference on papers of the state of the art for the past 3 years on the field, and other references on recent publication on LSTM for energy consumption prediction.
Your document end abruptly one would wait for a decorative conclusion as we are used to find in a regular paper.
That it.
(please dont take account of my only opinion as I don't feel myself data-scientist :) I will be very proud of myself the day I would be abble to produce what you done ;) thanks for sharing it was nice to read it)
If I was the evaluator, I would ask questions like,
1) What is the research/business problem?
Suggestion: Begin the report by clearly specifying the question
2) What are the existing solutions to solve the problem?
Suggestion: Add a brief literature review on existing solutions for similar problems and their results preferably in a tabular format.
3) Briefly elaborate on the descriptive and multivariate properties of the data.
Suggestion: Add descriptive and inferential statistics on the data including some preliminary hypothesis that can be derived from the variable correlations.
4) Why did you choose this particular approach to solve the problem?
Suggestion: Give credible justification backed up by quantitative hypothetical example solutions, that are in favour of the proposed approach.
5) If it's a classification task, I would ask a question like, "What is the baseline accuracy of the model?" And if its a clustering task, "What is the baseline for cluster purity?"
Suggestion: Find this accuracy from the target variable distribution.
Finally, you need to understand, why such an open-ended question is asked. There can be two possibilities;
(a) The company is new with reference to data science and is unsure of what they are looking for, meaning, they do not have either the required expertise to evaluate the candidate skills or they are simply unsure of what is their requirement. If this is the case, then it's imperative that the report is as simple and detailed as possible. Stay away from throwing jargon.
OR
(b) the company is experienced in data science and this is a filtering test. To filter out the self-proclaimed data scientist nincompoops, who think chaining some ready-made solution steps (like preprocessing, dimensionality reduction, modelling) solves a problem. The underlying idea is to figure out the analytical capabilities of a candidate.
Therefore, write the report wisely and ensure nothing is falsified.
Best of luck.

Creating a real estate price index for a given location

I have a dataset with property sales data for a city for the last several years. I am attempting to create a price index, but struggling to find any examples in code or even those same algorithms applied in other sectors. From what I understand, the main algorithms to use are RSR, Case-Shiller and hedonic regression. Maybe there are other methods? But again, there's nothing available that I managed to come across online, all ML work I have looked at so far is aimed at estimating values of individual properties. Would appreciate if anyone could suggest something helpful.
Also, what other factors should I consider and what methods to look at?
A few thoughts on this very interesting issue:
I don't really understand how or why you would use Machine Learning for this. You are not trying to predict or to find a pattern, but rather to simplify a dataset with high complexity into a single number which stays comparable over time.
As stated previously, the complexity of the reality you are trying to study is extremely high, and there are many, many things that have to be taken into account.
For instance, a long term index could face the following problem: over a few decades, the average house size can vary significantly. That could drive the prices up or down, but it would be produced by a change in the house attributes, not on the valuation given by the market. Prices would go up because houses would be better, and your index should account for that.
The construction of the index will force you to take decisions that will probably skew the index in some direction. There's hardly a single, best solution for the problem, and different solutions will deal differently with situations like the one I explained in the previous point.
Finally, I would recommend you to do some reading. Institutions with price indexes usually publish their methodologies, and you can learn a lot from them. I suggest this by Eurostat. This one by the spanish National Institute of Statistics is very good and concise, but is in spanish.
By the way, you can probably find better answers to this question in CrossValidated.

Merging many statistical methods for Text classification, starting with SVM multiclass classifier

Premise: I am not an expert of Machine Learning/Maths/Statistics. I am a linguist and I am entering the world of ML. Please when answering, try to be the more explicit you can.
My problem: I have 3000 expressions containing some aspects (or characteristics, or features) that users usually review in online reviews. These expressions are recognized and approved by human beings and experts.
Example: “they play a difficult role”
The labels are: Acting (referring to the act of acting and also to actors), Direction, Script, Sound, Image.
The goal: I am trying to classify these expressions according to their aspects.
My system: I am using SkLearn and Python under a Jupyter environment.
Technique used until now:
I built a bag-of-words matrix (so I kept track of the
presence/absence of – stemmed - words for each expression) and
I applied a SVM multiclass classifier with kernel RBF and C = 1 (or I
tuned according to the final accuracy.). The code used is this one from
https://www.geeksforgeeks.org/multiclass-classification-using-scikit-learn/
First attempt showed 0.63 of accuracy. When I tried to create more labels from the class Script accuracy went down to 0.50. I was interested in doing that because I have some expressions that for sure describe the plot or the characters.
I think that the problem is due to the presence of some words that are shared among these aspects.
I searched for a solution to improve the model. I found something called “learning curve”. I use the official code provided by sklearn documentation http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html .
The result is like the second picture (the right one). I can't understand if it is good or not.
In addition to this, I would like to:
import the expressions from a text file. For the moment I have
just created an array and put inside the expressions and I don't
feel so comfortable.
find a way, if it possible, to communicate to the system that there are some words that are very specific / important to an Aspect and help it to improve the classification.
How can I do this? I read that in some works researchers have used more systems... How should I handle this? From where can I retrieve the resulting numbers from the first system to use them in the second one?
I would like to underline that there are some expressions, verbs, nouns, etc. that are used a lot in some contexts and not in others. There are some names that for sure are names of actors and not directors, for example. In the future I would like to add more linguistic pieces of information to the system and trying to improve it.
I hope to have expressed myself in an enough clear way and to have used an appropriate and understandable language.

Utilising Genetic algorithm to overcome different size datasets in model

SO I realise the question I am asking here is large and complex.
A potential solution to variences in sizes of
In all of my searching through statistical forums and posts I haven't come across a scientifically sound method of taking into account the type of data that I am encountering,
but I have thought up a (novel?) potential solutions to account perfectly (in my mind) for large and small datasets within the same model.
The proposed method involves using a genetic algorithm to alter two numbers defining a relationship between the size of the dataset making up an implied strike rate and the
percentage of the implied strike to be used, with the target of the model to maximise the homology of the number 1 in two columns of the following csv. (ultra simplified
but hopefully demonstrates the principle)
Example data
Date,PupilName,Unique class,Achieved rank,x,y,x/y,Average xy
12/12/2012,PupilName1,UniqueClass1,1,3000,9610,0.312174818,0.08527
12/12/2012,PupilName2,UniqueClass1,2,300,961,0.312174818,0.08527
12/12/2012,PupilName3,UniqueClass1,3,1,3,0.333333333,0.08527
13/12/2012,PupilName1,UniqueClass2,1,2,3,0.666666667,0.08527
13/12/2012,PupilName2,UniqueClass2,2,0,1,0,0.08527
13/12/2012,PupilName3,UniqueClass2,3,0,5,0,0.08527
13/12/2012,PupilName4,UniqueClass2,4,0,2,0,0.08527
13/12/2012,PupilName5,UniqueClass2,5,0,17,0,0.08527
14/12/2012,PupilName1,UniqueClass3,1,1,2,0.5,0.08527
14/12/2012,PupilName2,UniqueClass3,2,0,1,0,0.08527
14/12/2012,PupilName3,UniqueClass3,3,0,5,0,0.08527
14/12/2012,PupilName4,UniqueClass3,4,0,6,0,0.08527
14/12/2012,PupilName5,UniqueClass3,5,0,12,0,0.08527
15/12/2012,PupilName1,UniqueClass4,1,0,0,0,0.08527
15/12/2012,PupilName2,UniqueClass4,2,1,25,0.04,0.08527
15/12/2012,PupilName3,UniqueClass4,3,1,29,0.034482759,0.08527
15/12/2012,PupilName4,UniqueClass4,4,1,38,0.026315789,0.08527
16/12/2012,PupilName1,UniqueClass5,1,12,24,0.5,0.08527
16/12/2012,PupilName2,UniqueClass5,2,1,2,0.5,0.08527
16/12/2012,PupilName3,UniqueClass5,3,13,59,0.220338983,0.08527
16/12/2012,PupilName4,UniqueClass5,4,28,359,0.077994429,0.08527
16/12/2012,PupilName5,UniqueClass5,5,0,0,0,0.08527
17/12/2012,PupilName1,UniqueClass6,1,0,0,0,0.08527
17/12/2012,PupilName2,UniqueClass6,2,2,200,0.01,0.08527
17/12/2012,PupilName3,UniqueClass6,3,2,254,0.007874016,0.08527
17/12/2012,PupilName4,UniqueClass6,4,2,278,0.007194245,0.08527
17/12/2012,PupilName5,UniqueClass6,5,1,279,0.003584229,0.08527
So I have created a tiny model dataset, which contains some good examples of where my current methods fall short and how I feel a genetic algorithm can be used to fix this. If we look in the dataset above it contains 6 unique classes the ultimate objective of the algorithm is to create as high as possible correspondence between a rank of an adjusted x/y and the achieved rank in column 3 (zero based referencing.) In uniqueclass1 we have two identical x/y values, now these are comparatively large x/y values if you compare with the average (note the average isn't calculated from this dataset) but it would be common sense to expect that the 3000/9610 is more significant and therefore more likely to have an achieved rank of 1 than the 300/961. So what I want to do is make an adjusted x/y to overcome these differences in dataset sizes using a logarithmic growth relationship defined by the equation:
adjusted xy = ((1-exp(-y*α)) * x/y)) + ((1-(1-exp(-y*α)))*Average xy)
Where α is the only dynamic number
If I can explain my logic a little and open myself up to (hopefully) constructive criticsm. This graph below shows is an exponential growth relationship between size of the data set and the % of x/y contributing to the adjusted x/y. Essentially what the above equation says is as the dataset gets larger the percentage of the original x/y used in the adjusted x/y gets larger. Whatever percentage is left is made up by the average xy. Could hypothetically be 75% x/y and 25% average xy for 300/961 and 95%/5% for 3000/9610 creating an adjusted x/y which clearly demonstrates
For help with understanding the lowering of α would produce the following relationship where by a larger dataset would be requred to achieve the same "% of xy contributed"
Conversly increasing α would produce the following relationship where by a smaller dataset would be requred to achieve the same "% of xy contributed"
So I have explained my logic. I am also open to code snippets to help me overcome the problem. I have plans to make a multitude of genetic/evolutionary algorithms in the future and could really use a working example to pick apart and play with in order to help my understanding of how to utilise such abilities of python. If additional detail is required or further clarification about the problem or methods please do ask, I really want to be able to solve this problem and future problems of this nature.
So after much discussion about the methods available to overcome the problem presented here I have come to the conclusion that he best method would be a genetic algorithm to iterate α in order to maximise the homology/correspondance between a rank of an adjusted x/y and the achieved rank in column 3. It would be greatly greatly appreciated if anyone be able to help in that department?
So to clarify, this post is no longer a discussion about methodology
I am hoping someone can help me produce a genetic algorithm to maximise the homology between the results of the equation
adjusted xy = ((1-exp(-y*α)) * x/y)) + ((1-(1-exp(-y*α)))*Average xy)
Where adjusted xy applies to each row of the csv. Maximising homology could be achieved by minimising the difference between the rank of the adjusted xy (where the rank is by each Unique class only) and Achieved rank.
Minimising this value would maximise the homology and essentially solve the problem presented to me of different size datasets. If any more information is required please ask, I check this post about 20 times a day at the moment so should reply rather promptly. Many thanks SMNALLY.
The problem you are facing sounds to me like "Bias Variance Dilemna" from a general point of view. In a nutshell, a more precise model favours variance (sensitivity to change in a single training set), a more general model favours bias (model works for many training sets)
May I suggest not to focus on GA but look at Instance Base Learning and advanced regression techniques. The Andrew moore page at CMU is a good entry point.
And particularly those slides.
[EDIT]
After a second reading, here is my second understanding:
You have a set of example data with two related attributes X and Y.
You do not want X/Y to dominate when Y is small, (considered as less representative).
As a consequence you want to "weigth" the examples with a adapted value adjusted_xy .
You want adjusted_xy to be related to a third attribute R (rank). Related such as,per class, adjusted_xy is sorted like R.
To do so you suggest to put it as an optimization problem, searching for PARAMS of a given function F(X,Y,PARAMS)= adjusted_xy .
With the constraint that D=Distance( achieved rank for this class, rank of adjusted_xy for this class ) is minimal.
Your question, at least for me, is in the field of attribute selection/attribute adaptation. (I guess the data set will later be used for supervised learning ).
One problem that I see in your approach (if well understood) is that, at the end, rank will be highly related to adjusted_xy which will bring therefore no interesting supplementary information.
Once this said, I think you surely know how GA works . You have to
define the content of the chromosome : this appears to be your alpha parameter.
define an appropriate fitness function
The fitness function for one individual can be a sum of distances over all examples of the dataset.
As you are dealing with real values , other metaheuristics such as Evolution Strategies (ES) or Simulated Anealing may be more adapted than GA.
As solving optimization problems is cpu intensive, you might eventually consider C or Java instead of Python. (as fitness at least will be interpreted and thus cost a lot).
Alternatively I would look at using Y as a weight to some supervised learning algorithm (if supervised learning is the target).
Let's start by the problem: You consider the fact that some features lead to some of your classes a 'strike'. You are taking a subset of your data and try to establish a rule for the strikes. You do establish one but then you notice that the accuracy of your rule depends on the volume of the dataset that was used to establish the 'strike' rate anyway. You are also commenting on the effect of some samples in biasing your 'strike' estimate.
The immediate answer is that it looks like you have a lot of variation in your data, therefore you will in one way or another need to collect more to account for that variation. (That is, variation that is inherent to the problem).
The fact that in some cases the numbers end up in 'unusable cases' could also be down to outliers. That is, measurements that are 'out of bounds' for a number of reasons and which you would have to find a way to either exclude them or re-adjust them. But this depends a lot on the context of the problem.
'Strike rates' on their own will not help but they are perhaps a step towards the right direction. In any case, you can not compare strike rates if they are coming from samples of different sizes as you have found out too. If your problem is purely to determine the size of your sample so that your results conform to some specific accuracy then i would recommend that you have a look at Statistical Power and how does the sample size affects it. But still, to determine the sample size you need to know a bit more about your data, which brings us back to point #1 about the inherent variation.
Therefore, my attempt to an answer is this: If i have understood your question correctly, you are dealing with a classification problem in which you seek to assign a number of items (patients) to a number of classes (types of cancer) on the evidence of some features (existence of genetic markers, or frequency of their appearance or any other quantity anyway) about these items. But, some features might not exist for all items or, there is a core group of features but there might be some more that do not appear all the time. The question now is, which classifier do you use to achieve this? Logistic regression was mentioned previously and has not helped. Therefore, what i would suggest is going for a Naive Bayesian Classifier. The classifier can be trained with the datasets you have used to derive the 'strike rates' which will provide the a-priori probabilities. When the classifier is 'running' it will be using the features of new data to construct a likelihood that the patient who provided this data should be assigned to each class.
Perhaps the more common example for such a classifier is the spam-email detectors where the likelihood that an email is spam is judged on the existence of specific words in the email (and a suitable training dataset that provides a good starting point of course).
Now, in terms of trying this out practically (and since your post is tagged with python related tags :) ), i would like to recommend Weka. Weka contains a lot of related functionality including bootstrapping that could potentially help you with those differences in the size of the datasets. Although Weka is Java, bindings exist for it in Python too. I would definitely give it a go, the Weka package, book and community are very helpful.
No. Don't use a genetic algorithm.
The bigger the search space of models and parameters, the better your chances of finding a good fit for your data points. But the less this fit will mean. Especially since for some groups your sample sizes are small and therefore the measurements have a high random component to them. This is why, somewhat counterintuitively, it is often actually harder to find a good model for your data after collecting it than before.
You have taken the question to the programmer's lair. This is not the place for it. We solve puzzles.
This is not a puzzle to find the best line through the dots. You are searching for a model that makes sense and brings understanding on the subject matter. A genetic algorithm is very creative at line-through-dot drawing but will bring you little understanding.
Take the problem back where it belongs and ask the statisticians instead.
For a good model should be based on theory behind the data. It'll have to match the points on the right side of the graph, where (if I understand you right) most of the samples are. It'll be able to explain in hard probabilities how likely the deviations on the left are and tell you if they are significant or not.
If you do want to do some programming, I'd suggest you take the simplest linear model, add some random noise, and do a couple simulation runs for a population like your subjects. See if the data looks like the data you're looking at or if it generally 'looks' different, in which case there really is something nonlinear (and possibly interesting) going on on the left.
I once tackled a similar problem (as similar as problems like this ever are), in which there were many classes and high variance in features per data point. I personally used a Random Forest classifier (which I wrote in Java). Since your data is highly variant, and therefore hard to model, you could create multiple forests from different random samples of your large dataset and put a control layer on top to classify data against all the forests, then take the best score. I don't write python, but i found this link
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
which may give you something to play with.
Following Occam's razor, you must select a simpler model for small dataset and may want to switch to a more complex model as your dataset grows.
There are no [good] statistical tests that show you if a given model, in isolation, is a good predictor of your data. Or rather, a test may tell you that given model fitness is N, but you can never tell what the acceptable value of N is.
Thus, build several models and pick one with better tradeoff of predictive power and simplicity using Akaike information criterion. It has useful properties and not too hard to understand. :)
There are other tests of course, but AIC should get you started.
For a simple test, check out p-value

How can I use text analysis in order to investigate questionnaire responses?

I'm the "programmer" of a team of pupils that aims to investigate satisfaction and general problems in my grammar school. We have a questionary that is built upon a scale from 1-6 and we interpret these answers by a diagram software that I wrote in python.
Now there's a <textarea> at the end of our questionary that one can use as he likes.
I'm currently thinking of ways to make this data usable (we don't want to read more than 800+ answers).
How can I use text analysis in Python to investigate what pupils write?
I was thinking of a way to "tag" any sentence that is written down, like:
I don't like being in school. [wellbeing][negative]
I have way too much homework. [homework][much]
I think there should be more interesting projects. [projects][more]
Are there any usable approaches to obtain that? Does it make sense to use an existing tokenizer?
Thanks for your help!
well, I am just throwing in ideas here..but one approach I can think of is,
to use a clustering algorithm to cluster the responses first. something like K-means
or you can do topic modelling using something like LDA.
Then you can use your tagging approach by doing text analysis to generate frequent/related keywords in each of the cluster/topic you get from step 1.
Why Step 1 would be a good idea? Well, in my opinion- while doing text analysis, if you arbitrarly go around tagging sentences, you could generate a lot of tags- a lot of them would be similar in context. Hence, your usability might go down that you still would have to analyze loads of tags for each sentence.
Using a clustering/topic modelling can help reduce the context problem to some level as well. Hence, more usable in my opinion.
"NLTK Sentiment Analysis" is a good place to start searching. The Natural Language Toolkit is the package for doing text analysis in Python but it is not exactly simple because the task is quite complex. The first few results had some compelling demos but I didn't look at them in detail.
I won't quite answer to your question. But if I understand you have a classic survey (with check boxes, ...) with a small text area question at the end...
So you will have about 800+ answers. But I guess the answers will not be too long. Usually it will a few lines or even a few words... I think that a manual QDA software will be better than an algorithms that won't be perfect. For instance you can use the open source RQDA (R project package) or commercials software such as Nvivio...
Thanks
This sounds a lot like AI programming just because of the way that they 'tag' questions and responses. Maybe take a look at http://pyaiml.sourceforge.net/ and the artificial intelligence markup language. I don't have much experience with it, but you might be able to tweak it to your needs instead of doing it from scratch.

Categories

Resources