SVM; training data doesn't contain target - python

I'm trying to predict whether a fan is going to turn out to a sporting event or not. My data (pandas DataFrame) consists of fan information (demographic's, etc.), and whether or not they attended the last 10 matches (g1_attend - g10_attend).
fan_info age neighborhood g1_attend g2_attend ... g1_neigh_turnout
2717 22 downtown 0 1 .47
2219 67 east side 1 1 .78
How can I predict if they're going to attend g11_attend, when g11_attend doesn't exist in the DataFrame?
Originally, I was going to look into applying some of the basic models in scikit-learn for classification, and possibly just add a g11_attend column into the DataFrame. This all has me quite confused for some reason. I'm thinking now that it would be more appropriate to treat this as a time-series, and was looking into other models.

You are correct, you can't just add a new category (ie output class) to a classifier -- this requires something that does time series.
But there is a fairly standard technique for using a classifier on times-series. Asserting (conditional) Time Independence, and using windowing.
In short we are going to make the assumption that whether or not someone attends a game depends only on variables we have captured, and not on some other time factor (or other factor in general).
i.e. we assume we can translate their history of games attended around the year and it will still be the same probability.
This is clearly wrong, but we do it anyway because machine learning techneques will deal with some noised in the data.
It is clearly wrong because some people are going to avoid games in winter cos it is too cold etc.
So now on the the classifier:
We have inputs, and we want just one output.
So the basic idea is that we are going to train a model,
that given as input whether they attended the first 9 games, predicts if they will attend the 10th
So out inputs are 1 age, neighbourhood, g1_attend, g2_attend,... g9_attend
and the output is g10_attend -- a binary value.
This gives us training data.
Then when it it time to test it we move everything accross: switch g1_attend for g2_attend, and g2_attend for g3_attend and ... and g9_attend for g10_attend.
And then our prediction output will be for g11_attend.
You can also train several models with different window sizes.
Eg only looking at the last 2 games, to predict attendance of the 3rd.
This gives you a lot more trainind data, since you can do.
g1,g2->g3 and g2,g3->g4 etc for each row.
You could train a bundle of different window sizes and merge the results with some ensemble technique.
In particular it is a good idea to train g1,...,g8-> g9,
and then use that to predict g10 (using g2,...,g9 as inputs)
to check if it is working.
I suggest in future you may like to ask these questions on Cross Validated. While this may be on topic on stack overflow, it is more on topic there, and has a lot more statisticians and machine learning experts.
1 I suggest discarding fan_id for now as an input. I just don't think it will get you anywhere, but it is beyond this question to explain why.

Related

Not sure if I'm answering the wrong question or wrongly answering the right question. Need suggestions please [closed]

Closed. This question is not about programming or software development. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 3 months ago.
Improve this question
Thank you for taking the time to read my post.
I'm in a bit of a limbo here and really need some intervention. I have been working on an individual project. It is a supervised learning regression problem. After cleaning, initial analysis, EDA, and feature selection, the chosen dataset now has a total of 8 attributes and 7 of them are numerical. There are around 1000 observations of top global companies ranked as per the highest revenues. The attributes are as follow:
Names
Revenues 2022
Rev_Percentage_Change
Profits 2022
Pro_Percentage_Change
Assets
Market_Value
Employees
I made the first blunder when finalizing the research/problem question. Though I still think the question is pretty good but now I feel it cannot be answered with this dataset. The question is, "Predict profit 2023 for the global 1000 companies and rank them as per highest profit earned". The main idea behind this is that the profit is a better measure then revenue, and therefore companies should be ranked according to the most profit made for the year. As anyone would understand that the Revenue is money earned; and profit is the money saved.
So, to answer this question I did my research on google scholar to find similar works but could not find any material on the subject matter. Out of all the materials I could find, I shortlisted 10 research papers where there was some sort of prediction involved. Still, I could not find anything on exactly what I was doing. Except something that I realize now that most of the search results were projects on time series analysis. I did explore time series but it was like I was blindfolded or something and I did not realize that the question I'm trying to answer is basically a time series problem, and it can easily be solved given the right dataset. For example if we were to predict the profit of these 1000 companies and we had the dataset which contained profits for the last 10 years then we would've been in business! But as the data we have is profit for the year, even though I was able to generate the profit figures of last year with the help of "Profit_Percentage_Change" column. But I still felt that it cannot get the job done.
At this point I just went with the flow and wanted to apply the regression models to get somewhere at least. Therefore I applied four regression models, Multiple Linear regression, Random Forest Regressor, Decision Tree Regressor, and Support Vector Machine regressor. These models were applied properly with a 70/30 split, generated the predicted values, evaluated the models with RMSE, MSE, R Squared scores, and also with Cross validation (5fold) with Random Forest regressor outperforming the other three.
Even though I predicted the values, even compared the actual and the predicted but of course they were for present year. Something I gained from this exercise was to find out which one of these models performed the best. But it still did not answer my question, in fact I was no way near it. I could've predicted the profit for next year with the help of time series and appropriate dataset, but I was late for it.
I still tried to find the appropriate dataset but had no luck so the only choice I am left with now is to change my research question. I have thought it through but I am not able to come up with the right question. The only possible question I can think of is "which supervised machine learning model best performs in predicting the profit of global 1000 companies". But why am I doing this? What problem does it solve? Is it a good solid question you think? I'm doubtful.
Guys, I know that Stack overflow is not a personal coding service and nor it should be. All I'm asking is some sort of direction so that I can row the boat and get to the right destination. Any help from you would be greatly appreciated. Thank you for taking the time to read and respond to this.
Included all the details in the first section. Thank you
If I get what you have done in your experiments, you are passing to the regression models all the data except the profit which is the output that you're trying to predict, but you get the current year's profit and not the future one, as expected due to the dataset structure.
For doing what you want, predict the next year's profit, you need a time series with the "history" of that data, so each company should have these data for each year from the start, even the data of only 2021 and 2022 could do the job, anyway I don't think you can trust the results with such a short series.
If you can create such a dataset you can predict the full record so Revenues, profits, etc.. of one year, passing one or two or more records of the past years to the model.
For such type of application, I would do that with a neural network, a simple mlp with a few layers should give you good results but you need more data...
A base structure could be a model that takes all the data that you have in one record for one company for 3 years and return one record for the next year, or you can return only one value, the profit if you only need that.
For the last question, I don't know if it's worth predicting the revenue of one year by the other values of the same year, it's not my field, but I guess that it's a much easier task.
Here is an article where you can find some info on how to set up a simple neural network for time series predictions: time series prediction with keras
Hope this helped, good luck with your project.

Build a machine learning/Time Series model to predict Heat waves in next 1 year [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have been often asked these question:
I have the data-sets which contain the following attributes:
Date_Day Geography Avg_Temp Max_Temp Min_Temp
1/01/2018 Delhi 32(C) 35(C) 28(C)
2/01/2018 Delhi 33(C) 34(C) 29(C)
There are 20 cities and their per day min,max, avg temperatures are given.
The question is:
How can we predict when there will be next heat waves per city is going to occur in coming 1 year?
We can have assumption as required and add any variables.
I thought of approaching these problem with time series forecasting but then I have this challenge that I have to forecast too many data for 1 year/day. And also forecasting will not be good in this case since forecasting period is very long.
Is there any approach which is feasible to solve such problems.
Any help would be appreciated.
To be a serious research, you may need a lot more information than what you have. And you may need get some ideas from Geographer about what impact the heat wave happening. You even need to use some other cities or areas' attributes to predict each city. The other cities could be from very far away countries. The weather impact facts could come from north pole, south pole, ocean, etc. Of course a lot more data. We don't know what is the relation between impact facts and the heat wave. But that is what we want machine learning to learn for us.
If you just want to train a model and learn to write a machine learning algorithm. It won't be too hard. You can try any RNNs. You can try use every 10 days as sequence to predict the 11th day's temperature. Each day in the 10 days has four or five attributes you listed above. You can train 3 models to predict max min and average. I don't know what you meant actual heat wave. But I think it is easy to define it based on max,min and average. If you have many years of data, you may get some look good results. For instance the heat wave always happen during the summer time.
Again, I don't think it will be helpful to a geography scientific research. For learning machine learning it is fine.
The atmosphere is too chaotic to be modeled by simple statistical models!
As an atmospheric scientist, I can tell you in confidence that there is no way you can make reliable weather predictions for the next year based on a purely statistical model, especially in a highly localized area like a city.
You can build a statistical model to understand what events or parameters might be related to extreme weather events such as ENSO, location of high/low-pressure centers, etc., but even if your model could technically make predictions its predictions would be useless because you wouldn't know what will be the values of predictors in your model. Besides, even if you could precisely predict the predictor variables (which is very very unlikely) your statistical model would still likely fail in most cases. You can test this by splitting a past weather data, such as ERA5, to train/test parts to see if you can predict an existing heat wave using the predictor variables over a city. I would be surprised if your model will be more successful than a random guess. However, you could get some meaningful results if you take an average over a much larger area than a city, such as a country like France, and over a longer period of time like a month or the entire season, provided that you already know precisely the state of the atmosphere for the period of prediction.
As an example, such a model might give you an idea of how many heatwaves you might expect to find in your data over southern Europe for the entire 2004 summer. Still, such an analysis wouldn't be useful other than theoretical reasons or climate change perspective since you still wouldn't know the values of predictors for a future time if you stick with a statistical model.
That being said, there are physically based weather/climate models that can be used to predict the future. For instance, WRF is a physically-based (not statistical) atmospheric model that is used to forecast the weather for the next few days with a very high temporal and spatial resolution. It can also be used as a climate model to make climate projections that are only meaningful over like a decade long average and relatively larger area than a city.
If you feel like I sound too discouraging then that's good! Because I am indeed trying to discourage you by all means from trying to predict the future heatwaves in a city using a purely statistical model. Unless you would like to learn from your own mistakes and have days of spare time to spend for only educational purposes, but not for actually achieving real-life applicable results.

Predicting customers intent

I got this Prospects dataset:
ID Company_Sector Company_size DMU_Final Joining_Date Country
65656 Finance and Insurance 10 End User 2010-04-13 France
54535 Public Administration 1 End User 2004-09-22 France
and Sales dataset:
ID linkedin_shared_connections online_activity did_buy Sale_Date
65656 11 65 1 2016-05-23
54535 13 100 1 2016-01-12
I want to build a model which assigns to each prospect in the Prospects table the probability of becoming a customer. The model will predict if a prospect going to buy, and return the probability. the Sales table gives info about 2015 sales. My approach-the 'did buy' column should be a label in the model because 1 represents that prospect bought in 2016, and 0 means no sale. another interesting column is the online activity that ranges from 5 to 685. the higher it is- the more active the prospect is about the product. so I'm trying maybe to do Random Forest model and then somehow put the probability for each prospect in the new intent column. Is a Random Forest an efficient model in this case or maybe I should use another one. How can I apply the model results into the new 'intent' column for each prospect in the first table.
Well first, please see the How to ask and the On-topic guidelines. This is more of a consulting than a practical or specific question. Maybe more appropriate topic is machine learning.
TL;DR: Random forests are nice but seem to be inappropriate due to unbalanced data. You should read about recommender systems, and more fashioned good-performing models like Wide and Deep
An answer depends on: How much data do you have? What are your available data during inference? could you see the current "online_activity" attribute of the potential sale, before the customer is buying? many questions may change the whole approach that fits for your task.
Suggestion:
Generally speaking, these is a kind of business where you usually deal with very unbalanced data - low number of "did_buy"=1 against huge number of potential customers.
On the data science side, you should define valuable metric for success that can be mapped to money directly as possible. Here, it seems that taking actions by advertising or approaching to more probable customers can rise the "did_buy" / "was_approached" is a great metric for success. Overtime, you succeed if you rise that number.
Another thing to take into account, is your data may be sparse. I do not know how much buys you usually get, but it can be that you have only 1 from each country etc. That should also be taken into consideration, since simple random forest can be easily targeting this column in most of its random models and overfitting will be come a big issue. Decision trees suffer from unbalanced datasets. However, by taking the probability of each label in the leaf, instead of a decision, can sometimes be helpful for simple interpretable models and it reflects the unbalanced data. To be honest, I do not truly believe this is the right approach.
If I where you:
I would first embed the Prospects columns to a vector by:
Converting categories to random vectors (for each category) or one-hot encoding.
Normalizing or bucketizing company sizes into numbers that fits the prediction model (next)
Same ideas regarding dates. Here, maybe year can be problematic but months/days should be useful.
Country is definitely categorical, maybe add another "unknown" country class.
Then,
I would use a model that can be actually optimized according to different costs. Logistic regression is a wide one, deep neural network is another option, or see Google's Wide and deep for a combination.
Set the cost to be my golden number (the money-metric in terms of labels), or something as close as possible.
Run experiment
Finally,
Inspect my results and why it failed.
Suggest another model/feature
Repeat.
Go eat launch.
Ask a bunch of data questions.
Try to answer at least some.
Discover new interesting relations in the data.
Suggest something interesting.
Repeat (tomorrow).
Ofcourse there is a lot more into that than just the above, but that is for you to discover on your data and business.
Hope I helped! Good luck.

How to do sequence labeling with an unlabeled dataset

I have 1000 text files which have discharge summary for patients
SAMPLE_1
The patient was admitted on 21/02/99. he appeared to have pneumonia at the time
of admission so we empirically covered him for community-acquired pneumonia with
ceftriaxone and azithromycin until day 2 when his blood cultures grew
out strep pneumoniae that was pan sensitive so we stopped the
ceftriaxone and completed a 5 day course of azithromycin. But on day 4
he developed diarrhea so we added flagyl to cover for c.diff, which
did come back positive on day 6 so he needs 3 more days of that…” this
can be summarized more concisely as follows: “Completed 5 day course
of azithromycin for pan sensitive strep pneumoniae pneumonia
complicated by c.diff colitis. Currently on day 7/10 of flagyl and
c.diff negative on 9/21.
SAMPLE_2
The patient is an 56-year-old female with history of previous stroke; hypertension;
COPD, stable; renal carcinoma; presenting after
a fall and possible syncope. While walking, she accidentally fell to
her knees and did hit her head on the ground, near her left eye. Her
fall was not observed, but the patient does not profess any loss of
consciousness, recalling the entire event. The patient does have a
history of previous falls, one of which resulted in a hip fracture.
She has had physical therapy and recovered completely from that.
Initial examination showed bruising around the left eye, normal lung
examination, normal heart examination, normal neurologic function with
a baseline decreased mobility of her left arm. The patient was
admitted for evaluation of her fall and to rule out syncope and
possible stroke with her positive histories.
I also have a csv file which is 1000rows X 5columns. Each row has information entered manually for each of the text file.
So for example for the above two files, someone has manually entered these records in the csv file:
Sex, Primary Disease,Age, Date of admission,Other complications
M,Pneumonia, NA, 21/02/99, Diarhhea
F,(Hypertension,stroke), 56, NA, NA
My question is:
How do I represent use this information of text:labels to a machine learning algorithm
Do I need to do some manual labelling around the areas of interest in all the 1000 text files?
If yes then how and which method to use. (i.e. like <ADMISSION> was admitted on 21/02/99</ADMISSION>,
<AGE>56-year-old</AGE>)
So basically how do I use this text:labels data to automate the filling of labels.
As far as I can tell the point is not to mark up the texts, but to extract the information represented by the annotations. This is an information extraction problem, and you should read up on techniques for this. The CSV file contains the information you want to extract (your "gold standard", so you should start by splitting it into training (90%) and testing (10%) subsets.
There is a named entity recognition task in there: Recognize diseases, numbers, dates and gender. You could use an off-the shelf chunker, or find an annotated medical corpus and use it to train one. You can also use a mix of approaches; spotting words that reveal gender is something you could hand-code pretty easily, for example. Once you have all these words, you need some more work, for example, to distinguish the primary disease from the symptoms; the age from other numbers, and the date of admission from any other dates. This is probably best done as a separate classification task.
I recommend you now read through the nltk book, chapter by chapter, so that you have some idea of what the available techniques and tools are. It's the approach that matters, so don't get bogged down in comparisons of specific machine learning engines.
I'm afraid the algorithm that fills the gaps has not yet been invented. If the gaps were strongly correlated or had some sort of causality you might be able to model that with some sort of Bayesian model. Still with the amount of data you have this is pretty much impossible.
Now on the more practical side of things. You can take two approaches:
Treat the problem as a document-level task in which case you can just take all rows with a label and train on them and infer the labels/values of the rest. You should look at Naïve Bayes, Multi-class SVM, MaxEnt, etc. for the categorical columns and linear regression for predicting the numerical values.
Treat the problem as an information extraction task in which case you have to add the annotation you mentioned inside the text and train a sequence model. You should look at CRF, structured SVM, HMM, etc. Actually, you could look at some systems that adapt multiclass classifiers to sequence labeling tasks, e.g. SVMTool for POS tagging (can be adapted to most sequence labeling tasks).
Now about the problems, you will face. In 1. it is very unlikely that you will predict the date of the record with any algorithm. It might be possible to roughly predict the patient age as this is something that usually correlates with diseases, etc. And it's very very unlikely that you will be able to even set up the disease column as an entity extraction task.
If I have to solve your problem I would probably pick approach 2. which is imho the correct approach but could is also quite a bit of work. In that case, you will need to create markup annotations yourself. A good starting point is an annotation tool called brat. Once you have your annotations, you could develop a classifier in the style of CoNLL-2003.
What you are trying to achieve seems quite a bit, especially with 1000 records. I think (depending on your data) you may be better off using ready products instead of building them yourself. There are open source and commercial products that might be able to use -- lexigram.io has an API, MetaMap and Apache cTAKES are state-of-the-art open source tools for clinical entity extraction.

Bayes Classifier Training set

I am working on a simple naive bayes classifier and I had a conceptual question about it.
I know that the training set is extremely important so I wanted to know what constitutes a good training set in the following example. Say I am classifying web pages and concluding if they are relevant or not. The factors on which this decision is based takes into account the probabilities of certain attributes being present on that page. These would be certain keywords that increase the relevancy of the page. The keywords are apple, banana, mango. The relevant/irrelevant score is for each user. Assume that a user marks the page relevant/irrelevant equally likely.
Now for the training data, to get the best training for my classifier, would I need to have the same number of relevant results as irrelevant results? Do I need to make sure that each user would have relevant/irrelevant results present for them to make a good training set? What do I need to keep in mind?
This is a slightly endless topic as there are millions of factors involved. Python is a good example as it drives most of goolge(for what I know). And this brings us to the very beginning of google-there was an interview with Larry Page some years ago who was speaking about the search engines before google-for example when he typed the word "university", the first result he found had the word "university" a few times in it's title.
Going back to naive Bayes classifiers - there are a few very important key factors - assumptions and pattern recognition. And relations of course. For example you mentioned apples - that could have a few possibilities. For example:
Apple - if eating, vitamins, and shape is present we assume that the we are most likely talking about a fruit.
If we are mentioning electronics, screens, maybe Steve Jobs - that should be obvious.
If we are talking about religion, God, gardens, snakes - then it must have something to do with Adam and Eve.
So depending on your needs, you could have a basic segments of data where each one of these falls into, or a complex structure containing far more details. So yes-you base most of those on plain assumptions. And based on those you can create a more complex patterns for further recognition-Apple-iPod, iPad -having similar pattern in their names, containing similar keywords, mentioning certain people-most likely related to each other.
Irrelevant data is very hard to spot-at this very point you are probably thinking that I own multiple Apple devices, writing on a large iMac, while this couldn't be further from the truth. So this would be a very wrong assumption to begin with. So the classifiers themselves must make a very good segmentation and analysis before jumping to exact conclusions.

Categories

Resources