I've go dataset with more than 10 features - football skills describing players. There is also price value in each row. I would like to predict price by specifying features (skills). Initially I used linear regression, but it looks like it's not a good choice because:
price is high if only three skills are high (pace, goalkeeper, passing) for goalkeeper
price is high if defender, pace, playmaker and technique are high for defender
price is high if striker, technique and pace are high together for striker
etc... so it's not just linear dependency because not all skills should be considered for goalkeepers, defenders etc. Specific combination of skills makes the price high. I don't know how to tackle this and predict price by providing skills (all 10+). Do you maybe have ideas? Thanks!
Related
I'm interested in finding 50+ similar samples within a dataset of 3M+ rows and 200 columns
Consider we've got .csv database of vehicles. Every row is one car, and in the columns, there are features like brand, millage, engine size etc.
brand
year bin
engine bin
millage
Ford
2014-2016
1-2
20K-30K
The procedure to automate:
When I receive a new sample I want to find 50+ similar ones. If I can't find exactly the same I can drop/broaden some information. For example, the same model of Ford between 2012 and 2016 is nearly the same car so I would expand the search with a bigger year bin. I expect that if I expand the search for enough categories I will always find a required population.
After this, I got a "search query" like this which returns me 50+ samples so it's maximally precise and big enough to observe the mean, variance etc.
brand
year bin
engine bin
millage
Ford
2010-2018
1-2
10K-40K
Is there anything like this already implemented?
I've tried k-means clustering vehicles by those features but it isn't precise enough and isn't easily interpretable for people without a data science background. I think the "distance" based metrics can't learn the "hard" constraints like not searching in different brands. But maybe there is a way of feature weighting?
I'm happy to receive every suggestion!
I am working on an Insurance domain use case to predict if an existing customer will buy a second insurance policy or not. I have a few personal details of the customer under different categories like Marital status, Smoker (Yes or No), Age (Young, Adult, Senior Citizen), Gender (Male/Female) and few are continuous variables like Premium Paid, Sum Insured.
My target is to use this mix set of categorical and continuous variables and predict the class ( 1 - Will buy a second policy, 0 - Will not buy a second policy). So how can I find/compute the correlation in this dataset and pick only the significant ones to use in Logistic Regression formula for classification?
Will appreciate if someone can provide articles, link to a similar piece of work done in Python.
For this problem, buy a second policy is more of a probabilistic event rather than a deterministic one. For example, how likely your customer A will buy another insurance and not customer A will not buy it
First, you need to have an hypothesis. Buy a second policy is your dependent variable (as the name say, it will depend of the values from other variables); this is the Y of your equation. Which factors do you belive that will lead a customer to acquire another policy?
Based on your experience in insurance field, you may say that customers older than X or who have been client for more than Y years, from gender Z and so on. These are your independend variables - the X of your equation.
If you really want to work with Python for this, check https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares but if it was me, I would start on Excel and if things get more complex, switch to Python.
For your categorical data, you can assign values for them... like Gender 1 for Male and 0 for Female. Check this link for more information https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have been often asked these question:
I have the data-sets which contain the following attributes:
Date_Day Geography Avg_Temp Max_Temp Min_Temp
1/01/2018 Delhi 32(C) 35(C) 28(C)
2/01/2018 Delhi 33(C) 34(C) 29(C)
There are 20 cities and their per day min,max, avg temperatures are given.
The question is:
How can we predict when there will be next heat waves per city is going to occur in coming 1 year?
We can have assumption as required and add any variables.
I thought of approaching these problem with time series forecasting but then I have this challenge that I have to forecast too many data for 1 year/day. And also forecasting will not be good in this case since forecasting period is very long.
Is there any approach which is feasible to solve such problems.
Any help would be appreciated.
To be a serious research, you may need a lot more information than what you have. And you may need get some ideas from Geographer about what impact the heat wave happening. You even need to use some other cities or areas' attributes to predict each city. The other cities could be from very far away countries. The weather impact facts could come from north pole, south pole, ocean, etc. Of course a lot more data. We don't know what is the relation between impact facts and the heat wave. But that is what we want machine learning to learn for us.
If you just want to train a model and learn to write a machine learning algorithm. It won't be too hard. You can try any RNNs. You can try use every 10 days as sequence to predict the 11th day's temperature. Each day in the 10 days has four or five attributes you listed above. You can train 3 models to predict max min and average. I don't know what you meant actual heat wave. But I think it is easy to define it based on max,min and average. If you have many years of data, you may get some look good results. For instance the heat wave always happen during the summer time.
Again, I don't think it will be helpful to a geography scientific research. For learning machine learning it is fine.
The atmosphere is too chaotic to be modeled by simple statistical models!
As an atmospheric scientist, I can tell you in confidence that there is no way you can make reliable weather predictions for the next year based on a purely statistical model, especially in a highly localized area like a city.
You can build a statistical model to understand what events or parameters might be related to extreme weather events such as ENSO, location of high/low-pressure centers, etc., but even if your model could technically make predictions its predictions would be useless because you wouldn't know what will be the values of predictors in your model. Besides, even if you could precisely predict the predictor variables (which is very very unlikely) your statistical model would still likely fail in most cases. You can test this by splitting a past weather data, such as ERA5, to train/test parts to see if you can predict an existing heat wave using the predictor variables over a city. I would be surprised if your model will be more successful than a random guess. However, you could get some meaningful results if you take an average over a much larger area than a city, such as a country like France, and over a longer period of time like a month or the entire season, provided that you already know precisely the state of the atmosphere for the period of prediction.
As an example, such a model might give you an idea of how many heatwaves you might expect to find in your data over southern Europe for the entire 2004 summer. Still, such an analysis wouldn't be useful other than theoretical reasons or climate change perspective since you still wouldn't know the values of predictors for a future time if you stick with a statistical model.
That being said, there are physically based weather/climate models that can be used to predict the future. For instance, WRF is a physically-based (not statistical) atmospheric model that is used to forecast the weather for the next few days with a very high temporal and spatial resolution. It can also be used as a climate model to make climate projections that are only meaningful over like a decade long average and relatively larger area than a city.
If you feel like I sound too discouraging then that's good! Because I am indeed trying to discourage you by all means from trying to predict the future heatwaves in a city using a purely statistical model. Unless you would like to learn from your own mistakes and have days of spare time to spend for only educational purposes, but not for actually achieving real-life applicable results.
I got this Prospects dataset:
ID Company_Sector Company_size DMU_Final Joining_Date Country
65656 Finance and Insurance 10 End User 2010-04-13 France
54535 Public Administration 1 End User 2004-09-22 France
and Sales dataset:
ID linkedin_shared_connections online_activity did_buy Sale_Date
65656 11 65 1 2016-05-23
54535 13 100 1 2016-01-12
I want to build a model which assigns to each prospect in the Prospects table the probability of becoming a customer. The model will predict if a prospect going to buy, and return the probability. the Sales table gives info about 2015 sales. My approach-the 'did buy' column should be a label in the model because 1 represents that prospect bought in 2016, and 0 means no sale. another interesting column is the online activity that ranges from 5 to 685. the higher it is- the more active the prospect is about the product. so I'm trying maybe to do Random Forest model and then somehow put the probability for each prospect in the new intent column. Is a Random Forest an efficient model in this case or maybe I should use another one. How can I apply the model results into the new 'intent' column for each prospect in the first table.
Well first, please see the How to ask and the On-topic guidelines. This is more of a consulting than a practical or specific question. Maybe more appropriate topic is machine learning.
TL;DR: Random forests are nice but seem to be inappropriate due to unbalanced data. You should read about recommender systems, and more fashioned good-performing models like Wide and Deep
An answer depends on: How much data do you have? What are your available data during inference? could you see the current "online_activity" attribute of the potential sale, before the customer is buying? many questions may change the whole approach that fits for your task.
Suggestion:
Generally speaking, these is a kind of business where you usually deal with very unbalanced data - low number of "did_buy"=1 against huge number of potential customers.
On the data science side, you should define valuable metric for success that can be mapped to money directly as possible. Here, it seems that taking actions by advertising or approaching to more probable customers can rise the "did_buy" / "was_approached" is a great metric for success. Overtime, you succeed if you rise that number.
Another thing to take into account, is your data may be sparse. I do not know how much buys you usually get, but it can be that you have only 1 from each country etc. That should also be taken into consideration, since simple random forest can be easily targeting this column in most of its random models and overfitting will be come a big issue. Decision trees suffer from unbalanced datasets. However, by taking the probability of each label in the leaf, instead of a decision, can sometimes be helpful for simple interpretable models and it reflects the unbalanced data. To be honest, I do not truly believe this is the right approach.
If I where you:
I would first embed the Prospects columns to a vector by:
Converting categories to random vectors (for each category) or one-hot encoding.
Normalizing or bucketizing company sizes into numbers that fits the prediction model (next)
Same ideas regarding dates. Here, maybe year can be problematic but months/days should be useful.
Country is definitely categorical, maybe add another "unknown" country class.
Then,
I would use a model that can be actually optimized according to different costs. Logistic regression is a wide one, deep neural network is another option, or see Google's Wide and deep for a combination.
Set the cost to be my golden number (the money-metric in terms of labels), or something as close as possible.
Run experiment
Finally,
Inspect my results and why it failed.
Suggest another model/feature
Repeat.
Go eat launch.
Ask a bunch of data questions.
Try to answer at least some.
Discover new interesting relations in the data.
Suggest something interesting.
Repeat (tomorrow).
Ofcourse there is a lot more into that than just the above, but that is for you to discover on your data and business.
Hope I helped! Good luck.
I have 1000 text files which have discharge summary for patients
SAMPLE_1
The patient was admitted on 21/02/99. he appeared to have pneumonia at the time
of admission so we empirically covered him for community-acquired pneumonia with
ceftriaxone and azithromycin until day 2 when his blood cultures grew
out strep pneumoniae that was pan sensitive so we stopped the
ceftriaxone and completed a 5 day course of azithromycin. But on day 4
he developed diarrhea so we added flagyl to cover for c.diff, which
did come back positive on day 6 so he needs 3 more days of that…” this
can be summarized more concisely as follows: “Completed 5 day course
of azithromycin for pan sensitive strep pneumoniae pneumonia
complicated by c.diff colitis. Currently on day 7/10 of flagyl and
c.diff negative on 9/21.
SAMPLE_2
The patient is an 56-year-old female with history of previous stroke; hypertension;
COPD, stable; renal carcinoma; presenting after
a fall and possible syncope. While walking, she accidentally fell to
her knees and did hit her head on the ground, near her left eye. Her
fall was not observed, but the patient does not profess any loss of
consciousness, recalling the entire event. The patient does have a
history of previous falls, one of which resulted in a hip fracture.
She has had physical therapy and recovered completely from that.
Initial examination showed bruising around the left eye, normal lung
examination, normal heart examination, normal neurologic function with
a baseline decreased mobility of her left arm. The patient was
admitted for evaluation of her fall and to rule out syncope and
possible stroke with her positive histories.
I also have a csv file which is 1000rows X 5columns. Each row has information entered manually for each of the text file.
So for example for the above two files, someone has manually entered these records in the csv file:
Sex, Primary Disease,Age, Date of admission,Other complications
M,Pneumonia, NA, 21/02/99, Diarhhea
F,(Hypertension,stroke), 56, NA, NA
My question is:
How do I represent use this information of text:labels to a machine learning algorithm
Do I need to do some manual labelling around the areas of interest in all the 1000 text files?
If yes then how and which method to use. (i.e. like <ADMISSION> was admitted on 21/02/99</ADMISSION>,
<AGE>56-year-old</AGE>)
So basically how do I use this text:labels data to automate the filling of labels.
As far as I can tell the point is not to mark up the texts, but to extract the information represented by the annotations. This is an information extraction problem, and you should read up on techniques for this. The CSV file contains the information you want to extract (your "gold standard", so you should start by splitting it into training (90%) and testing (10%) subsets.
There is a named entity recognition task in there: Recognize diseases, numbers, dates and gender. You could use an off-the shelf chunker, or find an annotated medical corpus and use it to train one. You can also use a mix of approaches; spotting words that reveal gender is something you could hand-code pretty easily, for example. Once you have all these words, you need some more work, for example, to distinguish the primary disease from the symptoms; the age from other numbers, and the date of admission from any other dates. This is probably best done as a separate classification task.
I recommend you now read through the nltk book, chapter by chapter, so that you have some idea of what the available techniques and tools are. It's the approach that matters, so don't get bogged down in comparisons of specific machine learning engines.
I'm afraid the algorithm that fills the gaps has not yet been invented. If the gaps were strongly correlated or had some sort of causality you might be able to model that with some sort of Bayesian model. Still with the amount of data you have this is pretty much impossible.
Now on the more practical side of things. You can take two approaches:
Treat the problem as a document-level task in which case you can just take all rows with a label and train on them and infer the labels/values of the rest. You should look at Naïve Bayes, Multi-class SVM, MaxEnt, etc. for the categorical columns and linear regression for predicting the numerical values.
Treat the problem as an information extraction task in which case you have to add the annotation you mentioned inside the text and train a sequence model. You should look at CRF, structured SVM, HMM, etc. Actually, you could look at some systems that adapt multiclass classifiers to sequence labeling tasks, e.g. SVMTool for POS tagging (can be adapted to most sequence labeling tasks).
Now about the problems, you will face. In 1. it is very unlikely that you will predict the date of the record with any algorithm. It might be possible to roughly predict the patient age as this is something that usually correlates with diseases, etc. And it's very very unlikely that you will be able to even set up the disease column as an entity extraction task.
If I have to solve your problem I would probably pick approach 2. which is imho the correct approach but could is also quite a bit of work. In that case, you will need to create markup annotations yourself. A good starting point is an annotation tool called brat. Once you have your annotations, you could develop a classifier in the style of CoNLL-2003.
What you are trying to achieve seems quite a bit, especially with 1000 records. I think (depending on your data) you may be better off using ready products instead of building them yourself. There are open source and commercial products that might be able to use -- lexigram.io has an API, MetaMap and Apache cTAKES are state-of-the-art open source tools for clinical entity extraction.