Could you please assist me with to following question?
I have a customer activity dataframe that looks like this:
It contains at least 500.000 customers and a "timeseries" of 42 months. The ones and zeroes represent customer activity. If a customer was active during a particular month then there will be a 1, if not - 0. I need determine those customers that most likely (+ probability) will not be active during the next 6 months (2018 July-December).
Could you please direct me what approach/models should i use in order to predict this? I use Python.
Thanks in advance!
The most direct analysis would be a survival model characterizing the customer's return over time: https://towardsdatascience.com/survival-analysis-in-python-a-model-for-customer-churn-e737c5242822
If you have more information about the customer besides the time series, you can augment your model with additional signals.
Related
My question is something that I didn't encounter anywhere, I've been wondering if it was possible for a TF Model to determinate values between 2 dates that have real / validated values assigned to them.
I have an example :
Let's take the price of Nickel, here's it's chart the last week :
There is no data for the two following dates : 19/11 and 20/11
But we have the data points before and after.
So is it possible to use the datas from before and after these 2 points to guess the values of the 2 missing dates ?
Thank you a lot !
It would be possible to create a machine learning model to predict the prices given a dataset of previous prices. Take a look at this post for instance. You would have to modify it slightly such that it predicts the prices in the gaps given previous and upcoming prices.
But for the example you gave assuming the dates are of this year 2022, these are a Saturday and Sunday, the stock market is closed on the weekends, hence there is not price of the item. Also notice that there are other days in the year where there is not trading occurring, think about holidays, then there also is not price of course.
I have a df with many columns of info about Home Depot customer accounts. Some fields are accountname, industry, territory, country, state, city, services, etc...
I need to build a model using python that will allow me to put in a customer accountname and I will get an output of customer accounts similar to the one I put in.
So let’s say I put in customeraccount ‘Jon Doe’
I want to get other customer accounts similar to Jon Doe based on features like industry, country, other categorical variables etc..
How can I approach this? What kind of a model would I need to build?
You need to create some metric for "closeness" - your definition of distance.
You need a way to compare all (or all relevant to you) fields of a record with the others.
The best/easiest skeletal function I can come up with right now is
def rowDist(rowA, rowB):
return industryDistance(rowA.industry, rowB.industry) \
* industryDistanceWeight + geographicalDistance(rowA, rowB) \
* geographicalDistanceWeight
Then you just search for rows with lowest distance.
I am working on an Insurance domain use case to predict if an existing customer will buy a second insurance policy or not. I have a few personal details of the customer under different categories like Marital status, Smoker (Yes or No), Age (Young, Adult, Senior Citizen), Gender (Male/Female) and few are continuous variables like Premium Paid, Sum Insured.
My target is to use this mix set of categorical and continuous variables and predict the class ( 1 - Will buy a second policy, 0 - Will not buy a second policy). So how can I find/compute the correlation in this dataset and pick only the significant ones to use in Logistic Regression formula for classification?
Will appreciate if someone can provide articles, link to a similar piece of work done in Python.
For this problem, buy a second policy is more of a probabilistic event rather than a deterministic one. For example, how likely your customer A will buy another insurance and not customer A will not buy it
First, you need to have an hypothesis. Buy a second policy is your dependent variable (as the name say, it will depend of the values from other variables); this is the Y of your equation. Which factors do you belive that will lead a customer to acquire another policy?
Based on your experience in insurance field, you may say that customers older than X or who have been client for more than Y years, from gender Z and so on. These are your independend variables - the X of your equation.
If you really want to work with Python for this, check https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares but if it was me, I would start on Excel and if things get more complex, switch to Python.
For your categorical data, you can assign values for them... like Gender 1 for Male and 0 for Female. Check this link for more information https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features
I got this Prospects dataset:
ID Company_Sector Company_size DMU_Final Joining_Date Country
65656 Finance and Insurance 10 End User 2010-04-13 France
54535 Public Administration 1 End User 2004-09-22 France
and Sales dataset:
ID linkedin_shared_connections online_activity did_buy Sale_Date
65656 11 65 1 2016-05-23
54535 13 100 1 2016-01-12
I want to build a model which assigns to each prospect in the Prospects table the probability of becoming a customer. The model will predict if a prospect going to buy, and return the probability. the Sales table gives info about 2015 sales. My approach-the 'did buy' column should be a label in the model because 1 represents that prospect bought in 2016, and 0 means no sale. another interesting column is the online activity that ranges from 5 to 685. the higher it is- the more active the prospect is about the product. so I'm trying maybe to do Random Forest model and then somehow put the probability for each prospect in the new intent column. Is a Random Forest an efficient model in this case or maybe I should use another one. How can I apply the model results into the new 'intent' column for each prospect in the first table.
Well first, please see the How to ask and the On-topic guidelines. This is more of a consulting than a practical or specific question. Maybe more appropriate topic is machine learning.
TL;DR: Random forests are nice but seem to be inappropriate due to unbalanced data. You should read about recommender systems, and more fashioned good-performing models like Wide and Deep
An answer depends on: How much data do you have? What are your available data during inference? could you see the current "online_activity" attribute of the potential sale, before the customer is buying? many questions may change the whole approach that fits for your task.
Suggestion:
Generally speaking, these is a kind of business where you usually deal with very unbalanced data - low number of "did_buy"=1 against huge number of potential customers.
On the data science side, you should define valuable metric for success that can be mapped to money directly as possible. Here, it seems that taking actions by advertising or approaching to more probable customers can rise the "did_buy" / "was_approached" is a great metric for success. Overtime, you succeed if you rise that number.
Another thing to take into account, is your data may be sparse. I do not know how much buys you usually get, but it can be that you have only 1 from each country etc. That should also be taken into consideration, since simple random forest can be easily targeting this column in most of its random models and overfitting will be come a big issue. Decision trees suffer from unbalanced datasets. However, by taking the probability of each label in the leaf, instead of a decision, can sometimes be helpful for simple interpretable models and it reflects the unbalanced data. To be honest, I do not truly believe this is the right approach.
If I where you:
I would first embed the Prospects columns to a vector by:
Converting categories to random vectors (for each category) or one-hot encoding.
Normalizing or bucketizing company sizes into numbers that fits the prediction model (next)
Same ideas regarding dates. Here, maybe year can be problematic but months/days should be useful.
Country is definitely categorical, maybe add another "unknown" country class.
Then,
I would use a model that can be actually optimized according to different costs. Logistic regression is a wide one, deep neural network is another option, or see Google's Wide and deep for a combination.
Set the cost to be my golden number (the money-metric in terms of labels), or something as close as possible.
Run experiment
Finally,
Inspect my results and why it failed.
Suggest another model/feature
Repeat.
Go eat launch.
Ask a bunch of data questions.
Try to answer at least some.
Discover new interesting relations in the data.
Suggest something interesting.
Repeat (tomorrow).
Ofcourse there is a lot more into that than just the above, but that is for you to discover on your data and business.
Hope I helped! Good luck.
I have a data set which contains site usage behavior of users over a period of six months. It contains data about:
Number of pages viewed
Number of unique cookies associated with each user
Different number of OS, Browsers used
Different number of cities visited
Everything over here is collected on a six month time frame. I have used this data to train a model to predict a target variable 'y'. Everything is numeric in format.
Now I know since its a six month data, and the model is built upon this 6 months of data, I can use this to predict on the next six month data to get target variable y.
My question is that if instead of using it to predict on six month time frame, I use the model to predict on monthly time frame, will it give me incorrect results?
My logic tells me yes, as for example, I used tree method such as Decision tree and Random forest, these algorithms kind of makes thresholds to give output "0/1". Now the variables I mentioned above such as number of cookies associated, OS, Browser etc would have different values if we look at it from one month stand point and if we look at it from 6 months standpoint. For example, number of unique cookies associated with a user would be less if seen over a month where as it will be more if seen from 6 months standpoint.
But I am confused as to if the model will automatically adjust these values while running on monthly data or not. Request you to help me understand the if I am thinking this right or wrong. Also please provide logical explanation if possible.
Thanks.
Is your minimum unit of mesurement 6 months ? I hope not, but if yes, then I would sugges that you dont try to predict the next 1 month.
Seasonality within a year aside, you would need daily volume measurements.. I would be very worried to build anything on monthly or even weekly numbers.
In terms of modelling techniques, please stick to simple regression methods like kungphu suggests.