I have a df with many columns of info about Home Depot customer accounts. Some fields are accountname, industry, territory, country, state, city, services, etc...
I need to build a model using python that will allow me to put in a customer accountname and I will get an output of customer accounts similar to the one I put in.
So let’s say I put in customeraccount ‘Jon Doe’
I want to get other customer accounts similar to Jon Doe based on features like industry, country, other categorical variables etc..
How can I approach this? What kind of a model would I need to build?
You need to create some metric for "closeness" - your definition of distance.
You need a way to compare all (or all relevant to you) fields of a record with the others.
The best/easiest skeletal function I can come up with right now is
def rowDist(rowA, rowB):
return industryDistance(rowA.industry, rowB.industry) \
* industryDistanceWeight + geographicalDistance(rowA, rowB) \
* geographicalDistanceWeight
Then you just search for rows with lowest distance.
Related
I am working on an Insurance domain use case to predict if an existing customer will buy a second insurance policy or not. I have a few personal details of the customer under different categories like Marital status, Smoker (Yes or No), Age (Young, Adult, Senior Citizen), Gender (Male/Female) and few are continuous variables like Premium Paid, Sum Insured.
My target is to use this mix set of categorical and continuous variables and predict the class ( 1 - Will buy a second policy, 0 - Will not buy a second policy). So how can I find/compute the correlation in this dataset and pick only the significant ones to use in Logistic Regression formula for classification?
Will appreciate if someone can provide articles, link to a similar piece of work done in Python.
For this problem, buy a second policy is more of a probabilistic event rather than a deterministic one. For example, how likely your customer A will buy another insurance and not customer A will not buy it
First, you need to have an hypothesis. Buy a second policy is your dependent variable (as the name say, it will depend of the values from other variables); this is the Y of your equation. Which factors do you belive that will lead a customer to acquire another policy?
Based on your experience in insurance field, you may say that customers older than X or who have been client for more than Y years, from gender Z and so on. These are your independend variables - the X of your equation.
If you really want to work with Python for this, check https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares but if it was me, I would start on Excel and if things get more complex, switch to Python.
For your categorical data, you can assign values for them... like Gender 1 for Male and 0 for Female. Check this link for more information https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features
I'm using django-filters lib https://django-filter.readthedocs.io/en/master/index.html. I need to make chained select dropdown in my filters.
I knew how to make it with simple django-forms like here https://simpleisbetterthancomplex.com/tutorial/2018/01/29/how-to-implement-dependent-or-chained-dropdown-list-with-django.html.
When user pick region, i need to show cities in this region? Have someone idea or solution how to build filters like this?
Integrate django-smart-selects with how you perform the filtering.
This package allows you to quickly filter or group “chained” models by adding a custom foreign key or many to many field to your models. This will use an AJAX query to load only the applicable chained objects.
In analogy to the original question for Region -> City, the documentation's example is Continent -> Country which fits exactly to what is needed.
Once you select a continent, if you want only the countries on that continent to be available, you can use a ChainedForeignKey on the Location model:
class Location(models.Model):
continent = models.ForeignKey(Continent)
country = ChainedForeignKey(
Country,
chained_field="continent", # Location.continent
chained_model_field="continent", # Country.continent
show_all=False, # only the filtered results should be shown
auto_choose=True,
sort=True)
Related question:
How to use django-smart-select
Could you please assist me with to following question?
I have a customer activity dataframe that looks like this:
It contains at least 500.000 customers and a "timeseries" of 42 months. The ones and zeroes represent customer activity. If a customer was active during a particular month then there will be a 1, if not - 0. I need determine those customers that most likely (+ probability) will not be active during the next 6 months (2018 July-December).
Could you please direct me what approach/models should i use in order to predict this? I use Python.
Thanks in advance!
The most direct analysis would be a survival model characterizing the customer's return over time: https://towardsdatascience.com/survival-analysis-in-python-a-model-for-customer-churn-e737c5242822
If you have more information about the customer besides the time series, you can augment your model with additional signals.
I'm trying to prepare a dataset for scikit learn, planning to build pandas dataframe to feed it to a decision tree classifier.
The data represents different companies with varying criteria, but some criteria can have multiple values - such as "Customer segment" - which, for any given company, could be any, or all of: SMB, midmarket, enterprise, etc. There are other criteria/columns like this with multiple possible values. I need decisions made upon individual values, not the aggregate - so company A for SMB, company A for Midmarket, and not for the "grouping" of customer A for SMB AND midmarket.
Is there guidance on how to handle this? Do I need to generate rows for every variant for a given company to be fed into the learning routine? Such that an input of:
Company,Segment
A,SMB:MM:ENT
becomes:
A, SMB
A, MM
A, ENT
As well as for any other variants that may come from additional criteria/columns - for example "customer vertical" which could also include multiple values? It seems like this will greatly increase the dataset size. Is there a better way to structure this data and/or handle this scenario?
My ultimate goal is to let users complete a short survey with simple questions, and map their responses to values to get a prediction of the "right" company, for a given segment, vertical, product category, etc. But I'm struggling to build the right learning dataset to accomplish that.
Let's try.
df = pd.DataFrame({'company':['A','B'], 'segment':['SMB:MM:ENT', 'SMB:MM']})
expended_segment = df.segment.str.split(':', expand=True)
expended_segment.columns = ['segment'+str(i) for i in range(len(expended_segment.columns))]
wide_df = pd.concat([df.company, expended_segment], axis=1)
result = pd.melt(wide_df, id_vars=['company'], value_vars=list(set(wide_df.columns)-set(['company'])))
result.dropna()
I've got an existing database full of objects (I'll use books as an example). When users login to a website I'd like to recommend books to them.
I can recommend books based on other people they follow etc but I'd like to be more accurate so I've collected a set of training data for each user.
The data is collected by repeatedly presenting each user with a book and asking them if they like the look of it or not.
The training data is stored in mongodb, the books are stored in a postgres database.
I've written code to predict wether or not a given user will like a given book based on their training data, but my question is this:
How should I apply the data / probability to query books in the postgres database?
Saving the probability a user likes a book for every user and every book would be inefficient.
Loading all of the books form the database and calculating the probability for each one would also be inefficient.
I've written code to predict wether or not a given user will like a given book based on their training data
What does this code look like? Ideally it's some kind of decision tree based on attributes of the book like genre, length, etc, and is technically called a classifier. A simple example:
if ( user.genres.contains(book.genre) ) {
if ( user.maxLength < book.length ) {
print "10% off, today only!"
}
}
print "how about some garden tools?"
Saving the probability a user likes a book for every user and every book would be inefficient.
True. Note that the above decision tree may be formulated as a database query:
SELECT * FROM Books WHERE Genre IN [user.genres] AND Length < [user.maxLength]
Which will give you all books that have the highest probability of being liked by the user, with respect to the training data.