Python - Decision Trees and Handling Unique Labels/features

Python - Decision Trees and Handling Unique Labels/features - python

Not sure if the title makes complete sense so sorry about that.
I'm new to Machine Learning and I'm using Scikit and decision trees.
Here's what I want to do; I want to take all of my inputs and include a unique feature which is a client ID. Now, the client ID is unique and can't be summed up in the normal way a feature would in decision tree analysis. What's happening now is that the tree is taking the client ID's as any other integer value and then branching it saying for instance, client ID's less than 430 go in a different path than those over 430. This isn't correct and not what I want to do. What I want to do is make the decision tree understand that the specific field can't be analyzed in such a way and each client will have their own branch. Is this possible with decision trees?
I do have a couple workarounds, one of which would be to develop unique decision trees for each client but training this would be a nightmare. I could also do another workaround, and lets say we have 800 clients, I would create 800 features with a bit field, but this is also crazy.

This is a fairly common problem in machine learning. A machine learning feature can't be unique to each instance in any case. Intuitively it makes sense; the algorithm doesn't learn anything if it can't extrapolate from that feature.
What you can do is just separate out that piece of information from the decision tree before you pass the rest of the features, and just re-merge the ID and the prediction after it is made.
I would strongly discourage any kind of manipulation of the feature vector to include the ID in any form. Features are only supposed to be things that the algorithm is supposed to use to make decisions. Don't give it information you don't want it to use. You're right in wanting to avoid using an ID as a feature because (most likely) the ID has no bearing on whatever you're trying to predict.
If you do want individual models (and have enough data for each user that you can make them), its not as big a pain as you might be thinking. You can use Scikit's model saving feature and this answer on saving pickles to MySQL to easily create and store personalized models. Unless you have a very large number of users, creating personalized decision trees shouldn't take very long.

Related

How to get similarity score for unseen documents using Gensim Doc2Vec model?

I have trained a gensim doc2vec model for an English news recommender system. the model was trained with 40K news data. I am using the code below to recommend the top 5 most similar news for e.g. news_1:
inferred_vector = model.infer_vector(news_1)
sims = model.dv.most_similar([inferred_vector], topn=5)
The problem is that if I add another 100 news data to the database(so our database will have 40K + 100 news data now) and re-run the same code, the code will only be able to recommend news based on the original 40K(instead of 40K + 100) to me, in another word, the recommended articles will never come from the 100 articles.
how can I address this issue without the need to retrain the model? Thank you in advanced!
Ps: As our APP is for news, so everyday we'll have lots of news data coming into our database, so we won't consider to retrain the model everyday(doing so may crash our backend server).

There's a bulk contiguous vector structure initially created by training, for the initial known set of vectors. It's amenable to the every-candidate bulk vector calculation at the heart of most_similar() - so that operation goes about as fast as it can, with the right vector libraries for your OS/processor.
But, that structure wasn't originally designed with incremental expansion in mind. Indeed, if you have 1 million vectors in a dense array, then want to add 1 to the end, the straightforward approach requires you to allocate a new 1-million-and-1 long array, bulk copy over the 1 million, then add the last 1. That works, but what seems like a "tiny" operation then takes a while, and ever-longer as the structure grows. And, each add more-than-doubles the temporary memory usage, for the bulk copy. So, the naive pattern of adding a whole bunch of new items individuall in a loop can be really slow & memory-intensive.
So, Gensim hasn't yet focused on providing a set-of-vectors that's easy & efficient to incrementally grow with new vectors. But, it's still indirectly possible, if you understand the caveats.
Especially in gensim-4.0.0 & above, the .dv set of doc-vectors is an instance of KeyedVectors with all that class's standard functions. Thos include the add_vector() and add_vectors() methods:
https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.add_vector
https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.add_vectors
You can try these methods to add your new inferred vectors to the model.dv object - and then they'll also be ncluded in folloup most_similar() results.
But keep in mind:
The above caveats about performance & memory-usage - which may be minor concerns as long as your dataset isn't too large, or manageable if you do additions in occasional larger batches.
The containing Doc2Vec model generally isn't expecting its internal .dv to be arbitrarily modified or expanded by other code. So, once you start doing that, parts of the model may not behave as expected. If you have problems with this, you could consider saving-aside the full Doc2Vec model before any direct-tampering with its .dv, and/or only expanding a completely separate instance of the doc-vectors, for example by saving them aside (eg: model.dv.save(DOC_VECS_FILENAME)) & reloading them into a separate KeyedVectors (eg: growing_docvecs = KeyedVectors.load(DOC_VECS_FILENAME)).

Best way to manage train/test/val splits on AzureML

I'm currently using AzureML with pretty complex workflows involving large datasets etc. and I'm wondering what is the best way to manage the splitting resulting of preprocessing steps. All my projects are built as pipelines fed by registered Datasets. I want to be able to track the splitting in order to easily retrieve, for example, test and validation sets for integration testing purposes.
What would be the best pattern to apply there ? Registering every intermediate set as different Dataset ? Directly retrieving the intermediate sets using the Run IDs ? ...
Thaanks

I wish I had a more coherent answer, the upside is that you're at the bleeding edge so, should you find a pattern that works for you, you can evangelize it and make it best practice! Hopefully you find my rantings below valuable.
First off -- if you aren't already, you should definitely use PipelineData to as the intermediate artifact for passing data b/w PipelineSteps. In this way, you can treat the PipelineData as semi-ephemeral in that they are materialized should you need them, but that it isn't a requirement to have a hold on every single version of every PipelineData. You can always grab them using Azure Storage Explorer, or like you said, using the SDK and walking down from a PipelineRun object.
Another recommendation is to split your workflow into the following pipelines:
featurization pipeline (all joining, munging, and featurizing)
training pipeline
scoring pipeline (if you have a batch score scenario).
The intra-pipeline artifacts are PipelineData, and the inter-pipeline artifacts would be registered Datasets.
To actually get actual your question of associating data splits together with a models, our team struggled with this -- especially because for each train,test,split we also have an "extra cols" which contains either identifiers or leaking variables that that the model shouldn't see.
In our current hack implementation, we register our "gold" dataset as an Azure ML Dataset at the end of the featurization pipeline. The first step of our training pipline is a PythonScriptStep, "Split Data", which contains our train,test,split steps and outputs a pickled dictionary as data.pkl. Then we can unpickle anytime we need one of the splits and can join back using the index using for any reporting needs. Here's a gist.

Registration is to make sharing and reuse easier so that you can retrieve the dataset by its name. If you do expect to reuse the test/validation sets in other experiments, then registering them make sense. However, if you are just trying to keep records of what you used for this particular experiment, you can always find those info via Run as you suggested.

NLP how to go beyond simple intent finding--using context and targeting objects

Edit: Apologies if this isn't the proper space for this kind of question. Would appreciate if you can please point me to a better forum
I'm manually applying NLP rules to a chatbot.
Currently, I've a simple set of rules--actions that follow certain trigger words.
Ex: "Create the match on saturday."
This has been working for relatively simple phrases like the above example, where I would expect words like "create" and appoint it as an action word, expect "match" then appoint it as an object entity; "saturday" as a time, etc.
When I try to expand the scope of what the bot can handle, it becomes more complicated. Here's an example of something I'm trying to handle:
"Update the memo title of the match on saturday to 'Game Day'."
I'm not sure how to move forward.
I considered manually expanding(expecting) the entities, then trying a method where I still parse for action words, but if a certain threshold of varying objects is passed, then I execute a subset of that action.
For example: update will obviously signal an action, but the addition of "memo", "title", "'Game day'", signals a subset of the action as there's more to this than a simpler "create match". Then, checking the additional objects like "title" against a predetermined set of entities will narrow down the intent to update + title.
I see many holes in this logic, esp. as the complexity even slightly increases.
This led me to the field of dependency parsing.
But upon looking into dep. parsing, I wonder if this is something feasible in manual implementation.
I'll be using python code, it won't be deep learning based.
What do you think of the basic rule-based method I've outlined? Is it something that sounds workable for a domain-specific bot?
Should I be looking into using NLP libraries offered in Python such as NLTK or spaCy and use their features? My concern with using these features such as spaCy's dependency parser was that it was overkill, or too much added complexity for handling my domain specific tasks. Furthermore, and likely because I haven't yet seen effective use of it in another project, I guess I have doubts the practicality of dependency parsers outside of academia or research.
Edit: Apologies if this isn't the proper space for this kind of question. Would appreciate if you can please point me in the right direction

store large data python

I am new with Python. Recenty,I have a project which processing huge amount of health data in xml file.
Here is an example:
In my data, there is about 100 of them and each of them have different id, origin, type and text . I want to store in data all of them so that I could training this dataset, the first idea in my mind was to use 2D arry ( one stores id and origin the other stores text). However, I found there are too many features and I want to know which features belong to each document.
Could anyone recommend a best way to do it.

For scalability ,simplicity and maintainance, you should normalised those data, build a database schema and move those stuff into database (sqlite,postgres,mysql,whatever)
This will move complicate data logic out of python. This is a typical practice of Model-view-controller.
Create a python dictionary and traverse it are quick and dirty. It will become huge technical time sink very soon if you want to make practical sense out of the data.

Bayes Classifier Training set

I am working on a simple naive bayes classifier and I had a conceptual question about it.
I know that the training set is extremely important so I wanted to know what constitutes a good training set in the following example. Say I am classifying web pages and concluding if they are relevant or not. The factors on which this decision is based takes into account the probabilities of certain attributes being present on that page. These would be certain keywords that increase the relevancy of the page. The keywords are apple, banana, mango. The relevant/irrelevant score is for each user. Assume that a user marks the page relevant/irrelevant equally likely.
Now for the training data, to get the best training for my classifier, would I need to have the same number of relevant results as irrelevant results? Do I need to make sure that each user would have relevant/irrelevant results present for them to make a good training set? What do I need to keep in mind?

This is a slightly endless topic as there are millions of factors involved. Python is a good example as it drives most of goolge(for what I know). And this brings us to the very beginning of google-there was an interview with Larry Page some years ago who was speaking about the search engines before google-for example when he typed the word "university", the first result he found had the word "university" a few times in it's title.
Going back to naive Bayes classifiers - there are a few very important key factors - assumptions and pattern recognition. And relations of course. For example you mentioned apples - that could have a few possibilities. For example:
Apple - if eating, vitamins, and shape is present we assume that the we are most likely talking about a fruit.
If we are mentioning electronics, screens, maybe Steve Jobs - that should be obvious.
If we are talking about religion, God, gardens, snakes - then it must have something to do with Adam and Eve.
So depending on your needs, you could have a basic segments of data where each one of these falls into, or a complex structure containing far more details. So yes-you base most of those on plain assumptions. And based on those you can create a more complex patterns for further recognition-Apple-iPod, iPad -having similar pattern in their names, containing similar keywords, mentioning certain people-most likely related to each other.
Irrelevant data is very hard to spot-at this very point you are probably thinking that I own multiple Apple devices, writing on a large iMac, while this couldn't be further from the truth. So this would be a very wrong assumption to begin with. So the classifiers themselves must make a very good segmentation and analysis before jumping to exact conclusions.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.