First of all, my apologies if I am not following some of the best practices of this site, as you will see, my home is mostly MSE (math stack exchange).
I am currently working on a project where I build a vacation recommendation system. The initial idea was somewhat akin to 20 questions: We ask the user certain questions, such as "Do you like museums?", "Do you like architecture", "Do you like nightlife" etc., and then based on these answers decide for the user their best vacation destination. We answer these questions based on keywords scraped from websites, and the decision tree we would implement would allow us to effectively determine the next question to ask a user. However, we are having some difficulties with the implementation. Some examples of our difficulties are as follows:
There are issues with granularity of questions. For example, to say that a city is good for "nature-lovers" is great, but this does not mean much. Nature could involve say, hot, sunny and wet vacations for some, whereas for others, nature could involve a brisk hike in cool woods. Fortunately, the API we are currently using provides us with a list of attractions in a city, down to a fairly granular level (for example, it distinguishes between different watersport activities such as jet skiing, or white water rafting). My question is: do we need to create some sort of hiearchy like:
nature-> (Ocean,Mountain,Plains) (Mountain->Hiking,Skiing,...)
or would it be best to simply include the bottom level results (the activities themselves) and just ask questions regarding those? I only ask because I am unfamiliar with exactly how the classification is done and the final output produced. Is there a better sort of structure that should be used?
Thank you very much for your help.
I think using a decision tree is a great idea for this problem. It might be an idea to group your granular activities, and for the "nature lovers" category list a number of different climate types: Dry and sunny, coastal, forests, etc and have subcategories within them.
For the activities, you could make a category called watersports, sightseeing, etc. It sounds like your dataset is more granular than you want your decision tree to be, but you can just keep dividing that granularity down into more categories on the tree until you reach a level you're happy with. It might be an idea to include images too, of each place and activity. Maybe even without descriptive text.
Bins and sub bins are a good idea, as is the nature, ocean_nature thing.
I was thinking more about your problem last night, TripAdvisor would be a good idea. What I would do is, take the top 10 items in trip advisor and categorize them by type.
Or, maybe your tree narrows it down to 10 cities. You would rank those cities according to popularity or distance from the user.
I’m not sure how to decide which city would be best for watersports, etc. You could even have cities pay to be top of the list.
Related
I have a reasonably technical background and have done a fair bit of node development, but I’m a bit of a novice when it comes to statistics and a complete novice with python, so any advice on a synthetic data generation experiment I’m trying my hand at would be very welcome :)
I’ve set myself the problem of generating some realistic(ish) sales data for a bricks and mortar store (old school, I know).
I’ve got a smallish real-world transactional dataset (~500k rows) from the internet that I was planning on analysing with a tool of some sort, to provide the input to a PRNG.
Hopefully if I explain my thinking across a couple of broad problem domains, someone(s?!) can help me:
PROBLEM 1
I think I should be able to use the real data I have to either:
a) generate a probability distribution curve or
b) identify an ‘out of the box’ distribution that’s the closest match to the actual data
I’m assuming there’s a tool or library in Python or Node that will do one or both of those things if fed the data and, further, give me the right values to plug in to a PRNG to produce a series of data points that not are not only distributed like the original's, but also within the same sort of ranges.
I suspect b) would be less expensive computationally and, also, better supported by tools - my need for absolute ‘realness’ here isn’t that high - it’s only an experiment :)
Which leads me to…
QUESTION 1: What tools could I use to do do the analysis and generate the data points? As I said, my maths is ok, but my statistics isn't great (and the docs for the tools I’ve seen are a little dense and, to me at least, somewhat impenetrable), so some guidance on using the tool would also be welcome :)
And then there’s my next, I think more fundamental, problem, which I’m not even sure how to approach…
PROBLEM 2
While I think the approach above will work well for generating timestamps for each row, I’m going round in circles a little bit on how to model what the transaction is actually for.
I’d like each transaction to be relatable to a specific product from a list of products.
Now the products don’t need to be ‘real’ (I reckon I can just use something like Faker to generate random words for the brand, product name etc), but ideally the distribution of what is being purchased should be a bit real-ey (if that’s a word).
My first thought was just to do the same analysis for price as I’m doing for timestamp and then ‘make up’ a product for each price that’s generated, but I discarded that for a couple of reasons: It might be consistent ‘within’ a produced dataset, but not ‘across’ data sets. And I imagine on largish sets would double count quite a bit.
So my next thought was I would create some sort of lookup table with a set of pre-defined products that persists across generation jobs, but Im struggling with two aspects of that:
I’d need to generate the list itself. I would imagine I could filter the original dataset to unique products (it has stock codes) and then use the spread of unit costs in that list to do the same thing as I would have done with the timestamp (i.e. generate a set of products that have a similar spread of unit cost to the original data and then Faker the rest of the data).
QUESTION 2: Is that a sensible approach? Is there something smarter I could do?
When generating the transactions, I would also need some way to work out what product to select. I thought maybe I could generate some sort of bucketed histogram to work out what the frequency of purchases was within a range of costs (say $0-1, 1-2$ etc). I could then use that frequency to define the probability that a given transaction's cost would fall within one those ranges, and then randomly select a product whose cost falls within that range...
QUESTION 3: Again, is that a sensible approach? Is there a way I could do that lookup with a reasonably easy to understand tool (or at least one that’s documented in plain English :))
This is all quite high level I know, but any help anyone could give me would be greatly appreciated as I’ve hit a wall with this.
Thanks in advance :)
The synthesised dataset would simply have timestamp, product_id and item_cost columns.
The source dataset looks like this:
InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850,United Kingdom
536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850,United Kingdom
536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850,United Kingdom
536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850,United Kingdom
536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850,United Kingdom
536365,22752,SET 7 BABUSHKA NESTING BOXES,2,12/1/2010 8:26,7.65,17850,United Kingdom
536365,21730,GLASS STAR FROSTED T-LIGHT HOLDER,6,12/1/2010 8:26,4.25,17850,United Kingdom
536366,22633,HAND WARMER UNION JACK,6,12/1/2010 8:28,1.85,17850,United Kingdom
Ive started learning Python and decided to give myself a golf related project to work on. My question revolves around choosing the best data type to use. Now i know th3nanswer to this is based on requirements but tht isnt helping me.
Besides simple data like name, date, name of course, etc., ill alao be generating 9 and 18 hole scores for multiple players in my locl society.
While keeping a historical record of past scores is nice i may want to perform some analytics across my dataset to find handicaps, hardest holes, etc. And, yes, i know there are apps out there aldeady im doing this to learn. ;)
So....which data structur should i use to work with? Lists, dictionaries, numpy arrats, objects, or a combination?
Many thanks!
First: Any recs on how to modify the title?
I am using my own named entity recognition algorithm to parse data from plain text. Specifically, I am trying to extract lawyer practice areas. A common sentence structure that I see is:
1) Neil focuses his practice on employment, tax, and copyright litigation.
or
2) Neil focuses his practice on general corporate matters including securities, business organizations, contract preparation, and intellectual property protection.
My entity extraction is doing a good job of finding the key words, for example, my output from sentence one might look like this:
Neil focuses his practice on (employment), (tax), and (copyright litigation).
However, that doesn't really help me. What would be more helpful is if i got an output that looked more like this:
Neil focuses his practice on (employment - litigation), (tax - litigation), and (copyright litigation).
Is there a way to accomplish this goal using an existing python framework such as nltk (after my algo extracts the practice areas) can I use ntlk to extract the other words that my "practice areas" modify in order to get a more complete picture?
Named entity recognition (NER) systems typically use grammer-based rules or statistical language models. What you have described here seems to be based only on keywords, though.
Typically, and much like most complex NLP tasks, NER systems should be trained on domain-specific data so that they perform well on previously unseen (test) data. You will require adequate knowledge of machine learning to go down that path.
In "normal" language, if you want to extract words or phrases and categorize them into classes defined by you (e.g. litigation), if often makes sense to use category labels in external ontologies. An example could be:
You want to extract words and phrases related to sports.
Such a categorization (i.e. detecting whether or not a word is indeed related to sports) is not a "general"-enough problem. Which means you will not find ready-made systems that will solve the problem (e.g. algorithms in the NLTK library). You can, however, use an ontology like Wikipedia and exploit the category labels available there.
E.g., you can check that if you search Wikipedia for "football", which has a category label "ball games", which in turn is under "sports".
Note that the wikipedia category labels form a directed graph. If you build a system which exploits the category structure of such an ontology, you should be able to categorize terms in your texts as you see fit. Moreover, you can even control the granularity of the categorization (e.g. do you want just "sports", or "individual sports" and "team sports").
I have built such a system for categorizing terms related to computer science, and it worked remarkably well. The closest freely available system that works in a similar way is the Wikifier built by the cognitive computing group at the University of Illinois at Urbana-Champaign.
Caveat: You may need to tweak a simple category-based code to suit your needs. E.g. there is no wikipedia page for "litigation". Instead, it redirects you to a page titled "lawsuit". Such cases need to be handled separately.
Final Note: This solution is really not in the area of NLP, but my past experience suggests that for some domains, this kind of ontology-based approach works really well. Also, I have used the "sports" example in my answer because I know nothing about legal terminology. But I hope my example helps you understand the underlying process.
I do not think your "algo" is even doing entity recognition... however, stretching the problem you presented quite a bit, what you want to do looks like coreference resolution in coordinated structures containing ellipsis. Not easy at all: start by googling for some relevant literature in linguistics and computational linguistics. I use the standard terminology from the field below.
In practical terms, you could start by assigning the nearest antecedent (the most frequently used approach in English). Using your examples:
first extract all the "entities" in a sentence
from the entity list, identify antecedent candidates ("litigation", etc.). This is a very difficult task, involving many different problems... you might avoid it if you know in advance the "entities" that will be interesting for you.
finally, you assign (resolve) each anaphora/cataphora to the nearest antecedent.
Have a look at CogComp NER tagger:
https://github.com/CogComp/cogcomp-nlp/tree/master/ner
I am working on a simple naive bayes classifier and I had a conceptual question about it.
I know that the training set is extremely important so I wanted to know what constitutes a good training set in the following example. Say I am classifying web pages and concluding if they are relevant or not. The factors on which this decision is based takes into account the probabilities of certain attributes being present on that page. These would be certain keywords that increase the relevancy of the page. The keywords are apple, banana, mango. The relevant/irrelevant score is for each user. Assume that a user marks the page relevant/irrelevant equally likely.
Now for the training data, to get the best training for my classifier, would I need to have the same number of relevant results as irrelevant results? Do I need to make sure that each user would have relevant/irrelevant results present for them to make a good training set? What do I need to keep in mind?
This is a slightly endless topic as there are millions of factors involved. Python is a good example as it drives most of goolge(for what I know). And this brings us to the very beginning of google-there was an interview with Larry Page some years ago who was speaking about the search engines before google-for example when he typed the word "university", the first result he found had the word "university" a few times in it's title.
Going back to naive Bayes classifiers - there are a few very important key factors - assumptions and pattern recognition. And relations of course. For example you mentioned apples - that could have a few possibilities. For example:
Apple - if eating, vitamins, and shape is present we assume that the we are most likely talking about a fruit.
If we are mentioning electronics, screens, maybe Steve Jobs - that should be obvious.
If we are talking about religion, God, gardens, snakes - then it must have something to do with Adam and Eve.
So depending on your needs, you could have a basic segments of data where each one of these falls into, or a complex structure containing far more details. So yes-you base most of those on plain assumptions. And based on those you can create a more complex patterns for further recognition-Apple-iPod, iPad -having similar pattern in their names, containing similar keywords, mentioning certain people-most likely related to each other.
Irrelevant data is very hard to spot-at this very point you are probably thinking that I own multiple Apple devices, writing on a large iMac, while this couldn't be further from the truth. So this would be a very wrong assumption to begin with. So the classifiers themselves must make a very good segmentation and analysis before jumping to exact conclusions.
I'm new to data mining and experimenting a bit.
Let's say I have N twitter users and what I want to find
is the overall theme they're writing about (based on tweets).
Then I want to give higher weight to each theme if that user has higher followers.
Then I want to merge all themes if there're similar enough but still
retain the weighting by twitter count.
So basically a list of "important" themes ranked by authority (user's twitter count)
For instance, like news.google.com but ranking would be based on twitter followers that are responsible for theme.
I'd prefer something in python since that's the language I'm most familiar with.
Any ideas?
Thanks
EDIT:
Here's a good example of what I'm trying to do (but with diff data)
http://www.facebook.com/notes/facebook-data-team/whats-on-your-mind/477517358858
Basically analyzing various data and their correlation to each other: work categories and each persons age or word categories and friend count as in this example.
Where would I begin to solve this and generate such graphs?
Generally speaking : R has some packages specifically directed at text mining and datamining, offering a wide range of techniques. I have no knowledge of that kind of packages in Python, but that doesn't mean they don't exist. I just wouldn't implement it all myself, it's a bit more complicated than it looks at first sight.
Some things you have to consider :
define "theme" : Is that the tags they use? Do you group tags? Do you have a small list with a limited set, or is the set unlimited?
define "general theme" : Is that the most used theme? How do you deal with ties? If a user writes about 10 themes about as much, what then?
define "weight" : Is that equivalent to the number of users? The square root? Some category?
If you have a general idea about this, you can start using the tm package for extracting all the information in a workable format. The package is based on matrices, and metadata objects. These allow you to get weighted frequencies for the different themes, provided you have defined what you consider a theme. You can also use different weighting functions to obtain what you want. The manual is here. But please also visit crossvalidated.com for extra guidance if you're not sure about what you're doing. This is actually more a question about data mining than it is about programming.
I have no specific code but I believe the methodology you want to employ is TF-IDF. It is explained here: http://en.wikipedia.org/wiki/Tf%E2%80%93idf and is used quote often classify text.