Machine learning unsupervised approach to extract pattern from text data using python?

Machine learning unsupervised approach to extract pattern from text data using python? - python

I would like to know that how to use unsupervised approach to exract pattern from the text data.
I have data set about the description of the product in the form of title,short and long description.My goal is to find the value of product attribute using the description available.The value which I am trying to find is present in the descripton in many varaitions.
Below are few examples of attribute which product has:
1. recomended minimum and maximum age for particular product.(get the values)
2. Is particular product is made from recycling or not ? (Yes or no).
3. Is remote control included for particular product ? (yes or no)
Currently I am using the regualar expression to get the values/find if its present in the data or not. But its very hard to find the values as I mentioned the values are present in many variations. I can't write all rules or more specific to say I can't generalize these patterns. If new varaitions come then my regex gets fail.
I was wondering is there any fairly intuitive way to automatically build these regex patterns with some sort of algorithm.
How do I use machine learning approach to build some intelligence model that can solve my problem.
Below is one example of the prodcut description.
Example:
UVM1067 Features Quantity per Selling Unit: 1 Set **Total Recycled Content: 30pct** Product Keywords: Kleer-Fax, Inc., Indexes, 8 Color, 10 Color Binders Sets per Pack: 1 Tab Style: 15-Tab Color: Multicolor Country of Manufacture: United States Index Divider Style: Printed Numeric Dimensions Overall Height - Top to Bottom: 11'' Overall Width - Side to Side: 8.5'' Overall Product Weight: 0.3 lbs
You can see in above description of the product it mentioned that total recycled it means that product is made from recycled so I would like to predict the 'Y' as my output.
I can do this by searching word or regex but i want to build some intelligent/automatic model/way to achive this.
Thanks,
Niranjan

Related

need to apply rule based algorithm for large corpus to find similar/relevant keywords present in array of elements

Currently am working in NLP workspace respective to text data. I want to find out actual given keywords based domain dictionary with column through search based
developer_position=['software engineer','florida','highest pay','startups']
analyst_position=['qa', 'testing','plsql']
data_science_position=['analytics lead','lead','python','R']
architect_position=['mongodb','technical architect','sql','java','kafka']
manager_position=['pmp certified','sixsigma', 'belt','delivery manager']
corpus=["software engineer positions are high demand in California",
"qa average salary in USA is $120K-$150K",
"Django & reactjs are minimum requirements for lead positions"]
The output should predict which category position will fall into a particular row based on high probability keywords in each category

You could use spaCy rule based matching in Python; or in Javascript you can use winkNLP custom entities or in Java use token regex of coreNLP.

Python + Machine Learning : String matching problem [duplicate]

This question already exists:
Python + Machine Learning + NLP : String matching [closed]
Closed 3 years ago.
I have been given one problem to solve:
The problem is explained below:
The company maintains a dataset for specifications of all the products (nearly 4,500 at present) which it sells. Now each customer shares the details (name, quantity, brand etc.) of the products which he/she wants to buy from the company. Now, the customer while entering details in his/her dataset may spell the name of the product incorrectly. Also a product can be referred by many different ways in the company dataset. Example : red chilly can be referred as guntur chilly, whole red chilly, red chilly with stem, red chilly without stem etc.
I am absolutely confused about how to approach this problem. Should I use any machine learning based technique? If yes, then plz explain me what to do. Or, if it is possible to solve this problem without machine learning then also explain your approach. I am using Python.
The challenge : customer can refer to a product in many ways and the company also stores a single product in many ways with different specifications like variations in name, quantity, unit of measurements etc. With a labeled dataset I can find out that red bull energy drink(data entered by customer) is red bull (label) and red bull(entered by customer) is also red bull. But what's the use of finding this label? Because in my company dataset also red bull is present in many ways. Again I have to find all the different names of red bull in which they present in company dataset.
My approach:
I will prepare a Python dictionary like this:
{
"red chilly" : ['red chilly', 'guntur chilly', 'red chilly with stem'],
"red bull" : ['red bull energy drink', 'red bull']
}
Each entry in the dictionary is a product. whose keys are the sort of stem names of the products and the values are the all possible names for a product. Now customer enters a product name, say red bull energy drink. I will check in the dictionary for each key. If any value of that key matches, then I'll understand that the product is actually red bull and it can be referred as red bull and red bull energy drink, both ways in the company dataset. How's this approach ?

Best situation
If you have access to all possible usage names of the product it will be the best situation, all you have to do is check if the name entered by the user falls in the synonyms. 5000 products with say 10 synonyms each with a well desired schema should be easily handled by a powerful Database system.
Search engine based solution
Lets say if you don't have access to synonyms but say you have access to detailed English description of the product, then you can search for the user entered name in the description. One can use search engine like Apache Solr which uses inverted Index based on TFIDF. The document which SOLR returns as top result will be the corresponding product then. In short, index you document desciptions into solr and search for the user entered product name in solr. Mind that it is lexicon based not semantic based but lexion based will suffice for you, as long as your user will not call a banana as "yellow color cylinder shaped fruit"
ML Based
The are good distributed vector representations (word2vec, glove) called embeddings. The important properly of embeddings is that the distance between related words will be small. However, these vectors are not good for you because what you have are phrases not words (red is a word but red chilly is a phrase). There are no good pre-trained phrase to vector embeddings available in open source. If you want to use a model based on vector similarity then you will have to build your own phrase2vec model. So assuming you are able to build a phrase2vec model, you have to find the vector(corresponding to the product) which is close to the vector of the product name typed by your customer.

How to implement supervised learning task

I am trying to implement a machine learning algorithm that will help me with two goals:
1) Classify a given string in a set into a predetermined category based on their content.
2) Estimate the confidence that a given string belongs in the category
An example set of strings and their categories is below:
"Damage to right rear fender" -- Problem
"Scratch. Side view mirror" -- Problem
"Next scheduled maintenance on 12/23/2016" -- Appointment
"Customer should return on 1/1/2017" -- Appointment
"Red car, Volkswagon" -- Description
"Car is dark gray with large scratch on the side" -- Description
" Do not fill the car with premium fuel" -- Instruction
"Engine should cool to <100 celcius before driving" -- Instruction
I am brand new to machine learning and so am trying to figure out the best approach to accomplish my goal in python. I have a training set of approximately 1000 strings and a test set of 5000 strings.
My first approach was to try a One vs. Rest classifier using Scikit (Credit to #Cerin and #JMaurer), but on implementation the results were not great (only 55% of my results were categorized correctly on manual review). I suspect because these strings contain symbols and numbers that contribute to their overall categorization.
Can anybody else with a bit more experience comment on if this is the right approach for the task or if there is a better method that I could utilize? I am a bit in the dark and am really looking for some breadcrumbs to point me in the right direction.
Thanks.
Paul

Hand tagging a training set with customized tags

I would like to perform some natural language processing on cooking recipes, in particular the ingredients (perhaps preparation later on). Basically I am looking to create my own set of POS tags to help me determine the meaning of an ingredient line.
For example, if one of the ingredients was:
3/4 cup (lightly packed) flat-leaf parsley leaves, divided
I would want tags to express the ingredient being listed and the quanitity, which is usually a number followed by some unit of measurement. For example:
3\NUM-QTY/\FRACTION4\NUM-QTY cup\N-MEAS (lightly\ADV packed\VD) [flat-leaf\ADJ parsley\N]\INGREDIENT leaves\N, divided\VD
The tags I found here.
I am uncertain about a few things:
Should I be using custom tags, or should I be doing some sort of post-tagging processing after using a pre-existing tagger?
If I do use custom tags, is the best way to make a training text to just go through an ingredient list and tag everything by hand?
I feel like this language processing is so specific that it would be beneficial to train a tagger on an applicable set, but I'm not exactly sure how to proceed.
Thanks!

Use pattern.search library.
The python pattern library supports many tags[1] , including a cardinal number tag(CD).
Once you have tagged cardinals , fractions are "cardinal/cardinal" or something like "cardinal cardinal/cardinal".
And regarding quantities , you should build a taxonomy of cooking quantities. the python pattern library also support lemmatization[2].
I think using pattern.search[2] you could build a Constraint that would fit your data, and do pattern searches on text using it.
[1]http://www.clips.ua.ac.be/pages/mbsp-tags
[2]http://www.clips.ua.ac.be/pages/pattern-search

Good algorithm to find themes in tweets ranked by follower counts?

I'm new to data mining and experimenting a bit.
Let's say I have N twitter users and what I want to find
is the overall theme they're writing about (based on tweets).
Then I want to give higher weight to each theme if that user has higher followers.
Then I want to merge all themes if there're similar enough but still
retain the weighting by twitter count.
So basically a list of "important" themes ranked by authority (user's twitter count)
For instance, like news.google.com but ranking would be based on twitter followers that are responsible for theme.
I'd prefer something in python since that's the language I'm most familiar with.
Any ideas?
Thanks
EDIT:
Here's a good example of what I'm trying to do (but with diff data)
http://www.facebook.com/notes/facebook-data-team/whats-on-your-mind/477517358858
Basically analyzing various data and their correlation to each other: work categories and each persons age or word categories and friend count as in this example.
Where would I begin to solve this and generate such graphs?

Generally speaking : R has some packages specifically directed at text mining and datamining, offering a wide range of techniques. I have no knowledge of that kind of packages in Python, but that doesn't mean they don't exist. I just wouldn't implement it all myself, it's a bit more complicated than it looks at first sight.
Some things you have to consider :
define "theme" : Is that the tags they use? Do you group tags? Do you have a small list with a limited set, or is the set unlimited?
define "general theme" : Is that the most used theme? How do you deal with ties? If a user writes about 10 themes about as much, what then?
define "weight" : Is that equivalent to the number of users? The square root? Some category?
If you have a general idea about this, you can start using the tm package for extracting all the information in a workable format. The package is based on matrices, and metadata objects. These allow you to get weighted frequencies for the different themes, provided you have defined what you consider a theme. You can also use different weighting functions to obtain what you want. The manual is here. But please also visit crossvalidated.com for extra guidance if you're not sure about what you're doing. This is actually more a question about data mining than it is about programming.

I have no specific code but I believe the methodology you want to employ is TF-IDF. It is explained here: http://en.wikipedia.org/wiki/Tf%E2%80%93idf and is used quote often classify text.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.