Python + Machine Learning : String matching problem [duplicate]

Python + Machine Learning : String matching problem [duplicate] - python

This question already exists:
Python + Machine Learning + NLP : String matching [closed]
Closed 3 years ago.
I have been given one problem to solve:
The problem is explained below:
The company maintains a dataset for specifications of all the products (nearly 4,500 at present) which it sells. Now each customer shares the details (name, quantity, brand etc.) of the products which he/she wants to buy from the company. Now, the customer while entering details in his/her dataset may spell the name of the product incorrectly. Also a product can be referred by many different ways in the company dataset. Example : red chilly can be referred as guntur chilly, whole red chilly, red chilly with stem, red chilly without stem etc.
I am absolutely confused about how to approach this problem. Should I use any machine learning based technique? If yes, then plz explain me what to do. Or, if it is possible to solve this problem without machine learning then also explain your approach. I am using Python.
The challenge : customer can refer to a product in many ways and the company also stores a single product in many ways with different specifications like variations in name, quantity, unit of measurements etc. With a labeled dataset I can find out that red bull energy drink(data entered by customer) is red bull (label) and red bull(entered by customer) is also red bull. But what's the use of finding this label? Because in my company dataset also red bull is present in many ways. Again I have to find all the different names of red bull in which they present in company dataset.
My approach:
I will prepare a Python dictionary like this:
{
"red chilly" : ['red chilly', 'guntur chilly', 'red chilly with stem'],
"red bull" : ['red bull energy drink', 'red bull']
}
Each entry in the dictionary is a product. whose keys are the sort of stem names of the products and the values are the all possible names for a product. Now customer enters a product name, say red bull energy drink. I will check in the dictionary for each key. If any value of that key matches, then I'll understand that the product is actually red bull and it can be referred as red bull and red bull energy drink, both ways in the company dataset. How's this approach ?

Best situation
If you have access to all possible usage names of the product it will be the best situation, all you have to do is check if the name entered by the user falls in the synonyms. 5000 products with say 10 synonyms each with a well desired schema should be easily handled by a powerful Database system.
Search engine based solution
Lets say if you don't have access to synonyms but say you have access to detailed English description of the product, then you can search for the user entered name in the description. One can use search engine like Apache Solr which uses inverted Index based on TFIDF. The document which SOLR returns as top result will be the corresponding product then. In short, index you document desciptions into solr and search for the user entered product name in solr. Mind that it is lexicon based not semantic based but lexion based will suffice for you, as long as your user will not call a banana as "yellow color cylinder shaped fruit"
ML Based
The are good distributed vector representations (word2vec, glove) called embeddings. The important properly of embeddings is that the distance between related words will be small. However, these vectors are not good for you because what you have are phrases not words (red is a word but red chilly is a phrase). There are no good pre-trained phrase to vector embeddings available in open source. If you want to use a model based on vector similarity then you will have to build your own phrase2vec model. So assuming you are able to build a phrase2vec model, you have to find the vector(corresponding to the product) which is close to the vector of the product name typed by your customer.

Related

Entity extraction using POS and NER in spacy

I need to extract entities from sentences using NER and POS tags. For example,
Given the sentence below:
docx = nlp("The two blue cars belong to the tall Lorry Jim.")
where the entities are (two blue cars, tall Lorry Jim). Running spacy NER on the sentence,
for ent in docx.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)
It returns:
two 4 7 CARDINAL
Lorry Jim 37 46 PERSON
My goal is to append adjectives/number in front of the entities identified by NER together, in the case above, tall is ADJ and should be appended to the Lorry Jim entity. And two blue cars should be extracted using NUM ADJ NOUN from POS tagger.

First, I have to say that the task you wanted and writing about is NOT what you said in the title. I think Entity has a standard definition, and for example, ADJ is not an Entity's part.
I think for solving your problem, you have to use dependency parsing and analyze the sentence's dependency tree. It could help you to find references for each word.
In another way, you can define a chunking task for your problem and build a dataset for what you meant and try to train a model for that type of chunking.
I think if you want to do this for functional usage, you need to make your problem very clear and also simple so that you can choose a practical method for solving the problem that you have. I think if you accept some error, you can define simple rules for any NOUN and ADJ parts so that if you have the POS and NER together, then you could solve it. It also depends on the language you want to work in. Like your example :
blue car
In English, adjectives are commonly placed before nouns and this is known as the modifier or attributive position. but you have to care about sentences like this :
All the cars he had were blue.
For feature works also you can look for coreference resolution like this :
I saw the cars that he drives and it was all blue.

Is there a Python text mining script to classify text with multiple classifications?

Classification of descriptions into categories
I have a problem that involves determining what category a text description falls under. These text descriptions are entered in by users and may contain keywords that can be matched to a specific category. Each category has a set of keywords and phrases that can be matched to. There are about 100 categories.
For example, a text description might look like this, “Burlap aisle runner w/borders”, and the category “Fabric” contains the keyword “Burlap”, so that the text description could fall under the category
text description/category
Orange Burlap aisle runner w/borders/Fabric
However, there are a couple of exceptions that make this categorization process more difficult.
First, there are text descriptions that contain keywords that match to multiple categories. For example, a text description could fall under 20 different categories (out of 100) due to having keywords that are the same in the categories. This does not permit the correct categorization of the text description.
For example, a text description that is “Orange Burlap aisle runner w/borders”, would have a keyword “Orang” that falls under the category “Fruit”, while also falling under “Fabric” due to the keyword “Burlap”.
text description/category
Orange Burlap aisle runner w/borders/Fabric, Fruit
Second, there are keywords in the text description that do not match directly to any of the categories. Again, this does not permit the correct categorization of the text description.
For example, a text description that contains the keyword “mouse” does not match directly with the category “Computer Accessory”.
Can anyone suggest an algorithm or python library that can classify text descriptions without direct classification and eliminate multi-classification?
I have broken down the keywords for both the text descriptions and categories, and then matched them.
This was the code I used to match the text description with the categories.
%LivyPy3.pyspark
entries['category']=list(map(lambda i:list(map(categories_list.get,i)),entries['text_description']))
However, from this script there are either multiple categorization or no categorization at all.

I suggest you look up https://skymind.ai/wiki/word2vec, word to vectorized allows for vectorization of phrases and sentence to apply more context to the word. Word to vec models create better word association models.
I would also search google scholar for papers including NLP AND word2vec AND NIPS AND categorization. This search yielded 4,300+ papers that would give you a lot of direction in solving your problem. If you only want one category to be chosen over all this is a very difficult task. I saw a presentation on #Mailchimps NLP model for classifying client content into categories and sometimes the correct category would literally be the 4th one. The model they created was very well done but still couldn't detect some edge cases and contained some classic biases toward more common categories over the less common.
https://scholar.google.com/scholar?hl=en&as_sdt=0%2C11&q=NLP+AND+word2vec+AND+categorization+AND+mailchimp&btnG=
The recommendation engine paper is tied to your task because the complexity of predicting context of small amount of words in order to make a search suggestion is a similar problem.

NLTK - distinguishing between colors and words using context

I'm writing a program to analyze the usage of color in text. I want to search for color words such as "apricot" or "orange". For example, an author might write "the apricot sundress billowed in the wind." However, I want to only count the apricots/oranges that actually describe color, not something like "I ate an apricot" or "I drank orange juice."
Is there anyway to do this, perhaps using context() in NLTK?

Welcome to the broad field of homonymy, polysemy and WSD. In corpus linguistics, this is an approach where collocations e.g. and are used to determine a probability of the juice having the colour "orange" or being made of the respective fruit. Both probabilities are high, but the probability of "jacket" being made of the respective fruit should be much lower. There are different methods to be used. You could ask corpus annotators (specialists, crowdsourcing etc.) to annotate data in a text, which you can use to train your (machine learning) model, in this case a simple classifier. Otherwise you could use large text data to gather collocation counts in combinition with Wordnet, which may give you semantic information whether it is usual for a jacket to be made of fruits. A fortunate detail is that only rarely people use stereotypical colours in text, so you don't have to care for cases like "the yellow banana".
Shallow parsing may also help, since colour adjectives should be preferrably used in attributive position.
A different approach would be to use word similarity measures (vector space semantics)
or embeddings for Word Sense disambiguation (WSD).
Maybe this helps:
https://web.stanford.edu/~jurafsky/slp3/slides/Chapter18.wsd.pdf
https://towardsdatascience.com/a-simple-word-sense-disambiguation-application-3ca645c56357

Extract the most relevant location corresponding to a keyword

I'm implementing an application that tracks the locations of Australia's sharks through analysing a Twitter dataset. So I'm using shark as the keyword and search for the Twitts that contains "shark" and a location phrase.
So the question is how to identify that "Airlie Beach at Hardy Reef" is the one that is correlated to "shark"? If it's possible, can anyone provide a working code of Python to demonstrate? Thank you so much!

If you've already used NER to extract a list of locations, could you then create a table of target words and assign probabilities of being the correct location? For example, you are interested in beaches not hospitals. If beach is mentioned within the location, the probability of being the correct location increases. Another hacky way of doing it might be determining the number of characters or tokens between the word shark and the location - hoping that the smaller the distance, the more likely the word is to be related to the actual attack.

This is not an easy task, This would require Named Entity Recognition https://www.quora.com/What-are-the-best-python-libraries-for-extracting-location-from-text

Machine learning unsupervised approach to extract pattern from text data using python?

I would like to know that how to use unsupervised approach to exract pattern from the text data.
I have data set about the description of the product in the form of title,short and long description.My goal is to find the value of product attribute using the description available.The value which I am trying to find is present in the descripton in many varaitions.
Below are few examples of attribute which product has:
1. recomended minimum and maximum age for particular product.(get the values)
2. Is particular product is made from recycling or not ? (Yes or no).
3. Is remote control included for particular product ? (yes or no)
Currently I am using the regualar expression to get the values/find if its present in the data or not. But its very hard to find the values as I mentioned the values are present in many variations. I can't write all rules or more specific to say I can't generalize these patterns. If new varaitions come then my regex gets fail.
I was wondering is there any fairly intuitive way to automatically build these regex patterns with some sort of algorithm.
How do I use machine learning approach to build some intelligence model that can solve my problem.
Below is one example of the prodcut description.
Example:
UVM1067 Features Quantity per Selling Unit: 1 Set **Total Recycled Content: 30pct** Product Keywords: Kleer-Fax, Inc., Indexes, 8 Color, 10 Color Binders Sets per Pack: 1 Tab Style: 15-Tab Color: Multicolor Country of Manufacture: United States Index Divider Style: Printed Numeric Dimensions Overall Height - Top to Bottom: 11'' Overall Width - Side to Side: 8.5'' Overall Product Weight: 0.3 lbs
You can see in above description of the product it mentioned that total recycled it means that product is made from recycled so I would like to predict the 'Y' as my output.
I can do this by searching word or regex but i want to build some intelligent/automatic model/way to achive this.
Thanks,
Niranjan

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.