string matching with NLP

string matching with NLP - python

I have two dataframes, df1 and df2, with ~40,000 rows and ~70,000 rows respectively of data about polling stations in country A.
The two dataframes have some common columns like 'polling_station_name', 'province', 'district' etc., however df1 has latitude and longitude columns, whereas df2 doesn't, so I am trying to do string matching between the two dataframes so at least some rows of df2 will have geolocations available. I am blocking on the 'district' column while doing the string matching.
This is the code that I have so far:
import recordlinkage
from recordlinkage.standardise import clean
indexer = recordlinkage.Index()
indexer.block('district')
candidate_links = indexer.index(df1, df2)
compare = recordlinkage.Compare()
compare.string('polling_station_name', 'polling_station_name', method='damerau_levenshtein', threshold=0.75)
compare_vectors = compare.compute(candidate_links, df1, df2)
This produced about 12,000 matches, however I have noticed that some polling station names are incorrectly being matched because their names are very similar when they are in different locations - e.g. 'government girls primary school meilabu' and 'government girls primary school muzaka' are clearly different, yet they are being matched.
I think utilising NLP might help here, to see if there are certain words that occur very frequently in the data, like 'government', 'girls', 'boys', 'primary', 'school', etc. so I can put less emphasis on those words, and put more emphasis on meilabu, muzaka etc. while doing the string matching, but I am not so sure where to start.
(For reference, many of the polling stations are 'government (i.e.public) schools')
Any advice would be greatly appreciated!

The topic is very broad, just pay attention to standard approaches:
TFIDF: term frequency–inverse document frequency is often used as a weighting factor.
Measure similarity between two sentences using cosine similarity

#ipj said it correct, the topic is very broad. You can try out below methods,
def get_sim_measure(sentence1, sentence1):
vec1 = get_vector(sentence1)
vec2 = get_vector(sentence2)
return cosine_similarity(vec1, vec2)
Now the get_vector method can be many things.
Remove the stop words first and then you can use word2vec, GloVe on a word level and average them for the sentence. (simple)
Use doc2vec from Gensim for vector embedding of the sentence. (medium)
Use Bert (DistilBert or something lighter) for dynamic embedding with context. (hard)
Use TF-IDF and then use GloVe embedding. (simple)
Use spaCy's entity recognition and then do similarity matching (in this case words from government girls primary school will act as stop words) on entity labels. (slow process but simple)
Use BleuScore for measuring the similar words (in case you need it). (maybe misguiding)
There can be many situations, so better give few simple ones a try and go ahead.

Related

Can I use ml/nlp to determine pattern in the type of usernames are generated?

I have a dataset which has first and last names along with their respective email ids. Some of the email ids follow a certain pattern such as:
Fn1 = John , Ln1 = Jacobs, eid1= jj#xyz.com
Fn2 = Emily , Ln2 = Queens, eid2= eq#pqr.com
Fn3 = Harry , Ln3 = Smith, eid3= hsm#abc.com
The content after # has no importance for finding the pattern. I want to find out how many people follow a certain pattern and what is that pattern. Is it possible to do so using nlp and python?
EXTRA: To know what kind of pattern is for a some number of people we can store examples of that pattern along with its count in an excel sheet?!

You certainly could - e.g., you could try to learn a relationship between your input and output data as
(Fn, Ln) --> eid
and further disect this relationship into patterns.
However before hitting the problem with complex tools (especially if new to ml/nlp), I'd do further analysis of the data first.
For example, I'd first be curious to see what portion of your data displays the clear patterns you've shown in the examples - using first character(s) from the individual's first/last name to build the corresponding eid (which could be determined easily programatically).
Setting aside that portion of the data that satisfies this clear pattern - what does the remainder look like?
Is there are another clear, but different pattern to some of this data?
If there is - I'd then perform the same exercise again - construct a proper filter to collect and set aside data satisfying that pattern - and examine the remainder.
Doing this analysis might help determine at least a partial answer to your inquiry rather quickly
To know what kind of pattern is for a some number of people we can store examples of that pattern along with its count in an excel sheet?!
Moreover it will help determine
a) whether you need to even use more complex tooling (if enough patterns can be easily seived out this way - is it worth the investment to go heavier?) or
b) if not, which portion of the data to target with heavier tools (the remainder of this process - those not containing simple patterns).

Emotional score of sentences using Spacy

I have a series of 100.000+ sentences and I want to rank how emotional they are.
I am quite new to the NLP world, but this is how I managed to get started (adaptation from spacy 101)
import spacy
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
def set_sentiment(matcher, doc, i, matches):
doc.sentiment += 0.1
myemotionalwordlist = ['you','superb','great','free']
sentence0 = 'You are a superb great free person'
sentence1 = 'You are a great person'
sentence2 = 'Rocks are made o minerals'
sentences = [sentence0,sentence1,sentence2]
pattern2 = [[{"ORTH": emotionalword, "OP": "+"}] for emotionalword in myemotionalwordlist]
matcher.add("Emotional", set_sentiment, *pattern2) # Match one or more emotional word
for sentence in sentences:
doc = nlp(sentence)
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id]
span = doc[start:end]
print("Sentiment", doc.sentiment)
myemotionalwordlist is a list of about 200 words that Ive built manually.
My questions are:
(1-a) Counting the number of emotional words does not seem like the best approach. Anyone has any suggetions of a better way of doing so?
(1-b) In case this approach is good enough, any suggestions on how I can extract emotional words from wordnet?
(2) Whats the best way of escalating this? I am thinking about adding all sentences to a pandas data frame and then applying the match function to each one of them
Thanks in advance!

There are going to be two main approaches:
the one you have started, which is a list of emotional words, and counting how often they appear
showing examples of what you consider emotional sentences and what are unemotional sentences to a machine learning model, and let it work it out.
The first way will get better as you give it more words, but you will eventually hit a limit. (Simply due to the ambiguity and flexibility of human language, e.g. while "you" is more emotive than "it", there are going to be a lot of unemotional sentences that use "you".)
any suggestions on how I can extract emotional words from wordnet?
Take a look at sentiwordnet, which adds a measure of positivity, negativity or neutrality to each wordnet entry. For "emotional" you could extract just those that have either pos or neg score over e.g. 0.5. (Watch out for the non-commercial-only licence.)
The second approach will probably work better if you can feed it enough training data, but "enough" can sometimes be too much. Other downsides are the models often need much more compute power and memory (a serious issue if you need to be offline, or working on a mobile device), and that they are a blackbox.
I think the 2020 approach would be to start with a pre-trained BERT model (the bigger the better, see the recent GPT-3 paper), and then fine-tune it with a sample of your 100K sentences that you've manually annotated. Evaluate it on another sample, and annotate more training data for the ones it got wrong. Keep doing this until you get the desired level of accuracy.
(Spacy has support for both approaches, by the way. What I called fine-tuning above is also called transfer learning. See https://spacy.io/usage/training#transfer-learning Also googling for "spacy sentiment analysis" will find quite a few tutorials.)

Extracting Key-Phrases from text based on the Topic with Python

I have a large dataset with 3 columns, columns are text, phrase and topic.
I want to find a way to extract key-phrases (phrases column) based on the topic.
Key-Phrase can be part of the text value or the whole text value.
import pandas as pd
text = ["great game with a lot of amazing goals from both teams",
"goalkeepers from both teams made misteke",
"he won all four grand slam championchips",
"the best player from three-point line",
"Novak Djokovic is the best player of all time",
"amazing slam dunks from the best players",
"he deserved yellow-card for this foul",
"free throw points"]
phrase = ["goals", "goalkeepers", "grand slam championchips", "three-point line", "Novak Djokovic", "slam dunks", "yellow-card", "free throw points"]
topic = ["football", "football", "tennis", "basketball", "tennis", "basketball", "football", "basketball"]
df = pd.DataFrame({"text":text,
"phrase":phrase,
"topic":topic})
print(df.text)
print(df.phrase)
I'm having big trouble with finding a path to do something like this, because I have more than 50000 rows in my dataset and around 48000 of unique values of phrases, and 3 different topics.
I guess that building a dataset with all football, basketball and tennis topics are not really the best solution. So I was thinking about making some kind of ML model for this, but again that means that I will have 2 features (text and topic) and one result (phrase), but I will have more than 48000 of different classes in my result, and that is not a good approach.
I was thinking about using text column as a feature and applying classification model in order to find sentiment. After that I can use predicted sentiment to extract key features, but I do not know how to extract them.
One more problem is that I get only 66% accuracy when I try to classify sentiment by using CountVectorizer or TfidfTransformer with Random Forest, Decision Tree, or any other classifying algorithm, and also 66% of accuracy if Im using TextBlob for sentiment analysis.
Any help?

It looks like a good approach here would be to use a Latent Dirichlet allocation model, which is an example of what are known as topic models.
A LDA is a an unsupervised model that finds similar groups among a set of observations, which you can then use to assign a topic to each of them. Here I'll go through what could be an approach to solve this by training a model using the sentences in the text column. Though in the case the phrases are representative enough an contain the necessary information to be captured by the models, then they could also be a good (possibly better) candidate for training the model, though that you'll better judge by yourself.
Before you train the model, you need to apply some preprocessing steps, including tokenizing the sentences, removing stopwords, lemmatizing and stemming. For that you can use nltk:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import lda
from sklearn.feature_extraction.text import CountVectorizer
ignore = set(stopwords.words('english'))
stemmer = WordNetLemmatizer()
text = []
for sentence in df.text:
words = word_tokenize(sentence)
stemmed = []
for word in words:
if word not in ignore:
stemmed.append(stemmer.lemmatize(word))
text.append(' '.join(stemmed))
Now we have more appropriate corpus to train the model:
print(text)
['great game lot amazing goal team',
'goalkeeper team made misteke',
'four grand slam championchips',
'best player three-point line',
'Novak Djokovic best player time',
'amazing slam dunk best player',
'deserved yellow-card foul',
'free throw point']
We can then convert the text to a matrix of token counts through CountVectorizer, which is the input LDA will be expecting:
vec = CountVectorizer(analyzer='word', ngram_range=(1,1))
X = vec.fit_transform(text)
Note that you can use the ngram parameter to spacify the n-gram range you want to consider to train the model. By setting ngram_range=(1,2) for instance you'd end up with features containing all individual words as well as 2-grams in each sentence, here's an example having trained CountVectorizer with ngram_range=(1,2):
vec.get_feature_names()
['amazing',
'amazing goal',
'amazing slam',
'best',
'best player',
....
The advantage of using n-grams is that you could then also find Key-Phrases other than just single words.
Then we can train the LDA with whatever amount of topics you want, in this case I'll just be selecting 3 topics (note that this has nothing to do with the topics column), which you can consider to be the Key-Phrases - or words in this case - that you mention. Here I'll be using lda, though there are several options such as gensim.
Each topic will have associated a set of words from the vocabulary it has been trained with, with each word having a score measuring the relevance of the word in a topic.
model = lda.LDA(n_topics=3, random_state=1)
model.fit(X)
Through topic_word_ we can now obtain these scores associated to each topic. We can use argsort to sort the vector of scores, and use it to index the vector of feature names, which we can obtain with vec.get_feature_names:
topic_word = model.topic_word_
vocab = vec.get_feature_names()
n_top_words = 3
for i, topic_dist in enumerate(topic_word):
topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
print('Topic {}: {}'.format(i, ' '.join(topic_words)))
Topic 0: best player point
Topic 1: amazing team slam
Topic 2: yellow novak card
The printed results don't really represent much in this case, since the model has been trained with the sample from the question, however you should see more clear and meaningful topics by training with your entire corpus.
Also note that for this example I've use the whole vocabulary to train the model. However it seems that in your case what would make more sense, is to split the text column into groups according to the different topics you already have, and train a separate model on each group. But hopefully this gives you a good idea on how to proceed.

It appears you're looking to group short pieces of text by topic. You will have to tokenize the data in one way or another. There are a variety of encodings that you could consider:
Bag of words, which classifies by counting the frequency of each word in your vocabulary.
TF-IDF: Does what's above but makes words that appear in more entries less important
n_grams / bigrams / trigrams which essentially does the bag of words method but also maintains some context around each word. So you'll have encodings for each word but you'll also have tokens for "great_game", "game_with" and "great_game_with" etc.
Orthogonal Sparse Bigrams (OSB)s Also create features that have the words further apart, like "great__with"
Any of these options could be ideal for your dataset (the last two are likely your best bet). If none of these options work, There are a few more options you could try:
First you could use word embeddings. These are vector representations of each word that unlike one-hot-encoding intrinsically contain word meaning. You can sum the words in a sentence together to get a new vector containing the general idea of what the sentence is about which can then be decoded.
You can also use word embeddings alongside a Bidirectional LSTM. This is the most computationally intensive option but if your other options are not working this might be a good choice. biLSTMs try to interpret sentences by looking at the context around words to try to understand what the word might mean in that context.
Hope this helps

I think what your looking for is called "Topic modeling" in NLP.
you should try using LDA for topic modeling. It's one of easiest methods to apply.
also as #Mike mentioned, converting word to vector has many approaches. You should first try simple approaches like count vectorizer and then gradually move to something like word-2-vect or glove.
I am attaching some links for applying LDA to the corpus.
1. https://towardsdatascience.com/nlp-extracting-the-main-topics-from-your-dataset-using-lda-in-minutes-21486f5aa925
2. https://www.machinelearningplus.com/nlp/topic-modeling-visualization-how-to-present-results-lda-models/

Gensim doc2vec training on ngrams

I have several thousand documents that I'd like to use in a gensim doc2vec model, but I only have 5grams for each of the documents, not the full texts in their original word order. In the doc2vec tutorial on the gensim website (https://radimrehurek.com/gensim/auto_examples/tutorials/run_doc2vec_lee.html), a corpus is created with full texts and then the model is trained on that corpus. It looks something like this:
[TaggedDocument(words=['hundreds', 'of', 'people', 'have', 'been', 'forced', 'to', 'vacate', 'their', 'homes', 'in', 'the', 'southern',...], tags=[1]), TaggedDocument(words=[.....], tags=[2]),...]
Is it possible create a training corpus where each document consists of a list of 5grams rather than a list of words in their original order?

If you have "all" the 5-grams from the documents – perhaps even still in the order they appeared – it should be possible to stitch-together the original documents (or nearly-equivalent pseudo-documents), as if the 5-grams were puzzle-pieces or dominoes.
(For example, find the 1st 5-gram, by either its ordinal position in your data, or by finding a 5-gram whose 4-prefix-tokens isn't any other 5-grams' 4-suffix-tokens. Then, find its successor by matching its 4-suffix-tokens to the 4-prefix-tokens of another candidate 5-gram. If at any point you have more than one candidate 'start' or 'continuation', you could try any one & keep going until you either finish or reach a dead end – depth-1st search for consistent chains – & if a dead-end, then back up & try another. Though also, you could probably just pick another good start 5-gram, & continue, at risk fo slightly misordering the document & repeating a few tokens. A bunch of such errors won't have much effect on final results in a large corpus.)
Alternatively, the 'PV-DBOW' mode (dm=0) doesn't use context-windows or neighboring words – so getting the exact original word order doesn't matter, just stand-in documents with the right words in any order. So just concatenating all the 5-grams creates a reasonable pseudo-document – especially if you then discard 4/5 of any word (to account for the fact that any one word in the original doc, except at the very beginning or end, appears in 5 5-grams).

fuzzy matching lots of strings

I've got a database with property owners; I would like to count the number of properties owned by each person, but am running into standard mismatch problems:
REDEVELOPMENT AUTHORITY vs. REDEVELOPMENT AUTHORITY O vs. PHILADELPHIA REDEVELOPMEN vs. PHILA. REDEVELOPMENT AUTH
COMMONWEALTH OF PENNA vs. COMMONWEALTH OF PENNSYLVA vs. COMMONWEALTH OF PA
TRS UNIV OF PENN vs. TRUSTEES OF THE UNIVERSIT
From what I've seen, this is a pretty common problem, but my problem differs from those with solutions I've seen for two reasons:
1) I've got a large number of strings (~570,000), so computing the 570000 x 570000 matrix of edit distances (or other pairwise match metrics) seems like a daunting use of resources
2) I'm not focused on one-off comparisons--e.g., as is most common for what I've seen from big data fuzzy matching questions, matching user input to a database on file. I have one fixed data set that I want to condense once and for all.
Are there any well-established routines for such an exercise? I'm most familiar with Python and R, so an approach in either of those would be ideal, but since I only need to do this once, I'm open to branching out to other, less familiar languages (perhaps something in SQL?) for this particular task.

That is exactly what I am facing at my new job daily (but lines counts are few million). My approach is to:
1) find a set of unique strings by using p = unique(a)
2) remove punctuation, split strings in p by whitespaces, make a table of words' frequencies, create a set of rules and use gsub to "recover" abbreviations, mistyped words, etc. E.g. in your case "AUTH" should be recovered back to "AUTHORITY", "UNIV" -> "UNIVERSITY" (or vice versa)
3) recover typos if I spot them by eye
4) advanced: reorder words in strings (to often an improper English) to see if the two or more strings are identical albeit word order (e.g. "10pack 10oz" and "10oz 10pack").

You can also use agrep() in R for fuzzy name matching, by giving a percentage of allowed mismatches. If you pass it a fixed dataset, then you can grep for matches out of your database.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.