I've got a list of company names that I am trying to parse from a large number of PDF documents.
I've forced the PDFs through Apache Tika to extract the raw text, and I've got the list of 200 companies read in.
I'm stuck trying to use some combination of FuzzyWuzzy and Spacy to extract the required matches.
This is as far as I've gotten:
import spacy
from fuzzywuzzy import fuzz, process
nlp = spacy.load("en_core_web_sm")
doc = nlp(strings[1])
companies = []
candidates = []
for ent in doc.ents:
if ent.label_ == "ORG":
candidates.append(ent.text)
process.extractBests(company_name, candidates, score_cutoff=80)
What I'm trying to do is:
Read through the document string
Parse for any fuzzy company name
matches scoring say 80+
Return company names that are contained in
the document and their scores.
Help!
This is the way I populated candidates -- mpg is a Pandas DataFrame:
for s in mpg['name'].values:
doc = nlp(s)
for ent in doc.ents:
if ent.label_ == 'ORG':
candidates.append(ent.text)
Then let's say we have a short list of car data just to test with:
candidates = ['buick'
,'buick skylark'
,'buick estate wagon'
,'buick century']
The below method uses fuzz.token_sort_ratio which is described as "returning a measure of the sequences' similarity between 0 and 100 but sorting the token before comparing." Try out some of the ones partially documented here: https://github.com/seatgeek/fuzzywuzzy/issues/137
results = {} # dictionary to store results
companies = ['buick'] # you'll have more companies
for company in companies:
results[company] = process.extractBests(company,candidates,
scorer=fuzz.token_sort_ratio,
score_cutoff=50)
And the results are:
In [53]: results
Out[53]: {'buick': [('buick', 100),
('buick skylark', 56),
('buick century', 56)]}
In this case using 80 as a cutoff score would work better than 50.
Related
I am using Python and spaCy as my NLP library, working on a big dataframe that contains feedback about different cars, which looks like this:
'feedback' column contains natural language text to be processed,
'lemmatized' column contains lemmatized version of the feedback text,
'entities' column contains named entities extracted from the feedback text (I've trained the pipeline so that it will recognise car models and brands, labelling these as 'CAR_BRAND' and 'CAR_MODEL')
I then created the following function, which applies the Spacy nlp token to each row of my dataframe and extract any [noun + verb], [verb + noun], [adj + noun], [adj+ proper noun] combinations.
def bi_gram(x):
doc = nlp_token(x)
result = []
text = ''
for i in range(len(doc)):
j = i+1
if j < len(doc):
if (doc[i].pos_ == "NOUN" and doc[j].pos_ == "VERB") or (doc[i].pos_ == "VERB" and doc[j].pos_ == "NOUN") or (doc[i].pos_ == "ADJ" and doc[j].pos_ == "NOUN") or (doc[i].pos_ == "ADJ" and doc[j].pos_ == "PROPN"):
text = doc[i].text + " " + doc[j].text
result.append(text)
i = i+1
return result
Then I applied this function to 'lemmatized' column:
df['bi_gram'] = df['lemmatized'].apply(bi_gram)
This is where I have a problem...
This is producing only one bigram per row maximum. How can I tweak the code so that more than one bigram can be extracted and put in a column? (Also are there more linguistic combinations I should try?)
Is there a possibility to find out what people are saying about 'CAR_BRAND' and 'CAR_MODEL' named entities extracted in the 'entities' column? For example 'Cool Porsche' - Some brands or models are made of more than two words so it's tricky to tackle.
I am very new to NLP.. If there is a more efficient way to tackle this, any advice will be super helpful!
Many thanks for your help in advance.
spaCy has a built-in pattern matching engine that's perfect for your application – it's documented here and in a more extensive usage guide. It allows you to define patterns in a readable and easy-to-maintain way, as lists of dictionaries that define the properties of the tokens to be matched.
Set up the pattern matcher
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm") # or whatever model you choose
matcher = Matcher(nlp.vocab)
# your patterns
patterns = {
"noun_verb": [{"POS": "NOUN"}, {"POS": "VERB"}],
"verb_noun": [{"POS": "VERB"}, {"POS": "NOUN"}],
"adj_noun": [{"POS": "ADJ"}, {"POS": "NOUN"}],
"adj_propn": [{"POS": "ADJ"}, {"POS": "PROPN"}],
}
# add the patterns to the matcher
for pattern_name, pattern in patterns.items():
matcher.add(pattern_name, [pattern])
Extract matches
doc = nlp("The dog chased cats. Fast cats usually escape dogs.")
matches = matcher(doc)
matches is a list of tuples containing
a match id,
the start index of the matched bit and
the end index (exclusive).
This is a test output adopted from the spaCy usage guide:
for match_id, start, end in matches:
# Get string representation
string_id = nlp.vocab.strings[match_id]
# The matched span
span = doc[start:end]
print(repr(span.text))
print(match_id, string_id, start, end)
print()
Result
'dog chased'
1211260348777212867 noun_verb 1 3
'chased cats'
8748318984383740835 verb_noun 2 4
'Fast cats'
2526562708749592420 adj_noun 5 7
'escape dogs'
8748318984383740835 verb_noun 8 10
Some ideas for improvement
Named entity recognition should be able to detect multi-word expressions, so brand and/or model names that consist of more than one token shouldn't be an issue if everything is set up correctly
Matching dependency patterns instead of linear patterns might slightly improve your results
That being said, what you're trying to do – kind of sentiment analysis -is quite a difficult task that's normally engaged with machine learning approaches and heaps of training data. So don't expect too much from simple heuristics.
I have the following dataset:
ID Text
12 Coolest fan we’ve ever seen.
12 SHARE this with anyone you know who can use this tip!
31 Time for a Royal Celebration! Save the date.
54 The way to a sports fan’s heart? Behind-the-scenes content from their favourite teams.
419 Start asking your questions now for tomorrow’s LIVE Q&A on careers you can do without going to university.
451 Save the date, we’re hosting a fabulous & fun meetup at Coffee Bar Bryant on 9/20. Stay tuned
I have used ngrams to analyse text and words/sentences frequency.
from nltk import ngrams
text=df.Text.tolist()
list_n=[]
for i in text:
n_grams = ngrams(i.split(), 3)
for grams in n_grams:
list_n.append(grams)
list_n
Since I am interested in finding in which text a particular word/words sequence was used, I would need to create an association between text (i.e. ID) and text with particular ngrams.
For example: I am interested in finding texts which contains "Save the date", i.e. ID=31 and ID=451.
To find the n-grams for one single word, I have been using this:
def ngram_filter(col, word, n):
tokens = col.split()
all_ngrams = ngrams(tokens, n)
filtered_ngrams = [x for x in all_ngrams if word in x]
return filtered_ngrams
However, I do not know how to find the ID associated to the text and how to select more words in the function above.
How could I do this? Any idea?
Please feel free to change tags, if you require. Thanks
I don't have much experience with ngrams, but you could get what you want with str.contains like:
import re
txt = 'save the date'
print (f'The ngrams "{txt}" is in IDs: ',
df.loc[df['Text'].str.contains(txt, flags=re.IGNORECASE), 'ID'].tolist())
The ngrams "save the date" is in IDs: [31, 451]
Another option could be to do this, but the performance may not be good:
txt = 'save the date'
df_ = df.assign(ngrams=df.Text.str.replace(r'[^\w\s]+', '') #remove punctuation
.str.lower() # lower case
.str.split() #split over a whitespace
.apply(lambda x: list(ngrams(x, 3))))\
.explode('ngrams') #create a row per ngram
print (df_.loc[df_['ngrams'].isin(ngrams(txt.lower().split(), 3))])
ID Text ngrams
2 31 Time for a Royal Celebration! Save the date. (save, the, date)
5 451 Save the date, we’re hosting a fabulous & fun ... (save, the, date)
I have a pandas dataframe called df. It has a column called article. The article column contains 600 strings, each of the strings represent a news article.
I want to only KEEP those articles whose first four sentences contain keywords "COVID-19" AND ("China" OR "Chinese"). But I´m unable to find a way to conduct this on my own.
(in the string, sentences are separated by \n. An example article looks like this:)
\nChina may be past the worst of the COVID-19 pandemic, but they aren’t taking any chances.\nWorkers in Wuhan in service-related jobs would have to take a coronavirus test this week, the government announced, proving they had a clean bill of health before they could leave the city, Reuters reported.\nThe order will affect workers in security, nursing, education and other fields that come with high exposure to the general public, according to the edict, which came down from the country’s National Health Commission.\ .......
First we define a function to return a boolean based on whether your keywords appear in a given sentence:
def contains_covid_kwds(sentence):
kw1 = 'COVID19'
kw2 = 'China'
kw3 = 'Chinese'
return kw1 in sentence and (kw2 in sentence or kw3 in sentence)
Then we create a boolean series by applying this function (using Series.apply) to the sentences of your df.article column.
Note that we use a lambda function in order to truncate the sentence passed on to the contains_covid_kwds up to the fifth occurrence of '\n', i.e. your first four sentences (more info on how this works here):
series = df.article.apply(lambda s: contains_covid_kwds(s[:s.replace('\n', '#', 4).find('\n')]))
Then we pass the boolean series to df.loc, in order to localize the rows where the series was evaluated to True:
filtered_df = df.loc[series]
You can use pandas apply method and do the way I did.
string = "\nChina may be past the worst of the COVID-19 pandemic, but they aren’t taking any chances.\nWorkers in Wuhan in service-related jobs would have to take a coronavirus test this week, the government announced, proving they had a clean bill of health before they could leave the city, Reuters reported.\nThe order will affect workers in security, nursing, education and other fields that come with high exposure to the general public, according to the edict, which came down from the country’s National Health Commission."
df = pd.DataFrame({'article':[string]})
def findKeys(string):
string_list = string.strip().lower().split('\n')
flag=0
keywords=['china','covid-19','wuhan']
# Checking if the article has more than 4 sentences
if len(string_list)>4:
# iterating over string_list variable, which contains sentences.
for i in range(4):
# iterating over keywords list
for key in keywords:
# checking if the sentence contains any keyword
if key in string_list[i]:
flag=1
break
# Else block is executed when article has less than or equal to 4 sentences
else:
# Iterating over string_list variable, which contains sentences
for i in range(len(string_list)):
# iterating over keywords list
for key in keywords:
# Checking if sentence contains any keyword
if key in string_list[i]:
flag=1
break
if flag==0:
return False
else:
return True
and then call the pandas apply method on df:-
df['Contains Keywords?'] = df['article'].apply(findKeys)
First I create a series which contains just the first four sentences from the original `df['articles'] column, and convert it to lower case, assuming that searches should be case-independent.
articles = df['articles'].apply(lambda x: "\n".join(x.split("\n", maxsplit=4)[:4])).str.lower()
Then use a simple boolean mask to filter only those rows where the keywords were found in the first four sentences.
df[(articles.str.contains("covid")) & (articles.str.contains("chinese") | articles.str.contains("china"))]
Here:
found = []
s1 = "hello"
s2 = "good"
s3 = "great"
for string in article:
if s1 in string and (s2 in string or s3 in string):
found.append(string)
A homograph is a word that has the same spelling as another word but has a different sound and a different meaning, for example,lead (to go in front of) / lead (a metal) .
I was trying to use spacy word vectors to compare documents with each other by summing each word vector for each document and then finally finding cosine similarity. If for example spacy vectors have the same vector for the two 'lead' listed above , the results will be probably bad.
In the code below , why does the similarity between the two 'bank'
tokens come out as 1.00 ?
import spacy
nlp = spacy.load('en')
str1 = 'The guy went inside the bank to take out some money'
str2 = 'The house by the river bank.'
str1_tokenized = nlp(str1.decode('utf8'))
str2_tokenized = nlp(str2.decode('utf8'))
token1 = str1_tokenized[-6]
token2 = str2_tokenized[-2]
print 'token1 = {} token2 = {}'.format(token1,token2)
print token1.similarity(token2)
The output for given program is
token1 = bank token2 = bank
1.0
As kntgu already pointed out, spaCy distinguishes tokens by their characters, not by their semantic meaning. The sense2vec approach by the developers of spaCy concatenates tokens with their POS-tag and can help in the case of 'lead_VERB' vs. 'lead_NOUN'. However, it will not help in your example of 'bank (river bank)' vs. 'bank (financial institute)', as both are nouns.
SpaCy does not support any solution to this out of the box, but you can have a look at contextualized word representations like ELMo or BERT. Both generate word vectors for a given sentence, taking the context into account. Therefore, I assume the vectors for both 'bank' tokens will be substantially different.
Both are relatively recent approaches and are not as comfortable to use, but might help in your use case. For ELMo, there is a command line tool which lets you generate word embeddings for a set of sentences without having to write any code: https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md#writing-contextual-representations-to-disk
I have a large excel file like the following:
Timestamp Text Work Id
5/4/16 17:52 rain a lot the packs maybe damage. Delivery XYZ
5/4/16 18:29 wh. screen Other ABC
5/4/16 14:54 15107 Lane Pflugerville,
TX customer called me and his phone
number and my phone numbers were not
masked. thank you customer has had a
stroke and items were missing from his
delivery the cleaning supplies for his
wet vacuum steam cleaner. he needs a
call back from customer support Delivery YYY
5/6/16 13:05 How will I know if I Signing up ASX
5/4/16 23:07 an quality Delivery DFC
I want to work only on the "Text" column and then eliminate those row that have basically just have gibberish in the "Text" column (rows 2,4,5 from the above example).
I'm reading only the 2nd column as follow:
import xlrd
book = xlrd.open_workbook("excel.xlsx")
sheet = book.sheet_by_index(0)
for row_index in xrange(1, sheet.nrows): # skip heading row
timestamp, text = sheet.row_values(row_index, end_colx=2)
text)
print (text)
How do I remove the gibberish rows? I have an idea that I need to work with nltk and have a positive corpus (one that does not have any gibberish), one negative corpus (only having gibberish text), and train my model with it. But how do I go about implementing it? Please help!!
You can use nltk to do the following.
import nltk
english_words = set(w.lower() for w in nltk.corpus.words.words())
'a' in english_words
True
'dog' in english_words
True
'asdasdase' in english_words
False
How to get individual words in nltk from string:
individual_words_front_string = nltk.word_tokenize('This is my text from text column')
individual_words_front_string
['This', 'is,' 'my', 'text', 'from', 'text', 'column']
For each rows text column, test the individual words to see if they are in the english dictionary. If they all are, you know that rows text column us not gibberish.
If your definition of gibberish vs non-gibberish is different than english words found in nltk, you can use the same process above, just with a different list of acceptable words.
How to accept numbers and street addresses?
Simple way to determine if something is a number.
word = '32423432'
word.isdigit()
True
word = '32423432ds'
word.isdigit()
False
Addresses are more difficult. You can find info on that here:Parsing Addresses, and probably many other places. Of course you can always use the above logic if you have access to a list of cities, states, roads...etc.
Will it fail if any one word is False?
It's your code you decide. Perhaps you can mark something as gibberish if x% of words in the text are false?
How to determine if grammar is correct?
This is a bigger topic, and a more in-depth explanation can be found at the following link:
Checking Grammar. But the above answer will just check if words are in the nltk corpus, not whether or not the sentence is grammatically correct.
Separating good text from 'gibber' is not a trivial task, especially if you are dealing with text messages / chats (that's what it looks like to me).
A misspelled word does not make a sample unusable and even a syntactically wrong sentence should not disqualify the whole text. That's a standard you could use for newspaper texts, but not for raw, user generated content.
I would annotate a corpus in which you separate the good samples from the bad ones and train a simple classifier on in. Annotation does not have to be a big effort, since these gibberish texts are shorter than the good ones and should be easy to recognise (at least some). Also, you could try to start with a corpus size of ~100 datapoints (50 good / 50 bad) and expand it when the first model is more or less working.
This is a sample code that I always use for text classification. You need to install scikit-learn and numpy though:
import re
import random
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
# Prepare data
def prepare_data(data):
"""
data is expected to be a list of tuples of category and texts.
Returns a tuple of a list of lables and a list of texts
"""
random.shuffle(data)
return zip(*data)
# Format training data
training_data = [
("good", "rain a lot the packs maybe damage."),
("good", "15107 Lane Pflugerville, TX customer called me and his phone number and my phone numbers were not masked. thank you customer has had a stroke and items were missing from his delivery the cleaning supplies for his wet vacuum steam cleaner. he needs a call back from customer support "),
("gibber", "wh. screen"),
("gibber", "How will I know if I")
]
training_labels, training_texts = prepare_data(training_data)
# Format test set
test_data = [
("gibber", "an quality"),
("good", "<datapoint with valid text>",
# ...
]
test_labels, test_texts = prepare_data(test_data)
# Create feature vectors
"""
Convert a collection of text documents to a matrix of token counts.
See: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
"""
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(training_texts)
y = training_labels
# Train the classifier
clf = LogisticRegression()
clf.fit(X, y)
# Test performance
X_test = vectorizer.transform(test_texts)
y_test = test_labels
# Generates a list of labels corresponding to the samples
test_predictions = clf.predict(X_test)
# Convert back to the usual format
annotated_test_data = list(zip(test_predictions, test_texts))
# evaluate predictions
y_test = np.array(test_labels)
print(metrics.classification_report(y_test, test_predictions))
print("Accuracy: %0.4f" % metrics.accuracy_score(y_test, test_predictions))
# predict labels for unknown texts
data = ["text1", "text2",]
# Important: use the same vectorizer you used for the training.
# When saving the model (e.g. via pickle) always serialize
# classifier & vectorizer
X = vectorizer.transform(data)
# Now predict the labels for the texts in 'data'
labels = clf.predict(X)
# And put them back together
result = list(zip(labels, data))
# result = [("good", "text1"), ("gibber", "text2")]
A few words about how it works: The count vectorizer tokenizes the text and creates vectors containing the counts for all words in the corpus. Based upon these vectors, the classifier tries to recognise patters to distinguish between both categories. A text with only a few and uncommon (b/c misspelled) words would rather be in the 'gibber' category, while a text with a lot of words that are typical for common sentences (think of all the stop words here: 'I', 'you', 'is'... ) is more prone to be a good text.
If this method works for you, you should also try other classifiers and use the first model to semi-automatically annotate a larger training corpus.