dictionary based ngram - python

I am trying to extract unigram, bi- and trigram strings that are formed with combination of some of the smaller parts. Is there any possible way to extract them individually without counting the smaller ones when they are part of the larger ones?
text = "the log user should able to identify log entries and domain log entries"
ngramList = ['log', 'log entries','domain log entries']
import re
counts = {}
for ngram in ngrams:
words = ngram.rsplit()
pattern = re.compile(r'%s' % "\s+".join(words),re.IGNORECASE)
counts[ngram] = len(pattern.findall(text))
print(counts)
current program output = 'log':3 ,'log entries':2,'domain log entries':1
expected output = 'log' : 1 , 'log entries':1, 'domain log entries':1

You can first sort the ngram list by size and then use re.subn to substitute each ngram (from large to small) by an empty string and at the same time count the number of substitutions.
Because you sort from larger to smaller ngram, you ensure that the smaller ones don't get counted 'as part of the larger ones' because you remove those from the string in the loop.
import re
s = "the log user should able to identify log entries and domain log entries"
ngramList = ['log', 'log entries','domain log entries']
ngramList.sort(key=len, reverse=True)
counts = {}
for ngram in ngramList:
words = ngram.rsplit()
pattern = re.compile(r'%s' % "\s+".join(words), re.IGNORECASE)
s, n = re.subn(pattern, '', s)
counts[ngram] = n
print(counts)
As Wiktor indicates in the comments, you may want to improve your regex pattern though. Now the pattern will also match 'log' in the word 'key logging'. To be sure, you want to wrap the token in word breaks:
pattern = re.compile(r"\b(?:{})\b".format(r"\s+".join(ngram.split())), re.IGNORECASE)

Related

Custom fuzzy search

Good afternoon everyone. There is a parser, which the text receives as input, it is necessary to determine the category in this text by keywords before writing to the database. At the moment the algorithm is as follows:
I bring everything to lowercase
I remove the excess with a regular expression ([^a-za-z])
I delete all words that consist of less than 4 characters (category name 4+ characters)
I break the line into words
I remove duplicate words
In the cycle, I go through all the categories and use the Levenshtein (fuzzywuzzy library) algorithm to determine the most suitable one.
There will be about 100 categories, and each category will have 100 keywords.
What tools or algorithms can be used to more effectively solve the problem?
def detect_category(post):
tech_categories = ["macbook", "ноутбук", "процессор", "экран", "оперативная память", "видеокарта", "аккумулятор"]
result = re.sub(r'[^a-zа-я ]+', '', post)
post_without_small_words = ' '.join([w for w in result.split() if len(w) > 3])
words = post_without_small_words.split(' ')
unique_words = sorted(set(words), key=words.index)
good = []
for tech_category in tech_categories:
ratios = process.extract(tech_category, unique_words)
for ration in ratios:
if ration[1] > 90:
# print(ration)
good.append(ration)
return good

How to determine the number of negation words per sentence

I would like to know how to count how many negative words (no, not) and abbreviation (n't) there are in a sentence and in the whole text.
For number of sentences I am applying the following one:
df["sent"]=df['text'].str.count('[\w][\.!\?]')
However this gives me the count of sentences in a text. I would need to look per each sentence at the number of negation words and within the whole text.
Can you please give me some tips?
The expected output for text column is shown below
text sent count_n_s count_tot
I haven't tried it yet 1 1 1
I do not like it. What do you think? 2 0.5 1
It's marvellous!!! 1 0 0
No, I prefer the other one. 2 1 1
count_n_s is given by counting the total number of negotiation words per sentence, then dividing by the number of sentences.
I tried
split_w = re.split("\w+",df['text'])
neg_words=['no','not','n\'t']
words = [w for i,w in enumerate(split_w) if i and (split_w[i-1] in neg_words)]
This would get a count of total negations in the text (not for individual sentences):
import re
NEG = r"""(?:^(?:no|not)$)|n't"""
NEG_RE = re.compile(NEG, re.VERBOSE)
def get_count(text):
count = 0
for word in text:
if NEG_RE .search(word):
count+=1
continue
else:
pass
return count
df['text_list'] = df['text'].apply(lambda x: x.split())
df['count'] = df['text_list'].apply(lambda x: get_count(x))
To get count of negations for individual lines use the code below. For words like haven't you can add it to neg_words since it is not a negation if you strip the word of everything else if it has n't
import re
str1 = '''I haven't tried it yet
I do not like it. What do you think?
It's marvellous!!!
No, I prefer the other one.'''
neg_words=['no','not','n\'t']
for text in str1.split('\n'):
split_w = re.split("\s", text.lower())
# to get rid of special characters such as comma in 'No,' use the below search
split_w = [re.search('^\w+', w).group(0) for w in split_w]
words = [w for w in split_w if w in neg_words]
print(len(words))

How can I look for specific bigrams in text example - python?

I am interested in finding how often (in percentage) a set of words, as in n_grams appears in a sentence.
example_txt= ["order intake is strong for Q4"]
def find_ngrams(text):
text = re.findall('[A-z]+', text)
content = [w for w in text if w.lower() in n_grams] # you can calculate %stopwords using "in"
return round(float(len(content)) / float(len(text)), 5)
#the goal is for the above procedure to work on a pandas datafame, but for now lets use 'text' as an example.
#full_MD['n_grams'] = [find_ngrams(x) for x in list(full_MD.loc[:,'text_no_stopwords'])]
Below you see two examples. The first one works, the last doesn't.
n_grams= ['order']
res = [find_ngrams(x) for x in list(example_txt)]
print(res)
Output:
[0.16667]
n_grams= ['order intake']
res = [find_ngrams(x) for x in list(example_txt)]
print(res)
Output:
[0.0]
How can I make the find_ngrams() function process bigrams, so the last example from above works?
Edit: Any other ideas?
You can use SpaCy Matcher:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
# Add match ID "orderintake" with no callback and one pattern
pattern = [{"LOWER": "order"}, {"LOWER": "intake"}]
matcher.add("orderintake", None, pattern)
doc = nlp("order intake is strong for Q4")
matches = matcher(doc)
print(len(matches)) #Number of times the bi-gram appears in text
maybe you have already exploited this option, but why not use the a simple .count combined with len:
(example_txt[0].count(n_grams[0]) * len(n_grams[0])) / len(example_txt[0])
or if you are not interested in the spaces as part of your calculation you can use the following:
(example_txt[0].count(n_grams[0])* len(n_grams[0])) / len(example_txt[0].replace(' ',''))
of course you can use them in a list comprehension, this was just for demonstration purposes
The line
re.findall('[A-z]+', text)
returns
['order', 'intake', 'is', 'strong', 'for', 'Q'].
For this reason, the string 'order intake' will not be matched in your for here:
content = [w for w in text if w.lower() in n_grams]
If you want it to match, you'll need to make one single of string from each Bigram.
Instead, you should probably use this to find Bigrams.
For N-grams, have a look at this answer.

Remove close matches / similar phrases from list

I am working on removing similar phrases in a list, but I have hit a small roadblock.
I have sentences and phrases, phrases are related to the sentence. All phrases of a sentence are in a single list.
Let the phrase list be : p=[['This is great','is great','place for drinks','for drinks'],['Tonight is a good','good night','is a good','for movies']]
I want my output to be [['This is great','place for drinks'],['Tonight is a good','for movies']]
Basically, I want to get all the longest unique phrases of a list.
I took a look at fuzzywuzzy library, but I am unable to get around to a good solution.
here is my code :
def remove_dup(arr, threshold=80):
ret_arr =[]
for item in arr:
if item[1]<threshold:
ret_arr.append(item[0])
return ret_arr
def find_important(sents=sents, phrase=phrase):
import os, random
from fuzzywuzzy import process, fuzz
all_processed = [] #final array to be returned
for i in range(len(sents)):
new_arr = [] #reshaped phrases for a single sentence
for item in phrase[i]:
new_arr.append(item)
new_arr.sort(reverse=True, key=lambda x : len(x)) #sort with highest length
important = [] #array to store terms
important = process.extractBests(new_arr[0], new_arr) #to get levenshtein distance matches
to_proc = remove_dup(important) #remove_dup removes all relatively matching terms.
to_proc.append(important[0][0]) #the term with highest match is obviously the important term.
all_processed.append(to_proc) #add non duplicates to all_processed[]
return all_processed
Can someone point out what I am missing, or what is a better way to do this?
Thanks in advance!
I would use the difference between each phrase and all the other phrases.
If a phrase has at least one different word compared to all the other phrases then it's unique and should be kept.
I've also made it robust to exact matches and added spaces
sentences = [['This is great','is great','place for drinks','for drinks'],
['Tonight is a good','good night','is a good','for movies'],
['Axe far his favorite brand for deodorant body spray',' Axe far his favorite brand for deodorant spray','Axe is']]
new_sentences = []
s = " "
for phrases in sentences :
new_phrases = []
phrases = [phrase.split() for phrase in phrases]
for i in range(len(phrases)) :
phrase = phrases[i]
if all([len(set(phrase).difference(phrases[j])) > 0 or i == j for j in range(len(phrases))]) :
new_phrases.append(phrase)
new_phrases = [s.join(phrase) for phrase in new_phrases]
new_sentences.append(new_phrases)
print(new_sentences)
Output:
[['This is great', 'place for drinks'],
['Tonight is a good', 'good night', 'for movies'],
['Axe far his favorite brand for deodorant body spray', 'Axe is']]

Python Text processing (str.contains)

I am using str.contains for text analytics in Pandas. If for the sentence "My latest Data job was an Analyst" , I want a combination of the words "Data" & "Analyst" but at the same time I want to specify the number of words between the two words used for the combination( here it is 2 words between "Data" and "Analyst".Currently I am using (DataFile.XXX.str.contains('job') & DataFile.XXX.str.contains('Analyst') to get the counts for "job Analyst".
How can I Specify the number of words in between the 2 words in the str.contains syntax.
Thanks in advance
You can't. At least, not in a simple or standardized way.
Even the basics, like how you define a "word," are a lot more complex than you probably imagine. Both word parsing and lexical proximity (e.g. "are two words within distance D of one another in sentence s?") is the realm of natural language processing (NLP). NLP and proximity searches are not part of basic Pandas, nor of Python's standard string processing. You could import something like NLTK, the Natural Language Toolkit to solve this problem in a general way, but that's a whole 'nother story.
Let's look at a simple approach. First you need a way to parse a string into words. The following is rough by NLP standards, but will work for simpler cases:
def parse_words(s):
"""
Simple parser to grab English words from string.
CAUTION: A simplistic solution to a hard problem.
Many possibly-important edge- and corner-cases
not handled. Just one example: Hyphenated words.
"""
return re.findall(r"\w+(?:'[st])?", s, re.I)
E.g.:
>>> parse_words("and don't think this day's last moment won't come ")
['and', "don't", 'think', 'this', "day's", 'last', 'moment', "won't", 'come']
Then you need a way to find all the indices in a list where a target word is found:
def list_indices(target, seq):
"""
Return all indices in seq at which the target is found.
"""
indices = []
cursor = 0
while True:
try:
index = seq.index(target, cursor)
except ValueError:
return indices
else:
indices.append(index)
cursor = index + 1
And finally a decision making wrapper:
def words_within(target_words, s, max_distance, case_insensitive=True):
"""
Determine if the two target words are within max_distance positiones of one
another in the string s.
"""
if len(target_words) != 2:
raise ValueError('must provide 2 target words')
# fold case for case insensitivity
if case_insensitive:
s = s.casefold()
target_words = [tw.casefold() for tw in target_words]
# for Python 2, replace `casefold` with `lower`
# parse words and establish their logical positions in the string
words = parse_words(s)
target_indices = [list_indices(t, words) for t in target_words]
# words not present
if not target_indices[0] or not target_indices[1]:
return False
# compute all combinations of distance for the two words
# (there may be more than one occurance of a word in s)
actual_distances = [i2 - i1 for i2 in target_indices[1] for i1 in target_indices[0]]
# answer whether the minimum observed distance is <= our specified threshold
return min(actual_distances) <= max_distance
So then:
>>> s = "and don't think this day's last moment won't come at last"
>>> words_within(["THIS", 'last'], s, 2)
True
>>> words_within(["think", 'moment'], s, 2)
False
The only thing left to do is map that back to Pandas:
df = pd.DataFrame({'desc': [
'My latest Data job was an Analyst',
'some day my prince will come',
'Oh, somewhere over the rainbow bluebirds fly',
"Won't you share a common disaster?",
'job! rainbow! analyst.'
]})
df['ja2'] = df.desc.apply(lambda x: words_within(["job", 'analyst'], x, 2))
df['ja3'] = df.desc.apply(lambda x: words_within(["job", 'analyst'], x, 3))
This is basically how you'd solve the problem. Keep in mind, it's a rough and simplistic solution. Some simply-posed questions are not simply-answered. NLP questions are often among them.

Categories

Resources