I have a list of strings called txtFreeForm:
['Add roth Sweep non vested money after 5 years of termination',
'Add roth in-plan to the 401k plan.]
I need to check if only 'Add roth' exists in the sentence. To do that i used this
for each_line in txtFreeForm:
match = re.search('add roth',each_line.lower())
if match is not None:
print(each_line)
But this obviously returns both the strings in my list as both contain 'add roth'. Is there a way to exclusively search for 'Add roth' in a sentence, because i have a bunch of these patterns to search in strings.
Thanks for your help!
Can you fix this problem by using the .Length property of strings? I'm not an experienced Python programmer, but here is how I think it should work:
for each_line in txtFreeForm:
match = re.search('add roth',each_line.lower())
if (match is not None) and (len(txtFreeForm) == len("Add Roth")):
print(each_line)
Basically, if the text is in the string, AND the length of the string is exactly to the length of the string "Add Roth", then it must ONLY contain "Add Roth".
I hope this was helpful.
EDIT:
I misunderstood what you were asking. You want to print out sentences that contain "Add Roth", but not sentences that contain "Add Roth in plan". Is this correct?
How about this code?
for each_line in txtFreeForm:
match_AR = re.search('add roth',each_line.lower())
match_ARIP = re.search('add roth in plan',each_line.lower())
if (match_AR is True) and (match_ARIP is None):
print(each_line)
This seems like it should fix the problem. You can exclude any strings (like "in plan") by searching for them too and adding them to the comparison.
You're close :) Give this a shot:
for each_line in txtFreeForm:
match = re.search('add roth (?!in[-]plan)',each_line.lower())
if match is not None:
print(each_line[match.end():])
EDIT:
Ahhh I misread... you have a LOT of these. This calls for some more aggressive magic.
import re
from functools import partial
txtFreeForm = ['Add roth Sweep non vested money after 5 years of termination',
'Add roth in-plan to the 401k plan.']
def roths(rows):
for row in rows:
match = re.search('add roth\s*', row.lower())
if match:
yield row, row[match.end():]
def filter_pattern(pattern):
return partial(lazy_filter_out, pattern)
def lazy_filter(pattern):
return partial(lazy_filter, pattern)
def lazy_filter_out(pattern, rows):
for row, rest in rows:
if not re.match(pattern, rest):
yield row, rest
def magical_transducer(bad_words, nice_rows):
magical_sentences = reduce(lambda x, y: y(x), [roths] + map(filter_pattern, bad_words), nice_rows)
for row, _ in magical_sentences:
yield row
def main():
magic = magical_transducer(['in[-]plan'], txtFreeForm)
print(list(magic))
if __name__ == '__main__':
main()
To explain a bit about what's happening hear, you mentioned you have a LOT of these words to process. The traditional way you might compare two groups of items is with nested for-loops. So,
results = []
for word in words:
for pattern in patterns:
data = do_something(word_pattern)
results.append(data)
for item in data:
for thing in item:
and so on...
and so fourth...
I'm using a few different techniques to attempt to achieve a "flatter" implementation and avoid the nested loops. I'll do my best to describe them.
**Function compositions**
# You will often see patterns that look like this:
x = foo(a)
y = bar(b)
z = baz(y)
# You may also see patterns that look like this:
z = baz(bar(foo(a)))
# an alternative way to do this is to use a functional composition
# the technique works like this:
z = reduce(lambda x, y: y(x), [foo, bar, baz], a)
Related
I want to do fuzzy matching on string with words.
The target string could be like.
"Hello, I am going to watch a film today."
where the words I want to search are.
"flim toda".
This hopefully should return "film today" as a search result.
I have used this method but it seems to be working only with one word.
import difflib
def matches(large_string, query_string, threshold):
words = large_string.split()
matched_words = []
for word in words:
s = difflib.SequenceMatcher(None, word, query_string)
match = ''.join(word[i:i+n] for i, j, n in s.get_matching_blocks() if n)
if len(match) / float(len(query_string)) >= threshold:
matched_words.append(match)
return matched_words
large_string = "Hello, I am going to watch a film today"
query_string = "film"
print(list(matches(large_string, query_string, 0.8)))
This only works with one word and it returns when there is little noise.
Is there any way to do such fuzzy matching with words?
The feature you are thinking of is called "query suggestion" and does rely on spell checking, but it relies on markov chains built out of search engine query log.
That being said, you use an approach similar to the one described in this answer: https://stackoverflow.com/a/58166648/140837
You can simply use Fuzzysearch, please see the example below;
from fuzzysearch import find_near_matches
text_string = "Hello, I am going to watch a film today."
matches = find_near_matches('flim toda', text_string, max_l_dist=2)
print([my_string[m.start:m.end] for m in matches])
This will give you the desired output.
['film toda']
Please note that you can give a value for max_l_dist parameter based on how much you are going to tolerate.
I was wondering if it's possible to use list comprehension in the following case, or if it should be left as a for loop.
temp = []
for value in my_dataframe[my_col]:
match = my_regex.search(value)
if match:
temp.append(value.replace(match.group(1),'')
else:
temp.append(value)
I believe I can do it with the if/else section, but the 'match' line throws me off. This is close but not exactly it.
temp = [value.replace(match.group(1),'') if (match) else value for
value in my_dataframe[my_col] if my_regex.search(value)]
Single-statement approach:
result = [
value.replace(match.group(1), '') if match else value
for value, match in (
(value, my_regex.search(value))
for value in my_dataframe[my_col])]
Functional approach - python 2:
data = my_dataframe[my_col]
gen = zip(data, map(my_regex.search, data))
fix = lambda (v, m): v.replace(m.group(1), '') if m else v
result = map(fix, gen)
Functional approach - python 3:
from itertools import starmap
data = my_dataframe[my_col]
gen = zip(data, map(my_regex.search, data))
fix = lambda v, m: v.replace(m.group(1), '') if m else v
result = list(starmap(fix, gen))
Pragmatic approach:
def fix_string(value):
match = my_regex.search(value)
return value.replace(match.group(1), '') if match else value
result = [fix_string(value) for value in my_dataframe[my_col]]
This is actually a good example of a list comprehension that performs worse than its corresponding for-loop and is (far) less readable.
If you wanted to do it, this would be the way:
temp = [value.replace(my_regex.search(value).group(1),'') if my_regex.search(value) else value for value in my_dataframe[my_col]]
# ^ ^
Note that there is no place for us to define match inside the comprehension and as a result we have to call my_regex.search(value) twice.. This is of course inefficient.
As a result, stick to the for-loop!
use a regular expression pattern with a sub group pattern looking for any word until an space plus character and characters he plus character is found and a space plus character and el is found plus any character . repeat the sub group pattern
paragraph="""either the well was very deep, or she fell very slowly, for she had
plenty of time as she went down to look about her and to wonder what was
going to happen next. first, she tried to look down and make out what
she was coming to, but it was too dark to see anything; then she
looked at the sides of the well, and noticed that they were filled with
cupboards and book-shelves; here and there she saw maps and pictures
hung upon pegs. she took down a jar from one of the shelves as
she passed; it was labelled 'orange marmalade', but to her great
disappointment it was empty: she did not like to drop the jar for fear
of killing somebody, so managed to put it into one of the cupboards as
she fell past it."""
sentences=paragraph.split(".")
pattern="\w+\s+((\whe)\s+(\w+el\w+)){1}\s+\w+"
temp=[]
for sentence in sentences:
result=re.findall(pattern,sentence)
for item in result:
temp.append("".join(item[0]).replace(' ',''))
print(temp)
output:
['thewell', 'shefell', 'theshelves', 'shefell']
I am using str.contains for text analytics in Pandas. If for the sentence "My latest Data job was an Analyst" , I want a combination of the words "Data" & "Analyst" but at the same time I want to specify the number of words between the two words used for the combination( here it is 2 words between "Data" and "Analyst".Currently I am using (DataFile.XXX.str.contains('job') & DataFile.XXX.str.contains('Analyst') to get the counts for "job Analyst".
How can I Specify the number of words in between the 2 words in the str.contains syntax.
Thanks in advance
You can't. At least, not in a simple or standardized way.
Even the basics, like how you define a "word," are a lot more complex than you probably imagine. Both word parsing and lexical proximity (e.g. "are two words within distance D of one another in sentence s?") is the realm of natural language processing (NLP). NLP and proximity searches are not part of basic Pandas, nor of Python's standard string processing. You could import something like NLTK, the Natural Language Toolkit to solve this problem in a general way, but that's a whole 'nother story.
Let's look at a simple approach. First you need a way to parse a string into words. The following is rough by NLP standards, but will work for simpler cases:
def parse_words(s):
"""
Simple parser to grab English words from string.
CAUTION: A simplistic solution to a hard problem.
Many possibly-important edge- and corner-cases
not handled. Just one example: Hyphenated words.
"""
return re.findall(r"\w+(?:'[st])?", s, re.I)
E.g.:
>>> parse_words("and don't think this day's last moment won't come ")
['and', "don't", 'think', 'this', "day's", 'last', 'moment', "won't", 'come']
Then you need a way to find all the indices in a list where a target word is found:
def list_indices(target, seq):
"""
Return all indices in seq at which the target is found.
"""
indices = []
cursor = 0
while True:
try:
index = seq.index(target, cursor)
except ValueError:
return indices
else:
indices.append(index)
cursor = index + 1
And finally a decision making wrapper:
def words_within(target_words, s, max_distance, case_insensitive=True):
"""
Determine if the two target words are within max_distance positiones of one
another in the string s.
"""
if len(target_words) != 2:
raise ValueError('must provide 2 target words')
# fold case for case insensitivity
if case_insensitive:
s = s.casefold()
target_words = [tw.casefold() for tw in target_words]
# for Python 2, replace `casefold` with `lower`
# parse words and establish their logical positions in the string
words = parse_words(s)
target_indices = [list_indices(t, words) for t in target_words]
# words not present
if not target_indices[0] or not target_indices[1]:
return False
# compute all combinations of distance for the two words
# (there may be more than one occurance of a word in s)
actual_distances = [i2 - i1 for i2 in target_indices[1] for i1 in target_indices[0]]
# answer whether the minimum observed distance is <= our specified threshold
return min(actual_distances) <= max_distance
So then:
>>> s = "and don't think this day's last moment won't come at last"
>>> words_within(["THIS", 'last'], s, 2)
True
>>> words_within(["think", 'moment'], s, 2)
False
The only thing left to do is map that back to Pandas:
df = pd.DataFrame({'desc': [
'My latest Data job was an Analyst',
'some day my prince will come',
'Oh, somewhere over the rainbow bluebirds fly',
"Won't you share a common disaster?",
'job! rainbow! analyst.'
]})
df['ja2'] = df.desc.apply(lambda x: words_within(["job", 'analyst'], x, 2))
df['ja3'] = df.desc.apply(lambda x: words_within(["job", 'analyst'], x, 3))
This is basically how you'd solve the problem. Keep in mind, it's a rough and simplistic solution. Some simply-posed questions are not simply-answered. NLP questions are often among them.
Examples of words:
ball
encyclopedia
tableau
Examples of random strings:
qxbogsac
jgaynj
rnnfdwpm
Of course it may happen that a random string will actually be a word in some language or look like one. But basically a human being is able to say it something looks 'random' or not, basically just by checking if you are able to pronounce it or not.
I was trying to calculate entropy to distinguish those two but it's far from perfect. Do you have any other ideas, algorithms that works?
There is one important requirement though, I can't use heavy-weight libraries like nltk or use dictionaries. Basically what I need is some simple and quick heuristic that works in most cases.
I developed a Python 3 package called Nostril for a problem closely related to what the OP asked: deciding whether text strings extracted during source-code mining are class/function/variable/etc. identifiers or random gibberish. It does not use a dictionary, but it does incorporate a rather large table of n-gram frequencies to support its probabilistic assessment of text strings. (I'm not sure if that qualifies as a "dictionary".) The approach does not check pronunciation, and its specialization may make it unsuitable for general word/nonword detection; nevertheless, perhaps it will be useful for either the OP or someone else looking to solve a similar problem.
Example: the following code,
from nostril import nonsense
real_test = ['bunchofwords', 'getint', 'xywinlist', 'ioFlXFndrInfo',
'DMEcalPreshowerDigis', 'httpredaksikatakamiwordpresscom']
junk_test = ['faiwtlwexu', 'asfgtqwafazfyiur', 'zxcvbnmlkjhgfdsaqwerty']
for s in real_test + junk_test:
print('{}: {}'.format(s, 'nonsense' if nonsense(s) else 'real'))
will produce the following output:
bunchofwords: real
getint: real
xywinlist: real
ioFlXFndrInfo: real
DMEcalPreshowerDigis: real
httpredaksikatakamiwordpresscom: real
faiwtlwexu: nonsense
asfgtqwafazfyiur: nonsense
zxcvbnmlkjhgfdsaqwerty: nonsense
Caveat I am not a Natural Language Expert
Assuming what ever mentioned in the link If You Can Raed Tihs, You Msut Be Raelly Smrat is authentic, a simple approach would be
Have an English (I believe its language antagonistic) dictionary
Create a python dict of the words, with keys as the first and last character of the words in the dictionary
words = defaultdict()
with open("your_dict.txt") as fin:
for word in fin:
words[word[0]+word[-1]].append(word)
Now for any given word, search the dictionary (remember key is the first and last character of the word)
for matches in words[needle[0] + needle[-1]]:
Compare if the characters in the value of the dictionary and your needle matches
for match in words[needle[0] + needle[-1]]:
if sorted(match) == sorted(needle):
print "Human Readable Word"
A comparably slower approach would be to use difflib.get_close_matches(word, possibilities[, n][, cutoff])
If you really mean that your metric of randomness is pronounceability, you're getting into the realm of phonotactics: the allowed sequences of sounds in a language. As #ChrisPosser points out in his comment to your question, these allowed sequences of sounds are language-specific.
This question only makes sense within a specific language.
Whichever language you choose, you might have some luck with an n-gram model trained over the letters themselves (as opposed to the words, which is the usual approach). Then you can calculate a score for a particular string and set a threshold under which a string is random and over which a string is something like a word.
EDIT: Someone has done this already and actually implemented it: https://stackoverflow.com/a/6298193/583834
Works quite well for me:
VOWELS = "aeiou"
PHONES = ['sh', 'ch', 'ph', 'sz', 'cz', 'sch', 'rz', 'dz']
def isWord(word):
if word:
consecutiveVowels = 0
consecutiveConsonents = 0
for idx, letter in enumerate(word.lower()):
vowel = True if letter in VOWELS else False
if idx:
prev = word[idx-1]
prevVowel = True if prev in VOWELS else False
if not vowel and letter == 'y' and not prevVowel:
vowel = True
if prevVowel != vowel:
consecutiveVowels = 0
consecutiveConsonents = 0
if vowel:
consecutiveVowels += 1
else:
consecutiveConsonents +=1
if consecutiveVowels >= 3 or consecutiveConsonents > 3:
return False
if consecutiveConsonents == 3:
subStr = word[idx-2:idx+1]
if any(phone in subStr for phone in PHONES):
consecutiveConsonents -= 1
continue
return False
return True
Use PyDictionary.
You can install PyDictionary using following command.
easy_install -U PyDictionary
Now in code:
from PyDictionary import PyDictionary
dictionary=PyDictionary()
a = ['ball', 'asdfg']
for item in a:
x = dictionary.meaning(item)
if x==None:
print item + ': Not a valid word'
else:
print item + ': Valid'
As far as I know, you can use PyDictionary for some other languages then english.
I wrote this logic to detect number of consecutive vowels and consonants in a string. You can choose the threshold based on the language.
def get_num_vowel_bunches(txt,num_consq = 3):
len_txt = len(txt)
num_viol = 0
if len_txt >=num_consq:
pos_iter = re.finditer('[aeiou]',txt)
pos_mat = np.zeros((num_consq,len_txt),dtype=int)
for idx in pos_iter:
pos_mat[0,idx.span()[0]] = 1
for i in np.arange(1,num_consq):
pos_mat[i,0:-1] = pos_mat[i-1,1:]
sum_vec = np.sum(pos_mat,axis=0)
num_viol = sum(sum_vec == num_consq)
return num_viol
def get_num_consonent_bunches(txt,num_consq = 3):
len_txt = len(txt)
num_viol = 0
if len_txt >=num_consq:
pos_iter = re.finditer('[bcdfghjklmnpqrstvwxz]',txt)
pos_mat = np.zeros((num_consq,len_txt),dtype=int)
for idx in pos_iter:
pos_mat[0,idx.span()[0]] = 1
for i in np.arange(1,num_consq):
pos_mat[i,0:-1] = pos_mat[i-1,1:]
sum_vec = np.sum(pos_mat,axis=0)
num_viol = sum(sum_vec == num_consq)
return num_viol
I have some text:
s="Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"
I'd like to parse this into its individual words. I quickly looked into the enchant and nltk, but didn't see anything that looked immediately useful. If I had time to invest in this, I'd look into writing a dynamic program with enchant's ability to check if a word was english or not. I would have thought there'd be something to do this online, am I wrong?
Greedy approach using trie
Try this using Biopython (pip install biopython):
from Bio import trie
import string
def get_trie(dictfile='/usr/share/dict/american-english'):
tr = trie.trie()
with open(dictfile) as f:
for line in f:
word = line.rstrip()
try:
word = word.encode(encoding='ascii', errors='ignore')
tr[word] = len(word)
assert tr.has_key(word), "Missing %s" % word
except UnicodeDecodeError:
pass
return tr
def get_trie_word(tr, s):
for end in reversed(range(len(s))):
word = s[:end + 1]
if tr.has_key(word):
return word, s[end + 1: ]
return None, s
def main(s):
tr = get_trie()
while s:
word, s = get_trie_word(tr, s)
print word
if __name__ == '__main__':
s = "Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"
s = s.strip(string.punctuation)
s = s.replace(" ", '')
s = s.lower()
main(s)
Results
>>> if __name__ == '__main__':
... s = "Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"
... s = s.strip(string.punctuation)
... s = s.replace(" ", '')
... s = s.lower()
... main(s)
...
image
classification
methods
can
be
roughly
divided
into
two
broad
families
of
approaches
Caveats
There are degenerate cases in English that this will not work for. You need to use backtracking to deal with those, but this should get you started.
Obligatory test
>>> main("expertsexchange")
experts
exchange
This is sort of a problem that occurs often in Asian NLP. If you have a dictionary, then you can use this http://code.google.com/p/mini-segmenter/ (Disclaimer: i wrote it, hope you don't mind).
Note that the search space might be extremely large because the number of characters in alphabetic English is surely longer than syllabic chinese/japanese.