Efficient replacement of occurrences of a list of words

Efficient replacement of occurrences of a list of words - python

I need to censor all occurrences of a list of words with *'s. I have about 400 words in the list and it's going to get hit with a lot of traffic, so I want to make it very efficient. What's an efficient algorithm/data structure to do this in? Preferably something already in Python.
Examples:
"piss off" => "**** off"
"hello" => "hello"
"go to hell" => "go to ****"

A case-insensitive trie-backed set implementation might fit the bill. For each word, you'll only process a minimum of characters. For example, you would only need to process the first letter of the word 'zoo' to know the word is not present in your list (assuming you have no 'z' expletives).
This is something that is not packaged with python, however. You may observe better performance from a simple dictionary solution since it's implemented in C.

(1) Let P be the set of phrases to censor.
(2) Precompute H = {h(w) | p in P, w is a word in p}, where h is a sensible hash function.
(3) For each word v that is input, test whether h(v) in H.
(4) If h(v) not in H, emit v.
(5) If h(v) in H, back off to any naive method that will check whether v and the words following form a phrase in P.
Step (5) is not a problem since we assume that P is (very) small compared to the quantity of input. Step (3) is an O(1) operation.

like cheeken has mentioned, a Trie may be the thing you need, and actually, you should use Aho–Corasick string matching algorithm. Something more than a trie.
For every string, say S you need to process, the time complexity is approximately O(len(S)). I mean, Linear
And you need to build the automaton initially, it's time complexity is O(sigma(len(words))), and space complexity is about(less always) O(52*sigma(len(words))) here 52 means the size of the alphabet(i take it as ['a'..'z', 'A'..'Z']). And you need to do this just for once(or every time the system launches).

You might want to time a regexp based solution against others. I have used similar regexp based substitution of one to three thousand words on a text to change phrases into links before, but I am not serving those pages to many people.
I take the set of words (it could be phrases), and form a regular expression out of them that will match their occurrence as a complete word in the text because of the '\b'.
If you have a dictionary mapping words to their sanitized version then you could use that. I just swap every odd letter with '*' for convenience here.
The sanitizer function just returns the sanitized version of any matched swear word and is used in the regular expression substitution call on the text to return a sanitized version.
import re
swearwords = set("Holy Cow".split())
swear = re.compile(r'\b(%s)\b' % '|'.join(sorted(swearwords, key=lambda w: (-len(w), w))))
sanitized = {sw:''.join((ch if not i % 2 else '*' for i,ch in enumerate(sw))) for sw in swearwords}
def sanitizer(matchobj):
return sanitized.get(matchobj.group(1), '????')
txt = 'twat prick Holy Cow ... hell hello shitter bonk'
swear.sub(sanitizer, txt)
# Out[1]: 'twat prick H*l* C*w ... hell hello shitter bonk'
You might want to use re.subn and the count argument to limit the number of substitutions done and just reject the whole text if it has too many profanities:
maxswear = 2
newtxt, scount = swear.subn(sanitizer, txt, count=maxswear)
if scount >= maxswear: newtxt = 'Ouch my ears hurt. Please tone it down'
print(newtxt)
# 'Ouch my ears hurt. Please tone it down'

If performance is what you want I would suggest:
Get a sample of the input
Calculate the average amount of censored words per line
Define a max number of words to filter per line (3 for example)
Calcule what censored words have the most hits in the sample
Write a function that given the censored words, will generate a
python file with IF statements to check each words, putting the 'most
hits' words first, since you just want to match whole words it will
be fairly simple
Once you hit the max number per line, exit the function
I know this is not nice and I'm only suggesting this approach because of the high traffic scenario, doing a loop of each word in your list will have a huge negative impact on performance.
Hope that help or at least give you some out of the box idea on how to tackle the problem.

Related

Finding the length of each sentence in a text

I'm new to python (and coding in general), and I ran into an issue while doing my first assignment for school. We need to implement several simple text analysis techniques, and after hours of hitting my head on my keyboard, I figured I better ask for some pointers here.
The problem lies within one of the tasks. I'm supposed to find the number of words per sentence in a given text, and then print out the number of words per sentence from the longest to the shortest. Thus far, I have been able to figure out how to find the longest and shortest sentence, (and even the second longest sentence). However, I'm stuck as to how to find out the second shortest sentence or third longest, etc.
My code looks like this:
length = sentences.split(". ")
tokenized_sentences = [sentence.split(" ") for sentence in length]
longest_sen = max(tokenized_sentences, key=len)
longest_sen_length = len(longest_sen)
shortest_sen = min(tokenized_sentences, key=len)
shortest_sen_length = len(shortest_sen)
print("The longest sentence is", (longest_sen_length), "words.")
print("The shortest sentence is", (shortest_sen_length), "words.")
I'm aware that the code is not robust, and I could save a lot of time using nltk or re. However, the paragraph isn't very long or complex, and I'm not certain my professor would be a-ok with me using an additional platform at this point.
Any pointers would be highly appreciated!
Edit: An example of the text: "Once more. Say, you are in the country, in some high land of lakes. Take almost any path you please, and ten to one it carries you down in a dale, and leaves you there by a pool in the stream. There is magic in it. Let the most absentminded of men be plunged in his deepest reveries--stand that man on his legs, set his feet a-going, and he will infallibly lead you to water, if water there be in all that region. Should you ever be athirst in the great American desert, try this experiment, if your caravan happen to be supplied with a metaphysical professor. Yes, as every one knows, meditation and water are wedded for ever."

If we assume that your entire test is named sentences, then you can do the following, that sorts the sentences by length (descending).
l=sentences.split('. ')
m=[len(i.split()) for i in l]
m.sort(reverse=True)
And you will have all sentences lengths and you can play around with what you want to print

You can use sort() method to sort a list based on the length and you want descending order, so just put reverse = True.
tokenized_sentences.sort(key=len,reverse = True)

get string between stopwords in an (time)effective way

assuming I have a text:
txt='A single house painted white with a few windows and a nice door in front of the park'
I would like to eliminate all the first words if they are stopwords and get the substring up to the first stop word.
desired outcome: single house painted white
I can loop over the list:
txt='A single house painted white with a few windows and a nice door in front of the park'
stopwords = ['a','the','with','this','is','to','etc'] # up to 250 words
for i,word in enumerate(txt.lower().split()):
pos1= i
if word in stopwords:
break
rest_text = txt.split()[pos1+1:]
print(rest_text)
# and now we do the same for pos2
for i,word in enumerate(rest_text):
pos2= i
if word in stopwords:
print(word,pos2)
break
rest_text = rest_text[:pos2]
print(rest_text)
I have to do this for thousands of texts and speed is important. looping is never the way to go in python.
But I can not come up with a list comprehension solution.
some help?
NOTE1: I made the example text longer to make clear the outcome
NOTE2:
other example:
txt = 'this is a second text to make clear the outcome that I like'
outcome: "second text"

There are 2 ways that I can see might significantly improve performance here.
set instead of list
Your code must check whether some string is a member of stopwords a lot. A list is not the best data structure for this, since in the worst case, it requires a comparison with every element in the list. Membership test for a list is O(n).
sets are much quicker at performing this membership test. Their implementation in Python is something like a hash table, which means they can perform the membership test in constant time, O(1). So, for large numbers of elements, a set will significantly outperform a list for this particular operation.
You can make a set of stopwords, instead of a list with:
stopwords = set(['a','the','with','etc'])
re.finditer instead of str.split()
If your txt is large, and you only require the first qualifying substring of your txt (as is implied in the question), then you may squeeze more performance by using re.finditer instead of str.split() to separate the words of your text.
str.split() returns a list of words from the entire text at once, whereas re.finditer returns an iterator that can yield words as they are needed. In the worst case you will obviously still need to 'loop' over the entire text, but if your matches are near the beginning of txt, time and memory savings may be significant.
For an example:
txt='A single house painted white with a few windows'
stopwords = set(['a','the','with','etc'])
import re
split_txt = (match.group(0) for match in re.finditer(r'\S+', txt))
result = []
word = next(split_txt)
while word.lower() in stopwords:
word = next(split_txt)
while word.lower() not in stopwords:
result.append(word)
word = next(split_txt)
print(' '.join(result))
Note though that it is often better to just start out with some code that works to test on your input, than to prematurely start optimising. Testing will reveal whether optimisation is necessary. You say in the question that
looping is never the way to go in Python
but this is just not true. Looping in one form or another is more often than not unavoidable, in any language. While the performance may not match that of compiled languages like C or Fortran, Python may surprise you with how performant it can be (if you let it)

efficient way to get words before and after substring in text (python)

I'm using regex to find occurrences of string patterns in a body of text. Once I find that the string pattern occurs, I want to get x words before and after the string as well (x could be as small as 4, but preferably ~10 if still as efficient).
I am currently using regex to find all instances, but occasionally it will hang. Is there a more efficient way to solve this problem?
This is the solution I currently have:
sub = r'(\w*)\W*(\w*)\W*(\w*)\W*(\w*)\W*(%s)\W*(\w*)\W*(\w*)\W*(\w*)\W*(\w*)' % result_string #refind string and get surrounding += 4 words
surrounding_text = re.findall(sub, text)
for found_text in surrounding_text:
result_found.append(" ".join(map(str,found_text)))

I'm not sure if this is what you're looking for:
>>> text = "Hello, world. Regular expressions are not always the answer."
>>> words = text.partition("Regular expressions")
>>> words
('Hello, world. ', 'Regular expressions', ' are not always the answer.')
>>> words_before = words[0]
>>> words_before
'Hello, world. '
>>> separator = words[1]
>>> separator
'Regular expressions'
>>> words_after = words[2]
>>> words_after
' are not always the answer.'
Basically, str.partition() splits the string into a 3-element tuple. In this example, the first element is all of the words before the specific "separator", the second element is the separator, and the third element is all of the words after the separator.

The main problem with your pattern is that it begins with optional things that causes a lot of tries for each positions in the string until a match is found. The number of tries increases with the text size and with the value of n (the number of words before and after). This is why only few lines of text suffice to crash your code.
A way consists to begin the pattern with the target word and to use lookarounds to capture the text (or the words) before and after:
keyword (?= words after ) (?<= words before - keyword)
Starting a pattern with the searched word (a literal string) makes it very fast, and words around are then quickly found from this position in the string. Unfortunately the re module has some limitations and doesn't allow variable length lookbehinds (as many other regex flavors).
The new regex module supports variable length lookbehinds and other useful features like the ability to store the matches of a repeated capture group (handy to get the separated words in one shot).
import regex
text = '''In strange contrast to the hardly tolerable constraint and nameless
invisible domineerings of the captain's table, was the entire care-free
license and ease, the almost frantic democracy of those inferior fellows
the harpooneers. While their masters, the mates, seemed afraid of the
sound of the hinges of their own jaws, the harpooneers chewed their food
with such a relish that there was a report to it.'''
word = 'harpooneers'
n = 4
pattern = r'''
\m (?<target> %s ) \M # target word
(?<= # content before
(?<before> (?: (?<wdb>\w+) \W+ ){0,%d} )
%s
)
(?= # content after
(?<after> (?: \W+ (?<wda>\w+) ){0,%d} )
)
''' % (word, n, word, n)
rgx = regex.compile(pattern, regex.VERBOSE | regex.IGNORECASE)
class Result(object):
def __init__(self, m):
self.target_span = m.span()
self.excerpt_span = (m.starts('before')[0], m.ends('after')[0])
self.excerpt = m.expandf('{before}{target}{after}')
self.words_before = m.captures('wdb')[::-1]
self.words_after = m.captures('wda')
results = [Result(m) for m in rgx.finditer(text)]
print(results[0].excerpt)
print(results[0].excerpt_span)
print(results[0].words_before)
print(results[0].words_after)
print(results[1].excerpt)

Making a regex (well, anything, for that matter) with "as much repetitions as you will ever possibly need" is an extremely bad idea. That's because you
do an excessive amount of needless work every time
cannot really know for sure how much you will ever possibly need, thus introducing an arbitrary limitation
The bottom line for the below solutions: the 1st solution is the most effective one for large data; the 2nd one is the closest to your current, but scales much worse.
strip your entities to exactly what you are interested in at each moment:
find the substring (e.g. str.index. For whole words only, re.find with e.g. r'\b%s\b'%re.escape(word) is more suitable)
go N words back.
Since you mentioned a "text", your strings are likely to be very large, so you want to avoid copying potentially unlimited chunks of them.
E.g. re.finditer over a substring-reverse-iterator-in-place according to slices to immutable strings by reference and not copy and Best way to loop over a python string backwards. This will only become better than slicing when the latter is expensive in terms of CPU and/or memory - test on some realistic examples to find out. Doesn't work. re works directly with the memory buffer. Thus it's impossible to reverse a string for it without copying the data.
There's no function to find a character from a class in Python, nor an "xsplit". So the fastest way appears to be (i for i,c in enumerate(reversed(buffer(text,0,substring_index)) if c.isspace()) (timeit gives ~100ms on P3 933MHz for a full pass through a 100k string).
Alternatively:
Fix your regex to not be subject to catastrophic backtracking and eliminate code duplication (DRY principle).
The 2nd measure will eliminate the 2nd issue: we'll make the number of repetitions explicit (Python Zen, koan 2) and thus highly visible and manageable.
As for the 1st issue, if you really only need "up to known, same N" items in each case, you won't actually be doing "excessive work" by finding them together with your string.
The "fix" part here is \w*\W* -> \w+\W+. This eliminates major ambiguity (see the above link) from the fact that each x* can be a blank match.
Matching up to N words before the string effectively is harder:
with (\w+\W+){,10} or equivalent, the matcher will be finding every 10 words before discovering that your string doesn't follow them, then trying 9,8, etc. To ease it up on the matcher somewhat, \b before the pattern will make it only perform all this work at the beginning of each word
lookbehind is not allowed here: as the linked article explains, the regex engine must know how many characters to step back before trying the contained regex. And even if it was - a lookbehind is tried before every character - i.e. it's even more of a CPU hog
As you can see, regexes aren't quite cut to match things backwards
To eliminate code duplication, either
use the aforementioned {,10}. This will not save individual words but should be noticeably faster for large text (see the above on how the matching works here). We can always parse the retrieved chunk of text in more details (with the regex in the next item) once we have it. Or
autogenerate the repetitive part
note that (\w+\W+)? repeated mindlessly is subject to the same ambiguity as above. To be unambiguous, the expression must be like this (w=(\w+\W+) here for brevity): (w(w...(ww?)?...)?)? (and all the groups need to be non-capturing).

I personally think that using text.partition() is the best option, as it eliminates the messy regular expressions, and automatically leaves output in an easy-to-access tuple.

Finding a substring's position in a larger string

I have a large string and a large number of smaller substrings and I am trying to check if each substring exists in the larger string and get the position of each of these substrings.
string="some large text here"
sub_strings=["some", "text"]
for each_sub_string in sub_strings:
if each_sub_string in string:
print each_sub_string, string.index(each_sub_string)
The problem is, since I have a large number of substrings (around a million), it takes about an hour of processing time. Is there any way to reduce this time, maybe by using regular expressions or some other way?

The best way to solve this is with a tree implementation. As Rishav mentioned, you're repeating a lot of work here. Ideally, this should be implemented as a tree-based FSM. Imagine the following example:
Large String: 'The cat sat on the mat, it was great'
Small Strings: ['cat', 'sat', 'ca']
Then imagine a tree where each level is an additional letter.
small_lookup = {
'c':
['a', {
'a': ['t']
}], {
's':
['at']
}
}
Apologies for the gross formatting, but I think it's helpful to map back to a python data structure directly. You can build a tree where the top level entries are the starting letters, and they map to the list of potential final substrings that could be completed. If you hit something that is a list element and has nothing more nested beneath you've hit a leaf and you know that you've hit the first instance of that substring.
Holding that tree in memory is a little hefty, but if you've only got a million string this should be the most efficient implementation. You should also make sure that you trim the tree as you find the first instance of words.
For those of you with CS chops, or if you want to learn more about this approach, it's a simplified version of the Aho-Corasick string matching algorithm.
If you're interested in learning more about these approaches there are three main algorithms used in practice:
Aho-Corasick (Basis of fgrep) [Worst case: O(m+n)]
Commentz-Walter (Basis of vanilla GNU grep) [Worst case: O(mn)]
Rabin-Karp (Used for plagiarism detection) [Worst case: O(mn)]
There are domains in which all of these algorithms will outperform the others, but based on the fact that you've got a very high number of sub-strings that you're searching and there's likely a lot of overlap between them I would bet that Aho-Corasick is going to give you significantly better performance than the other two methods as it avoid the O(mn) worst-case scenario
There is also a great python library that implements the Aho-Corasick algorithm found here that should allow you to avoid writing the gross implementation details yourself.

Depending on the distribution of the lengths of your substrings, you might be able to shave off a lot of time using preprocessing.
Say the set of the lengths of your substrings form the set {23, 33, 45} (meaning that you might have millions of substrings, but each one takes one of these three lengths).
Then, for each of these lengths, find the Rabin Window over your large string, and place the results into a dictionary for that length. That is, let's take 23. Go over the large string, and find the 23-window hashes. Say the hash for position 0 is 13. So you insert into the dictionary rabin23 that 13 is mapped to [0]. Then you see that for position 1, the hash is 13 as well. Then in rabin23, update that 13 is mapped to [0, 1]. Then in position 2, the hash is 4. So in rabin23, 4 is mapped to [2].
Now, given a substring, you can calculate its Rabin hash and immediately check the relevant dictionary for the indices of its occurrence (which you then need to compare).
BTW, in many cases, then lengths of your substrings will exhibit a Pareto behavior, where say 90% of the strings are in 10% of the lengths. If so, you can do this for these lengths only.

This is approach is sub-optimal compared to the other answers, but might be good enough regardless, and is simple to implement. The idea is to turn the algorithm around so that instead of testing each sub-string in turn against the larger string, iterate over the large string and test against possible matching sub-strings at each position, using a dictionary to narrow down the number of sub-strings you need to test.
The output will differ from the original code in that it will be sorted in ascending order of index as opposed to by sub-string, but you can post-process the output to sort by sub-string if you want to.
Create a dictionary containing a list of sub-strings beginning each possible 1-3 characters. Then iterate over the string and at each character read the 1-3 characters after it and check for a match at that position for each sub-string in the dictionary that begins with those 1-3 characters:
string="some large text here"
sub_strings=["some", "text"]
# add each of the substrings to a dictionary based the first 1-3 characters
dict = {}
for s in sub_strings:
if s[0:3] in dict:
dict[s[0:3]].append(s)
else:
dict[s[0:3]] = [s];
# iterate over the chars in string, testing words that match on first 1-3 chars
for i in range(0, len(string)):
for j in range(1,4):
char = string[i:i+j]
if char in dict:
for word in dict[char]:
if string[i:i+len(word)] == word:
print word, i
If you don't need to match any sub-strings 1 or 2 characters long then you can get rid of the for j loop and just assign char with char = string[i:3]
Using this second approach I timed the algorithm by reading in Tolstoy's War and Peace and splitting it into unique words, like this:
with open ("warandpeace.txt", "r") as textfile:
string=textfile.read().replace('\n', '')
sub_strings=list(set(string.split()))
Doing a complete search for every unique word in the text and outputting every instance of each took 124 seconds.

algorithm for testing mutliple substrings in multiple strings

I have several million strings, X, each with less than 20 or so words. I also have a list of several thousand candidate substrings C. for each x in X, I want to see if there are any strings in C that are contained in x. Right now I am using a naive double for loop, but it's been a while and it hasn't finished yet...Any suggestions? I'm using python if any one knows of a nice implementation, but links for any language or general algorithms would be nice too.

Encode one of your sets of strings as a trie (I recommend the bigger set). Lookup time should be faster than an imperfect hash and you will save some memory too.

It's gonna be a long while. You have to check every one of those several million strings against every one of those several thousand candidate substrings, meaning that you will be doing (several million * several thousand) string comparisons. Yeah, that will take a while.
If this is something that you're only going to do once or infrequently, I would suggest using fgrep. If this is something that you're going to do often, then you want to look into implementing something like the Aho-Corasick string matching algorithm.

If your x in X only contains words, and you only want to match words you could do the following:
Insert your keywords into a set, that makes the access log(n), and then check for every word in x if it is contained in that set.
like:
keywords = set(['bla', 'fubar'])
for w in [x.split(' ') for x in X]:
if w in keywords:
pass # do what you need to do
A good alternative would be to use googles re2 library, that uses super nice automata theory to produce efficient matchers. (http://code.google.com/p/re2/)
EDIT: Be sure you use proper buffering and something in a compiled language, that makes it a lot faster. If its less than a couple gigabytes, it should work with python too.

you could try to use regex
subs=re.compile('|'.join(C))
for x in X:
if subs.search(x):
print 'found'

Have a look at http://en.wikipedia.org/wiki/Aho-Corasick. You can build a pattern-matcher for a set of fixed strings in time linear in the total size of the strings, then search in text, or multiple sections of text, in time linear in the length of the text + the number of matches found.
Another fast exact pattern matcher is http://en.wikipedia.org/wiki/Rabin-Karp_string_search_algorithm

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.