get string between stopwords in an (time)effective way

get string between stopwords in an (time)effective way - python

assuming I have a text:
txt='A single house painted white with a few windows and a nice door in front of the park'
I would like to eliminate all the first words if they are stopwords and get the substring up to the first stop word.
desired outcome: single house painted white
I can loop over the list:
txt='A single house painted white with a few windows and a nice door in front of the park'
stopwords = ['a','the','with','this','is','to','etc'] # up to 250 words
for i,word in enumerate(txt.lower().split()):
pos1= i
if word in stopwords:
break
rest_text = txt.split()[pos1+1:]
print(rest_text)
# and now we do the same for pos2
for i,word in enumerate(rest_text):
pos2= i
if word in stopwords:
print(word,pos2)
break
rest_text = rest_text[:pos2]
print(rest_text)
I have to do this for thousands of texts and speed is important. looping is never the way to go in python.
But I can not come up with a list comprehension solution.
some help?
NOTE1: I made the example text longer to make clear the outcome
NOTE2:
other example:
txt = 'this is a second text to make clear the outcome that I like'
outcome: "second text"

There are 2 ways that I can see might significantly improve performance here.
set instead of list
Your code must check whether some string is a member of stopwords a lot. A list is not the best data structure for this, since in the worst case, it requires a comparison with every element in the list. Membership test for a list is O(n).
sets are much quicker at performing this membership test. Their implementation in Python is something like a hash table, which means they can perform the membership test in constant time, O(1). So, for large numbers of elements, a set will significantly outperform a list for this particular operation.
You can make a set of stopwords, instead of a list with:
stopwords = set(['a','the','with','etc'])
re.finditer instead of str.split()
If your txt is large, and you only require the first qualifying substring of your txt (as is implied in the question), then you may squeeze more performance by using re.finditer instead of str.split() to separate the words of your text.
str.split() returns a list of words from the entire text at once, whereas re.finditer returns an iterator that can yield words as they are needed. In the worst case you will obviously still need to 'loop' over the entire text, but if your matches are near the beginning of txt, time and memory savings may be significant.
For an example:
txt='A single house painted white with a few windows'
stopwords = set(['a','the','with','etc'])
import re
split_txt = (match.group(0) for match in re.finditer(r'\S+', txt))
result = []
word = next(split_txt)
while word.lower() in stopwords:
word = next(split_txt)
while word.lower() not in stopwords:
result.append(word)
word = next(split_txt)
print(' '.join(result))
Note though that it is often better to just start out with some code that works to test on your input, than to prematurely start optimising. Testing will reveal whether optimisation is necessary. You say in the question that
looping is never the way to go in Python
but this is just not true. Looping in one form or another is more often than not unavoidable, in any language. While the performance may not match that of compiled languages like C or Fortran, Python may surprise you with how performant it can be (if you let it)

Related

Is there a way to detect if unnecessary characters are added to strings to bypass spam detection?

I'm building a simple spam classifier and from a cursory look at my dataset, most spams put spaces in between "spammy" words, which I assume is for them to bypass spam classifier. Here's some examples:
c redi t card
mort - gage
I would like to be able to take these and encode them in my dataframe as the correct words:
credit card
mortgage
I'm using Python by the way.

This depends a lot on whether you have a list of all spam words or not.
If you do have a list of spam words and you know that there are always only ADDED spaces (e.g. give me your cred it card in formation) but never MISSING spaces (e.g. give me yourcredit cardinformation), then you could use a simple rule-based approach:
import itertools
spam_words = {"credit card", "rolex"}
spam_words_no_spaces = {"".join(s.split()) for s in spam_words}
sentence = "give me your credit car d inform ation and a rol ex"
tokens = sentence.split()
for length in range(1, len(tokens)):
for t in set(itertools.combinations(tokens, length)):
if "".join(t) in spam_words_no_spaces:
print(t)
Which prints:
> ('rol', 'ex')
> ('credit', 'car', 'd')
So first create a set of all spam words, then for an easier comparison remove all spaces (although you could adjust the method to consider only correct spacing spam words).
Then split the sentence into tokens and finally get all possible unique consequtive subsequences in the token list (including one-word sequences and the whole sentence without whitespaces), then check if they're in the list of spam words.
If you don't have a list of spam words your best chance would probably be to do general whitespace-correction on the data. Check out Optical Character Recognition (OCR) Post Correction which you can find some pretrained models for. Also check out this thread which talks about how to add spaces to spaceless text and even mentions a python package for that. So in theory you could remove all spaces and then try to split it again into meaningful words to increase the chance the spam words are found. Generally your problem (and the oppositve, missing whitespaces) is called word boundary detection, so you might want to check some ressources on that.
Also you should be aware that modern pretrained models such as common transformer models often use sub-token-level embeddings for unknown words so that they can relatively easiely still combine what they learned for a split and a non-split version of a common word.

Count occurrences of elements in string from a list?

I'm trying to count the number of occurrences of verbal contractions in some speeches I've gathered. One particular speech looks like this:
speech = "I've changed the path of the economy, and I've increased jobs in our own
home state. We're headed in the right direction - you've all been a great help."
So, in this case, I'd like to count four (4) contractions. I have a list of contractions, and here are some of the first few terms:
contractions = {"ain't": "am not; are not; is not; has not; have not",
"aren't": "are not; am not",
"can't": "cannot",...}
My code looks something like this, to begin with:
count = 0
for word in speech:
if word in contractions:
count = count + 1
print count
I'm not getting anywhere with this, however, as the code's iterating over every single letter, as opposed to whole words.

Use str.split() to split your string on whitespace:
for word in speech.split():
This will split on arbitrary whitespace; this means spaces, tabs, newlines, and a few more exotic whitespace characters, and any number of them in a row.
You may need to lowercase your words using str.lower() (otherwise Ain't won't be found, for example), and strip punctuation:
from string import punctuation
count = 0
for word in speech.lower().split():
word = word.strip(punctuation)
if word in contractions:
count += 1
I use the str.strip() method here; it removes everything found in the string.punctuation string from the start and end of a word.

You're iterating over a string. So the items are characters. To get the words from a string you can use naive methods like str.split() that makes this for you (now you can iterate over a list of strings (the words splitted on the argument of str.split(), default: split on whitespace). There is even re.split(), which is more powerful. But I don't think that you need splitting the text with regexes.
What you have to do at least is to lowercase your string with str.lower() or to put all possible occurences (also with capital letters) in the dictionary. I strongly recommending the first alternative. The latter isn't really practicable. Removing the punctuation is also a duty for this. But this is still naive. If you're need a more sophisticated method, you have to split the text via a word tokenizer. NLTK is a good starting point for that, see the nltk tokenizer. But I strongly feel that this problem is not your major one or affects you really in solving your question. :)
speech = """I've changed the path of the economy, and I've increased jobs in our own home state. We're headed in the right direction - you've all been a great help."""
# Maybe this dict makes more sense (list items as values). But for your question it doesn't matter.
contractions = {"ain't": ["am not", "are not", "is not", "has not", "have not"], "aren't": ["are not", "am not"], "i've": ["i have", ]} # ...
# with re you can define advanced regexes, but maybe
# from string import punctuation (suggestion from Martijn Pieters answer
# is still enough for you)
import re
def abbreviation_counter(input_text, abbreviation_dict):
count = 0
# what you want is a list of words. str.split() does this job for you.
# " " is default and you can also omit this. But if you really need better
# methods (see answer text abover), you have to take a word tokenizer tool
# or have to write your own.
for word in input_text.split(" "):
# and also clean word (remove ',', ';', ...) afterwards. The advantage of
# using re over `from string import punctuation` is that you have more
# control in what you want to remove. That means that you can add or
# remove easily any punctuation mark. It could be very handy. It could be
# also overpowered. If the latter is the case, just stick to Martijn Pieters
# solution.
if re.sub(',|;', '', word).lower() in abbreviation_dict:
count += 1
return count
print abbrev_counter(speech, contractions)
2 # yeah, it worked - I've included I've in your list :)
It's a litte bit frustrating to give an answer at the same time as Martijn Pieters does ;), but I hope I still have generated some values for you. That's why I've edited my question to give you some hints for future work in addition.

A for loop in Python iterates over all elements in an iterable. In the case of strings the elements are the characters.
You need to split the string into a list (or tuple) of strings that contain the words. You can use .split(delimiter) for this.
Your problem is quite common, so Python has a shortcut: speech.split() splits at any number of spaces/tabs/newlines, so you only get your words in the list.
So your code should look like this:
count = 0
for word in speech.split():
if word in contractions:
count = count + 1
print(count)
speech.split(" ") works too, but only splits on whitespaces but not tabs or newlines and if there are double spaces you'd get empty elements in your resulting list.

efficient way to get words before and after substring in text (python)

I'm using regex to find occurrences of string patterns in a body of text. Once I find that the string pattern occurs, I want to get x words before and after the string as well (x could be as small as 4, but preferably ~10 if still as efficient).
I am currently using regex to find all instances, but occasionally it will hang. Is there a more efficient way to solve this problem?
This is the solution I currently have:
sub = r'(\w*)\W*(\w*)\W*(\w*)\W*(\w*)\W*(%s)\W*(\w*)\W*(\w*)\W*(\w*)\W*(\w*)' % result_string #refind string and get surrounding += 4 words
surrounding_text = re.findall(sub, text)
for found_text in surrounding_text:
result_found.append(" ".join(map(str,found_text)))

I'm not sure if this is what you're looking for:
>>> text = "Hello, world. Regular expressions are not always the answer."
>>> words = text.partition("Regular expressions")
>>> words
('Hello, world. ', 'Regular expressions', ' are not always the answer.')
>>> words_before = words[0]
>>> words_before
'Hello, world. '
>>> separator = words[1]
>>> separator
'Regular expressions'
>>> words_after = words[2]
>>> words_after
' are not always the answer.'
Basically, str.partition() splits the string into a 3-element tuple. In this example, the first element is all of the words before the specific "separator", the second element is the separator, and the third element is all of the words after the separator.

The main problem with your pattern is that it begins with optional things that causes a lot of tries for each positions in the string until a match is found. The number of tries increases with the text size and with the value of n (the number of words before and after). This is why only few lines of text suffice to crash your code.
A way consists to begin the pattern with the target word and to use lookarounds to capture the text (or the words) before and after:
keyword (?= words after ) (?<= words before - keyword)
Starting a pattern with the searched word (a literal string) makes it very fast, and words around are then quickly found from this position in the string. Unfortunately the re module has some limitations and doesn't allow variable length lookbehinds (as many other regex flavors).
The new regex module supports variable length lookbehinds and other useful features like the ability to store the matches of a repeated capture group (handy to get the separated words in one shot).
import regex
text = '''In strange contrast to the hardly tolerable constraint and nameless
invisible domineerings of the captain's table, was the entire care-free
license and ease, the almost frantic democracy of those inferior fellows
the harpooneers. While their masters, the mates, seemed afraid of the
sound of the hinges of their own jaws, the harpooneers chewed their food
with such a relish that there was a report to it.'''
word = 'harpooneers'
n = 4
pattern = r'''
\m (?<target> %s ) \M # target word
(?<= # content before
(?<before> (?: (?<wdb>\w+) \W+ ){0,%d} )
%s
)
(?= # content after
(?<after> (?: \W+ (?<wda>\w+) ){0,%d} )
)
''' % (word, n, word, n)
rgx = regex.compile(pattern, regex.VERBOSE | regex.IGNORECASE)
class Result(object):
def __init__(self, m):
self.target_span = m.span()
self.excerpt_span = (m.starts('before')[0], m.ends('after')[0])
self.excerpt = m.expandf('{before}{target}{after}')
self.words_before = m.captures('wdb')[::-1]
self.words_after = m.captures('wda')
results = [Result(m) for m in rgx.finditer(text)]
print(results[0].excerpt)
print(results[0].excerpt_span)
print(results[0].words_before)
print(results[0].words_after)
print(results[1].excerpt)

Making a regex (well, anything, for that matter) with "as much repetitions as you will ever possibly need" is an extremely bad idea. That's because you
do an excessive amount of needless work every time
cannot really know for sure how much you will ever possibly need, thus introducing an arbitrary limitation
The bottom line for the below solutions: the 1st solution is the most effective one for large data; the 2nd one is the closest to your current, but scales much worse.
strip your entities to exactly what you are interested in at each moment:
find the substring (e.g. str.index. For whole words only, re.find with e.g. r'\b%s\b'%re.escape(word) is more suitable)
go N words back.
Since you mentioned a "text", your strings are likely to be very large, so you want to avoid copying potentially unlimited chunks of them.
E.g. re.finditer over a substring-reverse-iterator-in-place according to slices to immutable strings by reference and not copy and Best way to loop over a python string backwards. This will only become better than slicing when the latter is expensive in terms of CPU and/or memory - test on some realistic examples to find out. Doesn't work. re works directly with the memory buffer. Thus it's impossible to reverse a string for it without copying the data.
There's no function to find a character from a class in Python, nor an "xsplit". So the fastest way appears to be (i for i,c in enumerate(reversed(buffer(text,0,substring_index)) if c.isspace()) (timeit gives ~100ms on P3 933MHz for a full pass through a 100k string).
Alternatively:
Fix your regex to not be subject to catastrophic backtracking and eliminate code duplication (DRY principle).
The 2nd measure will eliminate the 2nd issue: we'll make the number of repetitions explicit (Python Zen, koan 2) and thus highly visible and manageable.
As for the 1st issue, if you really only need "up to known, same N" items in each case, you won't actually be doing "excessive work" by finding them together with your string.
The "fix" part here is \w*\W* -> \w+\W+. This eliminates major ambiguity (see the above link) from the fact that each x* can be a blank match.
Matching up to N words before the string effectively is harder:
with (\w+\W+){,10} or equivalent, the matcher will be finding every 10 words before discovering that your string doesn't follow them, then trying 9,8, etc. To ease it up on the matcher somewhat, \b before the pattern will make it only perform all this work at the beginning of each word
lookbehind is not allowed here: as the linked article explains, the regex engine must know how many characters to step back before trying the contained regex. And even if it was - a lookbehind is tried before every character - i.e. it's even more of a CPU hog
As you can see, regexes aren't quite cut to match things backwards
To eliminate code duplication, either
use the aforementioned {,10}. This will not save individual words but should be noticeably faster for large text (see the above on how the matching works here). We can always parse the retrieved chunk of text in more details (with the regex in the next item) once we have it. Or
autogenerate the repetitive part
note that (\w+\W+)? repeated mindlessly is subject to the same ambiguity as above. To be unambiguous, the expression must be like this (w=(\w+\W+) here for brevity): (w(w...(ww?)?...)?)? (and all the groups need to be non-capturing).

I personally think that using text.partition() is the best option, as it eliminates the messy regular expressions, and automatically leaves output in an easy-to-access tuple.

Finding a substring's position in a larger string

I have a large string and a large number of smaller substrings and I am trying to check if each substring exists in the larger string and get the position of each of these substrings.
string="some large text here"
sub_strings=["some", "text"]
for each_sub_string in sub_strings:
if each_sub_string in string:
print each_sub_string, string.index(each_sub_string)
The problem is, since I have a large number of substrings (around a million), it takes about an hour of processing time. Is there any way to reduce this time, maybe by using regular expressions or some other way?

The best way to solve this is with a tree implementation. As Rishav mentioned, you're repeating a lot of work here. Ideally, this should be implemented as a tree-based FSM. Imagine the following example:
Large String: 'The cat sat on the mat, it was great'
Small Strings: ['cat', 'sat', 'ca']
Then imagine a tree where each level is an additional letter.
small_lookup = {
'c':
['a', {
'a': ['t']
}], {
's':
['at']
}
}
Apologies for the gross formatting, but I think it's helpful to map back to a python data structure directly. You can build a tree where the top level entries are the starting letters, and they map to the list of potential final substrings that could be completed. If you hit something that is a list element and has nothing more nested beneath you've hit a leaf and you know that you've hit the first instance of that substring.
Holding that tree in memory is a little hefty, but if you've only got a million string this should be the most efficient implementation. You should also make sure that you trim the tree as you find the first instance of words.
For those of you with CS chops, or if you want to learn more about this approach, it's a simplified version of the Aho-Corasick string matching algorithm.
If you're interested in learning more about these approaches there are three main algorithms used in practice:
Aho-Corasick (Basis of fgrep) [Worst case: O(m+n)]
Commentz-Walter (Basis of vanilla GNU grep) [Worst case: O(mn)]
Rabin-Karp (Used for plagiarism detection) [Worst case: O(mn)]
There are domains in which all of these algorithms will outperform the others, but based on the fact that you've got a very high number of sub-strings that you're searching and there's likely a lot of overlap between them I would bet that Aho-Corasick is going to give you significantly better performance than the other two methods as it avoid the O(mn) worst-case scenario
There is also a great python library that implements the Aho-Corasick algorithm found here that should allow you to avoid writing the gross implementation details yourself.

Depending on the distribution of the lengths of your substrings, you might be able to shave off a lot of time using preprocessing.
Say the set of the lengths of your substrings form the set {23, 33, 45} (meaning that you might have millions of substrings, but each one takes one of these three lengths).
Then, for each of these lengths, find the Rabin Window over your large string, and place the results into a dictionary for that length. That is, let's take 23. Go over the large string, and find the 23-window hashes. Say the hash for position 0 is 13. So you insert into the dictionary rabin23 that 13 is mapped to [0]. Then you see that for position 1, the hash is 13 as well. Then in rabin23, update that 13 is mapped to [0, 1]. Then in position 2, the hash is 4. So in rabin23, 4 is mapped to [2].
Now, given a substring, you can calculate its Rabin hash and immediately check the relevant dictionary for the indices of its occurrence (which you then need to compare).
BTW, in many cases, then lengths of your substrings will exhibit a Pareto behavior, where say 90% of the strings are in 10% of the lengths. If so, you can do this for these lengths only.

This is approach is sub-optimal compared to the other answers, but might be good enough regardless, and is simple to implement. The idea is to turn the algorithm around so that instead of testing each sub-string in turn against the larger string, iterate over the large string and test against possible matching sub-strings at each position, using a dictionary to narrow down the number of sub-strings you need to test.
The output will differ from the original code in that it will be sorted in ascending order of index as opposed to by sub-string, but you can post-process the output to sort by sub-string if you want to.
Create a dictionary containing a list of sub-strings beginning each possible 1-3 characters. Then iterate over the string and at each character read the 1-3 characters after it and check for a match at that position for each sub-string in the dictionary that begins with those 1-3 characters:
string="some large text here"
sub_strings=["some", "text"]
# add each of the substrings to a dictionary based the first 1-3 characters
dict = {}
for s in sub_strings:
if s[0:3] in dict:
dict[s[0:3]].append(s)
else:
dict[s[0:3]] = [s];
# iterate over the chars in string, testing words that match on first 1-3 chars
for i in range(0, len(string)):
for j in range(1,4):
char = string[i:i+j]
if char in dict:
for word in dict[char]:
if string[i:i+len(word)] == word:
print word, i
If you don't need to match any sub-strings 1 or 2 characters long then you can get rid of the for j loop and just assign char with char = string[i:3]
Using this second approach I timed the algorithm by reading in Tolstoy's War and Peace and splitting it into unique words, like this:
with open ("warandpeace.txt", "r") as textfile:
string=textfile.read().replace('\n', '')
sub_strings=list(set(string.split()))
Doing a complete search for every unique word in the text and outputting every instance of each took 124 seconds.

Efficient replacement of occurrences of a list of words

I need to censor all occurrences of a list of words with *'s. I have about 400 words in the list and it's going to get hit with a lot of traffic, so I want to make it very efficient. What's an efficient algorithm/data structure to do this in? Preferably something already in Python.
Examples:
"piss off" => "**** off"
"hello" => "hello"
"go to hell" => "go to ****"

A case-insensitive trie-backed set implementation might fit the bill. For each word, you'll only process a minimum of characters. For example, you would only need to process the first letter of the word 'zoo' to know the word is not present in your list (assuming you have no 'z' expletives).
This is something that is not packaged with python, however. You may observe better performance from a simple dictionary solution since it's implemented in C.

(1) Let P be the set of phrases to censor.
(2) Precompute H = {h(w) | p in P, w is a word in p}, where h is a sensible hash function.
(3) For each word v that is input, test whether h(v) in H.
(4) If h(v) not in H, emit v.
(5) If h(v) in H, back off to any naive method that will check whether v and the words following form a phrase in P.
Step (5) is not a problem since we assume that P is (very) small compared to the quantity of input. Step (3) is an O(1) operation.

like cheeken has mentioned, a Trie may be the thing you need, and actually, you should use Aho–Corasick string matching algorithm. Something more than a trie.
For every string, say S you need to process, the time complexity is approximately O(len(S)). I mean, Linear
And you need to build the automaton initially, it's time complexity is O(sigma(len(words))), and space complexity is about(less always) O(52*sigma(len(words))) here 52 means the size of the alphabet(i take it as ['a'..'z', 'A'..'Z']). And you need to do this just for once(or every time the system launches).

You might want to time a regexp based solution against others. I have used similar regexp based substitution of one to three thousand words on a text to change phrases into links before, but I am not serving those pages to many people.
I take the set of words (it could be phrases), and form a regular expression out of them that will match their occurrence as a complete word in the text because of the '\b'.
If you have a dictionary mapping words to their sanitized version then you could use that. I just swap every odd letter with '*' for convenience here.
The sanitizer function just returns the sanitized version of any matched swear word and is used in the regular expression substitution call on the text to return a sanitized version.
import re
swearwords = set("Holy Cow".split())
swear = re.compile(r'\b(%s)\b' % '|'.join(sorted(swearwords, key=lambda w: (-len(w), w))))
sanitized = {sw:''.join((ch if not i % 2 else '*' for i,ch in enumerate(sw))) for sw in swearwords}
def sanitizer(matchobj):
return sanitized.get(matchobj.group(1), '????')
txt = 'twat prick Holy Cow ... hell hello shitter bonk'
swear.sub(sanitizer, txt)
# Out[1]: 'twat prick H*l* C*w ... hell hello shitter bonk'
You might want to use re.subn and the count argument to limit the number of substitutions done and just reject the whole text if it has too many profanities:
maxswear = 2
newtxt, scount = swear.subn(sanitizer, txt, count=maxswear)
if scount >= maxswear: newtxt = 'Ouch my ears hurt. Please tone it down'
print(newtxt)
# 'Ouch my ears hurt. Please tone it down'

If performance is what you want I would suggest:
Get a sample of the input
Calculate the average amount of censored words per line
Define a max number of words to filter per line (3 for example)
Calcule what censored words have the most hits in the sample
Write a function that given the censored words, will generate a
python file with IF statements to check each words, putting the 'most
hits' words first, since you just want to match whole words it will
be fairly simple
Once you hit the max number per line, exit the function
I know this is not nice and I'm only suggesting this approach because of the high traffic scenario, doing a loop of each word in your list will have a huge negative impact on performance.
Hope that help or at least give you some out of the box idea on how to tackle the problem.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.