Python: Faster regex replace - python

I have a large set of large files and a set of "phrases" that need to be replaced in each file.
The "business logic" imposes several restrictions:
Matching must be case-insensitive
The whitespace, tabs and new lines in the regex cannot be ignored
My solution (see below) is a bit on the slow side. How could it be optimised, both in terms of IO and string replacement?
data = open("INPUT__FILE").read()
o = open("OUTPUT_FILE","w")
for phrase in phrases: # these are the set of words I am talking about
b1, b2 = str(phrase).strip().split(" ")
regex = re.compile(r"%s\ *\t*\n*%s"%(b1,b2), re.IGNORECASE)
data = regex.sub(b1+"_"+b2,data)
o.write(data)
UPDATE: 4x speed-up by converting all text to lower case and dropping re.IGNORECASE

you could avoid recompiling your regexp for every file:
precompiled = []
for phrase in phrases:
b1, b2 = str(phrase).strip().split(" ")
precompiled.append(b1+"_"+b2, re.compile(r"%s\ *\t*\n*%s"%(b1,b2), re.IGNORECASE))
for (input, output) in ...:
with open(output,"w") as o:
with open(input) as i:
data = i.read()
for (pattern, regex) in precompiled:
data = regex.sub(pattern, data)
o.write(data)
it's the same for one file, but if you're repeating over many files then you are re-using the regexes.
disclaimer: untested, may contain typos.
[update] also, you can simplify the regexp a little by replacing the various space characters with \s*. i suspect you have a bug there, in that you would want to match " \t " and currently don't.

You can do this in 1 pass by using a B-Tree data structure to store your phrases. This is the fastest way of doing it with a time-complexity of N O(log h) where N is the number of characters in your input file and h is the length of your longest word. However, Python does not offer an out of the box implementation of a B-Tree.
You can also use a Hashtable (dictionary) and a replacement function to speed up things. This is easy to implement if the words you wish to replace are alphanumeric and single words only.
replace_data = {}
# Populate replace data here
for phrase in phrases:
key, value = phrase.strip().split(' ')
replace_data[key.lower()] = value
def replace_func(matchObj):
# Function which replaces words
key = matchObj.group(0).lower()
if replace_data.has_key(key):
return replace_data[key]
else:
return key
# Original code flow
data = open("INPUT_FILE").read()
output = re.sub("[a-zA-Z0-9]+", replace_func, data)
o = open('OUTPUT_FILE', 'w')
o.write(output)
o.close()

Related

Search for sentences containing characters using Python regular expressions

I am searching for sentences containing characters using Python regular expressions.
But I can't find the sentence I want.
Please help me
regex.py
opfile = open(file.txt, 'r')
contents = opfile.read()
opfile.close()
index = re.findall(r'\[start file\](?:.|\n)*\[end file\]', contents)
item = re.search(r'age.*', str(index))
file.txt(example)
[start file]
name: steve
age: 23
[end file]
result
<re.Match object; span=(94, 738), match='age: >
The age is not printed
There are several issues here:
The str(index) returns the string literal representation of the string list, and it makes it difficult to further process the result
(?:.|\n)* is a very resource consuming construct, use a mere . with the re.S or re.DOTALL option
If you plan to find a single match, use re.search, not re.findall.
Here is a possible solution:
match = re.search(r'\[start file].*\[end file]', contents, re.S)
if match:
match2 = re.search(r"\bage:\s*(\d+)", match.group())
if match2:
print(match2.group(1))
Output:
23
If you want to get age in the output, use match2.group().
If you want to match the age only once between the start and end file markers, you could use a single pattern with a capture group and in between match all lines that do not start with age: or the start or end marker.
^\[start file](?:\n(?!age:|\[(?:start|end) file]).*)*\nage: (\d+)(?:\n(?!\[(?:start|end) file]).*)*\n\[end file]
Regex demo
Example
import re
regex = r"^\[start file](?:\n(?!age:|\[(?:start|end) file]).*)*\nage: (\d+)(?:\n(?!\[(?:start|end) file]).*)*\n\[end file]"
s = ("[start file]\n" "name: steve \n" "age: 23\n" "[end file]")
m = re.search(regex, s)
if m:
print(m.group(1))
Output
23
The example input looks like a list of key, value pairs enclosed between some start/end markers. For this use-case, it might be more efficient and readable to write the parsing stage as:
re.search to locate the document
splitlines() to isolate individual records
split() to extract the key and value of each record
Then, in a second step, access the extracted records.
Doing this allows to separate the parsing and exploitation parts and makes the code easier to maintain.
Additionally, a good practice is to wrap access to a file in a "context manager" (the with statement) to guarantee all resources are correctly cleaned on error.
Here is a full standalone example:
import re
# 1: Load the raw data from disk, in a context manager
with open('/tmp/file.txt') as f:
contents = f.read()
# 2: Parse the raw data
fields = {}
if match := re.search(r'\[start file\]\n(.*)\[end file\]', contents, re.S):
for line in match.group(1).splitlines():
k, v = line.split(':', 1)
fields[k.strip()] = v.strip()
# 3: Actual data exploitation
print(fields['age'])

Replace words that are common in two text file with third word

i have two text file 1.txt is dictionary of words, and other is 2.txt with phrases now i would like to check the common words in 1.txt & 2.txt and i want to replace those common words with a third word "explain".
i have tried many ways to crack but failed. can any one help me
code i have used:
wordsreplace = open("1.txt",'r')
with open("2.txt") as main:
words = main.read().split()
replaced = []
for y in words:
if y in wordreplace:
replaced.append(wordreplace[y])
else:
replaced.append(y)
text = ' '.join(replaced)
replaced = []
for y in words:
replacement = wordreplace.get(y, y)
replaced.append(replacement)
text = ' '.join(replaced)
text = ' '.join(wordreplace.get(y, y) for y in words)
new_main = open("2.txt", 'w')
new_main.write(text)
new_main.close()
this code writes the 2.txt but i cannot words replaced
I don't want to point out problems in your code, because this task can be basically done in a few lines. Here's a self-contained example (no files, only text input)
first create a set called words you can lookup into when needed (passing read().split() to a set() will do when done from a file: words = set(main.read().split()))
now use a word-boundary word regex and a replacement function
The replacement function issues the word if not found in the dictionary else, it issues "explain":
words = {"computer","random"}
text = "computer sometimes yields random results"
import re
new_text = re.sub(r"\b(\w+)\b",lambda m : "explain" if m.group(1) in words else m.group(1),text)
print(new_text)
So the replacement is handled by the regex engine, calling my lambda when there's a match, so I can decide whether to replace the word, or issue it back again.
result:
explain sometimes yields explain results
Of course this doesn't handle plurals (computers, ...) which must be in the dictionary as well.

Counting phrases EXCEPT when they are preceded by another phrase in Python

Using pandas in Python 2.7 I am attempting to count the number of times a phrase (e.g., "very good") appears in pieces of text stored in a CSV file. I have multiple phrases and multiple pieces of text. I have succeeded in this first part using the following code:
for row in df_book.itertuples():
index, text = row
normed = re.sub(r'[^\sa-zA-Z0-9]', '', text).lower().strip()
for row in df_phrase.itertuples():
index, phrase = row
count = sum(1 for x in re.finditer(r"\b%s\b" % (re.escape(phrase)), normed))
file.write("%s," % (count))
However, I don't want to count the phrase if it's preceded by a different phrase (e.g., "it is not"). Therefore I used a negative lookbehind assertion:
for row in df_phrase.itertuples():
index, phrase = row
for row in df_negations.itertuples():
index, negation = row
count = sum(1 for x in re.finditer(r"(?<!%s )\b%s\b" % (negation, re.escape(phrase)), normed))
The problem with this approach is that it records a value for each and every negation as pulled from the df_negations dataframe. So, if finditer doesn't find "it was not 'very good'", then it will record a 0. And so on for every single possible negation.
What I really want is just an overall count for the number of times a phrase was used without a preceding phrase. In other words, I want to count every time "very good" occurs, but only when it's not preceded by a negation ("it was not") on my list of negations.
Also, I'm more than happy to hear suggestions on making the process run quicker. I have 100+ phrases, 100+ negations, and 1+ million pieces of text.
I don't really do pandas, but this cheesy non-Pandas version gives some results with the data you sent me.
The primary complication is that the Python re module does not allow variable-width negative look-behind assertions. So this example looks for matching phrases, saving the starting location and text of each phrase, and then, if it found any, looks for negations in the same source string, saving the ending locations of the negations. To make sure that negation ending locations are the same as phrase starting locations, we capture the whitespace after each negation along with the negation itself.
Repeatedly calling functions in the re module is fairly costly. If you have a lot of text as you say, you might want to batch it up, e.g. by using 'non-matching-string'.join() on some of your source strings.
import re
from collections import defaultdict
import csv
def read_csv(fname):
with open(fname, 'r') as csvfile:
result = list(csv.reader(csvfile))
return result
df_negations = read_csv('negations.csv')[1:]
df_phrases = read_csv('phrases.csv')[1:]
df_book = read_csv('test.csv')[1:]
negations = (str(row[0]) for row in df_negations)
phrases = (str(re.escape(row[1])) for row in df_phrases)
# Add a word to the negation pattern so it overlaps the
# next group.
negation_pattern = r"\b((?:%s)\W+)" % '|'.join(negations)
phrase_pattern = r"\b(%s)\b" % '|'.join(phrases)
counts = defaultdict(int)
for row in df_book:
normed = re.sub(r'[^\sa-zA-Z0-9]', '', row[0]).lower().strip()
# Find the location and text of any matching good groups
phrases = [(x.start(), x.group()) for x in
re.finditer(phrase_pattern, normed)]
if not phrases:
continue
# If we had matches, find the (start, end) locations of matching bad
# groups
negated = set(x.end() for x in re.finditer(negation_pattern, normed))
for start, text in phrases:
if start not in negated:
counts[text] += 1
else:
print("%r negated and ignored" % text)
for pattern, count in sorted(counts.items()):
print(count, pattern)

How can I replace substrings without replacing all at the same time? Python

I have written a really good program that uses text files as word banks for generating sentences from sentence skeletons. An example:
The skeleton
"The noun is good at verbing nouns"
can be made into a sentence by searching a word bank of nouns and verbs to replace "noun" and "verb" in the skeleton. I would like to get a result like
"The dog is good at fetching sticks"
Unfortunately, the handy replace() method was designed for speed, not custom functions in mind. I made methods that accomplish the task of selecting random words from the right banks, but doing something like skeleton = skeleton.replace('noun', getNoun(file.txt)) replaces ALL instances of 'noun' with the single call of getNoun(), instead of calling it for each replacement. So the sentences look like
"The dog is good at fetching dogs"
How might I work around this feature of replace() and make my method get called for each replacement? My minimum length code is below.
import random
def getRandomLine(rsv):
#parameter must be a return-separated value text file whose first line contains the number of lines in the file.
f = open(rsv, 'r') #file handle on read mode
n = int(f.readline()) #number of lines in file
n = random.randint(1, n) #line number chosen to use
s = "" #string to hold data
for x in range (1, n):
s = f.readline()
s = s.replace("\n", "")
return s
def makeSentence(rsv):
#parameter must be a return-separated value text file whose first line contains the number of lines in the file.
pattern = getRandomLine(rsv) #get a random pattern from file
#replace word tags with random words from matching files
pattern = pattern.replace('noun', getRandomLine('noun.txt'))
pattern = pattern.replace('verb', getRandomLine('verb.txt'))
return str(pattern);
def main():
result = makeSentence('pattern.txt');
print(result)
main()
The re module's re.sub function does the job str.replace does, but with far more abilities. In particular, it offers the ability to pass a function for the replacement, rather than a string. The function is called once for each match with a match object as an argument and must return the string that will replace the match:
import re
pattern = re.sub('noun', lambda match: getRandomLine('noun.txt'), pattern)
The benefit here is added flexibility. The downside is that if you don't know regexes, the fact that the replacement interprets 'noun' as a regex may cause surprises. For example,
>>> re.sub('Aw, man...', 'Match found.', 'Aw, manatee.')
'Match found.e.'
If you don't know regexes, you may want to use re.escape to create a regex that will match the raw text you're searching for even if the text contains regex metacharacters:
>>> re.sub(re.escape('Aw, man...'), 'Match found.', 'Aw, manatee.')
'Aw, manatee.'
I don't know if you are asking to edit your code or to write new code, so I wrote new code:
import random
verbs = open('verb.txt').read().split()
nouns = open('noun.txt').read().split()
def makeSentence(sent):
sent = sent.split()
for k in range(0, len(sent)):
if sent[k] == 'noun':
sent[k] = random.choice(nouns)
elif sent[k] == 'nouns':
sent[k] = random.choice(nouns)+'s'
elif sent[k] == 'verbing':
sent[k] = random.choice(verbs)
return ' '.join(sent)
var = raw_input('Enter: ')
print makeSentence(var)
This runs as:
$ python make.py
Enter: the noun is good at verbing nouns
the mouse is good at eating cats

Breaking a string into individual words in Python

I have a large list of domain names (around six thousand), and I would like to see which words trend the highest for a rough overview of our portfolio.
The problem I have is the list is formatted as domain names, for example:
examplecartrading.com
examplepensions.co.uk
exampledeals.org
examplesummeroffers.com
+5996
Just running a word count brings up garbage. So I guess the simplest way to go about this would be to insert spaces between whole words then run a word count.
For my sanity I would prefer to script this.
I know (very) little python 2.7 but I am open to any recommendations in approaching this, example of code would really help. I have been told that using a simple string trie data structure would be the simplest way of achieving this but I have no idea how to implement this in python.
We try to split the domain name (s) into any number of words (not just 2) from a set of known words (words). Recursion ftw!
def substrings_in_set(s, words):
if s in words:
yield [s]
for i in range(1, len(s)):
if s[:i] not in words:
continue
for rest in substrings_in_set(s[i:], words):
yield [s[:i]] + rest
This iterator function first yields the string it is called with if it is in words. Then it splits the string in two in every possible way. If the first part is not in words, it tries the next split. If it is, the first part is prepended to all the results of calling itself on the second part (which may be none, like in ["example", "cart", ...])
Then we build the english dictionary:
# Assuming Linux. Word list may also be at /usr/dict/words.
# If not on Linux, grab yourself an enlish word list and insert here:
words = set(x.strip().lower() for x in open("/usr/share/dict/words").readlines())
# The above english dictionary for some reason lists all single letters as words.
# Remove all except "i" and "u" (remember a string is an iterable, which means
# that set("abc") == set(["a", "b", "c"])).
words -= set("bcdefghjklmnopqrstvwxyz")
# If there are more words we don't like, we remove them like this:
words -= set(("ex", "rs", "ra", "frobnicate"))
# We may also add words that we do want to recognize. Now the domain name
# slartibartfast4ever.co.uk will be properly counted, for instance.
words |= set(("4", "2", "slartibartfast"))
Now we can put things together:
count = {}
no_match = []
domains = ["examplecartrading.com", "examplepensions.co.uk",
"exampledeals.org", "examplesummeroffers.com"]
# Assume domains is the list of domain names ["examplecartrading.com", ...]
for domain in domains:
# Extract the part in front of the first ".", and make it lower case
name = domain.partition(".")[0].lower()
found = set()
for split in substrings_in_set(name, words):
found |= set(split)
for word in found:
count[word] = count.get(word, 0) + 1
if not found:
no_match.append(name)
print count
print "No match found for:", no_match
Result: {'ions': 1, 'pens': 1, 'summer': 1, 'car': 1, 'pensions': 1, 'deals': 1, 'offers': 1, 'trading': 1, 'example': 4}
Using a set to contain the english dictionary makes for fast membership checks. -= removes items from the set, |= adds to it.
Using the all function together with a generator expression improves efficiency, since all returns on the first False.
Some substrings may be a valid word both as either a whole or split, such as "example" / "ex" + "ample". For some cases we can solve the problem by excluding unwanted words, such as "ex" in the above code example. For others, like "pensions" / "pens" + "ions", it may be unavoidable, and when this happens, we need to prevent all the other words in the string from being counted multiple times (once for "pensions" and once for "pens" + "ions"). We do this by keeping track of the found words of each domain name in a set -- sets ignore duplicates -- and then count the words once all have been found.
EDIT: Restructured and added lots of comments. Forced strings to lower case to avoid misses because of capitalization. Also added a list to keep track of domain names where no combination of words matched.
NECROMANCY EDIT: Changed substring function so that it scales better. The old version got ridiculously slow for domain names longer than 16 characters or so. Using just the four domain names above, I've improved my own running time from 3.6 seconds to 0.2 seconds!
assuming you only have a few thousand standard domains you should be able to do this all in memory.
domains=open(domainfile)
dictionary=set(DictionaryFileOfEnglishLanguage.readlines())
found=[]
for domain in domains.readlines():
for substring in all_sub_strings(domain):
if substring in dictionary:
found.append(substring)
from collections import Counter
c=Counter(found) #this is what you want
print c
with open('/usr/share/dict/words') as f:
words = [w.strip() for w in f.readlines()]
def guess_split(word):
result = []
for n in xrange(len(word)):
if word[:n] in words and word[n:] in words:
result = [word[:n], word[n:]]
return result
from collections import defaultdict
word_counts = defaultdict(int)
with open('blah.txt') as f:
for line in f.readlines():
for word in line.strip().split('.'):
if len(word) > 3:
# junks the com , org, stuff
for x in guess_split(word):
word_counts[x] += 1
for spam in word_counts.items():
print '{word}: {count}'.format(word=spam[0],count=spam[1])
Here's a brute force method which only tries to split the domains into 2 english words. If the domain doesn't split into 2 english words, it gets junked. It should be straightforward to extend this to attempt more splits, but it will probably not scale well with the number of splits unless you be clever. Fortunately I guess you'll only need 3 or 4 splits max.
output:
deals: 1
example: 2
pensions: 1

Categories

Resources