My input file is a large txt file with concatenated texts I got from an open text library. I am now trying to extract only the content of the book itself and filter out other stuff such as disclaimers etc. So I have around 100 documents in my large text file (around 50 mb).
I then have identified the start and end markers of the contents themselves, and decided to use a Python regex to find me everything between the start and end marker. To sum it up, the regex should look for the start marker, then match everything after it, and stop looking once the end marker is reached, then repeat these steps until the end of the file is reached.
The following code works flawlessly when I feed a small, 100kb sized file into it:
import codecs
import re
outfile = codecs.open("outfile.txt", "w", "utf-8-sig")
inputfile = codecs.open("infile.txt", "r", "utf-8-sig")
filecontents = inputfile.read()
for result in re.findall(r'START\sOF\sTHE\sPROJECT\sGUTENBERG\sEBOOK.*?\n(.*?)END\sOF\THE\sPROJECT\sGUTENBERG\sEBOOK', filecontents, re.DOTALL):
outfile.write(result)
outfile.close()
When I use this regex operation on my larger file however, it will not do anything, the program just hangs. I tested it overnight to see if it was just slow and even after around 8 hours the program was still stuck.
I am very sure that the source of the problem is the
(.*?)
part of the regex, in combination with re.DOTALL.
When I use a similar regex on smaller distances, the script will run fine and fast.
My question now is: why is this just freezing up everything? I know the texts between the delimiters are not small, but a 50mb file shouldn't be too much to handle, right?
Am I maybe missing a more efficient solution?
Thanks in advance.
You are correct in thinking that using the sequence .*, which appears more than once, is causing problems. The issue is that the solver is trying many possible combinations of .*, leading to a result known as catastrophic backtracking.
The usual solution is to replace the . with a character class that is much more specific, usually the production that you are trying to terminate the first .* with. Something like:
`[^\n]*(.*)`
so that the capturing group can only match from the first newline to the end. Another option is to recognize that a regular expression solution may not be the best approach, and to use either a context free expression (such as pyparsing), or by first breaking up the input into smaller, easier to digest chunks (for example, with corpus.split('\n'))
Another workaround to this issue is adding a sane limit to the number of matched characters.
So instead of something like this:
[abc]*.*[def]*
You can limit it to 1-100 instances per character group.
[abc]{1,100}.{1,100}[def]{1,100}
This won't work for every situation, but in some cases it's an acceptable quickfix.
Related
How can I grab 'dlc3.csv' & 'spongebob.csv' from the below string via the absolute quickest method - which i assume is regex?
4918, fx,fx,weapon/muzzleflashes/fx_m1carbine,3.3,3.3,|sp/zombie_m1carbine|weapon|../zone_source/dlc3.csv|csv|../zone_source/spongebob.csv|csv
I've already managed to achieve this by using split() and for loops but its slowing my program down way too much.
I would post an example of my current code but its got a load of other stuff in it so it would only cause you to ask more questions.
In a nutshell im opening a large 6,000 line .csv file and im then using nested for loops to iterate through each line and using .split() to find specific parts in each line. I have many files where i need to scan specific things on each line and atm ive only implemented a couple features into my Qt program and its already taking upto 5 seconds to load some things and up to 10 seconds for others. All of which is due to the nested loops. Ive looked at where to use range, where not to, and where to use enumerate. I also use time.time() and loggin.info() to show each code changes speed. And after asking around ive been told that using a regex is the best option for me as it would remove the need for many of my for loops. Problem is i have no clue how to use regex. I of course plan on learning it but if someone could help me out with this it'll be much appreciated.
Thanks.
Edit: just to point out that when scanning each line the filename is unknown. ".csv" is the only thing that isnt unknown. So i basically need the regex to grab every filename before .csv but of course without grabbing the crap before the filename.
Im currently looking for .csv using .split('/') & .split('|'), then checking if .csv is in list index to grab the 'unknown' filename. And some lines will only have 1 filename whereas others will have 2+ so i need the regex to account for this too.
You can use this pattern: [^/]*\.csv
Breakdown:
[^/] - Any character that's not a forward slash (or newline)
* - Zero or more of them
\. - A literal dot. (This is necessary because the dot is a special character in regex.)
For example:
import re
s = '''4918, fx,fx,weapon/muzzleflashes/fx_m1carbine,3.3,3.3,|sp/zombie_m1carbine|weapon|../zone_source/dlc3.csv|csv|../zone_source/spongebob.csv|csv'''
pattern = re.compile(r'[^/]*\.csv')
result = pattern.findall(s)
Result:
['dlc3.csv', 'spongebob.csv']
Note: It could just as easily be result = re.findall(r'[^/]*\.csv', s), but for code cleanliness, I prefer naming my regexes. You might consider giving it an even clearer name in your code, like pattern_csv_basename or something like that.
Docs: re, including re.findall
See also: The official Python Regular Expression HOWTO
I have a long text like the one below. I need to split based on some words say ("In","On","These")
Below is sample data:
On the other hand, we denounce with righteous indignation and dislike men who are so beguiled and demoralized by the charms of pleasure of the moment, so blinded by desire, that they cannot foresee the pain and trouble that are bound to ensue; and equal blame belongs to those who fail in their duty through weakness of will, which is the same as saying through shrinking from toil and pain. These cases are perfectly simple and easy to distinguish. In a free hour, when our power of choice is untrammelled and when nothing prevents our being able to do what we like best, every pleasure is to be welcomed and every pain avoided. But in certain circumstances and owing to the claims of duty or the obligations of business it will frequently occur that pleasures have to be repudiated and annoyances accepted. The wise man therefore always holds in these matters to this principle of selection: he rejects pleasures to secure other greater pleasures, or else he endures pains to avoid worse pains.
Can this problem be solved with a code as I have 1000 rows in a csv file.
As per my comment, I think a good option would be to use regular expression with the pattern:
re.split(r'(?<!^)\b(?=(?:On|In|These)\b)', YourStringVariable)
Yes this can be done in python. You can load the text into a variable and use the built in Split function for string. For example:
with open(filename, 'r') as file:
lines = file.read()
lines = lines.split('These')
# lines is now a list of strings split whenever 'These' string was encountered
To find whole words that are not part of larger words, I like using the regular expression:
[^\w]word[^\w]
Sample python code, assuming the text is in a variable named text:
import re
exp = re.compile(r'[^\w]in[^\w]', flags=re.IGNORECASE)
all_occurrences = list(exp.finditer(text))
This method works just fine in Python:
with open(file) as f:
for line in f:
for field in line.rstrip().split('\t'):
continue
However, it also means I read each line twice. First I loop over each character of the file and search for newline characters and second I loop over each character of the line and search for tab spaces. Is there a built-in method for splitting lines, while avoiding looping over the same set of characters twice? Apologies if this is a stupid question.
If you're worried about this level of efficiency then you probably shouldn't be programming in Python. Most of what is happening in that loop happens in C (if you're using the CPython implementation). You're not going to find a more efficient way to process your data using a pure python approach or without creating a very complicated looping structure.
If I wanted to avoid looping over the lines and handle the whole file in one go I would go with a regular expression. Also, regular expressions should be really fast.
import re
regexp = re.compile("\n+")
with open(file) as f:
lines = re.split(regexp, f.read())
Now \n matches one or more newlines and splits the file there. The results is a python list with all the lines. If you want to split by another character, for example whitespaces (and tabs and newlines) you would replace \n+ with \s+. Depending on what you want to do with the lines this might not be faster. Timeit is your friend.
More on pythons regexp:
https://docs.python.org/2/library/re.html
Is there a way to get the re.findall, or better yet, re.finditer functionality applied to a stream (i.e. an filehandle open for reading)?
Note that I am not assuming that the pattern to be matched is fully contained within one line of input (i.e. multi-line patterns are permitted). Nor am I assuming a maximum match length.
It is true that, at this level of generality, it is possible to specify a regex that would require that the regex engine have access to the entire string (e.g. r'(?sm).*'), and, of course, this means having to read the entire file into memory, but I am not concerned with this worst-case scenario at the moment. It is, after all, perfectly possible to write multi-line-matching regular expressions that would not require reading the entire file into memory.
Is it possible to access the underlying automaton (or whatever is used internally) from a compiled regex, to feed it a stream of characters?
Thanks!
Edit: Added clarifications regarding multi-line patterns and match lengths, in response to Tim Pietzcker's and rplnt's answers.
This is possible if you know that a regex match will never span a newline.
Then you can simply do
for line in file:
result = re.finditer(regex, line)
# do something...
If matches can extend over multiple lines, you need to read the entire file into memory. Otherwise, how would you know if your match was done already, or if some content further up ahead would make a match impossible, or if a match is only unsuccessful because the file hasn't been read far enough?
Edit:
Theoretically it is possible to do this. The regex engine would have to check whether at any point during the match attempt it reaches the end of the currently read portion of the stream, and if it does, read on ahead (possibly until EOF). But the Python engine doesn't do this.
Edit 2:
I've taken a look at the Python stdlib's re.py and its related modules. The actual generation of a regex object, including its .match() method and others is done in a C extension. So you can't access and monkeypatch it to also handle streams, unless you edit the C sources directly and build your own Python version.
It would be possible to implement on regexp with known maximum length. Either no +/* or ones where you know maximum numbers of repetition. If you know this you can read file by chunks and match on these, yielding the result. You would also run the regexp on overlapping chunk than would cover the case when the regexp would match but was stopped by the end of a string.
some pseudo(python)code:
overlap_tail = ''
matched = {}
for chunk in file.stream(chunk_size):
# calculate chunk_start
for result in finditer(match, overlap_tail+chunk):
if not chunk_start + result.start() in matched:
yield result
matched[chunk_start + result.start()] = result
# delete old results from dict
overlap_tail = chunk[-max_re_len:]
Just an idea but I hope you get what I'm trying to achieve. You'd need to consider that file(stream) could end and some other cases. But I think it can be done (if the length of the regular expression is limited(known)).
I have a somewhat complex regular expression which I'm trying to match against a long string (65,535 characters). I'm looking for multiple occurrences of the re in the string, and so am using finditer. It works, but for some reason it hangs after identifying the first few occurrences. Does anyone know why this might be? Here's the code snippet:
pattern = "(([ef]|([gh]d*(ad*[gh]d)*b))d*b([ef]d*b|d*)*c)"
matches = re.finditer(pattern, string)
for match in matches:
print "(%d-%d): %s" % (match.start(), match.end(), match.group())
It prints out the first four occurrences, but then it hangs. When I kill it using Ctrl-C, it tells me it was killed in the iterator:
Traceback (most recent call last):
File "code.py", line 133, in <module>
main(sys.argv[1:])
File "code.py", line 106, in main
for match in matches:
KeyboardInterrupt
If I try it with a simpler re, it works fine.
I'm running this on python 2.5.4 running on Cygwin on Windows XP.
I managed to get it to hang with a very much shorter string. With this 50 character string, it never returned after about 5 minutes:
ddddddeddbedddbddddddddddddddddddddddddddddddddddd
With this 39 character string it took about 15 seconds to return (and display no matches):
ddddddeddbedddbdddddddddddddddddddddddd
And with this string it returns instantly:
ddddddeddbedddbdddddddddddddd
Definitely exponential behaviour. You've got so many d* parts to your regexp that it'll be backtracking like crazy when it gets to the long string of d's, but fails to match something earlier. You need to rethink the regexp, so it has less possible paths to try.
In particular I think:
([ef]d\*b|d\*)*</pre></code> and <code><pre>([ef]|([gh]d\*(ad\*[gh]d)\*b))d\*b
Might need rethinking, as they'll force a retry of the alternate match. Plus they also overlap in terms of what they match. They'd both match edb for example, but if one fails and tries to backtrack the other part will probably have the same behaviour.
So in short try not to use the | if you can and try to make sure the patterns don't overlap where possible.
Could it be that your expression triggers exponential behavior in the Python RE engine?
This article deals with the problem. If you have the time, you might want to try running your expression in an RE engine developed using those ideas.
Thanks to all the responses, which were very helpful. In the end, surprisingly, it was easy to speed it up. Here's the original regex:
(([ef]|([gh]d*(ad*[gh]d)*b))d*b([ef]d*b|d*)*c)
I noticed that the |d* near the end was not really what I needed, so I modified it as follows:
(([ef]|([gh]d*(ad*[gh]d)*b))d*b([ef]d*bd*)*c)
Now it works almost instantaneously on the 65,536 character string. I guess now I just have to make sure that the regex is really matching the strings I need it to match...
I think you experience what is known as "catastrophic backtracking".
Your regex has many optional/alternative parts, all of which still try to match, so previous sub-expressions give back characters to the following expression on local failure. This leads to a back-and-fourth behavior within the regex and exponentially rising execution times.
Python (2.7+?, I'm not sure) supports atomic grouping and possessive quantifiers, you could examine your regex to identify the parts that should match or fail as a whole. Unnecessary backtracking can be brought under control with that.
catastrophic backtracking!
Regular Expressions can be very expensive. Certain (unintended and intended) strings may cause RegExes to exhibit exponential behavior. We've taken several hotfixes for this. RegExes are so handy, but devs really need to understand how they work; we've gotten bitten by them.
example and debugger:
http://www.codinghorror.com/blog/archives/000488.html
You already gave yourself the answer: The regular expression is to complex and ambiguous.
You should try to find a less complex and more distinct expression that is easier to process. Or tell us what you want to accomplish and we could try to help you to find one.
Edit If you just want to allow ds in every position as you said in a comment to John Montgomery’s answer, you should remove them before testing the pattern:
import re
string = "ddddddeddbedddbddddddddddddddddddddddddddddddddddd"
pattern = "(([ef]|([gh](a[gh])*b))b([ef]b)*c)"
matches = re.finditer(pattern, re.sub("d+", "", string))
for match in matches:
print "(%d-%d): %s" % (match.start(), match.end(), match.group())