Pushing data into regex? - python

I am working on a small project which I call pydlp. It is basically a set of regex signatures that will extract data from a fileobject. And a function that check if extracted data is in fact interesting.
This code is how I perform matching. It is far from optimal, as I have to read the file over and over again.
for signature in signatures:
match = signature.validate(signature.regex.match(fobj.read())))
if match: matches.append(match)
fobj.seek(0)
Is there a way to perform multiple regex matches on the same file object while only reading the file object content once. The file object can be large, so I cannot put it in memory.
Edit:
I want to clarify why I mean by "pushing data into regex". I recognize that regex has similarities with a finite state machine. Instead of passing the whole data at once to the regex engine, is it possible to push parts of it at a time?
while True:
data = fobj.read(1024)
if data == "": break
for signature in signatures:
match = signature.regex.push_and_match(data)
if match: matches.append(match)
Edit 2:
Removed link, as I removed the project from github.

The standard way to do this sort of text processing with files too large to read into memory is to iterate over the file line by line:
regexes = [ .... ]
with open('large.file.txt') as fh:
for line in fh:
for rgx in regexes:
m = rgx.search(line)
if m:
# Do stuff.
But that approach assumes your regexes can operate successfully on single lines of text in isolation. If they cannot, perhaps there are other units that you can pass to the regexes (eg, paragraphs delimited by blank lines). In other words, you might need to do a bit of pre-parsing so that you can grab meaningful sections of text before sending them to your main regexes.
with open('large.file.txt') as fh:
section = []
for line in fh:
if line.strip():
section.append(line)
else:
# We've hit the end of a section, so we
# should check it against our regexes.
process_section(''.join(section), regexes)
section = []
# Don't forget the last one.
if section:
process_section('\n'.join(section), regexes)
Regarding your literal question: "Is there a way to perform multiple regex matches on the same file object while only reading the file object content once". No and yes. No in the sense that Python regexes operate on strings, not file objects. But you can perform multiple regex searches at the same time on one string, simply by using alternation. Here's a minimal example:
patterns = 'aa bb cc'.split()
big_regex = '|'.join(patterns) # Match this or that or that.
m = big_regex.search(some_text)
But that doesn't really solve your problem if the file is too big for memory.

Maybe consider using re.findall() if you don't need match objects but only matched strings? If file is too big you can slice it for parts, as you suggest, but using some overlaps, to miss no regexes (if you know nature of regexes maybe it'll be possible to find out how big overlap should be).

Related

Finding sub-strings in LARGE string

#read in csv file in form ("case, num, val \n case1, 1, baz\n...")
# convert to form FOO = "casenumval..." roughly 6 million characters
for someString in List: #60,000 substrings
if substr not in FOO:
#do stuff
else:
#do other stuff
So my issue is that there are far too many sub strings to check against this massive string. I have tried reading the file in line by line and checking the substrings against the line, but this still crashes the program. Are there any techniques for checking a lot of substrings againsts a very large string efficiently?
FOR CONTEXT:
I am performing data checks, suspect data is saved to a csv file to be reviewed/changed. This reviewed/changed file is then compared to the original file. Data which has not changed has been verified as good and must be saved to a new "exceptionFile". Data that has been changed and passes is disregarded. And data which has been changed and is checked and still suspect is the sent off for review again.
The first thing you should do is convert your list of 60,000 strings to search for into one big regular expression:
import re
searcher = re.compile("|".join(re.escape(s) for s in List)
Now you can search for them all at once:
for m in searcher.finditer(FOO):
print(m.group(0)) # prints the substring that matched
If all you care about is knowing which ones were found,
print(set(m.group(0) for m in searcher.finditer(FOO))
This is still doing substantially more work than the absolute minimum, but it should be much more efficient than what you were doing before.
Also, if you know that your input is a CSV file and you also know that none of the strings-to-search-for contain a newline, you can operate line by line, which may or may not be faster than what you were doing depending on conditions, but will certainly use less memory:
with open("foo.csv") as FOO:
for line in FOO:
for m in searcher.finditer(line):
# do something with the substring that matched

Need help finding the correct regex pattern for my string pattern

I'm terrible with RegEx patterns, and I'm writing a simple python program that requires splitting lines of a file into a 'content' part and a 'tags' part, and then further splitting the tags parts into individual tags. Here's a simple example of what one line of my file might look like:
The Beatles <music,rock,60s,70s>
I've opened my file with begun reading lines like this:
def Load(self, filename):
file = open(filename, r)
for line in file:
#Ignore comments and empty lines..
if not line.startswith('#') and not line.strip():
#...
Forgive my likely terrible Python, it's my first few days with the language. Anyway, next I was thinking it would be useful to use a regex to break my string into sections - with a variable to store the 'content' (for example, "The Beatles"), and a list/set to store each of the tags. As such, I need a regex (or two?) that can:
Split the raw part from the <> part.
And split the tags part into a list based on the commas.
Finally, I want to make sure that the content part retains its capitalization and inner spacing. But I want to make sure the tags are all lower-case and without white space.
I'm wondering if any of the regex experts out there can help me find the correct pattern(s) to achieve my goals here?
This is a solution that gets around the problem without using by relying on multiple splits.
# This separates the string into the content and the remainder
content, tagStr = line.split('<')
# This splits the tagStr into individual tags. [:-1] is used to remove trailing '>'
tags = tagStr[:-1].split(',')
print content
print tags
The problem with this is that it leaves a trailing whitespace after the content.
You can remove this with:
content = content[:-1]

How can I efficiently search for many strings at once in many files?

Hi i post my portion of code and then i'll explain my goal:
for eachcsv in matches:
with open(eachcsv, 'r') as f:
lines = f.readlines()
for entry in rs:
for line in lines:
if entry in line:
print("found %s in %s" % (entry, eachcsv))
So in "matches" i got a list of csv files (the path to them). I open every csv file and i load them into memory with readlines(). "rs" is a list of unique ids. For every element of the list "rs" i need to search every line of the csv file and print each time i find the id on the file (later on i will test if the line contains another fixed word also).
The code above works for my purpose but i don't know why it takes more than 10 minutes to process a 400k row file, i need to do this task for thousand of files so it's impossible for me to finish the task. It seems to me that the slow part is the testing process, but i'm not sure.
Please note that i use python because i'm more confortable with it, if there is ANY other solution to my problem using other tools i'm ok with it.
EDIT:
i will try to post some examples
"rs" list:
rs12334435
rs3244567
rs897686
....
files
# header data not needed
# data
# data
# data
# data
# data [...]
#COLUMN1 COLUMN2 COLUMN3 ...
data rs7854.rs165463 dataSS=1(random_data)
data rs465465data datadata
data rs798436 dataSS=1
data datars45648 dataSS=1
The final goal is to count how many times every rs appers on each file and if in column 3 there is SS=1 to flag it in the output.
Something like
found rs12345 SS yes file 3 folder /root/foobar/file
found rs74565 SS no file 3 folder /root/foobar/file
Much of the problem is because you have so many nested loops. You can probably make your program faster by eliminating loops:
One of the loops is over each of the lines in the file. But if all
you want to do is determine whether any of the matches exists in the
file, you can search the whole file body in one operation. To be
sure, this searches a longer string, but it does so in one operation
in native code instead of doing it in Python.
You loop over all the match strings. But you know those before you
start and they are the same for each file. So this is a good case
of when doing-more up-front work will pay off in time saved in the
rest of the program. Stand back, I'm going to use a regular
expression.
Here is a version of the code which combines these two ideas:
import re
import random
import sys
import time
# get_patterns makes up some test data.
def get_patterns():
rng = random.Random(1) # fixed seed, for reproducibility
n = 300
# Generate up to n unique integers between 60k and 80k.
return list(set([str(rng.randint(60000, 80000)) for _ in xrange(n)]))
def original(rs, matches):
for eachcsv in matches:
with open(eachcsv, 'r') as f:
lines = f.readlines()
for entry in rs:
for line in lines:
if entry in line:
print("found %s in %s" % (entry, eachcsv))
def mine(rs, matches):
my_rx = re.compile(build_regex(rs))
for eachcsv in matches:
with open(eachcsv, 'r') as f:
body = f.read()
matches = my_rx.findall(body)
for match in matches:
print "found %s in %s" % (match, eachcsv)
def build_regex(literal_patterns):
return "|".join([re.escape(pat) for pat in literal_patterns])
def print_elapsed_time(label, callable, args):
t1 = time.time()
callable(*args)
t2 = time.time()
elapsed_ms = (t2 - t1) * 1000
print "%8s: %9.1f milliseconds" % (label, elapsed_ms)
def main(args):
rs = get_patterns()
filenames = args[1:]
for function_name_and_function in (('original', original), ('mine', mine)):
name, func = function_name_and_function
print_elapsed_time(name, func, [rs, filenames])
return 0
if __name__ == '__main__':
sys.exit(main(sys.argv))
Your original code is in there as original and my replacement is mine.
For 300 patterns, my implementation runs in 400ms on my computer. which is roughly a 30x speedup. For more match strings, the improvement should be greater. Doubling the number of patterns roughly doubles the runtime of your implementation, but the regex-based one only takes about 3% longer (though this is partly because my test patterns all have similar prefixes and that may not be true for the real data).
Edit: updated code to print one message for each match. My code is now somewhat slower, but it's still an improvement, and should be relatively speaking faster with more strings to match:
~/source/stackoverflow/36923237$ python search.py example.csv
found green fox in example.csv
original: 9218.0 milliseconds
found green fox in example.csv
mine: 600.4 milliseconds
Edit: explanation of the regex technique, as requested.
Suppose you want to search your file for the strings foobar and umspquux. One way to do this is to search the file for first foobar and then umspquux. This is the approach you started with.
Another approach is to search for both strings at once. Imagine that you check the first character of the file. If it's 'f' or 'u', you might be looking at a match, and should check the second character to see if it, respectively 'o' or 'm'. And so on. If you get to the end od of the file, you will have found all the matches there qre in the file to find.
A convenient way to tell a computer to look for multiple strings at once is to use a regular expression. Normal strings are regular expressions. The regular expression 'foobar' matches the sub-string 'foobar' inside 'the foobar is all fuzzy'. However, you can do more complex things. You can combine two regular expressions, each of which matches something, into a combined regular expression which will match either of those somethings. This is done with the alternation symbol, '|'. So the regular expression 'foobar|umspquux' will match either 'foobar' or 'umspquux'. You can also match a real '|' by escaping the significance of the '|' with a backslash '\'.
This is what the build_regex_literal_patterns is all about. It would convert the list ['foobar', 'umspquux'] into the string 'foobar|umspquux'. Try it out - put the function definition into your own file and call it with some try-out arguments to see how it behaves.
That's a good way, by the way, to figure out how any piece of code works - run part of it and print the intermediate result. This is harder to do safely with programs that have side-effects of course, but this program doesn't.
The call to re.escape in build_regex_literal_patterns simply ensures that any special regular expression operators (such as '|') are escaped (giving in this case '\|') so that they will match themselves.
The last part of the regular expression method here is to use the findall method of the compiled regular expression. This simply returns all the matches for our regular expression in the input string (i.e. the body of the file).
You can read up on Python regular expressions in the Python documentation on regular expressions. That documentation is basically reference material, so you might find the gentler introduction at the Google Develoeprs site has an introduction to Python regular expressions better as a starting point. Jeffrey Friedl's Mastering Regular Expressions is a pretty comprehensive work on regular expressions, thought it doesn't happen to cover the Python dialect of regular expressions.
Your largest memory intensive operation is going to be reading every line, this should be on the outer loop.
You'll also not want to read the entire document in at once with readlines, use readline instead. Readline is much less memory intensive.
for eachcsv in matches:
with open(eachcsv, 'r') as f:
for line in f:
for entry in rs:
if entry in line:
print("found %s in %s" % (entry, eachcsv))
More reading here https://docs.python.org/2/tutorial/inputoutput.html
There are other optimizations you can take too that fall outside the scope of this question, such as using threading, or using multi processing to read in many csv files at the same time. https://docs.python.org/2/library/threading.html
import threading
def findRSinCSVFile(csvFile,rs)
with open(csvFile, 'r') as f:
for line in f:
for entry in rs:
if entry in line:
print("found %s in %s" % (entry, eachcsv))
for csvFile in csvFiles():
threads.append(threading.Thread(target=findRSinCSVFile,args=(csvFile,rs)))
for thread in threads:
thread.start()
for thread in threads:
thread.join()
This will allow you to parse all the csv files at the same time.
(Note I have not tested any of this code, and only serves as an example)

Speed up a series of regex replacement in python

My python script would read each line in file and do many regex replacements in each line.
If the regex success, skip to the next line
Is there any way to speed up this kind of script?
Is it worth to call subn instead and check if replacement done and then skip to the remain one?
If I compile the regex, is it possible to store all the compiled regex in memory?
for file in files:
for line in file:
re.sub() # <--- ~ 100 re.sub
PS: the replacement vaires for each regex
You should probably do three things:
Reduce the number of regexes. Depending on differences in the substitution part, you might be able to combine them all into a single one. Using careful alternation, you can determine the sequence in which parts of the regex will be matched.
If possible (depending on file size), read the file into memory completely.
Compile your regex (only for readability; it won't matter in terms of speed as long as the number of regexes stays below 100).
This gives you something like:
regex = re.compile(r"My big honking regex")
for datafile in files:
content = datafile.read()
result = regex.sub("Replacement", content)
As #Tim Pietzcker said, you could reduce the number of regexes by making them alternatives. You can determine which alternative matched by the using the 'lastindex' attribute of the match object.
Here's an example of what you could do:
>>> import re
>>> replacements = {1: "<UPPERCASE LETTERS>", 2: "<lowercase letters>", 3: "<Digits>"}
>>> def replace(m):
... return replacements[m.lastindex]
...
>>> re.sub(r"([A-Z]+)|([a-z]+)|([0-9]+)", replace, "ABC def 789")
'<UPPERCASE LETTERS> <lowercase letters> <Digits>'

findall/finditer on a stream?

Is there a way to get the re.findall, or better yet, re.finditer functionality applied to a stream (i.e. an filehandle open for reading)?
Note that I am not assuming that the pattern to be matched is fully contained within one line of input (i.e. multi-line patterns are permitted). Nor am I assuming a maximum match length.
It is true that, at this level of generality, it is possible to specify a regex that would require that the regex engine have access to the entire string (e.g. r'(?sm).*'), and, of course, this means having to read the entire file into memory, but I am not concerned with this worst-case scenario at the moment. It is, after all, perfectly possible to write multi-line-matching regular expressions that would not require reading the entire file into memory.
Is it possible to access the underlying automaton (or whatever is used internally) from a compiled regex, to feed it a stream of characters?
Thanks!
Edit: Added clarifications regarding multi-line patterns and match lengths, in response to Tim Pietzcker's and rplnt's answers.
This is possible if you know that a regex match will never span a newline.
Then you can simply do
for line in file:
result = re.finditer(regex, line)
# do something...
If matches can extend over multiple lines, you need to read the entire file into memory. Otherwise, how would you know if your match was done already, or if some content further up ahead would make a match impossible, or if a match is only unsuccessful because the file hasn't been read far enough?
Edit:
Theoretically it is possible to do this. The regex engine would have to check whether at any point during the match attempt it reaches the end of the currently read portion of the stream, and if it does, read on ahead (possibly until EOF). But the Python engine doesn't do this.
Edit 2:
I've taken a look at the Python stdlib's re.py and its related modules. The actual generation of a regex object, including its .match() method and others is done in a C extension. So you can't access and monkeypatch it to also handle streams, unless you edit the C sources directly and build your own Python version.
It would be possible to implement on regexp with known maximum length. Either no +/* or ones where you know maximum numbers of repetition. If you know this you can read file by chunks and match on these, yielding the result. You would also run the regexp on overlapping chunk than would cover the case when the regexp would match but was stopped by the end of a string.
some pseudo(python)code:
overlap_tail = ''
matched = {}
for chunk in file.stream(chunk_size):
# calculate chunk_start
for result in finditer(match, overlap_tail+chunk):
if not chunk_start + result.start() in matched:
yield result
matched[chunk_start + result.start()] = result
# delete old results from dict
overlap_tail = chunk[-max_re_len:]
Just an idea but I hope you get what I'm trying to achieve. You'd need to consider that file(stream) could end and some other cases. But I think it can be done (if the length of the regular expression is limited(known)).

Categories

Resources