python speed up this regex sub

python speed up this regex sub - python

p = re.compile('>.*\n')
p.sub('', text)
I want to delete all lines starting with a '>'. I have a really huge file (3GB) that I process in chunks of size 250MB, so the variable "text" is a string of size 250MB. (I tried different sizes, but the performance was always the same for the complete file).
Now, can I speed up this regex somehow? I tried the multi-line matching, but it was a lot slower. Or are there even better ways?
(I already tried to split the string and then filter out the line like this, but it was also slower (i also tried a lambda instead of def del_line: (that might not be working code, it's just from memory):
def del_line(x): return x[0] != '>'
def func():
....
text = file.readlines(chunksize)
text = filter(del_line, text)
...
EDIT:
As suggested in the comments, I also tried walking line by line:
text = []
for line in file:
if line[0] != '>':
text.append(line)
text = ''.join(text)
That's also slower, it needs ~12 sec. My regex need ~7 sec. (yeah, that's fast, but it must also run on slower machines)
EDIT: Of course, I also tried str.startswith('>'), it was slower...

If you have the chance, running grep as a subprocess is probably the most pragmatic choice.
If for whatever reason you can't rely on grep, you could try implementing some of the "tricks" that make grep fast. From the author himself, you can read about them here: http://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html
At the ending of the article, the author summarizes the main points. The one that stands out to me the most is:
Moreover, GNU grep AVOIDS BREAKING THE INPUT INTO LINES. Looking for
newlines would slow grep down by a factor of several times, because to
find the newlines it would have to look at every byte!
The idea would be to load the entire file in memory and iterate with it on byte-level instead of line-level. Only when you find a match, you look for the line boundaries and delete it.
You say you have to run this on other computers. If it's within your reach and you are not doing it already, consider running it on PyPy instead of CPython (the default interpreter). This may (or may not) improve the runtime by a significant factor, depending on the nature of the program.
Also, as some comments already mentioned, benchmark with the actual grep to get a baseline of how fast you can go, reasonably speaking. Get it on Cygwin if you are on Windows, it's easy enough.

This is not faster?
def cleanup(chunk):
return '\n'.join(st for st in chunk.split('\n') if not(st and st[0] == '>'))
EDIT: yeah, that is not faster. That's twice as slow.
Maybe consider using subprocess and a tool like grep, as suggested by Ryan P. You could even take advantage of multiprocessing.

Related

Efficiently search for many different strings in large file

I am trying to find a fast way of searching strings in a file. First of all, I don't have only one string to find. I have a list of 1900 strings to find in a file which is 150MB. So basically I am opening a file, looping for 1900 times to find all occurrences of that string in that file. Here are some of the attributes of my search.
Size of the file to be searched is 150mb – it’s text file.
I need to find all occurrences of 1900 strings in a file. Means I am looping 1900 times entire file to search for all occurrences.
It’s not simple search, I have to use regex to search the string.
In few cases, I need a line above and a line below the where I found the search string. So I need to use file.readlines() not file.read()
In few cases I also have to replace the searched string with new string.
First I am trying to find a best way to search in the file. My code is taking too long. I am not sure if this is best way to do it:
#searchstrings is list of 1900 strings
file = open("mytextfile.txt", "r")
for line in file:
for i in range(len(searchstrings)):
if searchstrings[i] in line:
print(line)
file.close()
This code does the job but it’s extremely slow. Also it does not give me option to choose the line above or below where the searchstring is found.
Another code I am using to replace the string is like below. This code is also extremely slow. Here I am using regex.
file = open("mytextfile.txt", "r")
file_data = file.read()
#searchstrings is list of 1900 strings
#replacestrings is list of 1900 strings that needs to be replaced
for i in range(len(searchstrings)):
src_str = re.compile(searchstrings[i], re.IGNORECASE)
file_data = src_str.sub(replacestrings[i], file_data)
file.close()
I know the performance of the code depends on the computing power as well, however, I just want to know what is the best way to write this code that will work at optimum speed for given hardware. Also I would like to know how to time the program execution.

I like Unix commands, they are fun, fast and efficient.
import re, sys
map(sys.stdout.write,(string_x for string_x in sys.stdin if re.search(sys.argv[1],string_x)))

A few observations.
For idiomatic Python, you usually want
for string in searchstrings:
...
instead of
for i in range(len(searchstrings)):
searchstrings[i]
and with open(filename) as f: ... instead of open()/close(). The with statement will close the file automatically.
When you want to replace any of several strings with a regex, you can do
re.sub('|'.join(YOUR_STRINGS), replacement, text)
because | is the regex symbol for "or", instead of looping over them all individually.
For performance, I might try switching from CPython to PyPy. PyPy is another implementation of the same language but often much faster.
On the other hand, if that's really all your program is supposed to do, you might want to use a dedicated tool for the job, like Ag or RipGrep which has already been optimized for this job. Possibly through the subprocess.run() function if you're working in Python.

write better code instead of 2 for loops

I have 2 for loops and I want to make it better like list comprehension or lambda or else.
how can i achieve the same?
for example :
filename = ['a.txt', 'b.txt', 'c.txt']
for files in filename:
for f in glob.glob(os.path.join(source_path, files)):
print f
... some processing...

Your code is perfectly fine as it is. You can only make it less legible by introducing unnecessary complex constructs.

You can compress the two for loops into a single generator expression*, with a new for loop to extract the file names from it.
for f in (f_ for files in filename
for f_ in glob.glob(os.path.join(source_path, files))):
print f
# ...
As the other answer said, this is not better, this is worse and you shouldn't use it (I'm not sure that's enough emphasis!). It is far harder to understand what is going on, and probably has little performance benefit (in fact, the extra layers of indirection mean it is likely to be slower).
(* basically equivalent to a list comprehension, but better in situations like this.)

I would do it like below. The reason being that now you can separate your search pattern formation, searching and file prosessing. It is easier to expand if they are unrelated.
If your system is slightly exotic (e.g. distributed network drive), the line with both glob and os.path.join is a nasty line. Although as others have mentioned, two loops is perfectly ok.
filename = ['a.txt', 'b.txt', 'c.txt']
searchPatterns = [os.path.join(source_path, files) for files in filename]
searchResults = [glob.glob(pattern) for pattern in searchPatterns]
fileListFlat = sum(searchResults,[])
for file in fileListFlat:
print file

Long expression is hard to read when you have to scan to right and round back. it is even worse when there are many local variables, lambdas and comprehensions, merely being separated by parens and commas, in few lines. Use them only if your code does not get longer and more complex.
For you case, I prefer to extract find as a tradeoff. But just as the top answer said, your code is fine enough.
from itertools import chain
find = lambda p: glob.glob(os.path.join(source_path, p))
for file in chain(map(find, filename)):
"""
=) I like one-level indentation here.
=( I don't know which file pattern is used currently,
unless I use longer expression...
"""

Pythonic and efficient way of defining multiple regexes for use over many iterations

I am presently writing a Python script to process some 10,000 or so input documents. Based on the script's progress output I notice that the first 400+ documents get processed really fast and then the script slows down although the input documents all are approximately the same size.
I am assuming this may have to do with the fact that most of the document processing is done with regexes that I do not save as regex objects once they have been compiled. Instead, I recompile the regexes whenever I need them.
Since my script has about 10 different functions all of which use about 10 - 20 different regex patterns I am wondering what would be a more efficient way in Python to avoid re-compiling the regex patterns over and over again (in Perl I could simply include a modifier //o).
My assumption is that if I store the regex objects in the individual functions using
pattern = re.compile()
the resulting regex object will not be retained until the next invocation of the function for the next iteration (each function is called but once per document).
Creating a global list of pre-compiled regexes seems an unattractive option since I would need to store the list of regexes in a different location in my code than where they are actually used.
Any advice here on how to handle this neatly and efficiently?

The re module caches compiled regex patterns. The cache is cleared when it reaches a size of re._MAXCACHE which by default is 100. (Since you have 10 functions with 10-20 regexes each (i.e. 100-200 regexes), your observed slow-down makes sense with the clearing of the cache.)
If you are okay with changing private variables, a quick and dirty fix to your program might be to set re._MAXCACHE to a higher value:
import re
re._MAXCACHE = 1000

Last time I looked, re.compile maintained a rather small cache, and when it filled up, just emptied it. DIY with no limit:
class MyRECache(object):
def __init__(self):
self.cache = {}
def compile(self, regex_string):
if regex_string not in self.cache:
self.cache[regex_string] = re.compile(regex_string)
return self.cache[regex_string]

Compiled regular expression are automatically cached by re.compile, re.search and re.match, but the maximum cache size is 100 in Python 2.7, so you're overflowing the cache.
Creating a global list of pre-compiled regexes seems an unattractive option since I would need to store the list of regexes in a different location in my code than where they are actually used.
You can define them near the place where they are used: just before the functions that use them. If you reuse the same RE in a different place, then it would have been a good idea to define it globally anyway to avoid having to modify it in multiple places.

In the spirit of "simple is better" I'd use a little helper function like this:
def rc(pattern, flags=0):
key = pattern, flags
if key not in rc.cache:
rc.cache[key] = re.compile(pattern, flags)
return rc.cache[key]
rc.cache = {}
Usage:
rc('[a-z]').sub...
rc('[a-z]').findall <- no compilation here
I also recommend you to try regex. Among many other advantages over the stock re, its MAXCACHE is 500 by default and won't get dropped completely on overflow.

In Python what's the best way to emulate Perl's END?

Am I correct in thinking that that Python doesn't have a direct equivalent for Perl's __END__?
print "Perl...\n";
__END__
End of code. I can put anything I want here.
One thought that occurred to me was to use a triple-quoted string. Is there a better way to achieve this in Python?
print "Python..."
"""
End of code. I can put anything I want here.
"""

The __END__ block in perl dates from a time when programmers had to work with data from the outside world and liked to keep examples of it in the program itself.
Hard to imagine I know.
It was useful for example if you had a moving target like a hardware log file with mutating messages due to firmware updates where you wanted to compare old and new versions of the line or keep notes not strictly related to the programs operations ("Code seems slow on day x of month every month") or as mentioned above a reference set of data to run the program against. Telcos are an example of an industry where this was a frequent requirement.
Lastly Python's cult like restrictiveness seems to have a real and tiresome effect on the mindset of its advocates, if your only response to a question is "Why would you want to that when you could do X?" when X is not as useful please keep quiet++.

The triple-quote form you suggested will still create a python string, whereas Perl's parser simply ignores anything after __END__. You can't write:
"""
I can put anything in here...
Anything!
"""
import os
os.system("rm -rf /")
Comments are more suitable in my opinion.
#__END__
#Whatever I write here will be ignored
#Woohoo !

What you're asking for does not exist.
Proof: http://www.mail-archive.com/python-list#python.org/msg156396.html
A simple solution is to escape any " as \" and do a normal multi line string -- see official docs: http://docs.python.org/tutorial/introduction.html#strings
( Also, atexit doesn't work: http://www.mail-archive.com/python-list#python.org/msg156364.html )

Hm, what about sys.exit(0) ? (assuming you do import sys above it, of course)
As to why it would useful, sometimes I sit down to do a substantial rewrite of something and want to mark my "good up to this point" place.
By using sys.exit(0) in a temporary manner, I know nothing below that point will get executed, therefore if there's a problem (e.g., server error) I know it had to be above that point.
I like it slightly better than commenting out the rest of the file, just because there are more chances to make a mistake and uncomment something (stray key press at beginning of line), and also because it seems better to insert 1 line (which will later be removed), than to modify X-many lines which will then have to be un-modified later.
But yeah, this is splitting hairs; commenting works great too... assuming your editor supports easily commenting out a region, of course; if not, sys.exit(0) all the way!

I use __END__ all the time for multiples of the reasons given. I've been doing it for so long now that I put it (usually preceded by an exit('0');), along with BEGIN {} / END{} routines, in by force-of-habit. It is a shame that Python doesn't have an equivalent, but I just comment-out the lines at the bottom: extraneous, but that's about what you get with one way to rule them all languages.

Python does not have a direct equivalent to this.
Why do you want it? It doesn't sound like a really great thing to have when there are more consistent ways like putting the text at the end as comments (that's how we include arbitrary text in Python source files. Triple quoted strings are for making multi-line strings, not for non-code-related text.)
Your editor should be able to make using many lines of comments easy for you.

Awk, bash or python for converting a regular file?

I have a text file with lots of lines and with this structure:
[('name_1a',
'name_1b',
value_1),
('name_2a',
'name_2b',
value_2),
.....
.....
('name_XXXa',
'name_XXXb',
value_XXX)]
I would like to convert it to:
name_1a, name_1b, value_1
name_2a, name_2b, value_2
......
name_XXXa, name_XXXb, value_XXX
I wonder what would be the best way, whether awk, python or bash.
Thanks
Jose

Tried evaluating it python? Looks like a list of tuples to me.
eval(your_string)
Note, it's massively unsafe! If there's code in there to delete your hard disk, evaluating it will run that code!

I would like to use Python:
lines = open('filename.txt','r').readlines()
n = len(lines) # n % 3 == 0
for i in range(0,n,3):
name1 = lines[i].strip("',[]\n\r")
name2 = lines[i+1].strip("',[]\n\r")
value = lines[i+2].strip("',[]\n\r")
print name1,name2,value

It looks like legal Python. You might be able to just import it as a module and then write it back out after formatting it.

Oh boy, here is a job for ast.literal_eval:
(literal_eval is safer than eval, since it restricts the input string to literals such as strings, numbers, tuples, lists, dicts, booleans and None:
import ast
filename='in'
with open(filename,'r') as f:
contents=f.read()
data=ast.literal_eval(contents)
for elt in data:
print(', '.join(map(str,elt)))

here's one way to do it with (g)awk
$ awk -vRS=")," ' { gsub(/\n|[\047\]\[)(]/,"") } 1' file
name_1a,name_1b,value_1
name_2a,name_2b,value_2
name_XXXa,name_XXXb,value_XXX

Awk is typically line oriented, and bash is a shell, with limited numbrer of string manipulation functions. It really depends on where your strength as a programmer lies, but all other things being equal, I would choose python.
Did you ever consider that by redirecting the time it took to post this on SO, you could have had it done?
"AWK is a language for processing
files of text. A file is treated as a
sequence of records, and by default
each line is a record. Each line is
broken up into a sequence of fields,
so we can think of the first word in a
line as the first field, the second
word as the second field, and so on.
An AWK program is of a sequence of
pattern-action statements. AWK reads
the input a line at a time. A line is
scanned for each pattern in the
program, and for each pattern that
matches, the associated action is
executed." - Alfred V. Aho[2]

Asking what's the best language for doing a given task is a very different question to say, asking: 'what's the best way of doing a given task in a particular language'. The first, what you're asking, is in most cases entirely subjective.
Since this is a fairly simple task, I would suggest going with what you know (unless you're doing this for learning purposes, which I doubt).
If you know any of the languages you suggested, go ahead and solve this in a matter of minutes. If you know none of them, now enters the subjective part, I would suggest learning Python, since it's so much more fun than the other 2 ;)

If the values are legal python values, you can take advantage of eval() since your data is a legal python data sucture. The following would work if values are integers, otherwise you might have to massage the print call a bit:
input = """[('name_1a',
'name_1b',
1),
('name_2a',
'name_2b',
2),
('name_XXXa',
'name_XXXb',
3)]"""
for e in eval(input):
print '%s,%s,%d' % e
P.S. using eval() is quite controversial since it will execute any valid python code that you pass into it, so take care.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python speed up this regex sub - python

This is not faster? def cleanup(chunk): return '\n'.join(st for st in chunk.split('\n') if not(st and st[0] == '>')) EDIT: yeah, that is not faster. That's twice as slow. Maybe consider using subprocess and a tool like grep, as suggested by Ryan P. You could even take advantage of multiprocessing.

Related

Efficiently search for many different strings in large file

write better code instead of 2 for loops

Pythonic and efficient way of defining multiple regexes for use over many iterations

In Python what's the best way to emulate Perl's END?

Awk, bash or python for converting a regular file?

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python speed up this regex sub - python

This is not faster? def cleanup(chunk): return '\n'.join(st for st in chunk.split('\n') if not(st and st[0] == '>')) EDIT: yeah, that is not faster. That's twice as slow. Maybe consider using subprocess and a tool like grep, as suggested by Ryan P. You could even take advantage of multiprocessing.

Related

Efficiently search for many different strings in large file

write better code instead of 2 for loops

Pythonic and efficient way of defining multiple regexes for use over many iterations

In Python what's the best way to emulate Perl's __END__?

Awk, bash or python for converting a regular file?

Categories

Resources

In Python what's the best way to emulate Perl's END?