Assign and Test Regex in Python? - python

In many of my python projects, I find myself having to go through a file, match lines against regexes, and then perform some computation on the basis of elements from the line extracted by regex.
In pseudo-C code, this is pretty-easy:
while (read(line))
{
if (m=matchregex(regex1,line))
{
/* munch on the components extracted in regex1 by accessing m */
}
else if (m=matchregex(regex2,line))
{
/* munch on the components extracted in regex2 by accessing m */
}
else if ...
...
else
{
error("Unrecognized line format");
}
}
However, because python does not allow an assignment in the conditional of an if, this can't be done elegantly. One could first parse against all the regexes and then do the if on the various match objects, but that is neither elegant nor efficient.
What I find myself doing instead is including code like this at the base level of every project:
im=None
img=None
def imps(p,s):
global im
global img
im=re.search(p,s)
if im:
img=im.groups()
return True
else:
img=None
return False
Then I can work like this:
for line in open(file,'r').read().splitlines():
if imps(regex1,line):
# munch on contents of img
elsif imps(regex2,line):
# munch on contents of img
else:
error('Unrecognised line: {}'.format(line))
That works, is reasonably compact, and easy to type. But it is hardly beautiful; it uses global variables and is not thread safe (which has not been an issue for me so far).
But I'm sure others have run across this problem before and come up with an equally compact, but more python-y and generally superior solution. What is it?

Depends on the needs of the code.
A common choice I use is something like this:
# note, order is important here. The first one to match will exit the processing
parse_regexps = [
(r"^foo", handle_foo),
(r"^bar", handle_bar),
]
for regexp, handler in parse_regexps:
m = regexp.match(line)
if m:
handler(line) # possibly other data too like m.groups
break
else:
error("Unrecognized format....")
This has the advantage of moving the handling code into clear and obvious functions which makes testing and change easy.

You can just use continue:
for line in file:
m = re.match(re1, line)
if m:
do stuff
continue
m = re.match(re2, line)
if m:
do stuff
continue
raise BadLine
Another, less obvious, option is to have a function like this:
def match_any(subject, *regexes):
for n, regex in enumerate(regexes):
m = re.match(regex, subject)
if m:
return n, m
return -1, None
and then:
for line in file:
n, m = match_any(line, re1, re2)
if n == 0:
....
elif n == 1:
....
else:
raise BadLine

Related

While generating a hash of a file using Python3's hashlib.blake2b , are these two methods functionally similar?

I must use Python3.7 in the environment I find myself in. Common tutorials on how to utilize the hashlib.blake2b module show using a 'walrus' while reading out the chunks of the file to be hashed
Example of conventional approach:
def makeNormalHash():
with open('fizzy.jpg', "rb") as f:
file_hash = hashlib.blake2b()
while chunk := f.read(8192):
file_hash.update(chunk)
hexdig = file_hash.hexdigest()
dig = file_hash.digest()
return hexdig,dig
This usage of the := operator has me a little confused but I have attempted to extrapolate out its end resulting functionality in this usecase but written for Python3.7 instead of Python3.8. My understanding of how := works yielded the following :
def makeDifferentHash():
with open('fizzy.jpg', "rb") as f:
foo_hash = hashlib.blake2b()
chunk = f.read(8192)
while len(chunk) > 0:
foo_hash.update(chunk)
chunk = f.read(8192)
foohexdig = foo_hash.hexdigest()
foodig = foo_hash.digest()
return foohexdig, foodig
Which at first glance seems to work just the same but if I compare the resulting values when hashing the same file I come to find out that the values do not match.
nhd, nd = makeNormalHash()
fhd, fd = makeDifferentHash()
if(nhd != fhd):
print('hexdig no match')
if(nd != fd):
print('foodig no match')
I believe I should anticipate getting the same resulting values when hashing the same file in the same manner each time, this is to confirm the file is valid and/or not tampered with. So I am using the same method ( blake2b ) each time but I am changing how I am looping through the file. Is this the cause of the mismatch of digest values or am I missing another aspect of hashing that is creating this difference?
Ultimately I am trying to make a python3.7 friendly function that replaces the usage of the walrus operator ( := )
Any ideas?
*Walrus Operator == Assignment Expression PEP572
In my case, when the file is modified, it gets empty first. At that point the hash is the "empty string hash" which should be ignored. You can declare the empty hash for the empty string (b'') like this:
tempHash = hashlib.blake2b()
tempHash.update(b'')
emptyHash = tempHash.hexdigest()

Importing big tecplot block files in python as fast as possible

I want to import in python some ascii file ( from tecplot, software for cfd post processing).
Rules for those files are (at least, for those that I need to import):
The file is divided in several section
Each section has two lines as header like:
VARIABLES = "x" "y" "z" "ro" "rovx" "rovy" "rovz" "roE" "M" "p" "Pi" "tsta" "tgen"
ZONE T="Window(s) : E_W_Block0002_ALL", I=29, J=17, K=25, F=BLOCK
Each section has a set of variable given by the first line. When a section ends, a new section starts with two similar lines.
For each variable there are I*J*K values.
Each variable is a continous block of values.
There are a fixed number of values per row (6).
When a variable ends, the next one starts in a new line.
Variables are "IJK ordered data".The I-index varies the fastest; the J-index the next fastest; the K-index the slowest. The I-index should be the inner loop, the K-index shoould be the outer loop, and the J-index the loop in between.
Here is an example of data:
VARIABLES = "x" "y" "z" "ro" "rovx" "rovy" "rovz" "roE" "M" "p" "Pi" "tsta" "tgen"
ZONE T="Window(s) : E_W_Block0002_ALL", I=29, J=17, K=25, F=BLOCK
-3.9999999E+00 -3.3327306E+00 -2.7760824E+00 -2.3117116E+00 -1.9243209E+00 -1.6011492E+00
[...]
0.0000000E+00 #fin first variable
-4.3532482E-02 -4.3584235E-02 -4.3627592E-02 -4.3663762E-02 -4.3693815E-02 -4.3718831E-02 #second variable, 'y'
[...]
1.0738781E-01 #end of second variable
[...]
[...]
VARIABLES = "x" "y" "z" "ro" "rovx" "rovy" "rovz" "roE" "M" "p" "Pi" "tsta" "tgen" #next zone
ZONE T="Window(s) : E_W_Block0003_ALL", I=17, J=17, K=25, F=BLOCK
I am quite new at python and I have written a code to import the data to a dictionary, writing the variables as 3D numpy.array . Those files could be very big, (up to Gb). How can I make this code faster? (or more generally, how can I import such files as fast as possible)?
import re
from numpy import zeros, array, prod
def vectorr(I, J, K):
"""function"""
vect = []
for k in range(0, K):
for j in range(0, J):
for i in range(0, I):
vect.append([i, j, k])
return vect
a = open('E:\u.dat')
filelist = a.readlines()
NumberCol = 6
count = 0
data = dict()
leng = len(filelist)
countzone = 0
while count < leng:
strVARIABLES = re.findall('VARIABLES', filelist[count])
variables = re.findall(r'"(.*?)"', filelist[count])
countzone = countzone+1
data[countzone] = {key:[] for key in variables}
count = count+1
strI = re.findall('I=....', filelist[count])
strI = re.findall('\d+', strI[0])
I = int(strI[0])
##
strJ = re.findall('J=....', filelist[count])
strJ = re.findall('\d+', strJ[0])
J = int(strJ[0])
##
strK = re.findall('K=....', filelist[count])
strK = re.findall('\d+', strK[0])
K = int(strK[0])
data[countzone]['indmax'] = array([I, J, K])
pr = prod(data[countzone]['indmax'])
lin = pr // NumberCol
if pr%NumberCol != 0:
lin = lin+1
vect = vectorr(I, J, K)
for key in variables:
init = zeros((I, J, K))
for ii in range(0, lin):
count = count+1
temp = map(float, filelist[count].split())
for iii in range(0, len(temp)):
init.itemset(tuple(vect[ii*6+iii]), temp[iii])
data[countzone][key] = init
count = count+1
Ps. In python, no cython or other languages
Converting a large bunch of strings to numbers is always going to be a little slow, but assuming the triple-nested for-loop is the bottleneck here maybe changing it to the following gives you a sufficient speedup:
# add this line to your imports
from numpy import fromstring
# replace the nested for-loop with:
count += 1
for key in variables:
str_vector = ' '.join(filelist[count:count+lin])
ar = fromstring(str_vector, sep=' ')
ar = ar.reshape((I, J, K), order='F')
data[countzone][key] = ar
count += lin
Unfortunately at the moment I only have access to my smartphone (no pc) so I can't test how fast this is or even if it works correctly or at all!
Update
Finally I got around to doing some testing:
My code contained a small error, but it does seem to work correctly now.
The code with the proposed changes runs about 4 times faster than the original
Your code spends most of its time on ndarray.itemset and probably loop overhead and float conversion. Unfortunately cProfile doesn't show this in much detail..
The improved code spends about 70% of time in numpy.fromstring, which, in my view, indicates that this method is reasonably fast for what you can achieve with Python / NumPy.
Update 2
Of course even better would be to iterate over the file instead of loading everything all at once. In this case this is slightly faster (I tried it) and significantly reduces memory use. You could also try to use multiple CPU cores to do the loading and conversion to floats, but then it becomes difficult to have all the data under one variable. Finally a word of warning: the fromstring method that I used scales rather bad with the length of the string. E.g. from a certain string length it becomes more efficient to use something like np.fromiter(itertools.imap(float, str_vector.split()), dtype=float).
If you use regular expressions here, there's two things that I would change:
Compile REs which are used more often (which applies to all REs in your example, I guess). Do regex=re.compile("<pattern>") on them, and use the resulting object with match=regex.match(), as described in the Python documentation.
For the I, J, K REs, consider reducing two REs to one, using the grouping feature (also described above), by searching for a pattern of the form "I=(\d+)", and grabbing the part matched inside the parentheses using regex.group(1). Taking this further, you can define a single regex to capture all three variables in one step.
At least for starting the sections, REs seem a bit overkill: There's no variation in the string you need to look for, and string.find() is sufficient and probably faster in that case.
EDIT: I just saw you use grouping already for the variables...

parse nested conditional statements

I need to parse a file that contains conditional statements, sometimes nested inside one another.
I have a file that stores configuration data but the configuration data is slightly different depending on user defined options. I can deal with the conditional statements, they're all just booleans with no operations but I don't know how to recursively evaluate the nested conditionals. For instance, a piece of the file might look like:
...
#if CELSIUS
#if FROM_KELVIN ; this is a comment about converting kelvin to celsius.
temp_conversion = 1, 273
#else
temp_conversion = 0.556, -32
#endif
#else
#if FROM_KELVIN
temp_conversion = 1.8, -255.3
#else
temp_conversion = 1.8, 17.778
#endif
#endif
...
... Also, some conditionals don't have an #else statement, just #if CONDITION statement(s) #endif.
I realize that this could be easy if the file were just written in XML or something else with a nice parser to begin with, but this is what I have to work with so I'm wondering if there's any relatively simple way to parse this file. It's similar to parenthesis matching so I imagine there would be some module for it but I haven't found anything.
I'm working in python but I can switch for this function if it's easier to solve this in another language.
Here's a simple recursive parser for this syntax:
def parse(lines):
result = []
while lines:
if lines[0].startswith('#if'):
block = [lines.pop(0).split()[1], parse(lines)]
if lines[0].startswith('#else'):
lines.pop(0)
block.append(parse(lines))
lines.pop(0) #endif
result.append(block)
elif not lines[0].startswith(('#else', '#endif')):
result.append(lines.pop(0))
else:
break
return result
tree = parse([x.strip() for x in your_code.splitlines() if x.strip()])
From your example it creates the following tree structure:
[['CELSIUS',
[['FROM_KELVIN',
['temp_conversion = 1, 273'],
['temp_conversion = 0.556, -32']]],
[['FROM_KELVIN',
['temp_conversion = 1.8, -255.3'],
['temp_conversion = 1.8, 17.778']]]]]
which should be easy to evaluate.
For more advanced parsing consider one of many parsing tools available for Python.
Since all of the conditions are binary and I know the values of all of them in advance (no need to evaluate them in order in order like a programming language), i was able to do it with a regular expression. This works better for me. It finds the lowest level conditionals (ones with no nested conditions), evaluates them and replaces them with the correct contents. Then repeats for the higher level conditionals and so on.
import re
conditions = ['CELSIUS', 'FROM_KELVIN']
def eval_conditional(matchobj):
statement = matchobj.groups()[1].split('#else')
statement.append('') # in case there was no else statement
if matchobj.groups()[0] in conditions: return statement[0]
else: return statement[1]
def parse(text):
pattern = r'#if\s*(\S*)\s*((?:.(?!#if|#endif))*.)#endif'
regex = re.compile(pattern, re.DOTALL)
while True:
if not regex.search(text): break
text = regex.sub(eval_conditional, text)
return text
if __name__ == '__main__':
i = open('input.txt', 'r').readlines()
g = ''.join([x.split(';')[0] for x in i if x.strip()])
o = parse(g)
open('output.txt', 'w').write(o)
Given the input in the original post, it outputs:
...
temp_conversion = 1, 273
...
which is what I need. Thanks to everyone for their responses, I really appreciate the help!

Reading n lines from file (but not all) in Python

How to read n lines from a file instead of just one when iterating over it? I have a file which has well defined structure and I would like to do something like this:
for line1, line2, line3 in file:
do_something(line1)
do_something_different(line2)
do_something_else(line3)
but it doesn't work:
ValueError: too many values to unpack
For now I am doing this:
for line in file:
do_someting(line)
newline = file.readline()
do_something_else(newline)
newline = file.readline()
do_something_different(newline)
... etc.
which sucks because I am writing endless 'newline = file.readline()' which are cluttering the code.
Is there any smart way to do this ? (I really want to avoid reading whole file at once because it is huge)
Basically, your fileis an iterator which yields your file one line at a time. This turns your problem into how do you yield several items at a time from an iterator. A solution to that is given in this question. Note that the function isliceis in the itertools module so you will have to import it from there.
If it's xml why not just use lxml?
You could use a helper function like this:
def readnlines(f, n):
lines = []
for x in range(0, n):
lines.append(f.readline())
return lines
Then you can do something like you want:
while True:
line1, line2, line3 = readnlines(file, 3)
do_stuff(line1)
do_stuff(line2)
do_stuff(line3)
That being said, if you are using xml files, you will probably be happier in the long run if you use a real xml parser...
itertools to the rescue:
import itertools
def grouper(n, iterable, fillvalue=None):
"grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return itertools.izip_longest(fillvalue=fillvalue, *args)
fobj= open(yourfile, "r")
for line1, line2, line3 in grouper(3, fobj):
pass
for i in file produces a str, so you can't just do for i, j, k in file and read it in batches of three (try a, b, c = 'bar' and a, b, c = 'too many characters' and look at the values of a, b and c to work out why you get the "too many values to unpack").
It's not clear entirely what you mean, but if you're doing the same thing for each line and just want to stop at some point, then do it like this:
for line in file_handle:
do_something(line)
if some_condition:
break # Don't want to read anything else
(Also, don't use file as a variable name, you're shadowning a builtin.)
If your're doing the same thing why do you need to process multiple lines per iteration?
For line in file is your friend. It is in general much more efficient than manually reading the file, both in terms of io performance and memory.
Do you know something about the length of the lines/format of the data? If so, you could read in the first n bytes (say 80*3) and f.read(240).split("\n")[0:3].
If you want to be able to use this data over and over again, one approach might be to do this:
lines = []
for line in file_handle:
lines.append(line)
This will give you a list of the lines, which you can then access by index. Also, when you say a HUGE file, it is most likely trivial what the size is, because python can process thousands of lines very quickly.
why can't you just do:
ctr = 0
for line in file:
if ctr == 0:
....
elif ctr == 1:
....
ctr = ctr + 1
if you find the if/elif construct ugly you could just create a hash table or list of function pointers and then do:
for line in file:
function_list[ctr]()
or something similar
It sounds like you are trying to read from disk in parallel... that is really hard to do. All the solutions given to you are realistic and legitimate. You shouldn't let something put you off just because the code "looks ugly". The most important thing is how efficient/effective is it, then if the code is messy, you can tidy it up, but don't look for a whole new method of doing something because you don't like how one way of doing it looks like in code.
As for running out of memory, you may want to check out pickle.
It's possible to do it with a clever use of the zip function. It's short, but a bit voodoo-ish for my tastes (hard to see how it works). It cuts off any lines at the end that don't fill a group, which may be good or bad depending on what you're doing. If you need the final lines, itertools.izip_longest might do the trick.
zip(*[iter(inputfile)] * 3)
Doing it more explicitly and flexibly, this is a modification of Mats Ekberg's solution:
def groupsoflines(f, n):
while True:
group = []
for i in range(n):
try:
group.append(next(f))
except StopIteration:
if group:
tofill = n - len(group)
yield group + [None] * tofill
return
yield group
for line1, line2, line3 in groupsoflines(inputfile, 3):
...
N.B. If this runs out of lines halfway through a group, it will fill in the gaps with None, so that you can still unpack it. So, if the number of lines in your file might not be a multiple of three, you'll need to check whether line2 and line3 are None.

Python: item for item until stopterm in item?

Disclaimer: I'm fairly new to python!
If I want all the lines of a file until (edit: and including) the line containing some string stopterm, is there a way of using the list syntax for it? I was hoping there would be something like:
usefullines = [line for line in file until stopterm in line]
For now, I've got
usefullines = []
for line in file:
usefullines.append(line)
if stopterm in line:
break
It's not the end of the world, but since there rest of Python syntax is so straightforward, I was hoping for a 1 thought->1 Python line mapping.
from itertools import takewhile
usefullines = takewhile(lambda x: not re.search(stopterm, x), lines)
from itertools import takewhile
usefullines = takewhile(lambda x: stopterm not in x, lines)
Here's a way that keeps the stopterm line:
def useful_lines(lines, stopterm):
for line in lines:
if stopterm in line:
yield line
break
yield line
usefullines = useful_lines(lines, stopterm)
# or...
for line in useful_lines(lines, stopterm):
# ... do stuff
pass
" I was hoping for a 1 thought->1 Python line mapping." Wouldn't we all love a programming language that somehow mirrored our natural language?
You can achieve that, you just need to define your unique thoughts once. Then you have the 1:1 mapping you were hoping for.
def usefulLines( aFile ):
for line in aFile:
yield line
if line == stopterm:
break
Is pretty much it.
for line in usefulLines( aFile ):
# process a line, knowing it occurs BEFORE stopterm.
There are more general approaches. The lassevk answers with enum_while and enum_until are generalizations of this simple design pattern.
That itertools solution is neat. I have earlier been amazed by itertools.groupby, one handy tool.
But still i was just tinkering if I could do this without itertools. So here it is
(There is one assumption and one drawback though: the file is not huge and its goes for one extra complete iteration over the lines, respectively.)
I created a sample file named "try":
hello
world
happy
day
bye
once you read the file and have the lines in a variable name lines:
lines=open('./try').readlines()
then
print [each for each in lines if lines.index(each)<=[lines.index(line) for line in lines if 'happy' in line][0]]
gives the result:
['hello\n', 'world\n', 'happy\n']
and
print [each for each in lines if lines.index(each)<=[lines.index(line) for line in lines if 'day' in line][0]]
gives the result:
['hello\n', 'world\n', 'happy\n', 'day\n']
So you got the last line - the stop term line also included.
Forget this
Leaving the answer, but marking it community. See Stewen Huwig's answer for the correct way to do this.
Well, [x for x in enumerable] will run until enumerable doesn't produce data any more, the if-part will simply allow you to filter along the way.
What you can do is add a function, and filter your enumerable through it:
def enum_until(source, until_criteria):
for k in source:
if until_criteria(k):
break;
yield k;
def enum_while(source, while_criteria):
for k in source:
if not while_criteria(k):
break;
yield k;
l1 = [k for k in enum_until(xrange(1, 100000), lambda y: y == 100)];
l2 = [k for k in enum_while(xrange(1, 100000), lambda y: y < 100)];
print l1;
print l2;
Of course, it doesn't look as nice as what you wanted...
I think it's fine to keep it that way. Sophisticated one-liner are not really pythonic, and since Guido had to put a limit somewhere, I guess this is it...
I'd go with Steven Huwig's or S.Lott's solutions for real usage, but as a slightly hacky solution, here's one way to obtain this behaviour:
def stop(): raise StopIteration()
usefullines = list(stop() if stopterm in line else line for line in file)
It's slightly abusing the fact that anything that raises StopIteration will abort the current iteration (here the generator expression) and uglier to read than your desired syntax, but will work.

Categories

Resources