I need to parse a file that contains conditional statements, sometimes nested inside one another.
I have a file that stores configuration data but the configuration data is slightly different depending on user defined options. I can deal with the conditional statements, they're all just booleans with no operations but I don't know how to recursively evaluate the nested conditionals. For instance, a piece of the file might look like:
...
#if CELSIUS
#if FROM_KELVIN ; this is a comment about converting kelvin to celsius.
temp_conversion = 1, 273
#else
temp_conversion = 0.556, -32
#endif
#else
#if FROM_KELVIN
temp_conversion = 1.8, -255.3
#else
temp_conversion = 1.8, 17.778
#endif
#endif
...
... Also, some conditionals don't have an #else statement, just #if CONDITION statement(s) #endif.
I realize that this could be easy if the file were just written in XML or something else with a nice parser to begin with, but this is what I have to work with so I'm wondering if there's any relatively simple way to parse this file. It's similar to parenthesis matching so I imagine there would be some module for it but I haven't found anything.
I'm working in python but I can switch for this function if it's easier to solve this in another language.
Here's a simple recursive parser for this syntax:
def parse(lines):
result = []
while lines:
if lines[0].startswith('#if'):
block = [lines.pop(0).split()[1], parse(lines)]
if lines[0].startswith('#else'):
lines.pop(0)
block.append(parse(lines))
lines.pop(0) #endif
result.append(block)
elif not lines[0].startswith(('#else', '#endif')):
result.append(lines.pop(0))
else:
break
return result
tree = parse([x.strip() for x in your_code.splitlines() if x.strip()])
From your example it creates the following tree structure:
[['CELSIUS',
[['FROM_KELVIN',
['temp_conversion = 1, 273'],
['temp_conversion = 0.556, -32']]],
[['FROM_KELVIN',
['temp_conversion = 1.8, -255.3'],
['temp_conversion = 1.8, 17.778']]]]]
which should be easy to evaluate.
For more advanced parsing consider one of many parsing tools available for Python.
Since all of the conditions are binary and I know the values of all of them in advance (no need to evaluate them in order in order like a programming language), i was able to do it with a regular expression. This works better for me. It finds the lowest level conditionals (ones with no nested conditions), evaluates them and replaces them with the correct contents. Then repeats for the higher level conditionals and so on.
import re
conditions = ['CELSIUS', 'FROM_KELVIN']
def eval_conditional(matchobj):
statement = matchobj.groups()[1].split('#else')
statement.append('') # in case there was no else statement
if matchobj.groups()[0] in conditions: return statement[0]
else: return statement[1]
def parse(text):
pattern = r'#if\s*(\S*)\s*((?:.(?!#if|#endif))*.)#endif'
regex = re.compile(pattern, re.DOTALL)
while True:
if not regex.search(text): break
text = regex.sub(eval_conditional, text)
return text
if __name__ == '__main__':
i = open('input.txt', 'r').readlines()
g = ''.join([x.split(';')[0] for x in i if x.strip()])
o = parse(g)
open('output.txt', 'w').write(o)
Given the input in the original post, it outputs:
...
temp_conversion = 1, 273
...
which is what I need. Thanks to everyone for their responses, I really appreciate the help!
Related
I must use Python3.7 in the environment I find myself in. Common tutorials on how to utilize the hashlib.blake2b module show using a 'walrus' while reading out the chunks of the file to be hashed
Example of conventional approach:
def makeNormalHash():
with open('fizzy.jpg', "rb") as f:
file_hash = hashlib.blake2b()
while chunk := f.read(8192):
file_hash.update(chunk)
hexdig = file_hash.hexdigest()
dig = file_hash.digest()
return hexdig,dig
This usage of the := operator has me a little confused but I have attempted to extrapolate out its end resulting functionality in this usecase but written for Python3.7 instead of Python3.8. My understanding of how := works yielded the following :
def makeDifferentHash():
with open('fizzy.jpg', "rb") as f:
foo_hash = hashlib.blake2b()
chunk = f.read(8192)
while len(chunk) > 0:
foo_hash.update(chunk)
chunk = f.read(8192)
foohexdig = foo_hash.hexdigest()
foodig = foo_hash.digest()
return foohexdig, foodig
Which at first glance seems to work just the same but if I compare the resulting values when hashing the same file I come to find out that the values do not match.
nhd, nd = makeNormalHash()
fhd, fd = makeDifferentHash()
if(nhd != fhd):
print('hexdig no match')
if(nd != fd):
print('foodig no match')
I believe I should anticipate getting the same resulting values when hashing the same file in the same manner each time, this is to confirm the file is valid and/or not tampered with. So I am using the same method ( blake2b ) each time but I am changing how I am looping through the file. Is this the cause of the mismatch of digest values or am I missing another aspect of hashing that is creating this difference?
Ultimately I am trying to make a python3.7 friendly function that replaces the usage of the walrus operator ( := )
Any ideas?
*Walrus Operator == Assignment Expression PEP572
In my case, when the file is modified, it gets empty first. At that point the hash is the "empty string hash" which should be ignored. You can declare the empty hash for the empty string (b'') like this:
tempHash = hashlib.blake2b()
tempHash.update(b'')
emptyHash = tempHash.hexdigest()
I found myself writing some tricky algorithmic code, and I tried to comment it as well as I could since I really do not know who is going to maintain this part of the code.
Following this idea, I've wrote quite a lot of block and inline comments, also trying not to over-comment it. But still, when I go back to the code I wrote a week ago, I find it difficult to read because of the swarming presence of the comments, especially the inline ones.
I though that indenting them (to ~120char) could easy the readability, but would obviously make the line way too long according to style standards.
Here's an example of the original code:
fooDataTableOccurrence = nestedData.find("table=\"public\".")
if 0 > fooDataTableOccurrence: # selects only tables without tag value "public-"
otherDataTableOccurrence = nestedData.find("table=")
dbNamePos = nestedData.find("dbname=") + 7 # 7 is the length of "dbname="
if -1 < otherDataTableOccurrence: # selects only tables with tag value "table="
# database resource case
resourceName = self.findDB(nestedData, dbNamePos, otherDataTableOccurrence, destinationPartitionName)
if resourceName: #if the resource is in a wrong path
if resourceName in ["foo", "bar", "thing", "stuff"]:
return True, False, False # respectively isProjectAlreadyExported, isThereUnexpectedData and wrongPathResources
wrongPathResources.append("Database table: " + resourceName)
And here's how indenting inline comments would look like:
fooDataTableOccurrence = nestedData.find("table=\"public\".")
if 0 > seacomDataTableOccurrence: # selects only tables without tag value "public-"
otherDataTableOccurrence = nestedData.find("table=")
dbNamePos = nestedData.find("dbname=") + 7 # 7 is the length of "dbname="
if -1 < otherDataTableOccurrence: #selects only tables with tag value "table="
# database resource case
resourceName = self.findDB(nestedData, dbNamePos, otherDataTableOccurrence, destinationPartitionName)
if resourceName: # if the resource is in a wrong path
if resourceName in ["foo", "bar", "thing", "stuff"]:
return True, False, False # respectively isProjectAlreadyExported, isThereUnexpectedData and wrongPathResources
wrongPathResources.append("Database table: " + resourceName)
The code is in Python (my company legacy code is not drastically following the PEP8 standard so we had to stick with that), but my point is not about the cleanness of the code itself, but on the comments. I am looking for a trade-off between readability and easy understanding of the code, and sometimes I find difficult achieving both at the same time
Which of the examples is better? If none, what would be?
Maybe this is an XY_Problem?
Could the comments be eliminated altogether?
Here is a (quick & dirty) attempt at refactoring the code posted:
dataTableOccurrence_has_tag_public = nestedData.find("table=\"public\".") > 0
if dataTableOccurrence_has_tag_public:
datataTableOccurrence_has_tag_table = nestedData.find("table=") > 0
prefix = "dbname="
dbNamePos = nestedData.find(prefix) + len(prefix)
if datataTableOccurrence_has_tag_table:
# database resource case
resourceName = self.findDB(nestedData,
dbNamePos,
otherDataTableOccurrence,
destinationPartitionName)
resource_name_in_wrong_path = len(resourceName) > 0
if resourceNameInWrongPath:
if resourceName in ["foo", "bar", "thing", "stuff"]:
project_already_exported = True
unexpected_data = False
return project_already_exported,
unexpected_data,
resource_name_in_wrong_path
wrongPathResources.append("Database table: " + resourceName)
Further work could involve extracting functions out of the block of code.
About a year back, I wrote a little program in python that basically automates a part of my job (with quite a bit of assistance from you guys!) However, I ran into a problem. As I kept making the program better and better, I realized that Python did not want to play nice with excel, and (without boring you with the details suffice to say xlutils will not copy formulas) I NEED to have more access to excel for my intentions.
So I am starting back at square one with VB (2010 Express if it helps.) The only programming course I ever took in my life was on it, and it was pretty straight forward so I decided I'd go back to it for this. Unfortunately, I've forgotten much of what I had learned, and we never really got this far down the rabbit hole in the first place. So, long story short I am trying to:
1) Read data from a .csv structured as so:
41,332.568825,22.221759,-0.489714,eow
42,347.142926,-2.488763,-0.19358,eow
46,414.9969,19.932693,1.306851,r
47,450.626074,21.878299,1.841957,r
48,468.909171,21.362568,1.741944,r
49,506.227269,15.441723,1.40972,r
50,566.199838,17.656284,1.719818,r
51,359.069935,-11.773073,2.443772,l
52,396.321911,-8.711589,1.83507,l
53,423.766684,-4.238343,1.85591,l
2) Sort that data alphabetically by column 5
3) Then selecting only the ones with an "l" in column 5, sort THOSE numerically by column 2 (ascending order) AND copy them to a new file called coil.csv
4) Then selecting only the ones that have an "r" in column 5, sort those numerically by column 2 (descending order) and copy them to the SAME file coil.csv (appended after the others obviously)
After all of that hoopla I wish to get out:
51,359.069935,-11.773073,2.443772,l
52,396.321911,-8.711589,1.83507,l
53,423.766684,-4.238343,1.85591,l
50,566.199838,17.656284,1.719818,r
49,506.227269,15.441723,1.40972,r
48,468.909171,21.362568,1.741944,r
47,450.626074,21.878299,1.841957,r
46,414.9969,19.932693,1.306851,r
I realize that this may be a pretty involved question, and I certainly understand if no one wants to deal with all this bs, lol. Anyway, some full on code, snippets, ideas or even relevant links would be GREATLY appreciated. I've been, and still am googling, but it's harder than expected to find good reliable information pertaining to this.
P.S. Here is the piece of python code that did what I am talking about (although it created two seperate files for the lefts and rights which I don't really need) - if it helps you at all.
msgbox(msg="Please locate your survey file in the next window.")
mainfile = fileopenbox(title="Open survey file")
toponame = boolbox(msg="What is the name of the shots I should use for topography? Note: TOPO is used automatically",choices=("Left","Right"))
fieldnames = ["A","B","C","D","E"]
surveyfile = open(mainfile, "r")
left_file = open("left.csv",'wb')
right_file = open("right.csv",'wb')
coil_file = open("coil1.csv","wb")
reader = csv.DictReader(surveyfile, fieldnames=fieldnames, delimiter=",")
left_writer = csv.DictWriter(left_file, fieldnames + ["F"], delimiter=",")
sortedlefts = sorted(reader,key=lambda x:float(x["B"]))
surveyfile.seek(0,0)
right_writer = csv.DictWriter(right_file, fieldnames + ["F"], delimiter=",")
sortedrights = sorted(reader,key=lambda x:float(x["B"]), reverse=True)
coil_writer = csv.DictWriter(coil_file, fieldnames, delimiter=",",extrasaction='ignore')
for row in sortedlefts:
if row["E"] == "l" or row["E"] == "cl+l":
row['F'] = '%s,%s' % (row['B'], row['D'])
left_writer.writerow(row)
coil_writer.writerow(row)
for row in sortedrights:
if row["E"] == "r":
row['F'] = '%s,%s' % (row['B'], row['D'])
right_writer.writerow(row)
coil_writer.writerow(row)
One option you have is to start with a class to hold the fields. This allows you to override the ToString method to facilitate the output. Then, it's a fairly simple matter of reading each line and assigning the values to a list of the class. In your case you'll want the extra step of making 2 lists sorting one descending and combining them:
Class Fields
Property A As Double = 0
Property B As Double = 0
Property C As Double = 0
Property D As Double = 0
Property E As String = ""
Public Overrides Function ToString() As String
Return Join({A.ToString, B.ToString, C.ToString, D.ToString, E}, ",")
End Function
End Class
Function SortedFields(filename As String) As List(Of Fields)
SortedFields = New List(Of Fields)
Dim test As New List(Of Fields)
Dim sr As New IO.StreamReader(filename)
Using sr As New IO.StreamReader(filename)
Do Until sr.EndOfStream
Dim fieldarray() As String = sr.ReadLine.Split(","c)
If fieldarray.Length = 5 AndAlso Not fieldarray(4)(0) = "e"c Then
If fieldarray(4) = "r" Then
test.Add(New Fields With {.A = Double.Parse(fieldarray(0)), .B = Double.Parse(fieldarray(1)), .C = Double.Parse(fieldarray(2)), .D = Double.Parse(fieldarray(3)), .E = fieldarray(4)})
Else
SortedFields.Add(New Fields With {.A = Double.Parse(fieldarray(0)), .B = Double.Parse(fieldarray(1)), .C = Double.Parse(fieldarray(2)), .D = Double.Parse(fieldarray(3)), .E = fieldarray(4)})
End If
End If
Loop
End Using
SortedFields = SortedFields.OrderBy(Function(x) x.B).Concat(test.OrderByDescending(Function(x) x.B)).ToList
End Function
One simple way of writing the data to a csv file is to use the IO.File.WriteAllLines methods and the ConvertAll method of the List:
IO.File.WriteAllLines(" coil.csv", SortedFields("textfile1.txt").ConvertAll(New Converter(Of Fields, String)(Function(x As Fields) x.ToString)))
You'll notice how the ToString method facilitates this quite easily.
If the class will only be used for this you do have the option to make all the fields string.
In many of my python projects, I find myself having to go through a file, match lines against regexes, and then perform some computation on the basis of elements from the line extracted by regex.
In pseudo-C code, this is pretty-easy:
while (read(line))
{
if (m=matchregex(regex1,line))
{
/* munch on the components extracted in regex1 by accessing m */
}
else if (m=matchregex(regex2,line))
{
/* munch on the components extracted in regex2 by accessing m */
}
else if ...
...
else
{
error("Unrecognized line format");
}
}
However, because python does not allow an assignment in the conditional of an if, this can't be done elegantly. One could first parse against all the regexes and then do the if on the various match objects, but that is neither elegant nor efficient.
What I find myself doing instead is including code like this at the base level of every project:
im=None
img=None
def imps(p,s):
global im
global img
im=re.search(p,s)
if im:
img=im.groups()
return True
else:
img=None
return False
Then I can work like this:
for line in open(file,'r').read().splitlines():
if imps(regex1,line):
# munch on contents of img
elsif imps(regex2,line):
# munch on contents of img
else:
error('Unrecognised line: {}'.format(line))
That works, is reasonably compact, and easy to type. But it is hardly beautiful; it uses global variables and is not thread safe (which has not been an issue for me so far).
But I'm sure others have run across this problem before and come up with an equally compact, but more python-y and generally superior solution. What is it?
Depends on the needs of the code.
A common choice I use is something like this:
# note, order is important here. The first one to match will exit the processing
parse_regexps = [
(r"^foo", handle_foo),
(r"^bar", handle_bar),
]
for regexp, handler in parse_regexps:
m = regexp.match(line)
if m:
handler(line) # possibly other data too like m.groups
break
else:
error("Unrecognized format....")
This has the advantage of moving the handling code into clear and obvious functions which makes testing and change easy.
You can just use continue:
for line in file:
m = re.match(re1, line)
if m:
do stuff
continue
m = re.match(re2, line)
if m:
do stuff
continue
raise BadLine
Another, less obvious, option is to have a function like this:
def match_any(subject, *regexes):
for n, regex in enumerate(regexes):
m = re.match(regex, subject)
if m:
return n, m
return -1, None
and then:
for line in file:
n, m = match_any(line, re1, re2)
if n == 0:
....
elif n == 1:
....
else:
raise BadLine
I wrote a simple procedure to calculate the average of the test coverage of some specific packages in a Java project. The raw data in a huge html file is like this:
<body>
package pkg1 <line_coverage>11/111,<branch_coverage>44/444<end>
package pkg2 <line_coverage>22/222,<branch_coverage>55/555<end>
package pkg3 <line_coverage>33/333,<branch_coverage>66/666<end>
...
</body>
Given the specified packages "pkg1" and "pkg3", for example, the average line coverage is:
(11+33)/(111+333)
and average branch coverage is:
(44+66)/(444+666)
I wrote the follow procedure to get the result and it works well. But how to implement this calculation in a functional style? Something like "(x,y) for x in ... for b in ... if...". I know a little Erlang, Haskell and Clojure, So solutions in these languages are also appreciated. Thanks a lot!
from __future__ import division
import re
datafile = ('abc', 'd>11/23d>34/89d', 'e>25/65e>13/25e', 'f>36/92f>19/76')
core_pkgs = ('d', 'f')
covered_lines, total_lines, covered_branches, total_branches = 0, 0, 0, 0
for line in datafile:
for pkg in core_pkgs:
ptn = re.compile('.*'+pkg+'.*'+'>(\d+)/(\d+).*>(\d+)/(\d+).*')
match = ptn.match(line)
if match is not None:
cvln, tlln, cvbh, tlbh = match.groups()
covered_lines += int(cvln)
total_lines += int(tlln)
covered_branches += int(cvbh)
total_branches += int(tlbh)
print 'Line coverage:', '{:.2%}'.format(covered_lines / total_lines)
print 'Branch coverage:', '{:.2%}'.format(covered_branches/total_branches)
Down below you can find my Haskell solution. I will try to explain the important points I went through as I wrote it.
First you will find that I created a data structure for coverage data. It's generally a good idea to create data structures to represent whatever data you want to handle. This is in part because it makes it easier to design your code when you can think in terms of whatever you are designing – closely related to functional programming philosophies, and in part because it can eliminate a few bugs where you think you are doing something but are in actuality doing something else.
Related to the point before: The first thing I do is to convert the string-represented data into my own data structure. When you are doing functional programming, you are often doing things in "sweeps." You don't have a single function that converts data to your format, filters out the unwanted data and summarises the result. You have three different functions for each of those tasks, and you do them one at a time!
This is because functions are very composable, i.e. if you have three different ones, you can stick them together to form a single one if you want to. If you start with a single one, it is very difficult to take it apart to form three different ones.
The actual workings of the conversion function is actually quite uninteresting unless you are specifically doing Haskell. All it does is try to match each string with a regex, and if it succeeds, it adds the coverage data to the resulting list.
Again, mad composition is about to happen. I don't create a function to loop over a list of coverages and sum them up. I create a single function to sum two coverages, because I know I can use it together with the specialised fold loop (which is sort of like a for loop on steroids) to summarise all coverages in a list. There's no need for me to reinvent the wheel and create a loop myself.
Besides, my sumCoverages function works with a lot of specialised loops, so I don't have to write a ton of functions, I just stick my single function into a ton of pre-made library functions!
In the main function you will see what I mean by programming in "sweeps" or "passes" over the data. First I convert it to the internal format, then I filter out the unwanted data, then I summarise the remaining data. These are completely independent computations. That's functional programming.
You will also notice that I use two specialised loops there, filter and fold. This means that I don't have to write any loops myself, I just stick in a function to those standard library loops and let those take it from there.
import Data.Maybe (catMaybes)
import Data.List (foldl')
import Text.Printf (printf)
import Text.Regex (matchRegex, mkRegex)
corePkgs = ["d", "f"]
stats = [
"d>11/23d>34/89d",
"e>25/65e>13/25e",
"f>36/92f>19/76"
]
format = mkRegex ".*(\\w+).*>([0-9]+)/([0-9]+).*>([0-9]+)/([0-9]+).*"
-- It might be a good idea to define a datatype for coverage data.
-- A bit of coverage data is defined as the name of the package it
-- came from, the lines covered, the total amount of lines, the
-- branches covered and the total amount of branches.
data Coverage = Coverage String Int Int Int Int
-- Then we need a way to convert the string data into a list of
-- coverage data. We do this by regex. We try to match on each
-- string in the list, and then we choose to keep only the successful
-- matches. Returned is a list of coverage data that was represented
-- by the strings.
convert :: [String] -> [Coverage]
convert = catMaybes . map match
where match line = do
[name, cl, tl, cb, tb] <- matchRegex format line
return $ Coverage name (read cl) (read tl) (read cb) (read tb)
-- We need a way to summarise two coverage data bits. This can of course also
-- be used to summarise entire lists of coverage data, by folding over it.
sumCoverage (Coverage nameA clA tlA cbA tbA) (Coverage nameB clB tlB cbB tbB) =
Coverage (nameA ++ nameB ++ ",") (clA + clB) (tlA + tlB) (cbA + cbB) (tbA + tbB)
main = do
-- First we need to convert the strings to coverage data
let coverageData = convert stats
-- Then we want to filter out only the relevant data
relevantData = filter (\(Coverage name _ _ _ _) -> name `elem` corePkgs) coverageData
-- Then we need to summarise it, but we are only interested in the numbers
Coverage _ cl tl cb tb = foldl' sumCoverage (Coverage "" 0 0 0 0) relevantData
-- So we can finally print them!
printf "Line coverage: %.2f\n" (fromIntegral cl / fromIntegral tl :: Double)
printf "Branch coverage: %.2f\n" (fromIntegral cb / fromIntegral tb :: Double)
Here are some quickly-hacked, untested ideas applied to your code:
import numpy as np
import re
datafile = ('abc', 'd>11/23d>34/89d', 'e>25/65e>13/25e', 'f>36/92f>19/76')
core_pkgs = ('d', 'f')
covered_lines, total_lines, covered_branches, total_branches = 0, 0, 0, 0
for pkg in core_pkgs:
ptn = re.compile('.*'+pkg+'.*'+'>(\d+)/(\d+).*>(\d+)/(\d+).*')
matches = map(datafile, ptn.match)
statsList = [map(int, match.groups()) for match in matches if matches]
# statsList is a list of [cvln, tlln, cvbh, tlbh]
stats = np.array(statsList)
covered_lines, total_lines, covered_branches, total_branches = stats.sum(axis=1)
Well, as you can see I haven't bothered to finish off the remaining loop, but I think the point is made by now. There's certainly a lot more than one way to do this; I elected to show off map() (which some will say makes this less efficient, and it probably does), as well as NumPy to get the (admittedly light) math done.
This is the corresponding Clojure solution:
(defn extract-data
"extract 4 integer from a string line according to a package name"
[pkg line]
(map read-string
(rest (first
(re-seq
(re-pattern
(str pkg ".*>(\\d+)/(\\d+).*>(\\d+)/(\\d+)"))
line)))))
(defn scan-lines-by-pkg
"scan all string lines and extract all data as integer sequences
according to package names"
[pkgs lines]
(filter seq (for [pkg pkgs
line lines]
(extract-data pkg line))))
(defn sum-data
"add all data in valid lines together"
[pkgs lines]
(apply map + (scan-lines-by-pkg pkgs lines)))
(defn get-percent
[covered all]
(str (format "%.2f" (float (/ (* covered 100) all))) "%"))
(defn get-cov
[pkgs lines]
{:line-cov (apply get-percent (take 2 (sum-data pkgs lines)))
:branch-cov (apply get-percent (drop 2 (sum-data pkgs lines)))})
(get-cov ["d" "f"] ["abc" "d>11/23d>34/89d" "e>25/65e>13/25e" "f>36/92f>19/76"])