Regular Expression Parsing Key Value Pairs in Namelist Input File - python

I have an input file which is in a Fortran "namelist" format which I would like to parse with python regular expressions. Easiest way to demonstrate is with a ficticious example:
$VEHICLES
CARS= 1,
TRUCKS = 0,
PLAINS= 0, TRAINS = 0,
LIB='AUTO.DAT',
C This is a comment
C Data variable spans multiple lines
DATA=1.2,2.34,3.12,
4.56E-2,6.78,
$END
$PLOTTING
PLOT=T,
PLOT(2)=12,
$END
So the keys can contain regular variable-name characters as well as parenthesis and numbers. The values can be strings, boolean (T, F, .T., .F., TRUE, FALSE, .TRUE., .FALSE. are all possible), integers, floating-point numbers, or comma-separated lists of numbers. Keys are connected to their values with equal signs. Key-Value pairs are separated by commas, but can share a line. Values can span multiple lines for long lists of numbers. Comments are any line beginning with a C. There is generally inconsistent spacing before and after '=' and ','.
I have come up with a working regular expression for parsing the keys and values and getting them into an Ordered Dictionary (need to preserve order of inputs).
Here's my code so far. I've included everything from reading the file to saving to a dictionary for thoroughness.
import re
from collections import OrderedDict
f=open('file.dat','r')
file_str=f.read()
#Compile regex pattern for requested namelist
name='Vehicles'
p_namelist = re.compile(r"\$"+name.upper()+"(.*?)\$END",flags=re.DOTALL|re.MULTILINE)
#Execute regex on file string and get a list of captured tokens
m_namelist = p_namelist.findall(file_str)
#Check for a valid result
if m_namelist:
#The text of the desired namelist is the first captured token
namelist=m_namelist[0]
#Split into lines
lines=namelist.splitlines()
#List comprehension which returns the list of lines that do not start with "C"
#Effectively remove comment lines
lines = [item for item in lines if not item.startswith("C")]
#Re-combine now that comment lines are removed
namelist='\n'.join(lines)
#Create key-value parsing regex
p_item = re.compile(r"([^\s,\=]+?)\s*=\s*([^=]+)(?=[\s,][^\s,\=]+\s*\=|$)",flags=re.DOTALL|re.MULTILINE)
#Execute regex
items = p_item.findall(namelist)
#Initialize namelist ordered dictionary
n = OrderedDict()
#Remove undesired characters from value
for item in items:
n[item[0]] = item[1].strip(',\r\n ')
My question is whether I'm going about this correctly. I realize there is a ConfigParser library, which I have not yet attempted. My focus here is the regular expression:
([^\s,\=]+?)\s*=\s*([^=]+)(?=[\s,][^\s,\=]+\s*\=|$)
but I went ahead and included the other code for thoroughness and to demonstrate what I'm doing with it. For my Regular Expression, because the values can contain commas, and the key-value pairs are also separated by commas, there is no simple way to isolate the pairs. I chose to use a forward look-ahead to find the next key and "=". This allows everything between the "=" and the next key to be the value. Finally, because this doesn't work for the last pair, I threw in "|$" into the forward look-ahead meaning that if another "VALUE=" isn't found, look for the end of the string. I figured matching the value with [^=]+ followed by a look-ahead was better than trying to match all possible value types.
While writing this question I came up with an alternative Regular Expresson that takes advantage of the fact that numbers are the only value that can be in lists:
([^\s,\=]+?)\s*=\s*((?:\s*\d[\d\.\E\+\-]*\s*,){2,}|[^=,]+)
This one matches either a list of 2 or more numbers with (?:\s*\d[\d\.\E\+\-]*\s*,){2,} or anything before the next comma with [^=,].
Are these somewhat messy Regular Expressions the best way to parse a file like this?

I would suggest to develop little more sophisticated parser.
I stumble upon the project on google code hosting that implements very similar parser functionality: Fortran Namelist parser for Python prog/scripts but it was build for little different format.
I played with it a little and updated it to support structure of the format in your example:
Please see my version on gist:
Updated Fortran Namelist parser for python https://gist.github.com/4506282
I hope this parser will help you with your project.
Here is example output produced by the script after parsing FORTRAN code example:
{'PLOTTING':
{'par':
[OrderedDict([('PLOT', ['T']), ('PLOT(2) =', ['12'])])],
'raw': ['PLOT=T', 'PLOT(2)=12']},
'VEHICLES':
{'par':
[OrderedDict([('TRUCKS', ['0']), ('PLAINS', ['0']), ('TRAINS', ['0']), ('LIB', ['AUTO.DAT']), ('DATA', ['1.2', '2.34', '3.12', '4.56E-2', '6.78'])])],
'raw':
['TRUCKS = 0',
'PLAINS= 0, TRAINS = 0',
"LIB='AUTO.DAT'",
'DATA=1.2,2.34,3.12',
'4.56E-2,6.78']}}

Related

In python, find tokens in line

long time ago I wrote a tool for parsing text files, line by line, and do some stuff, depending on commands and conditions in the file.
I used regex for this, however, I was never good in regex.
A line holding a condition looks like this:
[type==STRING]
And the regex I use is:
re.compile(r'^[^\[\]]*\[([^\]\[=]*)==([^\]\[=]*)\][^\]\[]*$', re.MULTILINE)
This regex would result me the keyword "type" and the value "STRING".
However, now I need to update my tool to have more conditions in one line, e.g.
[type==STRING][amount==0]
I need to update my regex to get me two pairs of results, one pair type/STRING and one pair amount/0.
But I'm lost on this. My regex above gets me zero results with this line.
Any ideas how to do this?
You could either match a second pair of groups:
^[^\[\]]*\[([^\]\[=]*)==([^\]\[=]*)\][^\]\[]*(?:\[([^\]\[=]*)==([^\]\[=]*)\][^\]\[]*)?$
Regex demo
Or you can omit the anchors and the [^\[\]]* part to get the group1 and group 2 values multiple times:
\[([^\]\[=]*)==([^\]\[=]*)\]
Regex demo
Is it a requirement that you use regex? You can alternatively accomplish this pretty easily using the split function twice and stripping the first opening and last closing bracket.
line_to_parse = "[type==STRING]"
# omit the first and last char before splitting
pairs = line_to_parse[1:-1].split("][")
for pair in pairs:
x, y = pair.split("==")
Rather depends on the precise "rules" that describe your data. However, for your given data why not:
import re
text = '[type==STRING][amount==0]'
words = re.findall('\w+', text)
lst = []
for i in range(0, len(words), 2):
lst.append((words[i], words[i+1]))
print(lst)
Output:
[('type', 'STRING'), ('amount', '0')]

Get all substrings between two different start and ending delimiters

I am trying in Python 3 to get a list of all substrings of a given String a, which start after a delimiter x and end right before a delimiter y.
I have found solutions which only get me the first occurence, but the result needs to be a list of all occurences.
start = '>'
end = '</'
s = '<script>a=eval;b=alert;a(b(/XSS/.source));</script><script>a=eval;b=alert;a(b(/XSS/.source));</script>'"><marquee><h1>XSS by Xylitol</h1></marquee>'
print((s.split(start))[1].split(end)[0])
the above example is what I've got so far. But I am searching for a more elegant and stable way to get all the occurences.
So the expected return as list would contain the javascript code as following entries:
a=eval;b=alert;a(b(/XSS/.source));
a=eval;b=alert;a(b(/XSS/.source));
Looking for patterns in strings seems like a decent job for regular expressions.
This should return a list of anything between a pair of <script> and </script>:
import re
pattern = re.compile(r'<script>(.*?)</script>')
s = '<script>a=eval;b=alert;a(b(/XSS/.source));</script><script>a=eval;b=alert;a(b(/XSS/.source));</script>\'"><marquee><h1>XSS by Xylitol</h1></marquee>'
print(pattern.findall(s))
Result:
['a=eval;b=alert;a(b(/XSS/.source));', 'a=eval;b=alert;a(b(/XSS/.source));']

Check if a variable substring is in a string

I receive an input string having values expressed in two possible formats. E.g.:
#short format
data = '"interval":19'
>>> "interval":19
#extended format
data = '"interval":{"t0":19,"tf":19}'
>>> "interval":{"t0":19,"tf":19}
I would like to check whether a short format is used and, in case, make it extended.
Considering that the string could be composed of multiple values, i.e.
data = '"interval":19,"interval2":{"t0":10,"tf":15}'
>>> "interval":19,"interval2":{"t0":10,"tf":15}
I cannot just say:
if ":{" not in data:
#then short format is used
I would like to code something like:
if ":$(a general int/float/double number)" in data:
#extract the number
#replace ":{number}" with the extended format
I know how to code the replacing part.
I need help for implementing if condition: in my mind, I model it like a variable substring, in which the variable part is the number inside it, while the rigid format is the $(value name) + ":" part.
"some_value":19
^ ^
rigid format variable part
EDIT - WHY NOT PARSE IT?
I know the string is "JSON-friendly" and I can convert it into a dictionary, easily accessing then the values.
Indeed, I already have this solution in my code. But I don't like it since the input string could be multilevel and I need to iterate on the leaf values of the resulting dictionary, independently from the dictionary levels. The latter is not a simple thing to do.
So I was wondering whether a way to act directly on the string exists.
If you replace all keys, except t0, tf, followed by numbers, it should work.
I show you an example on a multilevel string, probably to be put in a better shape:
import re
s = '"interval": 19,"t0interval2":{"t0":10,"tf":15},{"deeper": {"other_interval":23}}'
gex = '("(?!(t0|tf)")\w+":)\s*(\d+)'
new_s = re.sub(gex, r'\1 {"t0": \3, "tf": \3}', s)
print(new_s)
>>> print(new_s)
"interval": {"t0": 19, "tf": 19},"t0interval2":{"t0":10,"tf":15},{"deeper": {"other_interval": {"t0": 23, "tf": 23}}}
You could use a regular expression. ("interval":)(\d+) will look for the string '"interval":' followed by any number of digits.
Let's test this
data = '"interval":19,"interval2":{"t0":10,"tf":15},"interval":25'
result = re.sub(r'("interval":)(\d+)', r'xxx', data)
print(result)
# -> xxx,"interval2":{"t0":10,"tf":15},xxx
We see that we found the correct places. Now we're going to create your target format. Here the matched groups come in handy. In the regular expression ("interval":) is group 1, (\d+) is group 2.
Now we use the content of those groups to create your wanted result.
data = '"interval":19,"interval2":{"t0":10,"tf":15},"interval":25'
result = re.sub(r'("interval":)(\d+)', r'\1{"t0":\2,"tf":\2}', data)
print(result)
# -> "interval":{"t0":19,"tf":19},"interval2":{"t0":10,"tf":15},"interval":{"t0":25,"tf":25}
If there are floating point values involved you'll have to change (\d+) to ([.\d]+).
If you want any Unicode standard word characters and not only interval you can use the special sequence \w and because it could be multiple characters the expression will be \w+.
data = '"interval":19,"interval2":{"t0":10,"tf":15},"Monty":25.4'
result = re.sub(r'("\w+":)([.\d]+)', r'\1{"t0":\2,"tf":\2}', data)
print(result)
# -> "interval":{"t0":19,"tf":19},"interval2":{"t0":{"t0":10,"tf":10},"tf":{"t0":15,"tf":15}},"Monty":{"t0":25.4,"tf":25.4}
Dang! Yes, we found "Monty" but now the values from the second part are found too. We'll have to fix this somehow. Let's see. We don't want ("\w+") if it's preceded by { so were going to use a negative lookbehind assertion: (?<!{)("\w+"). And after the number part (\d+) we don't want a } or an other digit so we're using a negative lookahead assertion here: ([.\d]+)(?!})(?!\d).
data = '"interval":19,"interval2":{"t0":10,"tf":15},"Monty":25.4'
result = re.sub(r'(?<!{)("\w+":)([.\d]+)(?!})(?!\d)', r'\1{"t0":\2,"tf":\2}', data)
print(result)
# -> "interval":{"t0":19,"tf":19},"interval2":{"t0":10,"tf":15},"Monty":{"t0":25.4,"tf":25.4}
Hooray, it works!
Regular expressions are powerful and fun, but if you start to add more constraints this might become unmanageable.

Parse in a file with links Python

I have a file that I have to parse that has a lot of links, and example of how it looks:
<hm><w syst="whatrudoing" please="http://facebook.com.u/qwe-
pls/facebook?funn=wordlis&sys;sys;colorsdif_id=11908675">colors</p></hm>
<hm><w syst="whatrudoing" please="http://facebook.com.u/qwe-
pls/facebook?funn=wordlis&sys;sys;colorsdif_id=45103481">yelloW</p></hm>
<td>I have a dream, and it is all good 2</hm>
<hm><w syst="whatrudoing" please="http://facebook.com.u/qwe-
pls/facebook?funn=wordlis&sys;sys;colorsdif_id=40984930">orangE</p></hm>
<hm><w syst="whatrudoing" please="http://facebook.com.u/qwe-
pls/facebook?funn=wordlis&sys;sys;colorsdif_id=90648361">pinK</p></hm>
I only have to keep the words that are in the position of >colors< so I also want >yelloW<, >orangE< and >pinK<.
In this example, the common expression between them, will be all the link, except the number (the id, that it is a different number in all the links), and the word.
Just after finding all the words I want to save them in a dictionary, that use the first element as key and the others as elements, so the final result will be:
d = {"colors": ["yelloW", "orangE", "pinK"]}
You can try something like this:
import re
re.findall(r"http://[^>]+>(\w+)",ree)
Where:
[^>]+ - get any characters except >
\w+ - get any letters
(..) - return the group between parentheses
And Python dictionaries doesn't support identical keys. You can look at this question.

regular expression that reference a match from earlier part of expression

I'm looking for a regular expression that will identify a sequence in which an integer in the text specifies the number of trailing letters at the end of the expression. This specific example applies to identifying insertions and deletions in genetic data in the pileup format.
For example:
If the text I am searching is:
AtT+3ACGTTT-1AaTTa
I need to match the insertions and deletions, which in this case are +3ACG and -1A. The integer (n) portion can be any integer larger than 1, and I must capture the n trailing characters.
I can match a single insertion or deletion with [+-]?[0-9]+[ACGTNacgtn], but I can't figure out how to grab the exact number of trailing ACGTN's specified by the integer.
I apologize if there is an obvious answer here, I have been searching for hours. Thanks!
(UPDATE)
I typically work in Python. The one workaround I've been able to figure out with the re module in python is to call both the integers and span of every in/del and combine the two to extract the appropriate length of text.
For example:
>>> import re
>>> a = 'ATTAA$At^&atAA-1A+1G+4ATCG'
>>> expr = '[+-]?([0-9]+)[ACGTNacgtn]'
>>> ints = re.findall(expr, a) #returns a list of the integers
>>> spans = [i.span() for i in re.finditer(expr,a)]
>>> newspans = [(spans[i][0],spans[i][1]+(int(indel[i])-1)) for i in range(len(spans))]
>>> newspans
>>> [(14, 17), (17, 20), (20, 26)]
The resulting tuples allow me to slice out the indels. Probably not the best syntax, but it works!
You can use regular expression substitution passing a function as replacement... for example
s = "abcde+3fghijkl-1mnopqr+12abcdefghijklmnoprstuvwxyz"
import re
def dump(match):
start, end = match.span()
print s[start:end + int(s[start+1:end])]
re.sub(r'[-+]\d+', dump, s)
#output
# +3fgh
# -1m
# +12abcdefghijkl
It's not directly possible, regexes can't 'count' like that.
But if you're using a programming language that allows callbacks as a regex match evaluator (e.g. C#, PHP), then what you could do is have the regex as [+-]?([0-9]+)([ACGTNacgtn]+) and in the callback trim the trailing characters to the desired length.
e.g. for C#
var regexMatches = new List<string>();
Regex theRegex = new Regex(#"[+-]?([0-9]+)([ACGTNacgtn]+)");
text = theRegex.Replace(text, delegate(Match thisMatch)
{
int numberOfInsertsOrDeletes = Convert.ToInt32(thisMatch.Groups[1].Value);
string trailingString = thisMatch.Groups[2].Value;
if (numberOfInsertsOrDeletes > trailingString.Length)
{ trailingString = trailingString.Substring(0, numberOfInsertsOrDeletes); }
regexMatches.Add(trailingString);
return thisMatch.Groups[0].Value;
});
The simple Perl pattern for matching an integer followed by that number of any character is just:
(\d+)(??{"." x $1})
which is quite straight-forward, I think you’ll agree. For example, this snippet:
my $string = "AtT+3ACGTTT-1AaTTa";
print "Matched $&\n" while $string =~ m{
( \d+ ) # capture an integer into $1
(??{ "." x $1 }) # interpolate that many dots back into pattern
}xg;
Merrily prints out the expected
Matched 3ACG
Matched 1A
EDIT
Oh drat, I see you just added the Python tag since I began editing. Oops. Well, maybe this will be helpful to you anyway.
That said, if what you are actually looking for is fuzzy matching where you allow for some number of insertions and deletions (the edit distance), then Matthew Barnett’s regex library for Python will handle that. That doesn’t seem to be quite what you’re doing, as the insertions and deletions are actually represented in your strings.
But Matthew’s library is really very good and very interesting, and it even does many things that Perl cannot do. :) It’s a drop-in replacement for the standard Python re library.

Categories

Resources