I'm trying to parse a file, actually some portions of the file. The file contains information about hardwares in a server and each line starts with a keyword denoting the type of hardware. For example:
pci24 u2480-L0
fcs1 g4045-L1
pci25 h6045-L0
en192 v7024-L3
pci26 h6045-L1
Above example doesnt show a real file but it's simple and quite enough to demonstrate the need. I want only to parse the lines starting with "pci" and skip others. I wrote a grammer for lines starting with "pci":
grammar_pci = Group ( Word( "pci" + nums ) + Word( alphanums + "-" ) )
I've also wrote a grammar for lines not starting with "pci":
grammar_non_pci = Suppress( Regex( r"(?!pci)" ) )
And then build a grammar that sum up above two:
grammar = ( grammar_pci | grammar_non_pci )
Then i read the file and send it to parseString:
with open("foo.txt","r") as f:
data = grammar.parseString(f.read())
print(data)
But no data is written as output. What am i missing? How to parse data skipping the lines not starts with a specific keyword?
Thanks.
You are off to a good start, but you are missing a few steps, mostly having to do with filling in gaps and repetition.
First, look at your expression for grammar_non_pci:
grammar_non_pci = Suppress( Regex( r"(?!pci)" ) )
This correctly detects a line that does not start with "pci", but it doesn't actually parse the line's content.
The easiest way to add this is to add a ".*" to the regex, so that it will parse not only the "not starting with pci" lookahead, but also the rest of the line.
grammar_non_pci = Suppress( Regex( r"(?!pci).*" ) )
Second, your grammar just processes a single instance of an input line.
grammar = ( grammar_pci | grammar_non_pci )
grammar needs to be repetitive
grammar = OneOrMore( grammar_pci | grammar_non_pci, stopOn=StringEnd())
[EDIT: since you are up to pyparsing 3.0.9, this can also be written as follows]
grammar = (grammar_pci | grammar_non_pci)[1, ...: StringEnd()]
Since grammar_non_pci could actually match on an empty string, it could repeat forever at the end of the file - that's why the stopOn argument is needed.
With these changes, your sample text should parse correctly.
But there is one issue that you'll need to clean up, and that is the definition of the "pci"-prefixed word in grammar_pci.
grammar_pci = Group ( Word( "pci" + nums ) + Word( alphanums + "-" ) )
Pyparsing's Word class takes 1 or 2 strings of characters, and uses them as a set of the valid characters for the initial word character and the body word characters. "pci" + nums gives the string "pci0123456789", and will match any word group using any of those characters. So it will match not only "pci00" but also "cip123", "cci123", "p0c0i", or "12345".
To resolve this, use "pci" + Word(nums) wrapped in Combine to represent only word groups that start with "pci":
grammar_pci = Group ( Combine("pci" + Word( nums )) + Word( alphanums + "-" ) )
Since you seem comfortable using Regex items, you could also write this as
grammar_pci = Group ( Regex(r"pci\d+") + Word( alphanums + "-" ) )
These changes should get you moving forward on your parser.
Read each line at a time, and if starts with pci, add it to the list data; otherwise, discard it:
data = []
with open("foo.txt", "r") as f:
for line in f:
if line.startswith('pci'):
data.append(line)
print(data)
If you still need to do further parsing with your Grammar, you can now parse the list data, knowing that each item does indeed start with pci.
Related
I have a function where the user passes in a file and a String and the code should get rid of the specificed delimeters. I am having trouble finishing the part where I loop through my code and get rid of each of the replacements. I will post the code down below
def forReader(filename):
try:
# Opens up the file
file = open(filename , "r")
# Reads the lines in the file
read = file.readlines()
# closes the files
file.close()
# loops through the lines in the file
for sentence in read:
# will split each element by a spaace
line = sentence.split()
replacements = (',', '-', '!', '?' '(' ')' '<' ' = ' ';')
# will loop through the space delimited line and get rid of
# of the replacements
for sentences in line:
# Exception thrown if File does not exist
except FileExistsError:
print('File is not created yet')
forReader("mo.txt")
mo.txt
for ( int i;
After running the filemo.txt I would like for the output to look like this
for int i
Here's a way to do this using regex. First, we create a pattern consisting of all the delimiter characters, being careful to escape them, since several of those characters have special meaning in a regex. Then we can use re.sub to replace each delimiter with an empty string. This process can leave us with two or more adjacent spaces, which we then need to replace with a single space.
The Python re module allows us to compile patterns that are used frequently. Theoretically, this can make them more efficient, but it's a good idea to test such patterns against real data to see if it does actually help. :)
import re
delimiters = ',-!?()<=;'
# Make a pattern consisting of all the delimiters
pat = re.compile('|'.join(re.escape(c) for c in delimiters))
s = 'for ( int i;'
# Remove the delimiters
z = pat.sub('', s)
#Clean up any runs of 2 or more spaces
z = re.sub(r'\s{2,}', ' ', z)
print(z)
output
for int i
I'm trying to parse some fields from a multi-line file, of which I'm only interested in some lines, while others I would like to skip. Here is an example of something similar to what I'm trying to do:
from pyparsing import *
string = "field1: 5\nfoo\nbar\nfield2: 42"
value1 = Word(nums)("value1")
value2 = Word(nums)("value2")
not_field2 = Regex(r"^(?!field2:).*$")
expression = "field1:" + value1 + LineEnd() + OneOrMore(not_field2)+ "field2:" + value2 + LineEnd()
tokens = expression.parseString(string)
print tokens["value1"]
print tokens["value2"]
where the Regex for a line not starting with field2: is adapted from Regular expression for a string that does not start with a sequence. However, running this example script gives a
pyparsing.ParseException: Expected Re:('^(?!field2:).*$') (at char 10), (line:2, col:1)
I would like the value2 to end up as 42, regardless of the number of lines (foo\n and bar\n in this case). How can I achieve that?
The '^' and '$' characters in your Regex aren't interpreted on a line-by-line basis by pyparsing, but in the context of the whole string being parsed. So '^' will match only at the very beginning of the string and '$' only at the very end.
Instead you can do:
not_field2 = LineStart() + Regex(r"(?!field2:).*")
I have string in format
,xys=2/3,
d=e,
b*y,
b/e
I want to fetch xys=2/3 and b/e.
Right now I have regular expression which just picks 2/3 and b/e.
pattern = r'(\S+)\s*(?<![;|<|#])/\s*(\S+)'
regex = re.compile(pattern,re.DOTALL)
for result in regex.findall(data):
f.write("Division " + str(result)+ "\n\n\n")
How can I modify to pick what I intend to do?
Match anything but , (or newlines) up until the first slash /: [^,/\n]*/
Match the remaining text up to the next comma: [^,\n]*
Put the two together: [^,/\n]*/[^,\n]*
No need for regular expressions.
s = """,xys=2/3,
d=e,
b*y,
b/e
"""
l = s.split("\n")
for line in l:
if '/' in line:
print(line.strip(","))
Will this work:
x.split(",")[1].split('\n')[0] if "," in x[:-1] else None
It ignores (evaluates to None) unles , is present in the non-last position, else extract the part between , and another , or till the end, and again filter until new line if any.
I am trying to parse the following text using pyparsing.
acp (SOLO1,
"solo-100",
"hi here is the gift"
"Maximum amount of money, goes",
430, 90)
jhk (SOLO2,
"solo-101",
"hi here goes the wind."
"and, they go beyond",
1000, 320)
I have tried the following code but it doesn't work.
flag = Word(alphas+nums+'_'+'-')
enclosed = Forward()
nestedBrackets = nestedExpr('(', ')', content=enclosed)
enclosed << (flag | nestedBrackets)
print list(enclosed.searchString (str1))
The comma(,) within the quotation is producing undesired results.
Well, I might have oversimplified slightly in my comments - here is a more complete
answer.
If you don't really have to deal with nested data items, then a single-level parenthesized
data group in each section will look like this:
LPAR,RPAR = map(Suppress, "()")
ident = Word(alphas, alphanums + "-_")
integer = Word(nums)
# treat consecutive quoted strings as one combined string
quoted_string = OneOrMore(quotedString)
# add parse action to concatenate multiple adjacent quoted strings
quoted_string.setParseAction(lambda t: '"' +
''.join(map(lambda s:s.strip('"\''),t)) +
'"' if len(t)>1 else t[0])
data_item = ident | integer | quoted_string
# section defined with no nesting
section = ident + Group(LPAR + delimitedList(data_item) + RPAR)
I wasn't sure if it was intentional or not when you omitted the comma between
two consecutive quoted strings, so I chose to implement logic like Python's compiler,
in which two quoted strings are treated as just one longer string, that is "AB CD " "EF" is
the same as "AB CD EF". This was done with the definition of quoted_string, and adding
the parse action to quoted_string to concatenate the contents of the 2 or more component
quoted strings.
Finally, we create a parser for the overall group
results = OneOrMore(Group(section)).parseString(source)
results.pprint()
and get from your posted input sample:
[['acp',
['SOLO1',
'"solo-100"',
'"hi here is the giftMaximum amount of money, goes"',
'430',
'90']],
['jhk',
['SOLO2',
'"solo-101"',
'"hi here goes the wind.and, they go beyond"',
'1000',
'320']]]
If you do have nested parenthetical groups, then your section definition can be
as simple as this:
# section defined with nesting
section = ident + nestedExpr()
Although as you have already found, this will retain the separate commas as if they
were significant tokens instead of just data separators.
I've searched but didn't quite find something for my case. Basically, I'm trying to split the following line:
(CU!DIVD:WEXP:DIVD-:DIVD+:RWEXP:RDIVD:RECL:RLOSS:MISCDI:WEXP-:INT:RGAIN:DIVOP:RRGAIN:DIVOP-:RDIVOP:RRECL:RBRECL:INT -:RRLOSS:INT +:RINT:RDIVD-:RECL-:RWXPOR:WEXPOR:MISCRE:WEXP+:RWEXP-:RBWEXP:RECL+:RRECL-:RBDIVD)
You can read this as CU is NOT DIVD or WEXP or DIV- or and so on. What I'd like to do is split this line if it's over 65 characters into something more manageable like this:
(CU!DIVD:WEXP:DIVD-:DIVD+:RWEXP:RDIVD:RECL:RLOSS:MISCDI:WEXP-)
(CU!INT:RGAIN:DIVOP:RRGAIN:DIVOP-:RDIVOP:RRECL:RBRECL:INT-)
(CU!RRLOSS:INT +:RINT:RDIVD-:RECL-:RWXPOR:WEXPOR:MISCRE:WEXP+)
(CU!RWEXP-:RBWEXP:RECL+:RRECL-:RBDIVD)
They're all less than 65 characters. This can be stored in a list and I can take care of the rest. I'm starting to work on this with RegEx but I'm having a bit of trouble.
Additionally, it can also have the following conditionals:
!
<
>
=
!=
!<
!>
As of now, I have this:
def FilterParser(iteratorIn, headerIn):
listOfStrings = []
for eachItem in iteratorIn:
if len(str(eachItem.text)) > 65:
exmlLogger.error('The length of filter' + eachItem.text + ' exceeds the limit and will be dropped')
pass
else:
listOfStrings.append(rightSpaceFill(headerIn + EXUTIL.intToString(eachItem),80))
return ''.join(stringArray)
Here is a solution using regex, edited to include the CU! prefix (or any other prefix) to the beginning of each new line:
import re
s = '(CU!DIVD:WEXP:DIVD-:DIVD+:RWEXP:RDIVD:RECL:RLOSS:MISCDI:WEXP-:INT:RGAIN:DIVOP:RRGAIN:DIVOP-:RDIVOP:RRECL:RBRECL:INT -:RRLOSS:INT +:RINT:RDIVD-:RECL-:RWXPOR:WEXPOR:MISCRE:WEXP+:RWEXP-:RBWEXP:RECL+:RRECL-:RBDIVD)'
prefix = '(' + re.search(r'\w+(!?[=<>]|!)', s).group(0)
maxlen = 64 - len(prefix) # max line length of 65, prefix and ')' will be added
regex = re.compile(r'(.{1,%d})(?:$|:)' % maxlen)
lines = [prefix + line + ')' for line in regex.findall(s[len(prefix):-1])]
>>> print '\n'.join(lines)
(CU!DIVD:WEXP:DIVD-:DIVD+:RWEXP:RDIVD:RECL:RLOSS:MISCDI:WEXP-)
(CU!INT:RGAIN:DIVOP:RRGAIN:DIVOP-:RDIVOP:RRECL:RBRECL:INT -)
(CU!RRLOSS:INT +:RINT:RDIVD-:RECL-:RWXPOR:WEXPOR:MISCRE:WEXP+)
(CU!RWEXP-:RBWEXP:RECL+:RRECL-:RBDIVD)
First we need to grab the prefix, we do this using re.search().group(0), which returns the entire match. Each of the final lines should be at most 65 characters, the regex that we will use to get these lines will not include the prefix or the closing parentheses, which is why maxlen is 64 - len(prefix).
Now that we know the most characters we can match, the first part of the regex (.{1,<maxlen>) will match at most that many characters. The portion at the end, (?:$|:), is used to make sure that we only split the string on semi-colons or at the end of the string. Since there is only one capturing group regex.findall() will return only that match, leaving off the trailing semi-colon. Here is what it looks like for you sample string:
>>> pprint.pprint(regex.findall(s[len(prefix):-1]))
['DIVD:WEXP:DIVD-:DIVD+:RWEXP:RDIVD:RECL:RLOSS:MISCDI:WEXP-',
'INT:RGAIN:DIVOP:RRGAIN:DIVOP-:RDIVOP:RRECL:RBRECL:INT -',
'RRLOSS:INT +:RINT:RDIVD-:RECL-:RWXPOR:WEXPOR:MISCRE:WEXP+',
'RWEXP-:RBWEXP:RECL+:RRECL-:RBDIVD']
The list comprehension is used to construct a list of all of the lines by adding the prefix and the trailing ) to each result. The slicing of s is done so that the prefix and the trailing ) are stripped off of the original string before regex.findall(). Hope this helps!