pyparsing: skip to the next token ignoring everything in between - python

I am trying to parse a log file that contains multiple entries with the following format:
ITEM_BEGIN item_name
some_text
some_text may optionally contain an expression matched by my_expr anywhere within itself. I am only interested in item_name and my_expr (or None if it is missing). Ideally, what I want is a list of (item_name, my_expr) pairs. What is the best way to extract this information using pyparsing?

If you are not trying to define a parser for the entire input text, but only some pieces of it, look into using pyparsing's searchString or scanString methods - something along these lines:
import pyparsing as pp
ident = Word(alphas, alphanums+'_')
item_header = pp.Keyword("ITEM_BEGIN") + ident("name")
other_expr = ... whatever ...
search_expr = item_header | other_expr
found = {}
current_name = ''
for result in search_expr.searchString(input_text):
result = result[0]
if result[0] == "ITEM_BEGIN":
print("found an item header with name {name}".format_map(result))
current_name = result.name
found[result.name] = []
else:
# found an other expr
found[current_name].append(result.asList())

Related

re.sub() gives Nameerror when no match

So I'm trying to search and replace rows of texts from a csv file, and I keep getting errors from it if re.sub() can't find any matches.
Say if the text in a row is
text = "a00123 一二三四五"
And my codes are
import re
html = "www.abcdefg.com/"
text = "a00123 一二三四五"
namelist_raw = re.sub(r'([a-z])00(\d{3})',r'\1-\2',text)
p = re.findall(r'\w',namelist_raw)
if p:
q = re.findall(r'([a-z]-\d{3})',namelist_raw)
for namelist in q:
print(namelist)
else:
namelist = "failed"
link = html + namelist
print(link)
so for this i should be getting a result of
www.abcdefg.com/a-123
so that's no problem.
but if the text is something like this,
text = "asdfdsdfd123 一二三四五"
I'll get Nameerror saying name 'namelist' is not defined
Why is that? I thought at the if else statement I've already wrote if anything else, namelist is "failed"
my code
Your p = re.findall(r'\w',namelist_raw) is extracting every word char from a string, and later, you only extract the values from the string if there were matches. You do not need that check.
Next, namelist is only populated if there is a match for [a-z]-\d{3}, but if there is no match, you do not get it populated. You need to account for that scenario, too.
Use
import re
html = "www.abcdefg.com/"
text = "a00123 一二三四五"
p = re.findall(r'([a-z])00(\d{3})', text) # Extract a list of tuples
namelist = [] # Init the list
for letter, number in p:
namelist.append(f"{letter}-{number}") # Populate namelist with formatted tuple values
if len(namelist): # If there was a match
namelist = "/".join(namelist) # Create a string by joining namelist items with /
else:
namelist = "failed" # Else, assign failed to the namelist
link = html + namelist
print(link)
See the Python demo.

Regex that grabs variable number of groups

This is not a question asking how to use re.findall() or the global modifier (?g) or \g. This is asking how to match n groups with one regex expression, with n between 3 and 5.
Rules:
needs to ignore lines with first non-space character as # (comments)
needs to get at least three items, always in order: ITEM1, ITEM2, ITEM3
class ITEM1(stuff)
model = ITEM2
fields = (ITEM3)
needs to get any of the following matches if they exist (UNKNOWN order, and can be missing)
write_once_fields = (ITEM4)
required_fields = (ITEM5)
needs to know which match is which, so either retrieve matches in order, returning None if there is no match, or retrieve pairs.
My question is if this is doable, and how?
I've gotten this far, but it hasn't dealt with comments or unknown order or if some items are missing and to stop searching for this particular regex when you see the next class definition. https://www.regex101.com/r/cG5nV9/8
(?s)\nclass\s(.*?)(?=\()
.*?
model\s=\s(.*?)\n
.*?
(?=fields.*?\((.*?)\))
.*?
(?=write_once_fields.*?\((.*?)\))
.*?
(?=required_fields.*?\((.*?)\))
Do I need a conditional?
Thanks for any kinds of hints.
I'd do something like:
from collections import defaultdict
import re
comment_line = re.compile(r"\s*#")
matches = defaultdict(dict)
with open('path/to/file.txt') as inf:
d = {} # should catch and dispose of any matching lines
# not related to a class
for line in inf:
if comment_line.match(line):
continue # skip this line
if line.startswith('class '):
classname = line.split()[1]
d = matches[classname]
if line.startswith('model'):
d['model'] = line.split('=')[1].strip()
if line.startswith('fields'):
d['fields'] = line.split('=')[1].strip()
if line.startswith('write_once_fields'):
d['write_once_fields'] = line.split('=')[1].strip()
if line.startswith('required_fields'):
d['required_fields'] = line.split('=')[1].strip()
You could probably do this easier with regex matching.
comment_line = re.compile(r"\s*#")
class_line = re.compile(r"class (?P<classname>)")
possible_keys = ["model", "fields", "write_once_fields", "required_fields"]
data_line = re.compile(r"\s*(?P<key>" + "|".join(possible_keys) +
r")\s+=\s+(?P<value>.*)")
with open( ...
d = {} # default catcher as above
for line in ...
if comment_line.match(line):
continue
class_match = class_line.match(line)
if class_match:
d = matches[class_match.group('classname')]
continue # there won't be more than one match per line
data_match = data_line.match(line)
if data_match:
key,value = data_match.group('key'), data_match.group('value')
d[key] = value
But this might be harder to understand. YMMV.

Parsing key values in string

I have a string that I am getting from a command line application. It has the following structure:
-- section1 --
item11|value11
item12|value12
item13
-- section2 --
item21|value21
item22
what I would like is to parse this to a dict so that I can easily access the values with:
d['section1']['item11']
I already solved it for the case when there are no sections and every key has a value but I get errors otherwise. I have tried a couple things but it is getting complicated because and nothing seems to work. This is what I have now:
s="""
item11|value11
item12|value12
item21|value21
"""
d = {}
for l in s.split('\n'):
print(l, l.split('|'))
if l != '':
d[l.split('|')[0]] = l.split('|')[1]
Can somebody help me extend this for the section case and when no values are present?
Seems like a perfect fit for the ConfigParser module in the standard library:
d = ConfigParser(delimiters='|', allow_no_value=True)
d.SECTCRE = re.compile(r"-- *(?P<header>[^]]+?) *--") # sections regex
d.read_string(s)
Now you have an object that you can access like a dictionary:
>>> d['section1']['item11']
'value11'
>>> d['section2']['item22'] # no value case
None
Regexes are a good take at this:
import re
def parse(data):
lines = data.split("\n") #split input into lines
result = {}
current_header = ""
for line in lines:
if line: #if the line isn't empty
#tries to match anything between double dashes:
match = re.match(r"^-- (.*) --$", line)
if match: #true when the above pattern matches
#grabs the part inside parentheses:
current_header = match.group(1)
else:
#key = 1st element, value = 2nd element:
key, value = line.split("|")
#tries to get the section, defaults to empty section:
section = result.get(current_header, {})
section[key] = value #adds data to section
result[current_header] = section #updates section into result
return result #done.
print parse("""
-- section1 --
item1|value1
item2|value2
-- section2 --
item1|valueA
item2|valueB""")

Bulk replace with regular expressions in Python

For a Django application, I need to turn all occurrences of a pattern in a string into a link if I have the resource related to the match in my database.
Right now, here's the process:
- I use re.sub to process a very long string of text
- When re.sub finds a pattern match, it runs a function that looks up whether that pattern matches an entry in the database
- If there is a match, it wraps the link wraps a link around the match.
The problem is that there are sometimes hundreds of hits on the database. What I'd like to be able to do is a single bulk query to the database.
So: can you do a bulk find and replace using regular expressions in Python?
For reference, here's the code (for the curious, the patterns I'm looking up are for legal citations):
def add_linked_citations(text):
linked_text = re.sub(r'(?P<volume>[0-9]+[a-zA-Z]{0,3})\s+(?P<reporter>[A-Z][a-zA-Z0-9\.\s]{1,49}?)\s+(?P<page>[0-9]+[a-zA-Z]{0,3}))', create_citation_link, text)
return linked_text
def create_citation_link(match_object):
volume = None
reporter = None
page = None
if match_object.group("volume") not in [None, '']:
volume = match_object.group("volume")
if match_object.group("reporter") not in [None, '']:
reporter = match_object.group("reporter")
if match_object.group("page") not in [None, '']:
page = match_object.group("page")
if volume and reporter and page: # These should all be here...
# !!! Here's where I keep hitting the database
citations = Citation.objects.filter(volume=volume, reporter=reporter, page=page)
if citations.exists():
citation = citations[0]
document = citation.document
url = document.url()
return '%s %s %s' % (url, volume, reporter, page)
else:
return '%s %s %s' % (volume, reporter, page)
Sorry if this is obvious and wrong (that no-one has suggested it in 4 hours is worrying!), but why not search for all matches, do a batch query for everything (easy once you have all matches), and then call sub with the dictionary of results (so the function pulls the data from the dict)?
You have to run the regexp twice, but it seems like the database access is the expensive part anyway.
You can do it with a single regexp pass, by using finditer which returns match objects.
The match object have:
a method returning a dict of the named groups, groupdict()
the start and the end positions of the match in the original text, span()
the original matching text, group()
So I would suggest that you:
Make a list of all the matches in your text using finditer
Make a list of all the unique volume, reporter, page triplets in the matches
Lookup those triplets
Correlate each match object with the result of the triplet lookup if found
Process the original text, splitting by the match spans and interpolating lookup results.
I've implemented the database lookup by combining a list of Q(volume=foo1,reporter=bar2,page=baz3)|Q(volume=foo1,reporter=bar2,page=baz3).... There maybe be more efficient approaches.
Here's an untested implementation:
from django.db.models import Q
from collections import namedtuple
Triplet = namedtuple('Triplet',['volume','reporter','page'])
def lookup_references(matches):
match_to_triplet = {}
triplet_to_url = {}
for m in matches:
group_dict = m.groupdict()
if any(not(x) for x in group_dict.values()): # Filter out matches we don't want to lookup
continue
match_to_triplet[m] = Triplet(**group_dict)
# Build query
unique_triplets = set(match_to_triplet.values())
# List of Q objects
q_list = [Q(**trip._asdict()) for trip in unique_triplets]
# Consolidated Q
single_q = reduce(Q.__or__,q_list)
for row in Citations.objects.filter(single_q).values('volume','reporter','page','url'):
url = row.pop('url')
triplet_to_url[Triplet(**row)] = url
# Now pair original match objects with URL where found
lookups = {}
for match, triplet in match_to_triplet.items():
if triplet in triplet_to_url:
lookups[match] = triplet_to_url[triplet]
return lookups
def interpolate_citation_matches(text,matches,lookups):
result = []
prev = m_start = 0
last = m_end = len(text)
for m in matches:
m_start, m_end = m.span()
if prev != m_start:
result.append(text[prev:m_start])
# Now check match
if m in lookups:
result.append('%s' % (lookups[m],m.group()))
else:
result.append(m.group())
if m_end != last:
result.append(text[m_end:last])
return ''.join(result)
def process_citations(text):
citation_regex = r'(?P<volume>[0-9]+[a-zA-Z]{0,3})\s+(?P<reporter>[A-Z][a-zA-Z0-9\.\s]{1,49}?)\s+(?P<page>[0-9]+[a-zA-Z]{0,3}))'
matches = list(re.finditer(citation_regex,text))
lookups = lookup_references(matches)
new_text = interpolate_citation_matches(text,matches,lookups)
return new_text

Splitting lines in a file into string and hex and do operations on the hex values

I have a large file with several lines as given below.I want to read in only those lines which have the _INIT pattern in them and then strip off the _INIT from the name and only save the OSD_MODE_15_H part in a variable. Then I need to read the corresponding hex value, 8'h00 in this case, ans strip off the 8'h from it and replace it with a 0x and save in a variable.
I have been trying strip the off the _INIT,the spaces and the = and the code is becoming really messy.
localparam OSD_MODE_15_H_ADDR = 16'h038d;
localparam OSD_MODE_15_H_INIT = 8'h00
Can you suggest a lean and clean method to do this?
Thanks!
The following solution uses a regular expression (compiled to speed searching up) to match the relevant lines and extract the needed information. The expression uses named groups "id" and "hexValue" to identify the data we want to extract from the matching line.
import re
expression = "(?P<id>\w+?)_INIT\s*?=.*?'h(?P<hexValue>[0-9a-fA-F]*)"
regex = re.compile(expression)
def getIdAndValueFromInitLine(line):
mm = regex.search(line)
if mm == None:
return None # Not the ..._INIT parameter or line was empty or other mismatch happened
else:
return (mm.groupdict()["id"], "0x" + mm.groupdict()["hexValue"])
EDIT: If I understood the next task correctly, you need to find the hexvalues of those INIT and ADDR lines whose IDs match and make a dictionary of the INIT hexvalue to the ADDR hexvalue.
regex = "(?P<init_id>\w+?)_INIT\s*?=.*?'h(?P<initValue>[0-9a-fA-F]*)"
init_dict = {}
for x in re.findall(regex, lines):
init_dict[x.groupdict()["init_id"]] = "0x" + x.groupdict()["initValue"]
regex = "(?P<addr_id>\w+?)_ADDR\s*?=.*?'h(?P<addrValue>[0-9a-fA-F]*)"
addr_dict = {}
for y in re.findall(regex, lines):
addr_dict[y.groupdict()["addr_id"]] = "0x" + y.groupdict()["addrValue"]
init_to_addr_hexvalue_dict = {init_dict[x] : addr_dict[x] for x in init_dict.keys() if x in addr_dict}
Even if this is not what you actually need, having init and addr dictionaries might help to achieve your goal easier. If there are several _INIT (or _ADDR) lines with the same ID and different hexvalues then the above dict approach will not work in a straight forward way.
try something like this- not sure what all your requirements are but this should get you close:
with open(someFile, 'r') as infile:
for line in infile:
if '_INIT' in line:
apostropheIndex = line.find("'h")
clean_hex = '0x' + line[apostropheIndex + 2:]
In the case of "16'h038d;", clean_hex would be "0x038d;" (need to remove the ";" somehow) and in the case of "8'h00", clean_hex would be "0x00"
Edit: if you want to guard against characters like ";" you could do this and test if a character is alphanumeric:
clean_hex = '0x' + ''.join([s for s in line[apostropheIndex + 2:] if s.isalnum()])
You can use a regular expression and the re.findall() function. For example, to generate a list of tuples with the data you want just try:
import re
lines = open("your_file").read()
regex = "([\w]+?)_INIT\s*=\s*\d+'h([\da-fA-F]*)"
res = [(x[0], "0x"+x[1]) for x in re.findall(regex, lines)]
print res
The regular expression is very specific for your input example. If the other lines in the file are slightly different you may need to change it a bit.

Categories

Resources