Editing pyparsing parse results - python

This is similar to a question I've asked before.
I have written a pyparsing grammar logparser for a text file which contains multiple logs. A log documents every function call and every function completion. The underlying process is multithreaded, so it is possible that a slow function A is called, then a fast function B is called and finishes almost immediately, and after that function A finishes and gives us its return value. Due to this, the log file is very difficult to read by hand because the call information and return value information of one function can be thousands of lines apart.
My parser is able to parse the function calls (from now on called input_blocks) and their return values (from now on called output_blocks). My parse results (logparser.searchString(logfile)) look like this:
[0]: # first log
- input_blocks:
[0]:
- func_name: 'Foo'
- parameters: ...
- thread: '123'
- timestamp_in: '12:01'
[1]:
- func_name: 'Bar'
- parameters: ...
- thread: '456'
- timestamp_in: '12:02'
- output_blocks:
[0]:
- func_name: 'Bar'
- func_time: '1'
- parameters: ...
- thread: '456'
- timestamp_out: '12:03'
[1]:
- func_name: 'Foo'
- func_time: '3'
- parameters: ...
- thread: '123'
- timestamp_out: '12:04'
[1]: # second log
- input_blocks:
...
- output_blocks:
...
... # n-th log
I want to solve the problem that input and output information of one function call are separated. So I want to put an input_block and the corresponding output_block into a function_block. My final parse results should look like this:
[0]: # first log
- function_blocks:
[0]:
- input_block:
- func_name: 'Foo'
- parameters: ...
- thread: '123'
- timestamp_in: '12:01'
- output_block:
- func_name: 'Foo'
- func_time: '3'
- parameters: ...
- thread: '123'
- timestamp_out: '12:04'
[1]:
- input_block:
- func_name: 'Bar'
- parameters: ...
- thread: '456'
- timestamp_in: '12:02'
- output_block:
- func_name: 'Bar'
- func_time: '1'
- parameters: ...
- thread: '456'
- timestamp_out: '12:03'
[1]: # second log
- function_blocks:
[0]: ...
[1]: ...
... # n-th log
To achieve this, I define a function rearrange which iterates through input_blocks and output_blocks and checks whether func_name, thread, and the timestamps match. However, moving the matching blocks into one function_block is the part I am missing. I then set this function as parse action for the log grammar: logparser.setParseAction(rearrange)
def rearrange(log_token):
for input_block in log_token.input_blocks:
for output_block in log_token.output_blocks:
if (output_block.func_name == input_block.func_name
and output_block.thread == input_block.thread
and check_timestamp(output_block.timestamp_out,
output_block.func_time,
input_block.timestamp_in):
# output_block and input_block match -> put them in a function_block
# modify log_token
return log_token
My question is: How do I put the matching output_block and input_block in a function_block in a way that I still enjoy the easy access methods of pyparsing.ParseResults?
My idea looks like this:
def rearrange(log_token):
# define a new ParseResults object in which I store matching input & output blocks
function_blocks = pp.ParseResults(name='function_blocks')
# find matching blocks
for input_block in log_token.input_blocks:
for output_block in log_token.output_blocks:
if (output_block.func_name == input_block.func_name
and output_block.thread == input_block.thread
and check_timestamp(output_block.timestamp_out,
output_block.func_time,
input_block.timestamp_in):
# output_block and input_block match -> put them in a function_block
function_blocks.append(input_block.pop() + output_block.pop()) # this addition causes a maximum recursion error?
log_token.append(function_blocks)
return log_token
This doesn't work though. The addition causes a maximum recursion error and the .pop() doesn't work as expected. It doesn't pop the whole block, it just pops the last entry in that block. Also, it doesn't actually remove that entry either, it justs removes it from the list, but it's still accessible by its results name.
It's also possible that some of theinput_blocks don't have a corresponding output_block (for example if the process crashes before all functions can finish). So my parse results should have the attributes input_blocks, output_blocks (for the spare blocks), and function_blocks (for the matching blocks).
Thanks for your help!
EDIT:
I made a simpler example to show my problem. Also, I experimented around and have a solution which kind of works but is a bit messy. I must admit there was a lot of trial-and-error included because I neither found documentation on nor can make sense of the inner workings of ParseResults and how to properly create my own nested ParseResults-structure.
from pyparsing import *
def main():
log_data = '''\
Func1_in
Func2_in
Func2_out
Func1_out
Func3_in'''
ParserElement.inlineLiteralsUsing(Suppress)
input_block = Group(Word(alphanums)('func_name') + '_in').setResultsName('input_blocks', listAllMatches=True)
output_block = Group(Word(alphanums)('func_name') +'_out').setResultsName('output_blocks', listAllMatches=True)
log = OneOrMore(input_block | output_block)
parse_results = log.parseString(log_data)
print('***** before rearranging *****')
print(parse_results.dump())
parse_results = rearrange(parse_results)
print('***** after rearranging *****')
print(parse_results.dump())
def rearrange(log_token):
function_blocks = list()
for input_block in log_token.input_blocks:
for output_block in log_token.output_blocks:
if input_block.func_name == output_block.func_name:
# found two matching blocks! now put them in a function_block
# and delete them from their original positions in log_token
# I have to do both __setitem__ and .append so it shows up in the dict and in the list
# and .copy() is necessary because I delete the original objects later
tmp_function_block = ParseResults()
tmp_function_block.__setitem__('input', input_block.copy())
tmp_function_block.append(input_block.copy())
tmp_function_block.__setitem__('output', output_block.copy())
tmp_function_block.append(output_block.copy())
function_block = ParseResults(name='function_blocks', toklist=tmp_function_block, asList=True,
modal=False) # I have no idea what modal and asList do, this was trial-and-error until I got acceptable output
del function_block['input'], function_block['output'] # remove duplicate data
function_blocks.append(function_block)
# delete from original position in log_token
input_block.clear()
output_block.clear()
log_token.__setitem__('function_blocks', sum(function_blocks))
return log_token
if __name__ == '__main__':
main()
Output:
***** before rearranging *****
[['Func1'], ['Func2'], ['Func2'], ['Func1'], ['Func3']]
- input_blocks: [['Func1'], ['Func2'], ['Func3']]
[0]:
['Func1']
- func_name: 'Func1'
[1]:
['Func2']
- func_name: 'Func2'
[2]:
['Func3']
- func_name: 'Func3'
- output_blocks: [['Func2'], ['Func1']]
[0]:
['Func2']
- func_name: 'Func2'
[1]:
['Func1']
- func_name: 'Func1'
***** after rearranging *****
[[], [], [], [], ['Func3']]
- function_blocks: [['Func1'], ['Func1'], ['Func2'], ['Func2'], [], []] # why is this duplicated? I just want the inner function_blocks!
- function_blocks: [[['Func1'], ['Func1']], [['Func2'], ['Func2']], [[], []]]
[0]:
[['Func1'], ['Func1']]
- input: ['Func1']
- func_name: 'Func1'
- output: ['Func1']
- func_name: 'Func1'
[1]:
[['Func2'], ['Func2']]
- input: ['Func2']
- func_name: 'Func2'
- output: ['Func2']
- func_name: 'Func2'
[2]: # where does this come from?
[[], []]
- input: []
- output: []
- input_blocks: [[], [], ['Func3']]
[0]: # how do I delete these indexes?
[] # I think I only cleared their contents
[1]:
[]
[2]:
['Func3']
- func_name: 'Func3'
- output_blocks: [[], []]
[0]:
[]
[1]:
[]

This version of rearrange addresses most of the issues I see in your example:
def rearrange(log_token):
function_blocks = list()
for input_block in log_token.input_blocks:
# look for match among output blocks that have not been cleared
for output_block in filter(None, log_token.output_blocks):
if input_block.func_name == output_block.func_name:
# found two matching blocks! now put them in a function_block
# and clear them from in their original positions in log_token
# create rearranged block, first with a list of the two blocks
# instead of append()'ing, just initialize with a list containing
# the two block copies
tmp_function_block = ParseResults([input_block.copy(), output_block.copy()])
# now assign the blocks by name
# x.__setitem__(key, value) is the same as x[key] = value
tmp_function_block['input'] = tmp_function_block[0]
tmp_function_block['output'] = tmp_function_block[1]
# wrap that all in another ParseResults, as if we had matched a Group
function_block = ParseResults(name='function_blocks', toklist=tmp_function_block, asList=True,
modal=False) # I have no idea what modal and asList do, this was trial-and-error until I got acceptable output
del function_block['input'], function_block['output'] # remove duplicate name references
function_blocks.append(function_block)
# clear blocks in their original positions in log_token, so they won't be matched any more
input_block.clear()
output_block.clear()
# match found, no need to keep going looking for a matching output block
break
# find all input blocks that weren't cleared (had matching output blocks) and append as input-only blocks
for input_block in filter(None, log_token.input_blocks):
# no matching output for this input
tmp_function_block = ParseResults([input_block.copy()])
tmp_function_block['input'] = tmp_function_block[0]
function_block = ParseResults(name='function_blocks', toklist=tmp_function_block, asList=True,
modal=False) # I have no idea what modal and asList do, this was trial-and-error until I got acceptable output
del function_block['input'] # remove duplicate data
function_blocks.append(function_block)
input_block.clear()
# clean out log_token, and reload with rearranged function blocks
log_token.clear()
log_token.extend(function_blocks)
log_token['function_blocks'] = sum(function_blocks)
return log_token
And since this takes the input token and returns the rearranged tokens, you can make it a parse action as-is:
# trailing '*' on the results name is equivalent to listAllMatches=True
input_block = Group(Word(alphanums)('func_name') + '_in')('input_blocks*')
output_block = Group(Word(alphanums)('func_name') +'_out')('output_blocks*')
log = OneOrMore(input_block | output_block)
log.addParseAction(rearrange)
Since rearrange updated log_token in place, if you make it a parse action, the ending return statement would be unnecessary.
It is interesting how you were able to update the list in-place by clearing those blocks that you had found matches for - very clever.
Generally, the assembly of tokens into ParseResults is an internal function, so the docs are light on this topic. I was just looking through the module docs and I don't really see a good home for this topic.

another demo for pyparsing setParseAction:
remove whitespace before the first value, preserve whitespace between values
i tried to solve this with pp.Optional(pp.White(' \t')).suppress()
but then i got a = ["b=1"] (parser did not stop at end-of-line)
def lstrip_first_value(src, loc, token):
"remove whitespace before first value"
# based on https://stackoverflow.com/a/51335710/10440128
if token == []:
return token
# update the values
copy = token[:]
copy[0] = copy[0].lstrip()
if copy[0] == "" and len(copy) > 1:
copy = copy[1:]
# update the token
token.clear()
token.extend(copy)
token["value"] = copy
return token
Values = (
pp.OneOrMore(Value.leaveWhitespace())
| pp.Empty().setParseAction(pp.replaceWith(""))
)("value").setParseAction(lstrip_first_value)
Value = pp.Combine(
pp.QuotedString(quoteChar='"', escChar="\\")
| pp.White(' \t') # parse whitespace to separate token
)
inputs
a=
b=2
a =
b=2
the values of a should always be [""]

Related

Access elements inside yaml using python

I am using yaml and pyyaml to configure my application.
Is it possible to configure something like this -
config.yml -
root:
repo_root: /home/raghhuveer/code/data_science/papers/cv/AlexNet_lght
data_root: $root.repo_root/data
service:
root: $root.data_root/csv/xyz.csv
yaml loading function -
def load_config(config_path):
config_path = os.path.abspath(config_path)
if not os.path.isfile(config_path):
raise FileNotFoundError("{} does not exist".format(config_path))
else:
with open(config_path) as f:
config = yaml.load(f, Loader=yaml.SafeLoader)
# logging.info(config)
logging.info("Config used for run - \n{}".format(yaml.dump(config, sort_keys=False)))
return DotDict(config)
Current Output-
root:
repo_root: /home/raghhuveer/code/data_science/papers/cv/AlexNet_lght
data_root: ${root.repo_root}/data
service:
root: ${root.data_root}/csv/xyz.csv
Desired Output -
root:
repo_root: /home/raghhuveer/code/data_science/papers/cv/AlexNet_lght
data_root: /home/raghhuveer/code/data_science/papers/cv/AlexNet_lght/data
service:
root: /home/raghhuveer/code/data_science/papers/cv/AlexNet_lght/data/csv/xyz.csv
Is this even possible with python? If so any help would be really nice.
Thanks in advance.
A general approach:
read the file as is
search for strings containing $:
determine the "path" of "variables"
replace the "variables" with actual values
An example, using recursive call for dictionaries and replaces strings:
import re, pprint, yaml
def convert(input,top=None):
"""Replaces $key1.key2 with actual values. Modifies input in-place"""
if top is None:
top = input # top should be the original input
if isinstance(input,dict):
ret = {k:convert(v,top) for k,v in input.items()} # recursively convert items
if input != ret: # in case order matters, do it one or several times more until no change happens
ret = convert(ret)
input.update(ret) # update original input
return input # return updated input (for the case of recursion)
if isinstance(input,str):
vars = re.findall(r"\$[\w_\.]+",input) # find $key_1.key_2.keyN sequences
for var in vars:
keys = var[1:].split(".") # remove dollar and split by dots to make "key chain"
val = top # starting from top ...
for k in keys: # ... for each key in the key chain ...
val = val[k] # ... go one level down
input = input.replace(var,val) # replace $key sequence eith actual value
return input # return modified input
# TODO int, float, list, ...
with open("in.yml") as f: config = yaml.load(f) # load as is
convert(config) # convert it (in-place)
pprint.pprint(config)
Output:
{'root': {'data_root': '/home/raghhuveer/code/data_science/papers/cv/AlexNet_lght/data',
'repo_root': '/home/raghhuveer/code/data_science/papers/cv/AlexNet_lght'},
'service': {'root': '/home/raghhuveer/code/data_science/papers/cv/AlexNet_lght/data/csv/xyz.csv'}}
Note: YAML is not that important here, would work also with JSON, XML or other formats.
Note2: If you use exclusively YAML and exclusively python, some answers from this post may be useful (using anchors and references and application specific local tags)

Python- validate generated powerset

I want to validate generated combinations only based on data with in "< >".
I have an excel sheet consisting of all the possible combinations generated based on "<>" condition:
Below is the sample of that:
[<Pen(x)>-C(A2)-C(60)-<jack(c)>-xy1-[dress0]-C(D0)-lbr-]
[<Pen(x)>-C(A2)-C(60)-NULL-xy1-[dress0]-C(D0)-lbr-]
[NULL-C(A2)-C(60)-<jack(c)>-xy1-[dress0]-C(D0)-lbr-]
[NULL-C(A2)-C(60)-NULL-xy1-[dress0]-C(D0)-lbr-]
I want to check if the generated combinations is valid or not.
For example: for the above list the original string before generating combinations is below:
<Pen(x)>-C(A2)-C(60)--<jack(c)>-xy1-[address0]-C(D0)-lbr-
Kindly help me to find a generic method to validate all the powersets generated based on <>.
To give a simple example:
I have the below list1.
[<A><B>-CAT-DOG]
[NULL-<B>-CAT-DOG]
[<A>-NULL-CAT-DOG]
[NULL-NULL-CAT-DOG]
The list1 is all possible combination of:
<A><B>-CAT-DOG
I want to check if the above list1 is valid or not
We can build the desired combinations using itertools.product, which generates the Cartesian product of its iterable arguments. But first we need to split the input string up into its components. We can do that by first adding some extra spaces and then calling the .split method.
We can then transform each string in the list returned by .split into a tuple. Items enclosed by < and > get transformed into a 2-tuple containing the item and the 'NULL' string, all other items become 1-tuples.
from itertools import product
def make_powerset(base):
# Add some spaces to make splitting easier
s = base.replace('-', ' ').replace('<', ' <').replace('>', '> ')
# Convert items enclosed in <> into 2-tuples and make other items 1-tuples
elements = [(u, 'NULL') if u.startswith('<') else (u,) for u in s.split()]
# Create all the subsets by finding the Cartesian product of all the tuples
return {'-'.join(t).replace('>-<', '><') for t in product(*elements)}
# Tests
# Make a powerset from base
base = '<Pen(x)>-C(A2)-C(60)--<jack(c)>-xy1-[address0]-C(D0)-lbr-'
powerset = make_powerset(base)
for t in powerset:
print(t)
print()
# Test if the following data are in the powerset
data = (
'<Pen(x)>-C(A2)-C(60)-<jack(c)>-xy1-[address0]-C(D0)-lbr-',
'<Pen(x)>-C(A2)-C(60)-NULL-xy1-[address0]-C(D0)-lbr-',
'NULL-C(A2)-C(60)-<jack(c)>-xy1-[address0]-C(D0)-lbr-',
'NULL-C(A2)-C(60)-NULL-xy1-[address0]-C(D0)-lbr-',
'<Pen(y)>-C(A2)-C(60)-NULL-xy1-[address0]-C(D0)-lbr-',
)
for s in data:
print(s, s.rstrip('-') in powerset)
print('\n', '- ' * 20, '\n')
# Make another powerset
for t in make_powerset('<A><B>-CAT-DOG<C>'):
print(t)
output
<Pen(x)>-C(A2)-C(60)-<jack(c)>-xy1-[address0]-C(D0)-lbr
NULL-C(A2)-C(60)-NULL-xy1-[address0]-C(D0)-lbr
<Pen(x)>-C(A2)-C(60)-NULL-xy1-[address0]-C(D0)-lbr
NULL-C(A2)-C(60)-<jack(c)>-xy1-[address0]-C(D0)-lbr
<Pen(x)>-C(A2)-C(60)-<jack(c)>-xy1-[address0]-C(D0)-lbr- True
<Pen(x)>-C(A2)-C(60)-NULL-xy1-[address0]-C(D0)-lbr- True
NULL-C(A2)-C(60)-<jack(c)>-xy1-[address0]-C(D0)-lbr- True
NULL-C(A2)-C(60)-NULL-xy1-[address0]-C(D0)-lbr- True
<Pen(y)>-C(A2)-C(60)-NULL-xy1-[address0]-C(D0)-lbr- False
- - - - - - - - - - - - - - - - - - - -
NULL-NULL-CAT-DOG-NULL
NULL-<B>-CAT-DOG-NULL
<A>-NULL-CAT-DOG-<C>
<A>-NULL-CAT-DOG-NULL
NULL-NULL-CAT-DOG-<C>
<A><B>-CAT-DOG-<C>
NULL-<B>-CAT-DOG-<C>
<A><B>-CAT-DOG-NULL

Python - Convert a 2 (or less) item set to 2 variables

My code works but I feel like the while loop is possibly not as succinct as it could be.
Maybe using a while loop for a set of 2 items or less is silly? I'm not sure.
# <SETUP CODE TO SIMULATE MY SITUATION>
import random
import re
# The real data set is much larger than this (Around 1,000 - 10,000 items):
names = {"abc", "def", "123"}
if random.randint(0, 3):
# foo value is "foo" followed by a string of unknown digits:
names.add("foo" + str(random.randint(0, 1000)))
if random.randint(0, 3):
# bar value is just "bar":
names.add("bar")
print("names:", names)
matches = {name for name in names if re.match("foo|bar", name)}
print("matches:", matches)
# In the names variable, foo and/or bar may be missing, thus len(matches) should be 0-2:
assert len(matches) <= 2, "Somehow got more than 2 matches"
# </SETUP CODE TO SIMULATE MY SITUATION>
foo, bar = None, None
while matches:
match = matches.pop()
if match == "bar":
bar = match
else:
foo = match
print("foo:", foo)
print("bar:", bar)
And here's what else I've tried within the while loop:
I know ternaries don't work like this (at least not in Python) but this is the pipe-dream level of simplicity I was hoping for:
(bar if match == "bar" else foo) = match
The remove function doesn't return anything:
try:
bar = matches.remove("bar")
except KeyError:
foo = matches.pop()
The loop in your first code is ok, 10,000 inputs is really small at computer scale.
If you want to go slightly faster you can just browse your list match without popping elements (which takes more time), replacing simply
while matches:
match = matches.pop()
by
for match in matches:
Why don't you use simple for loop instead of while loop
for match in matches:
bar = match if match == 'bar' else foo = match
print("foo:", foo)
print("bar:", bar)
You don't have to remove the element from the set every time. Since your set only contains 2 or fewer elements :P. Maybe for larger sets you can delete the entire set after use by
del matches # will help in garbage collection.
In our case, this is not needed.

How can I extract values from an Rx Observable?

I am attempting to integrate some ReactiveX concepts into an existing project, thinking it might be good practice and a way to make certain tasks cleaner.
I open a file, create an Observable from its lines, then do some filtering until I get just the lines I want. Now, I want to extract some information from two of those lines using re.search() to return particular groups. I can't for the life of me figure out how to get such values out of an Observable (without assigning them to globals).
train = 'ChooChoo'
with open(some_file) as fd:
line_stream = Observable.from_(fd.readlines())
a_stream = line_stream.skip_while(
# Begin at dictionary
lambda x: 'config = {' not in x
).skip_while(
# Begin at train key
lambda x: "'" + train.lower() + "'" not in x
).take_while(
# End at closing brace of dict value
lambda x: '}' not in x
).filter(
# Filter sdk and clang lines only
lambda x: "'sdk'" in x or "'clang'" in x
).subscribe(lambda x: match_some_regex(x))
In place of .subscribe() at the end of that stream, I have tried using .to_list() to get a list over which I can iterate "the normal way," but it only returns a value of type:
<class 'rx.anonymousobservable.AnonymousObservable'>
What am I doing wrong here?
Every Rx example I have ever seen does nothing but print results. What if I want them in a data structure I can use synchronously?
For the short term, I implemented the feature I wanted using itertools (as suggested by #jonrsharpe). Still the problem grated at back of my mind, so I came back to it today and figured it out.
This is not a good example of Rx, since it only uses a single thread, but at least now I know how to break out of "the monad" when need be.
#!/usr/bin/env python
from __future__ import print_function
from rx import *
def my_on_next(item):
print(item, end="", flush=True)
def my_on_error(throwable):
print(throwable)
def my_on_completed():
print('Done')
pass
def main():
foo = []
# Create an observable from a list of numbers
a = Observable.from_([14, 9, 5, 2, 10, 13, 4])
# Keep only the even numbers
b = a.filter(lambda x: x % 2 == 0)
# For every item, call a function that appends the item to a local list
c = b.map(lambda x: foo.append(x))
c.subscribe(lambda x: x, my_on_error, my_on_completed)
# Use the list outside the monad!
print(foo)
if __name__ == "__main__":
main()
This example is rather contrived, and all the intermediate observables aren't necessary, but it demonstrates that you can easily do what I originally described.

Search list items in another longer list in python

I am new to this forum, hence apologies if this is a very long question.
I am trying create a generic keyword parser that accepts a keyword list and a list of text lines (that could have been either generated from a DB or a free format text file). Now I am trying to extract the entities from the Text lines list based on the keyword list so that I can generate three key outputs;
Keyword that was mentioned
The text line where this keyword was mentioned and,
the number of times this keyword was mentioned in the text line
The following is a sample of the python code I have written to do this. As you can see that I am trying to accomplish this in three stages;
Stage 1 - accept a reject sequence so that I can remove all known unwanted lines from the Text lines list
Stage 2 (Pass 1 parsing) - Carry out a index-type search on the keywords to reduce the list of lines I need to do a full looped search
Stage 3 - Carry out a full looped search.
Problem: The problem I have is that the stage 3 (or pass 2 in the code) is extremely in-efficient and as an example for the keyword list that has 4500 elements and for the text lines with nearly 2 million rows the code runs for more than 24 hours.
Can anyone suggest a better method of doing the pass 2?
or
If there is a better method of writing the whole function?
I am a Python beginner hence if I have missed something obvious, then apologies in advance.
##########################################################################################
# The keyWord parser conducts a 2 pass keyword lookup and parsing.
# Inputs:
# keywordIDsList - Is a list of the IDs of the keyword (Standard declaration: keywordIDsList[]= Hash value of the keyWords)
# KeywordDict - is the Dict of all the keywords and the associated ID.
# (Standard declaration: keywordDict[keywordID]=(keywordID, keyWord) where keywordID is hash value in keywordIDsList)
# valueIDsList - Is a list of the IDs of all the values that need to be parsed (Standard declaration: valueIDsList[]= Unique reference number of the values)
# valuesDict - Is the Dict of all the value lines and the associated IDs.
# (Standard declaration: valuesDict[uniqueValueKey]=(uniqueValueKey, valueText) where uniqueValueKey is the unique key in valueIDsList)
# rejectPattern - A regular expression based pattern for rejecting columns with certain types of patterns. This is an optional field.
# Outputs:
# parsedHashIDsList - Is the a hash value that is generated for every successful parse results
# parsedResultsDict - Is actual parsed value as parsedResultsDict[parsedHashID]=(uniqueValueKey, keywordID, frequencyResult)
# successResultIDsList - list of all unique value references that were parsed successfully
# rejectResultIDsList - list of all unique value references that were rejected
##########################################################################################
def keywordParser(keywordIDsList, keywordDict, valueIDsList, valuesDict, rejectPattern):
parsedResultsDict = {}
parsedHashIDsList = []
successResultIDsList = []
rejectResultIDsList = []
processListPass1 = []
processListPass2 = []
idxkeyWordDict = {}
for keyID in keywordIDsList:
keywordID, keyWord = keywordDict[keyID]
idxkeyWordDict[keyWord] = (keywordID, keyWord)
percCount = 1
# optional: if rejectPattern is provided then reject lines
# ## Some python code for processing the reject patterns - this works fine
# Pass 1: Index based matching - partial code for index based search
for valueID in processListPass1:
valKey, valText = valuesDict[valueID]
try:
keyWordVal, keywordID = idxkeyWordDict[valText]
except:
processListPass2.append(valueID)
percCount = 0
# Pass 2: Text based search and lookup - this part of the code is extremely inefficient
for valueID in processListPass2:
percCount += 1
valKey, valText = valuesDict[valueID]
valSuccess = 'N'
for keyID in keywordIDsList:
keyWordVal, keywordID = keywordDict[keyID]
keySearch = re.findall(keyWordVal, valText, re.DOTALL)
if keySearch:
parsedHashID = hash(str(valueID) + str(keyID))
parsedResultsDict[parsedHashID] = (valueID, keywordID, len(keySearch))
valSuccess = 'Y'
if valSuccess == 'Y':
successResultIDsList.append(valueID)
else:
rejectResultIDsList.append(valueID)
return (parsedResultsDict, parsedHashIDsList, successResultIDsList, rejectResultIDsList)
This is a perfect use case for the Aho-Corasick string matching algorithm. There is an explanation of a similar use case using code examples in python in this blog post.

Categories

Resources