I am parsing a large number of of huge XML files (up to 1GB) and I am cross-referencing a list of about 700 possible matches for a given field. If I find a match I would like to know which match I hit from my list rather than using the text from the field itself.
I have the following line in my code
<-- outside loops iterating over outer layer tags tags -->
if any(re.search(s, parsedOutTag.text) for s in preCompiledRegexList):
<-- checking innner layer tags for additional content-->
I am wondering how to access the iterant s directly when the condition is satisfied. I currently have a very hack'ish implementation of what I need to happen.
I have to admit, and I am sure it is obvious, I adopted this line for the efficiency from another question here on Stack Overflow so I don't really know all the details.
The any function shortcircuits, I believe, so even if you could access the s binding from the generator expression, it would only ever be the first matching instance. If that's what you want, then you can just unwrap the if condition:
for s in preCompiledRegexList:
if re.search(s, parsedOutTag.text):
# checking inner layer tags for additional content
break
If you want to process all items in preCompiledRegexList that match, either remove the break above, or use a generator that only yields values that match the required condition:
for outer_s in (inner_s for inner_s in preCompiledRegexList of re.search(s, parsedOutTag.text):
# checking inner layer tags for additional content
(Note that having different outer_s and inner_s labels isn't necessary, I just wanted to highlight that they exist in separate scopes.)
Related
This is a generic question and answer for a logical error I've seen in many questions from new programmers in a variety of languages.
The problem is searching an array for an element that matches some input criteria. The algorithm, in pseudo-code, looks something like this:
for each element of Array:
if element matches criteria:
do something with element
maybe break out of loop (if only interested in first match)
else:
print "Not found"
This code reports "Not found" even if it successfully finds a matching element.
The problem is that when you're searching for something linearly through an array, you can't know that it's not found until you reach the end of the array. The code in the question reports "Not found" for every non-matching element, even though there may be other matching elements.
The simple modification is to use a variable that tracks whether you found something, and then check this variable at the end of the loop.
found = false
for each element of Array:
if element matches criteria:
do something with element
found = true
maybe break out of loop (if only interested in first match)
if not found:
print "Not found"
Python has an else: block in its for loops. This executes code only if the loop runs to completion, rather than ending due to use of break. This allows you to avoid the found variable (although it might still be useful for later processing):
for element in someIterable:
if matchesCriteria(element):
print("Found")
break
else:
print("Not found")
Some languages have built-in mechanisms that can be used instead of writing your own loop.
Some languages have an any or some function that takes a callback function, and returns a boolean indicating whether it succeeds for any elements of the array.
If the language has an array filtering function, you can filter the input array with a function that checks the criteria, and then check whether the result is an empty array.
If you're trying to match an element exactly, most languages provide a find or index function that will search for a matching element.
If you'll be searching frequently, it may be better to convert the array to a data structure that can be searched more efficiently. Most languages provide set and/or hash table data structures (the latter goes under many names depending on the language, e.g. associative array, map, dictionary), and these are typically searchable in O(1) time, while scanning an array is O(n).
This is a generic question and answer for a logical error I've seen in many questions from new programmers in a variety of languages.
The problem is searching an array for an element that matches some input criteria. The algorithm, in pseudo-code, looks something like this:
for each element of Array:
if element matches criteria:
do something with element
maybe break out of loop (if only interested in first match)
else:
print "Not found"
This code reports "Not found" even if it successfully finds a matching element.
The problem is that when you're searching for something linearly through an array, you can't know that it's not found until you reach the end of the array. The code in the question reports "Not found" for every non-matching element, even though there may be other matching elements.
The simple modification is to use a variable that tracks whether you found something, and then check this variable at the end of the loop.
found = false
for each element of Array:
if element matches criteria:
do something with element
found = true
maybe break out of loop (if only interested in first match)
if not found:
print "Not found"
Python has an else: block in its for loops. This executes code only if the loop runs to completion, rather than ending due to use of break. This allows you to avoid the found variable (although it might still be useful for later processing):
for element in someIterable:
if matchesCriteria(element):
print("Found")
break
else:
print("Not found")
Some languages have built-in mechanisms that can be used instead of writing your own loop.
Some languages have an any or some function that takes a callback function, and returns a boolean indicating whether it succeeds for any elements of the array.
If the language has an array filtering function, you can filter the input array with a function that checks the criteria, and then check whether the result is an empty array.
If you're trying to match an element exactly, most languages provide a find or index function that will search for a matching element.
If you'll be searching frequently, it may be better to convert the array to a data structure that can be searched more efficiently. Most languages provide set and/or hash table data structures (the latter goes under many names depending on the language, e.g. associative array, map, dictionary), and these are typically searchable in O(1) time, while scanning an array is O(n).
I'm using rply and Python3.6 to create a lexer and a parser for a little privat project.
But what I noticed is that the parser appears to flip the order of the lexerstream.
This is the file I'm parsing:
let test:string = "test";
print(test);
Lexer output:
Token('LET', 'let')
Token('NAME', 'test')
Token('COLON', ':')
Token('NAME', 'string')
Token('EQUALS', '=')
Token('STRING', '"test"')
Token('SEMI_COLON', ';')
Token('PRINT', 'print')
Token('OPEN_PARENS', '(')
Token('STRING', '"test"')
Token('CLOSE_PARENS', ')')
Token('SEMI_COLON', ';')
As you can see it is in the order of the script.
I use the parser to create a variable with name test, type string and value test. Then I want to print the variable.
It does create the variable but when I want to print it out, there is nothing.
But when I flip the script like this
print(test);
let test:string = "test";
it is able to print the value correctly.
The two parser 'rules' look like this:
Print:
#self.pg.production('expression : PRINT OPEN_PARENS expression CLOSE_PARENS SEMI_COLON expression')
def print_s(p):
...
Create variable:
#self.pg.production('expression : LET expression COLON expression EQUALS expression SEMI_COLON expression')
def create_var(p):
...
So my question is: How can I flip the order in which the content is parsed?
Edit: I looked for similar questions or problems and also in the documentation but didn't find anything.
Here's a somewhat simpler example; hopefully, you can see the pattern.
The key insight is that reduction actions (that is, the parser functions) are executed when the production's match has been fully parsed. That means that if a production contains non-terminals, the actions for those non-terminals are executed before the action for the whole production.
It should be clear why this is true. Every production action depends on the semantic values of all of the components, and in the case of non-terminals those values are produced by running the corresponding actions.
Now, consider these two very similar ways to parse a list of things. In both cases, we assume there is a base production which recognises an empty list (list :) and does nothing.
Right recursion:
list : thing list
Left recursion:
list : list thing
In both cases, the action prints the thing, which is p[0] in the right-recursive case, and p[1] in the left-recursive one.
The right-recursive production will cause the things to be printed in reverse order, because printing the thing doesn't happen until after the internal list is parsed (and it's components are printed).
But the left-recursive production will print the things in left-to-right order, for the same reason. The difference is tgat in the left-recursive case, the internal (recursive) list contains the initial things while in the right-recursive case, the list contains the final things.
If you were just building a Python list of things, this probably wouldn't matter much, since execution order wouldn't be important. It's only visible in this example because the action has a side-effect (printing a value), which makes execution order visible.
There are other techniques to order actions, in the rare cases where it is really necessary. But best practice is to always use left-recursion whenever it is syntactically practical. Left-recursive parsers are more efficient because the parser doesn't need to accumulate a stack of incomplete productions. And left-recursion is often better for your actions as well.
Here, for example, the left-recursive action could append the new value (p[0].append(p[1]); return p[0]), while the right-recursive action needs to create a new list (return [p[0] + p[1]). Since repeated appending is on average linear time while repeated concatenation is quadratic, the left-recursive parser is more scalable for large lists.
I need to create a BASH script, ideally using SED to find and replace value lists in href URL link constructs with HTML sit files, looking-up in a map (old to new values), that have a given URL construct. There are around 25K site files to look through, and the map has around 6,000 entries that I have to search through.
All old and new values have 6 digits.
The URL construct is:
One value:
HREF=".*jsp\?.*N=[0-9]{1,}.*"
List of values:
HREF=".*\.jsp\?.*N=[0-9]{1,}+N=[0-9]{1,}+N=[0-9]{1,}...*"
The list of values are delimited by + PLUS symbol, and the list can be 1 to n values in length.
I want to ignore a construct such as this:
HREF=".*\.jsp\?.*N=0.*"
IE the list is only N=0
Effectively I'm only interested in URL's that include one or more values that are in the file map, that are not prepended with CHANGED -- IE the list requires updating.
PLEASE NOTE: in the above construct examples: .* means any character that isn't a digit; I'm just interested in any 6 digit values in the list of values after N=; so I've trying to isolate the N= list from the rest of the URL construct, and it should be noted that this N= list can appear anywhere within this URL construct.
Initially, I want to create a script that will create a report of all links that fulfills the above criteria and that have a 6 digital OLD value that's in the map file, with its file path, to get an understanding of links impacted. EG:
Filename link
filea.jsp /jsp/search/results.jsp?N=204200+731&Ntx=mode+matchallpartial&Ntk=gensearch&Ntt=
filea.jsp /jsp/search/BROWSE.jsp?Ntx=mode+matchallpartial&N=213890+217867+731&
fileb.jsp /jsp/search/results.jsp?N=0+450+207827+213767&Ntx=mode+matchallpartial&Ntk=gensearch&Ntt=
Lastly, I'd like to find and replace all 6 digit numbers, within the URL construct lists, as outlined above, as efficiently as possible (I'd like it to be reasonably fast as there could be around 25K files, with 6K values to look up, with potentially multiple values in the list).
**PLEASE NOTE:** There is an additional issue I have, when finding and replacing, is that an old value could have been assigned a new value, that's already been used, that may also have to be replaced.
E.G. If the map file is as below:
MAP-FILE.txt
OLD NEW
214865 218494
214866 217854
214867 214868
214868 218633
... ...
and there is a HREF link such as:
/jsp/search/results.jsp?Ntx=mode+matchallpartial&Ntk=gensearch&N=0+450+214867+214868
214867 changes to 214868 - this would need to be prepended to flag that this value has been changed, and should not be replaced, otherwise what was 214867 would become 218633 as all 214868 would be changed to 218633. Hope this makes sense - I would then need to run through file and remove all 6 digit numbers that had been marked with the prepended flag, such that link would become:
/jsp/search/results.jsp?Ntx=mode+matchallpartial&Ntk=gensearch&N=0+450+214868CHANGED+218633CHANGED
Unless there's a better way to manage these infile changes.
Could someone please help me on this, I'm note an expert with these kind of changes - so help would be massively appreciated.
Many thanks in advance,
Alex
I will write the outline for the code in some kind of pseudocode. And I don't remember Python well to quickly write the code in Python.
First find what type it is (if contains N=0 then type 3, if contains "+" then type 2, else type 1) and get a list of strings containing "N=..." by exploding (name of PHP function) by "+" sign.
The first loop is on links. The second loop is for each N= number. The third loop looks in map file and finds the replacing value. Load the data of the map file to a variable before all the loops. File reading is the slowest operation you have in programming.
You replace the value in the third loop, then implode (PHP function) the list of new strings to a new link when returning to a first loop.
Probably you have several files with the links then you need another loop for the files.
When dealing with repeated codes you nees a while loop until spare number found. And you need to save the numbers that are already used in a list.
I have some XML that is generated by a script that may or may not have empty elements. I was told that now we cannot have empty elements in the XML. Here is an example:
<customer>
<govId>
<id>#</id>
<idType>SSN</idType>
<issueDate/>
<expireDate/>
<dob/>
<state/>
<county/>
<country/>
</govId>
<govId>
<id/>
<idType/>
<issueDate/>
<expireDate/>
<dob/>
<state/>
<county/>
<country/>
</govId>
</customer>
The output should look like this:
<customer>
<govId>
<id>#</id>
<idType>SSN</idType>
</govId>
</customer>
I need to remove all the empty elements. You'll note that my code took out the empty stuff in the "govId" sub-element, but didn't take out anything in the second. I am using lxml.objectify at the moment.
Here is basically what I am doing:
root = objectify.fromstring(xml)
for customer in root.customers.iterchildren():
for e in customer.govId.iterchildren():
if not e.text:
customer.govId.remove(e)
Does anyone know of a way to do this with lxml objectify or is there an easier way period? I would also like to remove the second "govId" element in its entirety if all its elements are empty.
First of all, the problem with your code is that you are iterating over customers, but not over govIds. On the third line you take the first govId for every customer, and iterate over its children. So, you'd need a another for loop for the code to work like you intended it to.
This small sentence at the end of your question then makes the problem quite a bit more complex: I would also like to remove the second "govId" element in its entirety if all its elements are empty.
This means, unless you want to hard code just checking one level of nesting, you need to recursively check if an element and it's children are empty. Like this for example:
def recursively_empty(e):
if e.text:
return False
return all((recursively_empty(c) for c in e.iterchildren()))
Note: Python 2.5+ because of the use of the all() builtin.
You then can change your code to something like this to remove all the elements in the document that are empty all the way down.
# Walk over all elements in the tree and remove all
# nodes that are recursively empty
context = etree.iterwalk(root)
for action, elem in context:
parent = elem.getparent()
if recursively_empty(elem):
parent.remove(elem)
Sample output:
<customer>
<govId>
<id>#</id>
<idType>SSN</idType>
</govId>
</customer>
One thing you might want to do is refine the condition if e.text: in the recursive function. Currently this will consider None and the empty string as empty, but not whitespace like spaces and newlines. Use str.strip() if that's part of your definition of "empty".
Edit: As pointed out by #Dave, the recursive function could be improved by using a generator expression:
return all((recursively_empty(c) for c in e.getchildren()))
This will not evaluate recursively_empty(c) for all the children at once, but evaluate it for each one lazily. Since all() will stop iteration upon the first False element, this could mean a significant performance improvement.
Edit 2: The expression can be further optimized by using e.iterchildren() instead of e.getchildren(). This works with the lxml etree API and the objectify API.