Segmenting XML mixed content (in Python)

Segmenting XML mixed content (in Python) - python

I have a lot of XML documents with mixed content, i.e. they contain paragraphs of normal text with interspersed XML for formatting etc., unrelated to the document structure).
I need to segment these paragraphs within the existing XML document:
identify permissible "breakpoints" based on textual criteria (sentence boundaries - full stop, tab etc.)
divide the paragraph into segments defined by adjacent pairs of breakpoints (segment start and end points), i.e. wrap everything between the two breakpoints in a <seg> tag.
The paragraph start and end are also valid breakpoints.
But a breakpoint pair cannot be used if it clashes with the XML structure.
The simplest example goes like this:
<par>Hello <x>you</x>. How are you?</par> might be segmented into <par><seg>Hello <x>you</x>.</seg> <seg>How are you?</seg></par>
But when the interspersed tags span across a potential breakpoint:
<par>Hello <x>you. How are you</x>?</par> cannot be split up and I can only make <par><seg>Hello <x>you. How are you</x>?</seg></par>
A complication is that a breakpoint, if defined simply as a text index, is ambiguous in terms of the XML structure, e.g.:
<par><x>Hello you. How are you?</x></par> can only be split with all breakpoints inside the <x> tag as <par><x><seg>Hello you.</seg> <seg>How are you?</seg></x></par>
I've been trying to do this with lxml, but that quickly became rather complicated. Each segment's start and end breakpoints have to be at the same "level" within the tree, but that could mean being in the text property of one tag and the tail of another; inserting a new tag means moving some of the surrounding text to other tags; the "level" is ambiguous for empty text/tails, etc etc. It didn't feel very natural at all.
What's a better way to do this?
Thank you so much!

The best option to transform xml is xslt.
Maybe you take a look here as a starting point: "How to transform an XML file using XSLT in Python?"
And this question: "Tokenize mixed content in XSLT" explains some basics in that what you also might want.

Related

How to apply formatting to a part of a cell's content (of type Paragraph) in reportlab?

I'm building a table in reportlab and want some cells in a table to be displayed in the format "Example: some text", with part of the cell being bolded and the rest not. I'm wrapping each cell in Paragraph to allow for wrapping when lines are too long, but this doesn't provide a neat way to apply formatting to only part of the cell's contents. These are the unideal things I have tried:
Use XML to apply the formatting to the first part of the cell's content, concatenate the string with the second part, and wrap the whole thing in Paragraph; this is currently what I'm doing, and while it technically works, it isn't the prettiest code to look at, especially when working as part of the rest of my script. Relevant code (let me know if it needs more context):
cellData = (example, someText)
cellBold = "".join("<b>", cellData [0], "</b>", cellData [1])
tableRow.append(Paragraph(cellBold, normalStyle))
Display both Paragraphs in the same cell; I tried this in a similar manner described in the answer to this similar question, but doing so displayed the two Paragraphs on separate lines, instead of on the same line. This would work perfectly if there was a way to remove the line break at the end of a Paragraph, but I don't think there is. Relevant code:
cellData = (example, someText)
tableRow.append([Paragraph(cellData [0], boldStyle), Paragraph(cellData [1], normalStyle)])
Two other solutions would be to be able to apply formatting to part of a Paragraph or to concanenate two Paragraphs, but I don't think these are possible in reportlab. Is there a way to neatly accomplish what I want, or do I have to stick with the code mess of using XML formatting on the strings themselves? I'm using the doc.build method of building my PDF, if that's relevant.

Using python to parse a large set of filenames concatenated from inconsistent object names

/tldr Looking to parse a large set of filenames that are a concatenation of two names (container + child) for the original two names where nomenclature is inconsistent. Python library suggestions or any other guidance appreciated.
I am looking for a way to parse strings for information where the nomenclature and formatting of information within those strings will likely be inconsistent to some degree.
Background
Industry: Automation controls
Problem to be solved:
Time series data is exported from an automation system with a single data point being saved to a single .csv file. (example: If the controls system were an environmental controls system the point might be the measured temperature of a room taken at 15 minute intervals.) It is possible to have an environment where there are a few dozen points that export to CSV files or several thousand points that export to CSV files. The structure that the points are normally stored in is as follows: points are contained within a controller, controllers are integrated under a management system and occasionally management systems could be integrated into another management system. The resulting structure is a simple hierarchical tree.
The filenames associated with the CSV files are assembled from the path structure of each point as follows: Directories are created for the management systems (nested if necessary) and under those directories are the CSV files where the filename is a concatenation of the controller name and the point name.
I have written a python script that processes a monthly export of the CSV files (currently about 5500 of them [growing]) into a structured data store and another that assembles spreadsheets for others to review. Currently, I am using some really ugly regular expressions and even uglier string.find()s with a list of static string values that I have hand entered to parse out control names and point names for each file so that they can be inserted into the structured data store.
Unfortunately, as mentioned above, the nomenclature used in these environments are rarely consistent. Point names vary widely. The point referenced above might be known as ROOMTEMP, RM_T, RM-T, ROOM-T, ZN_T, ZNT, RMT or several other possibilities. This applies to almost any point contained within a controller. Controller names are also somewhat inconsistent where they may be named for what type of device they are controlling, the geographic location of the device or even an asset number associated with the device.
I would very much like to get out of the business of hand writing regular expressions to parse file names every time a new location is added. I would like to write code that reads in filenames and looks for patterns across the filenames and then makes a recommendation for parsing the controller and point name out of each filename. I already have an interface where I can assign controller name and point name to each point object by hand so if there are errors with the parse I can modify the results. Ideally, the patterns created by the existing objects would influence the suggested names of new files being parsed.
Some examples of filenames are as follows:
UNIT1254_SAT.csv, UNIT1254_RMT.csv, UNIT1254_fil.csv, AHU_5311_CLG_O.csv, QE239-01_DISCH_STPT.csv, HX_E2_CHW_Return.csv, Plant_RM221_CHW_Sys_Enable.csv, TU_E7_Actual Clg Setpoint.csv, 1725_ROOMTEMP.csv, 1725_DA_T.csv, 1725_RA_T.csv
The order will always be consistent where it is a concatenation of controller name and then point name. There will most likely be a consistent character used to separate controller name from point name (normally an underscore, but occasionally a dash or some other character.)
Does anyone have any recommendations on how to get started with parsing these file names? I’ve thought through a few ideas, but keep shelving them before trying them prior to implementation because I keep finding the potential for performance issues or identifying failure points. The rest of my code is working pretty much the way I need it to, I just haven’t figured out an efficient or useful way to pull the correct names out of the filename. Unfortunately, It is not an option to modify the names on the control system side to be consistent.

I don't know if the following code will help you, but I hope it'll give you at least some idea.
Considering that a filename as "QE239-01_STPT_1725_ROOMTEMP_DA" can contain following names
'QE239-01'
'QE239-01_STPT'
'QE239-01_STPT_1725'
'QE239-01_STPT_1725_ROOMTEMP'
'QE239-01_STPT_1725_ROOMTEMP_DA'
'STPT'
'STPT_1725'
'STPT_1725_ROOMTEMP'
'STPT_1725_ROOMTEMP_DA'
'1725'
'1725_ROOMTEMP'
'1725_ROOMTEMP_DA'
'ROOMTEMP'
'ROOMTEMP_DA'
'DA'
as being possible elements (container name or point name) of the filename,
I defined the function treat() to return this list from the name.
Then the code treats all the filenames to find all the possible elements of filenames.
The function is based on the idea that in the chosen example the element ROOMTEMP can't follow the element STPT because STPT_ROOMTEMP isn't a possible container name in this example string since there is 1725 between these two elements.
And then, with the help of a function in difflib module, I try to discriminate elements that may have some similarity, in order to try to detect patterns under which several elements of names can be gathered.
You must play with the value passed as argument to cutoff parameter to choose what could be the best to give interesting results for you.
It's far from being good, certainly, but I didn't understood all aspects of your problem.
s =\
"""UNIT1254_SAT
UNIT1254_RMT
UNIT1254_fil
AHU_5311_CLG_O
QE239-01_DISCH_STPT,
HX_E2_CHW_Return
Plant_RM221_CHW_Sys_Enable
TU_E7_Actual Clg Setpoint
1725_ROOMTEMP
1725_DA_T
1725_RA_T
UNT147_ROOMTEMP
TRU_EZ_RM_T
HXX_V2_RM-T
RHXX_V2_ROOM-T
SIX8_ZN_T
Plint_RP228_ZNT
SOHO79_EZ_RMT"""
li = s.split('\n')
print(li)
print('- - - - - - - - - - - - - - - - - ')
import difflib
from pprint import pprint
def treat(name):
lu = name.split('_')
W = []
while lu:
W.extend('_'.join(lu[0:x]) for x in range(1,len(lu)+1))
lu.pop(0)
return W
if 0:
q = "QE239-01_STPT_1725_ROOMTEMP_DA"
pprint(treat(q))
print('==========================================')
WALL = []
for t in li:
WALL.extend(treat(t))
pprint(WALL)
for x in WALL:
j = set(difflib.get_close_matches(x, WALL, n=9000000, cutoff=0.7 ))
if len(j)>1:
print(j,'\n')

accessing a conditional iterate in python

I am parsing a large number of of huge XML files (up to 1GB) and I am cross-referencing a list of about 700 possible matches for a given field. If I find a match I would like to know which match I hit from my list rather than using the text from the field itself.
I have the following line in my code
<-- outside loops iterating over outer layer tags tags -->
if any(re.search(s, parsedOutTag.text) for s in preCompiledRegexList):
<-- checking innner layer tags for additional content-->
I am wondering how to access the iterant s directly when the condition is satisfied. I currently have a very hack'ish implementation of what I need to happen.
I have to admit, and I am sure it is obvious, I adopted this line for the efficiency from another question here on Stack Overflow so I don't really know all the details.

The any function shortcircuits, I believe, so even if you could access the s binding from the generator expression, it would only ever be the first matching instance. If that's what you want, then you can just unwrap the if condition:
for s in preCompiledRegexList:
if re.search(s, parsedOutTag.text):
# checking inner layer tags for additional content
break
If you want to process all items in preCompiledRegexList that match, either remove the break above, or use a generator that only yields values that match the required condition:
for outer_s in (inner_s for inner_s in preCompiledRegexList of re.search(s, parsedOutTag.text):
# checking inner layer tags for additional content
(Note that having different outer_s and inner_s labels isn't necessary, I just wanted to highlight that they exist in separate scopes.)

How to find and replace 6 digit numbers within HREF links from map of values across site files, ideally using SED/Python

I need to create a BASH script, ideally using SED to find and replace value lists in href URL link constructs with HTML sit files, looking-up in a map (old to new values), that have a given URL construct. There are around 25K site files to look through, and the map has around 6,000 entries that I have to search through.
All old and new values have 6 digits.
The URL construct is:
One value:
HREF=".*jsp\?.*N=[0-9]{1,}.*"
List of values:
HREF=".*\.jsp\?.*N=[0-9]{1,}+N=[0-9]{1,}+N=[0-9]{1,}...*"
The list of values are delimited by + PLUS symbol, and the list can be 1 to n values in length.
I want to ignore a construct such as this:
HREF=".*\.jsp\?.*N=0.*"
IE the list is only N=0
Effectively I'm only interested in URL's that include one or more values that are in the file map, that are not prepended with CHANGED -- IE the list requires updating.
PLEASE NOTE: in the above construct examples: .* means any character that isn't a digit; I'm just interested in any 6 digit values in the list of values after N=; so I've trying to isolate the N= list from the rest of the URL construct, and it should be noted that this N= list can appear anywhere within this URL construct.
Initially, I want to create a script that will create a report of all links that fulfills the above criteria and that have a 6 digital OLD value that's in the map file, with its file path, to get an understanding of links impacted. EG:
Filename link
filea.jsp /jsp/search/results.jsp?N=204200+731&Ntx=mode+matchallpartial&Ntk=gensearch&Ntt=
filea.jsp /jsp/search/BROWSE.jsp?Ntx=mode+matchallpartial&N=213890+217867+731&
fileb.jsp /jsp/search/results.jsp?N=0+450+207827+213767&Ntx=mode+matchallpartial&Ntk=gensearch&Ntt=
Lastly, I'd like to find and replace all 6 digit numbers, within the URL construct lists, as outlined above, as efficiently as possible (I'd like it to be reasonably fast as there could be around 25K files, with 6K values to look up, with potentially multiple values in the list).
**PLEASE NOTE:** There is an additional issue I have, when finding and replacing, is that an old value could have been assigned a new value, that's already been used, that may also have to be replaced.
E.G. If the map file is as below:
MAP-FILE.txt
OLD NEW
214865 218494
214866 217854
214867 214868
214868 218633
... ...
and there is a HREF link such as:
/jsp/search/results.jsp?Ntx=mode+matchallpartial&Ntk=gensearch&N=0+450+214867+214868
214867 changes to 214868 - this would need to be prepended to flag that this value has been changed, and should not be replaced, otherwise what was 214867 would become 218633 as all 214868 would be changed to 218633. Hope this makes sense - I would then need to run through file and remove all 6 digit numbers that had been marked with the prepended flag, such that link would become:
/jsp/search/results.jsp?Ntx=mode+matchallpartial&Ntk=gensearch&N=0+450+214868CHANGED+218633CHANGED
Unless there's a better way to manage these infile changes.
Could someone please help me on this, I'm note an expert with these kind of changes - so help would be massively appreciated.
Many thanks in advance,
Alex

I will write the outline for the code in some kind of pseudocode. And I don't remember Python well to quickly write the code in Python.
First find what type it is (if contains N=0 then type 3, if contains "+" then type 2, else type 1) and get a list of strings containing "N=..." by exploding (name of PHP function) by "+" sign.
The first loop is on links. The second loop is for each N= number. The third loop looks in map file and finds the replacing value. Load the data of the map file to a variable before all the loops. File reading is the slowest operation you have in programming.
You replace the value in the third loop, then implode (PHP function) the list of new strings to a new link when returning to a first loop.
Probably you have several files with the links then you need another loop for the files.
When dealing with repeated codes you nees a while loop until spare number found. And you need to save the numbers that are already used in a list.

Does Pyparsing Support Context-Sensitive Grammars?

Forgive me if I have the incorrect terminology; perhaps just getting the "right" words to describe what I want is enough for me to find the answer on my own.
I am working on a parser for ODL (Object Description Language), an arcane language that as far as I can tell is now used only by NASA PDS (Planetary Data Systems; it's how NASA makes its data available to the public). Fortunately, PDS is finally moving to XML, but I still have to write software for a mission that fell just before the cutoff.
ODL defines objects in something like the following manner:
OBJECT = TABLE
ROWS = 128
ROW_BYTES = 512
END_OBJECT = TABLE
I am attempting to write a parser with pyparsing, and I was doing fine right up until I came to the above construction.
I have to create some rule that is able to ensure that the right-hand-value of the OBJECT line is identical to the RHV of END_OBJECT. But I can't seem to put that into a pyparsing rule. I can ensure that both are syntactically valid values, but I can't go the extra step and ensure that the values are identical.
Am I correct in my intuition that this is a context-sensitive grammar? Is that the phrase I should be using to describe this problem?
Whatever kind of grammar this is in the theoretical sense, is pyparsing able to handle this kind of construction?
If pyparsing is not able to handle it, is there another Python tool capable of doing so? How about ply (the Python implementation of lex/yacc)?

It is in fact a grammar for a context-sensitive language, classically abstracted as wcw where w is in (a|b)* (note that wcw' , where ' indicates reversal, is context-free).
Parsing Expression Grammars are capable of parsing wcw-type languages by using semantic predicates. PyParsing provides the matchPreviousExpr() and matchPreviousLiteral() helper methods for this very purpose, e.g.
w = Word("ab")
s = w + "c" + matchPreviousExpr(w)
So in your case you'd probably do something like
table_name = Word(alphas, alphanums)
object = Literal("OBJECT") + "=" + table_name + ... +
Literal("END_OBJECT") + "=" +matchPreviousExpr(table_name)

As a general rule, parsers are built as context-free parsing engines. If there is context sensitivity, it is grafted on after parsing (or at least after the relevant parsing steps are completed).
In your case, you want to write context-free grammar rules:
head = 'OBJECT' '=' IDENTIFIER ;
tail = 'END_OBJECT' '=' IDENTIFIER ;
element = IDENTIFIER '=' value ;
element_list = element ;
element_list = element_list element ;
block = head element_list tail ;
The checks that the head and tail constructs have matching identifiers isn't technically done by the parser.
Many parsers, however, allow a semantic action to occur when a syntactic element is recognized, often for the purpose of building tree nodes. In your case, you want
to use this to enable additional checking. For element, you want to make sure the IDENTIFIER isn't a duplicate of something already in the block; this means for each element encountered, you'll want to capture the corresponding IDENTIFIER and make a block-specific list to enable duplicate checking. For block, you want to capture the head *IDENTIFIER*, and check that it matches the tail *IDENTIFIER*.
This is easiest if you build a tree representing the parse as you go along, and hang the various context-sensitive values on the tree in various places (e.g., attach the actual IDENTIFIER value to the tree node for the head clause). At the point where you are building the tree node for the tail construct, it should be straightforward to walk up the tree, find the head tree, and then compare the identifiers.
This is easier to think about if you imagine the entire tree being built first, and then a post-processing pass over the tree is used to this checking. Lazy people in fact do it this way :-} All we are doing is pushing work that could be done in the post processing step, into the tree-building steps attached to the semantic actions.
None of these concepts is python specific, and the details for PyParsing will vary somewhat.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.