Search XML file with partial text - python

I have an XML file and need to search it for text. The problem is the text I'm searching for is not the full string which is in the XML file. For example in the below example I have the name hill but not justin.
<people type="test">
<category name="USA" id="1046" file_group="usa">
<type date="24.12.2013" time="00:00" id="225100">
<local name="justin hill" id="1185"/>
Is it possible to search the XML file in the way and get the associated ID of the user.
Thanks!

Look into BeautifulSoup with lxml.

Related

Python XML parsing removing empty CDATA nodes

I'm using minidom from xml.dom to parse an xml document. I make some changes to it and then re-export it back to a new xml file. This file is generated by a program as an export and I use the changed document as an import. Upon importing, the program tells me that there are missing CDATA nodes and that it cannot import.
I simplified my code to test the process:
from xml.dom import minidom
filename = 'Test.xml'
dom = minidom.parse(filename)
with open( filename.replace('.xml','_Generated.xml'), mode='w', encoding='utf8' ) as fh:
fh.write(dom.toxml())
Using this for the Test.xml:
<?xml version="1.0" encoding="UTF-8"?>
<body>
<![CDATA[]]>
</body>
This is what the Text_Generated.xml file is:
<?xml version="1.0" ?><body>
</body>
A simple solution is to first open the document and change all the empty CDATA nodes to include some value before parsing then removing the value from the new file after generation but this seems like unnecessary work and time for execution as some of these documents include tens of thousands of lines.
I partially debugged the issue down to the explatbuilder.py and it's parser. The parser is installed with custom callbacks. The callback that handles the data from the CDATA nodes is the character_data_handler_cdata method. The data that is supplied to this method is already missing after parsing.
Anyone know what is going on with this?
Unfortunately the XML specification is not 100% explicit about what counts as significant information in a document and what counts as noise. But there's a fairly wide consensus that CDATA tags serve no purpose other than to delimit text that hasn't been escaped: so % and % and &#x25 and <!CDATA[%]]> are different ways of writing the same content, and whichever of these you use in your input, the XML parser will produce the same output. On that assumption, an empty <!CDATA[]]> represents "no content" and a parser will remove it.
If your document design attaches signficance to CDATA tags then it's out of line with usual practice followed by most XML tooling, and it would be a good idea to revise the design to use element tags instead.
Having said that, many XML parsers do have an option to report CDATA tags to the application, so you may be able to find a way around this, but it's still not a good design choice.

Beautifulsoup4, parsing Tableau XML file, and writing to file

I'm having an issue where I'm using beautifulsoup to parse the xml generated from a Tableau workbook and when I write the results to file it doesn't behave as expected. Chose bs4 and it's standard XML parser, because I find it easiest for my brain to comprehend and I don't need the speed of the lxml parser/package.
Background: I have a calculated field in my Tableau workbook that will programmatically change during publish depending on the server and site location that template workbook will go to. I've already gone through and built some functions and scripted out everything I need to get the data to do this, but when my script writes the xml to file it adds some encodings for ampersand. This results in the file being valid and able to be opened in Tableau, but the field is considered invalid, despite looking like it is valid. I'm thinking the XML is some how getting malformed somewhere in my process.
Code so far for where I think the issue is occuring:
import bs4 as bs
twb = open(Script_config['local_file_location'], 'r')
bs_content = bs(twb, 'xml')
# formula_final below comes from another script that handles getting the data I need to programmatically generate the formula I need.
# Here is what I use to generate the bulk of the formula for Tableau
# 'When &apos;[{}]&apos; then {} '.format(rows['Column_Name'], rows['Formatted_ColumnName']))
# Does some other stuff and slaps together the formula I need as a string that can be written into my XML
# Verified that my result is coming over correctly and only changes once I do the replacement here and/or the writing of the file.
for calculation in bs_content.find_all('column', {'caption': 'Group By', 'datatype':'string', 'name':'[Calculation_12345678910]'}):
calculation.find('calculation')['formula'] = formula_final
with open('test.twb', 'w') as file:
file.write(str(bs_content))
Sample XML:
<?xml version="1.0" encoding="utf-8"?>
<workbook source-build="2021.1.4 (20211.21.0712.0907)" source-platform="win" version="18.1" xml:base="https://localhost" xmlns:user="http://www.tableausoftware.com/xml/user">
...
<column caption="Group By" datatype="string" name="[Calculation_12345678910]" role="dimension" type="nominal">
<calculation class="tableau" formula="Case [Parameters].[Location External ID Parameter] When &apos;[Territory]&apos; then [Territory] End"/>
</column>
Problem:
In the sample XML, Tableau is expecting the XML to be formatted without the & in front of the apos;. It should just be reading as &apos;.
What I've tried:
Thinking that I could just escape the & character I put the necessary slashes in place to escape it before the apos; portion, but to no avail I can't figure out how to get my XML to be formed so that it doesn't always put the ampersand code as part of the other special characters in my XML.
Any help would be much appreciated!
Good problem description.
Your problem is known as 'double escaping'. Your program is reading data which has already been serialized by an XML processor. That's why it contains &apos;[{}]&apos; and not '[{}]'
I think your program reads that XML value from a file as a simple string and assigns it to the value of a tag. But when BeautifulSoup's XML processor encounters the & in the tag value it must replace it with &. So you end up with &apos;' instead of &apos; in the XML output.
The quick and dirty solution is to write some code to replace all XML entities with the equivalent text. A better solution would be to read the XML data using an XML parser - that way, your program will receive the intended string value automatically.

how to parse xml with multiple root element

I need to parse both var & group root elements.
Code
import xml.etree.ElementTree as ET
tree_ownCloud = ET.parse('0020-syslog_rules.xml')
root = tree_ownCloud.getroot()
Error
xml.etree.ElementTree.ParseError: junk after document element: line 17, column 0
Sample XML
<var name="BAD_WORDS">core_dumped|failure|error|attack| bad |illegal |denied|refused|unauthorized|fatal|failed|Segmentation Fault|Corrupted</var>
<group name="syslog,errors,">
<rule id="1001" level="2">
<match>^Couldn't open /etc/securetty</match>
<description>File missing. Root access unrestricted.</description>
<group>pci_dss_10.2.4,gpg13_4.1,</group>
</rule>
<rule id="1002" level="2">
<match>$BAD_WORDS</match>
<options>alert_by_email</options>
<description>Unknown problem somewhere in the system.</description>
<group>gpg13_4.3,</group>
</rule>
</group>
I tried following couple of other questions on stackoverflow here, but none helped.
I know the reason, due to which it is not getting parsed, people have usually tried hacks. IMO it's a very common usecase to have multiple root elements in XML, and something must be there in ET parsing library to get this done.
As mentioned in the comment, an XML file cannot have multiple roots. Simple as that.
If you do receive/store data in this format (and then it's not proper XML). You could consider a hack of surrounding what you have with a fake tag, e.g.
import xml.etree.ElementTree as ET
with open("0020-syslog_rules.xml", "r") as inputFile:
fileContent = inputFile.read()
root = ET.fromstring("<fake>" + fileContent +"</fake>")
print(root)
Actually, the example data is not a well-formed XML document, but it is a well-formed XML entity. Some XML parsers have an option to accept an entity rather than a document, and in XPath 3.1 you can parse this using the parse-xml-fragment() function.
Another way to parse a fragment like this is to create a wrapper document which references it as an external entity:
<!DOCTYPE wrapper [
<!ENTITY e SYSTEM "fragment.xml">
]>
<wrapper>&e;</wrapper>
and then supply this wrapper document as the input to your XML parser.

Open and read each file in a directory separately [duplicate]

I'd like to search a Word 2007 file (.docx) for a text string, e.g., "some special phrase" that could/would be found from a search within Word.
Is there a way from Python to see the text? I have no interest in formatting - I just want to classify documents as having or not having "some special phrase".
After reading your post above, I made a 100% native Python docx module to solve this specific problem.
# Import the module
from docx import *
# Open the .docx file
document = opendocx('A document.docx')
# Search returns true if found
search(document,'your search string')
The docx module is at https://python-docx.readthedocs.org/en/latest/
More exactly, a .docx document is a Zip archive in OpenXML format: you have first to uncompress it.
I downloaded a sample (Google: some search term filetype:docx) and after unzipping I found some folders. The word folder contains the document itself, in file document.xml.
A problem with searching inside a Word document XML file is that the text can be split into elements at any character. It will certainly be split if formatting is different, for example as in Hello World. But it can be split at any point and that is valid in OOXML. So you will end up dealing with XML like this even if formatting does not change in the middle of the phrase!
<w:p w:rsidR="00C07F31" w:rsidRDefault="003F6D7A">
<w:r w:rsidRPr="003F6D7A">
<w:rPr>
<w:b />
</w:rPr>
<w:t>Hello</w:t>
</w:r>
<w:r>
<w:t xml:space="preserve">World.</w:t>
</w:r>
</w:p>
You can of course load it into an XML DOM tree (not sure what this will be in Python) and ask to get text only as a string, but you could end up with many other "dead ends" just because the OOXML spec is around 6000 pages long and MS Word can write lots of "stuff" you don't expect. So you could end up writing your own document processing library.
Or you can try using Aspose.Words.
It is available as .NET and Java products. Both can be used from Python. One via COM Interop another via JPype. See Aspose.Words Programmers Guide, Utilize Aspose.Words in Other Programming Languages (sorry I can't post a second link, stackoverflow does not let me yet).
In this example, "Course Outline.docx" is a Word 2007 document, which does contain the word "Windows", and does not contain the phrase "random other string".
>>> import zipfile
>>> z = zipfile.ZipFile("Course Outline.docx")
>>> "Windows" in z.read("word/document.xml")
True
>>> "random other string" in z.read("word/document.xml")
False
>>> z.close()
Basically, you just open the docx file (which is a zip archive) using zipfile, and find the content in the 'document.xml' file in the 'word' folder. If you wanted to be more sophisticated, you could then parse the XML, but if you're just looking for a phrase (which you know won't be a tag), then you can just look in the XML for the string.
You can use docx2txt to get the text inside the docx, than search in that txt
npm install -g docx2txt
docx2txt input.docx # This will print the text to stdout
A docx is just a zip archive with lots of files inside. Maybe you can look at some of the contents of those files? Other than that you probably have to find a lib that understands the word format so that you can filter out things you're not interested in.
A second choice would be to interop with word and do the search through it.
a docx file is essentially a zip file with an xml inside it.
the xml contains the formatting but it also contains the text.
OLE Automation would probably be the easiest. You have to consider formatting, because the text could look like this in the XML:
<b>Looking <i>for</i> this <u>phrase</u>
There's no easy way to find that using a simple text scan.
You should be able to use the MSWord ActiveX interface to extract the text to search (or, possibly, do the search). I have no idea how you access ActiveX from Python though.
You may also consider using the library from OpenXMLDeveloper.org

Parse Solr output in python

I am trying to parse a solr output of the form:
<doc>
<str name="source">source:A</str>
<str name="url">URL:A</str>
<date name="p_date">2012-09-08T10:02:01Z</date>
</doc>
<doc>
<str name="source">source:B</str>
<str name="url">URL:B</str>
<date name="p_date">2012-08-08T11:02:01Z</date>
</doc>
I am keen on using beautiful soup (versions that have BeautifulStoneSoup; I think prior to BS4) for parsing the docs.
I have used beautiful soup for HTML parsing but some how I am not able to find a effecient way to extract the contents of the tag.
I have written:
for tags in soup('doc'):
print tags.renderContents()
I do sense that I can work my way through it forcibly to get the outputs (like say 'soup'ing it again), but would appreciate an effecient solution to extract data.
My output required is:
source:A
URL:A
2012-09-08T10:02:01Z
source:B
URL:B
2012-08-08T11:02:01Z
Thanks
Use a XML parser for task instead; xml.etree.ElementTree is included with Python:
from xml.etree import ElementTree as ET
# `ET.fromstring()` expects a string containing XML to parse.
# tree = ET.fromstring(solrdata)
# Use `ET.parse()` for a filename or open file object, such as returned by urllib2:
ET.parse(urllib2.urlopen(url))
for doc in tree.findall('.//doc'):
for elem in doc:
print elem.attrib['name'], elem.text
Do you have to use this particular output format? Solr supports Python output format out of the box (in version 4 at least), just use wt=python in the query.

Categories

Resources