How to extract efficientely <![CDATA[]> content from an xml with python? - python

I have the following xml:
<?xml version="1.0" encoding="UTF-8" standalone="no"?><author id="user23">
<document><![CDATA["#username: That boner came at the wrong time ???? http://t.co/5X34233gDyCaCjR" HELP I'M DYING ]]></document>
<document><![CDATA[Ugh ]]></document>
<document><![CDATA[YES !!!! WE GO FOR IT. http://t.co/fiI23324E83b0Rt ]]></document>
<document><![CDATA[#username Shout out to me???? ]]></document>
</author>
What is the most efficient way to parse and extract the <![CDATA[content]]> into a list. Let's say:
[#username: That boner came at the wrong time ???? http://t.co/5X34233gDyCaCjR" HELP I'M DYING Ugh YES !!!! WE GO FOR IT. http://t.co/fiI23324E83b0Rt #username Shout out to me???? ]
This is what I tried:
from bs4 import BeautifulSoup
x='/Users/user/PycharmProjects/TratandoDeMejorarPAN/test.xml'
y = BeautifulSoup(open(x), 'xml')
out = [y.author.document]
print out
And this is the output:
[<document>"#username: That boner came at the wrong time ???? http://t.co/5XgDyCaCjR" HELP I'M DYING </document>]
The problem with this output is that I should not get the <document></document>. How can I remove the <document></document> tags and get all the elements of this xml in a list?.

There are several things wrong here. (Asking questions on selecting a library is against the rules here, so I'm ignoring that part of the question).
You need to pass in a file handle, not a file name.
That is: y = BeautifulSoup(open(x))
You need to tell BeautifulSoup that it's dealing with XML.
That is: y = BeautifulSoup(open(x), 'xml')
CDATA sections don't create elements. You can't search for them in the DOM, because they don't exist in the DOM; they're just syntactic sugar. Just look at the text directly under the document, don't try to search for something named CDATA.
To state it again, somewhat differently: <doc><![CDATA[foo]]</doc> is exactly the same as <doc>foo</doc>. What's different about a CDATA section is that everything inside it is automatically escaped, meaning that <![CDATA[<hello>]] is interpreted as <hello>. However -- you can't tell from the parsed object tree whether your document contained a CDATA section with literal < and > or a raw text section with < and >. This is by design, and true of any compliant XML DOM implementation.
Now, how about some code that actually works:
import bs4
doc="""
<?xml version="1.0" encoding="UTF-8" standalone="no"?><author id="user23">
<document><![CDATA["#username: That came at the wrong time ????" HELP I'M DYING ]]></document>
<document><![CDATA[Ugh ]]></document>
<document><![CDATA[YES !!!! WE GO FOR IT. ]]></document>
<document><![CDATA[#username Shout out to me???? ]]></document>
</author>
"""
doc_el = bs4.BeautifulSoup(doc, 'xml')
print [ el.text for el in doc_el.findAll('document') ]
If you want to read from a file, replace doc with open(filename, 'r').

Related

BeautifulSoup: How to pass a variable into soup.find({variable])

I am using Beautiful Soup to search an XML file provided by the SEC (this is public data). Beautiful Soup works very well for referencing tags but I can not seem to pass a variable to its find function. Static content is fine. I think there is a gap in my python understanding that I can't seem to figure out. (I code a few days a year, not my main role)
File:
https://reports.adviserinfo.sec.gov/reports/CompilationReports/IA_FIRM_SEC_Feed_02_08_2023.xml.gz
I download, unzip and then create the soup from the file using lxml.
with open(Firm_Download_name,'r') as f:
soup = BeautifulSoup(f, 'lxml')
Next is where I am running into trouble, I have a list of Firm CRD numbers (these are public numbers identifying the firm) that I am looking for in the XML file and then pulling out various data points from the child tags.
If I write it statically such as:
soup.find(firmcrdnb="5639055").parent
This works perfectly, but I want to loop through a list of CRD numbers and pull out a different block each time. I can not figure out how to pass a variable to the soup.find function.
I feel like this should be simple. I appreciate any help you can provide.
Here is my current attempt:
searchstring = 'firmcrdnb="'+Firm_CRD+'"'
select_firm = soup.find(searchstring).parent
I have tried other similar setups and reviewed other stack exchanges such as Is it possible to pass a variable to (Beautifulsoup) soup.find()? but just not quite getting it.
Here is an example of the XML.
<?xml version="1.0" encoding="iso-8859-1"?>
<IAPDFirmSECReport GenOn="2017-09-30">
<Firms>
<Firm>
<Info SECRgnCD="MIRO" FirmCrdNb="9999" SECNb="999-99999" BusNm="XXXX INC." LegalNm="XXX INC" UmbrRgstn="N"/>
<MainAddr Strt1="9999 XXXX" Strt2="XXXX" City="XXX" State="FL" Cntry="XXX" PostlCd="999999" PhNb="999-999-9999" FaxNb="999-999-9999"/>
<MailingAddr Strt1="9999 XXXX" Strt2="XXXX" City="XXX" State="FL" Cntry="XXX" PostlCd="999999" />
<Rgstn FirmType="Registered" St="APPROVED" Dt="9999-01-01"/>
<NoticeFiled>
Thanks
ps: if anyone has ideas on how to improve the speed of the search on this large file I'd appreciate that to. I get messages such as "pydevd warning: Computing repr of soup (BeautifulSoup) was slow (took 43.83s)" I did install and import chardet per the beautifulsoup documentation but that hasn't seemed to help.
I'm not sure where I got turned around but my static answer did in fact not work.
The tag is "info" and the attribute is "firmcrdnb".
The answer that works was:
select_firm = soup.find("info", {"firmcrdnb" : Firm_CRD}).parent
Welcome to StackOverFlow
Try use,
select_firm = soup.find(attrs={'firmcrdnb': str(Firm_CRD)}).parent
Maybe I'm missing something. If it works statically, have you tried something such as:
list_of_crds = ["11111","22222","33333"]
for crd in list_of_crds:
result = soup.find(firmcrdnb=crd).parent
...

'XML' document with multiple root elements

I have an 'XML' file, which I do not control, which I am trying to parse with etree.ElementTree which contains two root elements:
<?xml version="1.0"?>
<meta>
... data I do not care about
</meta>
<database>
... data I wish to parse
</database>
Trying to parse the file I'm getting the error: 'junk after document element' which I understand is related to the fact that it isn't valid xml, since xml can only have one root element. I've been reading around for a solution, and while I have found a few posts addressing this issue they have all been different enough or difficult enough that I could not, as a beginner, get my head round them.
As I understand it the solution would either be to encase everything in a new root element, and parse that, or somehow ignore/split off the <meta> element and it's children. Any guidance on how to best accomplish this would be appreciated.
Beautiful Soup might ease your problem (although it is the lxml inside which renders this service), but its a long-term downgrade, thus for instance when you want to use xpath.
Stick to ET. It is strict and won't allow you to parse not well-formed XML, which requires one root element and nothing else outside of it.
If you manage to parse your xml-file, you can be sure, it is well-formed. All options are legit:
1) Read the file as a string, remove the declaration and put the root tags around it. Then parse from string. (Clear the string variable after that.) Or you could edit the file first.
2) Create a new root element ( new_root = ET.Element('new_root') ), read the top-level elements in the file an append them with SubElement.
The second option requires more coding and maintainance, if the file gets changed.
Here is one solution using BeautifulSoup, in data is malformed xml. BeautifulSoup will process it as any document, so you can access both parts:
from bs4 import BeautifulSoup
data = """<?xml version="1.0"?>
<meta>
<somedata>1</somedata>
</meta>
<database>
<important>100</important>
</database>"""
soup = BeautifulSoup(data, 'lxml')
print(soup.database.important.text)
Prints:
100

How do i parse a xml comment properly in python

i have been using Python recently and i want to extract information from a given xml file. The problem is that the information is really badly stored, in a format like this
<Content>
<tags>
....
</tags>
<![CDATA["string1"; "string2"; ....
]]>
</Content>
I can not post the entire data here, since it is about 20.000 lines.
I just want to recieve the list containing ["string1", "string2", ...] and this is the code i have been using so far:
import xml.etree.ElementTree as ET
tree = ET.parse(xmlfile)
for node in tree.iter('Content'):
print (node.text)
However my output is none. How can i recieve the comment data? (again, I am using Python)
You'll want to create a SAX based parser instead of a DOM based parser. Especially with a document as large as yours.
A sax based parser requires you to write your own control logic in how data is stored. It's more complicated than simply loading it into a DOM, but much faster as it loads line by line and not the entire document at once. Which gives it the advantage that it can deal with squirrely cases like yours with comments.
When you build your handler, you'll probably want to use the LexicalHandler in your parser to pull out those comments.
I'd give you a working example on how to build one, but it's been a long time since I've done it myself. There's plenty of guides on how to build a sax based parser online, and will defer that discussion to another thread.
The problem is that your comment does not seem to be standard. The standard comment is <!--Comment here--> like this.
And these kind of comments can be parsed with Beautifulsoup for example:
from bs4 import BeautifulSoup, Comment
xml = """<Content>
<tags>
...
</tags>
<!--[CDATA["string1"; "string2"; ....]]-->
</Content>"""
soup = BeautifulSoup(xml)
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
print(comments)
This returns ['[CDATA["string1"; "string2"; ....]]'] From where it could be easy to parse further into the required strings.
If you have non standard comments, i would recommend a regular expression like:
import re
xml = """<Content>
<tags>
asd
</tags>
<![CDATA["string1"; "string2"; ....]]>
</Content>"""
for i in re.findall("<!.+>",xml):
for j in re.findall('\".+\"', i):
print(j)
This returns: "string1"; "string2"
With Python 3.8 you can insert Comment in Element TREE
A sample code to read attrs, value, tag and comment in XML
import csv, sys
import xml.etree.ElementTree as ET
parser = ET.XMLParser(target=ET.TreeBuilder(insert_comments=True)) # Python 3.8
tree = ET.parse(infile_path, parser)
csvwriter.writerow(TextWorkAdapter.CSV_HEADERS)
COMMENT = ""
TAG =""
NAME=""
# Get the comment nodes
for node in tree.iter():
if "function Comment" in str(node.tag):
COMMENT = node.text
else:
#read tag
TAG = node.tag # string
#read attributes
NAME= node.attrib.get("name") # ID
#Value
VALUE = node.text # value
print(TAG, NAME, VALUE, COMMENT)

Appropriate xpath syntax with python for non-standard xml

The input file is actually multiple XML files appending to one file. (Sourced from Google Patents). This is an example:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-23.dtd" [ ]>
<us-patent-grant lang="EN" dtd-version="v4.2 2006-08-23">
<applicants>
<applicant sequence="001" app-type="applicant-inventor" designation="us-only">
<addressbook><last-name>Beyer</last-name>
<first-name>Daniel Lee</first-name>
<address><city>Franklin</city>
<state>TN</state>
<country>US</country></address></addressbook>
<nationality><country>omitted</country></nationality>
<residence><country>US</country></residence>
</applicant>
<applicant sequence="002" app-type="applicant-inventor" designation="us-only">
<addressbook><last-name>Friedland</last-name>
<first-name>Jason Michael</first-name>
<address><city>Franklin</city>
<state>TN</state>
<country>US</country></address></addressbook>
<nationality><country>omitted</country></nationality>
<residence><country>US</country></residence>
</applicant>
</applicants>
</us-patent-grant>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-23.dtd" [ ]>
I'm trying to create a string with a "-".join xpath for all of the children and grandchildren within <applicant> using the following in python with lxml:
import urllib2, os, zipfile
from lxml import etree
count = 0
for item in xmlSplitter(zf.open(xml_file)):
count += 1
if count > 1: break
doc = etree.XML(item)
docID = "-".join(doc.xpath('//publication-reference/document-id/*/text()'))
title = first(doc.xpath('//invention-title/text()'))
applicant = "-".join(doc.xpath('//applicants/applicant/*/text()'))
print "DocID: {0}\nTitle: {1}\nApplicant: {2}\n".format(docID,title,applicant)
outFile.write(str(docID) +"|"+ str(title) +"|"+ str(applicant) +"\n")
I've tried mutliple xpath combinations but I can't produce a string with hyphens for <applicants> and while //text() cant get to the grandchild it doesn't help with the stringing. What is the appropriate xpath syntax to select all text within the children and grandchildren of <applicant> and still punch it out in a string? While not shown in this example is there a way to ignore unicode that might be present at the beginning of a text line too (I believe it appears in some of the later xml docs)? The 'applicant' output I'm hoping to get should look something like:
Beyer-Daniel Lee-Franklin-TN-US-omitted-US-Friedland-Jason Michael-Franklin-TN-US-omitted-US
This question is very similar to this other question of yours.
There are two problems here:
How to get from "non-standard XML" to "standard XML"?
How to use XPath to get text values of descendant elements and concatenate them?
You need to solve 1 before attacking 2. If you need help with that, ask a separate question.
"Non-standard XML" is the same as not XML at all. You can't parse it as XML, and you can't use XPath on it. But you have phrased the question in a way that makes it look like you are trying to do that anyway.
Assuming that your question is actually about working with "standard XML", how about using the same approach as in my answer to your other question?

Python get ID from XML data

I am a total python newb and am trying to parse an XML document that is being returned from google as a result of a post request.
The document returned looks like the one outlined in this doc
http://code.google.com/apis/documents/docs/3.0/developers_guide_protocol.html#Archives
where it says 'The response contains information about the archive.'
The only part I am interested in is the Id attribute right near the beginning. There will only every be 1 entry, and 1 id attribute. How can I extract it to be use later? I've been fighting with this for a while and I feel like I've tried everything from minidom to elementtree. No matter what I do my search comes back blank, loops don't iterate, or methods are missing. Any assistance is much appreciated. Thank you.
I would highly recommend the Python package BeautifulSoup. It is awesome. Here is a simple example using their example data (assuming you've installed BeautifulSoup already):
from BeautifulSoup import BeautifulSoup
data = """<?xml version='1.0' encoding='utf-8'?>
<entry xmlns='http://www.w3.org/2005/Atom'
xmlns:docs='http://schemas.google.com/docs/2007'
xmlns:gd='http://schemas.google.com/g/2005'>
<id>
https://docs.google.com/feeds/archive/-228SJEnnmwemsiDLLxmGeGygWrvW1tMZHHg6ARCy3Uj3SMH1GHlJ2scb8BcHSDDDUosQAocwBQOAKHOq3-0gmKA</id>
<published>2010-11-18T18:34:06.981Z</published>
<updated>2010-11-18T18:34:07.763Z</updated>
<app:edited xmlns:app='http://www.w3.org/2007/app'>
2010-11-18T18:34:07.763Z</app:edited>
<category scheme='http://schemas.google.com/g/2005#kind'
term='http://schemas.google.com/docs/2007#archive'
label='archive' />
<title>Document Archive - someuser#somedomain.com</title>
<link rel='self' type='application/atom+xml'
href='https://docs.google.com/feeds/default/private/archive/-228SJEnnmwemsiDLLxmGeGygWrvW1tMZHHg6ARCy3Uj3SMH1GHlJ2scb8BcHSDDDUosQAocwBQOAKHOq3-0gmKA' />
<link rel='edit' type='application/atom+xml'
href='https://docs.google.com/feeds/default/private/archive/-228SJEnnmwemsiDLLxmGeGygWrvW1tMZHHg6ARCy3Uj3SMH1GHlJ2scb8BcHSDDDUosQAocwBQOAKHOq3-0gmKA' />
<author>
<name>someuser</name>
<email>someuser#somedomain.com</email>
</author>
<docs:archiveNotify>someuser#somedomain.com</docs:archiveNotify>
<docs:archiveStatus>flattening</docs:archiveStatus>
<docs:archiveResourceId>
0Adj-hQNOVsTFSNDEkdk2221OTJfMWpxOGI5OWZu</docs:archiveResourceId>
<docs:archiveResourceId>
0Adj-hQNOVsTFZGZodGs2O72NFMllMQDN3a2Rq</docs:archiveResourceId>
<docs:archiveConversion source='application/vnd.google-apps.document'
target='text/plain' />
</entry>"""
soup = BeautifulSoup(data, fromEncoding='utf8')
print soup('id')[0].text
There is also expat, which is built into Python, but it is worth learning BeautifulSoup, because it will respond way better to real-world XML (and HTML).
Assuming the variable response contains a string representation of the returned HTML document, let me tell you the WRONG way to solve your problem
id = response.split("</id>")[0].split("<id>")[1]
The right way to do it is with xml.sax or xml.dom or expat, but personally, I wouldn't be bothered unless I wanted to have robust error handling of exception cases when response contains something unexpected.
EDIT: I forgot about BeautifulSoup, it is indeed as awesome as Travis describes.
If you'd like to use minidom, you can do the following (replace gd.xml with your xml input):
from xml.dom import minidom
dom = minidom.parse("gd.xml")
id = dom.getElementsByTagName("id")[0].childNodes[0].nodeValue
print id
Also, I assume you meant id element, and not id attribute.

Categories

Resources