Parsing a large complicated patent classification XML file in Python - python

I am trying to parse a large file in particular the English version of the https://www.wipo.int/ipc/itos4ipc/ITSupport_and_download_area/20200101/MasterFiles/index.html, a classification of patents in XML format. I am new to XML parsing so I think that is why I'm having a hard time parsing elements I really want from this file.
Let me provide some context:
<?xml version="1.0" encoding="UTF-8"?>
<IPCScheme xmlns="http://www.wipo.int/classifications/ipc/masterfiles" xmlns:xhtml="http://www.w3.org/1999/xhtml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" edition="20200101" lang="EN" xsi:schemaLocation="http://www.wipo.int/classifications/ipc/masterfiles ipc_scheme_3-1.xsd">
<ipcEntry kind="s" symbol="A" entryType="K">
<textBody>
<title>
<titlePart>
<text>HUMAN NECESSITIES</text>
</titlePart>
</title>
</textBody>
<ipcEntry kind="t" symbol="A01" endSymbol="A01" entryType="K">
<textBody>
<title>
<titlePart>
<text>AGRICULTURE</text>
</titlePart>
</title>
</textBody>
</ipcEntry>
<ipcEntry kind="c" symbol="A01" entryType="K">
<textBody>
<title>
<titlePart>
<text>AGRICULTURE</text>
</titlePart>
<titlePart>
<text>FORESTRY</text>
</titlePart>
<titlePart>
<text>ANIMAL HUSBANDRY</text>
</titlePart>
<titlePart>
<text>HUNTING</text>
</titlePart>
<titlePart>
<text>TRAPPING</text>
</titlePart>
<titlePart>
<text>FISHING</text>
</titlePart>
</title>
</textBody>
.
.
</ipcEntry>
.
.
</IPCScheme>
You can assume that the file is perfectly formatted, each branch has complete closure. It is quite long ~800,000 lines which is why I'm refraining from attaching the whole file in this code sample.
A short overview of the hierarchy should show that is is like:
ROOT
level 1: Symbols {A,B,C,D,E,F,K}
level 2: Subdivisions in each symbol {A01, B22 etc.}
level 3: further subdivisions
And this goes on till about H05K0013040000, the largest layer of granular complexity. In some of them, it halts till about level 5 but the reason why the sample isn't closed is because of these further subdivisions in between.
The task
I would like to extract textual descriptions from this patent classification file for example in the sample provided I would like to extract HUMAN NECESSITIES or AGRICULTURE. You can assume that all these subdivisions have in them and most of them are dominated by this hierarchy at this level (that is <title> -> <titlePart> -> <text>)
Using lxml in Python
Here is a sample code of what I've been trying to do:
from lxml import etree
import lxml
tree = etree.parse('EN_ipc_scheme_20200101.xml')
root = tree.getroot()
for elem in root.findall(".//*[#kind='s']"):
body = elem.find('textBody/title/titlePart/text')
print(body)
My output is
None
None
None
None
None
None
None
None

This might work :)
from lxml import etree
import lxml
tree = etree.parse('EN_ipc_scheme_20200101.xml')
root = tree.getroot()
for element in root.iter():
if element.text != None:
print("%s" % (element.text))
output:
HUMAN NECESSITIES
AGRICULTURE
AGRICULTURE
FORESTRY
ANIMAL HUSBANDRY
HUNTING
TRAPPING
FISHING
SOIL WORKING IN AGRICULTURE OR FORESTRY
PARTS, DETAILS, OR ACCESSORIES OF AGRICULTURAL MACHINES OR IMPLEMENTS, IN GENERAL
making or covering furrows or holes for sowing, planting or manuring
machines for harvesting root crops
mowers convertible to soil working apparatus or capable of soil working
mowers combined with soil working implements
soil working for engineering purposes
... (continued very long had to interrupt)
Although you might change the code from printing on console to rather saving in a text file. That would save the result. Might take some time to write all of it.

The namespace of every entity in the XML example you have shown falls under xmlns="http://www.wipo.int/classifications/ipc/masterfiles". You can see this by looking at the children of root.
root.getchildren()
# returns:
[<Element {http://www.wipo.int/classifications/ipc/masterfiles}ipcEntry at 0x210f93ab288>]
The http path in the curly brackets is the namespace. To search, you have to specify the namespace you are searching within. Normally you can just append the name of the namespace to the front of your path elements and pass in the namespace as a dictionary, like this:
root.findall('xs:textBody', namespaces=ns)
The issue is that this namespace is not tagged, so it appears in the namespace map under the key None.
root.nsmap
# returns:
{None: 'http://www.wipo.int/classifications/ipc/masterfiles',
'xhtml': 'http://www.w3.org/1999/xhtml',
'xsi': 'http://www.w3.org/2001/XMLSchema-instance'}
As a simple work-around, you can replace the None key with a key of your choosing, then reference that key in searches. Below, you can refer to the default namespace as 'z'.
ns = xml.nsmap
ns['z'] = ns.pop(None)
for elem in root.findall(".//*[#kind='s']", namespaces=ns):
body = elem.find('z:textBody/z:title/z:titlePart/z:text', namespaces=ns)
print(body.text)
# prints:
HUMAN NECESSITIES
Alternativly, you can search through all namespaces using {*} before each path element.
for elem in root.findall(".//*[#kind='s']"):
body = elem.find('{*}textBody/{*}title/{*}titlePart/{*}text')
print(body.text)
# prints:
HUMAN NECESSITIES

Related

Why doesn't Element.attrib include namespace definitions?

I'd like to create a XML namespace mapping (e.g., to use in findall calls as in the Python documentation of ElementTree). Given the definitions seem to exist as attributes of the xbrl root element, I'd have thought I could just examine the attrib attribute of the root element within my ElementTree. However, the following code
from io import StringIO
import xml.etree.ElementTree as ET
TEST = '''<?xml version="1.0" encoding="utf-8"?>
<xbrl
xml:lang="en-US"
xmlns="http://www.xbrl.org/2003/instance"
xmlns:country="http://xbrl.sec.gov/country/2021"
xmlns:dei="http://xbrl.sec.gov/dei/2021q4"
xmlns:iso4217="http://www.xbrl.org/2003/iso4217"
xmlns:link="http://www.xbrl.org/2003/linkbase"
xmlns:nvda="http://www.nvidia.com/20220130"
xmlns:srt="http://fasb.org/srt/2021-01-31"
xmlns:stpr="http://xbrl.sec.gov/stpr/2021"
xmlns:us-gaap="http://fasb.org/us-gaap/2021-01-31"
xmlns:xbrldi="http://xbrl.org/2006/xbrldi"
xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
</xbrl>'''
xbrl = ET.parse(StringIO(TEST))
print(xbrl.getroot().attrib)
produces the following output:
{'{http://www.w3.org/XML/1998/namespace}lang': 'en-US'}
Why aren't any of the namespace attributes showing up in root.attrib? I'd at least expect xlmns to be in the dictionary given it has no prefix.
What have I tried?
The following code seems to work to generate the namespace mapping:
print({prefix: uri for key, (prefix, uri) in ET.iterparse(StringIO(TEST), events=['start-ns'])})
output:
{'': 'http://www.xbrl.org/2003/instance',
'country': 'http://xbrl.sec.gov/country/2021',
'dei': 'http://xbrl.sec.gov/dei/2021q4',
'iso4217': 'http://www.xbrl.org/2003/iso4217',
'link': 'http://www.xbrl.org/2003/linkbase',
'nvda': 'http://www.nvidia.com/20220130',
'srt': 'http://fasb.org/srt/2021-01-31',
'stpr': 'http://xbrl.sec.gov/stpr/2021',
'us-gaap': 'http://fasb.org/us-gaap/2021-01-31',
'xbrldi': 'http://xbrl.org/2006/xbrldi',
'xlink': 'http://www.w3.org/1999/xlink',
'xsi': 'http://www.w3.org/2001/XMLSchema-instance'}
But yikes is it gross to have to parse the file twice.
As for the answer to your specific question, why the attrib list doesn't contain the namespace prefix decls, sorry for the unquenching answer: because they're not attributes.
http://www.w3.org/XML/1998/namespace is a special schema that doesn't act like the other schemas in your userspace. In that representation, xmlns:prefix="uri" is an attribute. In all other subordinate (by parsing sequence) schemas, xmlns:prefix="uri" is a special thing, a namespace prefix declaration, which is different than an attribute on a node or element. I don't have a reference for this but it holds true perfectly in at least a half dozen (correct) implementations of XML parsers that I've used, including those from IBM, Microsoft and Oracle.
As for the ugliness of reparsing the file, I feel your pain but it's necessary. As tdelaney so well pointed out, you may not assume that all of your namespace decls or prefixes must be on your root element.
Be prepared for the possibility of the same prefix being redefined with a different namespace on every node in your document. This may hold true and the library must correctly work with it, even if it is never the case your document (or worse, if it's never been the case so far).
Consider if perhaps you are shoehorning some text processing to parse or query XML when there may be a better solution, like XPath or XQuery. There are some good recent changes to and Python wrappers for Saxon, even though their pricing model has changed.

XMLTree Parsing and Printing

I'm starting to learn python3 and one of the things being discussed is XMLTree which I'm having a hard time grasping (most likely due to learning python concurrently)
What I am trying to do is output an easier to read version of my XML file.
The XML File: (there is no limit to the number of child customers - i've included two for example)
<?xml version="1.0" encoding="UTF-8"?>
<customers>
<customers>
<number area_code="800" exch_code="225" sub_code="5288" />
<address zip_code="90210" st_addr="9401 Sunset Blvd" />
<nameText>First Choice</nameText>
</customers>
<customers>
<number area_code="800" exch_code="867" sub_code="5309" />
<address zip_code="60652" st_addr="5 Lake Shore Drive" />
<nameText>Green Grass"</nameText>
</customers>
</customers>
From what I understand, the XML tree defines these lines as the following:
<root>
<child>
<element attribute...>
Where the first xml files 'customers' is the root, the second 'customers' is a child of 'customers', and 'number' (or address, or nameText) are elements.
With that being said, here is where I start to get confused.
If we take <number area_code="800" exch_code="225" sub_code="5288" />
This is an element with three attributes, area_code, exch_code, and sub_code but no text.
If we take <nameText>Green Grass"</nameText>
This is an element with no attributes, but does contain Text (Green Grass)
What I would like to see would be something like this:
First Choice
|--> Phone Number: 800-225-5288
|--> Address: 9401 Sunset Blvd, Zip Code: 90210
Green Grass
|--> Phone Number: 800-867-5309
|--> Address: 5 Lake Shore Drive, Zip Code: 60652
I dont have really any code to share but here it is:
import xml.etree.ElementTree as ET
tree = ET.parse(my_files[0])
root = tree.getroot()
print(root.tag)
for child in root:
print(child.tag,child.attrib)
Which provides the following output (line 1 being from print(root.tag) I believe)
customer
customer
{}
customer
{}
The questions I have after writing all this:
1 - Is my interpretation of the tree structure correct?
2 - How do you differentiate between attributes in ElementTree?
3 - How/what should I be considerate of in terms of the attributes, tags, and the rest of this file when trying to make the desired output? I might be overthinking how much more complex having XML in the mix is making this scenario so I am struggling to figure out how to do something similar to get the output I saw here: https://docs.python.org/3/library/xml.etree.elementtree.html#parsing-xml-with-namespaces but my xml lacks namespaces.
I'm still trying to learn, so any additional explanation is sincerely appreciated!
Resources that I've been trying to read through to understand all this:
https://docs.python.org/3/library/xml.etree.elementtree.html# (When I'm looking through this, I'm going off the assumption that when they are calling something an attribute, its not something unique to ElementTree but the same attribute as defined in the next link)
https://www.w3schools.com/xml/xml_tree.asp (however I havent seen anything yet about multiple attributes)
https://www.edureka.co/blog/python-xml-parser-tutorial/ (This page has been a great help breaking things down step by step so I have been able to follow along)
1 - Is my interpretation of the tree structure correct?
The ElementTree parser only knows about two entities: elements and
attributes. So when you say:
From what I understand, the XML tree defines these lines as the following:
<root>
<child>
<element attribute...>
I'm a little confused. Your XML document -- or any other XML document
-- is just an element that may have zero or more attributes and may
have zero or more children...and so forth all the way down.
2 - How do you differentiate between attributes in ElementTree?
It's not clear what you mean by "differentiate" here; you can ask for
elements by name. For example, the following code prints out the
areacode attribute of all <number> elements:
>>> from xml.etree import ElementTree as ET
>>> doc = ET.parse(open('data.xml'))
>>> doc.findall('.//number')
[<Element number at 0x7fdb8981e640>, <Element number at 0x7fdb8981e680>]
>>> for x in root.findall('.//number'):
... print(x.get('area_code'))
...
800
800
If you'd like, you can get all of the attributes of an element as a Python
dictionary:
>>> number = doc.find('customers/number')
>>> attrs = dict(number.items())
>>> attrs
{'area_code': '800', 'exch_code': '225', 'sub_code': '5288'}
3 - How/what should I be considerate of in terms of the attributes, tags, and the rest of this file when trying to make the desired output?
That code seems to have mostly what you're looking for. As you say,
you're not using namespaces, so you don't need to qualify element
names with namespace names...that is, you can write number instead
of {some/name/space}number.
That gives us something like:
from xml.etree import ElementTree as ET
with open('data.xml') as fd:
doc = ET.parse(fd)
for customer in doc.findall('customers'):
name = customer.find('nameText')
number = customer.find('number')
address = customer.find('address')
print(name.text)
print('|--> Address: {}, Zip Code: {}'.format(
address.get('st_addr'), address.get('zip_code')))
print('|--> Phone number: {}-{}-{}'.format(
number.get('area_code'), number.get('exch_code'), number.get('sub_code')))
Given your sample input, this produces:
First Choice
|--> Address: 9401 Sunset Blvd, Zip Code: 90210
|--> Phone number: 800-225-5288
Green Grass"
|--> Address: 5 Lake Shore Drive, Zip Code: 60652
|--> Phone number: 800-867-5309

lxml with large file: filter out subtrees based on attribute

The high level problem I'm trying to solve is that I have a 1.5 GB SMS data dump, and I am trying to filter the file to preserve only messages to and from a single contact.
I am using lxml in Python to parse the file, but let me know if there are better options.
The structure of the XML file is like this:
SMSES (root node)
'count': 'xxxx',
(Children):
MMS
'address': 'xxxx',
'foo': 'bar',
... : ...,
(Children)
'other fields': 'that _do not_ specify address',
MMS
'address': 'xxxx',
'foo': 'bar',
... : ...,
(Children)
'other fields': 'that _do not_ specify address'
i.e., I want to traverse the children of the root node, and for every MMS where 'address' does not match a specific value, remove that MMS and all its descendents (the children tend to hold items like images, etc.).
What I've tried:
I have found question/answers like this: how to remove an element in lxml
But these threads tend to have simple examples without nested elements.
It's not clear to me how to use tree.xpath() to find elements that do not match a value
It's not clear to me whether calling remove(item) removes the item's descendants (which I want in this case).
I've tried a very naive approach, in which I obtain an iterator, and then walk through the tree, removing elements as I go:
from lxml.etree import XMLParser, parse
p = XMLParser(huge_tree=True)
tree = parse('backup.xml', parser=p)
it = tree.iter()
item = next(it) # consume root node
for item in it:
if item.attrib['address'] != '0000':
item.getparent().remove(item)
The problem with this script is that the iterator performs DFS, and the children of MMS elements do not have the address field. So, I am looking for:
What is the most efficient + reasonably easy way to accomplish my task?
Otherwise, how can I force tree.iter() to give me a BFS iterator over only the first-degree neighbors of the root?
Does remove(item) indeed remove all descendants, or does it attach the children to the parent?
Thank you for taking the time to read. Sorry if this is a naive question -- parsing XML files isn't really my bread and butter, and the LXML documentation was difficult for me to read as a novice.
Thanks!
There's a new release of Saxon/C out last week with a Python language binding, incorporating XSLT 3.0 streaming capability: it's very new software but you could give it a try (with a Saxon-EE evaluation license available from saxonica.com). The stylesheet is very simple:
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="3.0">
<xsl:mode streamable="yes"/>
<xsl:template match="/">
<SMSES>
<xsl:copy-of select="SMS[#address='specific value']"/>
</SMSES>
</xsl:template>
</xsl:transform>
Unfortunately you've abstracted your XML so I can't tell whether "address" is actually an element or an attribute, and it makes a considerable difference when streaming. I've assumed here that it's an attribute, but if you provide a real XML sample then I can help you produce some real working XSLT code.
You could equally well run this directly from the command line using the established Saxon/Java product if there's no real constraint that it has to be run from Python. But either way, streaming requires the enterprise edition of Saxon.

How to parse .xml file with multiple nested children in python?

I am using python to parse a .xml file which is quite complicated since it has a lot of nested children; accessing some of the values contained in it is quite annoying since the code starts to become pretty bad looking.
Let me first present you the .xml file:
<?xml version="1.0" encoding="utf-8"?>
<Start>
<step1 stepA="5" stepB="6" />
<step2>
<GOAL1>11111</GOAL1>
<stepB>
<stepBB>
<stepBBB stepBBB1="pinco">1</stepBBB>
</stepBB>
<stepBC>
<stepBCA>
<GOAL2>22222</GOAL2>
</stepBCA>
</stepBC>
<stepBD>-NO WOMAN NO CRY
-I SHOT THE SHERIF
-WHO LET THE DOGS OUT
</stepBD>
</stepB>
</step2>
<step3>
<GOAL3 GOAL3_NAME="GIOVANNI" GOAL3_ID="GIO">
<stepB stepB1="12" stepB2="13" />
<stepC>XXX</stepC>
<stepC>
<stepCC>
<stepCC GOAL4="saf12">33333</stepCC>
</stepCC>
</stepC>
</GOAL3>
</step3>
<step3>
<GOAL3 GOAL3_NAME="ANDREA" GOAL3_ID="DRW">
<stepB stepB1="14" stepB2="15" />
<stepC>YYY</stepC>
<stepC>
<stepCC>
<stepCC GOAL4="fwe34">44444</stepCC>
</stepCC>
</stepC>
</GOAL3>
</step3>
</Start>
My goal would be to access the values contained inside of the children named "GOAL" in a nicer way then the one I wrote in my sample code below. Furthermore I would like to find an automated way to find the values of GOALS having the same type of tag belonging to different children having the same name:
Example: GIOVANNI and ANDREA are both under the same kind of tag (GOAL3_NAME) and belong to different children having the same name (<step3>) though.
Here is the code that I wrote:
import xml.etree.ElementTree as ET
data = ET.parse('test.xml').getroot()
GOAL1 = data.getchildren()[1].getchildren()[0].text
print(GOAL1)
GOAL2 = data.getchildren()[1].getchildren()[1].getchildren()[1].getchildren()[0].getchildren()[0].text
print(GOAL2)
GOAL3 = data.getchildren()[2].getchildren()[0].text
print(GOAL3)
GOAL4_A = data.getchildren()[2].getchildren()[0].getchildren()[2].getchildren()[0].getchildren()[0].text
print(GOAL4_A)
GOAL4_B = data.getchildren()[3].getchildren()[0].getchildren()[2].getchildren()[0].getchildren()[0].text
print(GOAL4_B)
and the output that I get is the following:
11111
22222
33333
44444
The output that I would like should be like this:
11111
22222
GIOVANNI
33333
ANDREA
44444
As you can see I am able to read GOAL1 and GOAL2 easily but I am looking for a nicer code practice to access those values since it seems to me too long and hard to read/understand.
The second thing I would like to do is getting GOAL3 and GOAL4 in a automated way so that I do not have to repeat similar lines of codes and make it more readable and understandable.
Note: as you can see I was not able to read GOAL3. If possible I would like to get both the GOAL3_NAME and GOAL3_ID
In order to make the .xml file structure more understandable I post an image of what it looks like:
The highlighted elements are what I am looking for.
here is simple example for iterating from head to tail with a recursive method and cElementTree(15-20x faster), you can than collect the needed information from that
import xml.etree.cElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
def get_tail(root):
for child in root:
print child.text
get_tail(child)
get_tail(root)
import xml.etree.cElementTree as ET
data = ET.parse('test.xml')
for d in data.iter():
if d.tag in ["GOAL1", "GOAL2", "stepCC", "stepCC"]:
print d.text
elif d.tag in ["GOAL3", "GOAL4"]:
print d.attrib.values()[0]

Extracting Specific Lines of XML with Python ElementTree

I am a bit stuck on a project I am doing which uses Python -which I am very new to. I have been told to use ElementTree and get specified data out of an incoming XML file. It sounds simple but I am not great at programming. Below is a (very!) tiny example of an incoming file along with the code I am trying to use.
I would like any tips or places to go next with this. I have tried searching and following what other people have done but I can't seem to get the same results. My aim is to get the information contained in the "Active", "Room" and "Direction" but later on I will need to get much more information.
I have tried using XPaths but it does not work too well, especially with the namespaces the xml uses and the fact that an XPath for everything I would need would become too large. I have simplified the example so I can understand the principle to do, as after this it must be extended to gain more information from an "AssetEquipment" and multiple instances of them. Then end goal would be all information from one equipment being saved to a dictionary so I can manipulate it later, with each new equipment in its own separate dictionary.
Example XML:
<AssetData>
<Equipment>
<AssetEquipment ID="3" name="PC960">
<Active>Yes</Active>
<Location>
<RoomLocation>
<Room>23</Room>
<Area>
<X-Area>-1</X-Area>
<Y-Area>2.4</Y-Area>
</Area>
</RoomLocation>
</Location>
<Direction>Positive</Direction>
<AssetSupport>12</AssetSupport>
</AssetEquipment>
</Equipment>
Example Code:
tree = ET.parse('C:\Temp\Example.xml')
root = tree.getroot()
ns = "{http://namespace.co.uk}"
for equipment in root.findall(ns + "Equipment//"):
tagname = re.sub(r'\{.*?\}','',equipment.tag)
name = equipment.get('name')
if tagname == 'AssetEquipment':
print "\tName: " + repr(name)
for attributes in root.findall(ns + "Equipment/" + ns + "AssetEquipment//"):
attname = re.sub(r'\{.*?\}','',attributes.tag)
if tagname == 'Room': #This does not work but I need it to be found while
#in this instance of "AssetEquipment" so it does not
#call information from another asset instead.
room = equipment.text
print "\t\tRoom:", repr(room)
import xml.etree.cElementTree as ET
tree = ET.parse('test.xml')
for elem in tree.getiterator():
if elem.tag=='{http://www.namespace.co.uk}AssetEquipment':
output={}
for elem1 in list(elem):
if elem1.tag=='{http://www.namespace.co.uk}Active':
output['Active']=elem1.text
if elem1.tag=='{http://www.namespace.co.uk}Direction':
output['Direction']=elem1.text
if elem1.tag=='{http://www.namespace.co.uk}Location':
for elem2 in list(elem1):
if elem2.tag=='{http://www.namespace.co.uk}RoomLocation':
for elem3 in list(elem2):
if elem3.tag=='{http://www.namespace.co.uk}Room':
output['Room']=elem3.text
print output

Categories

Resources