Generate Antlr Parse Tree - python

I am trying to generate the Antlr parse tree. This is a sample grammar which I got from the web.
grammar Hel;
hi : 'hello' ID ;
ID : [a-z]+ ;
WS : [ \t\r\n]+ -> skip ;
I tried the following code in Jupyter notebook to generate the Parse tree.
Please let me know how to fix it.
from antlr4 import *
from HelLexer import HelLexer
from HelParser import HelParser
input = InputStream('hello jgt')
lexer = CoDalogLexer(input)
stream = CommonTokenStream(lexer)
parser = HelParser(stream)
tree = parser.hi()
How do I generate the parse tree? How Do I access the elements of the tree?
Thank you.

Accessing elements in a parse tree is simple and when you look at your generated classes it should become obvious. A parse tree has a children property or method that allows to access all parse rule contexts that have been created for the found elements during parsing. For convenience there are special accessors that ease getting to specific child contexts. In your example you will see an ID() function on your HiParseContext which gives you a context with details for the matched ID token (e.g. location in the input, token type, token text and more). Just look at the generated and the runtime source code to see what's possible.

Related

Use Python docx to create phonetic guide / 'Ruby text' in Word?

I want to add phonetic guides to words in MS Word. Inside of MS Word the original word is called 'Base text' and the phonetic guide is called 'Ruby text.'
Here's what I'm trying to create looks like in Word:
The docx documentation has page that talks about Run-level content with a reference to ruby: <xsd:element name="ruby" type="CT_Ruby"/> located here:
https://python-docx.readthedocs.io/en/latest/dev/analysis/features/text/run-content.html
I can not figure out how to access these in my code.
Here's an example of one of my attempts:
import docx
from docx import Document
document = Document()
base_text = '光栄'
ruby_text = 'こうえい'
p = document.add_paragraph(base_text)
p.add_run(ruby_text).ruby = True
document.save('ruby.docx')
But this code only returns the following:
光栄こうえい
I've tried to use ruby on the paragraph and p.text, removing the = True but I keep getting the error message 'X object has no attribute 'ruby'
Can someone please show me how to accomplish this?
Thank you!
The <xsd:element name="ruby" ... excerpt you mention is from the XML Schema type for a run. This means a child element of type CT_Ruby can be present on a run element (<w:r>) with the tag name <w:ruby>.
There is not yet any API support for this element in python-docx, so if you want to use it you'll need to manipulate the XML using low-level lxml calls. You can get access to the run element on run._r. If you search on "python-docx workaround function" and also perhaps "python-pptx workaround function" you'll find some examples of doing this to extend the functionality.

Parsing XML Data with Python

actually I am working on a small project and need to parse public available XML data. My goal is to write the data to an mysql database for further processing.
XML Data Link: http://offenedaten.frankfurt.de/dataset/912fe0ab-8976-4837-b591-57dbf163d6e5/resource/48378186-5732-41f3-9823-9d1938f2695e/download/parkdatendyn.xml
XML structure (example):
<parkingAreaStatus>
<parkingAreaOccupancy>0.2533602</parkingAreaOccupancy>
<parkingAreaOccupancyTrend>stable</parkingAreaOccupancyTrend>
<parkingAreaReference targetClass="ParkingArea" id="2[Zeil]"
version="1.0"/>
<parkingAreaStatusTime>2018-02-
04T01:30:00.000+01:00</parkingAreaStatusTime
</parkingAreaStatus>
<parkingAreaStatus>
<parkingAreaOccupancy>0.34625</parkingAreaOccupancy>
<parkingAreaOccupancyTrend>stable</parkingAreaOccupancyTrend>
<parkingAreaReference targetClass="ParkingArea" id="5[Dom / Römer]"
version="1.0"/>
</parkingAreaStatus>
Using the code
import csv
import pymysql
import urllib.request
url = "http://offenedaten.frankfurt.de/dataset/912fe0ab-8976-4837-b591-57dbf163d6e5/resource/48378186-5732-41f3-9823-9d1938f2695e/download/parkdatendyn.xml"
from lxml.objectify import parse
from lxml import etree
from urllib.request import urlopen
locations_root = parse(urlopen(url)).getroot()
locations = list(locations_root.payloadPublication.genericPublicationExtension.parkingFacilityTableStatusPublication.parkingAreaStatus.parkingAreaReference)
print(*locations)
I expected to get a list of all "parkingAreaReference" entries within the XML document. Unfortunately the list is empty.
Playing arround with some code I got the sentiment that only the first block is parsed, I was able to fill the list with the value of "parkingAreaOccupancy" of the "parkingAreaReference" id="2[Zeil]" block by using the code
locations = list(locations_root.payloadPublication.genericPublicationExtension.parkingFacilityTableStatusPublication.parkingAreaStatus.parkingAreaOccupancy)
print(*locations)
-> 0.2533602
which is not the expected outcome
-> 0.2533602
-> 0.34625
MY question is:
What is the best way to get a matrix i can further work with of all blocks incl. the corresponding values stated in the XML document?
Example output:
A = [[ID:2[Zeil],0.2533602,stable,2018-02-
04T01:30:00.000+01:00],[id="5[Dom / Römer],0.34625,stable,2018-02-
04T01:30:00.000+01:00]]
or in general
A = [parkingAreaOccupancy,parkingAreaOccupancyTrend,parkingAreaStatusTime,....],[parkingAreaOccupancy,parkingAreaOccupancyTrend,parkingAreaStatusTime,.....]
After hours of research I hope for some tips from your site
Thank you in advance,
TR
You can just use etree directly and find interesting elements using XPath1 query. One important thing to note is, that your XML has default namespace declared at the root element :
xmlns="http://datex2.eu/schema/2/2_0"
By definition, element where default namespace is declared and all descendant elements without prefix are belong to this default namespace (unless another default namespace found in one of the descendant elements, which is not the case with your XML). This is why we define a prefix d, which references default namespace URI, in the following code, and we use that prefix to find every elements we need to get information from :
root = etree.parse(urlopen(url)).getroot()
ns = { 'd': 'http://datex2.eu/schema/2/2_0' }
parking_area = root.xpath('//d:parkingAreaStatus', namespaces=ns)
for pa in parking_area:
area_ref = pa.find('d:parkingAreaReference', ns)
occupancy = pa.find('d:parkingAreaOccupancy', ns)
trend = pa.find('d:parkingAreaOccupancyTrend', ns)
status_time = pa.find('d:parkingAreaStatusTime', ns)
print area_ref.get('id'), occupancy.text, trend.text, status_time.text
Below is the output of the demo code above. Instead of print, you can store these information in whatever data structure you like :
2[Zeil] 0.22177419 stable 2018-02-04T05:16:00.000+01:00
5[Dom / Römer] 0.28625 stable 2018-02-04T05:16:00.000+01:00
1[Anlagenring] 0.257889 stable 2018-02-04T05:16:00.000+01:00
3[Mainzer Landstraße] 0.20594966 stable 2018-02-04T05:16:00.000+01:00
4[Bahnhofsviertel] 0.31513646 stable 2018-02-04T05:16:00.000+01:00
1) some references on XPath :
XPath 1.0 spec: The most trustworthy reference on XPath 1.0
XPath syntax: Gentler introduction to basic XPath expressions

Processing large XML file [duplicate]

I have the following XML which I want to parse using Python's ElementTree:
<rdf:RDF xml:base="http://dbpedia.org/ontology/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:owl="http://www.w3.org/2002/07/owl#"
xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns="http://dbpedia.org/ontology/">
<owl:Class rdf:about="http://dbpedia.org/ontology/BasketballLeague">
<rdfs:label xml:lang="en">basketball league</rdfs:label>
<rdfs:comment xml:lang="en">
a group of sports teams that compete against each other
in Basketball
</rdfs:comment>
</owl:Class>
</rdf:RDF>
I want to find all owl:Class tags and then extract the value of all rdfs:label instances inside them. I am using the following code:
tree = ET.parse("filename")
root = tree.getroot()
root.findall('owl:Class')
Because of the namespace, I am getting the following error.
SyntaxError: prefix 'owl' not found in prefix map
I tried reading the document at http://effbot.org/zone/element-namespaces.htm but I am still not able to get this working since the above XML has multiple nested namespaces.
Kindly let me know how to change the code to find all the owl:Class tags.
You need to give the .find(), findall() and iterfind() methods an explicit namespace dictionary:
namespaces = {'owl': 'http://www.w3.org/2002/07/owl#'} # add more as needed
root.findall('owl:Class', namespaces)
Prefixes are only looked up in the namespaces parameter you pass in. This means you can use any namespace prefix you like; the API splits off the owl: part, looks up the corresponding namespace URL in the namespaces dictionary, then changes the search to look for the XPath expression {http://www.w3.org/2002/07/owl}Class instead. You can use the same syntax yourself too of course:
root.findall('{http://www.w3.org/2002/07/owl#}Class')
Also see the Parsing XML with Namespaces section of the ElementTree documentation.
If you can switch to the lxml library things are better; that library supports the same ElementTree API, but collects namespaces for you in .nsmap attribute on elements and generally has superior namespaces support.
Here's how to do this with lxml without having to hard-code the namespaces or scan the text for them (as Martijn Pieters mentions):
from lxml import etree
tree = etree.parse("filename")
root = tree.getroot()
root.findall('owl:Class', root.nsmap)
UPDATE:
5 years later I'm still running into variations of this issue. lxml helps as I showed above, but not in every case. The commenters may have a valid point regarding this technique when it comes merging documents, but I think most people are having difficulty simply searching documents.
Here's another case and how I handled it:
<?xml version="1.0" ?><Tag1 xmlns="http://www.mynamespace.com/prefix">
<Tag2>content</Tag2></Tag1>
xmlns without a prefix means that unprefixed tags get this default namespace. This means when you search for Tag2, you need to include the namespace to find it. However, lxml creates an nsmap entry with None as the key, and I couldn't find a way to search for it. So, I created a new namespace dictionary like this
namespaces = {}
# response uses a default namespace, and tags don't mention it
# create a new ns map using an identifier of our choice
for k,v in root.nsmap.iteritems():
if not k:
namespaces['myprefix'] = v
e = root.find('myprefix:Tag2', namespaces)
Note: This is an answer useful for Python's ElementTree standard library without using hardcoded namespaces.
To extract namespace's prefixes and URI from XML data you can use ElementTree.iterparse function, parsing only namespace start events (start-ns):
>>> from io import StringIO
>>> from xml.etree import ElementTree
>>> my_schema = u'''<rdf:RDF xml:base="http://dbpedia.org/ontology/"
... xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
... xmlns:owl="http://www.w3.org/2002/07/owl#"
... xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
... xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
... xmlns="http://dbpedia.org/ontology/">
...
... <owl:Class rdf:about="http://dbpedia.org/ontology/BasketballLeague">
... <rdfs:label xml:lang="en">basketball league</rdfs:label>
... <rdfs:comment xml:lang="en">
... a group of sports teams that compete against each other
... in Basketball
... </rdfs:comment>
... </owl:Class>
...
... </rdf:RDF>'''
>>> my_namespaces = dict([
... node for _, node in ElementTree.iterparse(
... StringIO(my_schema), events=['start-ns']
... )
... ])
>>> from pprint import pprint
>>> pprint(my_namespaces)
{'': 'http://dbpedia.org/ontology/',
'owl': 'http://www.w3.org/2002/07/owl#',
'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
'rdfs': 'http://www.w3.org/2000/01/rdf-schema#',
'xsd': 'http://www.w3.org/2001/XMLSchema#'}
Then the dictionary can be passed as argument to the search functions:
root.findall('owl:Class', my_namespaces)
I've been using similar code to this and have found it's always worth reading the documentation... as usual!
findall() will only find elements which are direct children of the current tag. So, not really ALL.
It might be worth your while trying to get your code working with the following, especially if you're dealing with big and complex xml files so that that sub-sub-elements (etc.) are also included.
If you know yourself where elements are in your xml, then I suppose it'll be fine! Just thought this was worth remembering.
root.iter()
ref: https://docs.python.org/3/library/xml.etree.elementtree.html#finding-interesting-elements
"Element.findall() finds only elements with a tag which are direct children of the current element. Element.find() finds the first child with a particular tag, and Element.text accesses the element’s text content. Element.get() accesses the element’s attributes:"
To get the namespace in its namespace format, e.g. {myNameSpace}, you can do the following:
root = tree.getroot()
ns = re.match(r'{.*}', root.tag).group(0)
This way, you can use it later on in your code to find nodes, e.g using string interpolation (Python 3).
link = root.find(f"{ns}link")
This is basically Davide Brunato's answer however I found out that his answer had serious problems the default namespace being the empty string, at least on my python 3.6 installation. The function I distilled from his code and that worked for me is the following:
from io import StringIO
from xml.etree import ElementTree
def get_namespaces(xml_string):
namespaces = dict([
node for _, node in ElementTree.iterparse(
StringIO(xml_string), events=['start-ns']
)
])
namespaces["ns0"] = namespaces[""]
return namespaces
where ns0 is just a placeholder for the empty namespace and you can replace it by any random string you like.
If I then do:
my_namespaces = get_namespaces(my_schema)
root.findall('ns0:SomeTagWithDefaultNamespace', my_namespaces)
It also produces the correct answer for tags using the default namespace as well.
My solution is based on #Martijn Pieters' comment:
register_namespace only influences serialisation, not search.
So the trick here is to use different dictionaries for serialization and for searching.
namespaces = {
'': 'http://www.example.com/default-schema',
'spec': 'http://www.example.com/specialized-schema',
}
Now, register all namespaces for parsing and writing:
for name, value in namespaces.iteritems():
ET.register_namespace(name, value)
For searching (find(), findall(), iterfind()) we need a non-empty prefix. Pass these functions a modified dictionary (here I modify the original dictionary, but this must be made only after the namespaces are registered).
self.namespaces['default'] = self.namespaces['']
Now, the functions from the find() family can be used with the default prefix:
print root.find('default:myelem', namespaces)
but
tree.write(destination)
does not use any prefixes for elements in the default namespace.

Python XML parsing - equivalent of "grep -v" in bash

This is one of my first forays into Python. I'd normally stick with bash, however Minidom seems to perfectly suite my needs for XML parsing, so I'm giving it a shot.
First question which I can't seem to figure out is, what's the equivalent for 'grep -v' when parsing a file?
Each object I'm pulling begins with a specific tag. If, within said tag, I want to exclude a row of data based off of a certain string embedded within the tag, how do I accomplish this?
Pseudo code that I've got now (no exclusion):
mainTag = xml.getElementsByTagName("network_object")
name = network_object.getElementsByTagName("Name")[0].firstChild.data
I'd like to see the data output all "name" fields, with the exception of strings that contain "cluster". Since I'll be doing multiple searches on network_objects, I believe I need to do it at that level, but don't know how.
Etree is giving me a ton of problems, can you give me some logic to do this with minidom?
This obviously doesn't work:
name = network_object.getElementsByTagName("Name")[0].firstChild.data
if name is not 'cluster' in name
continue
First of all, step away from the minidom module. Minidom is great if you already know the DOM from other languages and really do not want to learn any other API. There are easier alternatives available, right there in the standard library. I'd use the ElementTree API instead.
You generally just loop over matches, and skip over the ones that you want to exclude as you do so:
from xml.etree import ElementTree
tree = ElementTree.parse(somefile)
for name in tree.findall('.//network_object//Name'):
if name.text is not None and 'cluster' in name.text:
continue # skip this one

Python String parse

Im working on a data packet retrieval system which will take a packet, and process the various parts of the packet, based on a system of tags [similar to HTML tags].
[text based files only, no binary files].
Each part of the packet is contained between two identical tags, and here is a sample packet:
"<PACKET><HEAD><ID><ID><SEQ><SEQ><FILENAME><FILENAME><HEAD><DATA><DATA><PACKET>"
The entire packet is contained within the <PACKET><PACKET> tags.
All meta-data is contained within the <HEAD><HEAD> tags and the filename from which the packet is part of is contained within the, you guessed it, the <FILENAME><FILENAME> tags.
Lets say, for example, a single packet is received and stored in a temporary string variable called sTemp.
How do you efficiently retrieve, for example, only the contents of a single pair of tags, for example the contents of the <FILENAME><FILENAME> tags?
I was hoping for such functionality as saying getTagFILENAME( packetX ), which would return the textual string contents of the <FILENAME><FILENAME> tags of the packet.
Is this possible using Python?
Any suggestions or comments appreciated.
If the packet format effectively uses XML-looking syntax (i.e., if the "closing tags" actually include a slash), the xml.etree.ElementTree could be used.
This libray is part of Python Standard Library, starting in Py2.5. I find it a very convenient one to deal with this kind of data. It provides many ways to read and to modify this kind of tree structure. Thanks to the generic nature of XML languages and to the XML awareness built-in the ElementTree library, the packet syntax could evolve easily for example to support repeating elements, element attributes.
Example:
>>> import xml.etree.ElementTree
>>> myPacket = '<PACKET><HEAD><ID>123</ID><SEQ>1</SEQ><FILENAME>Test99.txt</FILE
NAME></HEAD><DATA>spam and cheese</DATA></PACKET>'
>>> xt = xml.etree.ElementTree.fromstring(myPacket)
>>> wrk_ele = xt.find('HEAD/FILENAME')
>>> wrk_ele.text
'Test99.txt'
>>>
Something like this?
import re
def getPacketContent ( code, packetName ):
match = re.search( '<' + packetName + '>(.*?)<' + packetName + '>', code )
return match.group( 1 ) if match else ''
# usage
code = "<PACKET><HEAD><ID><ID><SEQ><SEQ><FILENAME><FILENAME><HEAD><DATA><DATA><PACKET>"
print( getPacketContent( code, 'HEAD' ) )
print( getPacketContent( code, 'SEQ' ) )
As mjv points out, there's not the least sense in inventing an XML-like format if you can just use XML.
But: If you're going to use XML for your packet format, you need to really use XML for it. You should use an XML library to create your packets, not just to parse them. Otherwise you will come to grief the first time one of your field values contains an XML markup character.
You can, of course, write your own code to do the necessary escaping, filter out illegal characters, guarantee well-formedness, etc. For a format this simple, that may be all you need to do. But going down that path is a way to learn things about XML that you perhaps would rather not have to learn.
If using an XML library to create your packets is a problem, you're probably better off defining a custom format (and I'd define one that didn't look anything like XML, to keep people from getting ideas) and building a parser for it using pyparsing.

Categories

Resources