Use Python docx to create phonetic guide / 'Ruby text' in Word? - python

I want to add phonetic guides to words in MS Word. Inside of MS Word the original word is called 'Base text' and the phonetic guide is called 'Ruby text.'
Here's what I'm trying to create looks like in Word:
The docx documentation has page that talks about Run-level content with a reference to ruby: <xsd:element name="ruby" type="CT_Ruby"/> located here:
https://python-docx.readthedocs.io/en/latest/dev/analysis/features/text/run-content.html
I can not figure out how to access these in my code.
Here's an example of one of my attempts:
import docx
from docx import Document
document = Document()
base_text = '光栄'
ruby_text = 'こうえい'
p = document.add_paragraph(base_text)
p.add_run(ruby_text).ruby = True
document.save('ruby.docx')
But this code only returns the following:
光栄こうえい
I've tried to use ruby on the paragraph and p.text, removing the = True but I keep getting the error message 'X object has no attribute 'ruby'
Can someone please show me how to accomplish this?
Thank you!

The <xsd:element name="ruby" ... excerpt you mention is from the XML Schema type for a run. This means a child element of type CT_Ruby can be present on a run element (<w:r>) with the tag name <w:ruby>.
There is not yet any API support for this element in python-docx, so if you want to use it you'll need to manipulate the XML using low-level lxml calls. You can get access to the run element on run._r. If you search on "python-docx workaround function" and also perhaps "python-pptx workaround function" you'll find some examples of doing this to extend the functionality.

Related

Python's ElementTree, how to create links in a paragraph

I have a website I'm building running off Python 2.7 and using ElementTree to build the HTML on the fly. I have no problem creating the elements and appending them to the tree. It's where I have to insert links in the middle of a large paragraph that I am stumped. This is easy when it's done in text, but this is doing it via XML. Here's what I mean:
Sample text:
lawLine = "..., a vessel as defined in Section 21 of the Harbors and Navigation Code which is inhabited and designed for habitation, an inhabited floating home as defined in subdivision (d) of Section 18075.55 of the Health and Safety Code, ..."
To add that text to the HTML as H4-style text, I typically use:
h4 = ET.Element('h4')
htmlTree.append(h4)
h4.text = lawLine
I need to add links at the word "Section" and the numbers associated with it, but I can't simply create a new element "a" in the middle of a paragraph and add it to the HTML tree, so I'm trying to build that piece as text, then do ET.fromstring and append it to the tree:
thisLawType = 'PC'
matches = re.findall(r'Section [0-9.]*', lawLine)
if matches:
lawLine = """<h4>{0}</h4>""".format(lawLine)
for thisMatch in matches:
thisMatchLinked = """{2}""".format(thisLawType, thisMatch.replace('Section ',''), thisMatch)
lawLine = lawLine.replace(thisMatch, thisMatchLinked)
htmlBody.append(ET.fromstring(lawLine))
I am getting "xml.etree.ElementTree.ParseError: not well-formed" errors when I do ET.fromstring. Is there a better way to do this in ElementTree? I'm sure there are better extensions out there, but my work environment is limited to Python 2.7 and the standard library. Any help would be appreciated. Thanks!
Evan
The xml you are generating is indeed not well formed, because of the presence of & in thisMatchLinked. It's one of the special charcters which need to be escaped (see an interesting explanation here).
So try replacing & with & and see if it works.

Generate Antlr Parse Tree

I am trying to generate the Antlr parse tree. This is a sample grammar which I got from the web.
grammar Hel;
hi : 'hello' ID ;
ID : [a-z]+ ;
WS : [ \t\r\n]+ -> skip ;
I tried the following code in Jupyter notebook to generate the Parse tree.
Please let me know how to fix it.
from antlr4 import *
from HelLexer import HelLexer
from HelParser import HelParser
input = InputStream('hello jgt')
lexer = CoDalogLexer(input)
stream = CommonTokenStream(lexer)
parser = HelParser(stream)
tree = parser.hi()
How do I generate the parse tree? How Do I access the elements of the tree?
Thank you.
Accessing elements in a parse tree is simple and when you look at your generated classes it should become obvious. A parse tree has a children property or method that allows to access all parse rule contexts that have been created for the found elements during parsing. For convenience there are special accessors that ease getting to specific child contexts. In your example you will see an ID() function on your HiParseContext which gives you a context with details for the matched ID token (e.g. location in the input, token type, token text and more). Just look at the generated and the runtime source code to see what's possible.

Python: Copy content from one word document to another word document and keeping format?

As the title says I would like to know if there is any module that will allow me to parse content from one Microsoft word document to another via python and keeping the format.
I want to read table data and transfer it to another table in another document.
Both doc A and B exist. I just want to be able to walk through the cells in both docs (not necessarily at the same time) and copy content without having to worry about if the text is formatted (font, italic, bold) or contains bullets.
I'm asking for python since it's my favorite language...
Following Kasra advice to use python-docx :
Rough example code.
Query document for table:
from docx import *
document = opendocx('xxxzzz.docx')
table = document.xpath('/w:document/w:body/w:tbl', namespaces=nsprefixes)[0]
Writing to another document:
output = opendocx('yyywwww.docx')
body = output.xpath('/w:document/w:body', namespaces=nsprefixes)[0]
body.append(table)
output.save('new-file-name.docx')

Using custom POS tags for NLTK chunking?

Is it possible to use non-standard part of speech tags when making a grammar for chunking in the NLTK? For example, I have the following sentence to parse:
complication/patf associated/qlco with/prep breast/noun surgery/diap
independent/adj of/prep the/det use/inpr of/prep surgical/diap device/medd ./pd
Locating the phrases I need from the text is greatly assisted by specialized tags such as "medd" or "diap". I thought that because you can use RegEx for parsing, it would be independent of anything else, but when I try to run the following code, I get an error:
grammar = r'TEST: {<diap>}'
cp = nltk.RegexpParser(grammar)
cp.parse(sentence)
ValueError: Transformation generated invalid chunkstring:
<patf><qlco><prep><noun>{<diap>}<adj><prep><det><inpr><prep>{<diap>}<medd><pd>
I think this has to do with the tags themselves, because the NLTK can't generate a tree from them, but is it possible to skip that part and just get the chunked items returned? Maybe the NLTK isn't the best tool, and if so, can anyone recommend another module for chunking text?
I'm developing in python 2.7.6 with the Anaconda distribution.
Thanks in advance!
Yes it is possible to use custom tags for NLTK chunking. I have used the same.
Refer: How to parse custom tags using nltk.Regexp.parser()
The ValueError and the error description suggest that there is an error in the formation of your grammar and you need to check that. You can update the answer with the same for suggestions on corrections.
#POS Tagging
words=word_tokenize(example_sent)
pos=nltk.pos_tag(words)
print(pos)
#Chunking
chunk=r'Chunk: {<JJ.?>+<NN.?>+}'
par=nltk.RegexpParser(chunk)
par2=par.parse(pos)
print('Chunking - ',par2)
print('------------------------------ Parsing the filtered chunks')
# printing only the required chunks
for i in par2.subtrees():
if i.label()=='Chunk':
print(i)
print('------------------------------NER')
# NER
ner=nltk.ne_chunk(pos)
print(ner)

Python XML parsing - equivalent of "grep -v" in bash

This is one of my first forays into Python. I'd normally stick with bash, however Minidom seems to perfectly suite my needs for XML parsing, so I'm giving it a shot.
First question which I can't seem to figure out is, what's the equivalent for 'grep -v' when parsing a file?
Each object I'm pulling begins with a specific tag. If, within said tag, I want to exclude a row of data based off of a certain string embedded within the tag, how do I accomplish this?
Pseudo code that I've got now (no exclusion):
mainTag = xml.getElementsByTagName("network_object")
name = network_object.getElementsByTagName("Name")[0].firstChild.data
I'd like to see the data output all "name" fields, with the exception of strings that contain "cluster". Since I'll be doing multiple searches on network_objects, I believe I need to do it at that level, but don't know how.
Etree is giving me a ton of problems, can you give me some logic to do this with minidom?
This obviously doesn't work:
name = network_object.getElementsByTagName("Name")[0].firstChild.data
if name is not 'cluster' in name
continue
First of all, step away from the minidom module. Minidom is great if you already know the DOM from other languages and really do not want to learn any other API. There are easier alternatives available, right there in the standard library. I'd use the ElementTree API instead.
You generally just loop over matches, and skip over the ones that you want to exclude as you do so:
from xml.etree import ElementTree
tree = ElementTree.parse(somefile)
for name in tree.findall('.//network_object//Name'):
if name.text is not None and 'cluster' in name.text:
continue # skip this one

Categories

Resources