Having trouble setting text font style as Times New Roman for this document. I want all the test to be times new roman size 10. I generated a font table that holds the type and its the only type of my document. I want all the text to be times new roman but whenever the doc is generated it says its corrupted for some reason but if I don't set the font style the doc comes out in courier text as default and its not corrupted.
{\fonttbl{\f1\froman\fprq0\fcharset0 Times New Roman;}
\par
Hi <#name#>
\par\par
Welcome to New York \par\par
\b New iPad\'ae App Is Available \b0 \par
These are all the exciting things you cna do during your stay. \par \par
}
You're missing a lot of formatting in your example up there. For example, save a very simple RTF file from Wordpad or another application (Word puts in too much metadata) and see everything that you're missing.
First, Here is the most recent RTF Spec, 1.9.1. This will help you work through anything RTF related.
Second, any RTF document must start with \rtf*N* where N is the RTF Version (currently 1). You're missing this in your example. This is among one of the many reason why it is saying the file is corrupted.
Third, you define something in the font table and then don't use the definition, f1. This is an old spec for RTF, 1.6, but look at how the font table is defined.
There are many other things, but I think that you're using RTF like you would use HTML or something with tags. I would readup on the specification to see how RTF works. Here is a very small RTF document:
{\rtf1\ansi\ansicpg1252\uc1\htmautsp\deff2
{\fonttbl{\f0\fcharset0 Times New Roman;}{\f2\fcharset0 Tahoma;}}
{\colortbl\red0\green0\blue0;\red255\green255\blue255;}
\loch\hich\dbch\pard\plain\ltrpar\itap0{\lang1033\fs16\f2\cf0 \cf0\ql
{\f2 {\ltrch This is a test of RTF.}\li0\ri0\sa0\sb0\fi0\ql\par}}}
Related
I have a website I'm building running off Python 2.7 and using ElementTree to build the HTML on the fly. I have no problem creating the elements and appending them to the tree. It's where I have to insert links in the middle of a large paragraph that I am stumped. This is easy when it's done in text, but this is doing it via XML. Here's what I mean:
Sample text:
lawLine = "..., a vessel as defined in Section 21 of the Harbors and Navigation Code which is inhabited and designed for habitation, an inhabited floating home as defined in subdivision (d) of Section 18075.55 of the Health and Safety Code, ..."
To add that text to the HTML as H4-style text, I typically use:
h4 = ET.Element('h4')
htmlTree.append(h4)
h4.text = lawLine
I need to add links at the word "Section" and the numbers associated with it, but I can't simply create a new element "a" in the middle of a paragraph and add it to the HTML tree, so I'm trying to build that piece as text, then do ET.fromstring and append it to the tree:
thisLawType = 'PC'
matches = re.findall(r'Section [0-9.]*', lawLine)
if matches:
lawLine = """<h4>{0}</h4>""".format(lawLine)
for thisMatch in matches:
thisMatchLinked = """{2}""".format(thisLawType, thisMatch.replace('Section ',''), thisMatch)
lawLine = lawLine.replace(thisMatch, thisMatchLinked)
htmlBody.append(ET.fromstring(lawLine))
I am getting "xml.etree.ElementTree.ParseError: not well-formed" errors when I do ET.fromstring. Is there a better way to do this in ElementTree? I'm sure there are better extensions out there, but my work environment is limited to Python 2.7 and the standard library. Any help would be appreciated. Thanks!
Evan
The xml you are generating is indeed not well formed, because of the presence of & in thisMatchLinked. It's one of the special charcters which need to be escaped (see an interesting explanation here).
So try replacing & with & and see if it works.
I'm building my python package documentation as HTML and as a latex PDF. The default latex pdf generated (manual class) has a large amount of white space at the top of the table of contents around the text "CONTENTS". I'm not super familiar with latex so when I've look at the generated .tex file I don't see anything that tells me how to remove the whitespace.
I've searched around and couldn't find a latex solution that worked. I also tried setting the :caption: on the toctree to an empty string, but that actually removes the entire TOC and all of my content.
Can anyone help me with this?
The default behaviour of Sphinx for English language is to use Bjarne option to LaTeX package fncychap for chapter headings. But it also loads package titlesec for generally speaking title headings. It does not make a special chapter definition with titlesec, which simply gather the fncychap definition and wraps it in its own hooks. Anyway, making the story short we find
\ttl#save#mkschap #1->\vspace *{50\p# }{\parindent \z# \raggedright \normalfont \interlinepenalty \#M \DOTIS {#1} \vskip 40\p# }
in a log trace and this is the fncychap definition of \#makeschapterhead as preserved by titlesec in its own macro \ttl#save#mkschap.
fncychap is loaded before sphinx.sty, there is no hook,
edit: in fact the 'fncychap' key whose default value is '\\usepackage[Bjarne]{fncychap}' could serve to add some code to redefine the fncychap setting for un-numbered chapter titles. It is not that different from the approach with 'preamble' key below, except that one would not have needed knowing about titlesec intervention in all this.
but since recent Sphinx 1.5 you can use your own Jinja template for latex content. From the look of your contents which is small, I think you have an older version of Sphinx thus I will go for the LaTeX hacking variant something like this:
latex_elements = {
# The paper size ('letterpaper' or 'a4paper').
#
# 'papersize': 'letterpaper',
# The font size ('10pt', '11pt' or '12pt').
#
# 'pointsize': '10pt',
# Additional stuff for the LaTeX preamble.
#
'preamble': r"""
\makeatletter
\def\ttl#save#mkschap #1{\vspace *{10\p# }{\parindent \z# \raggedright
\color{blue}%
\normalfont \interlinepenalty \#M \DOTIS {#1} \vskip 10\p# }}
\makeatother
""",
# Latex figure (float) alignment
#
# 'figure_align': 'htbp',
}
I have added a \color{blue} in there for demonstration purposes only, and modified the \vspace and \vskip commands which is what you need.
The image shows however that there is some extra source of vertical space between Contents and the TOC contents (it remains even with \vskip 0\p# but one can do \vskip -40\p# ...), but I think you are after the top space above Contents and already using only \vspace*{10pt} reduced it a lot (not visible in screenshot below).
I created a word document which contains the text
Hello. You owe me ${debt}. Please pay me back soon.
in Times New Roman size 12. The file name is debtTemplate.docx. I would like to replace {debt} by an actual number (1.20) using python-docx. I tried that following code:
from docx import Document
document = Document("debtTemplate.docx")
paragraphs = document.paragraphs
debt = "1.20"
paragraph = paragraphs[0]
text = paragraph.text
newText = text.format(debt=debt)
paragraph.clear()
paragraph.add_run(newText)
document.save("debt.docx")
This results in a new document with the desired text, but in Calabri font size 11. I would like the font to be like the original: Times New Roman size 12.
I know that you can add a style variable to paragraph.add_run(), so I tried that but nothing work. Eg paragraph.add_run(newText,style="Strong") didn't even change anything.
Does anyone know what I can do?
EDIT: here's a modified version of my code that I had hoped would work but didn't.
from docx import Document
document = Document("debtTemplate.docx")
document.save("debt.docx")
paragraphs = document.paragraphs
debt = "1.20"
paragraph = paragraphs[0]
style = paragraph.style
text = paragraph.text
newText = text.format(debt=debt)
paragraph.clear()
paragraph.add_run(newText,style)
document.save("debt.docx")
This page in the docs should help you understand why the style is not having an effect. It's a pretty easy fix: http://python-docx.readthedocs.org/en/latest/user/styles.html
I like a couple other things about what you've found though:
Using the str.format() method to do placeholder replacement is a nice, easy way to do lightweight text replacement. I'll have to add that to the documentation as an approach to simple custom document generation.
In the XML for a paragraph, there is an optional element called <w:defRPr> which Word uses to indicates the default formatting for any new text added to the paragraph, like if you started typing after placing your insertion point at the end of the paragraph. Right now, python-docx ignores that element. That's why you're getting the default Calibri 11 instead of the Times New Roman 12 you started with. But a useful feature might be to use that element, if present, to assign run properties to any new runs added at the end of the paragraph. If you want to add that as a feature request to the GitHub tracker we'll take a look at getting it implemented.
As the title says I would like to know if there is any module that will allow me to parse content from one Microsoft word document to another via python and keeping the format.
I want to read table data and transfer it to another table in another document.
Both doc A and B exist. I just want to be able to walk through the cells in both docs (not necessarily at the same time) and copy content without having to worry about if the text is formatted (font, italic, bold) or contains bullets.
I'm asking for python since it's my favorite language...
Following Kasra advice to use python-docx :
Rough example code.
Query document for table:
from docx import *
document = opendocx('xxxzzz.docx')
table = document.xpath('/w:document/w:body/w:tbl', namespaces=nsprefixes)[0]
Writing to another document:
output = opendocx('yyywwww.docx')
body = output.xpath('/w:document/w:body', namespaces=nsprefixes)[0]
body.append(table)
output.save('new-file-name.docx')
I am looking for a way to extract / scrape data from Word files into a database. Our corporate procedures have Minutes of Meetings with clients documented in MS Word files, mostly due to history and inertia.
I want to be able to pull the action items from these meeting minutes into a database so that we can access them from a web-interface, turn them into tasks and update them as they are completed.
Which is the best way to do this:
VBA macro from inside Word to create CSV and then upload to the DB?
VBA macro in Word with connection to DB (how does one connect to MySQL from VBA?)
Python script via win32com then upload to DB?
The last one is attractive to me as the web-interface is being built with Django, but I've never used win32com or tried scripting Word from python.
EDIT: I've started extracting the text with VBA because it makes it a little easier to deal with the Word Object Model. I am having a problem though - all the text is in Tables, and when I pull the strings out of the CELLS I want, I get a strange little box character at the end of each string. My code looks like:
sFile = "D:\temp\output.txt"
fnum = FreeFile
Open sFile For Output As #fnum
num_rows = Application.ActiveDocument.Tables(2).Rows.Count
For n = 1 To num_rows
Descr = Application.ActiveDocument.Tables(2).Cell(n, 2).Range.Text
Assign = Application.ActiveDocument.Tables(2).Cell(n, 3).Range.Text
Target = Application.ActiveDocument.Tables(2).Cell(n, 4).Range.Text
If Target = "" Then
ExportText = ""
Else
ExportText = Descr & Chr(44) & Assign & Chr(44) & _
Target & Chr(13) & Chr(10)
Print #fnum, ExportText
End If
Next n
Close #fnum
What's up with the little control character box? Is some kind of character code coming across from Word?
Word has a little marker thingy that it puts at the end of every cell of text in a table.
It is used just like an end-of-paragraph marker in paragraphs: to store the formatting for the entire paragraph.
Just use the Left() function to strip it out, i.e.
Left(Target, Len(Target)-1))
By the way, instead of
num_rows = Application.ActiveDocument.Tables(2).Rows.Count
For n = 1 To num_rows
Descr = Application.ActiveDocument.Tables(2).Cell(n, 2).Range.Text
Try this:
For Each row in Application.ActiveDocument.Tables(2).Rows
Descr = row.Cells(2).Range.Text
Well, I've never scripted Word, but it's pretty easy to do simple stuff with win32com. Something like:
from win32com.client import Dispatch
word = Dispatch('Word.Application')
doc = word.Open('d:\\stuff\\myfile.doc')
doc.SaveAs(FileName='d:\\stuff\\text\\myfile.txt', FileFormat=?) # not sure what to use for ?
This is untested, but I think something like that will just open the file and save it as plain text (provided you can find the right fileformat) – you could then read the text into python and manipulate it from there. There is probably a way to grab the contents of the file directly, too, but I don't know it off hand; documentation can be hard to find, but if you've got VBA docs or experience, you should be able to carry them across.
Have a look at this post from a while ago: http://mail.python.org/pipermail/python-list/2002-October/168785.html Scroll down to COMTools.py; there's some good examples there.
You can also run makepy.py (part of the pythonwin distribution) to generate python "signatures" for the COM functions available, and then look through it as a kind of documentation.
You could use OpenOffice. It can open word files, and also can run python macros.
I'd say look at the related questions on the right -->
The top one seems to have some good ideas for going the python route.
how about saving the file as xml. then using python or something else and pull the data out of word and into the database.
It is possible to programmatically save a Word document as HTML and to import the table(s) contained into Access. This requires very little effort.