python-docx - deleting first paragraph

python-docx - deleting first paragraph - python

When I create a new document with python-docx and add paragraphs, it starts at the very first line. But if I use an empty document (I need it because of the user defined styles) and add paragraphes the document would always start with an empty line. Is there any workaround?

You can call document._body.clear_content() before adding the first paragraph.
document = Document('my-document.docx')
document._body.clear_content()
# start adding new paragraphs and whatever ...
That will leave the document with no paragraphs, so when you add new ones they start at the beginning.
It does, however, leave the document in a technically invalid state. So if you didn't add any new paragraphs and then tried to open it with Word, you might get a repair error on loading.
But if the next thing you're doing is adding paragraphs of your own, this should work just fine.
Also, note that this is technically an "internal" method and is not part of the documented API. So there's no guarantee this method's name won't change in a future release. But frankly I can't see any reason to change or remove it, so I expect it's safe enough :)

Related

Python-Docx changing existing text creates weird problems

So I am trying to automate some work tasks to help me create reports for various assignments. I have a template report that I want to simply replace the placeholder text. My code works for the most part, but the method I am using comes up with some strange results. Here are the relevant sections of my current code:
def create_new_report(self):
report = Document('Template.docx')
# Change Headers First
for sec in report.sections:
head = sec.header
for para in head.paragraphs:
for run in para.runs:
self.replace_run_text(run)
# Then the Tables
for table in report.tables:
for row in table.rows:
for cell in row.cells:
for para in cell.paragraphs:
for run in para.runs:
self.replace_run_text(run)
# And finally the Body
for para in report.paragraphs:
for run in para.runs:
self.replace_run_text(run)
def replace_run_text(self, run):
# Takes the run, performs string.replace for args, and returns new run
text = run.text
for arg in self.args: # a list of keys and the text to replace them with
text = text.replace(arg[0], arg[1])
run.text = text
For the most part this works well. However, when running this, I have noticed that it has some weird consequences. For the header, I had to hard-code which specific paragraphs to work with because running this on the entire thing was deleting my company logo as an image.
In the body, this code will remove page breaks, or form text boxes. I break up everything to individual runs in order to retain all styling, and that seems to work well at least.
For now I have hard-coded around the idiosyncrasies that come up, but I want to be able to make changes to my template document and have it just work, rather than needing to change those hard-coded sections as well. Does anyone have any advice as to why this particular behavior is occurring?
It really doesn't make sense to me. Why is the page break or the logo being removed when they do not even contain any runs? Or at the very least, I can guarantee they do not contain any of the text keys that are being replaced. They shouldn't be being messed with at all. But they are. I would appreciate any insight that anyone has!

Extract text from a config file [duplicate]

This question already has answers here:
Parse key value pairs in a text file
(7 answers)
Closed 1 year ago.
I'm using a config file to inform my Python script of a few key-values, for use in authenticating the user against a website.
I have three variables: the URL, the user name, and the API token.
I've created a config file with each key on a different line, so:
url:<url string>
auth_user:<user name>
auth_token:<API token>
I want to be able to extract the text after the key words into variables, also stripping any "\n" that exist at the end of the line. Currently I'm doing this, and it works but seems clumsy:
with open(argv[1], mode='r') as config_file:
lines = config_file.readlines()
for line in lines:
url_match = match('jira_url:', line)
if url_match:
jira_url = line[9:].split("\n")[0]
user_match = match('auth_user:', line)
if user_match:
auth_user = line[10:].split("\n")[0]
token_match = match('auth_token', line)
if token_match:
auth_token = line[11:].split("\n")[0]
Can anybody suggest a more elegant solution? Specifically it's the ... = line[10:].split("\n")[0] lines that seem clunky to me.
I'm also slightly confused why I can't reuse my match object within the for loop, and have to create new match objects for each config item.

you could use a .yml file and read values with yaml.load() function:
import yaml
with open('settings.yml') as file:
settings = yaml.load(file, Loader=yaml.FullLoader)
now you can access elements like settings["url"] and so on

If the format is always <tag>:<value> you can easily parse it by splitting the line at the colon and filling up a custom dictionary:
config_file = open(filename,"r")
lines = config_file.readlines()
config_file.close()
settings = dict()
for l in lines:
elements = l[:-1].split(':')
settings[elements[0]] = ':'.join(elements[1:])
So, you get a dictionary that has the tags as keys and the values as values. You can then just refer to these dictionary entries in your pogram.
(e.g.: if you need the auth_token, just call settings["auth_token"]

if you can add 1 line for config file, configparser is good choice
https://docs.python.org/3/library/configparser.html
[1] config file : 1.cfg
[DEFAULT] # configparser's config file need section name
url:<url string>
auth_user:<user name>
auth_token:<API token>
[2] python scripts
import configparser
config = configparser.ConfigParser()
config.read('1.cfg')
print(config.get('DEFAULT','url'))
print(config.get('DEFAULT','auth_user'))
print(config.get('DEFAULT','auth_token'))
[3] output
<url string>
<user name>
<API token>
also configparser's methods is useful
whey you can't guarantee config file is always complete

You have a couple of great answers already, but I wanted to step back and provide some guidance on how you might approach these problems in the future. Getting quick answers sometimes prevents you from understanding how those people knew about the answers in the first place.
When you zoom out, the first thing that strikes me is that your task is to provide config, using a file, to your program. Software has the remarkable property of solve-once, use-anywhere. Config files have been a problem worth solving for at least 40 years, so you can bet your bottom dollar you don't need to solve this yourself. And already-solved means someone has already figured out all the little off-by-one and edge-case dramas like stripping line endings and dealing with expected input. The challenge of course, is knowing what solution already exists. If you haven't spent 40 years peeling back the covers of computers to see how they tick, it's difficult to "just know". So you might have a poke around on Google for "config file format" or something.
That would lead you to one of the most prevalent config file systems on the planet - the INI file. Just as useful now as it was 30 years ago, and as a bonus, looks not too dissimilar to your example config file. Then you might search for "read INI file in Python" or something, and come across configparser and you're basically done.
Or you might see that sometime in the last 30 years, YAML became the more trendy option, and wouldn't you know it, PyYAML will do most of the work for you.
But none of this gets you any better at using Python to extract from text files in general. So zooming in a bit, you want to know how to extract parts of lines in a text file. Again, this problem is an age-old problem, and if you were to learn about this problem (rather than just be handed the solution), you would learn that this is called parsing and often involves tokenisation. If you do some research on, say "parsing a text file in python" for example, you would learn about the general techniques that work regardless of the language, such as looping over lines and splitting each one in turn.
Zooming in one more step closer, you're looking to strip the new line off the end of the string so it doesn't get included in your value. Once again, this ain't a new problem, and with the right keywords you could dig up the well-trodden solutions. This is often called "chomping" or "stripping", and with some careful search terms, you'd find rstrip() and friends, and not have to do awkward things like splitting on the '\n' character.
Your final question is about re-using the match object. This is much harder to research. But again, the "solution" wont necessarily show you where you went wrong. What you need to keep in mind is that the statements in the for loop are sequential. To think them through you should literally execute them in your mind, one after one, and imagine what's happening. Each time you call match, it either returns None or a Match object. You never use the object, except to check for truthiness in the if statement. And next time you call match, you do so with different arguments so you get a new Match object (or None). Therefore, you don't need to keep the object around at all. You can simply do:
if match('jira_url:', line):
jira_url = line[9:].split("\n")[0]
if match('auth_user:', line):
auth_user = line[10:].split("\n")[0]
and so on. Not only that, if the first if triggered then you don't need to bother calling match again - it will certainly not trigger any of other matches for the same line. So you could do:
if match('jira_url:', line):
jira_url = line[9:].rstrip()
elif match('auth_user:', line):
auth_user = line[10:].rstrip()
and so on.
But then you can start to think - why bother doing all these matches on the colon, only to then manually split the string at the colon afterwards? You could just do:
tokens = line.rstrip().split(':')
if token[0] == 'jira_url':
jira_url = token[1]
elif token[0] == 'auth_user':
auth_user = token[1]
If you keep making these improvements (and there's lots more to make!), eventually you'll end up re-writing configparse, but at least you'll have learned why it's often a good idea to use an existing library where practical!

Using scrapy to extract and structure table data

I'm new to python and scrapy and thought I'd try out a simple review site to scrape. While most of the site structure is straight forward, I'm having trouble extracting the content of the reviews. This portion is visually laid out in sets of 3 (the text to the right of 良 (good), 悪 (bad), 感 (impressions) fields), but I'm having trouble pulling this content and associating it with a reviewer or section of review due to the use of generic divs, , /n and other formatting.
Any help would be appreciated.
Here's the site and code I've tried for the grabbing them, with some results.
http://www.psmk2.net/ps2/soft_06/rpg/p3_log1.html
(1):
response.xpath('//tr//td[#valign="top"]//text()').getall()
This returns the entire set of reviews, but it contains newline markup and, more of a problem, it renders each line as a separate entry. Due to this, I can't figure out where the good, bad, and impression portions end, nor can I easily parse each separate review as entry length varies.
['\n弱点をついた時のメリット、つかれたときのデメリットがはっきりしてて良い', '\nコミュをあげるのが楽しい',
'\n仲間が多くて誰を連れてくか迷う', '\n難易度はやさしめなので遊びやすい', '\nタルタロスしかダンジョンが無くて飽きる。'........and so forth
(2) As an alternative, I tried:
response.xpath('//tr//td[#valign="top"]')[0].get()
Which actually comes close to what I'd like, save for the markup. Here it seems that it returns the entire field of a review section. Every third element should be the "good" points of each separate review (I've replaced the <> with () to show the raw return).
(td valign="top")\n精一杯考えました(br)\n(br)\n戦闘が面白いですね\n主人公だけですが・・・・(br)\n従来のプレスターンバトルの進化なので(br)\n(br)\n以上です(/td)
(3) Figuring I might be able to get just the text, I then tried:
response.xpath('//tr//td[#valign="top"]//text()')[0].get()
But that only provides each line at a time, with the \n at the front. As with (1), a line by line rendering makes it difficult to attribute reviews to reviewers and the appropriate section in their review.
From these (2) seems the closest to what I want, and I was hoping I could get some direction in how to grab each section for each review without the markup. I was thinking that since these sections come in sets of 3, if these could be put in a list that would make pulling them easier in the future (i.e. all "good" reviews follow 0, 0+3; all "bad" ones 1, 1+3 ... etc.)...but first I need to actually get the elements.
I've thought about, and tried, iterating over each line with an "if" conditional (something like:)
i = 0
if i <= len(response.xpath('//tr//td[#valign="top"]//text()').getall()):
yield {response.xpath('//tr//td[#valign="top"]')[i].get()}
i + 1
to pull these out, but I'm a bit lost on how to implement something like this. Not sure where it should go. I've briefly looked at Item Loader, but as I'm new to this, I'm still trying to figure it out.
Here's the block where the review code is.
def parse(self, response):
for table in response.xpath('body'):
yield {
#code for other elements in review
'date': response.xpath('//td//div[#align="left"]//text()').getall(),
'name': response.xpath('//td//div[#align="right"]//text()').getall(),
#this includes the above elements, and is regualr enough I can systematically extract what I want
'categories': response.xpath('//tr//td[#class="koumoku"]//text()').getall(),
'scores': response.xpath('//tr//td[#class="tokuten_k"]//text()').getall(),
'play_time': response.xpath('//td[#align="right"]//span[#id="setumei"]//text()').getall(),
#reviews code here
}

Pretty simple task using a part of text as anchor (I used string to get text content for a whole td):
for review_node in response.xpath('//table[#width="645"]'):
good = review_node.xpath('string(.//td[b[starts-with(., "良")]]/following-sibling::td[1])').get()
bad= review_node.xpath('string(.//td[b[starts-with(., "悪")]]/following-sibling::td[1])').get()
...............

Is there a way to programmatically reject changes to a word document using python, while not deleting comments from it?

I have old version of a few word documents (word document with '.doc' extension) all of which have a lot of tracked changes in them. Most of the changes have comments associated with them.
I need to figure out a way to use python to reject all the changes that have been made in the documents, while retaining the comments.
I tried this with the new versions of word document('.docx' files) and faced no issues. All the changes were rejected and the word document still had all the comments in it. But when I tried to do it with the older versions of word document, all my comments got deleted.
I was using the following function at first with few different versions of the word file.
def reject_changes(path):
doc = word.Documents.Open(path)
doc.Activate()
word.ActiveDocument.TrackRevisions = False
word.ActiveDocument.Revisions.RejectAll()
word.ActiveDocument.Save()
doc.Close(False)
I tried to use the above function with the original word document
I changed the extension of the file to '.docx' and tried the above function
I made a copy of the document and saved it in '.docx' format.
In all these cases the comments were deleted.
I then tried the following code:
def reject_changes(path):
doc = word.Documents.Open(path)
doc.Activate()
word.ActiveDocument.TrackRevisions = False
nextRev = word.Selection.NextRevision()
while nextRev:
nextRev.Reject()
nextRev = word.Selection.NextRevision()
word.ActiveDocument.Save()
doc.Close(False)
For some reason this code was almost working. But on checking few of the documents again, I found that while most of the comments remained a couple of them were still deleted.
I think that since the comments are being deleted, they are probably a part of Revisions, in that case, is it possible to check if the revision is a comment or not. If not, can someone please suggest a way to ensure that no comments are deleted in the document on rejecting the changes.
Edit:
So, I found out that the comments that were getting deleted were added to the document when the 'Track Changes' option was active. I guess it made the comments as a part of the revision. So my first function works pretty well in case the comments are made once the 'Track Changes' option was not active.
But then, I have about more then twenty word documents (all of them a mix of doc and docx files), each of them have at least fifteen pages and over fifty comments.
I am using win32com.client. I am not too familiar with other packages that work with MS word. Any help would be appreciated.
Thanks!

Okay, so I was able to get a workaround for this by:
Creating a selection object and selecting the scope of the text marked by the comment.
Saving the range of the commented text into a range object.
Rejecting the tracked changes for the selected text.
Getting the new text based on the range object that was created in step 2.
This method takes a lot of time, though and the easiest way to extract the marked text is to ensure that comments are made when the word is not tracking the changes.
This is the code I am using now.
def reject_changes(path, doc_names):
word = win32.gencache.EnsureDispatch('Word.Application')
rejected_changes = []
for doc in doc_names:
#open the word document
wb = word.Documents.Open(rejected_doc)
wb.Activate()
current_doc = word.ActiveDocument
current_doc.TrackRevisions = False
text = ''
#iterating over the comments
for c in current_doc.Comments:
sentence_range = c.Scope #returns a range object of the text marked by comment
select_sentence = sentence_range.Select() #select the sentence marked by sentence_range
nextRev = word.Selection.NextRevision() #checks for the next revision in word
while nextRev:
#if the next revision is not within the sentence_range then skip.
if nextRev.Range.Start < sentence_range.Start or nextRev.Range.End > sentence_range.End:
break
else:
nextRev.Reject()
new_range = current_doc.Range(sentence_range.Start, sentence_range.End)
text = new_range.Text
nextRev = word.Selection.NextRevision()
author = c.Author
rejected_changes.append((doc,author,text,path))
current_doc.Save()
wb.Close(False)
return rejected_changes

I can't change the style of text in Word documents with Python-docx

I created a word document which contains the text
Hello. You owe me ${debt}. Please pay me back soon.
in Times New Roman size 12. The file name is debtTemplate.docx. I would like to replace {debt} by an actual number (1.20) using python-docx. I tried that following code:
from docx import Document
document = Document("debtTemplate.docx")
paragraphs = document.paragraphs
debt = "1.20"
paragraph = paragraphs[0]
text = paragraph.text
newText = text.format(debt=debt)
paragraph.clear()
paragraph.add_run(newText)
document.save("debt.docx")
This results in a new document with the desired text, but in Calabri font size 11. I would like the font to be like the original: Times New Roman size 12.
I know that you can add a style variable to paragraph.add_run(), so I tried that but nothing work. Eg paragraph.add_run(newText,style="Strong") didn't even change anything.
Does anyone know what I can do?
EDIT: here's a modified version of my code that I had hoped would work but didn't.
from docx import Document
document = Document("debtTemplate.docx")
document.save("debt.docx")
paragraphs = document.paragraphs
debt = "1.20"
paragraph = paragraphs[0]
style = paragraph.style
text = paragraph.text
newText = text.format(debt=debt)
paragraph.clear()
paragraph.add_run(newText,style)
document.save("debt.docx")

This page in the docs should help you understand why the style is not having an effect. It's a pretty easy fix: http://python-docx.readthedocs.org/en/latest/user/styles.html
I like a couple other things about what you've found though:
Using the str.format() method to do placeholder replacement is a nice, easy way to do lightweight text replacement. I'll have to add that to the documentation as an approach to simple custom document generation.
In the XML for a paragraph, there is an optional element called <w:defRPr> which Word uses to indicates the default formatting for any new text added to the paragraph, like if you started typing after placing your insertion point at the end of the paragraph. Right now, python-docx ignores that element. That's why you're getting the default Calibri 11 instead of the Times New Roman 12 you started with. But a useful feature might be to use that element, if present, to assign run properties to any new runs added at the end of the paragraph. If you want to add that as a feature request to the GitHub tracker we'll take a look at getting it implemented.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.