HTML to Word docx

HTML to Word docx - python

I have some HTML formatted text I've got with BeautifulSoup. I'd like to convert all italic (tag i), bold (b) and links (a href) to Word format via docx run command.
I can make a paragraph:
p = document.add_paragraph('text')
I can ADD next sequence as bold/italic:
p.add_run('bold').bold = True
p.add_run('italic.').italic = True
Intuitively, I could find all particular tags (ie. soup.find_all('i')) and then watch indices and then concatenate partial strings...
...but maybe there's a better, more elegant way?
I don't want libraries or solutions that just convert a html page to word and save them. I want a little more control.
I got nowhere with a dictionary. Here is the code and visual wrong (from code) and right (desired) result:
from docx import Document
import os
from bs4 import BeautifulSoup
html = 'hi, I am link this is some nice regular text. <i> oooh, but I am italic</i> ' \
' or I can be <b>bold</b> '\
' or even <i><b>bold and italic</b></i>'
def get_tags(text):
soup = BeautifulSoup(text, "html.parser")
tags = {}
tags["i"] = soup.find_all("i")
tags["b"] = soup.find_all("b")
return tags
def make_test_word():
document = Document()
document.add_heading('Demo HTML', 0)
soup = BeautifulSoup(html, "html.parser")
p = document.add_paragraph(html)
# p.add_run('bold').bold = True
# p.add_run(' and some ')
# p.add_run('italic.').italic = True
file_name="demo_html.docx"
document.save(file_name)
os.startfile(file_name)
make_test_word()

I just wrote a bit of code to convert the text from a tkinter Text widget over to a word document, including any bold tags that the user can add. This isn't a complete solution for you, but it may help you to start toward a working solution. I think you're going to have to do some regex work to get the hyperlinks transferred to the word document. Stacked formatting tags may also get tricky. I hope this helps:
from docx import Document
html = 'HTML string <b>here</b>.'
html = html.split('<')
html = [html[0]] + ['<'+l for l in html[1:]]
doc = Document()
p = doc.add_paragraph()
for run in html:
if run.startswith('<b>'):
run = run.lstrip('<b>')
runner = p.add_run(run)
runner.bold = True
elif run.startswith('</b>'):
run = run.lstrip('</b>')
runner = p.add_run(run)
else:
p.add_run(run)
doc.save('test.docx')
I came back to it and made it possible to parse out multiple formatting tags. This will keep a tally of what formatting tags are in play in a list. At each tag, a new run is created, and formatting for the run is set by the current tags in play.
from docx import Document
import re
import docx
from docx.shared import Pt
from docx.enum.dml import MSO_THEME_COLOR_INDEX
def add_hyperlink(paragraph, text, url):
# This gets access to the document.xml.rels file and gets a new relation id value
part = paragraph.part
r_id = part.relate_to(url, docx.opc.constants.RELATIONSHIP_TYPE.HYPERLINK, is_external=True)
# Create the w:hyperlink tag and add needed values
hyperlink = docx.oxml.shared.OxmlElement('w:hyperlink')
hyperlink.set(docx.oxml.shared.qn('r:id'), r_id, )
# Create a w:r element and a new w:rPr element
new_run = docx.oxml.shared.OxmlElement('w:r')
rPr = docx.oxml.shared.OxmlElement('w:rPr')
# Join all the xml elements together add add the required text to the w:r element
new_run.append(rPr)
new_run.text = text
hyperlink.append(new_run)
# Create a new Run object and add the hyperlink into it
r = paragraph.add_run ()
r._r.append (hyperlink)
# A workaround for the lack of a hyperlink style (doesn't go purple after using the link)
# Delete this if using a template that has the hyperlink style in it
r.font.color.theme_color = MSO_THEME_COLOR_INDEX.HYPERLINK
r.font.underline = True
return hyperlink
html = '<H1>I want to</H1> <u>convert HTML to docx in <b>bold and <i>bold italic</i></b>.</u>'
html = html.split('<')
html = [html[0]] + ['<'+l for l in html[1:]]
tags = []
doc = Document()
p = doc.add_paragraph()
for run in html:
tag_change = re.match('(?:<)(.*?)(?:>)', run)
if tag_change != None:
tag_strip = tag_change.group(0)
tag_change = tag_change.group(1)
if tag_change.startswith('/'):
if tag_change.startswith('/a'):
tag_change = next(tag for tag in tags if tag.startswith('a '))
tag_change = tag_change.strip('/')
tags.remove(tag_change)
else:
tags.append(tag_change)
else:
tag_strip = ''
hyperlink = [tag for tag in tags if tag.startswith('a ')]
if run.startswith('<'):
run = run.replace(tag_strip, '')
if hyperlink:
hyperlink = hyperlink[0]
hyperlink = re.match('.*?(?:href=")(.*?)(?:").*?', hyperlink).group(1)
add_hyperlink(p, run, hyperlink)
else:
runner = p.add_run(run)
if 'b' in tags:
runner.bold = True
if 'u' in tags:
runner.underline = True
if 'i' in tags:
runner.italic = True
if 'H1' in tags:
runner.font.size = Pt(24)
else:
p.add_run(run)
doc.save('test.docx')
Hyperlink function thanks to this question. My concern here is that you will need to manually code for every HTML tag that you want to carry over to the docx. I imagine that could be a large number. I've given some examples of tags you may want to account for.

Alternatively, you can just save your html code as a string and do:
from htmldocx import HtmlToDocx
new_parser = HtmlToDocx()
new_parser.parse_html_file("html_filename", "docx_filename")
#Files extensions not needed, but tolerated

Related

How to access the next element below in HTML file using beautiful soup

<ns1:AffectedAreas>
<ns1:Area>
<ns1:AreaId>10YDK-1--------W</ns1:AreaId>
<ns1:AreaName>DK1</ns1:AreaName>
</ns1:Area>
</ns1:AffectedAreas>
I've been trying my best to access the ns1:AreaId which is (10YDK-1--------W) through ns1:AffectedAreas by using B = soup.find('ns1:area') and then B.next_element but all I get is an empty string.

Try this method,
import bs4
import re
data = """
<ns1:AffectedAreas>
<ns1:Area>
<ns1:AreaId>10YDK-1--------W</ns1:AreaId>
<ns1:AreaName>DK1</ns1:AreaName>
</ns1:Area>
</ns1:AffectedAreas>
"""
def striphtml(data):
p = re.compile(r'<.*?>')
return p.sub('', data)
bs = bs4.BeautifulSoup(data, "html.parser")
areaid = bs.find_all('ns1:areaid')
print((striphtml(str(areaid))))
Here, striphtml function will remove all the tags containing <>.So the output will be,
[10YDK-1--------W]

If you have defined namespaces in your HTML/XML document, you can use xml parser and CSS selectors.
For example:
txt = '''<root xmlns:ns1="some namespace">
<ns1:AffectedAreas>
<ns1:Area>
<ns1:AreaId>10YDK-1--------W</ns1:AreaId>
<ns1:AreaName>DK1</ns1:AreaName>
</ns1:Area>
</ns1:AffectedAreas>
</root>'''
soup = BeautifulSoup(txt, 'xml')
area_id = soup.select_one('ns1|AffectedAreas ns1|AreaId').text
print(area_id)
Prints:
10YDK-1--------W

You can try to iterate over the soup.find('ns1:area') childrens to find the ns1:areaid tag and then to get his text.
for i in soup.find('ns1:area').children:
if i.name == "ns1:areaid":
b = i.text
print(b)
And from ns1:AffectedAreas it will look like
for i in soup.find_all('ns1:AffectedAreas'.lower()):
for child in i.children:
if child.name == "ns1:area":
for y in child.children:
if y.name == "ns1:areaid":
print(y.text)
Or to search the tag ns1:AreaId in lower case and to get text of him. this way you can get all the text values from all ns1:AreaId tags.
soup.find_all("ns1:AreaId".lower())[0].text
Both cases will output
"10YDK-1--------W"

Another method.
from simplified_scrapy import SimplifiedDoc, req, utils
html = '''
<ns1:AffectedAreas>
<ns1:Area>
<ns1:AreaId>10YDK-1--------W</ns1:AreaId>
<ns1:AreaName>DK1</ns1:AreaName>
</ns1:Area>
<ns1:Area>
<ns1:AreaId>10YDK-2--------W</ns1:AreaId>
<ns1:AreaName>DK2</ns1:AreaName>
</ns1:Area>
</ns1:AffectedAreas>
'''
doc = SimplifiedDoc(html)
AffectedArea = doc.select('ns1:AffectedAreas')
Areas = AffectedArea.selects('ns1:Area')
AreaIds = Areas.select('ns1:AreaId').html
print (AreaIds)
# or
# print (doc.select('ns1:AffectedAreas').selects('ns1:Area').select('ns1:AreaId').html)
Result:
['10YDK-1--------W', '10YDK-2--------W']
Here are more examples: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples

Adding a link to a bookmark in MS Word using python docx library

I've used the code from an earlier question to create a hyperlink:
Adding an hyperlink in MSWord by using python-docx
I now want to create a link to a bookmark within the document, rather than an external hyperlink, but can't work out how to do it. Any ideas?

Found a way, thanks to neilbilly at github:
feature: Paragraph.add_hyperlink() #74
def add_link(paragraph, link_to, text):
hyperlink = docx.oxml.shared.OxmlElement('w:hyperlink')
hyperlink.set(docx.oxml.shared.qn('w:anchor'), link_to, )
new_run = docx.oxml.shared.OxmlElement('w:r')
rPr = docx.oxml.shared.OxmlElement('w:rPr')
new_run.append(rPr)
new_run.text = text
hyperlink.append(new_run)
r = paragraph.add_run ()
r._r.append (hyperlink)
r.font.name = "Calibri"
r.font.color.theme_color = MSO_THEME_COLOR_INDEX.HYPERLINK
r.font.underline = True

Unable to scrape the conversation among debaters in order to put them in a dictionary

I've created a script to fetch all the conversation between different debaters excluding moderators. What I've written so far can fetch the total conversation. However, I would like to grab them like {speaker_name: (first speech, second speech) etc }.
Webpage link
another one similar to the above link
webpage link
I've tried so far:
import requests
from bs4 import BeautifulSoup
url = 'https://www.presidency.ucsb.edu/documents/presidential-debate-the-university-nevada-las-vegas'
def get_links(link):
r = requests.get(link)
soup = BeautifulSoup(r.text,"lxml")
for item in soup.select(".field-docs-content p:has( > strong:contains('MODERATOR:')) ~ p"):
print(item.text)
if __name__ == '__main__':
get_links(url)
How can I scrape the conversation among debaters and put them in a dictionary?

I don't hold much hope for this lasting across lots of pages given the variability amongst the two pages I saw and the number of assumptions I have had to make. Essentially, I use regex on participant and moderators nodes text to isolate the lists of moderators and participants. I then loop all speech paragraphs and each time I encounter a moderator at the start of a paragraph I set a boolean variable store_paragraph = False and ignore subsequent paragraphs; likewise, each time I encounter a participant, I set store_paragraph = True and store that paragraph and subsequent ones under the appropriate participant key in my speaker_dict. I store each speaker_dict in a final results dictionary.
import requests, re
from bs4 import BeautifulSoup as bs
import pprint
links = ['https://www.presidency.ucsb.edu/documents/presidential-debate-the-university-nevada-las-vegas','https://www.presidency.ucsb.edu/documents/republican-presidential-candidates-debate-manchester-new-hampshire-0']
results = {}
p = re.compile(r'\b(\w+)\b\s+\(|\b(\w+)\b,')
with requests.Session() as s:
for number, link in enumerate(links):
r = s.get(link)
soup = bs(r.content,'lxml')
participants_tag = soup.select_one('p:has(strong:contains("PARTICIPANTS:"))')
if participants_tag.select_one('strong'):
participants_tag.strong.decompose()
speaker_dict = {i[0].upper() + ':' if i[0] else i[1].upper() + ':': [] for string in participants_tag.stripped_strings for i in p.findall(string)}
# print(speaker_dict)
moderator_data = [string for string in soup.select_one('p:has(strong:contains("MODERATOR:","MODERATORS:"))').stripped_strings][1:]
#print(moderator_data)
moderators = [i[0].upper() + ':' if i[0] else i[1].upper() + ':' for string in moderator_data for i in p.findall(string)]
store_paragraph = False
for paragraph in soup.select('.field-docs-content p:not(p:contains("PARTICIPANTS:","MODERATOR:"))')[1:]:
string_to_compare = paragraph.text.split(':')[0] + ':'
string_to_compare = string_to_compare.upper()
if string_to_compare in moderators:
store_paragraph = False
elif string_to_compare in speaker_dict:
speaker = string_to_compare
store_paragraph = True
if store_paragraph:
speaker_dict[speaker].append(paragraph.text)
results[number] = speaker_dict
pprint.pprint(results[1])

how to create bookmarks in a word document, then create internal hyperlinks to the bookmark w/ python

I have written a script using python-docx to search word documents (by searching the runs) for reference numbers and technical key words, then create a table which summarizes the search results which is appended to the end of the word document.
some of the documents are 100+ pages, so I want to make it easier for the user by creating internal hyperlinks in the search result table, so it will bring you to the location in the document where the search result was detected.
once a reference run is found, I don't know how to mark it as a bookmark or how to create a hyperlink to that bookmark in the results table.
I was able to create bookmarks to external urls using the code in this page
Adding an hyperlink in MSWord by using python-docx
I have also tried creating bookmarks, I found this page:
https://github.com/python-openxml/python-docx/issues/109
the title relates to creating bookmarks, but the code seems to generate figures in word.
I feel like the two solutions can be put together, but I don't have enough understanding of xml/ word docs to be able to do it.
Update:
I found some code that will add bookmarks to a word document, what is now needed is a way to link to this using a link in the word document
https://github.com/python-openxml/python-docx/issues/403
*from docx import Document
def add_bookmark(paragraph, bookmark_text, bookmark_name):
run = paragraph.add_run()
tag = run._r # for reference the following also works: tag = document.element.xpath('//w:r')[-1]
start = docx.oxml.shared.OxmlElement('w:bookmarkStart')
start.set(docx.oxml.ns.qn('w:id'), '0')
start.set(docx.oxml.ns.qn('w:name'), bookmark_name)
tag.append(start)
text = docx.oxml.OxmlElement('w:r')
text.text = bookmark_text
tag.append(text)
end = docx.oxml.shared.OxmlElement('w:bookmarkEnd')
end.set(docx.oxml.ns.qn('w:id'), '0')
end.set(docx.oxml.ns.qn('w:name'), bookmark_name)
tag.append(end)
doc = Document("test_input_1.docx")
# add a bookmakr to every paragraph
for paranum, paragraph in enumerate(doc.paragraphs):
add_bookmark(paragraph=paragraph, bookmark_text=f"temp{paranum}", bookmark_name=f"temp{paranum+1}")
doc.save("output.docx")*

Solved:
I got it from this post adding hyperlink to a bookmark
this is the key line
hyperlink.set(docx.oxml.shared.qn('w:anchor'), link_to,)
As a bonus I have added in the ability to add a tool tip to your link:
enjoy
here is the answer:
from docx import Document
import docx
from docx.enum.dml import MSO_THEME_COLOR_INDEX
def add_bookmark(paragraph, bookmark_text, bookmark_name):
run = paragraph.add_run()
tag = run._r
start = docx.oxml.shared.OxmlElement('w:bookmarkStart')
start.set(docx.oxml.ns.qn('w:id'), '0')
start.set(docx.oxml.ns.qn('w:name'), bookmark_name)
tag.append(start)
text = docx.oxml.OxmlElement('w:r')
text.text = bookmark_text
tag.append(text)
end = docx.oxml.shared.OxmlElement('w:bookmarkEnd')
end.set(docx.oxml.ns.qn('w:id'), '0')
end.set(docx.oxml.ns.qn('w:name'), bookmark_name)
tag.append(end)
def add_link(paragraph, link_to, text, tool_tip=None):
# create hyperlink node
hyperlink = docx.oxml.shared.OxmlElement('w:hyperlink')
# set attribute for link to bookmark
hyperlink.set(docx.oxml.shared.qn('w:anchor'), link_to,)
if tool_tip is not None:
# set attribute for link to bookmark
hyperlink.set(docx.oxml.shared.qn('w:tooltip'), tool_tip,)
new_run = docx.oxml.shared.OxmlElement('w:r')
rPr = docx.oxml.shared.OxmlElement('w:rPr')
new_run.append(rPr)
new_run.text = text
hyperlink.append(new_run)
r = paragraph.add_run()
r._r.append(hyperlink)
r.font.name = "Calibri"
r.font.color.theme_color = MSO_THEME_COLOR_INDEX.HYPERLINK
r.font.underline = True
# test the functions
if __name__ == "__main__":
# input test document
doc = Document(r"test_input_1.docx")
# add a bookmark to every paragraph
for paranum, paragraph in enumerate(doc.paragraphs):
add_bookmark(paragraph=paragraph,
bookmark_text=f"{paranum}", bookmark_name=f"temp{paranum+1}")
# add page to the end to put your link
doc.add_page_break()
paragraph = doc.add_paragraph("This is where the internal link will live")
# add a link to the first paragraph
add_link(paragraph=paragraph, link_to="temp0",
text="this is a link to ", tool_tip="your message here")
doc.save(r"output.docx")

The previous solution doesn't work with me on Libreoffice (6.4).
After checking the xml of 2 documents, with bookmark and without,
also after checking this: http://officeopenxml.com/WPbookmark.php, we can see that:
For Bookmark
The solution is to add the bookmark in the paragraph not in a run. so in this line:
tag = run._r # for reference the following also works: tag = document.element.xpath('//w:r')[-1]
you should change the "r" to "p" in "('//w:r')" :
tag = doc.element.xpath('//w:p')[-1]
and then it will work
For Link, you have to make the same thing, here the function:
def add_link(paragraph, link_to, text, tool_tip=None):
# create hyperlink node
hyperlink = docx.oxml.shared.OxmlElement('w:hyperlink')
# set attribute for link to bookmark
hyperlink.set(docx.oxml.shared.qn('w:anchor'), link_to,)
if tool_tip is not None:
# set attribute for link to bookmark
hyperlink.set(docx.oxml.shared.qn('w:tooltip'), tool_tip,)
new_run = docx.oxml.shared.OxmlElement('w:r')
# here to change the font color, and add underline
rPr = docx.oxml.shared.OxmlElement('w:rPr')
c = docx.oxml.shared.OxmlElement('w:color')
c.set(docx.oxml.shared.qn('w:val'), '2A6099')
rPr.append(c)
u = docx.oxml.shared.OxmlElement('w:u')
u.set(docx.oxml.shared.qn('w:val'), 'single')
rPr.append(u)
#
new_run.append(rPr)
new_run.text = text
hyperlink.append(new_run)
paragraph._p.append(hyperlink) # this to add the link in the w:p
# this is wrong:
# r = paragraph.add_run()
# r._r.append(hyperlink)
# r.font.name = "Calibri"
# r.font.color.theme_color = MSO_THEME_COLOR_INDEX.HYPERLINK
# r.font.underline = True

Using BeautifulSoup to scrape li's and id's in same method

How would i modify the parameters of the findAll method to read both li's and id's? li's are elements and id's are attributes correct?
#Author: David Owens
#File name: soupScraper.py
#Description: html scraper that takes surf reports from various websites
import csv
import requests
from bs4 import BeautifulSoup
###################### SURFLINE URL STRINGS AND TAG ###########################
slRootUrl = 'http://www.surfline.com/surf-report/'
slSunsetCliffs = 'sunset-cliffs-southern-california_4254/'
slScrippsUrl = 'scripps-southern-california_4246/'
slBlacksUrl = 'blacks-southern-california_4245/'
slCardiffUrl = 'cardiff-southern-california_4786/'
slTagText = 'observed-wave-range'
slTag = 'id'
#list of surfline URL endings
slUrls = [slSunsetCliffs, slScrippsUrl, slBlacksUrl, slCardiffUrl]
###############################################################################
#################### MAGICSEAWEED URL STRINGS AND TAG #########################
msRootUrl = 'http://magicseaweed.com/'
msSunsetCliffs = 'Sunset-Cliffs-Surf-Report/4211/'
msScrippsUrl = 'Scripps-Pier-La-Jolla-Surf-Report/296/'
msBlacksUrl = 'Torrey-Pines-Blacks-Beach-Surf-Report/295/'
msTagText = 'rating-text text-dark'
msTag = 'li'
#list of magicseaweed URL endings
msUrls = [msSunsetCliffs, msScrippsUrl, msBlacksUrl]
###############################################################################
'''
This method iterates through a list of urls and extracts the surf report from
the webpage dependent upon its tag location
rootUrl: The root url of each surf website
urlList: A list of specific urls to be appended to the root url for each
break
tag: the html tag where the actual report lives on the page
returns: a list of strings of each breaks surf report
'''
def extract_Reports(rootUrl, urlList, tag, tagText):
#empty list to hold reports
reports = []
#loop thru URLs
for url in urlList:
try:
#request page
request = requests.get(rootUrl + url)
#turn into soup
soup = BeautifulSoup(request.content, 'lxml')
#get the tag where report lives
reportTag = soup.findAll(id = tagText)
for report in reportTag:
reports.append(report.string.strip())
#notify if fail
except:
print 'scrape failure'
pass
return reports
#END METHOD
slReports = extract_Reports(slRootUrl, slUrls, slTag, slTagText)
msReports = extract_Reports(msRootUrl, msUrls, msTag, msTagText)
print slReports
print msReports
As of right now, only slReports prints correctly because i have it explicitly set to id = tagText. I am also aware that my tag paramater is not used currently.

So the problem is that you want to search the parse tree for elements that have either a class name of rating-text (it turns out you do not need text-dark to identify the relevant elements in the case of Magicseaweed) or an ID of observed-wave-range, using a single findAll call.
You can use a filter function to achieve this:
def reportTagFilter(tag):
return (tag.has_attr('class') and 'rating-text' in tag['class']) \
or (tag.has_attr('id') and tag['id'] == 'observed-wave-range')
Then change your extract_Reports function to read:
reportTag = soup.findAll(reportTagFilter)[0]
reports.append(reportTag.text.strip())

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

HTML to Word docx - python

Alternatively, you can just save your html code as a string and do: from htmldocx import HtmlToDocx new_parser = HtmlToDocx() new_parser.parse_html_file("html_filename", "docx_filename") #Files extensions not needed, but tolerated

Related

How to access the next element below in HTML file using beautiful soup

Adding a link to a bookmark in MS Word using python docx library

Unable to scrape the conversation among debaters in order to put them in a dictionary

how to create bookmarks in a word document, then create internal hyperlinks to the bookmark w/ python

Using BeautifulSoup to scrape li's and id's in same method

Categories

Resources