Search multiple words (dependant) from pdf - python

I happily found python code to search for multiple words in pdf.
I wanted to look for the pages where two words exist. For instance, I want both 'Name' and 'Address' to exist in the same page, that give the page location where this occur. If either one word is available, then the page location is not required.
Thank you.
Code that I found:
Search Multiple words from pdf

Refereing to the page cited by the author and from what I found here, I would suggest something like:
def string_found(word, string_page):
if re.search(r"\b" + re.escape(word) + r"\b", string_page,re.IGNORECASE):
return True
return False
word1 = "name"
word2 = "adress"
for i in range(0, num_pages):
page = object.getPage(i)
text = page.extractText() # get text of current page
bool1 = string_found(word1, text)
bool2 = string_found(word2, text)
if bool1 and bool2:
print(i) # print number of page with both occurences

Related

how to create bookmarks in a word document, then create internal hyperlinks to the bookmark w/ python

I have written a script using python-docx to search word documents (by searching the runs) for reference numbers and technical key words, then create a table which summarizes the search results which is appended to the end of the word document.
some of the documents are 100+ pages, so I want to make it easier for the user by creating internal hyperlinks in the search result table, so it will bring you to the location in the document where the search result was detected.
once a reference run is found, I don't know how to mark it as a bookmark or how to create a hyperlink to that bookmark in the results table.
I was able to create bookmarks to external urls using the code in this page
Adding an hyperlink in MSWord by using python-docx
I have also tried creating bookmarks, I found this page:
https://github.com/python-openxml/python-docx/issues/109
the title relates to creating bookmarks, but the code seems to generate figures in word.
I feel like the two solutions can be put together, but I don't have enough understanding of xml/ word docs to be able to do it.
Update:
I found some code that will add bookmarks to a word document, what is now needed is a way to link to this using a link in the word document
https://github.com/python-openxml/python-docx/issues/403
*from docx import Document
def add_bookmark(paragraph, bookmark_text, bookmark_name):
run = paragraph.add_run()
tag = run._r # for reference the following also works: tag = document.element.xpath('//w:r')[-1]
start = docx.oxml.shared.OxmlElement('w:bookmarkStart')
start.set(docx.oxml.ns.qn('w:id'), '0')
start.set(docx.oxml.ns.qn('w:name'), bookmark_name)
tag.append(start)
text = docx.oxml.OxmlElement('w:r')
text.text = bookmark_text
tag.append(text)
end = docx.oxml.shared.OxmlElement('w:bookmarkEnd')
end.set(docx.oxml.ns.qn('w:id'), '0')
end.set(docx.oxml.ns.qn('w:name'), bookmark_name)
tag.append(end)
doc = Document("test_input_1.docx")
# add a bookmakr to every paragraph
for paranum, paragraph in enumerate(doc.paragraphs):
add_bookmark(paragraph=paragraph, bookmark_text=f"temp{paranum}", bookmark_name=f"temp{paranum+1}")
doc.save("output.docx")*
Solved:
I got it from this post adding hyperlink to a bookmark
this is the key line
hyperlink.set(docx.oxml.shared.qn('w:anchor'), link_to,)
As a bonus I have added in the ability to add a tool tip to your link:
enjoy
here is the answer:
from docx import Document
import docx
from docx.enum.dml import MSO_THEME_COLOR_INDEX
def add_bookmark(paragraph, bookmark_text, bookmark_name):
run = paragraph.add_run()
tag = run._r
start = docx.oxml.shared.OxmlElement('w:bookmarkStart')
start.set(docx.oxml.ns.qn('w:id'), '0')
start.set(docx.oxml.ns.qn('w:name'), bookmark_name)
tag.append(start)
text = docx.oxml.OxmlElement('w:r')
text.text = bookmark_text
tag.append(text)
end = docx.oxml.shared.OxmlElement('w:bookmarkEnd')
end.set(docx.oxml.ns.qn('w:id'), '0')
end.set(docx.oxml.ns.qn('w:name'), bookmark_name)
tag.append(end)
def add_link(paragraph, link_to, text, tool_tip=None):
# create hyperlink node
hyperlink = docx.oxml.shared.OxmlElement('w:hyperlink')
# set attribute for link to bookmark
hyperlink.set(docx.oxml.shared.qn('w:anchor'), link_to,)
if tool_tip is not None:
# set attribute for link to bookmark
hyperlink.set(docx.oxml.shared.qn('w:tooltip'), tool_tip,)
new_run = docx.oxml.shared.OxmlElement('w:r')
rPr = docx.oxml.shared.OxmlElement('w:rPr')
new_run.append(rPr)
new_run.text = text
hyperlink.append(new_run)
r = paragraph.add_run()
r._r.append(hyperlink)
r.font.name = "Calibri"
r.font.color.theme_color = MSO_THEME_COLOR_INDEX.HYPERLINK
r.font.underline = True
# test the functions
if __name__ == "__main__":
# input test document
doc = Document(r"test_input_1.docx")
# add a bookmark to every paragraph
for paranum, paragraph in enumerate(doc.paragraphs):
add_bookmark(paragraph=paragraph,
bookmark_text=f"{paranum}", bookmark_name=f"temp{paranum+1}")
# add page to the end to put your link
doc.add_page_break()
paragraph = doc.add_paragraph("This is where the internal link will live")
# add a link to the first paragraph
add_link(paragraph=paragraph, link_to="temp0",
text="this is a link to ", tool_tip="your message here")
doc.save(r"output.docx")
The previous solution doesn't work with me on Libreoffice (6.4).
After checking the xml of 2 documents, with bookmark and without,
also after checking this: http://officeopenxml.com/WPbookmark.php, we can see that:
For Bookmark
The solution is to add the bookmark in the paragraph not in a run. so in this line:
tag = run._r # for reference the following also works: tag = document.element.xpath('//w:r')[-1]
you should change the "r" to "p" in "('//w:r')" :
tag = doc.element.xpath('//w:p')[-1]
and then it will work
For Link, you have to make the same thing, here the function:
def add_link(paragraph, link_to, text, tool_tip=None):
# create hyperlink node
hyperlink = docx.oxml.shared.OxmlElement('w:hyperlink')
# set attribute for link to bookmark
hyperlink.set(docx.oxml.shared.qn('w:anchor'), link_to,)
if tool_tip is not None:
# set attribute for link to bookmark
hyperlink.set(docx.oxml.shared.qn('w:tooltip'), tool_tip,)
new_run = docx.oxml.shared.OxmlElement('w:r')
# here to change the font color, and add underline
rPr = docx.oxml.shared.OxmlElement('w:rPr')
c = docx.oxml.shared.OxmlElement('w:color')
c.set(docx.oxml.shared.qn('w:val'), '2A6099')
rPr.append(c)
u = docx.oxml.shared.OxmlElement('w:u')
u.set(docx.oxml.shared.qn('w:val'), 'single')
rPr.append(u)
#
new_run.append(rPr)
new_run.text = text
hyperlink.append(new_run)
paragraph._p.append(hyperlink) # this to add the link in the w:p
# this is wrong:
# r = paragraph.add_run()
# r._r.append(hyperlink)
# r.font.name = "Calibri"
# r.font.color.theme_color = MSO_THEME_COLOR_INDEX.HYPERLINK
# r.font.underline = True

HTML to Word docx

I have some HTML formatted text I've got with BeautifulSoup. I'd like to convert all italic (tag i), bold (b) and links (a href) to Word format via docx run command.
I can make a paragraph:
p = document.add_paragraph('text')
I can ADD next sequence as bold/italic:
p.add_run('bold').bold = True
p.add_run('italic.').italic = True
Intuitively, I could find all particular tags (ie. soup.find_all('i')) and then watch indices and then concatenate partial strings...
...but maybe there's a better, more elegant way?
I don't want libraries or solutions that just convert a html page to word and save them. I want a little more control.
I got nowhere with a dictionary. Here is the code and visual wrong (from code) and right (desired) result:
from docx import Document
import os
from bs4 import BeautifulSoup
html = 'hi, I am link this is some nice regular text. <i> oooh, but I am italic</i> ' \
' or I can be <b>bold</b> '\
' or even <i><b>bold and italic</b></i>'
def get_tags(text):
soup = BeautifulSoup(text, "html.parser")
tags = {}
tags["i"] = soup.find_all("i")
tags["b"] = soup.find_all("b")
return tags
def make_test_word():
document = Document()
document.add_heading('Demo HTML', 0)
soup = BeautifulSoup(html, "html.parser")
p = document.add_paragraph(html)
# p.add_run('bold').bold = True
# p.add_run(' and some ')
# p.add_run('italic.').italic = True
file_name="demo_html.docx"
document.save(file_name)
os.startfile(file_name)
make_test_word()
I just wrote a bit of code to convert the text from a tkinter Text widget over to a word document, including any bold tags that the user can add. This isn't a complete solution for you, but it may help you to start toward a working solution. I think you're going to have to do some regex work to get the hyperlinks transferred to the word document. Stacked formatting tags may also get tricky. I hope this helps:
from docx import Document
html = 'HTML string <b>here</b>.'
html = html.split('<')
html = [html[0]] + ['<'+l for l in html[1:]]
doc = Document()
p = doc.add_paragraph()
for run in html:
if run.startswith('<b>'):
run = run.lstrip('<b>')
runner = p.add_run(run)
runner.bold = True
elif run.startswith('</b>'):
run = run.lstrip('</b>')
runner = p.add_run(run)
else:
p.add_run(run)
doc.save('test.docx')
I came back to it and made it possible to parse out multiple formatting tags. This will keep a tally of what formatting tags are in play in a list. At each tag, a new run is created, and formatting for the run is set by the current tags in play.
from docx import Document
import re
import docx
from docx.shared import Pt
from docx.enum.dml import MSO_THEME_COLOR_INDEX
def add_hyperlink(paragraph, text, url):
# This gets access to the document.xml.rels file and gets a new relation id value
part = paragraph.part
r_id = part.relate_to(url, docx.opc.constants.RELATIONSHIP_TYPE.HYPERLINK, is_external=True)
# Create the w:hyperlink tag and add needed values
hyperlink = docx.oxml.shared.OxmlElement('w:hyperlink')
hyperlink.set(docx.oxml.shared.qn('r:id'), r_id, )
# Create a w:r element and a new w:rPr element
new_run = docx.oxml.shared.OxmlElement('w:r')
rPr = docx.oxml.shared.OxmlElement('w:rPr')
# Join all the xml elements together add add the required text to the w:r element
new_run.append(rPr)
new_run.text = text
hyperlink.append(new_run)
# Create a new Run object and add the hyperlink into it
r = paragraph.add_run ()
r._r.append (hyperlink)
# A workaround for the lack of a hyperlink style (doesn't go purple after using the link)
# Delete this if using a template that has the hyperlink style in it
r.font.color.theme_color = MSO_THEME_COLOR_INDEX.HYPERLINK
r.font.underline = True
return hyperlink
html = '<H1>I want to</H1> <u>convert HTML to docx in <b>bold and <i>bold italic</i></b>.</u>'
html = html.split('<')
html = [html[0]] + ['<'+l for l in html[1:]]
tags = []
doc = Document()
p = doc.add_paragraph()
for run in html:
tag_change = re.match('(?:<)(.*?)(?:>)', run)
if tag_change != None:
tag_strip = tag_change.group(0)
tag_change = tag_change.group(1)
if tag_change.startswith('/'):
if tag_change.startswith('/a'):
tag_change = next(tag for tag in tags if tag.startswith('a '))
tag_change = tag_change.strip('/')
tags.remove(tag_change)
else:
tags.append(tag_change)
else:
tag_strip = ''
hyperlink = [tag for tag in tags if tag.startswith('a ')]
if run.startswith('<'):
run = run.replace(tag_strip, '')
if hyperlink:
hyperlink = hyperlink[0]
hyperlink = re.match('.*?(?:href=")(.*?)(?:").*?', hyperlink).group(1)
add_hyperlink(p, run, hyperlink)
else:
runner = p.add_run(run)
if 'b' in tags:
runner.bold = True
if 'u' in tags:
runner.underline = True
if 'i' in tags:
runner.italic = True
if 'H1' in tags:
runner.font.size = Pt(24)
else:
p.add_run(run)
doc.save('test.docx')
Hyperlink function thanks to this question. My concern here is that you will need to manually code for every HTML tag that you want to carry over to the docx. I imagine that could be a large number. I've given some examples of tags you may want to account for.
Alternatively, you can just save your html code as a string and do:
from htmldocx import HtmlToDocx
new_parser = HtmlToDocx()
new_parser.parse_html_file("html_filename", "docx_filename")
#Files extensions not needed, but tolerated

Python Web crawler, crawl through links and find specific words

So I am trying to code a web crawler that goes into a each chapter of a title for a Statue and count occurrence of a set a key words ("shall" "must") in its content.
Below is the code i used to acquire links to each chapters.
The base URL I used is http://law.justia.com/codes/georgia/2015/
import requests
from bs4 import BeautifulSoup, SoupStrainer
import re
from collections import Counter
pattern1 = re.compile(r"\bshall\b",re.IGNORECASE)
pattern2 = re.compile(r"\bmust\b",re.IGNORECASE)
########################################Sections##########################
def levelthree(item2_url):
r = requests.get(item2_url)
for sectionlinks in BeautifulSoup((r.content),"html.parser",parse_only=SoupStrainer('a')):
if sectionlinks.has_attr('href'):
if 'section' in sectionlinks['href']:
href = "http://law.justia.com" + sectionlinks.get('href')
href = "\n" + str(href)
print (href)
########################################Chapters##########################
def leveltwo(item_url):
r = requests.get(item_url)
for sublinks in BeautifulSoup((r.content), "html.parser", parse_only=SoupStrainer('a')):
if sublinks.has_attr('href'):
if 'chapt' in sublinks['href']:
chapterlinks = "http://law.justia.com" + sublinks.get('href')
# chapterlinks = "\n" + str(chapterlinks)
#print (chapterlinks)
######################################Titles###############################
def levelone(url):
r = requests.get(url)
for links in BeautifulSoup((r.content), "html.parser", parse_only=SoupStrainer('a')):
if links.has_attr('href'):
if 'title-43' in links['href']:
titlelinks = "http://law.justia.com" + links.get('href')
# titlelinks = "\n" + str(titlelinks)
leveltwo(titlelinks)
# print (titlelinks)
###########################################################################
base_url = "http://law.justia.com/codes/georgia/2015/"
levelone(base_url)
The problem is that the structure of the page are usually title - chapter - sections - contents ( ex: http://law.justia.com/codes/georgia/2015/title-43/chapter-1/section-43-1-1/)
But there are ones that are title - chapter - articles - sections - contents (ex.http://law.justia.com/codes/georgia/2015/title-43/chapter-4/article-1/section-43-4-1/ )
I am able to get the links for the first scenario. However, I will miss all the title- chapter - article - sections - contents
My questions is, how can I code this so that I will be able get contents for each chapter (from sections links and from article to sections links) then look for the occurrence of words (such as "shall" or "must") for each chapter individually?
I want to find the word frequency by chapters, hopefully the output will be something like this
chapter 1
Word Frequency
shall 35
must 3
chapter 2
Word Frequency
shall 59
must 14
for this problem, count '/' in urls
http://law.justia.com/codes/georgia/2015/title-43/chapter-1/section-43-1-1/)
http://law.justia.com/codes/georgia/2015/title-43/chapter-4/article-1/section-43-4-1/ )
if url.count('/') == 9:
# do somthing
if url.count('/') == 10:
# do somthing
or you can do a simple trick:
part = url.split('/')
title = part[7]
chapter = part[8]
section = part[-1]
Note: -1 means last part
to count shall or must:
use count function for the same
shall_count = response_text.count('shall')
must_count = response_text.count('must')

Cannot download Wordnet Error

I am trying to compile this code:
from collections import OrderedDict
import pdb
pdb.set_trace()
def alphaDict(words):
alphas = OrderedDict()
words = sorted(words, key = str.lower)
words = filter(None, words);
for word in words:
if word[0].upper() not in alphas:
alphas[word[0].upper()] = []
alphas[word[0].upper()].append(word.lower())
return alphas
def listConvert(passage):
alphabets = " abcdefghijklmnopqrstuvwxyz"
for char in passage:
if char.lower() not in alphabets:
passage = passage.replace(char, "")
listConvert(passage)
passage = rDup(passage.split(" "))
return passage
def rDup(sequence):
unique = []
[unique.append(item) for item in sequence if item not in unique]
return unique
def otherPrint(word):
base = "http://dictionary.reference.com/browse/"
end = "?s=t"
from nltk.corpus import wordnet as wn
data = [s.definition() for s in wn.synsets(word)]
print("<li>")
print("<a href = '" +base+word+end+"' target = '_blank'><h2 class = 'dictlink'>" +(word.lower())+":</h2></a>")
if not data:
print("Sorry, we could not find this word in our data banks. Please click the word to check <a target = '_blank' class = 'dictlink' href = 'http://www.dictionary.com'>Dictionary.com</a>")
return
print("<ul>")
for key in data:
print("<li>"+key+"</li>")
print("</ul>")
print("</ol>")
print("</li>")
def formatPrint(word):
base = "http://dictionary.reference.com/browse/"
end = "?s=t"
from PyDictionary import PyDictionary
pd = PyDictionary()
data = pd.meaning(word)
print "<li>"
print "<a href = '" +base+word+end+"' target = '_blank'><h2 class = 'dictlink'>" +(word.lower())+":</h2></a>"
if not data:
print "Sorry, we could not find this word in our data banks. Please click the word to check <a target = '_blank' class = 'dictlink' href = 'http://www.dictionary.com'>Dictionary.com</a>"
return
print "<ol type = 'A'>"
for key in data:
print "<li><h3 style = 'color: red;' id = '" +word.lower()+ "'>"+key+"</h3><ul type = 'square'>"
for item in data[key]:
print "<li>" +item+"</li>"
print "</ul>"
print "</li>"
print "</ol>"
print "</li>"
def specPrint(words):
print "<ol>"
for word in words:
otherPrint(word)
print "</ol>"
print "<br/>"
print "<br/>"
print "<a href = '#list'> Click here</a> to go back to choose another letter<br/>"
print "<a href = '#sentence'>Click here</a> to view your sentence.<br/>"
print "<a href = '#header'>Click here</a> to see the owner's information.<br/>"
print "<a href = '../html/main.html'>Click here</a> to go back to the main page."
print "</div>"
for x in range(0, 10):
print "<br/>"
To all those who answered my previous question, thank you. It worked, I will be accepting an answer soon. However, I have another problem. When I try to import wordnet in a shell (by compiling and IDLE commands), the process works fine. However, on xampp, I get this error:
Can someone please explain this as well? Thanks!
Your for loop is not indented in other loop -
for key in data:
print("<li>"+key+"</li>")
print("</ul>")
print("</ol>")
print("</li>")
This is most probably the issue. Try indenting it-
for key in data:
print("<li>"+key+"</li>")
print("</ul>")
print("</ol>")
print("</li>")
Also, please understand that python treats tabs and spaces differently, so assuming you indent one line using tab and then next line using 4 spaces (manual spaces) it would cause indentation error in Python. You have to either use all spaces or all tabs , you cannot use a mixture of both (even though they look the same).
A couple of things. First is the indent of line one. That may just be copying here.
Then every time you have a colon, you need to have the next line indented. So in the otherPrint function you have this:
for key in data:
print("<li>"+key+"</li>")
print("</ul>")
print("</ol>")
print("</li>")
At least the first line needs to be indented. If you intend all of the prints to be in the loop then you need to indent all of them.
You also have the same issue with you if statements in formatPrint function. Try indenting them under the loops and conditionals and this should clear it up. If you are still finding a problem, then check to make sure you have the correct number of parentheses and brackets closing out statements. Leaving one off will cause the rest of the code to go wonky.
Also your are using print statements instead of the print() function. The print statement no longer works in Python 3.x... you have to enclose all of that in parentheses.
def formatPrint(word):
base = "http://dictionary.reference.com/browse/"
end = "?s=t"
from PyDictionary import PyDictionary
pd = PyDictionary()
data = pd.meaning(word)
print("<li>")
print(
"<a href = '" +base+word+end+"' target = '_blank'>
<h2 class = 'dictlink'>" +(word.lower())+":</h2></a>"
)
if not data:
print(
"Sorry, we could not find this word in our data banks.
Please click the word to check <a target = '_blank'
class = 'dictlink' href
='http://www.dictionary.com'>Dictionary.com</a>"
)
return
print("<ol type = 'A'>")
for key in data:
print(
"<li><h3 style = 'color: red;' id = '" +word.lower()+
"'>"+key+"</h3><ul type = 'square'>"
)
for item in data[key]:
print("<li>" +item+"</li>")
print("</ul>")
print("</li>")
print("</ol>")
print("</li>")

Very fast webpage scraping (Python)

So I'm trying to filter through a list of urls (potentially in the hundreds) and filter out every article who's body is less than X number of words (ARTICLE LENGTH). But when I run my application, it takes an unreasonable amount of time, so much so that my hosting service times out. I'm currently using Goose (https://github.com/grangier/python-goose) with the following filter function:
def is_news_and_data(url):
"""A function that returns a list of the form
[True, title, meta_description]
or
[False]
"""
result = []
if url == None:
return False
try:
article = g.extract(url=url)
if len(article.cleaned_text.split()) < ARTICLE_LENGTH:
result.append(False)
else:
title = article.title
meta_description = article.meta_description
result.extend([True, title, meta_description])
except:
result.append(False)
return result
In the context of the following. Dont mind the debug prints and messiness (tweepy is my twitter api wrapper):
def get_links(auth):
"""Returns a list of t.co links from a list of given tweets"""
api = tweepy.API(auth)
page_list = []
tweets_list = []
links_list = []
news_list = []
regex = re.compile('http://t.co/.[a-zA-Z0-9]*')
for page in tweepy.Cursor(api.home_timeline, count=20).pages(1):
page_list.append(page)
for page in page_list:
for status in page:
tweet = status.text.encode('utf-8','ignore')
tweets_list.append(tweet)
for tweet in tweets_list:
links = regex.findall(tweet)
links_list.extend(links)
#print 'The length of the links list is: ' + str(len(links_list))
for link in links_list:
news_and_data = is_news_and_data(link)
if True in news_and_data:
news_and_data.append(link)
#[True, title, meta_description, link]
news_list.append(news_and_data[1:])
print 'The length of the news list is: ' + str(len(news_list))
Can anyone recommend a perhaps faster method?
This code is probably causing your slow performance:
len(article.cleaned_text.split())
This is performing a lot of work, most of which is discarded. I would profile your code to see if this is the culprit, if so, replace it with something that just counts spaces, like so:
article.cleaned_text.count(' ')
That won't give you exactly the same result as your original code, but will be very close. To get closer you could use a regular expression to count words, but it won't be quite as fast.
I'm not saying this is the most absolute best you can do, but it will be faster. You'll have to redo some of your code to fit this new function.
It will at least give you less function calls.
You'll have to pass the whole url list.
def is_news_in_data(listings):
new_listings = {}
tmp_listing = ''
is_news = {}
for i in listings:
url = listings[i]
is_news[url] = 0
article = g.extract(url=url).cleaned_text
tmp_listing = '';
for s in article:
is_news[url] += 1
tmp_listing += s
if is_news[url] > ARTICLE_LENGTH:
new_listings[url] = tmp_listing
del is_news[url]
return new_listings

Categories

Resources