How to extract text (PyPDF2) from specific location/span on PDF

How to extract text (PyPDF2) from specific location/span on PDF - python

I have already extracted a text from a PDF page to Text variable.
I'm looking to extract the number that comes after the string 'your number is' (14 length string was matched on span (982,996):
object=PyPDF2.PdfFileReader(filename)
Text = PageObj.extractText()
PageObj = object.getPage(0)
ResSearch = re.search(String, Text)
I'm getting a result: span = (982, 996) match = 'your number is'. Now all I need is to scrape the three digit text that comes after that ('your number is 105'), as the files are changing daily and the fetching should be dynamic.
Thank you everyone !!

The problem is about regex not pdf itself. Under the assumption that at most one match per page you can use search, otherwise use findall. Have a look at the doc on how to use group, section with (...).
import PyPDF2, re
filename = '' #
pdf_r = PyPDF2.PdfFileReader(open(filename, 'rb'))
text = pdf_r.getPage(0).extractText() # from 1st page or make a loop
if p := re.search(r'your number is (\d{3})', text):
my_number = int(p.groups()[0]) # as int
Use PyPDF4, the syntax is the same and it doesn't "have" such extractText issue:
from the doc:
This works well for some PDF files, but poorly for others, depending on the generator used. [...] Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated.

Related

How to modify a word doc with Python? [duplicate]

The oodocx module mentioned in the same page refers the user to an /examples folder that does not seem to be there.
I have read the documentation of python-docx 0.7.2, plus everything I could find in Stackoverflow on the subject, so please believe that I have done my “homework”.
Python is the only language I know (beginner+, maybe intermediate), so please do not assume any knowledge of C, Unix, xml, etc.
Task : Open a ms-word 2007+ document with a single line of text in it (to keep things simple) and replace any “key” word in Dictionary that occurs in that line of text with its dictionary value. Then close the document keeping everything else the same.
Line of text (for example) “We shall linger in the chambers of the sea.”
from docx import Document
document = Document('/Users/umityalcin/Desktop/Test.docx')
Dictionary = {‘sea’: “ocean”}
sections = document.sections
for section in sections:
print(section.start_type)
#Now, I would like to navigate, focus on, get to, whatever to the section that has my
#single line of text and execute a find/replace using the dictionary above.
#then save the document in the usual way.
document.save('/Users/umityalcin/Desktop/Test.docx')
I am not seeing anything in the documentation that allows me to do this—maybe it is there but I don’t get it because everything is not spelled-out at my level.
I have followed other suggestions on this site and have tried to use earlier versions of the module (https://github.com/mikemaccana/python-docx) that is supposed to have "methods like replace, advReplace" as follows: I open the source-code in the python interpreter, and add the following at the end (this is to avoid clashes with the already installed version 0.7.2):
document = opendocx('/Users/umityalcin/Desktop/Test.docx')
words = document.xpath('//w:r', namespaces=document.nsmap)
for word in words:
if word in Dictionary.keys():
print "found it", Dictionary[word]
document = replace(document, word, Dictionary[word])
savedocx(document, coreprops, appprops, contenttypes, websettings,
wordrelationships, output, imagefiledict=None)
Running this produces the following error message:
NameError: name 'coreprops' is not defined
Maybe I am trying to do something that cannot be done—but I would appreciate your help if I am missing something simple.
If this matters, I am using the 64 bit version of Enthought's Canopy on OSX 10.9.3

UPDATE: There are a couple of paragraph-level functions that do a good job of this and can be found on the GitHub site for python-docx.
This one will replace a regex-match with a replacement str. The replacement string will appear formatted the same as the first character of the matched string.
This one will isolate a run such that some formatting can be applied to that word or phrase, like highlighting each occurence of "foobar" in the text or perhaps making it bold or appear in a larger font.
The current version of python-docx does not have a search() function or a replace() function. These are requested fairly frequently, but an implementation for the general case is quite tricky and it hasn't risen to the top of the backlog yet.
Several folks have had success though, getting done what they need, using the facilities already present. Here's an example. It has nothing to do with sections by the way :)
for paragraph in document.paragraphs:
if 'sea' in paragraph.text:
print paragraph.text
paragraph.text = 'new text containing ocean'
To search in Tables as well, you would need to use something like:
for table in document.tables:
for row in table.rows:
for cell in row.cells:
for paragraph in cell.paragraphs:
if 'sea' in paragraph.text:
paragraph.text = paragraph.text.replace("sea", "ocean")
If you pursue this path, you'll probably discover pretty quickly what the complexities are. If you replace the entire text of a paragraph, that will remove any character-level formatting, like a word or phrase in bold or italic.
By the way, the code from #wnnmaw's answer is for the legacy version of python-docx and won't work at all with versions after 0.3.0.

I needed something to replace regular expressions in docx.
I took scannys answer.
To handle style I've used answer from:
Python docx Replace string in paragraph while keeping style
added recursive call to handle nested tables.
and came up with something like this:
import re
from docx import Document
def docx_replace_regex(doc_obj, regex , replace):
for p in doc_obj.paragraphs:
if regex.search(p.text):
inline = p.runs
# Loop added to work with runs (strings with same style)
for i in range(len(inline)):
if regex.search(inline[i].text):
text = regex.sub(replace, inline[i].text)
inline[i].text = text
for table in doc_obj.tables:
for row in table.rows:
for cell in row.cells:
docx_replace_regex(cell, regex , replace)
regex1 = re.compile(r"your regex")
replace1 = r"your replace string"
filename = "test.docx"
doc = Document(filename)
docx_replace_regex(doc, regex1 , replace1)
doc.save('result1.docx')
To iterate over dictionary:
for word, replacement in dictionary.items():
word_re=re.compile(word)
docx_replace_regex(doc, word_re , replacement)
Note that this solution will replace regex only if whole regex has same style in document.
Also if text is edited after saving same style text might be in separate runs.
For example if you open document that has "testabcd" string and you change it to "test1abcd" and save, even dough its the same style there are 3 separate runs "test", "1", and "abcd", in this case replacement of test1 won't work.
This is for tracking changes in the document. To marge it to one run, in Word you need to go to "Options", "Trust Center" and in "Privacy Options" unthick "Store random numbers to improve combine accuracy" and save the document.

Sharing a small script I wrote - helps me generating legal .docx contracts with variables while preserving the original style.
pip install python-docx
Example:
from docx import Document
import os
def main():
template_file_path = 'employment_agreement_template.docx'
output_file_path = 'result.docx'
variables = {
"${EMPLOEE_NAME}": "Example Name",
"${EMPLOEE_TITLE}": "Software Engineer",
"${EMPLOEE_ID}": "302929393",
"${EMPLOEE_ADDRESS}": "דרך השלום מנחם בגין דוגמא",
"${EMPLOEE_PHONE}": "+972-5056000000",
"${EMPLOEE_EMAIL}": "example#example.com",
"${START_DATE}": "03 Jan, 2021",
"${SALARY}": "10,000",
"${SALARY_30}": "3,000",
"${SALARY_70}": "7,000",
}
template_document = Document(template_file_path)
for variable_key, variable_value in variables.items():
for paragraph in template_document.paragraphs:
replace_text_in_paragraph(paragraph, variable_key, variable_value)
for table in template_document.tables:
for col in table.columns:
for cell in col.cells:
for paragraph in cell.paragraphs:
replace_text_in_paragraph(paragraph, variable_key, variable_value)
template_document.save(output_file_path)
def replace_text_in_paragraph(paragraph, key, value):
if key in paragraph.text:
inline = paragraph.runs
for item in inline:
if key in item.text:
item.text = item.text.replace(key, value)
if __name__ == '__main__':
main()

I got much help from answers from the earlier, but for me, the below code functions as the simple find and replace function in word would do. Hope this helps.
#!pip install python-docx
#start from here if python-docx is installed
from docx import Document
#open the document
doc=Document('./test.docx')
Dictionary = {"sea": "ocean", "find_this_text":"new_text"}
for i in Dictionary:
for p in doc.paragraphs:
if p.text.find(i)>=0:
p.text=p.text.replace(i,Dictionary[i])
#save changed document
doc.save('./test.docx')
The above solution has limitations. 1) The paragraph containing The "find_this_text" will became plain text without any format, 2) context controls that are in the same paragraph with the "find_this_text" will be deleted, and 3) the "find_this_text" in either context controls or tables will not be changed.

For the table case, I had to modify #scanny's answer to:
for table in doc.tables:
for col in table.columns:
for cell in col.cells:
for p in cell.paragraphs:
to make it work. Indeed, this does not seem to work with the current state of the API:
for table in document.tables:
for cell in table.cells:
Same problem with the code from here: https://github.com/python-openxml/python-docx/issues/30#issuecomment-38658149

The Office Dev Centre has an entry in which a developer has published (MIT licenced at this time) a description of a couple of algorithms that appear to suggest a solution for this (albeit in C#, and require porting):" MS Dev Centre posting

The library python-docx-template is pretty useful for this. It's perfect to edit Word documents and save them back to .docx format.

The problem with your second attempt is that you haven't defined the parameters that savedocx needs. You need to do something like this before you save:
relationships = docx.relationshiplist()
title = "Document Title"
subject = "Document Subject"
creator = "Document Creator"
keywords = []
coreprops = docx.coreproperties(title=title, subject=subject, creator=creator,
keywords=keywords)
app = docx.appproperties()
content = docx.contenttypes()
web = docx.websettings()
word = docx.wordrelationships(relationships)
output = r"path\to\where\you\want\to\save"

he changed the API in docx py again...
for the sanity of everyone coming here:
import datetime
import os
from decimal import Decimal
from typing import NamedTuple
from docx import Document
from docx.document import Document as nDocument
class DocxInvoiceArg(NamedTuple):
invoice_to: str
date_from: str
date_to: str
project_name: str
quantity: float
hourly: int
currency: str
bank_details: str
class DocxService():
tokens = [
'#INVOICE_TO#',
'#IDATE_FROM#',
'#IDATE_TO#',
'#INVOICE_NR#',
'#PROJECTNAME#',
'#QUANTITY#',
'#HOURLY#',
'#CURRENCY#',
'#TOTAL#',
'#BANK_DETAILS#',
]
def __init__(self, replace_vals: DocxInvoiceArg):
total = replace_vals.quantity * replace_vals.hourly
invoice_nr = replace_vals.project_name + datetime.datetime.strptime(replace_vals.date_to, '%Y-%m-%d').strftime('%Y%m%d')
self.replace_vals = [
{'search': self.tokens[0], 'replace': replace_vals.invoice_to },
{'search': self.tokens[1], 'replace': replace_vals.date_from },
{'search': self.tokens[2], 'replace': replace_vals.date_to },
{'search': self.tokens[3], 'replace': invoice_nr },
{'search': self.tokens[4], 'replace': replace_vals.project_name },
{'search': self.tokens[5], 'replace': replace_vals.quantity },
{'search': self.tokens[6], 'replace': replace_vals.hourly },
{'search': self.tokens[7], 'replace': replace_vals.currency },
{'search': self.tokens[8], 'replace': total },
{'search': self.tokens[9], 'replace': 'asdfasdfasdfdasf'},
]
self.doc_path_template = os.path.dirname(os.path.realpath(__file__))+'/docs/'
self.doc_path_output = self.doc_path_template + 'output/'
self.document: nDocument = Document(self.doc_path_template + 'invoice_placeholder.docx')
def save(self):
for p in self.document.paragraphs:
self._docx_replace_text(p)
tables = self.document.tables
self._loop_tables(tables)
self.document.save(self.doc_path_output + 'testiboi3.docx')
def _loop_tables(self, tables):
for table in tables:
for index, row in enumerate(table.rows):
for cell in table.row_cells(index):
if cell.tables:
self._loop_tables(cell.tables)
for p in cell.paragraphs:
self._docx_replace_text(p)
# for cells in column.
# for cell in table.columns:
def _docx_replace_text(self, p):
print(p.text)
for el in self.replace_vals:
if (el['search'] in p.text):
inline = p.runs
# Loop added to work with runs (strings with same style)
for i in range(len(inline)):
print(inline[i].text)
if el['search'] in inline[i].text:
text = inline[i].text.replace(el['search'], str(el['replace']))
inline[i].text = text
print(p.text)
Test case:
from django.test import SimpleTestCase
from docx.table import Table, _Rows
from toggleapi.services.DocxService import DocxService, DocxInvoiceArg
class TestDocxService(SimpleTestCase):
def test_document_read(self):
ds = DocxService(DocxInvoiceArg(invoice_to="""
WAW test1
Multi myfriend
""",date_from="2019-08-01", date_to="2019-08-30", project_name='WAW', quantity=10.5, hourly=40, currency='USD',bank_details="""
Paypal to:
bippo#bippsi.com"""))
ds.save()
have folders
docs
and
docs/output/
in same folder where you have DocxService.py
e.g.
be sure to parameterize and replace stuff

As shared by some of the fellow users above that one of the challenges is finding and replacing text in word document is retaining styles if the word spans across multiple runs this could happen if word has many styles or if the word was edited multiple times when the document was created. So a simple code which assumes a word would be found completely within a single run is generally not true so python-docx based code shared above may not work for many many scenarios.
You can try the following API
https://rapidapi.com/more.sense.tech#gmail.com/api/document-filter1
This has generic code to deal with the scenarios. The API currently only addresses the paragraphic text and tabular text is currently not supported and I will try that soon.

import docx2txt as d2t
from docx import Document
from docx.text.paragraph import Paragraph
document = Document()
all_text = d2t.process("mydata.docx")
# print(all_text)
words=["hey","wow"]
for i in range words:
all_text=all_text.replace(i,"your word variable")
document.add_paragraph(updated + "\n")
print(all_text)
document.save('data.docx')

Extract parts of text (html) file based on characters before & after with python

I am trying to build a script that will extract specific parts (namely the link & its related description) out of an html file and return the result per line.
I 'm trying to build it using the lists in python, yet I 'm making a mistake somehow!
This is what I 've done so far, but it returns blank my values list:
import re
def subtext (data, first_link, last_link, first_descr, last_descr):
values = []
link = re.search('''"first_link"(.+?)"last_link"''', data)
values.append(link)
descr = re.search('''"first_descr"(.+?)"last_descr"''', data)
values.append(descr)
while values:
print(values)
html_file = input ("Type filepath: ")
html_code = open (html_file, "r")
html_data = html_code.read()
subtext (html_data, '''11px;">''', '''</td><td style="font-''')
html_code.close()

There is a html parser for python. But if you want use your code then you need fix those mistakes:
link = re.search('''"first_link"(.+?)"last_link"''', data)
values.append(link)
First of all, Your regex will search for strings "first_link" and "last_link" instead of values from function args. Use .format to create string form args.
Also in above code link will be re.Match object, not a string. Use group() to pick string from object - just make sure that it found something. Same story with next re.search.
while values:
print(values)
Here you will get into infinite loop of prints. Simply do print(values) without any loop.

Adding Image at the very beginning of an already existing docx document [duplicate]

I use Python-docx to generate Microsoft Word document.The user want that when he write for eg: "Good Morning every body,This is my %(profile_img)s do you like it?"
in a HTML field, i create a word document and i recuper the picture of the user from the database and i replace the key word %(profile_img)s by the picture of the user NOT at the END OF THE DOCUMENT. With Python-docx we use this instruction to add a picture:
document.add_picture('profile_img.png', width=Inches(1.25))
The picture is added to the document but the problem that it is added at the end of the document.
Is it impossible to add a picture in a specific position in a microsoft word document with python? I've not found any answers to this in the net but have seen people asking the same elsewhere with no solution.
Thanks (note: I'm not a hugely experiance programmer and other than this awkward part the rest of my code will very basic)

Quoting the python-docx documentation:
The Document.add_picture() method adds a specified picture to the end of the document in a paragraph of its own. However, by digging a little deeper into the API you can place text on either side of the picture in its paragraph, or both.
When we "dig a little deeper", we discover the Run.add_picture() API.
Here is an example of its use:
from docx import Document
from docx.shared import Inches
document = Document()
p = document.add_paragraph()
r = p.add_run()
r.add_text('Good Morning every body,This is my ')
r.add_picture('/tmp/foo.jpg')
r.add_text(' do you like it?')
document.save('demo.docx')

well, I don't know if this will apply to you but here is what I've done to set an image in a specific spot to a docx document:
I created a base docx document (template document). In this file, I've inserted some tables without borders, to be used as placeholders for images. When creating the document, first I open the template, and update the file creating the images inside the tables. So the code itself is not much different from your original code, the only difference is that I'm creating the paragraph and image inside a specific table.
from docx import Document
from docx.shared import Inches
doc = Document('addImage.docx')
tables = doc.tables
p = tables[0].rows[0].cells[0].add_paragraph()
r = p.add_run()
r.add_picture('resized.png',width=Inches(4.0), height=Inches(.7))
p = tables[1].rows[0].cells[0].add_paragraph()
r = p.add_run()
r.add_picture('teste.png',width=Inches(4.0), height=Inches(.7))
doc.save('addImage.docx')

Here's my solution. It has the advantage on the first proposition that it surrounds the picture with a title (with style Header 1) and a section for additional comments. Note that you have to do the insertions in the reverse order they appear in the Word document.
This snippet is particularly useful if you want to programmatically insert pictures in an existing document.
from docx import Document
from docx.shared import Inches
# ------- initial code -------
document = Document()
p = document.add_paragraph()
r = p.add_run()
r.add_text('Good Morning every body,This is my ')
picPath = 'D:/Development/Python/aa.png'
r.add_picture(picPath)
r.add_text(' do you like it?')
document.save('demo.docx')
# ------- improved code -------
document = Document()
p = document.add_paragraph('Picture bullet section', 'List Bullet')
p = p.insert_paragraph_before('')
r = p.add_run()
r.add_picture(picPath)
p = p.insert_paragraph_before('My picture title', 'Heading 1')
document.save('demo_better.docx')

This is adopting the answer written by Robᵩ while considering more flexible input from user.
My assumption is that the HTML field mentioned by Kais Dkhili (orignal enquirer) is already loaded in docx.Document(). So...
Identify where is the related HTML text in the document.
import re
## regex module
img_tag = re.compile(r'%\(profile_img\)s') # declare pattern
for _p in enumerate(document.paragraphs):
if bool(img_tag.match(_p.text)):
img_paragraph = _p
# if and only if; suggesting img_paragraph a list and
# use append method instead for full document search
break # lose the break if want full document search
Replace desired image into placeholder identified as img_tag = '%(profile_img)s'
The following code is after considering the text contains only a single run
May be changed accordingly if condition otherwise
temp_text = img_tag.split(img_paragraph.text)
img_paragraph.runs[0].text = temp_text[0]
_r = img_paragraph.add_run()
_r.add_picture('profile_img.png', width = Inches(1.25))
img_paragraph.add_run(temp_text[1])
and done. document.save() it if finalised.
In case you are wondering what to expect from the temp_text...
[In]
img_tag.split(img_paragraph.text)
[Out]
['This is my ', ' do you like it?']

I spend few hours in it. If you need to add images to a template doc file using python, the best solution is to use python-docx-template library.
Documentation is available here
Examples available in here

This is variation on a theme. Letting I be the paragraph number in the specific document then:
p = doc.paragraphs[I].insert_paragraph_before('\n')
p.add_run().add_picture('Fig01.png', width=Cm(15))

Filter sentencs based on search words using python?

I got a text file containing many sentences. I want to query the text file with search words and return those sentences that contain the query words.
Effort so far:
h = input("Enter search word: ")
with open("file.txt") as openfile:
for line in openfile:
for part in line.split():
if h in part:
print (part)
file.txt contains these sentences
On the Insert tab, the galleries include items that are designed to coordinate with the overall look of your document.
You can use these galleries to insert tables, headers, footers, lists, cover pages, and other document building blocks.
When you create pictures, charts, or diagrams, they also coordinate with your current document look.
You can easily change the formatting of selected text in the document text by choosing a look for the selected text from the Quick Styles gallery on the Home tab.
You can also format text directly by using the other controls on the Home tab.
Most controls offer a choice of using the look from the current theme or using a format that you specify directly.
To change the overall look of your document, choose new Theme elements on the Page Layout tab. To change the looks available in the Quick Style gallery, use the Change Current Quick Style Set command.
Both the Themes gallery and the Quick Styles gallery provide reset commands so that you can always restore the look of your document to the original contained in your current template.
Output: for the 'galleries' search it returns galleries twice but i need to return the sentences.
How to query multiple words search and return those sentences containing those combination (not necessarily an n gram or in order) For example if i type 'overall' as one word and 'Layout' as another search word it should return the following sentence. Search words are case insensitive
To change the overall look of your document, choose new Theme elements on the Page Layout tab.
HELP!

This for multiple search words:
myfile = "queryfile.txt"
search_wordlist = input("Enter search words, separated by a comma\n")
mylist = search_wordlist.split(",")
with open(myfile) as openfile:
for line in openfile:
for term in mylist:
if term in line:
print(line)

Split word document by regex, and then group like headings into their own objects

I have a docx, which I read into jupyter like so:
### Import libraries
import docx2txt
import os
import re
import pandas
import docx
### Read document
file_text = docx2txt.process("big_document.docx")
In this document, there are multiple pages with the same headers. I want to search for these headers, and then group all like headers into their own objects. In the following chunk, the first thirty pages of my document all have the same header, EXAMPLE ONE (it's not in a header format, just the one unique identifying string on each page that matches the other 29 pages):
### Loop to get appropriate sections, according to the re.findall()
for i in range(0, 30):
match = re.findall('EXAMPLE\sONE', file_text)
print(match[i])
The re.findall() finds every instance of EXAMPLE ONE, but it only returns those two words 30 times. If I sub in re.split(), and set the range accordingly, it returns the whole document (several hundred pages).
### Loop to get appropriate sections, according to the re.split()
for i in range(0, 30):
match = re.split('EXAMPLE\sONE', file_text)
print(match[i])
# still returns whole document, instead of just the 30 pages with the chosen header
How do I set the code so it only returns the pages with the appropriate headers, and only those pages? I think re.split() is my tool, but I can't make it work.
The document has multiple headers, going up to EXAMPLE SEVEN, and I was going to make a for loop for each, and return an object. Thanks

I do not think you will be able to get the matching page for a given header, since If i'm not wrong docx won't return an 'end-of-page' character which could allow you to specify an end to the content you want.
What you could do, however, is to use a regex like this to get all the content before a certain header:
match = re.search('^((.|\n)+)EXAMPLE\nTWO', file_text, flags=re.MULTILINE)
print(match.group(1))

from docx2python import docx2python
from docx2python.iterators import iter_paragraphs
from collections import defaultdict
import re
text = docx2python('path_to_file.docx')
groups = defaultdict(list)
for par in iter_paragraphs(text.document):
header = re.search('EXAMPLE\s[A-Z]+', par)
if header:
open_group = groups[header.group()]
open_group.append(par)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.