is there a method to read pdf files line by line? - python

I have a pdf file over 100 pages. There are boxes and columns of text. When I extract the text using PyPdf2 and tika parser, I get a string of of data which is out of order. It is ordered by columns in many cases and skips around the document in other cases. Is it possible to read the pdf file starting from the top, moving left to right until the bottom? I want to read the text in the columns and boxes, but I want the line of text displayed as it would be read left to right.
I've tried:
PyPDF2 - the only tool is extracttext(). Fast but does not give gaps in the elements. Results are jumbled.
Pdfminer - PDFPageInterpeter() method with LAParams. This works well but is slow. At least 2 seconds per page and I've got 200 pages.
pdfrw - this only tells me the number of pages.
tabula_py - only gives me the first page. Maybe I'm not looping it correctly.
tika - what I'm currently working with. Fast and more readable, but the content is still jumbled.
from tkinter import filedialog
import os
from tika import parser
import re
# select the file you want
file_path = filedialog.askopenfilename(initialdir=os.getcwd(),filetypes=[("PDF files", "*.pdf")])
print(file_path) # print that path
file_data = parser.from_file(file_path) # Parse data from file
text = file_data['content'] # Get files text content
by_page = text.split('... Information') # split up the document into pages by string that always appears on the
# top of each page
for i in range(1,len(by_page)): # loop page by page
info = by_page[i] # get one page worth of data from the pdf
reformated = info.replace("\n", "&") # I replace the new lines with "&" to make it more readable
print("Page: ",i) # print page number
print(reformated,"\n\n") # print the text string from the pdf
This provides output of a sort, but it is not ordered in the way I would like. I want the pdf to be read left to right. Also, if I could get a pure python solution, that would be a bonus. I don't want my end users to be forced to install java (I think the tika and tabula-py methods are dependent on java).

I did this for .docx with this code. Where txt is the .docx. Hope this help link
import re
pttrn = re.compile(r'(\.|\?|\!)(\'|\")?\s')
new = re.sub(pttrn, r'\1\2\n\n', txt)
print(new)

Related

How can I extract the outline of a PDF using Python?

I wish to read the outline of a .pdf format paper. The expected output is a list of section titles like ['abstract', 'Introduction', ...], The section titles can be identified by the following characteristics: 1) bold and larger font size, 2) all nouns starting with capital letters, and 3) appearing immediately after a line break \n.
The solutions I have tried with includes:
pypdf2 with reader.outline
reader = PyPDF2.PdfReader('path/to/my/pdf')
print(reader.outline)
pymupdf with doc.get_toc()
doc = fitz.open('path/to/my/pdf')
toc = doc.get_toc()
However both give me empty list.
I am currently using the re library to extract the section titles, but the results include additional information such as references and table contents.
import re
re.findall(r'(\[turnpage\]|\n)([A-Z][^.]*?)(\n[A-Z0-9][^\s]*?)', text)
For a clearer understanding of the results produced by the code, please refer to this link
If reader.outline by pypdf gives an empty result, there is no outline specified as metadata.
There can still be an outline specified as normal text. However, detecting / parsing that would require custom work on your side. You can use the text extraction as a basis:
https://pypdf.readthedocs.io/en/latest/user/extract-text.html
from pypdf import PdfReader
reader = PdfReader("example.pdf")
page = reader.pages[0]
print(page.extract_text())

Unable to separate the passages, as no separation character is being displayed

I am trying to implement passage retrieval on PDF files. For easy navigation I want to include page number and in which passage result was belongs to (mostly passage number). like below:
query: "some query was asked"
results: "one result was displayed"
file_name: "name of file"
source: Page_no-2, passage_no:3
I have couple of pdf files, where we can separate the passage based on some recognizable pattrens. however, I am facing challenge with some pdf files, where no proper pattern was found.
when I open the pdf in chrome there are line gaps between the passages but when I trying read from fitz(pymupdf), no line gap is displayed and every line and every passage separated by just one "\n"
I tries pdfminer,pdftotext, and other libraries but no luck.
My code:
import fitz
from more_itertools import *
doc = fitz.open('IT_past.pdf')
single_doc = doc.load_page(0) # put here the page number
text=single_doc.get_text('text', sort=True)
text
Result:
screeshot of the page
-Full pdf

How can I extract unformated, table-like text from PDF's using python?

I have scenario where I have PDFs with a letterhead and table-like body of text. I have tried using pdfminer but I'm struggling to figure out how to approach my problem
An example of the format for one my PDFs
In specific, pdf miner reads the data starting from the letterhead up until the table header. It then reads the table header in a row like fashion from left to right. From there it's just beyond messy.
Here is python to convert pdf to text:
import pdfminer
import sys
from pdfminer.high_level import extract_text
text = extract_text('./quote2.pdf')
print((text))
f = open("results2.txt", "w")
f.write(text)
And here is a snippet of what the output looks like:
... letter head info
ITEM�#
DESCRIPTION
561347
55�PCs-792.00�LB
6061-T651�PLATE�AMS�4027
4�S/C�6"�SQUARE
CUTTING�PLATE�SAW�ALUM
PACKAGING�SKIDDING
SHIP�VIA�:�OUR�TRUCK
Quotation
DATE:
CUSTOMER NUMBER:
QUOTE NUMBER:
FOB:
4/1/2022
319486
957242
Destination
SHIP TO:
The idea was to use regex to extract relevant numbers. As you can see it read the first 2 records for columns ITEM and DESCRIPTION, but from there it starts back up from the letterhead, and it's even more messy below
Is there perhaps a way to seperate the letterhead from the rest of the body as a starting step? Very new to python, not sure how to get what I want, help much appreciated!

How to erase text from PDF using Python

I'm creating a python script to edit text from PDFs.
I have this Python code which allows me to add text into specific positions of a PDF file.
import PyPDF2
import io
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter
import sys
packet = io.BytesIO()
# create a new PDF with Reportlab
can = canvas.Canvas(packet, pagesize=letter)
# Insert code into specific position
can.drawString(300, 115, "Hello world")
can.save()
#move to the beginning of the StringIO buffer
packet.seek(0)
new_pdf = PyPDF2.PdfFileReader(packet)
# read your existing PDF
existing_pdf = PyPDF2.PdfFileReader(open("original.pdf", "rb"))
num_pages = existing_pdf.numPages
output = PyPDF2.PdfFileWriter()
# add the "watermark" (which is the new pdf) on the existing page
page = existing_pdf.getPage(num_pages-1) # get the last page of the original pdf
page.mergePage(new_pdf.getPage(0)) # merges my created text with my PDF.
x = existing_pdf.getNumPages()
#add all pages from original pdf into output pdf
for n in range(x):
output.addPage(existing_pdf.getPage(n))
# finally, write "output" to a real file
outputStream = open("output.pdf", "wb")
output.write(outputStream)
outputStream.close()
My problem: I want to replace the text in a specific position of my original PDF with my custom text. A way of writing blank characters would do the trick but I couldn't find anything that does this.
PS.: It must be Python code because I will need to deploy this as a .exe file later and I only know how to do that using Python code.
A general purpose algorithm for replacing text in a PDF is a difficult problem. I'm not saying it can't ever be done, because I've demonstrated doing so with the Adobe PDF Library albeit with a very simple input file with no complications, but I'm not sure that pyPDF2 has the facilities required to do so. In part, just finding the text can be a challenge.
You (or more realistically your PDF library) has to parse the page contents and keep track of the changes to the graphic state, specifically changes to the current transformation matrix in case the text is in a Form XObject, and the text transformation matrix, and changes to the font; you have to use the font resource to get character widths to figure out where the text cursor may be positioned after inserting a string. You may need to handle standard-14 fonts which don't contain that information in their font resources (the application -your program- is expected to know their metrics)
After all that, removing the text is easy if you don't need to break up a Tj or TJ (show text) instruction into different parts. Preventing the text after from shifting, if that's what's desired, may require inserting a new Tm instruction to reposition the text after to where it would have been.
Inserting new text can be challenging. If you want to stay consistent with the font being used and it is embedded and subset, it may not necessarily contain the glyphs you need for your text insertion. And after insertion, you then have to decide whether you need to reflow the text that comes after the text you inserted.
And lastly, you will need your PDF library to save all the changes. Quite frankly, using Adobe Acrobat's Redaction features would likely be cheaper and more cost-effective way of doing this than trying to program this from scratch.
If you want to do a poor man's redaction with ReportLab and PyPDF2,
you would create your replacement content with ReportLab.
Given a Canvas, a rectangle indicating an area, a text string and a point where the text string would be inserted you would then:
#set a fill color to white:
c.setFillColorRGB(1,1,1)
# draw a rectangle
c.rect([your rectangle], fill=1)
# change color
c.setFillColorRGB(0,0,0)
c.drawString([text insert position], [text string])
save this PDF document you've created to a temporary file.
Open this PDF document and the document you want to modify using the PyPDF2's PdfFileReader. create a pdfFileWriter object, call it ModifiedDoc. Get page 0 of temporary PDF, call it updatePage. Get page n of the other document, call it toModifyPage.
toModifyPage.mergePage(updatePage)
after you are done updating pages:
modifiedDoc.cloneDocumentFromReader(srcDoc)
modifiedDoc.write(outStream)
Again, if you go this route, a user might still see the original text before it gets covered up with the new content, and text extraction would likely pull out both the original and new text for that area, and possibly intermingle it to something unintelligible.

Python PyPDF2 join pages

I have a PDF with a big table splitted in pages, so I need to join the per-page tables into a big table in a large page.
Is this possible with PyPDF2 or another library?
Cheers
Just working on something similar, it takes an input pdf and via a config file you can set the final pattern of single pages.
Implementation with PyPDF2 but it still has issues with some pdf-files (have to dig deeper).
https://github.com/Lageos/pdf-stitcher
In principle adding a page right to another one works like:
import PyPDF2
with open('input.pdf', 'rb') as input_file:
# load input pdf
input_pdf = PyPDF2.PdfFileReader(input_file)
# start new PyPDF2 PageObject
output_pdf = input_pdf.getPage(page_number)
# get second page PyPDF2 PageObject
second_pdf = input_pdf.getPage(second_page_number)
# dimensions for offset from loaded page (adding it to the right)
offset_x = output_pdf.mediaBox[2]
offset_y = 0
# add second page to first one
output_pdf.mergeTranslatedPage(second_pdf, offset_x, offset_y, expand=True)
# write finished pdf
with open('output.pdf', 'wb') as out_file:
write_pdf = PyPDF2.PdfFileWriter()
write_pdf.addPage(output_pdf)
write_pdf.write(out_file)
Adding a page below needs an offset_y. You can get the amount from offset_y = first_pdf.mediaBox[3].
My understanding is that this is quite hard. See here and here.
The problem seems to be that tables aren't very well represented in pdfs but are simply made from absolutely positioned lines (see first link above).
Here are two possible workarounds (not sure if they will do it for you):
you could print multiple pages on one page and scale the page to make it readable....
open the pdf with inkscape or something similar. Once ungrouped, you should have access to the individual elements that make up the tables and be able to combine them the way that suit you
EDIT
Have a look at libre office draw, another vector package. I just opened a pdf in it and it seems to preserve some of the pdf structure and editing individual elements.
EDIT 2
Have a look at pdftables which might help.
PDFTables helps with extracting tables from PDF files.
I haven't tried it though... might have some time a bit later to see if I can get it to work.

Categories

Resources