PyPDF2 split pdf keeping original pdf format - python

I am spliting a pdf file by ranges. My code works fine, but the problem is that the new pdf file does not preserve the format of the original pdf.
How can I splita pdf file without losing the original format?
from PyPDF2 import PdfFileReader, PdfFileWriter
#split range
pgi=30 #start
pgf=37 #end
pdf_document = "test.pdf"
pdf = PdfFileReader(pdf_document)
pdf_writer = PdfFileWriter()
for page in range(pgi-1,pgf):
current_page = pdf.getPage(page)
pdf_writer.addPage(current_page)
with open(f'test-{pgi}-{pgf}.pdf', "wb") as out:
pdf_writer.write(out)
PS: I don't have this problem when I use Adobe or MasterPdfEditor software.
The image below shows the original and new pdf format.
out format pdf

Related

Convert PDF to HTML via PyMuPDF

For pages with tabular data in landscape format, the words in the HTML outcome overlap. For pages in portrait formats, the conversion is succesful. Any ideas how to fix that?
[Here is an example with the converted pdf to html in landscape format]
[1]: https://i.stack.imgur.com/twbzw.png
[2]: https://i.stack.imgur.com/Ln56P.png
import ntpath
from pathlib import Path
import fitz
doc = fitz.open(in_path) # open document
out = open(in_path + ".html", "wb") # open text output
for page in doc: # iterate the document pages
page.read_contents()
text = page.get_text('html', clip = None).encode("utf8")
out.write(text) # write text of page
out.write(bytes((12,))) # write page delimiter (form feed 0x0C)
out.close()

PyPDF2 does not always extract text from pdf file

I am trying to extract text from a PDF file using the python package PyPDF2. Below is the code I wrote for it and it is working just fine for most pdf files. However for this file I get the following error and no text gets extracted whatsoever. Here the warning: PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1736]
import PyPDF2
# Import and read pdf file
file_name = 'path/to/file.pdf'
pdfFileObj = open(file_name, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj, strict=True)
# Extract text from pdf file
pageObj = pdfReader.getPage(1)
text = pageObj.extractText()
print(text)

Extracting PDF text into a text file using Python - Extraction error

I want to first extract all the text from 1 pdf file and store it into one text file.
Here is my code:
import PyPDF2
from pathlib import Path
with Path('C:/Users/Lui/Desktop/Test/file1.pdf').open(mode='rb') as pdf_file, open('Extracted/extractPDF.txt', 'w') as text_file:
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
print(number_of_pages)
for page_number in range(number_of_pages): # use xrange in Py2
page = read_pdf.getPage(page_number)
page_content = page.extractText()
print(page_content)
text_file.write(page_content)
The pdf looks like this:
However, the text file created looks different in comparison with missing words and spacing:
What am I doing wrong? My goal is to then loop through 1,000 PDF's so I'm trying to get 1 example working first.
Try using pdftotext
import pdftotext
# Load your PDF
with open(filename, "rb") as f:
pdf = pdftotext.PDF(f)
# If it's password-protected
#with open("secure.pdf", "rb") as f:
# pdf = pdftotext.PDF(f, "secret")
# How many pages?
#print(len(pdf))
# Iterate over all the pages
#for page in pdf:
# print(page)
data = "\n\n".join(pdf)
# Read all the text into one string
print(data)
This package works far better and should help you out.

PyPDF2 corrupts file when watermarking

I have been trying to speed up our date stamping process by adding a stamp as a watermark to PDFs through PyPDF2. I found the code below online as I'm pretty new to coding.
When I run this it seems to work, but the file is corrupted and won't open. Does anyone have any ideas where I am going wrong?
from PyPDF2 import PdfFileWriter, PdfFileReader
def create_watermark(input_pdf, output_pdf, watermark):
watermark_obj = PdfFileReader(watermark,False,)
watermark_page = watermark_obj.getPage(0)
pdf_reader = PdfFileReader(input_pdf)
pdf_writer = PdfFileWriter()
# Watermark all the pages
for page in range(pdf_reader.getNumPages()):
page = pdf_reader.getPage(page)
page.mergePage(watermark_page)
pdf_writer.addPage(page)
with open(input_pdf, 'wb') as out:
pdf_writer.write(out)
if __name__ == '__main__':
input_pdf = "C:\\Users\\A***\\OneDrive - ***\\Desktop\\Invoice hold\\Test\\1.pdf"
output_pdf = "C:\\Users\\A***\\OneDrive - ***\\Desktop\\Invoice hold\\Test\\1 WM.pdf"
watermark = "C:\\Users\\A***\\OneDrive - ***\\Desktop\\Invoice hold\\WM.pdf"
create_watermark(input_pdf,output_pdf,watermark)
If you want to save pdf file under the name of output_pdf,
try this :
result = open(output_pdf, 'wb')
pdf_writer.write(result)
your code :
with open(input_pdf, 'wb') as out:
pdf_writer.write(out)
Your code is to overwrite input_pdf.
And if there is a problem while working, the pdf file will be damaged.
I succeeded in inserting the watermark by applying your code and my proposed method.
I recommend checking if the pdf file is not damaged.

Convert the page data extracted from pdf file into csv file using PyPDF2

I have extracted data from a pdf file, but I am unable to convert it into a csv file
import PyPDF2 as PDF
PDFfile = open("path", "rb")
pdfread = PDF.PdfFileReader(PDFfile)
page = pdfread.getpage(0)
Page_content = page.extractText()
After extracting the text from a particular page, I want to export it to a csv file. How to do this?

Categories

Resources