How to attach a file to an existing pdf? - python

I have a pdf created with latex and I have to attach a file to this pdf.
I´ve tried this code but the output is a pdf blanked but indeed the file was attached. I have also tried to merge the output file and the original file but the output detaches the file.
import PyPDF2 as pdf2
from PyPDF2 import dfFileReader,PdfFilewriter
output = pdf2.PdfFileWriter()
with open ("test.pdf", "rb") as f:
input_pdf = pdf2.PdfFileReader(f)
output.appendPagesFromReader(input_pdf)
with open("test.xml", "rb") as xmlFile:
output.addAttachment("test.xml", xmlFile.read())
with open ("test2.pdf", "ab") as f:
output.write(f)
Solution (partial):
I wrote this in the latex file that generates the pdf, and the file was embedded/attached to the pdf. But I couldn´t find the way to do it in python.
\documentclass{article}
\usepackage{embedfile}
\begin{document}
\embedfile{test-ansi.tex}
text \end{document}

Related

I can not open the pdf file downloaded from the website

I use my requests to scrap a PDF file from a website. The file appears, but it can not be open. Here is my code:
url="https://dl.acm.org/doi/pdf/10.1145/3397271.3401063"
html=requests.get(url)
path=Mypath
with open(path, 'wb') as f:
f.write(html.content)
And when I try to open the PDF file, I get an error:
We can't open this file. Something went wrong.
I think there is something wrong with the downloaded file?

PyPDF2 does not always extract text from pdf file

I am trying to extract text from a PDF file using the python package PyPDF2. Below is the code I wrote for it and it is working just fine for most pdf files. However for this file I get the following error and no text gets extracted whatsoever. Here the warning: PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1736]
import PyPDF2
# Import and read pdf file
file_name = 'path/to/file.pdf'
pdfFileObj = open(file_name, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj, strict=True)
# Extract text from pdf file
pageObj = pdfReader.getPage(1)
text = pageObj.extractText()
print(text)

could not find file existing file while converting from .tex to pdf in python

I need to convert latex file to pdf. I try to do it like in documentation written:
import subprocess, os
from tex import latex2pdf,tex2pdf
from latex import build_pdf
from pdflatex import PDFLaTeX
with open('sometexfile.tex','w') as file:
file.write('\\documentclass{article}\n')
file.write('\\begin{document}\n')
file.write('Hello Palo Alto!\n')
file.write('\\end{document}\n')
with open('sometexfile.tex', 'rb') as f:
file_to_read = f.read()
pdfl = PDFLaTeX.from_binarystring(file_to_read , 'sometexfile')
pdf = pdfl.create_pdf()
And I got an error:
line 20, in
pdf = pdfl.create_pdf() The system cannot find the file specified
How should I solve my problem?

PyPDF2 split pdf keeping original pdf format

I am spliting a pdf file by ranges. My code works fine, but the problem is that the new pdf file does not preserve the format of the original pdf.
How can I splita pdf file without losing the original format?
from PyPDF2 import PdfFileReader, PdfFileWriter
#split range
pgi=30 #start
pgf=37 #end
pdf_document = "test.pdf"
pdf = PdfFileReader(pdf_document)
pdf_writer = PdfFileWriter()
for page in range(pgi-1,pgf):
current_page = pdf.getPage(page)
pdf_writer.addPage(current_page)
with open(f'test-{pgi}-{pgf}.pdf', "wb") as out:
pdf_writer.write(out)
PS: I don't have this problem when I use Adobe or MasterPdfEditor software.
The image below shows the original and new pdf format.
out format pdf

How do I know my file is attached in my PDF using PyPDF2?

I am trying to attach an .exe file into a PDF using PyPDF2.
I ran the code below, but my PDF file is still the same size.
I don't know if my file was attached or not.
from PyPDF2 import PdfFileWriter, PdfFileReader
writer = PdfFileWriter()
reader = PdfFileReader("doc1.pdf")
# check it's whether work or not
print("doc1 has %d pages" % reader.getNumPages())
writer.addAttachment("doc1.pdf", "client.exe")
What am I doing wrong?
First of all, you have to use the PdfFileWriter class properly.
You can use appendPagesFromReader to copy pages from the source PDF ("doc1.pdf") to the output PDF (ex. "out.pdf"). Then, for addAttachment, the 1st parameter is the filename of the file to attach and the 2nd parameter is the attachment data (it's not clear from the docs, but it has to be a bytes-like sequence). To get the attachment data, you can open the .exe file in binary mode, then read() it. Finally, you need to use write to actually save the PdfFileWriter object to an actual PDF file.
Here is a more working example:
from PyPDF2 import PdfFileReader, PdfFileWriter
reader = PdfFileReader("doc1.pdf")
writer = PdfFileWriter()
writer.appendPagesFromReader(reader)
with open("client.exe", "rb") as exe:
writer.addAttachment("client.exe", exe.read())
with open("out.pdf", "wb") as f:
writer.write(f)
Next, to check if attaching was successful, you can use os.stat.st_size to compare the file size (in bytes) before and after attaching the .exe file.
Here is the same example with checking for file sizes:
(I'm using Python 3.6+ for f-strings)
import os
from PyPDF2 import PdfFileReader, PdfFileWriter
reader = PdfFileReader("doc1.pdf")
writer = PdfFileWriter()
writer.appendPagesFromReader(reader)
with open("client.exe", "rb") as exe:
writer.addAttachment("client.exe", exe.read())
with open("out.pdf", "wb") as f:
writer.write(f)
# Check result
print(f"size of SOURCE: {os.stat('doc1.pdf').st_size}")
print(f"size of EXE: {os.stat('client.exe').st_size}")
print(f"size of OUTPUT: {os.stat('out.pdf').st_size}")
The above code prints out
size of SOURCE: 42942
size of EXE: 989744
size of OUTPUT: 1031773
...which sort of shows that the .exe file was added to the PDF.
Of course, you can manually check it by opening the PDF in Adobe Reader:
As a side note, I am not sure what you want to do with attaching exe files to PDF, but it seems you can attach them but Adobe treats them as security risks and may not be possible to be opened. You can use the same code above to attach another PDF file (or other documents) instead of an executable file, and it should still work.

Categories

Resources