I have a PDF with a big table splitted in pages, so I need to join the per-page tables into a big table in a large page.
Is this possible with PyPDF2 or another library?
Cheers
Just working on something similar, it takes an input pdf and via a config file you can set the final pattern of single pages.
Implementation with PyPDF2 but it still has issues with some pdf-files (have to dig deeper).
https://github.com/Lageos/pdf-stitcher
In principle adding a page right to another one works like:
import PyPDF2
with open('input.pdf', 'rb') as input_file:
# load input pdf
input_pdf = PyPDF2.PdfFileReader(input_file)
# start new PyPDF2 PageObject
output_pdf = input_pdf.getPage(page_number)
# get second page PyPDF2 PageObject
second_pdf = input_pdf.getPage(second_page_number)
# dimensions for offset from loaded page (adding it to the right)
offset_x = output_pdf.mediaBox[2]
offset_y = 0
# add second page to first one
output_pdf.mergeTranslatedPage(second_pdf, offset_x, offset_y, expand=True)
# write finished pdf
with open('output.pdf', 'wb') as out_file:
write_pdf = PyPDF2.PdfFileWriter()
write_pdf.addPage(output_pdf)
write_pdf.write(out_file)
Adding a page below needs an offset_y. You can get the amount from offset_y = first_pdf.mediaBox[3].
My understanding is that this is quite hard. See here and here.
The problem seems to be that tables aren't very well represented in pdfs but are simply made from absolutely positioned lines (see first link above).
Here are two possible workarounds (not sure if they will do it for you):
you could print multiple pages on one page and scale the page to make it readable....
open the pdf with inkscape or something similar. Once ungrouped, you should have access to the individual elements that make up the tables and be able to combine them the way that suit you
EDIT
Have a look at libre office draw, another vector package. I just opened a pdf in it and it seems to preserve some of the pdf structure and editing individual elements.
EDIT 2
Have a look at pdftables which might help.
PDFTables helps with extracting tables from PDF files.
I haven't tried it though... might have some time a bit later to see if I can get it to work.
Related
How do you merge two slide slides saved as a PDF? Specifically, I want to know how to add a second slide to a first slide. I tried copying the code in a Python Udemy course while changing the variables' names, I found one solution online that looked very, very not concise, and I glanced at a stack overflow thread that ended in there maybe not being a solution. I got an error four times when running Udemy code that one of my PDFs was corrupted and never got a slide added to another slide. The expected output is a pdf file with two slides, one from each pdf file. Here's a very rough excerpt of my code (attached). Note, the practice presentations I tried to merge I called ~sept-1.pdf and second-page.pdf + I changed the username in the file path to eliminate my nickname.
Goal: Add the one and only slide from pdf 2 and add it to the end of pdf 1.
Import modules
!pip install PyPDF4 then import PyPDF4
Change directory to desktop
import os
os.getcwd()
os.chdir('/Users/mac_username_name/Desktop')
os.getcwd()
Stuff after chaging the directory to the desktop
Adding to my desired PDF
f = open('sept-1.pdf','rb')
pdf_reader = PyPDF4.PdfFileReader(f)
first_page = pdf_reader.getPage(0)
pdf_writer = PyPDF4.PdfFileWriter()
pdf_writer.addPage(first_page)
pdf_output = open("newer.pdf","wb")
pdf_writer.write(pdf_output)
f.close()
I'm on Python 3, using PyPDF2 and in order to add page numbers to a newly generated PDF (which I do using reportlab) I merge the two PDF files page by page in the following way:
from PyPDF2 import PdfFileWriter, PdfFileReader
def merge_pdf_files(first_pdf_fp, second_pdf_fp, target_fp):
"""
Merges two PDF files into a target final PDF file.
Args:
first_pdf_fp: the first PDF file path.
second_pdf_fp: the second PDF file path.
target_fp: the target PDF file path.
"""
pdf1 = PdfFileReader(first_pdf_fp)
pdf2 = PdfFileReader(second_pdf_fp)
assert (pdf1.getNumPages() == pdf2.getNumPages())
final_pdf_writer = PdfFileWriter()
for i in range(pdf1.getNumPages()):
number_page = pdf1.getPage(i)
content_page = pdf2.getPage(i)
content_page.mergePage(number_page)
final_pdf_writer.addPage(content_page)
with open(target_fp, "wb") as final_os:
final_pdf_writer.write(final_os)
But this is very slow. Is there a faster and cleaner way to a merge at once using PyPDF2?
I do not have enough 'reputation' to comment. But since I was going to post an answer I made it long.
Normally when people want to 'merge' documents they mean 'combining' them, or as you point out, concatenate or append one pdf at the end of the other (or somewhere in between). But based on the code you present, it seems you meant overlaying one pdf over another, right? Or in other words, you want page 1 from both pdf1 and pdf2 to be combined in to a single page as part of a new pdf.
If so, you could use this (modified from example used to illustrate watermarking). It is still overlaying one page at a time. But, pdfrw is known to be super fast compared to PyPDF2 and supposed to work well with reportlab. I havent compared the speeds, so not sure if this will actually be faster than what you already have
from pdfrw import PdfReader, PdfWriter, PageMerge
p1 = pdfrw.PdfReader("file1")
p2 = pdfrw.PdfReader("file2")
for page in range(len(p1.pages)):
merger = PageMerge(p1.pages[page])
merger.add(p2.pages[page]).render()
writer = PdfWriter()
writer.write("output.pdf", p1)
Try this.
You can use PyPdf2s PdfMerger class.
using file Concatenation, you can concatenate files using append method
from PyPDF2 import PdfFileMerger
pdfs = ['file1.pdf', 'file2.pdf', 'file3.pdf', 'file4.pdf']
merger = PdfFileMerger()
for pdf in pdfs:
merger.append(pdf)
merger.write("result.pdf")
merger.close()
Maybe the answer will help you in Is there a way to speed up PDF page merging... where using multiprocessing takes 100% of the processor
I am using pdf2image to change pdfs to jpgs in about 1600 folders. I have looked around and adapted code from many SO answers, but this one section seems to be overproducing jpgs in certain folders (hard to tell which).
In one particular case, using an Adobe Acrobat tool to make pdfs creates 447 jpgs (correct amount) but my script makes 1059. I looked through and found some pdf pages are saved as jpgs multiple times and inserted into the page sequences of other pdf files.
For example:
PDF A has 1 page and creates PDFA_page_1.jpg.
PDF B has 44 pages and creates PDFB_page_1.jpg through ....page_45.jpg because PDF A shows up again as page_10.jpg. If this is confusing, let me know.
I have tried messing with the index portion of the loop (specifically, taking the +1 away, using pages instead of page, placing the naming convention as a variable rather than directly into the .save and .move functions.
I also tried using the fmt='jpg' parameter in pdf2image.py but was unable to produce the correct naming scheme because I am unsure how to iterate the page numbers without the for page in pages loop.
for pdf_file in os.listdir(pdf_dir):
if pdf_file.endswith(".pdf") and pdf_file.startswith("602024"):
#Convert function from pdf2image
pages = convert_from_path(pdf_file, 72, output_folder=final_directory)
print(pages)
pdf_file = pdf_file[:-4]
for page in pages:
#save with designated naming scheme <pdf file name> + page index
jpg_name = "%s-page_%d.jpg" % (pdf_file,pages.index(page)+1)
page.save(jpg_name, "JPEG")
#Moves jpg to the mini_jpg folder
shutil.move(jpg_name, 'mini_jpg')
#no_Converted += 1
# Delete ppm files
dir_name = final_directory
ppm_remove_list = os.listdir(dir_name)
for ppm_file in ppm_remove_list:
if ppm_file.endswith(".ppm"):
os.remove(os.path.join(dir_name, ppm_file))
There are no error messages, just 2 - 3 times as many jpgs as I expected in just SOME cases. Folders with many single-page pdfs do not experience this problem, nor do folders with a single multi-page pdf. Some folders with multiple multi-page pdfs also function correctly.
If you can create a reproducible example, feel free to open an issue on the official repository, I am not sure that I understand how that could happen: https://github.com/Belval/pdf2image
Do provide PDF examples otherwise, I can't test.
As an aside, instead of pages.index use for i, page in enumerate(pages) and page number will be i + 1.
I have a pdf file over 100 pages. There are boxes and columns of text. When I extract the text using PyPdf2 and tika parser, I get a string of of data which is out of order. It is ordered by columns in many cases and skips around the document in other cases. Is it possible to read the pdf file starting from the top, moving left to right until the bottom? I want to read the text in the columns and boxes, but I want the line of text displayed as it would be read left to right.
I've tried:
PyPDF2 - the only tool is extracttext(). Fast but does not give gaps in the elements. Results are jumbled.
Pdfminer - PDFPageInterpeter() method with LAParams. This works well but is slow. At least 2 seconds per page and I've got 200 pages.
pdfrw - this only tells me the number of pages.
tabula_py - only gives me the first page. Maybe I'm not looping it correctly.
tika - what I'm currently working with. Fast and more readable, but the content is still jumbled.
from tkinter import filedialog
import os
from tika import parser
import re
# select the file you want
file_path = filedialog.askopenfilename(initialdir=os.getcwd(),filetypes=[("PDF files", "*.pdf")])
print(file_path) # print that path
file_data = parser.from_file(file_path) # Parse data from file
text = file_data['content'] # Get files text content
by_page = text.split('... Information') # split up the document into pages by string that always appears on the
# top of each page
for i in range(1,len(by_page)): # loop page by page
info = by_page[i] # get one page worth of data from the pdf
reformated = info.replace("\n", "&") # I replace the new lines with "&" to make it more readable
print("Page: ",i) # print page number
print(reformated,"\n\n") # print the text string from the pdf
This provides output of a sort, but it is not ordered in the way I would like. I want the pdf to be read left to right. Also, if I could get a pure python solution, that would be a bonus. I don't want my end users to be forced to install java (I think the tika and tabula-py methods are dependent on java).
I did this for .docx with this code. Where txt is the .docx. Hope this help link
import re
pttrn = re.compile(r'(\.|\?|\!)(\'|\")?\s')
new = re.sub(pttrn, r'\1\2\n\n', txt)
print(new)
sorry,.. i'am a noob in python..
I need to create a pdf file, without using an existing pdf files.. (pure create a new one)
i have googling, and lot of them is merge 2 pdf or create a new file copies from a particular page in another file... what i want to achieve is make a report page (in chart), but for first step or the simple one "how to insert a string into my pdf file ? (hello world mybe)"..
this is my code to make a new pdf file with a single blankpage
from pyPdf import PdfFileReader, PdfFileWriter
op = PdfFileWriter()
# here to add blank page
op.addBlankPage(200,200)
#how to add string here, and insert it to my blank page ?
ops = file("document-output.pdf", "wb")
op.write(ops)
ops.close()
You want "pisa" or "reportlab" for generating arbitrary PDF documents, not "pypdf".
http://www.xhtml2pdf.com/doc/pisa-en.html
http://www.reportlab.org
Also check out the pyfpdf library. I've used the php port of this library for a few years and it's quite flexible, allowing you to work with flowable text, lines, rectangles, and images.
http://code.google.com/p/pyfpdf