Split pdf in more than one page with pikepdf in python - python

I need to split a pdf file in group of pages specified by the user. For example, I have a pdf with 20 pages, and I want to split it in groups of 5 pages. The output would be 4 pdfs of 5 pages each.
I read the pikepdf documentation and it can only split it in a single page, so I would have 20 single page pdfs.
pdf = Pdf.open('../tests/resources/fourpages.pdf')
for n, page in enumerate(pdf.pages):
dst = Pdf.new()
dst.pages.append(page)
dst.save(f'{n:02d}.pdf')
This is the code of the pikepdf documentation. I made it work, but as said before, the output is just single page pdfs. I tried changing it a bit with a nested while but it didn't work.
I think it's weird that it doesn't allow to split in more than one page. Maybe there is something obvious that I'm not seeing.
I thought about splitting it in single pages, and then merging it again by the desired amount of pages, but it doesn't seem very optimal.
For now I'm not allowed to use another library other than pikepdf. Is there a way to achieve this?
Thanks in advance.

Related

ReportLab: edit a certain page after creating several pages

I want to edit a certain page while creating a PDF with ReportLab only. (There are some solutions with PyPDF2, but I want to use ReportLab only - if it is possible).
Description what I am doing / try to do:
I am creating a PDF-File which is giving the reader a good overview of certain data. I get the data from a server. I know the data structure, but from PDF to PDF it varies how much data I get. That's why some PDFs will be 20 pages long, some can be 50 pages+.
After getting all the data and creating a beautiful PDF (this work is done by now), I want to go back to page 2 of this PDF and create my own, very individual table of content.
But I can't find anywhere how to edit a certain page after creating several new pages.
What I've done so far for trying to solve my problem / search:
I read the documentation
I checked stackoverflow
I checked git-repos
Help would be really appreciated. In case that it is not possible to add a certain page after other pages got added with ReportLab, I think about using PyPDF2 then. I have little to no experience with PyPDF2 so far, so if you have some good links you can send me I'd very thankful.

Deleting pages in a word document using python-docx

I have a .docx template I use containing tables with placeholder names that I replace with data from a .xlsx file using openpyxl and python-docx. For example I have a "tool" placeholder that is replaced with the contents of cell A1 in the .xlsx.
Its a dumb workaround, but thats life when I cant get the business to use .xlsx to begin with.
The template is 126 pages, and I use anything from 1-126 depending on the part being documented. Currently I remove the unused pages manually. Is there a way to remove for example pages 10 through 126 using python?
All the pages are identical to start with, so another solution would be to generate the correct amount of pages at the beginning, I've tried various ways of doing that, but can never get the logo picture to copy correctly.

Grabbing an article from a pdf file - Python

I have more than 5000 pdf files with at least 15 pages each and 20 pages at most. I used pypdf2 to find out which among the 5000 pdf files have the keyword I am looking for and on which page.
Now I have the following data:
I was wondering if there is a way for me to get the specific article on the specific page using this data. I know now which filenames to check and which page.
Thanks a lot.
There is a library called tika. It can extract the text from a single page. You can split your pdf in such a way, that you have only the page in question still available. Then you can use:
parsed_page = parser.from_file('sample.pdf')
print(parsed_page['content'])
NOTE: This library requires Java to be installed on the system

Scraping large and complex PDF tables

I've been trying to scrape some data off of PDFs regarding 2020 election results in California for my own morbid curiosity.
I need to scrape many tables that appear across many pages. In some cases, the rows will continue onto the next page, and additional columns will appear on other pages as well. I've included a link to one example. I'm comfortable with R, but I can also use Python if that will be better for scraping. I haven't found many resources indicating how to deal with tables that carry onto additional pages for either language though. I need to get these tables into a CSV or XLSX format.
Thank you in advance!
In this example, Pages 15-28 should be one table.
https://www.co.tehama.ca.us/images/images/Elections/StatementOfVotesCastNOV2020v2excel.pdf
I was able to get the entire table using the following procedure.
Open the pdf in MS Word - not Adobe Acrobat. Word will convert the
document.
After the conversion has completed, select all. (Both may
take some time.)
Paste into a blank Excel worksheet. Save and enjoy.

PDF File Manipulation (open a large pdf file, find a keyword, then save in which page was found, and then split those pages and merge them in one pdf)

I'm working on a project for a friend of mine.
I want to find one specific keyword that is on multiple pages, and it has duplicates on other places on a large PDF file (40-60 pages and above) then save in memory in what page the keyword was found, and then split those pages from the original PDF File and lastly, merge them together.
I'm thinking about using PDFMiner or PyPDF2 (i'm open to other suggestions as well)
I'm already writing the code for the most part of it, but i can't figure out a good and efficient way to search the file and find that keyword, because this keyword is located in other places in the same file, and make sure that the data i want to extract from the original file isn't duplicate and all the data was extracted.
Thanks in Advance.
Did you try to split pdf file on couple of blocks and search keyword on each block with multithreading? This should be faster.

Categories

Resources