How do I reduce size of pdf merge - python

I'm trying to merge a group of PDFs (Up to 1,000) per unique groups. Meaning, of 100,00 pdfs created I need them to be grouped at a practice/market level and to output a merged pdf file containing varying counts of merged pdfs.
My pdf file creation and loop works fine but when it comes to merging, i'm running into file size issues.
Tried doing this utilizing PYPDF but file sizes are way too large:
''''
def merge_pdfs(paths, output):
'''
Is there an alternative to PYPDF that also allows me to create read only pdfs of a smaller size?
I've used PDFtk , ghostscript , and pymupdf to no avail.

It sounds like your files perhaps come from the same source or are generated in the same way, and therefore will have common internal parts, for example the same font data in each.
Try:
cpdf -squeeze in.pdf -o out.pdf
On the output. You could do the initial merge with cpdf too, but it’s not required.
If it must be done directly in python, pycpdflib can do it with squeezeInMemory.

Related

Extracting text from two column pdf using python

I am trying to extract text from a two-column pdf. On using pypdf2 and pdfplumber it is reading the page horizontally, i.e. after reading first line of column one it goes on to read first line of column two instead of second line of column one. I have also tried this code githubusercontent
as it is, but I have the same issue. I also saw this How to extract text from two column pdf with Python? but I dont want to convert to image as I have thousands of pages.
Any help will be appreciated. Thannnks!
You can check this blog here which uses PyMuPDF to extract two column pdfs like research papers.
https://towardsdatascience.com/read-a-multi-column-pdf-using-pymupdf-in-python-4b48972f82dc
From what I have tested so far, it works quite well. I would highly recommend the "blocks" option.
# OCR the PDF using the default 'text' parameter
with fitz.open(DIGITIZED_FILE_PATH) as doc:
for page in doc:
text = page.get_text("blocks")
print(text)
Note: It does not work for scanned images. It works only for searchable pdf files.

Sort labeled dataset with labels in separate csv file?

I have a dataset of ~3500 images, with the labels of each image in a csv file. The csv file has two columns: the first one contains the exact name of the image file (i.e. 00001.jpg) and the second column contains the label of the image. There are a total of 7 different labels.
How can I sort the images from one huge folder to 7 different folders (each image in its respective category) in an efficient manner? Does anyone have a script that can do this?
Also, is there any way I can do this with Google Drive? I've already uploaded the dataset to Drive in order to use with Colab soon, so I don't want to have to do it again (takes ~2.5 hours).
I'm not sure about performance, probably there are better ways...
But this would be my take on the problem:
(not tested, so might need small adjustments)
I'm assuming the images are in a subfolder /images/, but the csv and the script are in root. Furthermore I'm assuming the csv is named images.csv and the columns in the csv are titled file and label.
import pandas as pd
import os
df = pf.DataFrame.from_csv('images.csv')
for _, row in df.iterrows():
f = row['file']
l = row['label']
os.replace(f'images/{f}', f'images/{l}/{f}')
I don't know what google drive would make out of it, but as long as you can run it on a google-drive-synced folder, I wouldn't know why this should be an issue.
Note: if you test it, you may want to do so on a copy of the files, in case I screwed up...

Split pack of text files into multiple subsets according to the content of the files

I have lot of PDF, DOC[X], TIFF and others files (scans from a shared folder). Each file converted into pack of text files: one text file per page.
Each pack of files could contain multiple documents (for example thee contracts). Document kind could be not only contract.
During the processing the pack of the files I don't know what kind of the documents current pack contains and it's possible that one pack contains multiple document kinds (contracts, invoices, etc).
I'm looking for some possible approaches to solve this programmatically.
I'm tried to search something like that but without any success.
UPD: I tried to create binary classificator with scikit-learn and now looking for another solution.
This at its basis, being they are "scans" sounds more like something that could be approached with computer vision, however this is currently far far above my current level of programming.
E.g. projects like SimpleCV may be a good starting point,
http://www.simplecv.org/
Or possibly you could get away with OCR reading the "scans" and working based on the contents. pytesseract seems popular for this type of task,
https://pypi.org/project/pytesseract/
However that still lacks defining how you would tell your program that this part of the image means that this is 3 separate contracts, Is there anything about these files in particular that make this clear, e.g. "1 of 3" on the pages,, a logo or otherwise? that will be the main part that determines how complex a problem you are trying to solve.
Best solution was to create binary classifier (SGDClassifier) and train it on classes first-page and not-first-page. Each item from the dataset was trimmed to 100 tokens (words)

how to have indexed files in storage in python

I have a huge dataset of images and I am processing them one by one.
All the images are stored in a folder.
My Approach:
What I have tried is that I have tried reading all the filenames in memory and whenever a call for a certain index is sent, I load the corresponding image.
The problem is that it is even not possible to keep the paths and the names of the files in memory due to the huge dataset.
Is it possible to have an indexed file on storage and one is able to read a file name at a certain index.
Thanks a lot.

Pandas: efficiently write thousands of small files

here is my problem.
I have a single big CSV file containing a bit more than 100M rows which I need to divide in much smaller files (if needed I can add more details). At the moment I'm reading in chunks the big CSV, doing some computations to determine how to subdivide the chunk and finally writing (appending) to the files with
df.to_csv(outfile, float_format='%.8f', index=False, mode='a', header=header)
(the header variable is True if it is the first time that I write to 'outfile', otherwise it is False).
While running the code I noticed that the amount of on-disk memory taken by the smaller files on the whole was likely to become larger than three times the size of the single big csv.
So here are my questions:
is this behavior normal? (probably it is, but I'm asking just in case)
is it possible to reduce the size of the files? (different file formats?) [SOLVED through compression, see update below and comments]
are there better file types for this situation with respect to CSV?
Please note that I don't have an extensive knowledge of programming, I'm just using Python for my thesis.
Thanks in advance to whoever will help.
UPDATE: thanks to #AshishAcharya and #PatrickArtner I learned how to use the compression while writing and reading the CSV. Still, I'd like to know if there are any file types that may be better than CSV for this task.
NEW QUESTION: (maybe stupid question) does appending work on compressed files?
UPDATE 2: using the compression option I noticed something that I don't understand. To determine the size of folders I was taught to use the du -hs <folder> command, but using it on the folder containing compressed files or the one containing the uncompressed files results in the same value of '3.8G' (both are created using the same first 5M rows from the big CSV). From the file explorer (Nautilus) instead, I get about 590MB for the one containing uncompressed CSV and 230MB for the other. What am I missing?

Categories

Resources