(Python) Converting Hundreds of PNGs to a Single PDF - python

I have a folder with 452 images (.png) that I'm trying to merge into a single PDF file, using Python. Each of the images are labelled by their intended page number, e.g. "1.png", "2.png", ....., "452.png".
This code was technically successful, but input the pages out of the intended order.
import img2pdf
from PIL import Image
with open("output.pdf", 'wb') as f:
f.write(img2pdf.convert([i for i in os.listdir('.') if i.endswith(".png")]))
I also tried reading the data as binary data, then convert it and write it to the PDF, but this yields a 94MB one-page PDF.
import img2pdf
from PIL import Image
with open("output.pdf", 'wb') as f:
for i in range(1, 453):
img = Image.open(f"{i}.png")
pdf_bytes = img2pdf.convert(img)
f.write(pdf_bytes)
Any help would be appreciated, I've done quite a bit of research, but have come up short. Thanks in advance.

but input the pages out of the intended order
I suspect that the intended order is "in numerical order of file name", i.e. 1.png, 2.png, 3.png, and so forth.
This can be solved with:
with open("output.pdf", 'wb') as f:
f.write(img2pdf.convert(sorted([i for i in os.listdir('.') if i.endswith(".png")], key=lambda fname: int(fname.rsplit('.',1)[0]))))
This is a slightly modified version of your first attempt, that just sorts the file names (in the way your second attempt tries to do) before batch-writing it to the PDF

Related

Is there any feasible solution to read WOT battle results .dat files?

I am new here to try to solve one of my interesting questions in World of Tanks. I heard that every battle data is reserved in the client's disk in the Wargaming.net folder because I want to make a batch of data analysis for our clan's battle performances.
image
It is said that these .dat files are a kind of json files, so I tried to use a couple of lines of Python code to read but failed.
import json
f = open('ex.dat', 'r', encoding='unicode_escape')
content = f.read()
a = json.loads(content)
print(type(a))
print(a)
f.close()
The code is very simple and obviously fails to make it. Well, could anyone tell me the truth about that?
Added on Feb. 9th, 2022
After I tried another set of codes via Jupyter Notebook, it seems like something can be shown from the .dat files
import struct
import numpy as np
import matplotlib.pyplot as plt
import io
with open('C:/Users/xukun/Desktop/br/ex.dat', 'rb') as f:
fbuff = io.BufferedReader(f)
N = len(fbuff.read())
print('byte length: ', N)
with open('C:/Users/xukun/Desktop/br/ex.dat', 'rb') as f:
data =struct.unpack('b'*N, f.read(1*N))
The result is a set of tuple but I have no idea how to deal with it now.
Here's how you can parse some parts of it.
import pickle
import zlib
file = '4402905758116487.dat'
cache_file = open(file, 'rb') # This can be improved to not keep the file opened.
# Converting pickle items from python2 to python3 you need to use the "bytes" encoding or "latin1".
legacyBattleResultVersion, brAllDataRaw = pickle.load(cache_file, encoding='bytes', errors='ignore')
arenaUniqueID, brAccount, brVehicleRaw, brOtherDataRaw = brAllDataRaw
# The data stored inside the pickled file will be a compressed pickle again.
vehicle_data = pickle.loads(zlib.decompress(brVehicleRaw), encoding='latin1')
account_data = pickle.loads(zlib.decompress(brAccount), encoding='latin1')
brCommon, brPlayersInfo, brPlayersVehicle, brPlayersResult = pickle.loads(zlib.decompress(brOtherDataRaw), encoding='latin1')
# Lastly you can print all of these and see a lot of data inside.
The response contains a mixture of more binary files as well as some data captured from the replays.
This is not a complete solution but it's a decent start to parsing these files.
First you can look at the replay file itself in a text editor. But it won't show the code at the beginning of the file that has to be cleaned out. Then there is a ton of info that you have to read in and figure out but it is the stats for each player in the game. THEN it comes to the part that has to do with the actual replay. You don't need that stuff.
You can grab the player IDs and tank IDs from WoT developer area API if you want.
After loading the pickle files like gabzo mentioned, you will see that it is simply a list of values and without knowing what the value is referring to, its hard to make sense of it. The identifiers for the values can be extracted from your game installation:
import zipfile
WOT_PKG_PATH = "Your/Game/Path/res/packages/scripts.pkg"
BATTLE_RESULTS_PATH = "scripts/common/battle_results/"
archive = zipfile.ZipFile(WOT_PKG_PATH, 'r')
for file in archive.namelist():
if file.startswith(BATTLE_RESULTS_PATH):
archive.extract(file)
You can then decompile the python files(uncompyle6) and then go through the code to see the identifiers for the values.
One thing to note is that the list of values for the main pickle objects (like brAccount from gabzo's code) always has a checksum as the first value. You can use this to check whether you have the right order and the correct identifiers for the values. The way these checksums are generated can be seen in the decompiled python files.
I have been tackling this problem for some time (albeit in Rust): https://github.com/dacite/wot-battle-results-parser/tree/main/datfile_parser.

Saving images to a folder in the same directory Python

I have written this code that takes a URL and downloads the image. I was wondering how can I save the images to a specific folder in the same directory?
For example, I have a folder named images in the same directory. I wanted to save all the images to that folder.
Below is my code:
import requests
import csv
with open('try.csv', 'r') as csv_file:
csv_reader = csv.reader(csv_file)
for rows in csv_reader:
image_resp = requests.get(rows[0])
with open(rows[1], 'wb') as image_downloader:
image_downloader.write(image_resp.content)
Looking for this?
with open(os.path.join("images", rows[1]), 'wb') as fd:
for chunk in r.iter_content(chunk_size=128):
fd.write(chunk)
See official docs here: os.path.join
Requests specific stuff from https://2.python-requests.org/en/master/user/quickstart/#raw-response-content
PS: You might want to use a connection pool and/or multiprocessing instead of the for rows in csv_reader, in order to have many concurrent requests.
Saving an image can be achieved by using .save() method with Pillow library’s Image module in Python.
Ex.
from PIL import Image
file = "C://Users/ABC/MyPic.jpg" # Selects a jpg file 'MyPic'
img = Image.open(file)
img.save('Desktop/MyPics/new_img.jpg') # Saves it to MyPics folder as 'new_img'
Image.save(fp, format=None, **params)[source]:
Saves this image under the given filename. If no format is specified, the format to use is determined from the filename extension, if possible.
Keyword options can be used to provide additional instructions to the writer. If a writer doesn’t recognise an option, it is silently ignored. The available options are described in the image format documentation for each writer.
You can use a file object instead of a filename. In this case, you must always specify the format. The file object must implement the seek, tell, and write methods, and be opened in binary mode.

Is there a faster way to merge two files rather than page by page?

I'm on Python 3, using PyPDF2 and in order to add page numbers to a newly generated PDF (which I do using reportlab) I merge the two PDF files page by page in the following way:
from PyPDF2 import PdfFileWriter, PdfFileReader
def merge_pdf_files(first_pdf_fp, second_pdf_fp, target_fp):
"""
Merges two PDF files into a target final PDF file.
Args:
first_pdf_fp: the first PDF file path.
second_pdf_fp: the second PDF file path.
target_fp: the target PDF file path.
"""
pdf1 = PdfFileReader(first_pdf_fp)
pdf2 = PdfFileReader(second_pdf_fp)
assert (pdf1.getNumPages() == pdf2.getNumPages())
final_pdf_writer = PdfFileWriter()
for i in range(pdf1.getNumPages()):
number_page = pdf1.getPage(i)
content_page = pdf2.getPage(i)
content_page.mergePage(number_page)
final_pdf_writer.addPage(content_page)
with open(target_fp, "wb") as final_os:
final_pdf_writer.write(final_os)
But this is very slow. Is there a faster and cleaner way to a merge at once using PyPDF2?
I do not have enough 'reputation' to comment. But since I was going to post an answer I made it long.
Normally when people want to 'merge' documents they mean 'combining' them, or as you point out, concatenate or append one pdf at the end of the other (or somewhere in between). But based on the code you present, it seems you meant overlaying one pdf over another, right? Or in other words, you want page 1 from both pdf1 and pdf2 to be combined in to a single page as part of a new pdf.
If so, you could use this (modified from example used to illustrate watermarking). It is still overlaying one page at a time. But, pdfrw is known to be super fast compared to PyPDF2 and supposed to work well with reportlab. I havent compared the speeds, so not sure if this will actually be faster than what you already have
from pdfrw import PdfReader, PdfWriter, PageMerge
p1 = pdfrw.PdfReader("file1")
p2 = pdfrw.PdfReader("file2")
for page in range(len(p1.pages)):
merger = PageMerge(p1.pages[page])
merger.add(p2.pages[page]).render()
writer = PdfWriter()
writer.write("output.pdf", p1)
Try this.
You can use PyPdf2s PdfMerger class.
using file Concatenation, you can concatenate files using append method
from PyPDF2 import PdfFileMerger
pdfs = ['file1.pdf', 'file2.pdf', 'file3.pdf', 'file4.pdf']
merger = PdfFileMerger()
for pdf in pdfs:
merger.append(pdf)
merger.write("result.pdf")
merger.close()
Maybe the answer will help you in Is there a way to speed up PDF page merging... where using multiprocessing takes 100% of the processor

Python Splitting PDF Leaves Meta-Text From Other Pages

The following code has successfully split a large PDF file into smaller PDFs of 2 pages each. However, if I look into one of the files, I see meta-text from others.
This is used to split the PDF into smaller ones:
import numpy as np
from PyPDF2 import PdfFileWriter, PdfFileReader
inputpdf = PdfFileReader(open(path+"multi.pdf", "rb"))
r=np.arange(inputpdf.numPages)
r2=[(r[i],r[i+1]) for i in range(0,len(r),2)]
for i in r2:
output = PdfFileWriter()
output.addPage(inputpdf.getPage(i[0]))
output.addPage(inputpdf.getPage(i[1]))
with open(path+"document-page %s.pdf" % i[0], "wb") as outputStream:
output.write(outputStream)
This is used to get the meta-text of one of the resulting files (PyPDF2 will not read it):
import pdfx
path=path+'document-page 8.pdf'
pdf = pdfx.PDFx(path)
pdf.get_text()
My issues with this are:
The process is super slow and all I want is a 10 digit number in the upper-right corner of the first page. Can I somehow just get that part?
When looking at the result, it has text from other adjacent pages from the original PDF file (which is why I'm calling it "meta-text"). Why is that? Can this be resolved?
Update:
pdf.get_references_count()
...shows 20 (there should only be 2)
Thanks in advance!

Combine two lists of PDFs one to one using Python

I have created a series of PDF documents (maps) using data driven pages in ESRI ArcMap 10. There is a page 1 and page 2 for each map generated from separate *.mxd. So I have one list of PDF documents containing page 1 for each map and one list of PDF documents containing page 2 for each map. For example: Map1_001.pdf, map1_002.pdf, map1_003.pdf...map2_001.pdf, map2_002.pdf, map2_003.pdf...and so one.
I would like to append these maps, pages 1 and 2, together so that both page 1 and 2 are together in one PDF per map. For example: mapboth_001.pdf, mapboth_002.pdf, mapboth_003.pdf... (they don't have to go into a new pdf file (mapboth), it's fine to append them to map1)
For each map1_ *.pdf
Walk through the directory and append map2_ *.pdf where the numbers (where the * is) in the file name match
There must be a way to do it using python. Maybe with a combination of arcpy, os.walk or os.listdir, and pyPdf and a for loop?
for pdf in os.walk(datadirectory):
??
Any ideas? Thanks kindly for your help.
A PDF file is structured in a different way than a plain text file. Simply putting two PDF files together wouldn't work, as the file's structure and contents could be overwritten or become corrupt. You could certainly author your own, but that would take a fair amount of time, and intimate knowledge of how a PDF is internally structured.
That said, I would recommend that you look into pyPDF. It supports the merging feature that you're looking for.
This should properly find and collate all the files to be merged; it still needs the actual .pdf-merging code.
Edit: I have added pdf-writing code based on the pyPdf example code. It is not tested, but should (as nearly as I can tell) work properly.
Edit2: realized I had the map-numbering crossways; rejigged it to merge the right sets of maps.
import collections
import glob
import re
# probably need to install this module -
# pip install pyPdf
from pyPdf import PdfFileWriter, PdfFileReader
def group_matched_files(filespec, reg, keyFn, dataFn):
res = collections.defaultdict(list)
reg = re.compile(reg)
for fname in glob.glob(filespec):
data = reg.match(fname)
if data is not None:
res[keyFn(data)].append(dataFn(data))
return res
def merge_pdfs(fnames, newname):
print("Merging {} to {}".format(",".join(fnames), newname))
# create new output pdf
newpdf = PdfFileWriter()
# for each file to merge
for fname in fnames:
with open(fname, "rb") as inf:
oldpdf = PdfFileReader(inf)
# for each page in the file
for pg in range(oldpdf.getNumPages()):
# copy it to the output file
newpdf.addPage(oldpdf.getPage(pg))
# write finished output
with open(newname, "wb") as outf:
newpdf.write(outf)
def main():
matches = group_matched_files(
"map*.pdf",
"map(\d+)_(\d+).pdf$",
lambda d: "{}".format(d.group(2)),
lambda d: "map{}_".format(d.group(1))
)
for map,pages in matches.iteritems():
merge_pdfs((page+map+'.pdf' for page in sorted(pages)), "merged{}.pdf".format(map))
if __name__=="__main__":
main()
I don't have any test pdfs to try and combine but I tested with a cat command on text files.
You can try this out (I'm assuming unix based system): merge.py
import os, re
files = os.listdir("/home/user/directory_with_maps/")
files = [x for x in files if re.search("map1_", x)]
while len(files) > 0:
current = files[0]
search = re.search("_(\d+).pdf", current)
if search:
name = search.group(1)
cmd = "gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=FULLMAP_%s.pdf %s map2_%s.pdf" % (name, current, name)
os.system(cmd)
files.remove(current)
Basically it goes through and grabs the maps1 list and then just goes through and assumes correct files and just goes through numbers. (I can see using a counter to do this and padding with 0's to get similar effect).
Test the gs command first though, I just grabbed it from http://hints.macworld.com/article.php?story=2003083122212228.
There are examples of how to to do this on the pdfrw project page at googlecode:
http://code.google.com/p/pdfrw/wiki/ExampleTools

Categories

Resources