Python PyPDF2 seek of closed file Error - python

I am making a pdf splitter and at first seemed to work fine. But when i tryed to use multiple page regions , i keep getting this error--> ValueError: seek of closed file.
If i omit pdf_file.close() the error will stop but all the pdf created will have no pages.
My code is here:
from PyPDF2 import PdfFileReader , PdfFileWriter
counter = 1
pdf_file = open(fileName2,'rb')
pdf_reader = PdfFileReader(pdf_file)
pdf_writer = PdfFileWriter()
output_file2 , _ = QtWidgets.QFileDialog.getSaveFileName(self, "Save file", fileName2_c2+"_splited", "Folder will be created")
os.makedirs(r'{}'.format(output_file2+"\\{}_splited".format(fileName2_c2)))
for z in list_pdf_split:
try:
pdf_file = open(fileName2,'rb')
except:
print("error")
print(z)
c_z = z.split("-")
for i in range(int(c_z[0]),int(c_z[1])+1):
print(i)
pdf_writer.addPage(pdf_reader.getPage(i-1))
output_file = open(output_file2+"\\{}_splited".format(fileName2_c2)+"{}".format(counter)+".pdf",'wb')
pdf_reader = PdfFileReader(pdf_file)
pdf_writer = PdfFileWriter()
pdf_writer.write(output_file)
output_file.close()
counter +=1
pdf_file.close()

Your logic doesn't make much sense, in multiple places.
First, the problem you're asking about. Look at what you do with pdf_file and pdf_reader:
Open a file as pdf_file.
Create a PdfFileReader attached to pdf_file as pdf_reader.
Reopen the same file as pdf_file. This releases the old file, causing it to become garbage, so it will soon (usually immediately) get closed.
Repeatedly call getPage(:-1) on pdf_reader, which is probably attached to a closed file the first time, and definitely every time after that.
Create a new PdfFileReader with the file we opened in step 3, as pdf_reader.
Close the pdf_file you just opened, so pdf_reader is definitely now referencing a closed file.
Repeat steps 2-6.
You either need to do step 4 before step 3, or after step 5, or you need to have two different pdf_file variables so you can open the new one while still using the old one. I'm not sure which of the three you want, but as-is, you're reading from a closed file.
But I think it would be simpler to reorganize things to eliminate step 1—instead of trying to opening stuff before the loop and then reopen stuff at the end of each loop, you just open stuff at the start of the loop, right where you need it.
Second, your writer is just as confused. Look at what you do with output_file and pdf_writer:
Create PdfFileWriter as pdf_writer.
Repeatedly add pages to it.
Open an output file as output_file.
Create a new PdfFileWriter as pdf_writer, discarding everything you wrote to the old one.
Write out the now-empty pdf_writer to output_file.
Repeat steps 2-5.
Again, you need to do step 5 somewhere else, probably before step 4. But, again, it's probably a lot simpler to reorganize things to eliminate step 1.

Sorry i suppose i was too fast answering this question.
I moved pdf.writer and pdf.reader to the begining of for loops as it seems to block code (stream for writing pdf).

Related

Editing of txt files not saving when I concatenate them

I am fairly new to programming, so bear with me!
We have a task at school which we are made to clean up three text files ("Balance1", "Saving", and "Withdrawal") and append them together into a new file. These files are just names and sums of money listed downwards, but some of it is jumbled. This is my code for the first file Balance1:
with open('Balance1.txt', 'r+') as f:
f_contents = f.readlines()
# Then I start cleaning up the lines. Here I edit Anna's savings to an integer.
f_contents[8] = "Anna, 600000"
# Here I delete the blank lines and edit in the 50000 to Philip.
del f_contents[3]
del f_contents[3]
In the original text file Anna's savings is written like this: "Anna, six hundred thousand" and we have to make it look clean, so its rather "NAME, SUM (as integer). When I print this as a list it looks good, but after I have done this with all three files I try to append them together in a file called "Balance.txt" like this:
filenames = ["Balance1.txt", "Saving.txt", "Withdrawal.txt"]
with open("Balance.txt", "a") as outfile:
for filename in filenames:
with open(filename) as infile:
contents = infile.read()
outfile.write(contents)
When I check the new text file "Balance" it has appended them together, but just as they were in the beginning and not with my edits. So it is not "cleaned up". Can anyone help me understand why this happens, and what I have to do so it appends the edited and clean versions?
In the first part, where you do the "editing" of Balance.txt` file, this is what happens:
You open the file in read mode
You load the data into memory
You edit the in memory data
And voila.
You never persisted the changes to any file on the disk. So when in the second part you read the content of all the files, you will read the data that was originally there.
So if you want to concatenate the edited data, you have 2 choices:
Pre-process the data by creating 3 final correct files (editing Balance1.txt and persisting it to another file, say Balance1_fixed.txt) and then in the second part, concatenate: ["Balance1_fixed.txt", "Saving.txt", "Withdrawal.txt"]. Total of 4 data file openings, more IO.
Use only the second loop you have, and correct the contents before writing it to the outfile. You can use readlines() like you did first, edit the specific line and then use writelines(). Total of 3 data file openings, less IO than previous option

PyPDF2 doesn't display barcode (pdf417) after using update_page_form_field_values()

I'm trying to modify a pdf found here
https://www.uscis.gov/sites/default/files/document/forms/i-130.pdf
It has a barcode at the bottom that I need to keep. However if I use the update_page_form_field_values() function the pdf doesn't show the barcode. How do I prevent this?
from PyPDF2 import PdfFileReader,PdfFileWriter
reader = PdfFileReader("i-130.pdf")
pager = reader.getPage(0)
field_dict = {
"Pt2Line4b_GivenName[0]": "Mark"
}
writer = PdfFileWriter()
writer.add_page(pager)
writer.update_page_form_field_values(pager,fields=field_dict)
with open("newfile.pdf", "wb") as fh:
writer.write(fh)
I've tried modifying the basic fields by accessing the forms directly, but I have issues with all the forms showing up. Run this code snippet separately, but you can't update the GivenName field directly
from PyPDF2 import PdfFileReader,PdfFileWriter
from PyPDF2.generic import NameObject, IndirectObject, BooleanObject
reader = PdfFileReader("i-130.pdf")
pager = reader.getPage(0)
annot3 = pager['/Annots'][18].get_object()
annot3.update({NameObject("/V"):NameObject("Mark")})
writer = PdfFileWriter()
writer._root_object.update({NameObject("/AcroForm"):IndirectObject(len(writer._objects),0,writer)})
need_appearances = NameObject("/NeedAppearances")
writer._root_object["/AcroForm"][need_appearances] = BooleanObject(True)
with open("newfile.pdf", "wb") as fh:
writer.write(fh)
Unfortunately, I can't really give you any details on why this is so and what negative effects might result from approaching it this way, BUT...
The issue is driven by the self.set_need_appearances_writer() call at the top of the update_page_form_field_values function definition. To test, I commented that line out in my PyPDF2 distribution (see how-to below, but be advised, this is NOT a good long term solution), and the following code:
from PyPDF2 import PdfFileReader, PdfFileWriter
rdr = PdfFileReader("i-130.pdf")
fields_dict = {
"Pt2Line4b_GivenName[0]": "Mark"
}
wtr = PdfFileWriter()
wtr.append_pages_from_reader(rdr)
# wtr.set_need_appearances_writer()
wtr.update_page_form_field_values(wtr.pages[0], fields_dict)
with open("i-130-prefilled.pdf", "wb") as ofile:
wtr.write(ofile)
produces output:
The bad news is that the resultant PDF is quite a bit smaller than the original (~150 KB smaller), so even if it looks like an identical copy with a few fields filled out, there's clearly something more that's lacking. One of those "things" is the font in which the field values are rendered. The original version rendered entered values in a fixed width font like courier new, but the modified version displays a standard "sans" font. Conversely, you can replace the append_pages_from_reader call with clone_document_from_reader in which case you'll end up with a prepopulated version that DOES preserve the font, but the file size is ~300 KB larger than the original. No matter how you slice it, something is "not the same".
Bottom line:
I'm not sure I'd go with any solution without understanding what happens "under the hood" with that set_need_appearances_writer call. As such, I'd advise direct engagement with the good folks who picked up the baton and started actively working on the PyPDF2 library again earlier this year. You can find them at https://github.com/py-pdf/PyPDF2.
As promised, here's how you can test the solution in your own environment (but please don't do this and call it a day ;)):
navigate to the dist folder for your PyPDF2 install and open _writer.py (in VSCode, you can right click on the call to update_page_form_field_values and select 'Go to Definition')
comment out the offending line as shown below:

Is there any feasible solution to read WOT battle results .dat files?

I am new here to try to solve one of my interesting questions in World of Tanks. I heard that every battle data is reserved in the client's disk in the Wargaming.net folder because I want to make a batch of data analysis for our clan's battle performances.
image
It is said that these .dat files are a kind of json files, so I tried to use a couple of lines of Python code to read but failed.
import json
f = open('ex.dat', 'r', encoding='unicode_escape')
content = f.read()
a = json.loads(content)
print(type(a))
print(a)
f.close()
The code is very simple and obviously fails to make it. Well, could anyone tell me the truth about that?
Added on Feb. 9th, 2022
After I tried another set of codes via Jupyter Notebook, it seems like something can be shown from the .dat files
import struct
import numpy as np
import matplotlib.pyplot as plt
import io
with open('C:/Users/xukun/Desktop/br/ex.dat', 'rb') as f:
fbuff = io.BufferedReader(f)
N = len(fbuff.read())
print('byte length: ', N)
with open('C:/Users/xukun/Desktop/br/ex.dat', 'rb') as f:
data =struct.unpack('b'*N, f.read(1*N))
The result is a set of tuple but I have no idea how to deal with it now.
Here's how you can parse some parts of it.
import pickle
import zlib
file = '4402905758116487.dat'
cache_file = open(file, 'rb') # This can be improved to not keep the file opened.
# Converting pickle items from python2 to python3 you need to use the "bytes" encoding or "latin1".
legacyBattleResultVersion, brAllDataRaw = pickle.load(cache_file, encoding='bytes', errors='ignore')
arenaUniqueID, brAccount, brVehicleRaw, brOtherDataRaw = brAllDataRaw
# The data stored inside the pickled file will be a compressed pickle again.
vehicle_data = pickle.loads(zlib.decompress(brVehicleRaw), encoding='latin1')
account_data = pickle.loads(zlib.decompress(brAccount), encoding='latin1')
brCommon, brPlayersInfo, brPlayersVehicle, brPlayersResult = pickle.loads(zlib.decompress(brOtherDataRaw), encoding='latin1')
# Lastly you can print all of these and see a lot of data inside.
The response contains a mixture of more binary files as well as some data captured from the replays.
This is not a complete solution but it's a decent start to parsing these files.
First you can look at the replay file itself in a text editor. But it won't show the code at the beginning of the file that has to be cleaned out. Then there is a ton of info that you have to read in and figure out but it is the stats for each player in the game. THEN it comes to the part that has to do with the actual replay. You don't need that stuff.
You can grab the player IDs and tank IDs from WoT developer area API if you want.
After loading the pickle files like gabzo mentioned, you will see that it is simply a list of values and without knowing what the value is referring to, its hard to make sense of it. The identifiers for the values can be extracted from your game installation:
import zipfile
WOT_PKG_PATH = "Your/Game/Path/res/packages/scripts.pkg"
BATTLE_RESULTS_PATH = "scripts/common/battle_results/"
archive = zipfile.ZipFile(WOT_PKG_PATH, 'r')
for file in archive.namelist():
if file.startswith(BATTLE_RESULTS_PATH):
archive.extract(file)
You can then decompile the python files(uncompyle6) and then go through the code to see the identifiers for the values.
One thing to note is that the list of values for the main pickle objects (like brAccount from gabzo's code) always has a checksum as the first value. You can use this to check whether you have the right order and the correct identifiers for the values. The way these checksums are generated can be seen in the decompiled python files.
I have been tackling this problem for some time (albeit in Rust): https://github.com/dacite/wot-battle-results-parser/tree/main/datfile_parser.

How can I remove every other page of a pdf with python?

I downloaded a pdf where every other page is blank and I'd like to remove the blank pages. I could do this manually in a pdf tool (Adobe Acrobat, Preview.app, PDFPen, etc.) but since it's several hundred pages I'd like to do something more automated. Is there a way to do this in python?
One way is to use pypdf, so in your terminal first do
pip install pypdf4
Then create a .py script file similar to this:
# pdf_strip_every_other_page.py
from PyPDF4 import PdfFileReader, PdfFileWriter
number_of_pages = 500
output_writer = PdfFileWriter()
with open("/path/to/original.pdf", "rb") as inputfile:
pdfOne = PdfFileReader(inputfile)
for i in list(range(0, number_of_pages)):
if i % 2 == 0:
page = pdfOne.getPage(i)
output_writer.addPage(page)
with open("/path/to/output.pdf", "wb") as outfile:
output_writer.write(outfile)
Note: you'll need to change the paths to what's appropriate for your scenario.
Obviously this script is rather crude and could be improved, but wanted to share it for anyone else wanting a quick way to deal with this scenario.

pyPdf PdfFileReader vs PdfFileWriter

I have the following code:
import os
from pyPdf import PdfFileReader, PdfFileWriter
path = "C:/Real Python/Course materials/Chapter 12/Practice files"
input_file_name = os.path.join(path, "Pride and Prejudice.pdf")
input_file = PdfFileReader(file(input_file_name, "rb"))
output_PDF = PdfFileWriter()
for page_num in range(1, 4):
output_PDF.addPage(input_file.getPage(page_num))
output_file_name = os.path.join(path, "Output/portion.pdf")
output_file = file(output_file_name, "wb")
output_PDF.write(output_file)
output_file.close()
Till now I was just reading from Pdfs and later learned to write from Pdf to txt... But now this...
Why the PdfFileReader differs so much from PdfFileWriter
Can someone explain this? I would expect something like:
import os
from pyPdf import PdfFileReader, PdfFileWriter
path = "C:/Real Python/Course materials/Chapter 12/Practice files"
input_file_name = os.path.join(path, "Pride and Prejudice.pdf")
input_file = PdfFileReader(file(input_file_name, "rb"))
output_file_name = os.path.join(path, "out Pride and Prejudice.pdf")
output_file = PdfFileWriter(file(output_file_name, "wb"))
for page_num in range(1,4):
page = input_file.petPage(page_num)
output_file.addPage(page_num)
output_file.write(page)
Any help???
Thanks
EDIT 0: What does .addPage() do?
for page_num in range(1, 4):
output_PDF.addPage(input_file.getPage(page_num))
Does it just creates 3 BLANK pages?
EDIT 1: Someone can explain what happends when:
1) output_PDF = PdfFileWriter()
2) output_PDF.addPage(input_file.getPage(page_num))
3) output_PDF.write(output_file)
The 3rd one passes a JUST CREATED(!) object to output_PDF , why?
The issue is basically the PDF Cross-Reference table.
It's a somewhat tangled spaghetti monster of references to pages, fonts, objects, elements, and these all need to link together to allow for random access.
Each time a file is updated, it needs to rebuild this table. The file is created in memory first so this only has to happen once, and further decreasing the chances of torching your file.
output_PDF = PdfFileWriter()
This creates the space in memory for the PDF to go into. (to be pulled from your old pdf)
output_PDF.addPage(input_file.getPage(page_num))
add the page from your input pdf, to the PDF file created in memory (the page you want.)
output_PDF.write(output_file)
Finally, this writes the object stored in memory to a file, building the header, cross-reference table, and linking everything together all hunky dunky.
Edit: Presumably, the JUST CREATED flag signals PyPDF to start building the appropriate tables and link things together.
--
in response to the why vs .txt and csv:
When you're copying from a text or CSV file, there's no existing data structures to comprehend and move to make sure things like formatting, image placement, and form data (input sections, etc) are preserved and created properly.
Most likely, it's done because PDFs aren't exactly linear - the "header" is actually at the end of the file.
If the file was written to disk every time a change was made, your computer needs to keep pushing that data around on the disk. Instead, the module (probably) stores the information about the document in an object (PdfFileWriter), and then converts that data into your actual PDF file when you request it.

Categories

Resources