Add pdf metadata with accents in python - python

I want to change the metadata of the pdf file using this code:
from PyPDF2 import PdfFileReader, PdfFileWriter
title = "Vice-présidence pour l'éducation"
fin = open(filename, 'rb')
reader = PdfFileReader(fin)
writer = PdfFileWriter()
writer.appendPagesFromReader(reader)
metadata = reader.getDocumentInfo()
metadata.update({'/Title':title})
writer.addMetadata(metadata)
fout = open(filename, 'wb')
writer.write(fout)
fin.close()
fout.close()
It works fine if the title is in english(no accents) but when it has accents I get the following error:
TypeError: createStringObject should have str or unicode arg
How can I add a title with accent to the metadata ?
Thank you

The only way to get this error message is to have the wrong type for the parameter string in the createStringObject(string)-function in the library itself.
It's looking for type string or bytes using these functions in utils.py
import builtins
bytes_type = type(bytes()) # Works the same in Python 2.X and 3.X
string_type = getattr(builtins, "unicode", str)
I can only reproduce your error if I rewrite your code with an obviously wrong type like this (code is rewritten using with statement but only the commented line is important):
from PyPDF2 import PdfFileReader, PdfFileWriter
with open(inputfile, "rb") as fr, open(outputfile, "wb") as fw:
reader = PdfFileReader(fr)
writer = PdfFileWriter()
writer.appendPagesFromReader(reader)
metadata = reader.getDocumentInfo()
# metadata.update({'/Title': "Vice-présidence pour l'éducation"})
metadata.update({'/Title': [1, 2, 3]}) # <- wrong type here !
writer.addMetadata(metadata)
writer.write(fw)
It seems that the type of your string title = "Vice-présidence pour l'éducation" is not matching to whatever bytes_type or string_type is resolved. Either you have a weird type of the title variable (which I cannot see in your code, maybe because of creating a MCVE) or you have trouble getting bytes_type or string_type as types intended by library writer (this can be a bug in the library or an erroneous installation, hard to tell for me).
Without reproducible code, it's hard to provide a solution. But hopefully this will give you the right direction to go. Maybe it's enough to set the type of your string to whatever bytes_type or string_type is resolved to. Other solutions would be on library site or simply hacks.

Related

How to use PyPDF2 in a script?

import PyPDF2
from PyDF2 import PdfFileReader, PdfFileWriter
file_path="sample.pdf"
pdf = PdfFileReader(file_path)
with open("sample.pdf", "w") as f:'
for page_num in range(pdf.numPages):
pageObj = pdf.getPage(page_num)
try:
txt = pageObj.extractText()
txt = DocumentInformation.author
except:
pass
else:
f.write(txt)
f.close()
Error Received:
ModuleNotFoundError: No module named 'PyPDF2'
Writing my first ever script where I want to scan in a PDF then extract the text and write it to a txt file. I was trying to use pyPDF2 but I'm not sure how to use it in a script like this.
EDIT: I had success importing the os & sys like so.
import os
import sys
There are multiple issues:
from PyDF2 import ...: A typo. You meant PyPDF2 instead of PyDF2
PdfFileWriter was imported, but never used (side-note: It's PdfReader and PdfWriter in the latest version of PyPDF2)
with open("sample.pdf", "w") as f:': A syntax error
Lacking indentation of the next lines
Side-note: Did you know that you can simply write for page in pdf.pages?
DocumentInformation.author is wrong. I guess you meant pdf.metadata.author
You overwrite the txt variable - I don't understand why you don't use it before you re-assign it.
Maybe this is what you want:
from PyPDF2 import PdfReader
def get_text(pdf_file_path: str) -> str:
text = ""
reader = PdfReader(pdf_file_path)
for page in reader.pages:
text += page.extract_text()
return text
text = get_text("example.pdf")
with open("example.txt", "w") as f:
f.write(text)
Installation issues
In case you have installation issues, maybe the docs on installing PyPDF2 can help you?
If you execute your script in the console as python your_script_name.py you might want to check the output of
python -c "import PyPDF2; print(PyPDF2.__version__)"
That should show your PyPDF2 version. If it doesn't, it the Python environment you're using doesn't have PyPDF2 installed. Please note that your system might have arbitrary many Python environments.

How do I know my file is attached in my PDF using PyPDF2?

I am trying to attach an .exe file into a PDF using PyPDF2.
I ran the code below, but my PDF file is still the same size.
I don't know if my file was attached or not.
from PyPDF2 import PdfFileWriter, PdfFileReader
writer = PdfFileWriter()
reader = PdfFileReader("doc1.pdf")
# check it's whether work or not
print("doc1 has %d pages" % reader.getNumPages())
writer.addAttachment("doc1.pdf", "client.exe")
What am I doing wrong?
First of all, you have to use the PdfFileWriter class properly.
You can use appendPagesFromReader to copy pages from the source PDF ("doc1.pdf") to the output PDF (ex. "out.pdf"). Then, for addAttachment, the 1st parameter is the filename of the file to attach and the 2nd parameter is the attachment data (it's not clear from the docs, but it has to be a bytes-like sequence). To get the attachment data, you can open the .exe file in binary mode, then read() it. Finally, you need to use write to actually save the PdfFileWriter object to an actual PDF file.
Here is a more working example:
from PyPDF2 import PdfFileReader, PdfFileWriter
reader = PdfFileReader("doc1.pdf")
writer = PdfFileWriter()
writer.appendPagesFromReader(reader)
with open("client.exe", "rb") as exe:
writer.addAttachment("client.exe", exe.read())
with open("out.pdf", "wb") as f:
writer.write(f)
Next, to check if attaching was successful, you can use os.stat.st_size to compare the file size (in bytes) before and after attaching the .exe file.
Here is the same example with checking for file sizes:
(I'm using Python 3.6+ for f-strings)
import os
from PyPDF2 import PdfFileReader, PdfFileWriter
reader = PdfFileReader("doc1.pdf")
writer = PdfFileWriter()
writer.appendPagesFromReader(reader)
with open("client.exe", "rb") as exe:
writer.addAttachment("client.exe", exe.read())
with open("out.pdf", "wb") as f:
writer.write(f)
# Check result
print(f"size of SOURCE: {os.stat('doc1.pdf').st_size}")
print(f"size of EXE: {os.stat('client.exe').st_size}")
print(f"size of OUTPUT: {os.stat('out.pdf').st_size}")
The above code prints out
size of SOURCE: 42942
size of EXE: 989744
size of OUTPUT: 1031773
...which sort of shows that the .exe file was added to the PDF.
Of course, you can manually check it by opening the PDF in Adobe Reader:
As a side note, I am not sure what you want to do with attaching exe files to PDF, but it seems you can attach them but Adobe treats them as security risks and may not be possible to be opened. You can use the same code above to attach another PDF file (or other documents) instead of an executable file, and it should still work.

How do I edit a YAML file through python using ruamel.yaml and not pyyaml?

I found a similar questions here on StackOverflow, but when I tried to implement the code [Anthons's response] to suit my situation, I noticed that it does not actually edit the YAML file. Also, I need to use ruamel.yaml, not PyYAML. Many of the answers I've reviewed use PyYAML.
import sys
import ruamel.yaml
yaml = ruamel.yaml.YAML()
# yaml.preserve_quotes = True
with open('elastic.yml') as fp:
data = yaml.load(fp)
data['cluster.name'] = 'BLABLABLABLABLA'
data['node.name'] = 'HEHEHEHEHEHEHE'
yaml.dump(data, sys.stdout)
This code outputs the file with the correct edits, however, when I actually go into the file (elastic.yml), the original documentation is unchanged.
This is my first experience with ruamel.yaml and I would rather stick with this because I've noticed PyYAML does not keep comments.
The YAML file after I run the python code:
cluster.name: my-application
# Use a descriptive name for the node:
node.name: HappyNode
The output to the console after I run the python code:
cluster.name: BLABLABLABLABLA
# Use a descriptive name for the node:
node.name: HEHEHEHEHEHEHE
I've tried adding this to the bottom of the code to assure it write to the file, as described here:[Matheus Portela's response] but I get no luck:
with open('elastic.yml', 'w') as f:
yaml.dump(data, f)
I get the following error:
data['cluster.name'] = 'BLABLABLABLABLA'
TypeError: 'NoneType' object does not support item assignment
Assuming that the (unchanged) elastic.yml is your input, you can run:
import ruamel.yaml
file_name = 'elastic.yml'
yaml = ruamel.yaml.YAML()
with open(file_name) as fp:
data = yaml.load(fp)
data['cluster.name'] = 'BLABLABLABLABLA'
data['node.name'] = 'HEHEHEHEHEHEHE'
with open(file_name, 'w') as fp:
yaml.dump(data, fp)
# display file
with open(file_name) as fp:
print(fp.read(), end='')
to get the following output:
cluster.name: BLABLABLABLABLA
# Use a descriptive name for the node:
node.name: HEHEHEHEHEHEHE
Since the program displays the content of the file, you can be sure it has changed

Converting a .csv.gz to .csv in Python 2.7

I have read the documentation and a few additional posts on SO and other various places, but I can't quite figure out this concept:
When you call csvFilename = gzip.open(filename, 'rb') then reader = csv.reader(open(csvFilename)), is that reader not a valid csv file?
I am trying to solve the problem outlined below, and am getting a coercing to Unicode: need string or buffer, GzipFile found error on line 41 and 7 (highlighted below), leading me to believe that the gzip.open and csv.reader do not work as I had previously thought.
Problem I am trying to solve
I am trying to take a results.csv.gz and convert it to a results.csv so that I can turn the results.csv into a python dictionary and then combine it with another python dictionary.
File 1:
alertFile = payload.get('results_file')
alertDataCSV = rh.dataToDict(alertFile) # LINE 41
alertDataTotal = rh.mergeTwoDicts(splunkParams, alertDataCSV)
Calls File 2:
import gzip
import csv
def dataToDict(filename):
csvFilename = gzip.open(filename, 'rb')
reader = csv.reader(open(csvFilename)) # LINE 7
alertData={}
for row in reader:
alertData[row[0]]=row[1:]
return alertData
def mergeTwoDicts(dictA, dictB):
dictC = dictA.copy()
dictC.update(dictB)
return dictC
*edit: also forgive my non-PEP style of naming in Python
gzip.open returns a file-like object (same as what plain open returns), not the name of the decompressed file. Simply pass the result directly to csv.reader and it will work (the csv.reader will receive the decompressed lines). csv does expect text though, so on Python 3 you need to open it to read as text (on Python 2 'rb' is fine, the module doesn't deal with encodings, but then, neither does the csv module). Simply change:
csvFilename = gzip.open(filename, 'rb')
reader = csv.reader(open(csvFilename))
to:
# Python 2
csvFile = gzip.open(filename, 'rb')
reader = csv.reader(csvFile) # No reopening involved
# Python 3
csvFile = gzip.open(filename, 'rt', newline='') # Open in text mode, not binary, no line ending translation
reader = csv.reader(csvFile) # No reopening involved
The following worked for me for python==3.7.9:
import gzip
my_filename = my_compressed_file.csv.gz
with gzip.open(my_filename, 'rt') as gz_file:
data = gz_file.read() # read decompressed data
with open(my_filename[:-3], 'wt') as out_file:
out_file.write(data) # write decompressed data
my_filename[:-3] is to get the actual filename so that it does get a random filename.

python unicode csv export using pyramid

I'm trying to export mongodb that has non ascii characters into csv format.
Right now, I'm dabbling with pyramid and using pyramid.response.
from pyramid.response import Response
from mycart.Member import Member
#view_config(context="mycart:resources.Member", name='', request_method="POST", permission = 'admin')
def member_export( context, request):
filename = 'member-'+time.strftime("%Y%m%d%H%M%S")+".csv"
download_path = os.getcwd() + '/MyCart/mycart/static/downloads/'+filename
member = Members(request)
my_list = [['First Name,Last Name']]
record = member.get_all_member( )
for r in record:
mystr = [ r['fname'], r['lname']]
my_list.append(mystr)
with open(download_path, 'wb') as f:
fileWriter = csv.writer(f, delimiter=',',quotechar='|', quoting=csv.QUOTE_MINIMAL)
for l in my_list:
print(l)
fileWriter.writerow(l)
size = os.path.getsize(download_path)
response = Response(content_type='application/force-download', content_disposition='attachment; filename=' + filename)
response.app_iter = open(download_path , 'rb')
response.content_length = size
return response
In mongoDB, first name is showing 王, when I'm using print, it too is showing 王. However, when I used excel to open it up, it shows random stuff - ç¾…
However, when I tried to view it in shell
$ more member-20130227141550.csv
It managed to display the non ascii character correctly.
How should I rectify this problem?
I'm not a Windows guy, so I am not sure whether the problem may be with your code or with excel just not handling non-ascii characters nicely. But I have noticed that you are writing your file with python csv module, which is notorious for headaches with unicode.
Other users have reported success with using unicodecsv as a replacement for the csv module. Perhaps you could try dropping in this module as a csv writer and see if your problem magically goes away.

Categories

Resources