pyPdf: illegal UTF-16 surrogate

pyPdf: illegal UTF-16 surrogate - python

I have a pdf file that breaks pyPdf: http://tovotu.de/tests/test.pdf
This is the sample script:
from pyPdf import PdfFileWriter, PdfFileReader
outputPdf = PdfFileWriter()
inpdf = open("test.pdf","rb")
inputPdf = PdfFileReader(inpdf)
[outputPdf.addPage(x) for x in inputPdf.pages]
with open("output.pdf","wb") as outpdf:
outputPdf.write(outpdf)
Error output is here: http://pastebin.com/0m38zhjQ
The error is the same when using PyPDF2 from GitHub. pdftk can handle this pdf just like any other pdf out there. Please note, that writing fails, but reading seems to work just fine!
Can you at least point me to the exact part of the pdf that causes that error? A workaround would be even nicer :)

Looks like a bug in PyPDF2. In this section:
if string.startswith(codecs.BOM_UTF16_BE):
retval = TextStringObject(string.decode("utf-16"))
retval.autodetect_utf16 = True
it assumes that any string starting with (0xFE, 0xFF) can be decoded as UTF-16. Your file contains a bytestring that begins that way but then contains invalid UTF-16.
The simplest fix is to comment out that if and unconditionally use the # This is probably a big performance hit here branch.

Related

Nltk in xls files

There is no problem accessing the file but while reading I get the following error
from nltk.corpus.reader import WordListCorpusReader
reader= WordListCorpusReader("C:\\Users\samet\\nltk_data\\corpora\\bilgi\samet",
["politika.xls"])
a = reader.words()
print (a)
enter image description here

You'll want to make sure the file you're trying to load (politika.xls) is saved with utf-8 encoding. First I'll detail how I replicated your error, then I'll show an approach to solve it.
I was able to replicate your error as follows:
Create a new text document. "temp.txt"
Open it, add a few lines random text, save and close it.
Rename "temp.txt" to "temp.xls"
Open "temp.xls"
Save as.... "temp.xlsx"
Close file.
Rename "temp.xlsm" to "politika.xls"
Try running your code (with correction to path).
Receive your error: "UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 15-16: invalid continuation byte"
There may be a more straightforward approach, but from the above error condition, this worked to fix it:
Create a backup copy of "politika.xls"
Rename "politika.xls" to "old_politika.xls"
Create a new text file "politika.txt".
#Steps 3.1 - 3.4 may or may not be needed.
3.1. Open "politika.txt"
3.2. Save as...
3.3. Select Encoding >> (either ANSI or UTF-8 should work)
3.4. Save and close file.
Rename "politika.txt" to "politika.csv"
Open up "old_politika.xls"
Select and copy the data.
Open up "politika.csv"
Paste the data. Save and exit.
Rename "politika.csv" to "politika.xls"
Run your program. (See below for code / potential correction)
Also, you'll want to fix your dirrectory path. Make sure you use the excape character "\" for each "\" in the path. You were missing a "\" in front of " \samet" in 2 places. Corrected code below:
from nltk.corpus.reader import WordListCorpusReader
reader= WordListCorpusReader("C:\\Users\\samet\\nltk_data\\corpora\\bilgi\\samet",
["politika.xls"])
a = reader.words()
print (a)
I hope this helps.

How do I fix an error in my python code where it is complaining about a string being wanted?

I was actually using code from a course at Udacity.com on Data Wrangling. The code file is very short so I was able to copy what they did and I still get an error. They use python 2.7.x. The course is about a year old, so maybe something about the functions or modules in the 2.7 branch has changed. I mean the code used by the instructors works.
I know that using the csv module or function would solve the issue but they want to demonstrate the use of a custom parse function. In addition, they are using the enumerate function. Here is the link to the gist.
This should be very simple and basic and that is why it is frustrating me. I know they are reading the file, which is a csv file, as binary, with the "rb" parameter to the line
with open("file.csv", "rb") as f:

You don't have matching characters in your csv file and the dictionaries in your test function. In particular, in your csv file you are using an em dash (U+2014) and in your firstline and tenthline dictionaries you are using a hyphen-minus (U+002D).
hex(ord(d[0]['US Chart Position'].decode('utf-8')))
'0x2014' # output: code point for the em dash character in csv file
hex(ord(firstline['US Chart Position']))
'0x2d' # output: code point for hyphen-minus
To fix it, just copy and paste the — character from the csv in your gist into the dictionaries in your source code to replace the - characters.
Make sure to include this comment at the top of your file:
# -*- coding: utf-8 -*-
This will ensure that Python knows to expect non-ascii characters in the source code.
Alternatively, you could replace all the — (em dash) characters in the csv file with hyphens:
sed 's/—/-/g' beatles-diskography.csv > beatles-diskography2.csv
Then, remember to use the new file name in your source code.

pygame image to base64

I capture screen of my pygame program like this
data = pygame.image.tostring(pygame.display.get_surface(),"RGB")
How can I convert it into base64 string? (WITHOUT having to save it to HDD). Its important that there is no saving to HDD. I know I can save it to a file and then just encode the file to base64 but I cant seem to encode "on the fly"
thanks

If you want, you can save it to a StringIO, which is basically a virtual file stored as a string.
However, I'd really recommend using the base64 module, which has a method called base64.b64encode. It handles your 'on the fly' requirement well.
Code example:
import base64
data = pygame.image.tostring(pygame.display.get_surface(),"RGB")
base64data = base64.b64encode(data)
Happy coding!

Actually - pygame.image.tostring() is a pretty strange function (really dont understand the binary string it returns, I cant find anythin that can process it right).
There seems to be an enhancement issue on this at pygame bitbucket:
(https://bitbucket.org/pygame/pygame/issue/48/add-optional-format-argument-to)
I got around it like this:
data = cStringIO.StringIO()
pygame.image.save(pygame.display.get_surface(), data)
data = base64.b64encode(data.getvalue())
So in the end you get the valid and RIGHT base64 string. And it seems to work. Not sure about the format yet tho, will add more info tmrw

Putting gzipped data into a script as a string

I snagged a Lorem Ipsupm generator last week, and I admit, it's pretty cool.
My question: can someone show me a tutorial on how the author of the above script was able to post the contents of a gzipped file into their code as a string? I keep getting examples of gzipping a regular file, and I'm feeling kind of lost here.
For what it's worth, I have another module that is quite similar (it generates random names, companies, etc), and right now it reads from a couple different text files. I like this approach better; it requires one less sub-directory in my project to place data into, and it also presents a new way of doing things for me.
I'm quite new to streams, IO types, and the like. Feel free to dump the links on my lap. Snipptes are always appreciated too.

Assuming you are in a *nix environment, you just need gzip and a base64 encoder to generate the string. Lets assume your content is in file.txt, for the purpose of this example I created the file with random bytes with that specific name.
So you need to compress it first:
$ gzip file.txt
That will generate a file.txt.gz file that you now need to embed into your code. To do that, you need to encode it. A common way to do so is to use Base64 encoding, which can be done with the base64 program:
$ base64 file.txt.gz
H4sICGmHsE8AA2ZpbGUudHh0AAGoAFf/jIMKME+MgnEhgS4vd6SN0zIuVRhsj5fac3Q1EV1EvFJK
fBsw+Ln3ZSX7d5zjBXJR1BUn+b2/S3jHXO9h6KEDx37U7iOvmSf6BMo1gOJEgIsf57yHwUKl7f9+
Beh4kwF+VljN4xjBfdCiXKk0Oc9g/5U/AKR02fRwI+zYlp1ELBVDzFHNsxpjhIT43sBPklXW8L5P
d8Ao3i2tQQPf2JAHRQZYYn3vt0tKg7drVKgAAAA=
Now you have all what you need to use the contents of that file in your python script:
from cStringIO import StringIO
from base64 import b64decode
from gzip import GzipFile
# this is the variable with your file's contents
gzipped_data = """
H4sICGmHsE8AA2ZpbGUudHh0AAGoAFf/jIMKME+MgnEhgS4vd6SN0zIuVRhsj5fac3Q1EV1EvFJK
fBsw+Ln3ZSX7d5zjBXJR1BUn+b2/S3jHXO9h6KEDx37U7iOvmSf6BMo1gOJEgIsf57yHwUKl7f9+
Beh4kwF+VljN4xjBfdCiXKk0Oc9g/5U/AKR02fRwI+zYlp1ELBVDzFHNsxpjhIT43sBPklXW8L5P
d8Ao3i2tQQPf2JAHRQZYYn3vt0tKg7drVKgAAAA=
"""
# we now decode the file's content from the string and unzip it
orig_file_desc = GzipFile(mode='r',
fileobj=StringIO(b64decode(gzipped_data)))
# get the original's file content to a variable
orig_file_cont = orig_file_desc.read()
# and close the file descriptor
orig_file_desc.close()
Obviously, your program will depend on the base64, gzip and cStringIO python modules.

I'm not sure exactly what you're asking, but here's a stab...
The author of lipsum.py has included the compressed data inline in their code as chunks of Base64 encoded text. Base64 is an encoding mechanism for representing binary data using printable ASCII characters. It can be used for including binary data in your Python code. It is more commonly used to include binary data in email attachments...the next time someone sends you a picture or PDF document, take a look at the raw message and you'll see very much the same thing.
Python's base64 module provides routines for converting between base64 and binary representations of data...and once you have the binary representation of the data, it doesn't really matter how you got, whether it was by reading it from a file or decoding a string embedded in your code.
Python's gzip module can be used to decompress data. It expects a file-like object...and Python provides the StringIO module to wrap strings in the right set of methods to make them act like files. You can see that in lipsum.py in the following code:
sample_text_file = gzip.GzipFile(mode='rb',
fileobj=StringIO(base64.b64decode(DEFAULT_SAMPLE_COMPRESSED)))
This is creating a StringIO object containing the binary representation of the base64 encoded value stored in DEFAULT_SAMPLE_COMPRESSED.
All the modules mentioned here are described in the documentation for the Python standard library.
I wouldn't recommend including data in your code inline like this as a good idea in general, unless your data is small and relatively static. Otherwise, package it up into your Python package which makes it easier to edit and track changes.
Have I answered the right question?

How about this: Zips and encodes a string, prints it out encoded, then decodes and unzips it again.
from StringIO import StringIO
import base64
import gzip
contents = 'The quick brown fox jumps over the lazy dog'
zip_text_file = StringIO()
zipper = gzip.GzipFile(mode='wb', fileobj=zip_text_file)
zipper.write(contents)
zipper.close()
enc_text = base64.b64encode(zip_text_file.getvalue())
print enc_text
sample_text_file = gzip.GzipFile(mode='rb',
fileobj=StringIO(base64.b64decode(enc_text)))
DEFAULT_SAMPLE = sample_text_file.read()
sample_text_file.close()
print DEFAULT_SAMPLE

Old question but I had to do this recent for AWS logs. In Python3 use BytesIO instead of StringIO:
import base64
from io import BytesIO
DEFAULT_SAMPLE_COMPRESSED = "Some base 64 encoded and gzip compressed string"
sample_text_file = gzip.GzipFile(
mode='rb',
fileobj=BytesIO(base64.b64decode(DEFAULT_SAMPLE_COMPRESSED))
)
binary_text = sample_text_file.read() # This will be the final string as bianry
text = binary_text .decode() # This will make the binary text a string.

Processing a Django UploadedFile as UTF-8 with universal newlines

In my django application, I provide a form that allows users to upload a file. The file can be in a variety of formats (Excel, CSV), come from a variety of platforms (Mac, Linux, Windows), and be encoded in a variety of encodings (ASCII, UTF-8).
For the purpose of this question, let's assume that I have a view which is receiving request.FILES['file'], which is an instance of InMemoryUploadedFile, called file. My problem is that InMemoryUploadedFile objects (like file):
Do not support UTF-8 encoding (I see a \xef\xbb\xbf at the beginning of the file, which as I understand is a flag meaning 'this file is UTF-8').
Do not support universal newlines (which probably the majority of the files uploaded to this system will need).
Complicating the issue is that I wish to pass the file in to the python csv module, which does not natively support Unicode. I will happily accept answers that avoid this issue - once I get django playing nice with UTF-8 I'm sure I can bludgeon csv into doing the same. (Similarly, please ignore the requirement to support Excel - I am waiting until CSV works before I tackle parsing Excel files.)
I have tried using StringIO,mmap,codec, and any of a wide variety of ways of accessing the data in an InMemoryUploadedFile object. Each approach has yielded differing errors, none so far have been perfect. This shows some of the code that I feel came the closest:
import csv
import codecs
class CSVParser:
def __init__(self,file):
# 'file' is assumed to be an InMemoryUploadedFile object.
dialect = csv.Sniffer().sniff(codecs.EncodedFile(file,"utf-8").read(1024))
file.open() # seek to 0
self.reader = csv.reader(codecs.EncodedFile(file,"utf-8"),
dialect=dialect)
try:
self.field_names = self.reader.next()
except StopIteration:
# The file was empty - this is not allowed.
raise ValueError('Unrecognized format (empty file)')
if len(self.field_names) <= 1:
# This probably isn't a CSV file at all.
# Note that the csv module will (incorrectly) parse ALL files, even
# binary data. This will catch most such files.
raise ValueError('Unrecognized format (too few columns)')
# Additional methods snipped, unrelated to issue
Please note that I haven't spent too much time on the actual parsing algorithm so it may be wildly inefficient, right now I'm more concerned with getting encoding to work as expected.
The problem is that the results are also not encoded, despite being wrapped in the Unicode codecs.EncodedFile file wrapper.
EDIT: It turns out, the above code does in fact work. codecs.EncodedFile(file,"utf-8") is the ticket. It turns out the reason I thought it didn't work was that the terminal I was using does not support UTF-8. Live and learn!

As mentioned above, the code snippet I provided was in fact working as intended - the problem was with my terminal, and not with python encoding.
If your view needs to access a UTF-8 UploadedFile, you can just use utf8_file = codecs.EncodedFile(request.FILES['file_field'],"utf-8") to open a file object in the correct encoding.
I also noticed that, at least for InMemoryUploadedFiles, opening the file through the codecs.EncodedFile wrapper does NOT reset the seek() position of the file descriptor. To return to the beginning of the file (again, this may be InMemoryUploadedFile specific) I just used request.FILES['file_field'].open() to send the seek() position back to 0.

I use the csv.DictReader and it appears to be working well. I attached my code snippet, but it is basically the same as another answer here.
import csv as csv_mod
import codecs
file = request.FILES['file']
dialect = csv_mod.Sniffer().sniff(codecs.EncodedFile(file,"utf-8").read(1024))
file.open()
csv = csv_mod.DictReader( codecs.EncodedFile(file,"utf-8"), dialect=dialect )

For CSV and Excel upload to django, this site may help.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.