Error in the coding of the characters in reading a PDF

Error in the coding of the characters in reading a PDF - python

I need to read this PDF.
I am using the following code:
from PyPDF2 import PdfFileReader
f = open('myfile.pdf', 'rb')
reader = PdfFileReader(f)
content = reader.getPage(0).extractText()
f.close()
content = ' '.join(content.replace('\xa0', ' ').strip().split())
print(content)
However, the encoding is incorrect, it prints:
Resultado da Prova de Sele“‰o do...
But I expected
Resultado da Prova de Seleção do...
How to solve it?
I'm using Python 3

The PyPDF2 extractTest method returns UniCode. So you many need to just explicitly encode it. For example, explicitly encoding the Unicode into UTF-8.
# -*- coding: utf-8 -*-
correct = u'Resultado da Prova de Seleção do...'
print(correct.encode(encoding='utf-8'))
You're on Python 3, so you have Unicode under the hood, and Python 3 defaults to UTF-8. But I wonder if you need to specify a different encoding based on your locale.
# Show installed locales
import locale
from pprint import pprint
pprint(locale.locale_alias)
If that's not the quick fix, since you're getting Unicode back from PyPDF, you could take a look at the code points for those two characters. It's possible that PyPDF wasn't able to determine the correct encoding and gave you the wrong characters.
For example, a quick and dirty comparison of the good and bad strings you posted:
# -*- coding: utf-8 -*-
# Python 3.4
incorrect = 'Resultado da Prova de Sele“‰o do'
correct = 'Resultado da Prova de Seleção do...'
print("Incorrect String")
print("CHAR{}UNI".format(' ' * 20))
print("-" * 50)
for char in incorrect:
print(
'{}{}{}'.format(
char.encode(encoding='utf-8'),
' ' * 20, # Hack; Byte objects don't have __format__
ord(char)
)
)
print("\n" * 2)
print("Correct String")
print("CHAR{}UNI".format(' ' * 20))
print("-" * 50)
for char in correct:
print(
'{}{}{}'.format(
char.encode(encoding='utf-8'),
' ' * 20, # Hack; Byte objects don't have __format__
ord(char)
)
)
Relevant Output:
b'\xe2\x80\x9c' 8220
b'\xe2\x80\xb0' 8240
b'\xc3\xa7' 231
b'\xc3\xa3' 227
If you're getting code point 231, (>>>hex(231) # '0xe7) then you're getting back bad data back from PyPDF.

what i have tried is to replace specific " ' " unicode with "’" which solves this issue. Please let me know if u still failed to generate pdf with this approach.
text = text.replace("'", "’")

Related

UnicodeDecodeError: ('unknown', u'\xe0', 0, 1, '')

temp = "à la Carte"
print type(temp)
utemp = unicode(temp)
The code above results in an error.
My goal is to process the temp string and use a find to check if it contains specific string in it but cannot process due to the error:
UnicodeDecodeError: ('unknown', u'\xe0', 0, 1, '')

You need to specify the encoding: otherwise unicode() doesn't know what \xe0 means, because that is encoding-specific.
>>> temp = "à la Carte"
>>> utemp = unicode(temp,encoding="Windows-1252")
>>> utemp
u'\xe0 la Carte'
>>> print utemp
à la Carte

In python 2, the ordinary string literal cannot hold such unicode characters, so even if the parser manages to get through it, it is still an error. That's why there exists a unicode literal type. So to make it work, first you have to declare the encoding of the python file, and second, use a unicode literal. Like this:
# -*- coding: utf-8 -*-
temp = u"à la Carte"
print type(temp)
utemp = unicode(temp)

Python re.findall fails at UTF-8 while rest of script succeeds

I have this script that reads a large ammount of text files written in Swedish (frequently with the åäö letters). It prints everything just fine from the dictionary if I loop over d and dictionary[]. However, the regular expression (from the raw input with u'.*' added) fails at returning utf-8 properly.
# -*- coding: utf8 -*-
from os import listdir
import re
import codecs
import sys
print "Välkommen till SOU-sök!"
search_word = raw_input("Ange sökord: ")
dictionary = {}
for filename in listdir("20tal"):
with open("20tal/" + filename) as currentfile:
text = currentfile.read()
dictionary[filename] = text
for d in dictionary:
result = re.findall(search_word + u'.*', dictionary[d], re.UNICODE)
if len(result) > 0:
print "Filnament är:\n %s \noch sökresultatet är:\n %s" % (d, result)
Edit: The output is as follows:
If I input:
katt
I get the following output:
Filnament är: Betänkande och förslag angående vissa ekonomiska spörsmål berörande enskilda järnvägar - SOU 1929:2.txt
och sökresultatet är:
['katter, r\xc3\xa4ntor m. m.', 'katter m- m., men exklusive r \xc3\xa4 nor m.', 'kattemedel subventionerar', av totalkostnaderna, ofta \xe2\x80\x94 med eller utan', 'kattas den nuvarande bilparkens kapitalv\xc3\xa4rde till 500 milj.
Here, the Filename d is printed correctly but not the result of the re.findall

In Python 2.x unicode list items normally output escaped unless you loop through each or join them; maybe try something such as this:
result = ', '.join(result)
if len(result) > 0:
print ( u"Filnament är:\n %s \noch sökresultatet är:\n %s" % (d, result.decode('utf-8')))
Input:
katt
Result:
katter, räntor m. m. katter m- m., men exklusive r ä nor m. kattemedel subventionerar av totalkostnaderna, ofta — med eller utan kattas den nuvarande bilparkens kapitalvärde till 500 milj

The way file names are normalized is file system and OS dependent . Your particular regex may not match the normalization method correctly. Hence, consider this solution by remram:
import fnmatch
def myglob(pattern, directory=u'.'):
pattern = unicodedata.normalize('NFC', pattern)
results = []
enc = sys.getfilesystemencoding()
for name in os.listdir(directory):
if isinstance(name, bytes):
try:
name = name.decode(enc)
except UnicodeDecodeError:
# Filenames that are not proper unicode won't match any pattern
continue
if fnmatch.filter([unicodedata.normalize('NFC', name)], pattern):
results.append(name)
return results
I faced a similar problem here: Filesystem independent way of using glob.glob and regular expressions with unicode filenames in Python

Exporting to CSV Format In UTF-8 Format [duplicate]

I have a UTF-16 CSV file which I have to read. Python csv module does not seem to support UTF-16.
I am using python 2.7.2. CSV files I need to parse are huge size running into several GBs of data.
Answers for John Machin questions below
print repr(open('test.csv', 'rb').read(100))
Output with test.csv having just abc as content
'\xff\xfea\x00b\x00c\x00'
I think csv file got created on windows machine in USA. I am using Mac OSX Lion.
If I use code provided by phihag and test.csv containing one record.
sample test.csv content used. Below is print repr(open('test.csv', 'rb').read(1000)) output
'\xff\xfe1\x00,\x002\x00,\x00G\x00,\x00S\x00,\x00H\x00 \x00f\x00\xfc\x00r\x00 \x00e\x00 \x00\x96\x00 \x00m\x00 \x00\x85\x00,\x00,\x00I\x00\r\x00\n\x00'
Code by phihag
import codecs
import csv
with open('test.csv','rb') as f:
sr = codecs.StreamRecoder(f,codecs.getencoder('utf-8'),codecs.getdecoder('utf-8'),codecs.getreader('utf-16'),codecs.getwriter('utf-16'))
for row in csv.reader(sr):
print row
Output of the above code
['1', '2', 'G', 'S', 'H f\xc3\xbcr e \xc2\x96 m \xc2\x85']
['', '', 'I']
expected output is
['1', '2', 'G', 'S', 'H f\xc3\xbcr e \xc2\x96 m \xc2\x85','','I']

At the moment, the csv module does not support UTF-16.
In Python 3.x, csv expects a text-mode file and you can simply use the encoding parameter of open to force another encoding:
# Python 3.x only
import csv
with open('utf16.csv', 'r', encoding='utf16') as csvf:
for line in csv.reader(csvf):
print(line) # do something with the line
In Python 2.x, you can recode the input:
# Python 2.x only
import codecs
import csv
class Recoder(object):
def __init__(self, stream, decoder, encoder, eol='\r\n'):
self._stream = stream
self._decoder = decoder if isinstance(decoder, codecs.IncrementalDecoder) else codecs.getincrementaldecoder(decoder)()
self._encoder = encoder if isinstance(encoder, codecs.IncrementalEncoder) else codecs.getincrementalencoder(encoder)()
self._buf = ''
self._eol = eol
self._reachedEof = False
def read(self, size=None):
r = self._stream.read(size)
raw = self._decoder.decode(r, size is None)
return self._encoder.encode(raw)
def __iter__(self):
return self
def __next__(self):
if self._reachedEof:
raise StopIteration()
while True:
line,eol,rest = self._buf.partition(self._eol)
if eol == self._eol:
self._buf = rest
return self._encoder.encode(line + eol)
raw = self._stream.read(1024)
if raw == '':
self._decoder.decode(b'', True)
self._reachedEof = True
return self._encoder.encode(self._buf)
self._buf += self._decoder.decode(raw)
next = __next__
def close(self):
return self._stream.close()
with open('test.csv','rb') as f:
sr = Recoder(f, 'utf-16', 'utf-8')
for row in csv.reader(sr):
print (row)
open and codecs.open require the file to start with a BOM. If it doesn't (or you're on Python 2.x), you can still convert it in memory, like this:
try:
from io import BytesIO
except ImportError: # Python < 2.6
from StringIO import StringIO as BytesIO
import csv
with open('utf16.csv', 'rb') as binf:
c = binf.read().decode('utf-16').encode('utf-8')
for line in csv.reader(BytesIO(c)):
print(line) # do something with the line

The Python 2.x csv module documentation example shows how to handle other encodings.

I would strongly suggest that you recode your file(s) to UTF-8. Under the very likely condition that you don't have any Unicode characters outside the BMP, you can take advantage of the fact that UTF-16 is a fixed-length encoding to read fixed-length blocks from your input file without worrying about straddling block boundaries.
Step 1: Determine what encoding you actually have. Examine the first few bytes of your file:
print repr(open('thefile.csv', 'rb').read(100))
Four possible ways of encoding u'abc'
\xfe\xff\x00a\x00b\x00c -> utf_16
\xff\xfea\x00b\x00c\x00 -> utf_16
\x00a\x00b\x00c -> utf_16_be
a\x00b\x00c\x00 -> utf_16_le
If you have any trouble with this step, edit your question to include the results of the above print repr()
Step 2: Here's a Python 2.X recode-UTF-16*-to-UTF-8 script:
import sys
infname, outfname, enc = sys.argv[1:4]
fi = open(infname, 'rb')
fo = open(outfname, 'wb')
BUFSIZ = 64 * 1024 * 1024
first = True
while 1:
buf = fi.read(BUFSIZ)
if not buf: break
if first and enc == 'utf_16':
bom = buf[:2]
buf = buf[2:]
enc = {'\xfe\xff': 'utf_16_be', '\xff\xfe': 'utf_16_le'}[bom]
# KeyError means file doesn't start with a valid BOM
first = False
fo.write(buf.decode(enc).encode('utf8'))
fi.close()
fo.close()
Other matters:
You say that your files are too big to read the whole file, recode and rewrite, yet you can open it in vi. Please explain.
The <85> being treated as end of record is a bit of a worry. Looks like 0x85 is being recognised as NEL (C1 control code, NEWLINE). There is a strong possibility that the data was originally encoded in some legacy single-byte encoding where 0x85 has a meaning but has been transcoded to UTF-16 under the false assumption that the original encoding was ISO-8859-1 aka latin1. Where did the file originate? An IBM mainframe? Windows/Unix/classic Mac? What country, locale, language? You obviously think that the <85> is not meant to be a newline; what do you think that it means?
Please feel free to send a copy of a cut-down file (that includes some of the <85> stuff) to sjmachin at lexicon dot net
Update based on 1-line sample data provided.
This confirms my suspicions. Read this. Here's a quote from it:
... the C1 control characters ... are rarely used directly, except on
specific platforms such as OpenVMS. When they turn up in documents,
Web pages, e-mail messages, etc., which are ostensibly in an
ISO-8859-n encoding, their code positions generally refer instead to
the characters at that position in a proprietary, system-specific
encoding such as Windows-1252 or the Apple Macintosh ("MacRoman")
character set that use the codes provided for representation of the C1
set with a single 8-bit byte to instead provide additional graphic
characters
This code:
s1 = '\xff\xfe1\x00,\x002\x00,\x00G\x00,\x00S\x00,\x00H\x00 \x00f\x00\xfc\x00r\x00 \x00e\x00 \x00\x96\x00 \x00m\x00 \x00\x85\x00,\x00,\x00I\x00\r\x00\n\x00'
s2 = s1.decode('utf16')
print 's2 repr:', repr(s2)
from unicodedata import name
from collections import Counter
non_ascii = Counter(c for c in s2 if c >= u'\x80')
print 'non_ascii:', non_ascii
for c in non_ascii:
print "from: U+%04X %s" % (ord(c), name(c, "<no name>"))
c2 = c.encode('latin1').decode('cp1252')
print "to: U+%04X %s" % (ord(c2), name(c2, "<no name>"))
s3 = u''.join(
c.encode('latin1').decode('1252') if u'\x80' <= c < u'\xA0' else c
for c in s2
)
print 's3 repr:', repr(s3)
print 's3:', s3
produces the following (Python 2.7.2 IDLE, Windows 7):
s2 repr: u'1,2,G,S,H f\xfcr e \x96 m \x85,,I\r\n'
non_ascii: Counter({u'\x85': 1, u'\xfc': 1, u'\x96': 1})
from: U+0085 <no name>
to: U+2026 HORIZONTAL ELLIPSIS
from: U+00FC LATIN SMALL LETTER U WITH DIAERESIS
to: U+00FC LATIN SMALL LETTER U WITH DIAERESIS
from: U+0096 <no name>
to: U+2013 EN DASH
s3 repr: u'1,2,G,S,H f\xfcr e \u2013 m \u2026,,I\r\n'
s3: 1,2,G,S,H für e – m …,,I
Which do you think is a more reasonable interpretation of \x96:
SPA i.e. Start of Protected Area (Used by block-oriented terminals.)
or
EN DASH
?
Looks like a thorough analysis of a much larger data sample is warranted. Happy to help.

Just open your file with codecs.open like in
import codecs, csv
stream = codecs.open(<yourfile.csv>, encoding="utf-16")
reader = csv.reader(stream)
And work through your program with unicode strings, as you should do anyway if you are processing text

Replace Specialchars in Python

i need to replace special chars in the filename. Im trying this at the moment with translate, but its not really good working, and i hope you got an idea to do this. Its to make an clear playlist, ive got an bad player of mp3s in my car which cant do umlaute oder specialchars.
My code so far
# -*- coding: utf-8 -*-
import os
import sys
import id3reader
pfad = os.path.dirname(sys.argv[1])+"/"
ordner = ""
table = {
0xe9: u'e',
0xe4: u'ae',
ord(u'ö'): u'oe',
ord(u'ü'): u'ue',
ord(u'ß'): u'ss',
0xe1: u'ss',
0xfc: u'ue',
}
def replace(s):
return ''.join(c for c in s if (c.isalpha() or c == " " or c =="-") )
fobj_in = open(sys.argv[1])
fobj_out = open(sys.argv[1]+".new","w")
for line in fobj_in:
if (line.rstrip()[0:1]=="#" or line.rstrip()[0:1] ==" "):
print line.rstrip()[0:1]
else:
datei= pfad+line.rstrip()
#print datei
id3info = id3reader.Reader(datei)
dateiname= str(id3info.getValue('performer'))+" - "+ str(id3info.getValue('title'))
#print dateiname
arrPfad = line.split('/')
dateiname = replace(dateiname[0:60])
print dateiname
# dateiname = dateiname.translate(table)+".mp3"
ordner = arrPfad[0]+"/"+dateiname
# os.rename(datei,pfad+ordner)
fobj_out.write(ordner+"\r\n")
fobj_in.close()
i get this error: UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 37: ordinal not in range(128)
If i try to use the translate at the id3title i get TypeError: expected a character buffer object

if I need to get rid of non-ascii-characters, I often use:
>>> unicodedata.normalize("NFKD", u"spëcïälchärs").encode('ascii', 'ignore')
'specialchars'
which tries to convert characters to their ascii part of their normalized unicode decomposition.
Bad thing is, it throws away everything it does not know, and is not smart enough to transliterate umlauts (to ue, ae, etc).
But it might help you to at least play those mp3s.
Of course, you are free to do your own str.translate first, and wrap the result in this, to eliminate every non-ascii-character still left. In fact, if your replace is correct, this will solve your problem. I'd suggest you'd take a look on str.translate and str.maketrans, though.

Python UTF-16 CSV reader

I have a UTF-16 CSV file which I have to read. Python csv module does not seem to support UTF-16.
I am using python 2.7.2. CSV files I need to parse are huge size running into several GBs of data.
Answers for John Machin questions below
print repr(open('test.csv', 'rb').read(100))
Output with test.csv having just abc as content
'\xff\xfea\x00b\x00c\x00'
I think csv file got created on windows machine in USA. I am using Mac OSX Lion.
If I use code provided by phihag and test.csv containing one record.
sample test.csv content used. Below is print repr(open('test.csv', 'rb').read(1000)) output
'\xff\xfe1\x00,\x002\x00,\x00G\x00,\x00S\x00,\x00H\x00 \x00f\x00\xfc\x00r\x00 \x00e\x00 \x00\x96\x00 \x00m\x00 \x00\x85\x00,\x00,\x00I\x00\r\x00\n\x00'
Code by phihag
import codecs
import csv
with open('test.csv','rb') as f:
sr = codecs.StreamRecoder(f,codecs.getencoder('utf-8'),codecs.getdecoder('utf-8'),codecs.getreader('utf-16'),codecs.getwriter('utf-16'))
for row in csv.reader(sr):
print row
Output of the above code
['1', '2', 'G', 'S', 'H f\xc3\xbcr e \xc2\x96 m \xc2\x85']
['', '', 'I']
expected output is
['1', '2', 'G', 'S', 'H f\xc3\xbcr e \xc2\x96 m \xc2\x85','','I']

At the moment, the csv module does not support UTF-16.
In Python 3.x, csv expects a text-mode file and you can simply use the encoding parameter of open to force another encoding:
# Python 3.x only
import csv
with open('utf16.csv', 'r', encoding='utf16') as csvf:
for line in csv.reader(csvf):
print(line) # do something with the line
In Python 2.x, you can recode the input:
# Python 2.x only
import codecs
import csv
class Recoder(object):
def __init__(self, stream, decoder, encoder, eol='\r\n'):
self._stream = stream
self._decoder = decoder if isinstance(decoder, codecs.IncrementalDecoder) else codecs.getincrementaldecoder(decoder)()
self._encoder = encoder if isinstance(encoder, codecs.IncrementalEncoder) else codecs.getincrementalencoder(encoder)()
self._buf = ''
self._eol = eol
self._reachedEof = False
def read(self, size=None):
r = self._stream.read(size)
raw = self._decoder.decode(r, size is None)
return self._encoder.encode(raw)
def __iter__(self):
return self
def __next__(self):
if self._reachedEof:
raise StopIteration()
while True:
line,eol,rest = self._buf.partition(self._eol)
if eol == self._eol:
self._buf = rest
return self._encoder.encode(line + eol)
raw = self._stream.read(1024)
if raw == '':
self._decoder.decode(b'', True)
self._reachedEof = True
return self._encoder.encode(self._buf)
self._buf += self._decoder.decode(raw)
next = __next__
def close(self):
return self._stream.close()
with open('test.csv','rb') as f:
sr = Recoder(f, 'utf-16', 'utf-8')
for row in csv.reader(sr):
print (row)
open and codecs.open require the file to start with a BOM. If it doesn't (or you're on Python 2.x), you can still convert it in memory, like this:
try:
from io import BytesIO
except ImportError: # Python < 2.6
from StringIO import StringIO as BytesIO
import csv
with open('utf16.csv', 'rb') as binf:
c = binf.read().decode('utf-16').encode('utf-8')
for line in csv.reader(BytesIO(c)):
print(line) # do something with the line

The Python 2.x csv module documentation example shows how to handle other encodings.

I would strongly suggest that you recode your file(s) to UTF-8. Under the very likely condition that you don't have any Unicode characters outside the BMP, you can take advantage of the fact that UTF-16 is a fixed-length encoding to read fixed-length blocks from your input file without worrying about straddling block boundaries.
Step 1: Determine what encoding you actually have. Examine the first few bytes of your file:
print repr(open('thefile.csv', 'rb').read(100))
Four possible ways of encoding u'abc'
\xfe\xff\x00a\x00b\x00c -> utf_16
\xff\xfea\x00b\x00c\x00 -> utf_16
\x00a\x00b\x00c -> utf_16_be
a\x00b\x00c\x00 -> utf_16_le
If you have any trouble with this step, edit your question to include the results of the above print repr()
Step 2: Here's a Python 2.X recode-UTF-16*-to-UTF-8 script:
import sys
infname, outfname, enc = sys.argv[1:4]
fi = open(infname, 'rb')
fo = open(outfname, 'wb')
BUFSIZ = 64 * 1024 * 1024
first = True
while 1:
buf = fi.read(BUFSIZ)
if not buf: break
if first and enc == 'utf_16':
bom = buf[:2]
buf = buf[2:]
enc = {'\xfe\xff': 'utf_16_be', '\xff\xfe': 'utf_16_le'}[bom]
# KeyError means file doesn't start with a valid BOM
first = False
fo.write(buf.decode(enc).encode('utf8'))
fi.close()
fo.close()
Other matters:
You say that your files are too big to read the whole file, recode and rewrite, yet you can open it in vi. Please explain.
The <85> being treated as end of record is a bit of a worry. Looks like 0x85 is being recognised as NEL (C1 control code, NEWLINE). There is a strong possibility that the data was originally encoded in some legacy single-byte encoding where 0x85 has a meaning but has been transcoded to UTF-16 under the false assumption that the original encoding was ISO-8859-1 aka latin1. Where did the file originate? An IBM mainframe? Windows/Unix/classic Mac? What country, locale, language? You obviously think that the <85> is not meant to be a newline; what do you think that it means?
Please feel free to send a copy of a cut-down file (that includes some of the <85> stuff) to sjmachin at lexicon dot net
Update based on 1-line sample data provided.
This confirms my suspicions. Read this. Here's a quote from it:
... the C1 control characters ... are rarely used directly, except on
specific platforms such as OpenVMS. When they turn up in documents,
Web pages, e-mail messages, etc., which are ostensibly in an
ISO-8859-n encoding, their code positions generally refer instead to
the characters at that position in a proprietary, system-specific
encoding such as Windows-1252 or the Apple Macintosh ("MacRoman")
character set that use the codes provided for representation of the C1
set with a single 8-bit byte to instead provide additional graphic
characters
This code:
s1 = '\xff\xfe1\x00,\x002\x00,\x00G\x00,\x00S\x00,\x00H\x00 \x00f\x00\xfc\x00r\x00 \x00e\x00 \x00\x96\x00 \x00m\x00 \x00\x85\x00,\x00,\x00I\x00\r\x00\n\x00'
s2 = s1.decode('utf16')
print 's2 repr:', repr(s2)
from unicodedata import name
from collections import Counter
non_ascii = Counter(c for c in s2 if c >= u'\x80')
print 'non_ascii:', non_ascii
for c in non_ascii:
print "from: U+%04X %s" % (ord(c), name(c, "<no name>"))
c2 = c.encode('latin1').decode('cp1252')
print "to: U+%04X %s" % (ord(c2), name(c2, "<no name>"))
s3 = u''.join(
c.encode('latin1').decode('1252') if u'\x80' <= c < u'\xA0' else c
for c in s2
)
print 's3 repr:', repr(s3)
print 's3:', s3
produces the following (Python 2.7.2 IDLE, Windows 7):
s2 repr: u'1,2,G,S,H f\xfcr e \x96 m \x85,,I\r\n'
non_ascii: Counter({u'\x85': 1, u'\xfc': 1, u'\x96': 1})
from: U+0085 <no name>
to: U+2026 HORIZONTAL ELLIPSIS
from: U+00FC LATIN SMALL LETTER U WITH DIAERESIS
to: U+00FC LATIN SMALL LETTER U WITH DIAERESIS
from: U+0096 <no name>
to: U+2013 EN DASH
s3 repr: u'1,2,G,S,H f\xfcr e \u2013 m \u2026,,I\r\n'
s3: 1,2,G,S,H für e – m …,,I
Which do you think is a more reasonable interpretation of \x96:
SPA i.e. Start of Protected Area (Used by block-oriented terminals.)
or
EN DASH
?
Looks like a thorough analysis of a much larger data sample is warranted. Happy to help.

Just open your file with codecs.open like in
import codecs, csv
stream = codecs.open(<yourfile.csv>, encoding="utf-16")
reader = csv.reader(stream)
And work through your program with unicode strings, as you should do anyway if you are processing text

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Error in the coding of the characters in reading a PDF - python

what i have tried is to replace specific " ' " unicode with "’" which solves this issue. Please let me know if u still failed to generate pdf with this approach. text = text.replace("'", "’")

Related

UnicodeDecodeError: ('unknown', u'\xe0', 0, 1, '')

Python re.findall fails at UTF-8 while rest of script succeeds

Exporting to CSV Format In UTF-8 Format [duplicate]

Replace Specialchars in Python

Python UTF-16 CSV reader

Categories

Resources