I need to read this PDF.
I am using the following code:
from PyPDF2 import PdfFileReader
f = open('myfile.pdf', 'rb')
reader = PdfFileReader(f)
content = reader.getPage(0).extractText()
f.close()
content = ' '.join(content.replace('\xa0', ' ').strip().split())
print(content)
However, the encoding is incorrect, it prints:
Resultado da Prova de Sele“‰o do...
But I expected
Resultado da Prova de Seleção do...
How to solve it?
I'm using Python 3
The PyPDF2 extractTest method returns UniCode. So you many need to just explicitly encode it. For example, explicitly encoding the Unicode into UTF-8.
# -*- coding: utf-8 -*-
correct = u'Resultado da Prova de Seleção do...'
print(correct.encode(encoding='utf-8'))
You're on Python 3, so you have Unicode under the hood, and Python 3 defaults to UTF-8. But I wonder if you need to specify a different encoding based on your locale.
# Show installed locales
import locale
from pprint import pprint
pprint(locale.locale_alias)
If that's not the quick fix, since you're getting Unicode back from PyPDF, you could take a look at the code points for those two characters. It's possible that PyPDF wasn't able to determine the correct encoding and gave you the wrong characters.
For example, a quick and dirty comparison of the good and bad strings you posted:
# -*- coding: utf-8 -*-
# Python 3.4
incorrect = 'Resultado da Prova de Sele“‰o do'
correct = 'Resultado da Prova de Seleção do...'
print("Incorrect String")
print("CHAR{}UNI".format(' ' * 20))
print("-" * 50)
for char in incorrect:
print(
'{}{}{}'.format(
char.encode(encoding='utf-8'),
' ' * 20, # Hack; Byte objects don't have __format__
ord(char)
)
)
print("\n" * 2)
print("Correct String")
print("CHAR{}UNI".format(' ' * 20))
print("-" * 50)
for char in correct:
print(
'{}{}{}'.format(
char.encode(encoding='utf-8'),
' ' * 20, # Hack; Byte objects don't have __format__
ord(char)
)
)
Relevant Output:
b'\xe2\x80\x9c' 8220
b'\xe2\x80\xb0' 8240
b'\xc3\xa7' 231
b'\xc3\xa3' 227
If you're getting code point 231, (>>>hex(231) # '0xe7) then you're getting back bad data back from PyPDF.
what i have tried is to replace specific " ' " unicode with "’" which solves this issue. Please let me know if u still failed to generate pdf with this approach.
text = text.replace("'", "’")
I have a XML file with Russian text:
<p>все чашки имеют стандартный посадочный диаметр - 22,2 мм</p>
I use xml.etree.ElementTree to do manipulate it in various ways (without ever touching the text content). Then, I use ElementTree.tostring:
info["table"] = ET.tostring(table, encoding="utf8") #table is an Element
Then I do some other stuff with this string, and finally write it to a file
f = open(newname, "w")
output = page_template.format(**info)
f.write(output)
f.close()
I wind up with this in my file:
<p>\xd0\xb2\xd1\x81\xd0\xb5 \xd1\x87\xd0\xb0\xd1\x88\xd0\xba\xd0\xb8 \xd0\xb8\xd0\xbc\xd0\xb5\xd1\x8e\xd1\x82 \xd1\x81\xd1\x82\xd0\xb0\xd0\xbd\xd0\xb4\xd0\xb0\xd1\x80\xd1\x82\xd0\xbd\xd1\x8b\xd0\xb9 \xd0\xbf\xd0\xbe\xd1\x81\xd0\xb0\xd0\xb4\xd0\xbe\xd1\x87\xd0\xbd\xd1\x8b\xd0\xb9 \xd0\xb4\xd0\xb8\xd0\xb0\xd0\xbc\xd0\xb5\xd1\x82\xd1\x80 - 22,2 \xd0\xbc\xd0\xbc</p>
How do I get it encoded properly?
You use
info["table"] = ET.tostring(table, encoding="utf8")
which returns bytes. Then later you apply that to a format string, which is a str (unicode), if you do that you'll end up with a representation of the bytes object.
etree can return an unicode object instead if you use:
info["table"] = ET.tostring(table, encoding="unicode")
The problem is that ElementTree.tostring returns a binary object and not an actual string. The answer to this is:
info["table"] = ET.tostring(table, encoding="utf8").decode("utf8")
Try this - with output parameter being just the Russian string without utf-8 encoding.
import codecs
#output=u'все чашки имеют стандартный посадочный диаметр'
with codecs.open(newname, "w", "utf-16") as stream: #or utf-8
stream.write(output + u"\n")
I'm trying to convert characters in one list into characters in another list at the same index in Japanese (zenkaku to hangaku moji, for those interested), and I can't get the comparison to work. I am decoding into utf-8 before I compare (decoding into ascii broke the program), but the comparison doesn't ever return true. Does anyone know what I'm doing wrong? Here's the code (indents are a little wacky due to SO's editor):
#!C:\Python27\python.exe
# coding=utf-8
import os
import shutil
import sys
zk = [
'。',
'、',
'「',
'」',
'(',
')',
'!',
'?',
'・',
'/',
'ア','イ','ウ','エ','オ',
'カ','キ','ク','ケ','コ',
'サ','シ','ス','セ','ソ',
'ザ','ジ','ズ','ゼ','ゾ',
'タ','チ','ツ','テ','ト',
'ダ','ヂ','ヅ','デ','ド',
'ラ','リ','ル','レ','ロ',
'マ','ミ','ム','メ','モ',
'ナ','ニ','ヌ','ネ','ノ',
'ハ','ヒ','フ','ヘ','ホ',
'バ','ビ','ブ','ベ','ボ',
'パ','ピ','プ','ペ','ポ',
'ヤ','ユ','ヨ','ヲ','ン','ッ'
]
hk = [
'。',
'、',
'「',
'」',
'(',
')',
'!',
'?',
'・',
'/',
'ア','イ','ウ','エ','オ',
'カ','キ','ク','ケ','コ',
'サ','シ','ス','セ','ソ',
'ザ','ジ','ズ','ゼ','ゾ',
'タ','チ','ツ','テ','ト',
'ダ','ヂ','ヅ','デ','ド',
'ラ','リ','ル','レ','ロ',
'マ','ミ','ム','メ','モ',
'ナ','ニ','ヌ','ネ','ノ',
'ハ','ヒ','フ','ヘ','ホ',
'バ','ビ','ブ','ベ','ボ',
'パ','ピ','プ','ペ','ポ',
'ヤ','ユ','ヨ','ヲ','ン','ッ'
]
def main():
if len(sys.argv) > 1:
filename = sys.argv[1]
else:
print("Please specify a file to check.")
return
try:
f = open(filename, 'r')
except IOError as e:
print("Sorry! The file doesn't exist.")
return
filecontent = f.read()
f.close()
#y = zk[29]
#print y.decode('utf-8')
for f in filecontent:
for z in zk:
if f == z.decode('utf-8'):
print f
print filename
if __name__ == "__main__":
main()
Am I missing a step?
Several.
zk = [
u'。',
u'、',
u'「',
...
...
f = codecs.open(filename, 'r', encoding='utf-8')
...
I'll let you work out the rest now that the hard work's been done.
Make sure that zk and hk lists contain Unicode strings. Either use Unicode literals e.g., u'a' or decode them at runtime:
fromutf8 = lambda s: s.decode('utf-8') if not isinstance(s, unicode) else s
zk = map(fromutf8, zk)
hk = map(fromutf8, hk)
You could use unicode.translate() to convert characters in one list into characters in another list at the same index:
import codecs
translation_table = dict(zip(map(ord,zk), hk))
with codecs.open(sys.argv[1], encoding='utf-8') as f:
for line in f:
print line.translate(translation_table),
You need to convert everything to the same form, and the form is Unicode strings. Unicode strings have no encoding in the sense .encode() or .decode(). When having a non-unicode string, it is actually a stream of bytes that expresses the value in some encoding. When converting to Unicode, you have to .decode(). When storing Unicode string to a sequence of bytes, you have to .encode() the abstraction to concrete bytes.
This way, when loading Unicode strings from an UTF-8 encoded file, or you have to read it into the old strings (non Unicode, sequences of bytes) and then .decode('utf-8'), or you can use `codecs.open(..., encoding='utf-8') -- then you get Unicode strings automatically.
The form # coding=utf-8 is not the usual, but it is OK... if the editor (I mean the tool that you use to write the text) also thinks this way. Then the old strings are displayed by the editor correctly. In the case they should be .decode('utf-8')d to get Unicode. Old strings with ASCII characters only in the same source can also be converted to Unicode using the .decode('utf-8').
To summarize: you are de coding from bytes to Unicode, and you are en coding the Unicode strings into sequence of bytes. It seems from the question that you are doing the opposite.
The following is completely wrong:
for f in filecontent:
for z in zk:
if f == z.decode('utf-8'):
print f
because the filecontent is the result of f.read(). This way it is a sequence of bytes. The f in the loop is one byte. The z.decode('utf-8') returns one Unicode character. They cannot be compared. (By the way, the f is a kind of misleading name for a byte value.)