Pdf encoded in iso-8859-1 - python

How can i read text from pdf file encoded in 'iso-8859-1' in python ?
I am trying to convert pdf to txt using textract in python but with certain files i am facing "unicodeDecodeError utf-8 codec cant decode byte 0xe2 in position 11 : invalid continuation byte". I think the file is iso-8859-1 encoded.
File "/home/kanika/mypython/lib/python3.5/site-.
packages/textract/parsers/__init__.py", line 77, in process
return parser.process(filename, encoding, **kwargs)
File "/home/kanika/mypython/lib/python3.5/site-.
packages/textract/parsers/utils.py", line 46, in process
byte_string = self.extract(filename, **kwargs)
File "/home/kanika/mypython/lib/python3.5/site-.
packages/textract/parsers/txt_parser.py", line 9, in extract
return stream.read()
File "/home/kanika/mypython/lib/python3.5/codecs.py", line 321, in
decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position
11: invalid continuation byte

Try this. This should work if you want to use textract
text = textract.process("yourFile.pdf")
Here text will contain all the text in the pdf.
Then you can write it into a new txt file as you wish.

Related

Error when retrieving saved object using pickle

Working with the MESA agent based modelling package. Using pickle to save the state of my intermediate model. But when retrieving the saved model the execution ends up in error saying:
File "/home/demonwolf/PycharmProjects/pythonProject1/main.py", line 281, in <module>
empty_model = pickle.load(f)
File "/home/demonwolf/anaconda3/envs/ABM/lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte```
Any help would be appreciated.
Thanks in advance.
The file (the f parameter in pickle.load(f)) should be open in binary read (rb) mode, not the default text (r) mode.
with open("path/to/your/pickle.bin", "rb") as f:
empty_model = pickle.load(f)

Want to upload a sqlite.db file to a swift container using python swiftclient and always get a utf-8 error

i am trying to upload a sqlite.db(binary file) to a swift container using swiftclient in my python code.
import swiftclient
swift_conn.put_object
File "/usr/lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbc in position 43: invalid start byte
the code i am using is:
import swiftclient
bmdatabase = "./logs/test.db'
with open(bmdatabase, 'r') as bmdatabase_file:
#remote
correctbmdatabasename = bmdatabase.replace("./logs/", "")
swift_conn.put_object(container_name,correctbmdatabasename,
contents=bmdatabase_file.read())
I finally found it by myself, if I want to read a binary file I have to read it with 'rb'
like
import swiftclient
bmdatabase = "./logs/test.db'
with open(bmdatabase, 'rb') as bmdatabase_file:
#remote
correctbmdatabasename = bmdatabase.replace("./logs/", "")
swift_conn.put_object(container_name,correctbmdatabasename,
contents=bmdatabase_file.read())

how to set proper encoding for json.load

I have been trying to load json this way:
data = json.load(f)
For some reasons that JSON has windows1251 encoding. So trying opening it causes error:
File "./labelme2voc.py", line 252, in main
data = json.load(f)
File "/home/dex/anaconda3/lib/python3.6/json/__init__.py", line 296, in load
return loads(fp.read(),
File "/home/dex/anaconda3/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 81: invalid continuation byte
How can I fix that? JSON load doesn't have such option to encoding be specified
Try this:
import json
filename = ... # specify filename here
with open(filename, encoding='cp1252') as f:
data = json.loads(f.read())

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 434852: invalid continuation byte

I am using hfcca to calculate cyclomatic complexity for a c++ code. hfcca is a simple python script (https://code.google.com/p/headerfile-free-cyclomatic-complexity-analyzer/). When i am trying to run the script to generate the output in the form of an xml file i am getting following errors :
Traceback (most recent call last):
"./hfcca.py", line 802, in <module>
main(sys.argv[1:])
File "./hfcca.py", line 798, in main
print(xml_output([f for f in r], options))
File "./hfcca.py", line 798, in <listcomp>
print(xml_output([f for f in r], options))
File "/x/home06/smanchukonda/PREFIX/lib/python3.3/multiprocessing/pool.py", line 652, in next
raise value
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 434852: invalid continuation byte
Please help me with this..
The problem looks like the file has characters represented with latin1 that aren't characters in utf8. The file utility can be useful for figuring out what encoding a file should be treated as, e.g:
monk#monk-VirtualBox:~$ file foo.txt
foo.txt: UTF-8 Unicode text
Here's what the bytes mean in latin1:
>>> b'\xe2'.decode('latin1')
'â'
Probably easiest is to convert the files to utf8.
I also had the same problem rendering Markup("""yyyyyy""") but i solved it using an online tool with removed the 'bad' characters. https://pteo.paranoiaworks.mobi/diacriticsremover/
It is a nice tool and works even offline.

Getting UnicodeDecodeError while accessing csv file

Input file : chars.csv :
4,,x,,2,,9.012,2,,,,
6,,y,,2,,12.01,±4,,,,
7,,z,,2,,14.01,_3,,,,
When I try to parse this file, I get this error even after specifying utf-8 encoding.
>>> f=open('chars.csv',encoding='utf-8')
>>> f.read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.2/codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb1 in position 36: invalid start byte
How to correct this error?
Version: Python 3.2.3
Your input file is clearly not utf-8 encoded, so you have at least those options:
f=open('chars.csv', encoding='utf-8', errors='ignore') if given file is mostly utf-8 and you don't care about some small data loss. For other errors parameter values check manual
simply use proper encoding, like latin-1, if you know one
This is not UTF-8 encoding. The UTF-8 encoding of ± is \xC2\xB1 and  is \xC2\x83. As RobertT suggested, try Latin-1:
f=open('chars.csv',encoding='latin-1')

Categories

Resources