Process unicode strings in python - python

I am using fasttext pre-trained model based on english wikipedia. It works as expected...
https://github.com/shantanuo/pandas_examples/blob/master/nlp/fasttext_english.ipynb
But when I try the same code with some other language, I get an error as shown on this page...
https://github.com/shantanuo/pandas_examples/blob/master/nlp/fasttext_marathi.ipynb
The error is related to unicode:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 15: invalid start byte
I tried to open the file using Raw Binary option. I changed the function load_words_raw in load.py file:
with open(file_path, 'rb') as f:
And now I get a different error:
ValueError: could not convert string to float: b'\x00l\x02'
I have no idea how to handle this.

You should change the second line of the notebook file to:
#!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.mr.300.vec.gz
So pointing to the vec file, instead of the bin file:
#!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.mr.300.bin.gz

byte 0x80 in position 15. There is a possibility that file may be encoded in UTF-16.
Try this:
with open(path, encoding='utf-16') as f:
// your logic

Try this one :
data : str
with open('crawl-D.txt' ,'r', encoding='utf8') as file:
data = file.read()
str will contains the whole file as string.
parse float with float().

Related

How to decode this byte string from a file?

I used code provided by someone on the stackoverflow to read a binary file. The code is:
with open('OutputFile', 'rb') as f:
text = f.read()
I printed text and it appeared as below. Now I need to decode them into binary data, much like 01010010....
I used text.decode('utf-8') to decode the text, but it showed error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x88 in position 0: invalid start byte
How to solve this issue? How to read them into binary strings?
b'\x88\xc6\xe7\x1a3\'1(\xd2\xb7*\xa5{a\xac\x0f\xabf\\\xe9Z\xa4\xb0\x116v\xe0}\xc5\xb8\xe1P\xf1\x01\xd7\xf63\x11\xec\xe7\xb7\xbeG\xaa\xdf\xd87\xdaK\x9c?\xe8\x0e\x84\xce|%\xcb\xc0\x1d\xe4\xe4\x02#c\xb1\xd0o\x1da\x87\nW\xb9\xc9\xb2\x08\x1c\xffP,4\x86\'\xecx$\x05\xa2\x1b\x8d\xc9\xe9\x12(t\xcba\xa5\xc9H\xb5[X\xf9\xdd\xa7q\x8e\x92r\xe2T\xe0a*\x13$\xdaS\x8bx\r\xc1\xa9~\xb7\xd8-\xef&\xad\xa8\xdd\xe5\x1d#\x99`\xa0\xb7\xbc\xe1\x96;:#SG\xea\xd2]\xfc\xc02\xb9\x01\xbe%\xbb?\x99\x0e\xa0{7Z\xa3\xa4\xcb\xe4\xe0C%1\xaf\xcb\x1e\xb3\xc6\xb1\xd7\xf8,,\x08\xc0\'\xbf\xde\xb5\xe3)~$\x05\x8c>\x88\xb8`\xce\xe3\x1a\x97zs\x05\x91\xcd\xee\xb9^S\x8c\x8f=\x0f\xe6\xf5TZ\xb24c\xf0YZ\xac\xf9\x87\x05,\x04\xf7>\x0c\xf6c/t\xbayB\x06\x0cd\x0f\x15\x1eZ\x9c#\xb7\\-\x16A\x06#\r\x12\x19\x85YV\xb3\x7f$\xc4}\xab\xda\xf5\xebO\xcf#/\x1ea\xa7\x03E\xb3\xef4\x11\x05hCJF(\x93\r\xb9\xa6\x84\x15\x8a\xda\xbe\x12\xff\xd2\xa5\x19y\xea\xb5H\xbd\x97\xc8\x81\xd5\'\xadN\xd8s\x0c\x0f\x97\xcb3d\xfa\xf6&\n\xdc\xd5\xd4\x15\x87\x08\xcb\xeb\xb4\x07\xf8)IE\xfd\x1am_C\xf2x\x04a\xa8\xdc\xb3G\xa4\xeaq6O\xe6D\xb4]d\x93\x95`\xe6W\xe2w\xc8^\t\xdc\x13aJE\xafU];V\x1f\xda\x96\xc8t\xdfk\x96\xc6\xd5\xc0B.\xeb\xac)<\xa7F\xce\x0c\xf1\xf1v\x18\xba\xf6^#0\x14\x1c#\xc7r\x86\xc4\xd6\x0e\xca\x94c\xf8m!\xeb57\xc7P\t\x1a\xed\xc8#7h\xc2\x03\xd9M\xdf\xdd\x05\x7f\xecS\x1c\xd4\xca\x84\xf5\xb3\xe5<\x1f\xb5\x05\xd8$\x8dC_J\n\x89\xe7\x8b\xb7\x00\x95\xe9\x8ct!\xe8\xf3\x82|\x9f|6ORa_J\x9c*\xf9\x0b\x1emV\x91\x93#\x91+\x18^wfK\x01\xc8\xd7&[\x13\xeb[\xb8\x0b\xf0.8\xb1\t)#\xc5H\xa0O0H"n\xeaQ\xc0p\xdaZm\xf0A^\xed)m\xd2\xef[$\xce\x9d\xd9\x97\xd3K\xdc\x1c\xeb\x17\xc4\x0e-\xc7p\xe5\x7f\xcf\xa5l\x95q\xb9\xe7WB2\xb5\x8c\xf3\xdf_\xea\x02\x1cx5\x92>\xf1\xcec\x9b\x01\xd3\xa8\x89\xb6\x85O\x04\xd4W\xe5\xfa\t\xe6-\xaa?r\x166\xe6\xed\x80\xf4\xe2\xe6\x83\x0f\xae!\xc7C\xff#'
The data is already in binary format interally since you read the file in mode "rb". Here's how to print it out as 0s and 1s:
with open('OutputFile', 'rb') as f:
data = f.read()
binary = ''.join(('{:08b}'.format(byte) for byte in data))
print(binary)
Output:
1000100011000110111001110001101000110011001001110011000100101000110100101011011100101010101001010111101101100001101011000000111110101011011001100101110011101001010110101010010010110000000100010011011001110110111000000111110111000101101110001110000101010000111100010000000111010111111101100011001100010001111011001110011110110111101111100100011110101010110111111101100000110111110110100100101110011100001111111110100000001110100001001100111001111100001001011100101111000000000111011110010011100100000000100010001101100011101100011101000001101111000111010110000110000111000010100101011110111001110010011011001000001000000111001111111101010000001011000011010010000110001001111110110001111000001001000000010110100010000110111000110111001001111010010001001000101000011101001100101101100001101001011100100101001000101101010101101101011000111110011101110110100111011100011000111010010010011100101110001001010100111000000110000100101010000100110010010011011010010100111000101101111000000011011100000110101001011111101011011111011000001011011110111100100110101011011010100011011101111001010001110100100011100110010110000010100000101101111011110011100001100101100011101100111010010000000101001101000111111010101101001001011101111111001100000000110010101110010000000110111110001001011011101100111111100110010000111010100000011110110011011101011010101000111010010011001011111001001110000001000011001001010011000110101111110010110001111010110011110001101011000111010111111110000010110000101100000010001100000000100111101111111101111010110101111000110010100101111110001001000000010110001100001111101000100010111000011000001100111011100011000110101001011101111010011100110000010110010001110011011110111010111001010111100101001110001100100011110011110100001111111001101111010101010100010110101011001000110100011000111111000001011001010110101010110011111001100001110000010100101100000001001111011100111110000011001111011001100011001011110111010010111010011110010100001000000110000011000110010000001111000101010001111001011010100111000010001110110111010111000010110100010110010000010000011001000000000011010001001000011001100001010101100101010110101100110111111100100100110001000111110110101011110110101111010111101011010011111100111101000000001011110001111001100001101001110000001101000101101100111110111100110100000100010000010101101000010000110100101001000110001010001001001100001101101110011010011010000100000101011000101011011010101111100001001011111111110100101010010100011001011110011110101010110101010010001011110110010111110010001000000111010101001001111010110101001110110110000111001100001100000011111001011111001011001100110110010011111010111101100010011000001010110111001101010111010100000101011000011100001000110010111110101110110100000001111111100000101001010010010100010111111101000110100110110101011111010000111111001001111000000001000110000110101000110111001011001101000111101001001110101001110001001101100100111111100110010001001011010001011101011001001001001110010101011000001110011001010111111000100111011111001000010111100000100111011100000100110110000101001010010001011010111101010101010111010011101101010110000111111101101010010110110010000111010011011111011010111001011011000110110101011100000001000010001011101110101110101100001010010011110010100111010001101100111000001100111100011111000101110110000110001011101011110110010111100100000000110000000101000001110000100011110001110111001010000110110001001101011000001110110010101001010001100011111110000110110100100001111010110011010100110111110001110101000000001001000110101110110111001000001000110011011101101000110000100000001111011001010011011101111111011101000001010111111111101100010100110001110011010100110010101000010011110101101100111110010100111100000111111011010100000101110110000010010010001101010000110101111101001010000010101000100111100111100010111011011100000000100101011110100110001100011101000010000111101000111100111000001001111100100111110111110000110110010011110101001001100001010111110100101010011100001010101111100100001011000111100110110101010110100100011001001100100011100100010010101100011000010111100111011101100110010010110000000111001000110101110010011001011011000100111110101101011011101110000000101111110000001011100011100010110001000010010010100101000000110001010100100010100000010011110011000001001000001000100110111011101010010100011100000001110000110110100101101001101101111100000100000101011110111011010010100101101101110100101110111101011011001001001100111010011101110110011001011111010011010010111101110000011100111010110001011111000100000011100010110111000111011100001110010101111111110011111010010101101100100101010111000110111001111001110101011101000010001100101011010110001100111100111101111101011111111010100000001000011100011110000011010110010010001111101111000111001110011000111001101100000001110100111010100010001001101101101000010101001111000001001101010001010111111001011111101000001001111001100010110110101010001111110111001000010110001101101110011011101101100000001111010011100010111001101000001100001111101011100010000111000111010000111111111101000000

'utf-8' codec can't decode byte 0x8a in position 170: invalid start byte

I am trying to do this:
fh = request.FILES['csv']
fh = io.StringIO(fh.read().decode())
reader = csv.DictReader(fh, delimiter=";")
This is failing always with the error in title and I spent almost 8 hours on this.
here is my understanding:
I am using python3, so file fh is in bytes. I am encoding it into string and putting it in memory via StringIO.
with csv.DictReader() trying to read it as dict into memory. It is failing here:
also tried with io.StringIO(fh.read().decode('utf-8')), but same error.
what am I missing? :/
The error is because there is some non-ASCII character in the file and it can't be encoded/decoded. One simple way to avoid this error is to encode/decode such strings with encode()/decode() function as follows (if a is the string with non-ASCII character):
a.decode('utf-8')
Also, you could try opening the file as:
with open('filename', 'r', encoding = 'utf-8') as f:
your code using f as file pointer
use 'rb' if your file is binary.

Error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

https://github.com/affinelayer/pix2pix-tensorflow/tree/master/tools
An error occurred when compiling "process.py" on the above site.
python tools/process.py --input_dir data -- operation resize --outp
ut_dir data2/resize
data/0.jpg -> data2/resize/0.png
Traceback (most recent call last):
File "tools/process.py", line 235, in <module>
main()
File "tools/process.py", line 167, in main
src = load(src_path)
File "tools/process.py", line 113, in load
contents = open(path).read()
File"/home/user/anaconda3/envs/tensorflow_2/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
What is the cause of the error?
Python's version is 3.5.2.
Python tries to convert a byte-array (a bytes which it assumes to be a utf-8-encoded string) to a unicode string (str). This process of course is a decoding according to utf-8 rules. When it tries this, it encounters a byte sequence which is not allowed in utf-8-encoded strings (namely this 0xff at position 0).
Since you did not provide any code we could look at, we only could guess on the rest.
From the stack trace we can assume that the triggering action was the reading from a file (contents = open(path).read()). I propose to recode this in a fashion like this:
with open(path, 'rb') as f:
contents = f.read()
That b in the mode specifier in the open() states that the file shall be treated as binary, so contents will remain a bytes. No decoding attempt will happen this way.
Use this solution it will strip out (ignore) the characters and return the string without them. Only use this if your need is to strip them not convert them.
with open(path, encoding="utf8", errors='ignore') as f:
Using errors='ignore'
You'll just lose some characters. but if your don't care about them as they seem to be extra characters originating from a the bad formatting and programming of the clients connecting to my socket server.
Then its a easy direct solution.
reference
Use encoding format ISO-8859-1 to solve the issue.
Had an issue similar to this, Ended up using UTF-16 to decode. my code is below.
with open(path_to_file,'rb') as f:
contents = f.read()
contents = contents.rstrip("\n").decode("utf-16")
contents = contents.split("\r\n")
this would take the file contents as an import, but it would return the code in UTF format. from there it would be decoded and seperated by lines.
I've come across this thread when suffering the same error, after doing some research I can confirm, this is an error that happens when you try to decode a UTF-16 file with UTF-8.
With UTF-16 the first characther (2 bytes in UTF-16) is a Byte Order Mark (BOM), which is used as a decoding hint and doesn't appear as a character in the decoded string. This means the first byte will be either FE or FF and the second, the other.
Heavily edited after I found out the real answer
It simply means that one chose the wrong encoding to read the file.
On Mac, use file -I file.txt to find the correct encoding. On Linux, use file -i file.txt.
I had a similar issue with PNG files. and I tried the solutions above without success.
this one worked for me in python 3.8
with open(path, "rb") as f:
use only
base64.b64decode(a)
instead of
base64.b64decode(a).decode('utf-8')
This is due to the different encoding method when read the file. In python, it defaultly
encode the data with unicode. However, it may not works in various platforms.
I propose an encoding method which can help you solve this if 'utf-8' not works.
with open(path, newline='', encoding='cp1252') as csvfile:
reader = csv.reader(csvfile)
It should works if you change the encoding method here. Also, you can find other encoding method here standard-encodings , if above doesn't work for you.
Those getting similar errors while handling Pandas for data frames use the following solution.
example solution.
df = pd.read_csv("File path", encoding='cp1252')
I had this UnicodeDecodeError while trying to read a '.csv' file using pandas.read_csv(). In my case, I could not manage to overcome this issue using other encoder types. But instead of using
pd.read_csv(filename, delimiter=';')
I used:
pd.read_csv(open(filename, 'r'), delimiter=';')
which just seems working fine for me.
Note that: In open() function, use 'r' instead of 'rb'. Because 'rb' returns bytes object that causes to happen this decoder error in the first place, that is the same problem in the read_csv(). But 'r' returns str which is needed since our data is in .csv, and using the default encoding='utf-8' parameter, we can easily parse the data using read_csv() function.
if you are receiving data from a serial port, make sure you are using the right baudrate (and the other configs ) : decoding using (utf-8) but the wrong config will generate the same error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
to check your serial port config on linux use : stty -F /dev/ttyUSBX -a
I had a similar issue and searched all the internet for this problem
if you have this problem just copy your HTML code in a new HTML file and use the normal <meta charset="UTF-8">
and it will work....
just create a new HTML file in the same location and use a different name
Check the path of the file to be read. My code kept on giving me errors until I changed the path name to present working directory. The error was:
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
If you are on a mac check if you for a hidden file, .DS_Store. After removing the file my program worked.
I had a similar problem.
Solved it by:
import io
with io.open(filename, 'r', encoding='utf-8') as fn:
lines = fn.readlines()
However, I had another problem. Some html files (in my case) were not utf-8, so I received a similar error. When I excluded those html files, everything worked smoothly.
So, except from fixing the code, check also the files you are reading from, maybe there is an incompatibility there indeed.
You have to use the encoding as latin1 to read this file as there are some special character in this file, use the below code snippet to read the file.
The problem here is the encoding type. When Python can't convert the data to be read, it gives an error.
You can you latin1 or other encoding values.
I say try and test to find the right one for your dataset.
I have the same issue when processing a file generated from Linux. It turns out it was related with files containing question marks..
Following code worked in my case:
df = pd.read_csv(filename,sep = '\t', encoding='cp1252')
If possible, open the file in a text editor and try to change the encoding to UTF-8. Otherwise do it programatically at the OS level.

'utf-8' codec can't decode byte 0x80

I'm trying to download BVLC-trained model and I'm stuck with this error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 110: invalid start byte
I think it's because of the following function (complete code)
# Closure-d function for checking SHA1.
def model_checks_out(filename=model_filename, sha1=frontmatter['sha1']):
with open(filename, 'r') as f:
return hashlib.sha1(f.read()).hexdigest() == sha1
Any idea how to fix this?
You are opening a file that is not UTF-8 encoded, while the default encoding for your system is set to UTF-8.
Since you are calculating a SHA1 hash, you should read the data as binary instead. The hashlib functions require you pass in bytes:
with open(filename, 'rb') as f:
return hashlib.sha1(f.read()).hexdigest() == sha1
Note the addition of b in the file mode.
See the open() documentation:
mode is an optional string that specifies the mode in which the file is opened. It defaults to 'r' which means open for reading in text mode. [...] In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding. (For reading and writing raw bytes use binary mode and leave encoding unspecified.)
and from the hashlib module documentation:
You can now feed this object with bytes-like objects (normally bytes) using the update() method.
You didn't specify to open the file in binary mode, so f.read() is trying to read the file as a UTF-8-encoded text file, which doesn't seem to be working. But since we take the hash of bytes, not of strings, it doesn't matter what the encoding is, or even whether the file is text at all: just open it, and then read it, as a binary file.
>>> with open("test.h5.bz2","r") as f: print(hashlib.sha1(f.read()).hexdigest())
Traceback (most recent call last):
File "<ipython-input-3-fdba09d5390b>", line 1, in <module>
with open("test.h5.bz2","r") as f: print(hashlib.sha1(f.read()).hexdigest())
File "/home/dsm/sys/pys/Python-3.5.1-bin/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb8 in position 10: invalid start byte
but
>>> with open("test.h5.bz2","rb") as f: print(hashlib.sha1(f.read()).hexdigest())
21bd89480061c80f347e34594e71c6943ca11325
Since there is not a single hint in the documentation nor src code, I have no clue why, but using the b char (i guess for binary) totally works (tf-version: 1.1.0):
image_data = tf.gfile.FastGFile(filename, 'rb').read()
For more information, check out: gfile

UnicodeDecodeError: save to file in python

i want to read file, find something in it and save the result, but when I want to save it it give me a error
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 19: ordinal not in range(128)
Code to save to file:
fileout.write((key + ';' + nameDict[key]+ ';'+src + alt +'\n').decode('utf-8'))
What can I do to fix it?
Thank you
You are trying to concatenate unicode values with byte strings, then turn the result to unicode, to write it to a file object that most likely only takes byte strings.
Don't mix unicode and byte strings like that.
Open the file to write to with io.open() to automate encoding Unicode values, then handle only unicode in your code:
import io
with io.open(filename, 'w', encoding='utf8') as fileout:
# code gathering stuff from BeautifulSoup
fileout.write(u'{};{};{}{}\n'.format(key, nameDict[key], src, alt)
You may want to check out the csv module to handle writing out delimiter-separated values. If you do go that route, you'll have to explicitly encode your columns:
import csv
with open(filename, 'wb') as fileout:
writer = csv.writer(fileout, delimiter=';')
# code gathering stuff from BeautifulSoup
row = [key, nameDict[key], src + alt]
writer.writerow([c.encode('utf8') for c in row])
If some of this data comes from other files, make sure you also decode to Unicode first; again, io.open() to read these files is probably the best option, to have the data decoded to Unicode values for you as you read.

Categories

Resources