I am extracting the content from different file types into a csv file.
I am currently trying to extract from the file type 'm'.
That's my extraction function:
def extract_m(f): # f is the file
with open(f, encoding="utf8") as text:
lines = text.read()
lines = cleaning(lines)
return lines
this code is working until some specific characters are in the document. Then my program throws out the UnicodeDecodeError: 'utf-8' codec can't decode byte
In some other file types the program was crashing when I tried to write the extracted data into the csv file. To fix that I used the cleaning() function which replaced the troublesome characters.
But now the program crashes at the line lines = text.read()
So the program cannot go into the cleaning() function.
I tried
text = f.read().decode(errors='replace')
But then I get the Error AttributeError: 'str' object has no attribute 'decode'
I don't know why my function cannot open the file anymore.
Edit: You were all correct. One of the files is encoded in cp 1252.
When I put the
errors='replace
in
open
, then it is opening without error but '�'-symbols.
Try this instead:
def extract_m(f): # f is the file
with open(f, encoding="utf8", errors='replace') as text:
lines = text.read()
lines = cleaning(lines)
return lines
Where errors can be the following ones (from the docs):
'strict' to raise a ValueError exception if there is an encoding error. The default value of None has the same effect.
'ignore' ignores errors. Note that ignoring encoding errors can lead to data loss.
'replace' causes a replacement marker (such as '?') to be inserted where there is malformed data.
'surrogateescape' will represent any incorrect bytes as low surrogate code units ranging from U+DC80 to U+DCFF. These surrogate code units will then be turned back into the same bytes when the surrogateescape error handler is used when writing data. This is useful for processing files in an unknown encoding.
'xmlcharrefreplace' is only supported when writing to a file. Characters not supported by the encoding are replaced with the appropriate XML character reference &#nnn;.
'backslashreplace' replaces malformed data by Python’s backslashed escape sequences.
'namereplace' (also only supported when writing) replaces unsupported characters with \N{...} escape sequences.
I have a prophet model that I have stored to Google cloud storage folder and now I want to read this model in my code to run prediction pipeline. The model object was stored as JSON using this link https://facebook.github.io/prophet/docs/additional_topics.html
For this, first I download the JSON object locally from the bucket. And then I try to use the model_from_json() method. However, I keep getting below error -
import json
from google.cloud import bigquery, storage
from prophet.serialize import model_to_json, model_from_json
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob('/GCSpath/to/.json')
blob.download_to_filename('mymodel.json') # download the file locally
with open('mymodel.json', 'r') as fin: m = model_from_json(json.load(fin))
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/python/3.7.11/lib/python3.7/json/__init__.py", line 293, in load
return loads(fp.read(),
File "/Users/python/3.7.11/lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
I tried the method specified here too but it still does not work - Downloading a file from google cloud storage inside a folder
What is the correct way to save and load Prophet models?
The error UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte makes reference that either your filename or some text inside your file is not formated in UTF-8.
This means that you have some special characters inside your file that are not being able to be decoded, for example it could be Cyrillic characters or even some Unicode characters. Check this here for a reference on the difference between Unicode and UTF, you will find some examples too.
I would recommend checking your files in case there are special characters that are not compatible and removing them. It also marks the position where the error was found, so you could try starting from there.
On the other hand, if reviewing file by file and removing characters is not viable, you could also try opening your files in binary.
Instead of using 'r' in the open() command:
with open('mymodel.json', 'r') as fin: m = model_from_json(json.load(fin))
Try using 'rb':
with open('mymodel.json', 'rb') as fin: m = model_from_json(json.load(fin))
This most likely will solve your problem since reading a file in binary would not try to decode bytes to strings, hence no formatting issues. You may find more information about file reading in Python here, and more about how or why to read files in binary here.
i think the error message is quite clear. 'utf-8' can not decode the format of data in your file.
when you use open(), which is a python built-in function, it expects an argument for "encoding" which is set to 'utf-8' by default.
you need to find the encoding preferable for data in your file and provide it as argument to "encoding=your-encoding-code"
Hope this helps!
I have a .xlsx file and transformed it into a .csv file. Then I'm uploading the .csv file to a Python script I wrote, but an error is thrown.
Since the file is upload through HTTP, I'm accessing it with file = request.files['file']. This is returning a file of type FileStorage. After I'm trying to read it with the StringIO object as follows:
io.StringIO(file.stream.read().decode("UTF8"), newline=None)
I'm getting the following error:
TypeError: initial_value must be str or None, not bytes
I also tried to read the file of FileStorage object this way:
file_data = file.read().decode("utf-8")
and I'm getting the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 97: invalid start byte
Maybe it is interesting to note, that I'm being able to read the file directly, i.e. as a csv file, with the following code:
with open('file_path', 'r') as file:
csv_reader = csv.reader(file, delimiter=";")
...
But since I'm trying to get the file from an upload button, i.e. an input HTML element of type file, as mentioned above, I'm getting a FileStorage object, which I'm not being able to read it.
Anyone has any idea how could I approach this?
Thank you in advance!
It could be that it's not encoded in utf-8. Try decoding it into latin-1 instead:
file_data = file.read().decode("latin-1")
https://github.com/affinelayer/pix2pix-tensorflow/tree/master/tools
An error occurred when compiling "process.py" on the above site.
python tools/process.py --input_dir data -- operation resize --outp
ut_dir data2/resize
data/0.jpg -> data2/resize/0.png
Traceback (most recent call last):
File "tools/process.py", line 235, in <module>
main()
File "tools/process.py", line 167, in main
src = load(src_path)
File "tools/process.py", line 113, in load
contents = open(path).read()
File"/home/user/anaconda3/envs/tensorflow_2/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
What is the cause of the error?
Python's version is 3.5.2.
Python tries to convert a byte-array (a bytes which it assumes to be a utf-8-encoded string) to a unicode string (str). This process of course is a decoding according to utf-8 rules. When it tries this, it encounters a byte sequence which is not allowed in utf-8-encoded strings (namely this 0xff at position 0).
Since you did not provide any code we could look at, we only could guess on the rest.
From the stack trace we can assume that the triggering action was the reading from a file (contents = open(path).read()). I propose to recode this in a fashion like this:
with open(path, 'rb') as f:
contents = f.read()
That b in the mode specifier in the open() states that the file shall be treated as binary, so contents will remain a bytes. No decoding attempt will happen this way.
Use this solution it will strip out (ignore) the characters and return the string without them. Only use this if your need is to strip them not convert them.
with open(path, encoding="utf8", errors='ignore') as f:
Using errors='ignore'
You'll just lose some characters. but if your don't care about them as they seem to be extra characters originating from a the bad formatting and programming of the clients connecting to my socket server.
Then its a easy direct solution.
reference
Use encoding format ISO-8859-1 to solve the issue.
Had an issue similar to this, Ended up using UTF-16 to decode. my code is below.
with open(path_to_file,'rb') as f:
contents = f.read()
contents = contents.rstrip("\n").decode("utf-16")
contents = contents.split("\r\n")
this would take the file contents as an import, but it would return the code in UTF format. from there it would be decoded and seperated by lines.
I've come across this thread when suffering the same error, after doing some research I can confirm, this is an error that happens when you try to decode a UTF-16 file with UTF-8.
With UTF-16 the first characther (2 bytes in UTF-16) is a Byte Order Mark (BOM), which is used as a decoding hint and doesn't appear as a character in the decoded string. This means the first byte will be either FE or FF and the second, the other.
Heavily edited after I found out the real answer
It simply means that one chose the wrong encoding to read the file.
On Mac, use file -I file.txt to find the correct encoding. On Linux, use file -i file.txt.
I had a similar issue with PNG files. and I tried the solutions above without success.
this one worked for me in python 3.8
with open(path, "rb") as f:
use only
base64.b64decode(a)
instead of
base64.b64decode(a).decode('utf-8')
This is due to the different encoding method when read the file. In python, it defaultly
encode the data with unicode. However, it may not works in various platforms.
I propose an encoding method which can help you solve this if 'utf-8' not works.
with open(path, newline='', encoding='cp1252') as csvfile:
reader = csv.reader(csvfile)
It should works if you change the encoding method here. Also, you can find other encoding method here standard-encodings , if above doesn't work for you.
Those getting similar errors while handling Pandas for data frames use the following solution.
example solution.
df = pd.read_csv("File path", encoding='cp1252')
I had this UnicodeDecodeError while trying to read a '.csv' file using pandas.read_csv(). In my case, I could not manage to overcome this issue using other encoder types. But instead of using
pd.read_csv(filename, delimiter=';')
I used:
pd.read_csv(open(filename, 'r'), delimiter=';')
which just seems working fine for me.
Note that: In open() function, use 'r' instead of 'rb'. Because 'rb' returns bytes object that causes to happen this decoder error in the first place, that is the same problem in the read_csv(). But 'r' returns str which is needed since our data is in .csv, and using the default encoding='utf-8' parameter, we can easily parse the data using read_csv() function.
if you are receiving data from a serial port, make sure you are using the right baudrate (and the other configs ) : decoding using (utf-8) but the wrong config will generate the same error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
to check your serial port config on linux use : stty -F /dev/ttyUSBX -a
I had a similar issue and searched all the internet for this problem
if you have this problem just copy your HTML code in a new HTML file and use the normal <meta charset="UTF-8">
and it will work....
just create a new HTML file in the same location and use a different name
Check the path of the file to be read. My code kept on giving me errors until I changed the path name to present working directory. The error was:
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
If you are on a mac check if you for a hidden file, .DS_Store. After removing the file my program worked.
I had a similar problem.
Solved it by:
import io
with io.open(filename, 'r', encoding='utf-8') as fn:
lines = fn.readlines()
However, I had another problem. Some html files (in my case) were not utf-8, so I received a similar error. When I excluded those html files, everything worked smoothly.
So, except from fixing the code, check also the files you are reading from, maybe there is an incompatibility there indeed.
You have to use the encoding as latin1 to read this file as there are some special character in this file, use the below code snippet to read the file.
The problem here is the encoding type. When Python can't convert the data to be read, it gives an error.
You can you latin1 or other encoding values.
I say try and test to find the right one for your dataset.
I have the same issue when processing a file generated from Linux. It turns out it was related with files containing question marks..
Following code worked in my case:
df = pd.read_csv(filename,sep = '\t', encoding='cp1252')
If possible, open the file in a text editor and try to change the encoding to UTF-8. Otherwise do it programatically at the OS level.
I'm trying to download BVLC-trained model and I'm stuck with this error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 110: invalid start byte
I think it's because of the following function (complete code)
# Closure-d function for checking SHA1.
def model_checks_out(filename=model_filename, sha1=frontmatter['sha1']):
with open(filename, 'r') as f:
return hashlib.sha1(f.read()).hexdigest() == sha1
Any idea how to fix this?
You are opening a file that is not UTF-8 encoded, while the default encoding for your system is set to UTF-8.
Since you are calculating a SHA1 hash, you should read the data as binary instead. The hashlib functions require you pass in bytes:
with open(filename, 'rb') as f:
return hashlib.sha1(f.read()).hexdigest() == sha1
Note the addition of b in the file mode.
See the open() documentation:
mode is an optional string that specifies the mode in which the file is opened. It defaults to 'r' which means open for reading in text mode. [...] In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding. (For reading and writing raw bytes use binary mode and leave encoding unspecified.)
and from the hashlib module documentation:
You can now feed this object with bytes-like objects (normally bytes) using the update() method.
You didn't specify to open the file in binary mode, so f.read() is trying to read the file as a UTF-8-encoded text file, which doesn't seem to be working. But since we take the hash of bytes, not of strings, it doesn't matter what the encoding is, or even whether the file is text at all: just open it, and then read it, as a binary file.
>>> with open("test.h5.bz2","r") as f: print(hashlib.sha1(f.read()).hexdigest())
Traceback (most recent call last):
File "<ipython-input-3-fdba09d5390b>", line 1, in <module>
with open("test.h5.bz2","r") as f: print(hashlib.sha1(f.read()).hexdigest())
File "/home/dsm/sys/pys/Python-3.5.1-bin/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb8 in position 10: invalid start byte
but
>>> with open("test.h5.bz2","rb") as f: print(hashlib.sha1(f.read()).hexdigest())
21bd89480061c80f347e34594e71c6943ca11325
Since there is not a single hint in the documentation nor src code, I have no clue why, but using the b char (i guess for binary) totally works (tf-version: 1.1.0):
image_data = tf.gfile.FastGFile(filename, 'rb').read()
For more information, check out: gfile