I'm trying to read a CSV over SFTP using pysftp/Paramiko. My code looks like this:
input_conn = pysftp.Connection(hostname, username, password)
file = input_conn.open("Data.csv")
file_contents = list(csv.reader(file))
But when I do this, I get the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 23: invalid start byte
I know that this means the file is expected to be in UTF-8 encoding but isn't. The strange thing is, if I download the file and then use my code to open the file, I can specify the encoding as "macroman" and get no error:
with open("Data.csv", "r", encoding="macroman") as csvfile:
file_contents = list(csv.reader(csvfile))
The Paramiko docs say that the encoding of a file is meaningless over SFTP because it treats all files as bytes – but then, how can I get Python's CSV module to recognize the encoding if I use Paramiko to open the file?
If the file is not huge, so it's not a problem to have it loaded twice into the memory, you can download and convert the contents in memory:
with io.BytesIO() as bio:
input_conn.getfo("Data.csv", bio)
bio.seek(0)
with io.TextIOWrapper(bio, encoding='macroman') as f:
file_contents = list(csv.reader(f))
Partially based on Convert io.BytesIO to io.StringIO to parse HTML page.
Related
I have a hkscs dataset that I am trying to read in python 3. Below code
encoding = 'big5hkscs'
lines = []
num_errors = 0
for line in open('file.txt'):
try:
lines.append(line.decode(encoding))
except UnicodeDecodeError as e:
num_errors += 1
It throws me error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xae in position 0: invalid start byte. Seems like there is a non utf-8 character in the dataset that the code is not able to decode.
I tried adding errors = ignore in this line
lines.append(line.decode(encoding, errors='ignore'))
But that does not solve the problem.
Can anyone please suggest?
If a text file contains text encoded with an encoding that is not the default encoding, the encoding must be specified when opening the file to avoid decoding errors:
encoding = 'big5hkscs'
path = 'file.txt'
with open(path, 'r', encoding=encoding,) as f:
for line in f:
# do something with line
Alternatively, the file may be opened in binary mode, and text decoded afterwards:
encoding = 'big5hkscs'
path = 'file.txt'
with open(path, 'rb') as f:
for line in f:
decoded = line.decode(encoding)
# do something with decoded text
In the question, the file is opened without specifying an encoding, so its contents are automatically decoded with the default encoding - apparently UTF-8 in the is case.
Looks like if I do NOT add the except clause except UnicodeDecodeError as e, it works fine
encoding = 'big5hkscs'
lines = []
path = 'file.txt'
with open(path, encoding=encoding, errors='ignore') as f:
for line in f:
line = '\t' + line
lines.append(line)
The answer of snakecharmerb is correct, but possibly you need an explanation.
You didn't write it in the original question, but I assume you have the error on the for line. In this line you are decoding the file from UTF-8 (probably the default on your environment, so not on Windows), but later you are trying to decode it again. So the error is not about decoding big5hkscs, but about opening the file as text.
As in the good answer of snakecharmerb (second part), you should open the file as binary, so without decoding texts, and then you can decode the text with line.decode(encoding). Note: I'm not sure you can read lines on a binary file. In this manner you can still catch the errors, e.g. to write a message. Else the normal way is to decode at open() time, but then you loose the ability to fall back and to get users better error message (e.g. line number).
I have a .xlsx file and transformed it into a .csv file. Then I'm uploading the .csv file to a Python script I wrote, but an error is thrown.
Since the file is upload through HTTP, I'm accessing it with file = request.files['file']. This is returning a file of type FileStorage. After I'm trying to read it with the StringIO object as follows:
io.StringIO(file.stream.read().decode("UTF8"), newline=None)
I'm getting the following error:
TypeError: initial_value must be str or None, not bytes
I also tried to read the file of FileStorage object this way:
file_data = file.read().decode("utf-8")
and I'm getting the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 97: invalid start byte
Maybe it is interesting to note, that I'm being able to read the file directly, i.e. as a csv file, with the following code:
with open('file_path', 'r') as file:
csv_reader = csv.reader(file, delimiter=";")
...
But since I'm trying to get the file from an upload button, i.e. an input HTML element of type file, as mentioned above, I'm getting a FileStorage object, which I'm not being able to read it.
Anyone has any idea how could I approach this?
Thank you in advance!
It could be that it's not encoded in utf-8. Try decoding it into latin-1 instead:
file_data = file.read().decode("latin-1")
I am trying to do this:
fh = request.FILES['csv']
fh = io.StringIO(fh.read().decode())
reader = csv.DictReader(fh, delimiter=";")
This is failing always with the error in title and I spent almost 8 hours on this.
here is my understanding:
I am using python3, so file fh is in bytes. I am encoding it into string and putting it in memory via StringIO.
with csv.DictReader() trying to read it as dict into memory. It is failing here:
also tried with io.StringIO(fh.read().decode('utf-8')), but same error.
what am I missing? :/
The error is because there is some non-ASCII character in the file and it can't be encoded/decoded. One simple way to avoid this error is to encode/decode such strings with encode()/decode() function as follows (if a is the string with non-ASCII character):
a.decode('utf-8')
Also, you could try opening the file as:
with open('filename', 'r', encoding = 'utf-8') as f:
your code using f as file pointer
use 'rb' if your file is binary.
I'm trying to read a file and when I'm reading it, I'm getting a unicode error.
def reading_File(self,text):
url_text = "Text1.txt"
with open(url_text) as f:
content = f.read()
Error:
content = f.read()# Read the whole file
File "/home/soft/anaconda/lib/python3.6/encodings/ascii.py", line 26, in
decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 404:
ordinal not in range(128)
Why is this happening? I'm trying to run the same on Linux system, but on Windows it runs properly.
According to the question,
i'm trying to run the same on Linux system, but on Windows it runs properly.
Since we know from the question and some of the other answers that the file's contents are neither ASCII nor UTF-8, it's a reasonable guess that the file is encoded with one of the 8-bit encodings common on Windows.
As it happens 0x92 maps to the character 'RIGHT SINGLE QUOTATION MARK' in the cp125* encodings, used on US and latin/European regions.
So probably the the file should be opened like this:
# Python3
with open(url_text, encoding='cp1252') as f:
content = f.read()
# Python2
import codecs
with codecs.open(url_text, encoding='cp1252') as f:
content = f.read()
There can be two reasons for that to happen:
The file contains text encoded with an encoding different than 'ascii' and, according you your comments to other answers, 'utf-8'.
The file doesn't contain text at all, it is binary data.
In case 1 you need to figure out how the text was encoded and use that encoding to open the file:
open(url_text, encoding=your_encoding)
In case 2 you need to open the file in binary mode:
open(url_text, 'rb')
As it looks, default encoding is ascii while Python3 it's utf-8, below syntax to open the file can be used
open(file, encoding='utf-8')
Check your system default encoding,
>>> import sys
>>> sys.stdout.encoding
'UTF-8'
If it's not UTF-8, reset the encoding of your system.
export LANGUAGE=en_US.UTF-8
export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
export LC_TYPE=en_US.UTF-8
You can use codecs.open to fix this issue with the correct encoding:
import codecs
with codecs.open(filename, 'r', 'utf8' ) as ff:
content = ff.read()
i want to read file, find something in it and save the result, but when I want to save it it give me a error
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 19: ordinal not in range(128)
Code to save to file:
fileout.write((key + ';' + nameDict[key]+ ';'+src + alt +'\n').decode('utf-8'))
What can I do to fix it?
Thank you
You are trying to concatenate unicode values with byte strings, then turn the result to unicode, to write it to a file object that most likely only takes byte strings.
Don't mix unicode and byte strings like that.
Open the file to write to with io.open() to automate encoding Unicode values, then handle only unicode in your code:
import io
with io.open(filename, 'w', encoding='utf8') as fileout:
# code gathering stuff from BeautifulSoup
fileout.write(u'{};{};{}{}\n'.format(key, nameDict[key], src, alt)
You may want to check out the csv module to handle writing out delimiter-separated values. If you do go that route, you'll have to explicitly encode your columns:
import csv
with open(filename, 'wb') as fileout:
writer = csv.writer(fileout, delimiter=';')
# code gathering stuff from BeautifulSoup
row = [key, nameDict[key], src + alt]
writer.writerow([c.encode('utf8') for c in row])
If some of this data comes from other files, make sure you also decode to Unicode first; again, io.open() to read these files is probably the best option, to have the data decoded to Unicode values for you as you read.