This question already has an answer here:
Reading CSV file into Pandas from SFTP server via Paramiko fails with "'utf-8' codec can't decode byte ... in position ....: invalid start byte"
(1 answer)
Closed 1 year ago.
I'm trying to work around paramiko's strict utf-8 decoding functionality. I want to open the file in binary mode and read into a dataframe line by line. How can I do that?
remote_file = sftp.open(remoteName, "rb")
for line in remote_file:
print(line.decode("utf8", "ignore"))
I tested on my server and I see
This code
remote_file = sftp.open(remoteName)
print(remote_file.read())
reads data as bytes - even if I don't set bytes-mode (rb)
This code
remote_file = sftp.open(remoteName)
print(remote_file.readlines())
normally reads data as string but can read as bytes when I set bytes-mode (rb).
It seems when I use read_csv(remote_file) then it use some inner wrapper and it automatically converts with utf-8 - even if I set bytes-mode (rb) - and settings encoding in read_csv can't change it.
But I can use read() with io.StringIO to convert it manually with ie. latin1
import io
remote_file = sftp.open(remoteName)
bytes = remote_file.read()
text = bytes.decode('latin1')
#text = remote_file.read().decode('latin1')
file_obj = io.StringIO(text)
df = pd.read_csv(file_obj)
#df = pd.read_csv(io.StringIO(text))
EDIT:
Besed on answer in previous question it works with io.BytesIO and encoding in read_csv.
import io
remote_file = sftp.open(remoteName)
bytes = remote_file.read()
file_obj = io.BytesIO(bytes)
df = pd.read_csv(file_obj, encoding='latin1')
#df = pd.read_csv(io.BytesIO(bytes), encoding='latin1')
Related
so I'm trying to get a csv file with requests and save it to my project:
import requests
import pandas as pd
import csv
def get_and_save_countries():
url = 'https://www.trackcorona.live/api/countries'
r = requests.get(url)
data = r.json()
data = data["data"]
with open("corona/dash_apps/finished_apps/apicountries.csv","w",newline="") as f:
title = "location,country_code,latitude,longitude,confirmed,dead,recovered,updated".split(",")
cw = csv.DictWriter(f,title,delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
cw.writeheader()
cw.writerows(data)
I've managed that but when I try this:
get_data.get_and_save_countries()
df = pd.read_csv("corona\\dash_apps\\finished_apps\\apicountries.csv")
I get this error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 1: invalid continuation byte
And I have no idea why. Any help is welcome. Thanks.
Try:
with open("corona/dash_apps/finished_apps/apicountries.csv","w",newline="", encoding ='utf-8') as f:
to explicitly specify the encoding with encoding='utf-8'
When you write to a file, the default encoding is locale.getpreferredencoding(False). On Windows that is usually not UTF-8 and even on Linux the terminal could be configured other than UTF-8. Pandas is defaulting to utf-8, so specify encoding='utf8' as another parameter to open.
I just converted some action I did with JS (node) to Python (flask webserver) - connecting to secured FTP service and read and parse a CSV files from there because I know it is faster with Python.
I managed to do almost everything, but I'm having some hard time at parsing the CSV file well.
So this is my code:
import urllib.request
import csv
import json
import pysftp
import pandas as pd
cnopts = pysftp.CnOpts()
cnopts.hostkeys = None
name = 'username'
password = 'pass'
host = 'hostURL'
path = ""
with pysftp.Connection(host=host, username=name, password=password, cnopts=cnopts) as sftp:
for filename in sftp.listdir():
if filename.endswith('.csv'):
file = sftp.open(filename)
csvFile = file.read()
I got to the part where I can see the content of the CSV file but I can't parse well (like I need to it be formatted - array of objects).
I tried to parse it with:
with open (csvFile, 'rb') as csv_file:
print(csv_file)
cr = csv.reader(csv_file,delimiter=",") # , is default
rows = list(cr)
and with this:
Past=pd.read_csv(csvFile,encoding='cp1252')
print(Past)
but I got errors like:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 748: invalid start byte
and
OSError: Expected file path name or file-like object, got <class 'bytes'> type
I'm really kinda stuck right now.
(One more question - not important but just wanted to know if I can retrieve a file from ftp based on the latest date - because sometimes there can be more than 1 file in a repository.)
If you don't mind using Pandas (and Numpy)
Pandas' read_csv accepts a file path or a file object (docs). More specifically, it mentions:
By file-like object, we refer to objects with a read() method, such as a file handler (e.g. via builtin open function) or StringIO.
In that sense, using either filename or file from your example should work.
Given this, if using pandas option, try replacing your code with:
df = pd.read_csv(filename, encoding='cp1252') # assuming this is the correct encoding
print(df.head()) # optional, prints top 5 entries
df is now a Pandas DataFrame. To transform a DataFrame into an array of objects, try the to_numpy method (docs):
arr = df.to_numpy() # returns numpy array from DataFrame
I am currently reading the documentation for the io module: https://docs.python.org/3.5/library/io.html?highlight=stringio#io.TextIOBase
Maybe it is because I don't know Python well enough, but in most cases I just don't understand their documentation.
I need to save the data in addresses_list to a csv file and serve it to the user via https. So all of this must happen in-memory. This is the code for it and currently it is working fine.
addresses = Abonnent.objects.filter(exemplare__gt=0)
addresses_list = list(addresses.values_list(*fieldnames))
csvfile = io.StringIO()
csvwriter_unicode = csv.writer(csvfile)
csvwriter_unicode.writerow(fieldnames)
for a in addresses_list:
csvwriter_unicode.writerow(a)
csvfile.seek(0)
export_data = io.BytesIO()
myzip = zipfile.ZipFile(export_data, "w", zipfile.ZIP_DEFLATED)
myzip.writestr("output.csv", csvfile.read())
myzip.close()
csvfile.close()
export_data.close()
# serve the file via https
Now the problem is that I need the content of the csv file to be encoded in cp1252 and not in utf-8. Traditionally I would just write f = open("output.csv", "w", encoding="cp1252") and then dump all the data into it. But with in-memory streams it doesn't work that way. Both, io.StringIO() and io.BytesIO() don't take a parameter encoding=.
This is where I have truoble understanding the documentation:
The text stream API is described in detail in the documentation of TextIOBase.
And the documentation of TextIOBase says this:
encoding=
The name of the encoding used to decode the stream’s bytes into strings, and to encode strings into bytes.
But io.StringIO(encoding="cp1252") just throws: TypeError: 'encoding' is an invalid keyword argument for this function.
So how can I use TextIOBase's enconding parameter with StringIO? Or how does this work in general? I am so confused.
StringIO deals only with strings/text. It doesn't know anything about encodings or bytes. The easiest way to do what you want is probably something like:
f = StringIO()
f.write("Some text")
# Old-ish way:
f.seek(0)
my_bytes = f.read().encode("cp1252")
# Alternatively
my_bytes = f.getvalue().encode("cp1252")
reading text from io.BytesIO (in memory streams) using io.TextIOWrapper including encoding and error handling (python3)
this does what io.StringIO cant
sample code
>>> import io
>>> import chardet
>>> # my bytes, single german umlaut
... bts = b'\xf6'
>>>
>>> # try reading as utf-8 text and on error replace
... my_encoding = 'utf-8'
>>> fh_bytes = io.BytesIO(bts)
>>> fh = io.TextIOWrapper(fh_bytes, encoding=my_encoding, errors='replace')
>>> fh.read()
'�'
>>>
>>> # try reading as utf-8 text with strict error handling
... fh_bytes = io.BytesIO(bts)
>>> fh = io.TextIOWrapper(fh_bytes, encoding=my_encoding, errors='strict')
>>> # catch exception
... try:
... fh.read()
... except UnicodeDecodeError as err:
... print('"%s"' % err)
... # try to get encoding
... my_encoding = chardet.detect(err.object)['encoding']
... print("correct encoding is %s" % my_encoding)
...
"'utf-8' codec can't decode byte 0xf6 in position 0: invalid start byte"
correct encoding is windows-1252
>>> # retry with detected encoding
... fh_bytes = io.BytesIO(bts)
>>> fh = io.TextIOWrapper(fh_bytes, encoding=my_encoding, errors='strict')
>>> fh.read()
'ö'
My CSV was originally created by Excel. Anticipating encoding anomalies, I opened and re-saved the file with UTF-8 BOM encoding using Sublime Text.
Imported into the notebook:
filepath = "file:///Volumes/PASSPORT/Inserts/IMAGETRAC/csv/universe_wcsv.csv"
uverse = sc.textFile(filepath)
header = uverse.first()
data = uverse.filter(lambda x:x<>header)
Formatted my fields:
fields = header.replace(" ", "_").replace("/", "_").split(",")
Structured the data:
import csv
from StringIO import StringIO
from collections import namedtuple
Products = namedtuple("Products", fields, verbose=True)
def parse(row):
reader = csv.reader(StringIO(row))
row = reader.next()
return Products(*row)
products = data.map(parse)
If I then do products.first(), I get the first record as expected. However, if I want to, say, see the count by brand and so run:
products.map(lambda x: x.brand).countByValue()
I still get an UnicodeEncodeError related Py4JJavaError:
File "<ipython-input-18-4cc0cb8c6fe7>", line 3, in parse
UnicodeEncodeError: 'ascii' codec can't encode character u'\xab' in
position 125: ordinal not in range(128)
How can I fix this code?
csv module in legacy Python versions doesn't support Unicode input. Personally I would recommend using Spark csv data source:
df = spark.read.option("header", "true").csv(filepath)
fields = [c.strip().replace(" ", "_").replace("/", "_") for c in df.columns]
df.toDF(*fields).rdd
For most applications Row objects should work as well as namedtuple (it extends tuple and provides similar attribute getters) but you can easily follow convert one into another.
You could also try reading data as without decoding:
uverse = sc.textFile(filepath, use_unicode=False)
and decoding fields manually after initial parsing:
(data
.map(parse)
.map(lambda prod: Products(*[x.decode("utf-8") for x in prod])))
Related question Reading a UTF8 CSV file with Python
In this first example we save two Unicode strings in a file while delegating to codecs the task of encoding them.
# -*- coding: utf-8 -*-
import codecs
cities = [u'Düsseldorf', u'天津市']
with codecs.open("cities", "w", "utf-8") as f:
for c in cities:
f.write(c)
We now do the same thing, first saving the two names to redis, then reading them back and saving what we've read to a file. Because what we've read is already in utf-8 we skip decoding/encoding for that part.
# -*- coding: utf-8 -*-
import redis
r_server = redis.Redis('localhost') #, decode_responses = True)
cities_tag = u'Städte'
cities = [u'Düsseldorf', u'天津市']
for city in cities:
r_server.sadd(cities_tag.encode('utf8'),
city.encode('utf8'))
with open(u'someCities.txt', 'w') as f:
while r_server.scard(cities_tag.encode('utf8')) != 0:
city_utf8 = r_server.srandmember(cities_tag.encode('utf8'))
f.write(city_utf8)
r_server.srem(cities_tag.encode('utf8'), city_utf8)
How can I replace the line
r_server = redis.Redis('localhost')
with
r_server = redis.Redis('localhost', decode_responses = True)
to avoid the wholesale introduction of .encode/.decode when using redis?
I'm not sure that there is a problem.
If you remove all of the .encode('utf8') calls in your code it produces a correct file, i.e. the file is the same as the one produced by your current code.
>>> r_server = redis.Redis('localhost')
>>> r_server.keys()
[]
>>> r_server.sadd(u'Hauptstädte', u'東京', u'Godthåb',u'Москва')
3
>>> r_server.keys()
['Hauptst\xc3\xa4dte']
>>> r_server.smembers(u'Hauptstädte')
set(['Godth\xc3\xa5b', '\xd0\x9c\xd0\xbe\xd1\x81\xd0\xba\xd0\xb2\xd0\xb0', '\xe6\x9d\xb1\xe4\xba\xac'])
This shows that keys and values are UTF8 encoded, therefore .encode('utf8') is not required. The default encoding for the redis module is UTF8. This can be changed by passing an encoding when creating the client, e.g. redis.Redis('localhost', encoding='iso-8859-1'), but there's no reason to.
If you enable response decoding with decode_responses=True then the responses will be converted to unicode using the client connection's encoding. This just means that you don't need to explicitly decode the returned data, redis will do it for you and give you back a unicode string:
>>> r_server = redis.Redis('localhost', decode_responses=True)
>>> r_server.keys()
[u'Hauptst\xe4dte']
>>> r_server.smembers(u'Hauptstädte')
set([u'Godth\xe5b', u'\u041c\u043e\u0441\u043a\u0432\u0430', u'\u6771\u4eac'])
So, in your second example where you write data retrieved from redis to a file, if you enable response decoding then you need to open the output file with the desired encoding. If this is the default encoding then you can just use open(). Otherwise you can use codecs.open() or manually encode the data before writing to the file.
import codecs
cities_tag = u'Hauptstädte'
with codecs.open('capitals.txt', 'w', encoding='utf8') as f:
while r_server.scard(cities_tag) != 0:
city = r_server.srandmember(cities_tag)
f.write(city + '\n')
r_server.srem(cities_tag, city)