When python pandas.read_csv on azure, encoding is not changing - python

By reading csv file with python pandas, and try to change encoding, because of some Germans letters, seams Azure always keep same encoding (assuming default).
Whatever I've done, always get same error on Azure portal:
'utf-8' codec can't decode byte 0xc4 in position 0: invalid continuation byte Stack
Same error appears even if I set, uft-16, latin1, cp1252 etc.
with pysftp.Connection(host, username=username, password=password, cnopts=cnopts) as sftp:
for i in sftp.listdir_attr():
with sftp.open(i.filename) as f:
df = pd.read_csv(f, delimiter=';', encoding='cp1252')
By the way, testing this locally on windows machine, it works fine.
Full error:
Result: Failure Exception: UnicodeDecodeError: 'utf-8' codec cant decode byte 0xc4 in position 0: invalid continuation byte Stack: File "/home/site/wwwroot/.python_packages/lib/site-packages/azure_functions_worker/dispatcher.py",
line 355, in _handle__invocation_request call_result = await self._loop.run_in_executor(
File "/usr/local/lib/python3.8/concurrent/futures/thread.py",
line 57, in run result = self.fn(*self.args, **self.kwargs) File "/home/site/wwwroot/.python_packages/lib/site-packages/azure_functions_worker/dispatcher.py",
line 542, in __run_sync_func return func(**params)
File "/home/site/wwwroot/ce_etl/etl_main.py",
line 141, in main df = pd.read_csv(f, delimiter=';', encoding=r"utf-8-sig", error_bad_lines=False)
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/util/_decorators.py",
line 311, in wrapper return func(*args, **kwargs)
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/io/parsers/readers.py",
line 586, in read_csv return _read(filepath_or_buffer, kwds)
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/io/parsers/readers.py",
line 488, in _read return parser.read(nrows)
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/io/parsers/readers.py",
line 1047, in read index, columns, col_dict = self._engine.read(nrows)
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/io/parsers/c_parser_wrapper.py",
line 223, in read chunks = self._reader.read_low_memory(nrows)
File "pandas/_libs/parsers.pyx",
line 801, in pandas._libs.parsers.TextReader.read_low_memory
File "pandas/_libs/parsers.pyx",
line 880, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx",
line 1026, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas/_libs/parsers.pyx",
line 1080, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas/_libs/parsers.pyx",
line 1204, in pandas._libs.parsers.TextReader._convert_with_dtype
File "pandas/_libs/parsers.pyx",
line 1217, in pandas._libs.parsers.TextReader._string_convert
File "pandas/_libs/parsers.pyx",
line 1396, in pandas._libs.parsers._string_box_utf8

You can use encoding as below:
read_csv('file', encoding = "ISO-8859-1")
Also if we would like to detect the own encoding of the file and place it in read_csv, we can add it as below:
result = chardet.detect(f.read()) #or readline if the file is large
df=pd.read_csv(r'C:\test.csv',encoding=result['encoding'])
Refer to read_csv from Python Pandas documentations

I found solution. Basically sftp.open keeps utf-8 by default. Why Azure Linux can't change encoding in read_csv method is still remaining a question.
Reading file as object with sftp.getfo, and then parsing to df would work fine:
ba = io.BytesIO()
sftp.getfo(i.filename, ba)
ba.seek(0)
f = io.TextIOWrapper(ba, encoding='cp1252')
df = pd.read_csv(f, delimiter=';', encoding='cp1252', dtype=str,
error_bad_lines=False)

Related

How to resolve error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe5 in position 55: invalid continuation byte

I am having some kind of encoding error when using SQLAlchemy with mysql+mysqlconnector.
I found a lot of similar questions answered on StackOverflow but none of them seem to work for me. Think I have tried everything. I have tried setting charset=utf8 in the connection string.
I have tried setting the character set using SQL.
src_conn.execute("SET CHARACTER SET utf8mb4")
This is the initial error I got:
ExceptionTraceback (most recent call last):
File "transfer_binlog.py", line 73, in <module>
events = events.fetchall()
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/result.py", line 1216, in fetchall
e, None, None, self.cursor, self.context
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1478, in _handle_dbapi_exception
util.reraise(*exc_info)
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 153, in reraise
raise value
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/result.py", line 1211, in fetchall
l = self.process_rows(self._fetchall_impl())
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/result.py", line 1161, in _fetchall_impl
return self.cursor.fetchall()
File "/usr/local/lib/python3.7/site-packages/mysql/connector/cursor.py", line 990, in fetchall
row, self.description))
File "/usr/local/lib/python3.7/site-packages/mysql/connector/conversion.py", line 407, in row_to_python
result[i] = self._cache_field_types[field_type](row[i], field)
File "/usr/local/lib/python3.7/site-packages/mysql/connector/conversion.py", line 567, in _STRING_to_python
return value.decode(self.charset)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe5 in position 55: invalid continuation byte
And here is an excerpt of the code:
src_connect = create_engine('mysql+mysqlconnector://user:pw#host/core?charset=utf8', isolation_level="READ UNCOMMITTED")
src_conn = src_connect.connect()
events = src_conn.execute("SHOW BINLOG EVENTS")
x = events.fetchall()
Is there anything I can have missed or is sqlalchemy or mysql+mysqlconnector driver buggy?
Anya ideas are appreciated
EDIT: According to #GordThompsons advice I switched to mysql+pymysql.
There error is now (quite similar):
ExceptionTraceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1246, in _execute_context
cursor, statement, parameters, context
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 581, in do_execute
cursor.execute(statement, parameters)
File "/usr/local/lib/python3.7/site-packages/pymysql/cursors.py", line 170, in execute
result = self._query(query)
File "/usr/local/lib/python3.7/site-packages/pymysql/cursors.py", line 328, in _query
conn.query(q)
File "/usr/local/lib/python3.7/site-packages/pymysql/connections.py", line 517, in query
self._affected_rows = self._read_query_result(unbuffered=unbuffered)
File "/usr/local/lib/python3.7/site-packages/pymysql/connections.py", line 732, in _read_query_result
result.read()
File "/usr/local/lib/python3.7/site-packages/pymysql/connections.py", line 1082, in read
self._read_result_packet(first_packet)
File "/usr/local/lib/python3.7/site-packages/pymysql/connections.py", line 1152, in _read_result_packet
self._read_rowdata_packet()
File "/usr/local/lib/python3.7/site-packages/pymysql/connections.py", line 1190, in _read_rowdata_packet
rows.append(self._read_row_from_packet(packet))
File "/usr/local/lib/python3.7/site-packages/pymysql/connections.py", line 1206, in _read_row_from_packet
data = data.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe5 in position 55: invalid continuation byte
EDIT 2
#GordThompson it sais
'collation_connection','utf8mb4_general_ci'
'collation_database','utf8_general_ci'
'collation_server','utf8_general_ci'
I also tried SHOW VARIABLES LIKE 'char%'; which give me
'character_set_client','utf8mb4'
'character_set_connection','utf8mb4'
'character_set_database','utf8'
'character_set_filesystem','binary'
'character_set_results','utf8mb4'
'character_set_server','latin1'
'character_set_system','utf8'
'character_sets_dir','/rdsdbbin/mysql-5.6.44.R1/share/charsets/'
It is strange that character_set_server shows latin1, maybe that is the issue since I guess SHOW BINLOG EVENTS is a server feature.
But then I changed it like this: SET SESSION character_set_server=utf8; but the problem remains.
And just to be clear I runned the above statements in MySQLWorkbench and not from python.

Problems with pd.read_csv

I have Anaconda 3 on Windows 10. I am using pd.read_csv() to load csv files but I get error messages. To begin with I tried df = pd.read_csv('C:\direct_marketing.csv') which worked and the file was imported.
Then I tried df = pd.read_csv('C:\tutorial.csv') and I received the following error message:
Traceback (most recent call last):
File "<ipython-input-3-ce208cc2684f>", line 1, in <module>
df = pd.read_csv('C:\tutorial.csv')
File "C:\Users\Alexandros_7\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 562, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Users\Alexandros_7\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 315, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "C:\Users\Alexandros_7\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 645, in __init__
self._make_engine(self.engine)
File "C:\Users\Alexandros_7\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 799, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "C:\Users\Alexandros_7\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1213, in __init__
self._reader = _parser.TextReader(src, **kwds)
File "pandas\parser.pyx", line 358, in pandas.parser.TextReader.__cinit__ (pandas\parser.c:3427)
File "pandas\parser.pyx", line 628, in pandas.parser.TextReader._setup_parser_source (pandas\parser.c:6861)
OSError: File b'C:\tutorial.csv' does not exist
Then I moved the file to a new folder and renamed it and again used read.csv() to import it:
df = pd.read_csv('C:\Users\test.csv')
This time I received a different error message:
File "<ipython-input-5-03c6d380c174>", line 1
df = pd.read_csv('C:\Users\test.csv')
^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
Could you help me understand what is going on and how to handle this situation?
Thanks a lot!
Try escaping the backslashes:
df = pd.read_csv('C:\\Users\\test.csv')
try use two back-slash '\' instead of '\'. It might have take it as a escape sign.. ?
Another option is to add r before the path i.e. df = pd.read_csv(r'C:\Users\test.csv')

€ sign - xlsxwriter error

I'm trying to get some data from one web page and write these data into xlsx file. Everything seems good but the Encoding error probably raises if it tries to write it into xlsx file during CLOSING the file.
ERROR:
Traceback (most recent call last):
File "C:/Users/Milano/PycharmProjects/distrelec/crawler.py", line 429, in <module>
temp_file_to_xlsx()
File "C:/Users/Milano/PycharmProjects/distrelec/crawler.py", line 119, in temp_file_to_xlsx
wb.close()
File "C:\Python27\lib\site-packages\xlsxwriter\workbook.py", line 295, in close
self._store_workbook()
File "C:\Python27\lib\site-packages\xlsxwriter\workbook.py", line 518, in _store_workbook
xml_files = packager._create_package()
File "C:\Python27\lib\site-packages\xlsxwriter\packager.py", line 134, in _create_package
self._write_workbook_file()
File "C:\Python27\lib\site-packages\xlsxwriter\packager.py", line 174, in _write_workbook_file
workbook._assemble_xml_file()
File "C:\Python27\lib\site-packages\xlsxwriter\workbook.py", line 464, in _assemble_xml_file
self._write_sheets()
File "C:\Python27\lib\site-packages\xlsxwriter\workbook.py", line 1455, in _write_sheets
self._write_sheet(worksheet.name, id_num, worksheet.hidden)
File "C:\Python27\lib\site-packages\xlsxwriter\workbook.py", line 1472, in _write_sheet
self._xml_empty_tag('sheet', attributes)
File "C:\Python27\lib\site-packages\xlsxwriter\xmlwriter.py", line 80, in _xml_empty_tag
self.fh.write("<%s/>" % tag)
File "C:\Python27\lib\codecs.py", line 694, in write
return self.writer.write(data)
File "C:\Python27\lib\codecs.py", line 357, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 23: ordinal not in range(128)
To find out where is the problem I've edited codecs module:
def write(self, object):
""" Writes the object's contents encoded to self.stream.
"""
try:
data, consumed = self.encode(object, self.errors)
self.stream.write(data)
except:
print object
print repr(object)
raise Exception
The output is:
<sheet name="Android PC–APC" sheetId="42" r:id="rId42"/>
'<sheet name="Android PC\xe2\x80\x93APC" sheetId="42" r:id="rId42"/>'
temp_file_to_xlsx()
File "C:/Users/Milano/PycharmProjects/distrelec/crawler.py", line 119, in temp_file_to_xlsx
wb.close()
File "C:\Python27\lib\site-packages\xlsxwriter\workbook.py", line 295, in close
self._store_workbook()
File "C:\Python27\lib\site-packages\xlsxwriter\workbook.py", line 518, in _store_workbook
xml_files = packager._create_package()
File "C:\Python27\lib\site-packages\xlsxwriter\packager.py", line 134, in _create_package
self._write_workbook_file()
File "C:\Python27\lib\site-packages\xlsxwriter\packager.py", line 174, in _write_workbook_file
workbook._assemble_xml_file()
File "C:\Python27\lib\site-packages\xlsxwriter\workbook.py", line 464, in _assemble_xml_file
self._write_sheets()
File "C:\Python27\lib\site-packages\xlsxwriter\workbook.py", line 1455, in _write_sheets
self._write_sheet(worksheet.name, id_num, worksheet.hidden)
File "C:\Python27\lib\site-packages\xlsxwriter\workbook.py", line 1472, in _write_sheet
self._xml_empty_tag('sheet', attributes)
File "C:\Python27\lib\site-packages\xlsxwriter\xmlwriter.py", line 80, in _xml_empty_tag
self.fh.write("<%s/>" % tag)
File "C:\Python27\lib\codecs.py", line 699, in write
return self.writer.write(data)
File "C:\Python27\lib\codecs.py", line 363, in write
raise Exception
Exception
What should I do with that please?
You have to decode your input data with the correct encoding, which seems to be 'utf-8'.
You may want to look at this:
Example: Simple Unicode with Python 2

Encoding with pandas.read_csv when file name has accents

I'm trying to load a CSV with pandas, but am running into a problem if the file name has accents. It's clearly an encoding problem, but although read_csv lets you set encoding for text within the file, I can't figure out how to encode the file name properly.
input_file = r'C:\...\Datasets\%s\Provinces\Points\%s.csv' % (country, province)
self.locs = pandas.read_csv(input_file,sep=',',skipinitialspace=True)
The CSV file is Anzoátegui.csv. When I'm getting errors,
input_file = 'C:\\...\Datasets\Venezuela\Provinces\Points\Anzoátegui.csv
Error code:
OSError: File b'C:\\PF2\\QGIS Valmiera\\Datasets\\Venezuela\\Provinces\\Points\\Anzo\xc3\xa1tegui.csv' does not exist
So maybe it's converting my string to bytes? I tried using io.StringIO(input_file) as well, which puts the correct file name as a column header on an empty DataFrame:
Empty DataFrame
Columns: [C:\PF2\QGIS Valmiera\Datasets\Venezuela\Provinces\Points\Anzoátegui.csv]
Index: []
Any ideas on how to get this file to load? Unfortunately I can't just strip out accents, as I have to interface with software that requires the proper name, and I have a ton of files to format (not just the one). Thanks!
Edit: Full error
Traceback (most recent call last):
File "C:\PF2\eclipse-standard-kepler-SR2-win32-x86_64\eclipse\plugins\org.python.pydev_3.3.3.201401272249\pysrc\pydevd_comm.py", line 891, in doIt
result = pydevd_vars.evaluateExpression(self.thread_id, self.frame_id, self.expression, self.doExec)
File "C:\PF2\eclipse-standard-kepler-SR2-win32-x86_64\eclipse\plugins\org.python.pydev_3.3.3.201401272249\pysrc\pydevd_vars.py", line 486, in evaluateExpression
result = eval(compiled, updated_globals, frame.f_locals)
File "<string>", line 1, in <module>
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 404, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 205, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 486, in __init__
self._make_engine(self.engine)
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 594, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 952, in __init__
self._reader = _parser.TextReader(src, **kwds)
File "parser.pyx", line 330, in pandas.parser.TextReader.__cinit__ (pandas\parser.c:3040)
File "parser.pyx", line 557, in pandas.parser.TextReader._setup_parser_source (pandas\parser.c:5387)
OSError: File b'C:\\PF2\\QGIS Valmiera\\Datasets\\Venezuela\\Provinces\\Points\\Anzo\xc3\xa1tegui.csv' does not exist
Ok folks, I got a little lost in dependency hell, but it turns out that this issue was fixed in pandas 0.14.0. Install the updated version to get files named with accents to import correctly.
Comments at github.
Thanks for the input!

python mechanize file upload UnicodeDecode error

So I have a little script that I would like to use to upload some PDFs to my citation-site-of-choice (citeulike.org)
Thing is its not working. It does this:
so want to upload /Users/willwade/Dropbox/Papers/price_promoting_643127.pdf to 12589610
Traceback (most recent call last):
File "citeuupload.py", line 167, in <module>
cureader.parseUserBibTex()
File "citeuupload.py", line 160, in parseUserBibTex
self.uploadFileToCitation(b['citeulike-article-id'],self.localpapers+fileorfalse)
File "citeuupload.py", line 138, in uploadFileToCitation
resp = self.browser.submit()
File "build/bdist.macosx-10.8-intel/egg/mechanize/_mechanize.py", line 541, in submit
File "build/bdist.macosx-10.8-intel/egg/mechanize/_mechanize.py", line 203, in open
File "build/bdist.macosx-10.8-intel/egg/mechanize/_mechanize.py", line 230, in _mech_open
File "build/bdist.macosx-10.8-intel/egg/mechanize/_opener.py", line 193, in open
File "build/bdist.macosx-10.8-intel/egg/mechanize/_urllib2_fork.py", line 344, in _open
File "build/bdist.macosx-10.8-intel/egg/mechanize/_urllib2_fork.py", line 332, in _call_chain
File "build/bdist.macosx-10.8-intel/egg/mechanize/_urllib2_fork.py", line 1142, in http_open
File "build/bdist.macosx-10.8-intel/egg/mechanize/_urllib2_fork.py", line 1115, in do_open
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 955, in request
self._send_request(method, url, body, headers)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 989, in _send_request
self.endheaders(body)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 951, in endheaders
self._send_output(message_body)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 809, in _send_output
msg += message_body
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 544: ordinal not in range(128)
and the code:
def uploadFileToCitation(self,artid,file):
print 'so want to upload', file, ' to ', artid
self.browser.open('http://www.citeulike.org/user/'+cUser+'/article/'+artid)
self.browser.select_form(name="fileupload_frm")
self.browser.form.add_file(open(file, 'rb'), 'application/pdf', file, name='file')
try:
resp = self.browser.submit()
self.wait_for_api_limit()
except mechanize.HTTPError, e:
print 'error'
print e.getcode()
print resp.read()
exit()
NB: I can see it's reading in the file correctly (and it does exist). Also note that I'm doing this elsewhere
self.browser = mechanize.Browser()
self.browser.set_handle_robots(False)
self.browser.addheaders = [
("User-agent", 'me#me.com citeusyncpy/1.0'),
]
Full code is here
Try to check this similar question.
To clarify, the message is constructed in httplib from the method, URL, headers, etc. If any of these is Unicode, the whole string gets converted to Unicode (I presume this is normal Python behavior). Then if you try to append a UTF-8 string you get the error I described in the original question...
From looks of it's a problem with encoding that proper header can fix.
Also you can check this issue.

Categories

Resources