Problems with pd.read_csv - python

I have Anaconda 3 on Windows 10. I am using pd.read_csv() to load csv files but I get error messages. To begin with I tried df = pd.read_csv('C:\direct_marketing.csv') which worked and the file was imported.
Then I tried df = pd.read_csv('C:\tutorial.csv') and I received the following error message:
Traceback (most recent call last):
File "<ipython-input-3-ce208cc2684f>", line 1, in <module>
df = pd.read_csv('C:\tutorial.csv')
File "C:\Users\Alexandros_7\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 562, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Users\Alexandros_7\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 315, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "C:\Users\Alexandros_7\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 645, in __init__
self._make_engine(self.engine)
File "C:\Users\Alexandros_7\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 799, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "C:\Users\Alexandros_7\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1213, in __init__
self._reader = _parser.TextReader(src, **kwds)
File "pandas\parser.pyx", line 358, in pandas.parser.TextReader.__cinit__ (pandas\parser.c:3427)
File "pandas\parser.pyx", line 628, in pandas.parser.TextReader._setup_parser_source (pandas\parser.c:6861)
OSError: File b'C:\tutorial.csv' does not exist
Then I moved the file to a new folder and renamed it and again used read.csv() to import it:
df = pd.read_csv('C:\Users\test.csv')
This time I received a different error message:
File "<ipython-input-5-03c6d380c174>", line 1
df = pd.read_csv('C:\Users\test.csv')
^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
Could you help me understand what is going on and how to handle this situation?
Thanks a lot!

Try escaping the backslashes:
df = pd.read_csv('C:\\Users\\test.csv')

try use two back-slash '\' instead of '\'. It might have take it as a escape sign.. ?

Another option is to add r before the path i.e. df = pd.read_csv(r'C:\Users\test.csv')

Related

When python pandas.read_csv on azure, encoding is not changing

By reading csv file with python pandas, and try to change encoding, because of some Germans letters, seams Azure always keep same encoding (assuming default).
Whatever I've done, always get same error on Azure portal:
'utf-8' codec can't decode byte 0xc4 in position 0: invalid continuation byte Stack
Same error appears even if I set, uft-16, latin1, cp1252 etc.
with pysftp.Connection(host, username=username, password=password, cnopts=cnopts) as sftp:
for i in sftp.listdir_attr():
with sftp.open(i.filename) as f:
df = pd.read_csv(f, delimiter=';', encoding='cp1252')
By the way, testing this locally on windows machine, it works fine.
Full error:
Result: Failure Exception: UnicodeDecodeError: 'utf-8' codec cant decode byte 0xc4 in position 0: invalid continuation byte Stack: File "/home/site/wwwroot/.python_packages/lib/site-packages/azure_functions_worker/dispatcher.py",
line 355, in _handle__invocation_request call_result = await self._loop.run_in_executor(
File "/usr/local/lib/python3.8/concurrent/futures/thread.py",
line 57, in run result = self.fn(*self.args, **self.kwargs) File "/home/site/wwwroot/.python_packages/lib/site-packages/azure_functions_worker/dispatcher.py",
line 542, in __run_sync_func return func(**params)
File "/home/site/wwwroot/ce_etl/etl_main.py",
line 141, in main df = pd.read_csv(f, delimiter=';', encoding=r"utf-8-sig", error_bad_lines=False)
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/util/_decorators.py",
line 311, in wrapper return func(*args, **kwargs)
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/io/parsers/readers.py",
line 586, in read_csv return _read(filepath_or_buffer, kwds)
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/io/parsers/readers.py",
line 488, in _read return parser.read(nrows)
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/io/parsers/readers.py",
line 1047, in read index, columns, col_dict = self._engine.read(nrows)
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/io/parsers/c_parser_wrapper.py",
line 223, in read chunks = self._reader.read_low_memory(nrows)
File "pandas/_libs/parsers.pyx",
line 801, in pandas._libs.parsers.TextReader.read_low_memory
File "pandas/_libs/parsers.pyx",
line 880, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx",
line 1026, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas/_libs/parsers.pyx",
line 1080, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas/_libs/parsers.pyx",
line 1204, in pandas._libs.parsers.TextReader._convert_with_dtype
File "pandas/_libs/parsers.pyx",
line 1217, in pandas._libs.parsers.TextReader._string_convert
File "pandas/_libs/parsers.pyx",
line 1396, in pandas._libs.parsers._string_box_utf8
You can use encoding as below:
read_csv('file', encoding = "ISO-8859-1")
Also if we would like to detect the own encoding of the file and place it in read_csv, we can add it as below:
result = chardet.detect(f.read()) #or readline if the file is large
df=pd.read_csv(r'C:\test.csv',encoding=result['encoding'])
Refer to read_csv from Python Pandas documentations
I found solution. Basically sftp.open keeps utf-8 by default. Why Azure Linux can't change encoding in read_csv method is still remaining a question.
Reading file as object with sftp.getfo, and then parsing to df would work fine:
ba = io.BytesIO()
sftp.getfo(i.filename, ba)
ba.seek(0)
f = io.TextIOWrapper(ba, encoding='cp1252')
df = pd.read_csv(f, delimiter=';', encoding='cp1252', dtype=str,
error_bad_lines=False)

Pandas can't read in csv file produced by Numbers?

I have a comma delimited csv file that is exported by Mac Numbers, and I am trying to read it into a dataframe, but get an error message:
df = pd.read_csv('game.csv', dtype={"rating": str}, error_bad_lines='ignore', encoding='utf8', sep=',')
The error message is:
Traceback (most recent call last):
File "/Users/congminmin/nlp/data_collection/crawler/data/game/test.py", line 5, in <module>
df = pd.read_csv('game_app_apple.missing.url.csv', dtype={"rating": str}, error_bad_lines='ignore', encoding='utf8', sep=',')
File "/Users/congminmin/.venv/data_collection/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "/Users/congminmin/.venv/data_collection/lib/python3.7/site-packages/pandas/io/parsers.py", line 448, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "/Users/congminmin/.venv/data_collection/lib/python3.7/site-packages/pandas/io/parsers.py", line 880, in __init__
self._make_engine(self.engine)
File "/Users/congminmin/.venv/data_collection/lib/python3.7/site-packages/pandas/io/parsers.py", line 1114, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/Users/congminmin/.venv/data_collection/lib/python3.7/site-packages/pandas/io/parsers.py", line 1891, in __init__
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 426, in pandas._libs.parsers.TextReader.__cinit__
ValueError: invalid literal for int() with base 10: 'ignore'
Is my csv not valid? But it is produced by Numbers. Even if I removed the dtype parameter, it got the same issue. If i removed the error_bad_lines='ignore', I got the following error:
File "pandas/_libs/parsers.pyx", line 860, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 875, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 929, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 916, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 2071, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 3, saw 2
The csv exported by Numbers is comma delimited, and I want to read into a dataframe and output as tab delimited, but got the problem above.
Add data: The original data is Chinese and the 'rating' in code above is actually '评分' translation in actual data below:
I had to take a screenshot, since it is recognized as spam by stackoverflow:

Reading the CSV file using Pandas results in an error

I have a problem reading the csv file using pandas. Here is the code:
import pandas as pd
import numpy as np
df=pd.read_csv('item.csv')
Here is the error:
Traceback (most recent call last):
File "Script.py", line 3, in <module>
df=pd.read_csv('item.csv')
File "C:\Python38\lib\site-packages\pandas\io\parsers.py", line 605, in read_csv
return _read(filepath_or_buffer, kwds)
File "C:\Python38\lib\site-packages\pandas\io\parsers.py", line 457, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "C:\Python38\lib\site-packages\pandas\io\parsers.py", line 814, in __init__
self._engine = self._make_engine(self.engine)
File "C:\Python38\lib\site-packages\pandas\io\parsers.py", line 1045, in _make_engine
return mapping[engine](self.f, **self.options) # type: ignore[call-arg]
File "C:\Python38\lib\site-packages\pandas\io\parsers.py", line 1893, in __init__
self._reader = parsers.TextReader(self.handles.handle, **kwds)
File "pandas\_libs\parsers.pyx", line 518, in pandas._libs.parsers.TextReader.__cinit__
File "pandas\_libs\parsers.pyx", line 620, in pandas._libs.parsers.TextReader._get_header
File "pandas\_libs\parsers.pyx", line 814, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas\_libs\parsers.pyx", line 1943, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc1 in position 11843: invalid start byte
Try:
import pandas as pd
import numpy as np
df=pd.read_csv(r'item.csv')
If you add encoding, it will works. Additionally, you can see a link below for standard encodings in python when you need.
data = pd.read_csv(filename, encoding= 'unicode_escape')
https://docs.python.org/3/library/codecs.html#standard-encodings

Pandas throws ParserError on one computer but not on another

Here's the code I have, which works perfectly fine on my friend's computer:
#!/usr/bin/python
import pandas as pd
df = pd.read_csv("report.csv")
df = df.drop("Agent Name", axis=1)
df.to_csv("agent_report_updated.csv")
Here's the error I receive on mine:
Traceback (most recent call last):
File "./agent_calls_report.py", line 10, in <module>
df = pd.read_csv("report.csv")
File "/usr/lib/python3.7/site-packages/pandas/io/parsers.py", line 678, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/lib/python3.7/site-packages/pandas/io/parsers.py", line 446, in _read
data = parser.read(nrows)
File "/usr/lib/python3.7/site-packages/pandas/io/parsers.py", line 1036, in read
ret = self._engine.read(nrows)
File "/usr/lib/python3.7/site-packages/pandas/io/parsers.py", line 1848, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 891, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 945, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 932, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 2112, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 34 fields in line 3, saw 35
Any idea why this would work on one computer and not another? Edit: I've confirmed that we are using the same versions of both Python (3.7.1) and Pandas, the only difference is that he has a Mac while I'm on Linux.
I believe this is a problem with encoding
try this :
import pandas as pd
df = pd.read_csv("report.csv",encoding='cp1252')
df = df.drop("Agent Name", axis=1)
df.to_csv("agent_report_updated.csv")
There are other encoding options you can try utf-8 instead of cp1252.
Here is a list of encodings used.

Encoding with pandas.read_csv when file name has accents

I'm trying to load a CSV with pandas, but am running into a problem if the file name has accents. It's clearly an encoding problem, but although read_csv lets you set encoding for text within the file, I can't figure out how to encode the file name properly.
input_file = r'C:\...\Datasets\%s\Provinces\Points\%s.csv' % (country, province)
self.locs = pandas.read_csv(input_file,sep=',',skipinitialspace=True)
The CSV file is Anzoátegui.csv. When I'm getting errors,
input_file = 'C:\\...\Datasets\Venezuela\Provinces\Points\Anzoátegui.csv
Error code:
OSError: File b'C:\\PF2\\QGIS Valmiera\\Datasets\\Venezuela\\Provinces\\Points\\Anzo\xc3\xa1tegui.csv' does not exist
So maybe it's converting my string to bytes? I tried using io.StringIO(input_file) as well, which puts the correct file name as a column header on an empty DataFrame:
Empty DataFrame
Columns: [C:\PF2\QGIS Valmiera\Datasets\Venezuela\Provinces\Points\Anzoátegui.csv]
Index: []
Any ideas on how to get this file to load? Unfortunately I can't just strip out accents, as I have to interface with software that requires the proper name, and I have a ton of files to format (not just the one). Thanks!
Edit: Full error
Traceback (most recent call last):
File "C:\PF2\eclipse-standard-kepler-SR2-win32-x86_64\eclipse\plugins\org.python.pydev_3.3.3.201401272249\pysrc\pydevd_comm.py", line 891, in doIt
result = pydevd_vars.evaluateExpression(self.thread_id, self.frame_id, self.expression, self.doExec)
File "C:\PF2\eclipse-standard-kepler-SR2-win32-x86_64\eclipse\plugins\org.python.pydev_3.3.3.201401272249\pysrc\pydevd_vars.py", line 486, in evaluateExpression
result = eval(compiled, updated_globals, frame.f_locals)
File "<string>", line 1, in <module>
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 404, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 205, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 486, in __init__
self._make_engine(self.engine)
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 594, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 952, in __init__
self._reader = _parser.TextReader(src, **kwds)
File "parser.pyx", line 330, in pandas.parser.TextReader.__cinit__ (pandas\parser.c:3040)
File "parser.pyx", line 557, in pandas.parser.TextReader._setup_parser_source (pandas\parser.c:5387)
OSError: File b'C:\\PF2\\QGIS Valmiera\\Datasets\\Venezuela\\Provinces\\Points\\Anzo\xc3\xa1tegui.csv' does not exist
Ok folks, I got a little lost in dependency hell, but it turns out that this issue was fixed in pandas 0.14.0. Install the updated version to get files named with accents to import correctly.
Comments at github.
Thanks for the input!

Categories

Resources