ValueError: Missing column provided to 'parse_dates': 'CRASH_DATE, CRASH_TIME' - python

I made a streamlit app. It works fine when I run it locally.
But, after I push it to heroku, I got this value error on the parse_dates:
ValueError: Missing column provided to 'parse_dates': 'CRASH_DATE, CRASH_TIME'
Traceback:
File "/app/.heroku/python/lib/python3.6/site-packages/streamlit/script_runner.py", line 324, in _run_script
exec(code, module.__dict__)
File "/app/app.py", line 35, in <module>
data = load_data(100000)
File "/app/.heroku/python/lib/python3.6/site-packages/streamlit/caching.py", line 591, in wrapped_func
return get_or_create_cached_value()
File "/app/.heroku/python/lib/python3.6/site-packages/streamlit/caching.py", line 575, in get_or_create_cached_value
return_value = func(*args, **kwargs)
File "/app/app.py", line 27, in load_data
data = pd.read_csv(DATA_URL, nrows= rows, parse_dates=[['CRASH_DATE', 'CRASH_TIME']])
File "/app/.heroku/python/lib/python3.6/site-packages/pandas/io/parsers.py", line 688, in read_csv
return _read(filepath_or_buffer, kwds)
File "/app/.heroku/python/lib/python3.6/site-packages/pandas/io/parsers.py", line 454, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "/app/.heroku/python/lib/python3.6/site-packages/pandas/io/parsers.py", line 948, in __init__
self._make_engine(self.engine)
File "/app/.heroku/python/lib/python3.6/site-packages/pandas/io/parsers.py", line 1180, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/app/.heroku/python/lib/python3.6/site-packages/pandas/io/parsers.py", line 2068, in __init__
self._validate_parse_dates_presence(self.names)
File "/app/.heroku/python/lib/python3.6/site-packages/pandas/io/parsers.py", line 1546, in _validate_parse_dates_presence
f"Missing column provided to 'parse_dates': '{missing_cols}'"
This is the code to read the csv:
DATA_URL = ("https://github.com/chairielazizi/streamlit-collision/blob/master/Motor_Vehicle_Collisions_-_Crashes.csv")
#st.cache(persist=True)
def load_data(rows):
data = pd.read_csv(DATA_URL, nrows= rows, parse_dates=[['CRASH_DATE', 'CRASH_TIME']])
# data.seek(0)
data.dropna(subset =['LATITUDE', 'LONGITUDE'], inplace=True)
lowercase = lambda x: str(x).lower()
data.rename(lowercase,axis='columns',inplace=True)
data.rename(columns={'crash_date_crash_time':'date/time'},inplace=True)
return data
data = load_data(100000)
I tried changing the streamlit and pandas version but still got error. Error only happen when I set the DATA_URL to the csv file store in GitHub, but it works fine if I set it to my local file.
The code and the csv file for the project is here:
https://github.com/chairielazizi/streamlit-collision

When you reference a file on GitHub, you need to make sure you are accessing the "raw" version, not the version from the GitHub interface. Adding the ?raw=true parameter to your url should work:
https://github.com/chairielazizi/streamlit-collision/blob/master/Motor_Vehicle_Collisions_-_Crashes.csv?raw=true

Related

Can't open some .csv file using read_csv()

I am writing a Python function to open two .csv files and make changes to the data inside. I am using pandas and pd.read_csv('text') to open the files. Everything works well and the function works for one .csv file. However, when I try it on a different smaller .csv file the file cannot even open.
This is part of the error I am getting when I try to open the .csv file.
Traceback (most recent call last):
File "C:\Users\...\Downloads\test\test.py", line 3, in <module>
df = pd.read_csv('data2.csv')
File "C:\Users\...\AppData\Roaming\Python\Python311\site-packages\pandas\util\_decorators.py", line 211, in wrapper
return func(*args, **kwargs)
File "C:\Users\...\AppData\Roaming\Python\Python311\site-packages\pandas\util\_decorators.py", line 331, in wrapper
return func(*args, **kwargs)
File "C:\Users\...\AppData\Roaming\Python\Python311\site-packages\pandas\io\parsers\readers.py", line 950, in read_csv
return _read(filepath_or_buffer, kwds)
File "C:\Users\...\AppData\Roaming\Python\Python311\site-packages\pandas\io\parsers\readers.py", line 611, in _read
return parser.read(nrows)
File "C:\Users\...\AppData\Roaming\Python\Python311\site-packages\pandas\io\parsers\readers.py", line 1778, in read
) = self._engine.read( # type: ignore[attr-defined]
File "C:\Users\...\AppData\Roaming\Python\Python311\site-packages\pandas\io\parsers\c_parser_wrapper.py", line 230, in read
chunks = self._reader.read_low_memory(nrows)
File "pandas\_libs\parsers.pyx", line 808, in pandas._libs.parsers.TextReader.read_low_memory
File "pandas\_libs\parsers.pyx", line 866, in pandas._libs.parsers.TextReader._read_rows
File "pandas\_libs\parsers.pyx", line 852, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas\_libs\parsers.pyx", line 1973, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 4 fields in line 8836, saw 5
This is the code I am using to access the .csv files.
import pandas as pd
df = pd.read_csv('test.csv')
All the files are in the correct folders and the file paths are all correct. Any help is appreciated, thanks

Error while reading xlsm file by Pandas : "Conditional Formatting extension is not supported"

I want to read a xlsm file by Pandas:
pd.read_excel("data.xlsm", engine='openpyxl', sheet_name="sheet1")
But, I get the error:
C:\Users\anaconda3\lib\site-packages\openpyxl\worksheet\_read_only.py:79: UserWarning: Unknown extension is not supported and will be removed
for idx, row in parser.parse():
C:\Users\anaconda3\lib\site-packages\openpyxl\worksheet\_read_only.py:79: UserWarning: Conditional Formatting extension is not supported and will be removed
for idx, row in parser.parse():
Another try: I saved the data file by xlsx format and tried to read that by:
pd.read_excel("data.xlsx", engine='openpyxl', sheet_name="sheet1")
And this time, I get the following error:
File "C:\Users\AppData\Local\Temp\ipykernel_28028\1689108907.py", line 1, in <module>
data = pd.read_excel(data_original_filepath, engine='openpyxl', sheet_name=sheet_name)
File "C:\Users\anaconda3\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "C:\Users\anaconda3\lib\site-packages\pandas\io\excel\_base.py", line 457, in read_excel
io = ExcelFile(io, storage_options=storage_options, engine=engine)
File "C:\Users\anaconda3\lib\site-packages\pandas\io\excel\_base.py", line 1419, in __init__
self._reader = self._engines[engine](self._io, storage_options=storage_options)
File "C:\Users\anaconda3\lib\site-packages\pandas\io\excel\_openpyxl.py", line 525, in __init__
super().__init__(filepath_or_buffer, storage_options=storage_options)
File "C:\Users\anaconda3\lib\site-packages\pandas\io\excel\_base.py", line 518, in __init__
self.book = self.load_workbook(self.handles.handle)
File "C:\Users\anaconda3\lib\site-packages\pandas\io\excel\_openpyxl.py", line 536, in load_workbook
return load_workbook(
File "C:\Users\anaconda3\lib\site-packages\openpyxl\reader\excel.py", line 317, in load_workbook
reader.read()
File "C:\Users\anaconda3\lib\site-packages\openpyxl\reader\excel.py", line 278, in read
self.read_workbook()
File "C:\Users\anaconda3\lib\site-packages\openpyxl\reader\excel.py", line 150, in read_workbook
self.parser.parse()
File "C:\Users\anaconda3\lib\site-packages\openpyxl\reader\workbook.py", line 49, in parse
package = WorkbookPackage.from_tree(node)
File "C:\Users\anaconda3\lib\site-packages\openpyxl\descriptors\serialisable.py", line 83, in from_tree
obj = desc.from_tree(el)
File "C:\Users\anaconda3\lib\site-packages\openpyxl\descriptors\sequence.py", line 85, in from_tree
return [self.expected_type.from_tree(el) for el in node]
File "C:\Users\anaconda3\lib\site-packages\openpyxl\descriptors\sequence.py", line 85, in <listcomp>
return [self.expected_type.from_tree(el) for el in node]
File "C:\Users\anaconda3\lib\site-packages\openpyxl\descriptors\serialisable.py", line 103, in from_tree
return cls(**attrib)
TypeError: __init__() missing 1 required positional argument: 'id'
Any idea how to solve this issue?
In fact, I have to read the xlsm file. Changing the format to xlsx was only for trial purpose.
Please try this block of code.
import openpyxl
file='data.xlsm'
wb=openpyxl.load_workbook(file, data_only=True, read_only=False, keep_vba=True)
install the latest openpyxl from openpyxl web page
If you specify a sheet_name it's working
pd.read_excel("data.xlsm", sheet_name="sheet1")

Manipulate SQL dataframe with python script in Power BI

I'd like to execute a simple python script in Power BI on a SQL dataframe.
But the error seems to indicate like the SQL table has been read as a CSV file and I don't know why the script consider the dataframe as a CSV file instead of an SQL dataframe as it is.
The python script is :
import pandas as pd
dataset['COD-MARQ'] = dataset['COD-MARQ'].str.strip()
Any ideas on how shoud I process ?
thanks
Traceback (most recent call last):
File "PythonScriptWrapper.PY", line 7, in <module>
df1 = pandas.read_csv('input_df_da064532-6620-4e48-a091-ff580b127759.csv')
File "C:\Users\afalieres\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\parsers.py", line 686, in read_csv
return _read(filepath_or_buffer, kwds)
File "C:\Users\afalieres\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\parsers.py", line 458, in _read
data = parser.read(nrows)
File "C:\Users\afalieres\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\parsers.py", line 1186, in read
ret = self._engine.read(nrows)
File "C:\Users\afalieres\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\parsers.py", line 2145, in read
data = self._reader.read(nrows)
File "pandas\_libs\parsers.pyx", line 847, in pandas._libs.parsers.TextReader.read
File "pandas\_libs\parsers.pyx", line 862, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas\_libs\parsers.pyx", line 941, in pandas._libs.parsers.TextReader._read_rows
File "pandas\_libs\parsers.pyx", line 1073, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas\_libs\parsers.pyx", line 1119, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas\_libs\parsers.pyx", line 1194, in pandas._libs.parsers.TextReader._convert_with_dtype
File "pandas\_libs\parsers.pyx", line 1814, in pandas._libs.parsers._try_int64
MemoryError: Unable to allocate 64.0 KiB for an array with shape (8192,) and data type int64
Détails :
DataSourceKind=Python
DataSourcePath=Python
Message=Ρŷтнőŋ şсŗĩрţ εггǿŗ.
Traceback (most recent call last):
File "PythonScriptWrapper.PY", line 7, in <module>
df1 = pandas.read_csv('input_df_da064532-6620-4e48-a091-ff580b127759.csv')
File "C:\Users\afalieres\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\parsers.py", line 686, in read_csv
return _read(filepath_or_buffer, kwds)
File "C:\Users\afalieres\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\parsers.py", line 458, in _read
data = parser.read(nrows)
File "C:\Users\afalieres\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\parsers.py", line 1186, in read
ret = self._engine.read(nrows)
File "C:\Users\afalieres\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\parsers.py", line 2145, in read
data = self._reader.read(nrows)
File "pandas\_libs\parsers.pyx", line 847, in pandas._libs.parsers.TextReader.read
File "pandas\_libs\parsers.pyx", line 862, in pandas._libs.parsers.Tex...
ErrorCode=-2147467259
ExceptionType=Microsoft.PowerBI.Scripting.Python.Exceptions.PythonScriptRuntimeException ```
I'm not positive it's the problem, but it looks to me like the dataset is referring to the previous step rather than the original source, which means it's no longer in a SQL dataframe format. You probably want to either import the original source using python or else modify your script to treat the dataset not as a SQL dataframe but in whatever format the Query Editor passes to the python script (which I think is a pandas dataframe).
On a separate note, in this particular case, it seems unnecessary to use python for a simple transformation that can just as easily be done natively in M.

Encoding with pandas.read_csv when file name has accents

I'm trying to load a CSV with pandas, but am running into a problem if the file name has accents. It's clearly an encoding problem, but although read_csv lets you set encoding for text within the file, I can't figure out how to encode the file name properly.
input_file = r'C:\...\Datasets\%s\Provinces\Points\%s.csv' % (country, province)
self.locs = pandas.read_csv(input_file,sep=',',skipinitialspace=True)
The CSV file is Anzoátegui.csv. When I'm getting errors,
input_file = 'C:\\...\Datasets\Venezuela\Provinces\Points\Anzoátegui.csv
Error code:
OSError: File b'C:\\PF2\\QGIS Valmiera\\Datasets\\Venezuela\\Provinces\\Points\\Anzo\xc3\xa1tegui.csv' does not exist
So maybe it's converting my string to bytes? I tried using io.StringIO(input_file) as well, which puts the correct file name as a column header on an empty DataFrame:
Empty DataFrame
Columns: [C:\PF2\QGIS Valmiera\Datasets\Venezuela\Provinces\Points\Anzoátegui.csv]
Index: []
Any ideas on how to get this file to load? Unfortunately I can't just strip out accents, as I have to interface with software that requires the proper name, and I have a ton of files to format (not just the one). Thanks!
Edit: Full error
Traceback (most recent call last):
File "C:\PF2\eclipse-standard-kepler-SR2-win32-x86_64\eclipse\plugins\org.python.pydev_3.3.3.201401272249\pysrc\pydevd_comm.py", line 891, in doIt
result = pydevd_vars.evaluateExpression(self.thread_id, self.frame_id, self.expression, self.doExec)
File "C:\PF2\eclipse-standard-kepler-SR2-win32-x86_64\eclipse\plugins\org.python.pydev_3.3.3.201401272249\pysrc\pydevd_vars.py", line 486, in evaluateExpression
result = eval(compiled, updated_globals, frame.f_locals)
File "<string>", line 1, in <module>
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 404, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 205, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 486, in __init__
self._make_engine(self.engine)
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 594, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 952, in __init__
self._reader = _parser.TextReader(src, **kwds)
File "parser.pyx", line 330, in pandas.parser.TextReader.__cinit__ (pandas\parser.c:3040)
File "parser.pyx", line 557, in pandas.parser.TextReader._setup_parser_source (pandas\parser.c:5387)
OSError: File b'C:\\PF2\\QGIS Valmiera\\Datasets\\Venezuela\\Provinces\\Points\\Anzo\xc3\xa1tegui.csv' does not exist
Ok folks, I got a little lost in dependency hell, but it turns out that this issue was fixed in pandas 0.14.0. Install the updated version to get files named with accents to import correctly.
Comments at github.
Thanks for the input!

File exists but python says 'does not exist'

I'm running this code that reads 2 csv files (one of them is train.csv). The code gives an error saying 'file not does not exist'. However, the file does exists in the same location as the .py file. Can someone please help me on this. Thanks!
Reading dataset...
Traceback (most recent call last):
File "c:\Project_1\regression_2.py", line 163, in <module>
main(**args)
File "c:\Project_1\regression_2.py", line 80, in main
train_data = pd.read_csv(train)
File "c:\Python27\lib\site-packages\pandas\io\parsers.py", line 401, in parser_f
return _read(filepath_or_buffer, kwds)
File "c:\Python27\lib\site-packages\pandas\io\parsers.py", line 209, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "c:\Python27\lib\site-packages\pandas\io\parsers.py", line 509, in __init__
self._make_engine(self.engine)
File "c:\Python27\lib\site-packages\pandas\io\parsers.py", line 611, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "c:\Python27\lib\site-packages\pandas\io\parsers.py", line 893, in __init__
self._reader = _parser.TextReader(src, **kwds)
File "parser.pyx", line 312, in pandas._parser.TextReader.__cinit__
(pandas\src\parser.c:2846)
File "parser.pyx", line 512, in pandas._parser.TextReader._setup_parser_source
(pandas\src\parser.c:4893)
IOError: File train.csv does not exist
The variable is being referred to as ->
def main(train='train.csv', test='test.csv', submit='logistic_pred.csv'):
print "Reading dataset..."
train_data = pd.read_csv(train)
test_data = pd.read_csv(test)
You are opening a relative path, but your working directory is not what you think it is.
Use an absolute path instead:
train = os.path.join('c:/Documents and Settings', train)
Without an absolute path, Python uses the current working directory. What that directory is depends on how you ran your script, and is not something you should rely on.

Categories

Resources