Python Error when reading data from .xls file - python

I need to read a few xls files into Python.The sample data file can be found through Link:data.file. I tried:
import pandas as pd
pd.read_excel('data.xls',sheet=1)
But it gives an error message:
ERROR *** codepage 21010 -> encoding 'unknown_codepage_21010' ->
LookupError: unknown encoding: unknown_codepage_21010 Traceback (most
recent call last):
File "", line 1, in
pd.read_excel('data.xls',sheet=1)
File "C:\Anaconda3\lib\site-packages\pandas\io\excel.py", line 113,
in read_excel
return ExcelFile(io, engine=engine).parse(sheetname=sheetname, **kwds)
File "C:\Anaconda3\lib\site-packages\pandas\io\excel.py", line 150,
in init
self.book = xlrd.open_workbook(io)
File "C:\Anaconda3\lib\site-packages\xlrd__init__.py", line 435, in
open_workbook
ragged_rows=ragged_rows,
File "C:\Anaconda3\lib\site-packages\xlrd\book.py", line 116, in
open_workbook_xls
bk.parse_globals()
File "C:\Anaconda3\lib\site-packages\xlrd\book.py", line 1170, in
parse_globals
self.handle_codepage(data)
File "C:\Anaconda3\lib\site-packages\xlrd\book.py", line 794, in
handle_codepage
self.derive_encoding()
File "C:\Anaconda3\lib\site-packages\xlrd\book.py", line 775, in
derive_encoding
_unused = unicode(b'trial', self.encoding)
File "C:\Anaconda3\lib\site-packages\xlrd\timemachine.py", line 30,
in
unicode = lambda b, enc: b.decode(enc)
LookupError: unknown encoding: unknown_codepage_21010
Anyone could help with this problem?
PS: I know if I open the file in windows excel, and resave it, the code could work, but I am looking for a solution without manual adjustment.

using the ExcelFile class, I was successfully able to read the file into python.
let me know if this helps!
import xlrd
import pandas as pd
xls = pd.ExcelFile(’C:\data.xls’)
xls.parse(’Index Constituents Data’, index_col=None, na_values=[’NA’])

The below worked for me.
import xlrd
my_xls = xlrd.open_workbook('//myshareddrive/something/test.xls',encoding_override="gb2312")

Related

ValueError: Worksheet index 0 is invalid, 0 worksheets found.Cannot open xlsx with pandas in python

Could someone help me figure out why my files dont open.
import pandas as pd
file = "C://Dev//20211103_logfile Box 2.8.xlsx"
temp=pd.read_excel(file)
Here is the full error!
PS C:\Dev> & C:/Users/keyur/AppData/Local/Programs/Python/Python39/python.exe c:/Dev/test_excel.py
C:\Users\keyur\AppData\Local\Programs\Python\Python39\lib\site-packages\openpyxl\reader\workbook.py:88:
UserWarning: File contains an invalid specification for 20211103_logfile. This will be removed
warn(msg)
Traceback (most recent call last):
File "c:\Dev\test_excel.py", line 6, in <module>
temp=pd.read_excel(file)
File "C:\Users\keyur\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "C:\Users\keyur\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\excel\_base.py", line 372, in read_excel
data = io.parse(
File "C:\Users\keyur\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\excel\_base.py", line 1272, in parse
return self._reader.parse(
File "C:\Users\keyur\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\excel\_base.py", line 537, in parse
sheet = self.get_sheet_by_index(asheetname)
File "C:\Users\keyur\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\excel\_openpyxl.py", line 546, in get_sheet_by_index
self.raise_if_bad_sheet_by_index(index)
File "C:\Users\keyur\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\excel\_base.py", line 468, in raise_if_bad_sheet_by_index
raise ValueError(
ValueError: Worksheet index 0 is invalid, 0 worksheets found
PS C:\Dev>
There are problem with your excel,
try make a new excel and copy pase all data ,then try again ,this method works for me.

Why can't I access the excel file in python?

I'm trying to use some data that I have in an excel file. However, I'm getting an error saying that it doesn't find the file. I've looked up and the directory and the file name are correct, What am I doing wrong?
Here is the code:
import os
import pandas as pd
print(os.getcwd())
df = pd.read_excel(r'C:/Users/Eder/Desktop/TFG/Data/Interpolation_sample.xlsx',
index_col =0,parse_dates=True, sheet_name='sheet3')
And the answer from the console:
runcell(0, 'C:/Users/Eder/untitled0.py')
C:\Users\Eder\Desktop\TFG\Data
Traceback (most recent call last):
File "C:\Users\Eder\untitled0.py", line 14, in <module>
index_col =0,parse_dates=True, sheet_name='sheet3')
File "E:\Anaconda3\lib\site-packages\pandas\util\_decorators.py", line 299, in wrapper
return func(*args, **kwargs)
File "E:\Anaconda3\lib\site-packages\pandas\io\excel\_base.py", line 336, in read_excel
io = ExcelFile(io, storage_options=storage_options, engine=engine)
File "E:\Anaconda3\lib\site-packages\pandas\io\excel\_base.py", line 1072, in __init__
content=path_or_buffer, storage_options=storage_options
File "E:\Anaconda3\lib\site-packages\pandas\io\excel\_base.py", line 950, in inspect_excel_format
content_or_path, "rb", storage_options=storage_options, is_text=False
File "E:\Anaconda3\lib\site-packages\pandas\io\common.py", line 651, in get_handle
handle = open(handle, ioargs.mode)
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\Eder\\Desktop\\TFG\\Data\\Interpolation_sample.xlsx'
I've figured out a way to solve the problem. I just changed the name of the file from 'Interpolation_sample' to 'Interpolation sample'. I don't know why, but the underscore in the file name is what was causing this error.

Reading in csv file as dataframe from hdfs

I'm using pydoop to read in a file from hdfs, and when I use:
import pydoop.hdfs as hd
with hd.open("/home/file.csv") as f:
print f.read()
It shows me the file in stdout.
Is there any way for me to read in this file as dataframe? I've tried using pandas' read_csv("/home/file.csv"), but it tells me that the file cannot be found. The exact code and error is:
>>> import pandas as pd
>>> pd.read_csv("/home/file.csv")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 498, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 275, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 590, in __init__
self._make_engine(self.engine)
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 731, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 1103, in __init__
self._reader = _parser.TextReader(src, **kwds)
File "pandas/parser.pyx", line 353, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:3246)
File "pandas/parser.pyx", line 591, in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:6111)
IOError: File /home/file.csv does not exist
I know next to nothing about hdfs, but I wonder if the following might work:
with hd.open("/home/file.csv") as f:
df = pd.read_csv(f)
I assume read_csv works with a file handle, or in fact any iterable that will feed it lines. I know the numpy csv readers do.
pd.read_csv("/home/file.csv") would work if the regular Python file open works - i.e. it reads the file a regular local file.
with open("/home/file.csv") as f:
print f.read()
But evidently hd.open is using some other location or protocol, so the file is not local. If my suggestion doesn't work, then you (or we) need to dig more into the hdfs documentation.
you can use the following code to read csv from hdfs
import pandas as pd
import pyarrow as pa
hdfs_config = {
"host" : "XXX.XXX.XXX.XXX",
"port" : 8020,
"user" : "user"
}
fs = pa.hdfs.connect(hdfs_config['host'], hdfs_config['port'],
user=hdfs_config['user'])
df=pd.read_csv(fs.open("/home/file.csv"))
Use read instead open, it works
with hd.read("/home/file.csv") as f:
df = pd.read_csv(f)

TypeError when trying to open a workbook using openpyxl

I'm trying to use openpyxl to open and modify an existing excel workbook, but I can't even open the file without getting an error.
from openpyxl import load_workbook
ws = load_workbook('PO-Copy.xlsx')
I get a long TypeError as a result:
Traceback (most recent call last):
File "<module1>", line 6, in <module>
File "C:\Python27\Lib\site-packages\openpyxl\reader\excel.py", line 151, in load_workbook
_load_workbook(wb, archive, filename, read_only, keep_vba)
File "C:\Python27\Lib\site-packages\openpyxl\reader\excel.py", line 224, in _load_workbook
keep_vba=keep_vba)
File "C:\Python27\Lib\site-packages\openpyxl\reader\worksheet.py", line 308, in read_worksheet
fast_parse(ws, xml_source, shared_strings, style_table, color_index)
File "C:\Python27\Lib\site-packages\openpyxl\reader\worksheet.py", line 296, in fast_parse
parser.parse()
File "C:\Python27\Lib\site-packages\openpyxl\reader\worksheet.py", line 84, in parse
dispatcher[tag_name](element)
File "C:\Python27\Lib\site-packages\openpyxl\reader\worksheet.py", line 282, in parse_data_validation
dv = parser(tag)
File "C:\Python27\Lib\site-packages\openpyxl\worksheet\datavalidation.py", line 179, in parser
dv = DataValidation(**element.attrib)
TypeError: __init__() got an unexpected keyword argument 'errorStyle'
Has anyone else ran into this error? is there a fix I can use to keep going?
The ability to read DataValidation in existing files was added in openpyxl 2.1 but was limited to what DataValidation in Python supported. Work has started on supporting DataValidation fully and is available in the 2.2 branch at https://bitbucket.org/habub68/openpyxl

pandas HDFStore - how to reopen?

I created a file by using:
store = pd.HDFStore('/home/.../data.h5')
and stored some tables using:
store['firstSet'] = df1
store.close()
I closed down python and reopened in a fresh environment.
How do I reopen this file?
When I go:
store = pd.HDFStore('/home/.../data.h5')
I get the following error.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/misc/apps/linux/python-2.6.1/lib/python2.6/site-packages/pandas-0.10.0-py2.6-linux-x86_64.egg/pandas/io/pytables.py", line 207, in __init__
self.open(mode=mode, warn=False)
File "/misc/apps/linux/python-2.6.1/lib/python2.6/site-packages/pandas-0.10.0-py2.6-linux-x86_64.egg/pandas/io/pytables.py", line 302, in open
self.handle = _tables().openFile(self.path, self.mode)
File "/apps/linux/python-2.6.1/lib/python2.6/site-packages/tables/file.py", line 230, in openFile
return File(filename, mode, title, rootUEP, filters, **kwargs)
File "/apps/linux/python-2.6.1/lib/python2.6/site-packages/tables/file.py", line 495, in __init__
self._g_new(filename, mode, **params)
File "hdf5Extension.pyx", line 317, in tables.hdf5Extension.File._g_new (tables/hdf5Extension.c:3039)
tables.exceptions.HDF5ExtError: HDF5 error back trace
File "H5F.c", line 1582, in H5Fopen
unable to open file
File "H5F.c", line 1373, in H5F_open
unable to read superblock
File "H5Fsuper.c", line 334, in H5F_super_read
unable to find file signature
File "H5Fsuper.c", line 155, in H5F_locate_signature
unable to find a valid file signature
End of HDF5 error back trace
Unable to open/create file '/home/.../data.h5'
What am I doing wrong here? Thank you.
In my hands, following approach works best:
df = pd.DataFrame(...)
"write"
with pd.HDFStore('test.h5', mode='w') as store:
store.append('df', df, data_columns= df.columns, format='table')
"read"
with pd.HDFStore('test.h5', mode='r') as newstore:
df_restored = newstore.select('df')
You could try doing instead:
store = pd.io.pytables.HDFStore('/home/.../data.h5')
df1 = store['firstSet']
or use the read method directly:
df1 = pd.read_hdf('/home/.../data.h5', 'firstSet')
Either way, you should have pandas 0.12.0 or higher...
I had the same problem and finally fixed it by installing the pytables module (next to the pandas modules which I was using):
conda install pytables
which got me numexpr-2.4.3 and pytables-3.2.0
After that it worked. I am using pandas 0.16.2 under python 2.7.9

Categories

Resources