python pandas read_excel engine=openpyxl not closing file - python

I am loading a dataframe into pandas using following:
import pandas as pd
df_factor_histories=pd.read_excel("./eco_factor/eco_factor_test_data_builder.xlsx",
engine='openpyxl', sheet_name=0)
engine=openpyxl is required to enable read_excel to support newer Excel file formats (specifically in my case .xlsx rather than jusy .xls).
The dataframe loads just fine but the file is left open:
import psutil
p = psutil.Process()
print(p.open_files())
OUTPUT
[popenfile(path='C:\\Users\\xx\\.ipython\\profile_default\\history.sqlite', fd=-1),
popenfile(path='C:\\Windows\\System32\\en-US\\KernelBase.dll.mui', fd=-1),
popenfile(path='C:\\Windows\\System32\\en-US\\kernel32.dll.mui', fd=-1),
popenfile(path='D:\\xxxxx\\data modelling\\eco_factor\\eco_factor_test_data_builder.xlsx', fd=-1)]
This Github Post suggests the bug is fixed - but not for me (running Anaconda/Jupyter).
Relevant versions I am running:
numpy 1.19.2
openpyxl 3.0.5
pandas 1.1.3
Python 3.7.4
I would appreciate some suggestions on how to close the files/best work around this, thanks

I suggest to remove engine='openpyxl' from your code. It isn't actually needed. I use the pd.read_excel without it and it works just fine even for .xlsx format.
Removing this will cause the default behavior for the engine parameter to take over. The engine will know which engine to use:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html#pandas.read_excel
engine : str, default None
If io is not a buffer or path, this must be set to identify io. Supported engines: “xlrd”, “openpyxl”, “odf”, “pyxlsb”. Engine compatibility :
“xlrd” supports old-style Excel files (.xls).
“openpyxl” supports newer Excel file formats.
“odf” supports OpenDocument file formats (.odf, .ods, .odt).
“pyxlsb” supports Binary Excel files.
Changed in version 1.2.0: The engine xlrd now only supports old-style .xls files. When engine=None, the following logic will be used to determine the engine:
If path_or_buffer is an OpenDocument format (.odf, .ods, .odt), then odf will be used.
Otherwise if path_or_buffer is an xls format, xlrd will be used.
Otherwise if openpyxl is installed, then openpyxl will be used.
Otherwise if xlrd >= 2.0 is installed, a ValueError will be raised.
Otherwise xlrd will be used and a FutureWarning will be raised. This case will raise a ValueError in a future version of pandas.

I was facing the same issue,
When i read the excel file using pandas with engine=openpyxl. It is not being closed. when i try to archive/ move the excel file using python, it was giving error,
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process
Also, once it is read by pandas, we are not able to edit or any other operation on excel file using Excel tool.
Following solution worked for me.
i am using:
python version 3.6.8.
pandas==0.25.1
openpyxl==3.0.7
import io
import pandas as pd
with open('path/to/input_excel_file.xlsx', "rb") as f:
file_io_obj = io.BytesIO(f.read())
df_input_file = pd.read_excel(file_io_obj, engine='openpyxl', sheet_name=None)

Related

I cant read parquet file by pandas read_parquet function

when I use pd.read_parquet to read a parquet file this error is displayed
my code:
import pandas as pd
df = pd.read_parquet("fhv_tripdata_2018-05.parquet")
error:
ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: 32094334800000000
I want to convert this file to csv:
https://d37ci6vzurychx.cloudfront.net/trip-data/fhv_tripdata_2018-05.parquet
Please provide a minimal example, i.e., a small parquet file that generates the error.
It seems there are some open issues with this. Conventions are not compatible and apparently there are pitfalls in reading/writing dates in parquet via pandas. Thus, I propose a solution by directly using pyarrow:
import pyarrow.parquet as pq
table = pq.read_table('fhv_tripdata_2018-05.parquet')
table.to_pandas(timestamp_as_object=True)
// to csv
table.to_csv('data.csv')
Note the extra flag timestamp_as_object which prevents the overflow you observed.

Pandas won't read Excel XML 2003 file with `xls` extension

Can anyone please tell me why pandas won't read Excel XML 2003 file with xls extension? When I try to read it from my Python script, it throws an error:
xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'<html x'
I know the obvious reason: it is actually a XML file with fake xls extension. But I can still open it with Excel: a normal spread sheet. I think that means there is still a way to read it from pandas?
If no luck, can I convert this Excel XML 2003 with xls extension into a "real" xls file without the XML tags by using Python script? If so I can just add this section of code in front of the PANDAS code to read the converted xls file.
It should work if you make sure openpyxl is installed and explicitly tell Pandas to use that engine:
df = pd.read_excel("foo.xls", engine="openpyxl")
# ^^^^^^^^^^^^^^^^^
Pandas can use one of four underlying engines when ingesting Excel files. The one it uses for xls files doesn't support the newer formats:
If io is not a buffer or path, this must be set to identify io. Supported engines: "xlrd", "openpyxl", "odf", "pyxlsb". Engine compatibility:
"xlrd" supports old-style Excel files (.xls).
"openpyxl" supports newer Excel file formats.
"odf" supports OpenDocument file formats (.odf, .ods, .odt).
"pyxlsb" supports Binary Excel files.
Changed in version 1.2.0: The engine xlrd now only supports old-style .xls files. When engine=None, the following logic will be used to determine the engine:
If path_or_buffer is an OpenDocument format (.odf, .ods, .odt), then odf will be used.
Otherwise if path_or_buffer is an xls format, xlrd will be used.
Otherwise if path_or_buffer is in xlsb format, pyxlsb will be used.
New in version 1.3.0.
Otherwise openpyxl will be used.
Changed in version 1.3.0.

Read XLSB File in Pandas Python

There are many questions on this, but there has been no simple answer on how to read an xlsb file into pandas. Is there an easy way to do this?
With the 1.0.0 release of pandas - January 29, 2020, support for binary Excel files was added.
import pandas as pd
df = pd.read_excel('path_to_file.xlsb', engine='pyxlsb')
Notes:
You will need to upgrade pandas - pip install pandas --upgrade
You will need to install pyxlsb - pip install pyxlsb
Hi actually there is a way. Just use pyxlsb library.
import pandas as pd
from pyxlsb import open_workbook as open_xlsb
df = []
with open_xlsb('some.xlsb') as wb:
with wb.get_sheet(1) as sheet:
for row in sheet.rows():
df.append([item.v for item in row])
df = pd.DataFrame(df[1:], columns=df[0])
UPDATE:
as of pandas version 1.0 read_excel() now can read binary Excel (.xlsb) files by passing engine='pyxlsb'
Source: https://pandas.pydata.org/pandas-docs/version/1.0.0/whatsnew/v1.0.0.html
Pyxlsb indeed is an option to read xlsb file, however, is rather limited.
I suggest using the xlwings package which makes it possible to read and write xlsb files without losing sheet formating, formulas, etc. in the xlsb file. There is extensive documentation available.
import pandas as pd
import xlwings as xw
app = xw.App()
book = xw.Book('file.xlsb')
sheet = book.sheets('sheet_name')
df = sheet.range('A1').options(pd.DataFrame, expand='table').value
book.close()
app.kill()
'A1' in this case is the starting position of the excel table.
To write to xlsb file, simply write:
sheet.range('A1').value = df
If you want to read a big binary file or any excel file with some ranges you can directly put at this code
range = (your_index_number)
first_dataframe = []
second_dataframe = []
with open_xlsb('Test.xlsb') as wb:
with wb.get_sheet('Sheet1') as sheet:
i=0
for row in sheet.rows():
if(i!=range):
first_dataframe.append([item.v for item in row])
i=i+1
else:
second_dataframe.append([item.v for item in row])
first_dataframe = pd.DataFrame(first_dataframe[1:], columns=first[0])
second_dataframe = pd.DataFrame(second_dataframe[:], columns=first.columns)
To be able to read xlsb files, it is necessary to have openpyxl installed.
As per https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html#pandas.read_excel
engine: str, default None
If io is not a buffer or path, this must be set to identify io. Supported engines: “xlrd”, “openpyxl”, “odf”, “pyxlsb”. Engine compatibility :
“xlrd” supports old-style Excel files (.xls).
“openpyxl” supports newer Excel file formats.
“odf” supports OpenDocument file formats (.odf, .ods, .odt).
“pyxlsb” supports Binary Excel files.
Changed in version 1.2.0: The engine xlrd now only supports old-style .xls files.
When engine=None, the following logic will be used to determine the engine:
If path_or_buffer is an OpenDocument format (.odf, .ods, .odt), then odf will be used.
Otherwise if path_or_buffer is an xls format, xlrd will be used.
Otherwise if openpyxl is installed, then openpyxl will be used.
Otherwise if xlrd >= 2.0 is installed, a ValueError will be raised.
Otherwise xlrd will be used and a FutureWarning will be raised. This case will raise a ValueError in a future version of pandas.
xlsb reading without index_col:
import pandas as pd
dfcluster = pd.read_excel('c:/xml/baseline/distribucion.xlsb', sheet_name='Cluster', index_col=0, engine='pyxlsb')

importing an excel file to python

I have a basic question about importing xlsx files to Python. I have checked many responses about the same topic, however I still cannot import my files to Python whatever I try. Here's my code and the error I receive:
import pandas as pd
import xlrd
file_location = 'C:\Users\cagdak\Desktop\python_self_learning\Coursera\sample_data.xlsx'
workbook = xlrd.open_workbook(file_location)
Error:
IOError: [Errno 2] No such file or directory: 'C:\\Users\\cagdak\\Desktop\\python_self_learning\\Coursera\\sample_data.xlsx'
With pandas it is possible to get directly a column of an Excel file. Here is the code.
import pandas
df = pandas.read_excel('sample.xls')
#print the column names
print df.columns
#get the values for a given column
values = df['column_name'].values
#get a data frame with selected columns
FORMAT = ['Col_1', 'Col_2', 'Col_3']
df_selected = df[FORMAT]
You should use raw strings or escape your backslash instead, for example:
file_location = r'C:\Users\cagdak\Desktop\python_self_learning\Coursera\sample_data.xlsx'
or
file_location = 'C:\\Users\\cagdak\\Desktop\python_self_learning\\Coursera\\sample_data.xlsx'
go ahead and try this:
file_location = 'C:/Users/cagdak/Desktop/python_self_learning/Coursera/sample_data.xlsx'
As pointed out above Pandas supports reading of Excel spreadsheets using its read_excel() method. However, it is dependent upon a number of external libraries depending on which version Excel/odf is being accessed. It defaults to selecting one automatically, though one can be specified using the engine parameter. Here's an excerpt from the docs:
"xlrd" supports old-style Excel files (.xls).
"openpyxl" supports newer Excel file formats.
"odf" supports OpenDocument file formats (.odf, .ods, .odt).
"pyxlsb" supports Binary Excel files.
If the required library is not already installed you'll see an error message suggesting library you need to install.

Pandas excel reading buffer error (python 3)

I am having a problem reading an excel file from a download link using pandas. The excelString below loads correctly and looks like an excel file, but when trying to convert it to excel using pandas it says the file name is too long. Any assistance would be appreciated. This is a useful generic problem to solve for anyone accessing iShares index membership info.
import urllib
import pandas as pd
f = urllib.request.urlopen('https://www.ishares.com/us/239714/fund-download.dl')
excelString = f.read().decode('utf-8')
pd.ExcelFile(excelString)
The Error returned is OSError: [Errno 36] File name too long
Works fine for me using Python3 and pandas 0.16.2 - do you have the latest version?

Categories

Resources