Python: any lazy method for reading .xls files?

Python: any lazy method for reading .xls files? - python

I know how to read .xls files with pandas. However, it returns all the data. I want to load data on demand, I mean, I want a generator that returns the next row each time is iterated. See this question for general files.
I know openpyxl can do this, following this webpage. However, it doesn't support old .xls files. It recommends me to use xlrd, however, I don't know how to do what I want with that package.
The documentation tells how to do that sheet by sheet, but not row by row (my file has only one sheet).

Pandas doesn't support lazy loading, it reads the file and keeps everything in memory.
Polars -- an alternative to pandas -- supports lazy loading.
Unfortunately this isn't yet implemented for xls files.
One solution is to convert the excel file to csv and use the scan_csv function.
import polars as pl
pl.scan_csv("sample.csv")
<polars.internals.lazyframe.frame.LazyFrame object at 0x7f0ae95d1c00>

Related

Read and write a single cell in excel using Python

I am looking to replace the database (SQL)(around 50,00050 rowscolumns) for my app with excel. I need to update a single cell in excel without loading the whole workbook and then saving it again (I am using Openpyxl) as it is computationally very expensive. I need an alternative that will help me save execution time.
I have tried excel APIs like xlwings but need an alternative to APIs

I cannot comment yet, so I will "answer". Why would you replace a database with Excel? Sounds crazy to me. There are plenty of other persistent storage file systems out there to use, pickle, HD5, pyarrow stuff, csv, etc.. I used the feather format for a while, super fast and pandas can use it natively.

Load only a single sheet from a large workbook with Python

Using Python 3.7. I have several .xlsx workbooks with 34 sheets each, most of which have conditional formatting and charts, but all I'm actually after is a cell with specified text that's somewhere on the first sheet of each book. The workbook is not protected but the sheet is, and I don't know the password, so I can't use pandas.read_excel; using openpyxl/load_workbook, it takes ages to load and I get lots of errors about it not being able to handle conditional formatting etc. I then have to search the sheet for the text.
Is there an easy, quick way of loading just the first sheet (or a named sheet)? The pandas code is very quick and easy, but I can't use it :(

Not completely sure about that but I can recommend trying "read-only" mode from openpyxl
https://openpyxl.readthedocs.io/en/stable/optimized.html
It does not fetch the full file but read it in so-called "lazy" mode. Thus you can jump to the cell you need.
It also allows to start reading from the specific sheet
Note that closing file is mandatory

How do I use python pandas to read an already opened excel sheet

Assuming I have an excel sheet already open, make some changes in the file and use pd.read_excel to create a dataframe based on that sheet, I understand that the dataframe will only reflect the data in the last saved version of the excel file. I would have to save the sheet first in order for pandas dataframe to take into account the change.
Is there anyway for pandas or other python packages to read an opened excel file and be able to refresh its data real time (without saving or closing the file)?

Have you tried using mitosheet package? It doesn't answer your question directly, but it allows you working on pandas dataframes as you would do in excel sheets. In this way, you may edit the data on the fly as in excel and still get a pandas dataframe as a result (meanwhile generating the code to perform the same operations with python). Does this help?

There is no way to do this. The table is not saved to disk, so pandas can not read it from disk.

Be careful not to over-engineer, that being said:
Depending on your use case, if this is really needed, I could theoretically imagine a Robotic Process Automation like e.g. BluePrism, UiPath or PowerAutomate loading live data from Excel into a Python environment with a pandas DataFrame continuously and then changing it.
This use case would have to be a really important process though, otherwise licensing RPA is not worth it here.

df = pd.read_excel("path")
In variable explorer you can see the data if you run the program in SPYDER ide

Reading specific excel sheet into Python pandas dataframe without loading the whole workbook

pd.read_excel(filepath, sheetname) and openpyxl.Workbook(filepath) load the whole workbook. I have a 30MB file on a server and it takes a long time to load. I was wondering if there was a method to load a specific sheet as a pandas dataframe without opening the whole workbook.
I have looked into a couple of methods. Most recently, converting the workbook to a .zip and reading the .xml files. I know that people have suggested using lxml. I'm not sure how to go from here.
Is there a better way of doing this?

XLRD vs Win32 COM performance comparison

I have this huge Excel (xls) file that I have to read data from. I tried using the xlrd library, but is pretty slow. I then found out that by converting the Excel file to CSV file manually and reading the CSV file is orders of magnitude faster.
But I cannot ask my client to save the xls as csv manually every time before importing the file. So I thought of converting the file on the fly, before reading it.
Has anyone done any benchmarking as to which procedure is faster:
Open the Excel file with with the xlrd library and save it as CSV file, or
Open the Excel file with win32com library and save it as CSV file?
I am asking because the slowest part is the opening of the file, so if I can get a performance boots from using win32com I would gladly try it.

if you need to read the file frequently, I think it is better to save it as CSV. Otherwise, just read it on the fly.
for performance issue, I think win32com outperforms. however, considering cross-platform compatibility, I think xlrd is better.
win32com is more powerful. With it, one can handle Excel in all ways (e.g. reading/writing cells or ranges).
However, if you are seeking a quick file conversion, I think pandas.read_excel also works.
I am using another package xlwings. so I am also interested with a comparison among these packages.
to my opinion,
I would use pandas.read_excel to for quick file conversion.
If demanding more processing on Excel, I would choose win32com.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: any lazy method for reading .xls files? - python

Related

Read and write a single cell in excel using Python

Load only a single sheet from a large workbook with Python

How do I use python pandas to read an already opened excel sheet

Reading specific excel sheet into Python pandas dataframe without loading the whole workbook

XLRD vs Win32 COM performance comparison

Categories

Resources