Faster way to read Excel files to pandas dataframe

Faster way to read Excel files to pandas dataframe - python

I have a 14MB Excel file with five worksheets that I'm reading into a Pandas dataframe, and although the code below works, it takes 9 minutes!
Does anyone have suggestions for speeding it up?
import pandas as pd
def OTT_read(xl,site_name):
df = pd.read_excel(xl.io,site_name,skiprows=2,parse_dates=0,index_col=0,
usecols=[0,1,2],header=None,
names=['date_time','%s_depth'%site_name,'%s_temp'%site_name])
return df
def make_OTT_df(FILEDIR,OTT_FILE):
xl = pd.ExcelFile(FILEDIR + OTT_FILE)
site_names = xl.sheet_names
df_list = [OTT_read(xl,site_name) for site_name in site_names]
return site_names,df_list
FILEDIR='c:/downloads/'
OTT_FILE='OTT_Data_All_stations.xlsx'
site_names_OTT,df_list_OTT = make_OTT_df(FILEDIR,OTT_FILE)

As others have suggested, csv reading is faster. So if you are on windows and have Excel, you could call a vbscript to convert the Excel to csv and then read the csv. I tried the script below and it took about 30 seconds.
# create a list with sheet numbers you want to process
sheets = map(str,range(1,6))
# convert each sheet to csv and then read it using read_csv
df={}
from subprocess import call
excel='C:\\Users\\rsignell\\OTT_Data_All_stations.xlsx'
for sheet in sheets:
csv = 'C:\\Users\\rsignell\\test' + sheet + '.csv'
call(['cscript.exe', 'C:\\Users\\rsignell\\ExcelToCsv.vbs', excel, csv, sheet])
df[sheet]=pd.read_csv(csv)
Here's a little snippet of python to create the ExcelToCsv.vbs script:
#write vbscript to file
vbscript="""if WScript.Arguments.Count < 3 Then
WScript.Echo "Please specify the source and the destination files. Usage: ExcelToCsv <xls/xlsx source file> <csv destination file> <worksheet number (starts at 1)>"
Wscript.Quit
End If
csv_format = 6
Set objFSO = CreateObject("Scripting.FileSystemObject")
src_file = objFSO.GetAbsolutePathName(Wscript.Arguments.Item(0))
dest_file = objFSO.GetAbsolutePathName(WScript.Arguments.Item(1))
worksheet_number = CInt(WScript.Arguments.Item(2))
Dim oExcel
Set oExcel = CreateObject("Excel.Application")
Dim oBook
Set oBook = oExcel.Workbooks.Open(src_file)
oBook.Worksheets(worksheet_number).Activate
oBook.SaveAs dest_file, csv_format
oBook.Close False
oExcel.Quit
""";
f = open('ExcelToCsv.vbs','w')
f.write(vbscript.encode('utf-8'))
f.close()
This answer benefited from Convert XLS to CSV on command line and csv & xlsx files import to pandas data frame: speed issue

I used xlsx2csv to virtually convert excel file to csv in memory and this helped cut the read time to about half.
from xlsx2csv import Xlsx2csv
from io import StringIO
import pandas as pd
def read_excel(path: str, sheet_name: str) -> pd.DataFrame:
buffer = StringIO()
Xlsx2csv(path, outputencoding="utf-8", sheet_name=sheet_name).convert(buffer)
buffer.seek(0)
df = pd.read_csv(buffer)
return df

If you have less than 65536 rows (in each sheet) you can try xls (instead of xlsx. In my experience xls is faster than xlsx. It is difficult to compare to csv because it depends on the number of sheets.
Although this is not an ideal solution (xls is a binary old privative format), I have found this is useful if you are working with a lof many sheets, internal formulas with values that are often updated, or for whatever reason you would really like to keep the excel multisheet functionality (instead of csv separated files).

In my experience, Pandas read_excel() works fine with Excel files with multiple sheets. As suggested in Using Pandas to read multiple worksheets, if you assign sheet_name to None it will automatically put every sheet in a Dataframe and it will output a dictionary of Dataframes with the keys of sheet names.
But the reason that it takes time is for where you parse texts in your code. 14MB excel with 5 sheets is not that much. I have a 20.1MB excel file with 46 sheets each one with more than 6000 rows and 17 columns and using read_excel it took like below:
t0 = time.time()
def parse(datestr):
y,m,d = datestr.split("/")
return dt.date(int(y),int(m),int(d))
data = pd.read_excel("DATA (1).xlsx", sheet_name=None, encoding="utf-8", skiprows=1, header=0, parse_dates=[1], date_parser=parse)
t1 = time.time()
print(t1 - t0)
## result: 37.54169297218323 seconds
In code above data is a dictionary of 46 Dataframes.
As others suggested, using read_csv() can help because reading .csv file is faster. But consider that for the fact that .xlsx files use compression, .csv files might be larger and hence, slower to read. But if you wanted to convert your file to comma-separated using python (VBcode is offered by Rich Signel), you can use: Convert xlsx to csv

I know this is old but in case anyone else is looking for an answer that doesn't involve VB. Pandas read_csv() is faster but you don't need a VB script to get a csv file.
Open your Excel file and save as *.csv (comma separated value) format.
Under tools you can select Web Options and under the Encoding tab you can change the encoding to whatever works for your data. I ended up using Windows, Western European because Windows UTF encoding is "special" but there's lots of ways to accomplish the same thing. Then use the encoding argument in pd.read_csv() to specify your encoding.
Encoding options are listed here

I encourage you to do the comparison yourself and see which approach is appropriate in your situation.
For instance, if you are processing a lot of XLSX files and are only going to ever read each one once, you may not want to worry about the CSV conversion. However, if you are going to read the CSVs over and over again, then I would highly recommend saving each of the worksheets in the workbook to a csv once, then read them repeatedly using pd.read_csv().
Below is a simple script that will let you compare Importing XLSX Directly, Converting XLSX to CSV in memory, and Importing CSV. It is based on Jing Xue's answer.
Spoiler alert: If you are going to read the file(s) multiple times, it's going to be faster to convert the XLSX to CSV.
I did some testing with some files I'm working on are here are my results:
5,874 KB xlsx file (29,415 rows, 58 columns)
Elapsed time for [Import XLSX with Pandas]: 0:00:31.75
Elapsed time for [Convert XLSX to CSV in mem]: 0:00:22.19
Elapsed time for [Import CSV file]: 0:00:00.21
********************
202,782 KB xlsx file (990,832 rows, 58 columns)
Elapsed time for [Import XLSX with Pandas]: 0:17:04.31
Elapsed time for [Convert XLSX to CSV in mem]: 0:12:11.74
Elapsed time for [Import CSV file]: 0:00:07.11
YES! the 202MB file really did take only 7 seconds compared to 17 minutes for the XLSX!!!
If you're ready to set up your own test, just open you XLSX in Excel and save one of the worksheets to CSV. For a final solution, you would obviously need to loop through the worksheets to process each one.
You will also need to pip install rich pandas xlsx2csv.
from rich import print
import pandas as pd
from datetime import datetime
from xlsx2csv import Xlsx2csv
from io import StringIO
def timer(name, startTime = None):
if startTime:
print(f"Timer: Elapsed time for [{name}]: {datetime.now() - startTime}")
else:
startTime = datetime.now()
print(f"Timer: Starting [{name}] at {startTime}")
return startTime
def read_excel(path: str, sheet_name: str) -> pd.DataFrame:
buffer = StringIO()
Xlsx2csv(path, outputencoding="utf-8", sheet_name=sheet_name).convert(buffer)
buffer.seek(0)
df = pd.read_csv(buffer)
return df
xlsxFileName = "MyBig.xlsx"
sheetName = "Sheet1"
csvFileName = "MyBig.csv"
startTime = timer(name="Import XLSX with Pandas")
df = pd.read_excel(xlsxFileName, sheet_name=sheetName)
timer("Import XLSX with Pandas", startTime)
startTime = timer(name="Convert XLSX to CSV first")
df = read_excel(path=xlsxFileName, sheet_name=sheetName)
timer("Convert XLSX to CSV first", startTime)
startTime = timer(name="Import CSV")
df = pd.read_csv(csvFileName)
timer("Import CSV", startTime)

There's no reason to open excel if you're willing to deal with slow conversion once.
Read the data into a dataframe with pd.read_excel()
Dump it into a csv right away with pd.to_csv()
Avoid both excel and windows specific calls. In my case the one-time time hit was worth the hassle. I got a ☕.

Related

how do i assemble bunch of excel files into one or more using python

how to there is around 10k .csv files named as data0,data1 like that in sequence, want to combine them and want to have a master sheet in one file or at least couple of sheets using python because i think there is limitation of around 1070000 rows in one excel file i think?
import pandas as pd
import os
master_df = pd.DataFrame()
for file in os.listdir(os.getcwd()):
if file.endswith('.csv'):
master_df = master_df.append(pd.read_csv(file))
master_df.to_csv('master file.CSV', index=False)

A few things to note:
Please check your csv file content first. It would easily mismatch columns when reading csv with text(maybe ; in the content). Or you can try changing the csv engine
df= pd.read_csv(csvfilename,sep=';', encoding='utf-8',engine='python')
If you want to combing into one sheet, your can concat into one dataframe first, then to_excel
df = pd.concat([df,sh_tmp],axis=0,sort=False)
note: concat or append would be a straightforward way to combine data. However, 10k would lead to a perfomance topic. Try list instead of pd.concat if you facing perfomance issue.
Excel has maximum row limitation. 10k files would easily exceed the limit (1048576). You might change the output to csv file or split into multiple .xlsx
----update the 3rd----
You can try grouping the data first (1000k each), then write to sheet one by one.
row_limit = 1000000
master_df['group']=master_df.index//row_limit
writer = pd.ExcelWriter(path_out)
for gr in range(0,master_df['group'].max()+1):
master_df.loc[master_df['group']==gr].to_excel(writer,sheet_name='Sheet'+str(gr),index=False)
writer.save()

Pandas: How to write custom time duration format to Excel file with pd.ExcelWriter via Openpyxl

Using the Openpyxl engine for Pandas via pd.ExcelWriter, I'd like to know if there is a way to specify a (custom) Excel duration format for elapsed time.
The format I would like to use is: [hh]:mm:ss which should give a time like: 01:01:01 for 1 hour, 1 minute, 1 second.
I want to write from a DataFrame into this format so that Excel can recognize it when I open the spreadsheet file in the Excel application, after writing the file.
Here is my current demo code, taking a duration of two datetime.now() timestamps:
import pandas as pd
from time import sleep
from datetime import datetime
start_time = datetime.now()
sleep(1)
end_time = datetime.now()
elapsed_time = end_time - start_time
df = pd.DataFrame([[elapsed_time]], columns=['Elapsed'])
with pd.ExcelWriter('./sheet.xlsx') as writer:
df.to_excel(writer, engine='openpyxl', index=False)
Note that in this implementation, type(elapsed_time) is <type 'datetime.timedelta'>.
The code will create an Excel file with approximately the value 0.0000116263657407407 in the column of "Elapsed". In Excel's time/date format, the value 1.0 equals 1 full day, so this is roughly 1 second of that 1 day.
If I under Format > Cells > Number (CMD + 1) select the Custom Category and specify the custom format [hh]:mm:ss for the cell, I will now see:
This desired format I want to see, every time I open the file in Excel, after writing the file.
However, I have looked around for solutions, and I cannot find a way to inherently tell pd.ExcelWriter, df.to_excel, or Openpyxl how to format the datetime.timedelta object in this way.
The Openpyxl documentation gives some very sparse indications:
Handling timedelta values Excel users can use number formats
resembling [h]:mm:ss or [mm]:ss to display time interval durations,
which openpyxl considers to be equivalent to timedeltas in Python.
openpyxl recognizes these number formats when reading XLSX files and
returns datetime.timedelta values for the corresponding cells.
When writing timedelta values from worksheet cells to file, openpyxl
uses the [h]:mm:ss number format for these cells.
How can I accomplish my goal of writing Excel-interpretable time (durations) in the format [hh]:mm:ss?
To achieve this, I do not require to use the current method of creating a datetime.timedelta object via datetime.now(). If it's possible to achieve this objective by using/converting to a datetime object or similar and formatting it, I would like to know how.
NB: I am using Python 2 with its latest pandas version 0.24.2 (and the openpyxl version installed with pip is the latest, 2.6.4). I hope that is not a problem as I cannot upgrade to Python 3 and later versions of pandas right now.

It was some time ago I worked on this, but the below solution worked for me in Python 2.7.18 using Pandas 0.24.2 and Openpyxl 2.6.4 from PyPi.
As stated in the question comments, later versions may solve this more elegantly (and there might furthermore be a more elegant way to do it in the old versions I use):
If writing to a new Excel file:
writer = pd.ExcelWriter(file = './sheet.xlsx', engine='openpyxl')
# Writes dataFrame to Writer Sheet, including column header
df.to_excel(writer, sheet_name='Sheet1', index=False)
# Selects which Sheet in Writer to manipulate
sheet = writer.sheets['Sheet1']
# Formats specific cell with desired duration format
cell = 'A2'
sheet[cell].number_format = '[hh]:mm:ss'
# Writes to file on disk
writer.save()
writer.close()
If writing to an existing Excel file:
file = './sheet.xlsx'
writer = pd.ExcelWriter(file = file, engine='openpyxl')
# Loads content from existing Sheet in file
workbook = load_workbook(file)
writer.book = workbook #writer.book potentially needs to be explicitly stated like this
writer.sheets = {sheet.title: sheet for sheet in workbook.worksheets}
sheet = writer.sheets['Sheet1']
# Writes dataFrame to Writer Sheet, below the last existing row, excluding column header
df.to_excel(writer, sheet_name='Sheet1', startrow=sheet.max_row, index=False, header=False)
# Updates the row count again, and formats specific cell with desired duration format
# (the last cell in column A)
cell = 'A' + str(sheet.max_row)
sheet[cell].number_format = '[hh]:mm:ss'
# Writes to file on disk
writer.save()
writer.close()
The above code can of course easily be abstracted into one function handling writing to both new files and existing files, and extended to managing any number of different sheets or columns, as needed.

Python save Excel .xlsx to CSV/XML and also save styling Information for conversion back into .xsls

My Python program converts Excel files (.xlsx) into a CSV file using Panda's read_excel and to_csv function, and at some point in the future, the CSV is converted back into an Excel file. Maintaining the data is fine, but of course all of the formatting and styling is gone. So I could use some help in being able to capture the that information to use when after converting the CSV back into an Excel file.
import pandas as pd
import xlsxwriter
EXCEL_PATH_FROM = r'C:\absolute\path\to\excel.xlsx'
EXCEL_PATH_TO = r'C:\absolute\path\to\other\excel.xlsx'
CSV_PATH = r'C:\absolute\path\to\csv.csv'
# read excel and convert to csv
def saveData():
read_excel = pd.read_excel(EXCEL_PATH_FROM)
print("writing csv...")
read_excel.to_csv(CSV_PATH, index=None, header=True)
# get csv data and import that data into an excel file
def createFromData():
csv = pd.read_csv(CSV_PATH)
excel = pd.ExcelWriter(EXCEL_PATH_TO, engine='xlsxwriter')
csv.to_excel(excel, index=None)
excel.save()
Some ideas I had were to save the Excel as a XML and insert format and style information as attributes or something, or to create both a CSV and XML from the Excel (one for data and one for styling). One problem I have is figuring out how to access that information.
Are there currently any packages that support Python 3 (currently using 3.8) that could help simplify this process? I dug through openpyxl's documentation and they have some stylesheet classes that aren't meant to be used directly I don't think and I couldn't figure out how to use them directly.

Python: load excel header without loading remaining data

I am working with very big Excel files, which take a long time to be loaded with Pandas in Python. Before processing the data, the user has to select quite a few options related to the data, which only require the names of the each column in each dataset. It is very inconvenient for the user to have to wait sometimes minutes until the data is loaded to be able to select the necessary options and then let the program do the actual processing for another few minutes.
So, my question is: is there a way to load only the data header from an Excel file with Python? In a way I think of it as an alternate version to the "skiprows" parameter in the read_excel Pandas function, where instead of skipping rows in the beginning of the data, I would like to skip rows at the end of the data. I want to emphasize that my goal is to reduce the time Python takes to load the files. I also know there are ways to do this with csv files, but unfortunately it didn't help me.
Thank you for the help!

You can try to use the sxl module (https://pypi.org/project/sxl/). Here is the code I tried for a large excel file (around 75,000 rows) and the timing results:
from datetime import datetime
startTime = datetime.now()
import pandas as pd
import sxl
startTime = datetime.now()
df = pd.read_excel('\\Big_Excel.xlsx')
print("Time taken to load whole data with pandas read excel is {}".format((datetime.now() - startTime)))
startTime = datetime.now()
df = pd.read_excel('\\Big_Excel.xlsx', nrows = 5)
print("Time taken with top 5 rows with pandas read excel is {}".format((datetime.now() - startTime)))
startTime = datetime.now()
wb = sxl.Workbook('\\Big_Excel.xlsx')
ws = wb.sheets[1]
data = ws.head(5)
print("Time taken to load top 5 rows using sxl is {}".format((datetime.now() - startTime)))
Pandas read excel loads the whole data in memory, so there is not much of a difference difference in timing. Here are the outputs from the above:
Time taken to load whole data with pandas read excel is 0:00:49.174538
Time taken with top 5 rows with pandas read excel is 0:00:44.478523
Time taken to load top 5 rows using sxl is 0:00:00.671717
I hope this helps!!

You can use 'skipfooter' parameter or 'nrows' parameter in both .xlsx & .csv. However, both cannot be used together.
path = r'c:\users\abc\def\stack.xlsx'
df = pd.read_excel(path, skipfooter = 99999)
which means, 99999 rows will be skipped from footer to top & remaining records from header will load.
path = r'c:\users\abc\def\stack.xlsx'
df = pd.read_excel(path, nrows= 5)
which means, first 5 rows will be shown with header.
Also refer this Stack over flow Question.

from dask import dataframe as dd
df= dd.read_csv(“filename”)
Trust me its fast I am reading 800 mb of file

Processing large XLSX file in python

I have a large xlsx Excel file (56mb, 550k rows) from which I tried to read the first 10 rows. I tried using xlrd, openpyxl, and pyexcel-xlsx, but they always take more than 35 mins because it loads the whole file in memory.
I unzipped the Excel file and found out that the xml which contains the data I need is 800mb unzipped.
When you load the same file in Excel it takes 30 seconds. I'm wondering why it takes that much time in Python?

Use openpyxl's read-only mode to do this.
You'll be able to work with the relevant worksheet instantly.

Here is it, i found a solution. The fastest way to read an xlsx sheet.
56mb file with over 500k rows and 4 sheets took 6s to proceed.
import zipfile
from bs4 import BeautifulSoup
paths = []
mySheet = 'Sheet Name'
filename = 'xlfile.xlsx'
file = zipfile.ZipFile(filename, "r")
for name in file.namelist():
if name == 'xl/workbook.xml':
data = BeautifulSoup(file.read(name), 'html.parser')
sheets = data.find_all('sheet')
for sheet in sheets:
paths.append([sheet.get('name'), 'xl/worksheets/sheet' + str(sheet.get('sheetid')) + '.xml'])
for path in paths:
if path[0] == mySheet:
with file.open(path[1]) as reader:
for row in reader:
print(row) ## do what ever you want with your data
reader.close()
Enjoy and happy coding.

The load time you're experiencing is directly related to the io speed of your memory chip.
When pandas loads an excel file, it makes several copies of the file -- since the file structure isn't serialized (excel uses a binary encoding).
In terms of a solution: I'd suggest, as a workaround:
load your excel file through a virtual machine with specialized hardware (here's what AWS has to offer)
save your file to a csv format for local use.
For even better performance, use an optimized data structure such as parquet
For a deeper dive, check out this article I've written: Loading Ridiculously Large Excel Files in Python

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Faster way to read Excel files to pandas dataframe - python

There's no reason to open excel if you're willing to deal with slow conversion once. Read the data into a dataframe with pd.read_excel() Dump it into a csv right away with pd.to_csv() Avoid both excel and windows specific calls. In my case the one-time time hit was worth the hassle. I got a ☕.

Related

how do i assemble bunch of excel files into one or more using python

Pandas: How to write custom time duration format to Excel file with pd.ExcelWriter via Openpyxl

Python save Excel .xlsx to CSV/XML and also save styling Information for conversion back into .xsls

Python: load excel header without loading remaining data

Processing large XLSX file in python

Categories

Resources