Processing large XLSX file in python

Processing large XLSX file in python - python

I have a large xlsx Excel file (56mb, 550k rows) from which I tried to read the first 10 rows. I tried using xlrd, openpyxl, and pyexcel-xlsx, but they always take more than 35 mins because it loads the whole file in memory.
I unzipped the Excel file and found out that the xml which contains the data I need is 800mb unzipped.
When you load the same file in Excel it takes 30 seconds. I'm wondering why it takes that much time in Python?

Use openpyxl's read-only mode to do this.
You'll be able to work with the relevant worksheet instantly.

Here is it, i found a solution. The fastest way to read an xlsx sheet.
56mb file with over 500k rows and 4 sheets took 6s to proceed.
import zipfile
from bs4 import BeautifulSoup
paths = []
mySheet = 'Sheet Name'
filename = 'xlfile.xlsx'
file = zipfile.ZipFile(filename, "r")
for name in file.namelist():
if name == 'xl/workbook.xml':
data = BeautifulSoup(file.read(name), 'html.parser')
sheets = data.find_all('sheet')
for sheet in sheets:
paths.append([sheet.get('name'), 'xl/worksheets/sheet' + str(sheet.get('sheetid')) + '.xml'])
for path in paths:
if path[0] == mySheet:
with file.open(path[1]) as reader:
for row in reader:
print(row) ## do what ever you want with your data
reader.close()
Enjoy and happy coding.

The load time you're experiencing is directly related to the io speed of your memory chip.
When pandas loads an excel file, it makes several copies of the file -- since the file structure isn't serialized (excel uses a binary encoding).
In terms of a solution: I'd suggest, as a workaround:
load your excel file through a virtual machine with specialized hardware (here's what AWS has to offer)
save your file to a csv format for local use.
For even better performance, use an optimized data structure such as parquet
For a deeper dive, check out this article I've written: Loading Ridiculously Large Excel Files in Python

Related

How to improve my append and read excel For loop in python

Hope you can help me.
I have a folder where there are several .xlsx files with similar structure (NOTE that some of the files might be bigger than 50MB). I want to combine them all together and (eventually) send them to a database. But before that, I need to improve the performance of this block of code because sometimes it takes a lot of time to process all those files.
The code in question is this:
df_list = []
for file in location:
df_list.append(pd.read_excel(file, header=0, engine='openpyxl'))
df_concat = pd.concat(df_list)
Any suggestions?
Somewhere I read that converting Excel files to CSV might improve the performance, but should I do that before appending the files or after everything is concatenated?
And considering df_list is a list, can I do that conversion?

I've found a solution with xlsx2csv
xlsx_path = './data/Extract/'
csv_path = './data/csv/'
list_of_xlsx = glob.glob(xlsx_path+'*.xlsx')
for xlsx in list_of_xlsx:
# Extract File Name on group 2 "(.+)"
filename = re.search(r'(.+[\\|\/])(.+)(\.(xlsx))', xlsx).group(2)
# Setup the call for subprocess.call()
call = ["python", "./xlsx2csv.py", xlsx, csv_path+filename+'.csv']
try:
subprocess.call(call) # On Windows use shell=True
except:
print('Failed with {}'.format(filepath)
outputcsv = './data/bigcsv.csv' #specify filepath+filename of output csv
listofdataframes = []
for file in glob.glob(csv_path+'*.csv'):
df = pd.read_csv(file)
if df.shape[1] == 24: # make sure 24 columns
listofdataframes.append(df)
else:
print('{} has {} columns - skipping'.format(file,df.shape[1]))
bigdataframe = pd.concat(listofdataframes).reset_index(drop=True)
bigdataframe.to_csv(outputcsv,index=False)
I tried to make this work for me but had no success. Maybe you might be able to have it working for you? Or does anyone have any ideas?

Reading excel files is quite slow in pandas as you stated, you shoudld have a look at this answer. It bascally uses a vbscript before running the python script to convert excel file to csv file, which is way faster to read for the python script.
To be more specific and answer the second part of your question, you should convert teh excel files to csv before loading them with pandas. The read_excel function is the slow part.

Accessing Excel files directly from RAM using Excel Writer

In the documentation for pd.ExcelWriter we see the following code snippet:
You can store Excel file in RAM:
import io
df = pd.DataFrame([["ABC", "XYZ"]], columns=["Foo", "Bar"])
buffer = io.BytesIO()
with pd.ExcelWriter(buffer) as writer:
df.to_excel(writer)
My question is that how can we access the excel back. I wanted to have the b64 coded version of the same excel without saving the file in the system that is why I am thinking of saving it in my RAM. Can someone please help on this?
Thanks for your time
Solution: Was able to access the file using buffer.getvalue().

In the snippet you provided, the Excel file has been written to the buffer the same way as if it would have been stored on disk.
Therefore you can read it back in a similar way as if you were reading from file:
pd.read_excel(buffer.getvalue())
More on how BytesIO behave:
Create an excel file from BytesIO using python
Difference between `open` and `io.BytesIO` in binary streams

Is there any way of accelerating file read/write in Pandas?

I am having trouble with reading and writing moderately sized excel files in Pandas. I have 5 files each around 300 MB large. I need to combine these files into one, do some processing and then save it (as excel preferably):
import pandas as pd
f1 = pd.read_excel('File_1.xlsx')
f2 = pd.read_excel('File_2.xlsx')
f3 = pd.read_excel('File_3.xlsx')
f4 = pd.read_excel('File_4.xlsx')
f5 = pd.read_excel('File_5.xlsx')
FULL = pd.concat([f1,f2,f3,f4,f5], axis=0, ignore_index=True, sort=False)
FULL.to_excel('filename.xlsx', index=False)'
But unfortunately read takes way too much time (around 15 minutes or so), and write used up 100% of memory (on my 16 GB ram PC), and was taking so much time that I was forced to interrupt the program.
Is there any way I could accelerate both read/write?

In this post it is defined a nice function append_df_to_excel().
You can use that function to read the files one by one and append their content to the final excel file. This will save you RAM since you are not going to keep all the files in memory at once.
files = ['File_1.xlsx','File_2.xlsx',...]
for file in files:
df = pd.read_excel(file)
append_df_to_excel('filename.xlsx', df)
Depending on your input files, you may need to pass some extra arguments to the function. Check the linked post for extra info.
Note that you could use df.to_csv() with mode='a' to append to a csv file. Most of the time you can swap excel files for csv easily. If this is also your case, I would suggest this method instead of the custom function.

Not ideal (and dependent on use case), but I've always found it much quicker to load up the XLSX (in Excel) and save it as a CSV file, just because I tend to do multiple reads on the data and in the long run the time taken to wait for the XLSX load outweighs the amount of time it takes to convert the file.

Faster way to read Excel files to pandas dataframe

I have a 14MB Excel file with five worksheets that I'm reading into a Pandas dataframe, and although the code below works, it takes 9 minutes!
Does anyone have suggestions for speeding it up?
import pandas as pd
def OTT_read(xl,site_name):
df = pd.read_excel(xl.io,site_name,skiprows=2,parse_dates=0,index_col=0,
usecols=[0,1,2],header=None,
names=['date_time','%s_depth'%site_name,'%s_temp'%site_name])
return df
def make_OTT_df(FILEDIR,OTT_FILE):
xl = pd.ExcelFile(FILEDIR + OTT_FILE)
site_names = xl.sheet_names
df_list = [OTT_read(xl,site_name) for site_name in site_names]
return site_names,df_list
FILEDIR='c:/downloads/'
OTT_FILE='OTT_Data_All_stations.xlsx'
site_names_OTT,df_list_OTT = make_OTT_df(FILEDIR,OTT_FILE)

As others have suggested, csv reading is faster. So if you are on windows and have Excel, you could call a vbscript to convert the Excel to csv and then read the csv. I tried the script below and it took about 30 seconds.
# create a list with sheet numbers you want to process
sheets = map(str,range(1,6))
# convert each sheet to csv and then read it using read_csv
df={}
from subprocess import call
excel='C:\\Users\\rsignell\\OTT_Data_All_stations.xlsx'
for sheet in sheets:
csv = 'C:\\Users\\rsignell\\test' + sheet + '.csv'
call(['cscript.exe', 'C:\\Users\\rsignell\\ExcelToCsv.vbs', excel, csv, sheet])
df[sheet]=pd.read_csv(csv)
Here's a little snippet of python to create the ExcelToCsv.vbs script:
#write vbscript to file
vbscript="""if WScript.Arguments.Count < 3 Then
WScript.Echo "Please specify the source and the destination files. Usage: ExcelToCsv <xls/xlsx source file> <csv destination file> <worksheet number (starts at 1)>"
Wscript.Quit
End If
csv_format = 6
Set objFSO = CreateObject("Scripting.FileSystemObject")
src_file = objFSO.GetAbsolutePathName(Wscript.Arguments.Item(0))
dest_file = objFSO.GetAbsolutePathName(WScript.Arguments.Item(1))
worksheet_number = CInt(WScript.Arguments.Item(2))
Dim oExcel
Set oExcel = CreateObject("Excel.Application")
Dim oBook
Set oBook = oExcel.Workbooks.Open(src_file)
oBook.Worksheets(worksheet_number).Activate
oBook.SaveAs dest_file, csv_format
oBook.Close False
oExcel.Quit
""";
f = open('ExcelToCsv.vbs','w')
f.write(vbscript.encode('utf-8'))
f.close()
This answer benefited from Convert XLS to CSV on command line and csv & xlsx files import to pandas data frame: speed issue

I used xlsx2csv to virtually convert excel file to csv in memory and this helped cut the read time to about half.
from xlsx2csv import Xlsx2csv
from io import StringIO
import pandas as pd
def read_excel(path: str, sheet_name: str) -> pd.DataFrame:
buffer = StringIO()
Xlsx2csv(path, outputencoding="utf-8", sheet_name=sheet_name).convert(buffer)
buffer.seek(0)
df = pd.read_csv(buffer)
return df

If you have less than 65536 rows (in each sheet) you can try xls (instead of xlsx. In my experience xls is faster than xlsx. It is difficult to compare to csv because it depends on the number of sheets.
Although this is not an ideal solution (xls is a binary old privative format), I have found this is useful if you are working with a lof many sheets, internal formulas with values that are often updated, or for whatever reason you would really like to keep the excel multisheet functionality (instead of csv separated files).

In my experience, Pandas read_excel() works fine with Excel files with multiple sheets. As suggested in Using Pandas to read multiple worksheets, if you assign sheet_name to None it will automatically put every sheet in a Dataframe and it will output a dictionary of Dataframes with the keys of sheet names.
But the reason that it takes time is for where you parse texts in your code. 14MB excel with 5 sheets is not that much. I have a 20.1MB excel file with 46 sheets each one with more than 6000 rows and 17 columns and using read_excel it took like below:
t0 = time.time()
def parse(datestr):
y,m,d = datestr.split("/")
return dt.date(int(y),int(m),int(d))
data = pd.read_excel("DATA (1).xlsx", sheet_name=None, encoding="utf-8", skiprows=1, header=0, parse_dates=[1], date_parser=parse)
t1 = time.time()
print(t1 - t0)
## result: 37.54169297218323 seconds
In code above data is a dictionary of 46 Dataframes.
As others suggested, using read_csv() can help because reading .csv file is faster. But consider that for the fact that .xlsx files use compression, .csv files might be larger and hence, slower to read. But if you wanted to convert your file to comma-separated using python (VBcode is offered by Rich Signel), you can use: Convert xlsx to csv

I know this is old but in case anyone else is looking for an answer that doesn't involve VB. Pandas read_csv() is faster but you don't need a VB script to get a csv file.
Open your Excel file and save as *.csv (comma separated value) format.
Under tools you can select Web Options and under the Encoding tab you can change the encoding to whatever works for your data. I ended up using Windows, Western European because Windows UTF encoding is "special" but there's lots of ways to accomplish the same thing. Then use the encoding argument in pd.read_csv() to specify your encoding.
Encoding options are listed here

I encourage you to do the comparison yourself and see which approach is appropriate in your situation.
For instance, if you are processing a lot of XLSX files and are only going to ever read each one once, you may not want to worry about the CSV conversion. However, if you are going to read the CSVs over and over again, then I would highly recommend saving each of the worksheets in the workbook to a csv once, then read them repeatedly using pd.read_csv().
Below is a simple script that will let you compare Importing XLSX Directly, Converting XLSX to CSV in memory, and Importing CSV. It is based on Jing Xue's answer.
Spoiler alert: If you are going to read the file(s) multiple times, it's going to be faster to convert the XLSX to CSV.
I did some testing with some files I'm working on are here are my results:
5,874 KB xlsx file (29,415 rows, 58 columns)
Elapsed time for [Import XLSX with Pandas]: 0:00:31.75
Elapsed time for [Convert XLSX to CSV in mem]: 0:00:22.19
Elapsed time for [Import CSV file]: 0:00:00.21
********************
202,782 KB xlsx file (990,832 rows, 58 columns)
Elapsed time for [Import XLSX with Pandas]: 0:17:04.31
Elapsed time for [Convert XLSX to CSV in mem]: 0:12:11.74
Elapsed time for [Import CSV file]: 0:00:07.11
YES! the 202MB file really did take only 7 seconds compared to 17 minutes for the XLSX!!!
If you're ready to set up your own test, just open you XLSX in Excel and save one of the worksheets to CSV. For a final solution, you would obviously need to loop through the worksheets to process each one.
You will also need to pip install rich pandas xlsx2csv.
from rich import print
import pandas as pd
from datetime import datetime
from xlsx2csv import Xlsx2csv
from io import StringIO
def timer(name, startTime = None):
if startTime:
print(f"Timer: Elapsed time for [{name}]: {datetime.now() - startTime}")
else:
startTime = datetime.now()
print(f"Timer: Starting [{name}] at {startTime}")
return startTime
def read_excel(path: str, sheet_name: str) -> pd.DataFrame:
buffer = StringIO()
Xlsx2csv(path, outputencoding="utf-8", sheet_name=sheet_name).convert(buffer)
buffer.seek(0)
df = pd.read_csv(buffer)
return df
xlsxFileName = "MyBig.xlsx"
sheetName = "Sheet1"
csvFileName = "MyBig.csv"
startTime = timer(name="Import XLSX with Pandas")
df = pd.read_excel(xlsxFileName, sheet_name=sheetName)
timer("Import XLSX with Pandas", startTime)
startTime = timer(name="Convert XLSX to CSV first")
df = read_excel(path=xlsxFileName, sheet_name=sheetName)
timer("Convert XLSX to CSV first", startTime)
startTime = timer(name="Import CSV")
df = pd.read_csv(csvFileName)
timer("Import CSV", startTime)

There's no reason to open excel if you're willing to deal with slow conversion once.
Read the data into a dataframe with pd.read_excel()
Dump it into a csv right away with pd.to_csv()
Avoid both excel and windows specific calls. In my case the one-time time hit was worth the hassle. I got a ☕.

Converting a folder of Excel files into CSV files/Merge Excel Workbooks

I have a folder with a large number of Excel workbooks. Is there a way to convert every file in this folder into a CSV file using Python's xlrd, xlutiles, and xlsxWriter?
I would like the newly converted CSV files to have the extension '_convert.csv'.
OTHERWISE...
Is there a way to merge all the Excel workbooks in the folder to create one large file?
I've been searching for ways to do both, but nothing has worked...

Using pywin32, this will find all the .xlsx files in the indicated directory and open and resave them as .csv. It is relatively easy to figure out the right commands with pywin32...just record an Excel macro and perform the open/save manually, then look at the resulting macro.
import os
import glob
import win32com.client
xl = win32com.client.gencache.EnsureDispatch('Excel.Application')
for f in glob.glob('tmp/*.xlsx'):
fullname = os.path.abspath(f)
xl.Workbooks.Open(fullname)
xl.ActiveWorkbook.SaveAs(Filename=fullname.replace('.xlsx','.csv'),
FileFormat=win32com.client.constants.xlCSVMSDOS,
CreateBackup=False)
xl.ActiveWorkbook.Close(SaveChanges=False)

I will give a try with my library pyexcel:
from pyexcel import Book, BookWriter
import glob
import os
for f in glob.glob("your_directory/*.xlsx"):
fullname = os.path.abspath(f)
converted_filename = fullname.replace(".xlsx", "_converted.csv")
book = Book(f)
converted_csvs = BookWriter(converted_filename)
converted_csvs.write_book_reader(book)
converted_csvs.close()
If you have a xlsx that has more than 2 sheets, I imagine you will have more than 2 csv files generated. The naming convention is: "file_converted_%s.csv" % your_sheet_name. The script will save all converted csv files in the same directory where you had xlsx files.
In addition, if you want to merge all in one, it is super easy as well.
from pyexcel.cookbook import merge_all_to_a_book
import glob
merge_all_to_a_book(glob.glob("your_directory/*.xlsx"), "output.xlsx")
If you want to do more, please read the tutorial

Look at openoffice's python library. Although, I suspect openoffice would support MS document files.
Python has no native support for Excel file.

Sure. Iterate over your files using something like glob and feed them into one of the modules you mention. With xlrd, you'd use open_workbook to open each file by name. That will give you back a Book object. You'll then want to have nested loops that iterate over each Sheet object in the Book, each row in the Sheet, and each Cell in the Row. If your rows aren't too wide, you can append each Cell in a Row into a Python list and then feed that list to the writerow method of a csv.writer object.
Since it's a high-level question, this answer glosses over some specifics like how to call xlrd.open_workbook and how to create a csv.writer. Hopefully googling for examples on those specific points will get you where you need to go.

You can use this function to read the data from each file
import xlrd
def getXLData(Filename, min_row_len=1, get_datemode=False, sheetnum=0):
Data = []
book = xlrd.open_workbook(Filename)
sheet = book.sheets()[sheetnum]
rowcount = 0
while rowcount < sheet.nrows:
row = sheet.row_values(rowcount)
if len(row)>=min_row_len: Data.append(row)
rowcount+=1
if get_datemode: return Data, book.datemode
else: return Data
and this function to write the data after you combine the lists together
import csv
def writeCSVFile(filename, data, headers = []):
import csv
if headers:
temp = [headers]
temp.extend(data)
data = temp
f = open(filename,"wb")
writer = csv.writer(f)
writer.writerows(data)
f.close()
Keep in mind you may have to re-format the data, especially if there are dates or integers in the Excel files since they're stored as floating point numbers.
Edited to add code calling the above functions:
import glob
filelist = glob.glob("*.xls*")
alldata = []
headers = []
for filename in filelist:
data = getXLData(filename)
headers = data.pop(0) # omit this line if files do not have a header row
alldata.extend(data)
writeCSVFile("Output.csv", alldata, headers)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Processing large XLSX file in python - python

Use openpyxl's read-only mode to do this. You'll be able to work with the relevant worksheet instantly.

Related

How to improve my append and read excel For loop in python

Accessing Excel files directly from RAM using Excel Writer

Is there any way of accelerating file read/write in Pandas?

Faster way to read Excel files to pandas dataframe

Converting a folder of Excel files into CSV files/Merge Excel Workbooks

Categories

Resources