Importing large Excel file to Python

Importing large Excel file to Python - python

I am trying to import an Excel (.xlsx) file into the Spyder IDE. Everything works fine when I import small files, using openpyxl, but for this particular file (around 30MB and 800k rows) my system crashes.
Following is the part of code that imports:
from openpyxl import load_workbook
wb = load_workbook(filename=path + 'cleaned_noTC_s_PERNO_Date.xlsx', data_only=True)
Can anyone please let me know what is wrong with this method and what else can I use to import the stated file?

Try using the excellent pandas library, it has very robust excel reading functionality and is pretty good with memory in my experience:
See here:
import pandas as pd
xl = pd.read_excel("file.xlsx")

It sounds like you're running out of memory. If you don't need to edit the file then you can use read_only mode, otherwise you'll need more memory.

Related

execution process in Jupyter Notebook

I have some questions about the way that Jupyter Notebook reads python code lines. (Sorry for not being able to upload code image. my reputation level is low.)
there exists csv file named 'train.csv' and I allocate this file to the variable named 'titanic_df'
import pandas as pd
titanic_df=pd.read_csv('train.csv')
print(titanic_df)
this goes well when it is executed. However, my questioin is,
import pandas as pd
# titanic_df=pd.read_csv('train.csv')
print(titanic_df)
this also goes well contrary to my intention. Even though I commented out the reading csv file step, titanic_df prints datas.
As I run same code on python installed on my computer and second code doesn't work, I guess there are some differences on the way Jupyter Notebook executes codes. How Jupyter Notebook works?

Jupyter can be somewhat confusing at first, but I will explain what's going on here.
A sequence of events occurred after the following code was run in Jupyter:
import pandas as pd
titanic_df=pd.read_csv('train.csv')
print(titanic_df)
In that first line of code, you imported the pandas module and loaded the pandas into memory. The pandas module is available to use. In the second line, you access the pd.read_csv function within the pandas module.
The pandas module and it's functions are available whenever called and loaded into memory. The pandas functions will be available to be used until pandas is removed from memory.
Therefore, to answer this question: When the pd.read_csv line of code is commented-out like such:
# titanic_df=pd.read_csv('train.csv')
this pandas function has not been removed from memory. Pandas is still loaded in memory. The only thing that changes is the commented line of code will not be executed again, or any time you run this block of code. But the pandas module and the pandas features will remain in memory and available and ready to be used.
Even if the first line of code were to be commented out, the pandas module and its features would still remain active in memory and ready to use in Jupyter. But if Jupyter is restarted, then the panda module would not be reloaded into memory.
Also, know about restarting the kernel. If you were to comment-out the first line of code but not the second line of code, and then you were to select in Jupyter "Restart kernel and run all cells", then two things would happen. The pandas module would not be loaded and then calling the pd.read_csv line of code would cause an error. The error would occur because your code would call for a pandas function, but the pandas module had not been installed.
A saved Jupyter Notebook file will run all the cells in the file whenever the existing file is opened.

Writing pandas data to Excel with efficient memory usage [duplicate]

This question already has an answer here:
Pandas to_excel- How to make it faster
(1 answer)
Closed 2 years ago.
I am successfully writing dataframes to Excel using df.to_excel(). Unfortunately, this is slow and consumes gobs of memory. The larger the dataset, the more memory it consumes, until (with the largest datasets I need to deal with) the server starves for resources.
I found that using the df.to_csv() method instead offers the convenient chunksize=nnnn argument. This is far faster and consumes almost no extra memory. Wonderful! I'll just write initially to .csv, then convert the .csv to .xlsx in another process. I tried doing it with openpyxl, but every solution I found iterated through the csv data one row at a time, appending to a Workbook sheet, e.g.
with open(temporary_filepath, 'r') as f:
for row in csv.reader(f):
ws.append(row)
wb.save()
This works, but when I watch my resource monitor, consumes just as much memory and is just as slow (I now assume the original df.to_excel() was doing the same thing internally). So this approach didn't get me out of the woods after all.
I had assumed I could do something that would be the equivalent of File | Save As, but in Python, e.g. read the whole csv into an openpyxl Workbook and save it to a file all in one go, without iterating, but either this is not possible or I can't find the documentation on it.
Given a very large Pandas dataframe and a requirement to output .xlsx (not .csv), what is the best approach for low memory consumption? Can it be done efficiently with Pandas or Openpyxl, or is there a better tool for the job?
Update: Looks like pyexcel has as a Save As method that might do the trick. Would prefer not to add yet another spreadsheet lib to the stack if possible, but will do if there is no equivalent in pandas or openpyxl. Has anyone used that successfully?

Probably you could use the library pyexcelerate - https://github.com/kz26/PyExcelerate. They have posted the benchmarks on their github repo
from pyexcelerate import Workbook
values = [df.columns] + list(df.values)
wb = Workbook()
wb.new_sheet('data_sheet_name', data=values)
wb.save('data.xlsx')

The pyexcelerate response is exactly what I asked about, so I accepted that answer, but just wanted to post an update that we found an alternate solution that's possibly even easier. Sharing here in case it's useful.
Pandas now prefers xlsxwriter over openpyxl. If it's installed, and you do not specify the engine, xlsxwriter will be used by default (or of course you can specify it explicitly). In my experiments, xlsxwriter was 4x more memory efficient than openpyxl at the task of writing to Excel. This not an infinitely scalable solution - it's still conceivable that one could receive a dataset so large that it still overwhelms memory even with this optimization - but it's extremely easy: Just pip install xlsxwriter and you get a 4x bump in memory use when calling df.to_excel(), with no code changes (in my case).

Possibility of Corruption: Reading Excel Files with Pandas

We are in the design phase for product. The idea is that the code will read a list of values from Excel into SQL.
The requirements are as follows:
Workbook may be accessed by multiple users outside of our program
Workbook must remain accessible (i.e. not be corrupted) should something bad occur while our program is running
Program will be executed when no users are in the file
Right now we are considering using pandas in a simple manner as follows:
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
df = pd.read_excel('File.xlsx', sheetname='Sheet1')
"""Some code to write df in to SQL"""
If this code goes offline with the Excel still open, is there ANY possibility that the file will remain locked somewhere in my program or be corrupted?
To clarify, we envision something catastrophic like the server crashing or losing power.
Searched around but couldn't find a similar question, please redirect me if necessary.
I also read through Pandas read_excel documentation here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html

With the code you provide, from my reading of the pandas and xlrd code, the given file will only be opened in read mode. That should mean, to the best of my knowledge, that there is no more risk in what you're doing than in reading the file any other way - and you have to read it to use it, after all.
If this doesn't sufficiently reassure you, you could minimize the time the file is open and, more importantly, not expose your file to external code, by handing pandas a BytesIO object instead of a path:
import io
import pandas as pd
data = io.BytesIO(open('File.xlsx', 'rb').read())
df = pd.read_excel(data, sheetname='Sheet1')
# etc
This way your file will only be open for the time it takes to read it into memory, and pandas and xlrd will only be working with a copy of the data.

How can I open (pop up) an excel file for user using openpyxl in python?

I am using the openpyxl library to write data into an excel file, I want this file to be automatically opened (to pop up) for user when its finished writing data, I've looked so much but found nothing more than saving the file, which I have already done.

I think one possibility to do this would be with os.startfile:
import os
os.startfile('your_excel_file.xlsx')

How to open Excel instance in python on MAC?

I think this question has been asked before but it's not clear, in the original question the user has provided excel.exe which is a windows executable extension and not for mac.
I need to open new Excel instance in Python on MAC.
which module should I import?
I'm a newbie I have completed learning python language, but have trouble understanding documentation.

If all you need to do is launch Excel, the best way to do it is to use LaunchServices to do it.
If you have PyObjC (which you do if you're using the Python that Apple pre-installs on 10.6 and later; otherwise, you may have to install it):
import Foundation
ws = Foundation.NSWorkspace.sharedWorkspace()
ws.launchApplication_('Microsoft Excel')
If not, you can always use the open tool:
import subprocess
subprocess.check_call(['open', '-a', 'Microsoft Excel'])
Either way, you're effectively launching Excel the same way as if the user double-clicked the app icon in Finder.
If you want to make Excel do something simple like open a specific document, that's not much harder. Look at the NSWorkspace or open documentation to see how to do whatever you want.
If you actually want to control Excel—e.g., open a document, make some changes, and save it—you'll want to use its AppleScript interface.
Apple's recommended way of doing that is via ScriptingBridge, or using a dual-language approach (write AppleScripts and execute them via NSAppleScript—which, in Python, you do through PyObjC). However, I'd probably use appscript (get the code from here). Despite the fact that it's been abandoned by its original creator, and is only being sparsely maintained, and will probably eventually stop working with some future OS X version, it's still much better than the official solutions.
Here's a sample (untested, because I don't have Excel here):
import appscript
excel = appscript.app('Microsoft Excel')
excel.workbooks[1].column[2].row[2].formula.set('=A2+1')

From the comments it is not completely clear if you need to 'update' an Excel file with data, and just assume that you need Excel to do so, or that you need to change some excel files to include new data.
It is usually much easier, and certainly faster (wrt excution speed) to go with 'updating' an Excel file without starting Excel. However updating is not the right word: you have to read in the file and write it out new. You can of course overwrite the orginal file, so it looks like an update.
For 'updating' you can use the trio xlrd, xlwt, xlutils if the files you work with are .xls files (Excel 2003). IIRC xlwt does not support .xlsx for writing (but xlrd can read those files).
For .xlsx files I use openpyxl,
Both are good enough for writing things like data, formula and basic formatting.
If you have existing Excel files which you use as 'templates' with information that would get lost if you read/write using one of the above packages, then you have to go with updating the file in Excel. I had to do so because I had no easy way to include Visual Basic macros and very specific formatting specified by a client. And sometimes it is just easier to visually setup a spreadsheet and then just fill the cells programmatically. But this was all done on Windows.
If you really have to drive Excel on Mac, because you need to use existing files as templates, I suggest you look at Applescript. Or, if it is an option, look at OpenOffice/LibreOffice PyUno interface.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.