Possibility of Corruption: Reading Excel Files with Pandas

Possibility of Corruption: Reading Excel Files with Pandas - python

We are in the design phase for product. The idea is that the code will read a list of values from Excel into SQL.
The requirements are as follows:
Workbook may be accessed by multiple users outside of our program
Workbook must remain accessible (i.e. not be corrupted) should something bad occur while our program is running
Program will be executed when no users are in the file
Right now we are considering using pandas in a simple manner as follows:
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
df = pd.read_excel('File.xlsx', sheetname='Sheet1')
"""Some code to write df in to SQL"""
If this code goes offline with the Excel still open, is there ANY possibility that the file will remain locked somewhere in my program or be corrupted?
To clarify, we envision something catastrophic like the server crashing or losing power.
Searched around but couldn't find a similar question, please redirect me if necessary.
I also read through Pandas read_excel documentation here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html

With the code you provide, from my reading of the pandas and xlrd code, the given file will only be opened in read mode. That should mean, to the best of my knowledge, that there is no more risk in what you're doing than in reading the file any other way - and you have to read it to use it, after all.
If this doesn't sufficiently reassure you, you could minimize the time the file is open and, more importantly, not expose your file to external code, by handing pandas a BytesIO object instead of a path:
import io
import pandas as pd
data = io.BytesIO(open('File.xlsx', 'rb').read())
df = pd.read_excel(data, sheetname='Sheet1')
# etc
This way your file will only be open for the time it takes to read it into memory, and pandas and xlrd will only be working with a copy of the data.

Related

Python - Excel - Add sheet to existing workbook without removing sheets

Context: I am attempting to automate a report that is rather complicated (not conceptually, just in the sheer volume of things to keep track of). The method I settled on after a lot of investigation was to;
Create a template xlsx file which has a couple summary pages containing formulas pointing at other (raw data) sheets within the file.
Pull data from SQL Server and insert into template file, overwriting the raw data sheet with relevant data.
Publish report (Most likely this will just be moving xlsx file to a new directory).
Obviously, I have spent a lot of time looking at other peoples solutions to this issue (as this topic has been discussed a lot). The issue I have found however is that (at least in my search) none of the methods purposed have worked for me, my belief is that the previously correct responses are no longer relevant in current versions of pandas etc. Rather than linking to the dozens of articles attempting to answer this question, I will explain the issues I have had with various solutions.
Using openpyxl instead of xlsxwriter - This resulted in "BadZipFile: File is not a zip file" response. Which as I understand pertains to the pandas version, or rather the fix (mode='a') does not work due to the pandas version (I believe anything beyond 1.2 has this issue).
Helper Function This does not work however, also throws the BadZipFile error.
Below is a heavily redacted version of the code which should give all the required detail.
#Imports
import os
import pyodbc
import numpy as np
import shutil
import pandas as pd
import datetime
from datetime import date
from openpyxl import load_workbook
# Set database connection variables.
cnxn = pyodbc.connect(*Credentials*)
cursor = cnxn.cursor()
df = pd.read_sql_query(script, cnxn)
df.to_excel(writer, sheet_name = 'Some Sheet',index=False)
writer.close()
Long story short here, I am finding it very frustrating that what should be very very simple is turning into a multiple day long exercise. Please if anyone has experience with this and could offer some insight I would be very grateful.
Finally, I have to admit that I am quite new to using python, though I have not found the transition too difficult until today. Most of the issues I have been having are easily solvable (for me), with the exception of this issue. If there is something I have somehow completely missed, put me on the track and I will not be a bother.

Okay, so I found that I was infact incorrect (big surprise). That is, my statement that the helper function does not work. It does work, the ZipFile issue was most likely caused by some form of protection on the workbook. Funny thing is, I was able to get it working with a new workbook, but when I changed the name of the new workbook it again started throwing the ZipFile error. After awhile of creating new files and trying different things I eventually got it to work.
Two things I would note about the helper function;
It is not particularly efficient. At least not in the way I have set it up. I replaced all instances of 'to_excel' with 'append_df_to_excel' from the helper function. Doing this resulted in run time going from about 1-2 minutes to well over 10. I will do some more testing and see why this might be (I will post back if I find something intersting), but just something to watch for if using larger datasets.
Not an issue as such, but for me to get this to work as expected, I had to alter the function slightly. Specifically, in order to use the truncate feature in my situation, I needed to move the 'truncate' section to be above the 'firstrow' section. In my situation it made more sense to do that, rather than to specify the start row prior to truncating the sheet.
Hope this helps anyone running into the same issue.
Lesson learned, as always the information is out there, its just a matter of actually paying close attention and trying things out rather than copy paste and scratching your head when things aren't working.

Multiple External Processes Reading From the Same Data Source

I have a situation where I have multiple sources that will need to read from the same (small in size) data source, possibly at the same time. For example, multiple different computers calling a function that needs to read from an external data source (e.g. excel file). Since it multiple different sources are involved, I cannot simply read from the data source once and pass it into the function---it must be loaded in the function.
Is there a data source that can handle this effectively? A pandas dataframe was an acceptable format for information that need to be read so I tried storing that dataframe in an sqlite3 databases since according to the sqlite3 website, sqlite3 databases can handle concurrent reads. Unfortunately, it is failing too often. I tried multiple different iterations and simply could not get it to work.
Is there another data format/source that would work/be effective? I tried scouring the internet for whether or not something as simple as an excel file + the pandas read_excel function could handle this type of concurrency but I could not find information. I tried an experiment of using a multiprocessing pool to simultaneously load the same very large (i.e. 1 minute load) excel file and it did not crash. But of course, that is not exactly a perfect experiment.
Thanks!

You can try using openpyxl's read-only mode. It uses generator instead of loading whole file.
Also take a look at processing large xlsx file in python

Writing large dask dataframe to file

I have a large BCP file (12GB) that I have imported into dask and did some data wrangling that I wish to import to SQL server. The file has been reduced from 40+ columns to 8 columns and I wish to find the best method to import to SQL server. I have tried using the following:
import sqlalchemy as sa
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
from urllib.parse import quote_plus
pbar = ProgressBar()
pbar.register()
#windows authentication
#to_sql_uri = quote_plus(engine)
ddf.to_sql('test',
uri='mssql+pyodbc://TEST_SERVER/TEST_DB?driver=SQL Server?Trusted_Connection=yes', if_exists='replace', index=False)
This method is taking too long (3 days and counting). I had suspected this may be the case, so I also tried to write to a BCP file with the intention of using SQL BCP, but again this is taking a number of days:
df_train_grouped.compute().to_csv("F:\TEST_FILE.bcp", sep='\t')
I am relatively new to dask and can't seem to find an easy to follow example on the most efficient method to do this.

There is no need for you to use compute, this materialises the dataframe into memory and is likely the bottleneck for you. You can instead do
df_train_grouped.to_csv("F:\TEST_FILE*.bcp", sep='\t')
which will create a number of output files in parallel - which is probably exactly what you want.
Note that profiling will determine whether your process is IO bound (e.g., by the disc itself), in which case there is nothing you can do, or whether one of the process-based schedulers (ideally the distributed scheduler) can help with GIL-holding tasks.
Changing to a multiprocessing scheduler as follows improved performance in this particular case:
dask.config.set(scheduler='processes') # overwrite default with multiprocessing scheduler
df_train_grouped.to_csv("F:\TEST_FILE*.bcp", sep='\t', chunksize=1000000)

Writing pandas data to Excel with efficient memory usage [duplicate]

This question already has an answer here:
Pandas to_excel- How to make it faster
(1 answer)
Closed 2 years ago.
I am successfully writing dataframes to Excel using df.to_excel(). Unfortunately, this is slow and consumes gobs of memory. The larger the dataset, the more memory it consumes, until (with the largest datasets I need to deal with) the server starves for resources.
I found that using the df.to_csv() method instead offers the convenient chunksize=nnnn argument. This is far faster and consumes almost no extra memory. Wonderful! I'll just write initially to .csv, then convert the .csv to .xlsx in another process. I tried doing it with openpyxl, but every solution I found iterated through the csv data one row at a time, appending to a Workbook sheet, e.g.
with open(temporary_filepath, 'r') as f:
for row in csv.reader(f):
ws.append(row)
wb.save()
This works, but when I watch my resource monitor, consumes just as much memory and is just as slow (I now assume the original df.to_excel() was doing the same thing internally). So this approach didn't get me out of the woods after all.
I had assumed I could do something that would be the equivalent of File | Save As, but in Python, e.g. read the whole csv into an openpyxl Workbook and save it to a file all in one go, without iterating, but either this is not possible or I can't find the documentation on it.
Given a very large Pandas dataframe and a requirement to output .xlsx (not .csv), what is the best approach for low memory consumption? Can it be done efficiently with Pandas or Openpyxl, or is there a better tool for the job?
Update: Looks like pyexcel has as a Save As method that might do the trick. Would prefer not to add yet another spreadsheet lib to the stack if possible, but will do if there is no equivalent in pandas or openpyxl. Has anyone used that successfully?

Probably you could use the library pyexcelerate - https://github.com/kz26/PyExcelerate. They have posted the benchmarks on their github repo
from pyexcelerate import Workbook
values = [df.columns] + list(df.values)
wb = Workbook()
wb.new_sheet('data_sheet_name', data=values)
wb.save('data.xlsx')

The pyexcelerate response is exactly what I asked about, so I accepted that answer, but just wanted to post an update that we found an alternate solution that's possibly even easier. Sharing here in case it's useful.
Pandas now prefers xlsxwriter over openpyxl. If it's installed, and you do not specify the engine, xlsxwriter will be used by default (or of course you can specify it explicitly). In my experiments, xlsxwriter was 4x more memory efficient than openpyxl at the task of writing to Excel. This not an infinitely scalable solution - it's still conceivable that one could receive a dataset so large that it still overwhelms memory even with this optimization - but it's extremely easy: Just pip install xlsxwriter and you get a 4x bump in memory use when calling df.to_excel(), with no code changes (in my case).

Importing large Excel file to Python

I am trying to import an Excel (.xlsx) file into the Spyder IDE. Everything works fine when I import small files, using openpyxl, but for this particular file (around 30MB and 800k rows) my system crashes.
Following is the part of code that imports:
from openpyxl import load_workbook
wb = load_workbook(filename=path + 'cleaned_noTC_s_PERNO_Date.xlsx', data_only=True)
Can anyone please let me know what is wrong with this method and what else can I use to import the stated file?

Try using the excellent pandas library, it has very robust excel reading functionality and is pretty good with memory in my experience:
See here:
import pandas as pd
xl = pd.read_excel("file.xlsx")

It sounds like you're running out of memory. If you don't need to edit the file then you can use read_only mode, otherwise you'll need more memory.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.