pyExcelerator has problems reading some files - python

I've got a problem using pyExcelerator when reading some xls-files.
There're some python scripts i wrote, that use this library to parse XLS-files and populate database with info.
The templates for the files these scripts parse may vary and i sometimes reconfigure the script to handle them. With the one of the templates i ran into problem: pyExcelerator just raises an exception:
Traceback (most recent call last):
File "/home/* * */parsexls.py",
line 64, in handle_label
parser.parse()
File "/home/* * */parsers.py", line 335, in parse
self.contents = pyExcelerator.parse_xls(self.file_record.file,
self.encoding)
File "/usr/local/lib/python2.6/dist-packages/pyExcelerator/ImportXLS.py",
line 327, in parse_xls
ole_streams = CompoundDoc.Reader(filename).STREAMS
File "/usr/local/lib/python2.6/dist-packages/pyExcelerator/CompoundDoc.py",
line 67, in __init__
self.__build_short_sectors_data()
File "/usr/local/lib/python2.6/dist-packages/pyExcelerator/CompoundDoc.py",
line 256, in __build_short_sectors_data
dentry_start_sid, stream_size) = self.dir_entry_list[0]
IndexError: list index out of range
Some of the problem XLS-files contained empty sheets and removing of these sheets helped, but many of the files can't be handled even without empty sheets. There's nothing extraordinary in these files and they contain no formulas or pictures - just strings, numbers and dates.
As i can see, the pyExcelerator is abandoned by it's author :(
Any suggestions on fixing this issue are much appreciated.

I'm the author of xlrd. It reads XLS files and is not a fork of anything. I maintain a package called xlwt which writes XLS files and is a fork of pyExcelerator. The parse_xls functionality in pyExcelerator was deprecated to the point of removal from xlwt. Use xlrd instead.
Given the traceback that you reproduced, it looks like the file may be corrupted. What it is doing there happens well before the sheet data is parsed. What software produces these files? Can you open them with Excel or OpenOffice.org's Calc or Gnumeric? xlrd may give you a more meaningful error message. You may like to send me (insert_punctuation('sjmachin', 'lexicon', 'net')) copies of your failing file(s); please include some with and some without empty sheets. By the way, what are you using to remove empty sheets? What error message do you get from pyExcelerator when processing files with empty sheets?

You might wish to give xlrd a try... it started (I believe) as a fork of pyExcelerator, so incorporating requires few code changes, but it is actively maintained:
http://pypi.python.org/pypi/xlrd
Project website
General info, release notes and history from the documentation

Related

workbook save failing, not sure why

I apologize for the length of this. I am a relative Neophyte to Excel VBA and even more junior with Python. I have run into an issue with an error that occasionally occurs in python using OpenPyXl (just trying that for the first time).
Background: I have a series of python scripts (12) running and querying an API to gather data and populate 12 different, though similar, workbooks. Separately, I have a equal number of Excel instances periodically looking for that data and doing near-real-time analysis and reporting. Another python script looks for key information to be reported from the spreadsheets and will text it to me when identified. The problem seems to occur between the data gathering python scripts and a copy command in the data analysis workbooks.
The way the python data gathering scripts "talk" to the analysis workbooks is via the sheets they build in their workbooks. The existing vba in the analysis workbooks will copy the data workbooks to another directory (so that they can be opened and manipulated without impacting their use by the python scripts) and then interpret and copy the data into the Excel analysis workbook. Although I recently tested a method to read the data directly from those python-created workbooks without opening them, the vba will require some major surgery to convert to that method and is likely not going to happen soon.
TL,DR: There are data workbooks and analysis workbooks. Python builds the data workbooks and the analysis workbooks use VBA to copy the data workbooks to another directory and load specific data from the copied data workbooks. There is a one-to-one correspondence between the data and analysis workbooks.
Based on the above, I believe that the only "interference" that occurs with the data workbooks is when the macro in the analysis workbook copies the workbook. I thought this would be a relatively safe level of interference, but it apparently is not.
The copy is done in VBA with this set of commands (the actual VBA sub is about 500 lines):
fso.CopyFile strFromFilePath, strFilePath, True
where fso is set thusly:
Set fso = CreateObject("Scripting.FileSystemObject")
and the strFromFilePath and strFilePath both include a fully qualified file name (with their respective paths). This has not generated any errors on the VBA side.
The data is copied about once a minute (though it varies from 40 seconds to about 5 minutes) and seems to work fine from a VBA perspective.
What fails is the python side about 1% of the time (which is probably 12 or fewer times daily. While that seems small, the associated data capture process halts until I notice and restart it. This means anywhere from 1 to all 12 of the data capture processes will fail at some point each day.
Here is what a failure looks like:
Traceback (most recent call last):
File "<pyshell#2>", line 1, in <module>
monitor('DLD',1,13,0)
File "<string>", line 794, in monitor
File "C:\Users\abcd\AppData\Local\Programs\Python\Python39\lib\site-packages\openpyxl\workbook\workbook.py", line 407, in save
save_workbook(self, filename)
File "C:\Users\abcd\AppData\Local\Programs\Python\Python39\lib\site-packages\openpyxl\writer\excel.py", line 291, in save_workbook
archive = ZipFile(filename, 'w', ZIP_DEFLATED, allowZip64=True)
File "C:\Users\abcd\AppData\Local\Programs\Python\Python39\lib\zipfile.py", line 1239, in __init__
self.fp = io.open(file, filemode)
PermissionError: [Errno 13] Permission denied: 'DLD20210819.xlsx'
and I believe it occurs as a result of the following lines of python code (which comes after a while statement with various if conditions to populate the worksheets). The python script itself is about 200 lines long:
time.sleep(1) # no idea why wb.save sometimes fails; trying a delay
wb.save(FileName)
Notice, I left in one of the attempts to correct this. I have tried waiting as much as 3 seconds with no noticeable difference.
I admit I have no idea how to detect errors thrown by OpenPyXl and am quite unskilled at python error handling, but I had tried this code yesterday:
retries = 1
success = False
while not success and retries < 3:
try:
wb.save
success = True
except PermissionError as saveerror:
print ('>>> Save Error: ',saveerror)
wait = 3
print('=== Waiting %s secs and re-trying... ===' % wait)
#sys.stdout.flush()
time.sleep(wait)
retries += 1
My review of the output tells me that the except code never executed while testing the data capture routine over 3000 times. However, the "save" also never happened so the analysis spreadsheets did not receive any information until later when the python code saved the workbook and closed it.
I also tried adding a wb.close after setting the success variable to true, but got the same results.
I am considering either rewriting the VBA to try to grab the data directly from the unopened data workbooks without first copying them (which actually sounds more dangerous) or using an external synching tool to copy them outside of VBA (which could potentially cause exactly the same problem).
Does anyone have an idea of what may be happening and how to address it? It works nearly all the time but just fails several times a day.
Can someone help me to better understand how to trap the error thrown by OpenPyXl so that I can have it retry rather than just abending?
Any suggestions are appreciated. Thank you for reading.
Not sure if this is the best way, but the comment from simpleApp gave me an idea that I may want to use a technique I used elsewhere in the VBA. Since I am new to these tools, perhaps someone can suggest a cleaner approach, but I am going to try using a semaphore file to signal when I am copying the file to alert the python script that it should avoid saving.
In the below I am separating out the directory the prefix and the suffix. The prefix would be different for each of the 12 or more instances I am running and I have not figured out where I want to put these files nor what suffix I should use, so I made them variables.
For example, in the VBA I will have something like this to create a file saying currently available:
Dim strSemaphoreFolder As String
Dim strFilePrefix As String
Dim strFileDeletePath As String
Dim strFileInUseName As String
Dim strFileAvailableName As String
Dim strSemaphoreFileSuffix As String
Dim fso As Scripting.FileSystemObject
Dim fileTemp As TextStream
Set fso = CreateObject("Scripting.FileSystemObject")
strSemaphoreFileSuffix = ".txt"
strSemaphoreFolder = "c:\temp\monitor\"
strFilePrefix = "RJD"
strFileDeletePath = strSemaphoreFolder & strFilePrefix & "*" & strSemaphoreFileSuffix
' Clean up remnants from prior activities
If Len(Dir(strFileDeletePath)) > 0 Then
Kill strFileDeletePath
End If
' files should be gone
' Set the In-use and Available Names
strFileInUseName = strFilePrefix & "InUse" & strSemaphoreFileSuffix
strFileAvailableName = strFilePrefix & "Available" & strSemaphoreFileSuffix
' Create an available file
Set fileTemp = fso.CreateTextFile(strSemaphoreFolder & strFileAvailableName, True)
fileTemp.Close
' available file should be there
Then, when I am about to copy the file, I will briefly change the filename to indicate that the file is in use, perform the potentially problematic copy and then change it back with something like this:
' Temporarily name the semaphore file to "In Use"
Name strSemaphoreFolder & strFileAvailableName As strSemaphoreFolder & strFileInUseName
fso.CopyFile strFromFilePath, strFilePath, True
' After copying the file name it back to "Available"
Name strSemaphoreFolder & strFileInUseName As strSemaphoreFolder & strFileAvailableName
Over in the Python script, before I do the wb.save command, I will insert a check to see whether the file indicates that it is available or in use with something like this:
prefix = 'RJD'
directory = 'c:\\temp\\monitor\\'
suffix = '.txt'
filepathname = directory + prefix + 'Available' + suffix
while not (os.path.isfile(directory + prefix + 'Available' + suffix)):
time.sleep(1)
wb.save
Does this seem like it would work?
I am thinking that it should avoid the failure if I have properly identified it as an attempt to save the file in the Python script while the VBA script is telling the operating system to copy it.
Thoughts?
afterthoughts:
Using the technique I described, I probably need to create the "Available" semaphore file in the Python script and simply assume it will be there in the VBA script since the Python script is collecting the data and may be doing so before the VBA is even started.
A better alternative may be to simply check for the existence of the "In Use" file which will never be there unless the VBA wants it there, like this:
while (os.path.isfile(directory + prefix + 'InUse' + suffix)):
time.sleep(1)
wb.save

Py Pd, Try/Excep fooled by Excel and reserved names ('CON.xls)

I'm sharing an experience that might save some time to other users some day in the future.
It happened while reading a list of .xls files in python/pandas to append them to dataframe. The code is below. If a file in the list is missing the code will notice with a try/except. One of these files is named 'CON.xls' and the file was missing.
When the loop was executed the try/except apparently does not work. The program is on hold and nothing happen. This is only when the file is named 'CON.xls'. The code was OK with all the other file names in the list.
I then tried to create a 'CON.xls' file saving it directly from excel and even excel refused to accept the name. 'CON.xls' is a reserved file name.
Try/except was apparently not detecting this kind of issue or was not the right way in this case:
def db_to_df(list_of_file_names):
return_df=pd.DataFrame([])
for file_name in list_of_file_names:
try:
df=pd.DataFrame([])
df=pd.read_excel(file_name+'.xls')
return_df = pd.concat([return_df, df])
print('\tFile added: ', file_name)
except:
print('\nERROR: ', file_name,'\n')
return(return_df)
The above with WIN7 and a very old 2003 xls, not sure with other versions.

Pandas suddenly cannot open Excel file (can't find workbook in OLE2 compound document

I have a script that reads an xlsx excel file that was working fine until a week ago. The error message is:
xlrd.biffh.XLRDError: Can't find workbook in OLE2 compound document
By debugging the script, I've found the whole stack:
C:\MyFolder\MyScript.py", line 42, in PandasReadExcel
ef=pd.read_excel(excfile,sheetname,header,skiprows)
File "C:\Python\Python36\lib\site-packages\pandas\io\excel.py", line 191, in read_excel
io = ExcelFile(io, engine=engine)
File "C:\Python\Python36\lib\site-packages\pandas\io\excel.py", line 249, in __init__
self.book = xlrd.open_workbook(io)
File "C:\Python\Python36\lib\site-packages\xlrd\__init__.py", line 441, in open_workbook
ragged_rows=ragged_rows, File "C:\Python\Python36\lib\site-packages\xlrd\book.py", line 87, in open_workbook_xls
ragged_rows=ragged_rows,
File "C:\Python\Python36\lib\site-packages\xlrd\book.py", line 595, in biff2_8_load
raise XLRDError("Can't find workbook in OLE2 compound document")
xlrd.biffh.XLRDError: Can't find workbook in OLE2 compound document
By reviewing similar cases here and in GitHub, I've found that this error usually occurs with xlsm files or password-protected files. But the concerned Excel workbook is not password protected and is a xlsx file. To my "unluck" I don't know the person who changes the file, it is updated regularly by a team that takes laboratory analysis, so I don't have any ideas of what they changed in the file. All I know is that I can open/edit that file with no problem.
Some threads suggest updating pandas or xlrd version (I am using pandas 0.19.2), which I am wanting to avoid, since the script runs in a remote server and updating the version would affect proper work of other scripts depending on this routine.
I thank anybody who has any clue on how to solve this problem.
After months struggling with this error, I've learned that the concerned files are being edited using an older version of Microsoft Office (namely Office 2007, in this very case). Then I decided to implement a clumsy workaround solution:
Just open the files using a compatible Excel version, and save a copy in a different folder; then open the file using pandas read_excel function, it should open normally!
To automate this task I wrote a powershell script just to open the original file and save the copy. This script must be executed according to how often the data is updated:
$FileName = "\\path\to\the\source\file.xlsx"
$FileNameCopy = "\\path\to\the\copy\file.xlsx"
$xl = New-Object -comobject Excel.Application
# repeat this for every file concerned
$wb = $xl.Workbooks.open("$FileName",3)
$wb.SaveAs($FileNameCopy)
$wb.Close($False)
$xl.Quit()
Now I can have my data loaded normally again.

Sound files in PsychoPy wont load

I'm currently working on building an experiment in PsychoPy (v1.82.01 stand-alone). I started on the project several months ago with an older version of PsychoPy.
It worked great and I ran some pilot subjects. We have since adjusted the stimuli sounds and it won’t run.
It looks like there is an issue with referencing the sound file, but I can’t figure out what’s going on.
I recreated the first part of the experiment with a single file rather than a loop so that it would be easier to debug. The sound file is referenced using:
study_sound = sound.Sound(u‘2001-1.ogg’, secs=-1)
When I run it, I get this output:
or see below
Running: /Users/dkbjornn/Desktop/Test/test.py
2016-04-29 14:05:43.164 python[65267:66229207] ApplePersistenceIgnoreState: Existing state will not be touched. New state will be written to /var/folders/9f/3kr6zwgd7rz95bcsfw41ynw40000gp/T/org.psychopy.PsychoPy2.savedState
0.3022 WARNING Movie2 stim could not be imported and won't be available
sndinfo: failed to open the file.
Traceback (most recent call last):
File "/Users/dkbjornn/Desktop/Test/test.py", line 84, in <module>
study_sound = sound.Sound(u'2001-1.ogg', secs=-1)
File "/Applications/PsychoPy2.app/Contents/Resources/lib/python2.7/psychopy/sound.py", line 380, in __init__
self.setSound(value=value, secs=secs, octave=octave, hamming=hamming)
File "/Applications/PsychoPy2.app/Contents/Resources/lib/python2.7/psychopy/sound.py", line 148, in setSound
self._setSndFromFile(value)
File "/Applications/PsychoPy2.app/Contents/Resources/lib/python2.7/psychopy/sound.py", line 472, in _setSndFromFile
start=self.startTime, stop=self.stopTime)
File "/Applications/PsychoPy2.app/Contents/Resources/lib/python2.7/pyolib/tables.py", line 1420, in setSound
saved data to u'/Users/dkbjornn/Desktop/Test/data/99_test_2016_Apr_29_1405_1.csv'
_size, _dur, _snd_sr, _snd_chnls, _format, _type = sndinfo(path)
TypeError: 'NoneType' object is not iterable
The important thing here is the sndinfo: failed to open the file. message. Most likely, it cannot find your file on the disk. Check the following:
Is the file 2001-1.ogg in the same folder as your experiment? Not in a subfolder? Or have you accidentially changed your path, e.g. using os.chdir?
Is it actually called 2001-1.ogg? Any differences in uppercase/lowercase, spaces, etc. all count.
Alternatively, there's something in the particular way the .ogg was saved that causes the problem, even though the Sound class can read a large set of different sound codecs. Try exporting the sound file in other formats, e.g. .mp3 or .wav.

Saving a temporary file

I'm using xlwt in python to create a Excel spreadsheet. You could interchange this for almost anything else that generates a file; it's what I want to do with the file that's important.
from xlwt import *
w = Workbook()
#... do something
w.save('filename.xls')
I want to I have two use cases for the file: I stream it out to the user's browser or I attach it to an email. In both cases the file only needs to exist the duration of the web request that generates it.
What I'm getting at, the reason for starting this thread is saving to a real file on the filesystem has its own hurdles (stopping overwriting, cleaning up the file once done). Is there somewhere I could "save" it where it lives only in memory and only for the duration of the request?
cStringIO
(or mmap if it should be mutable)
Generalising the answer, as you suggested: If the "anything else that generates a file" won't accept a file-like object as well as a filepath, then you can reduce the hassle by using tempfile.NamedTemporaryFile

Categories

Resources