Context: I am attempting to automate a report that is rather complicated (not conceptually, just in the sheer volume of things to keep track of). The method I settled on after a lot of investigation was to;
Create a template xlsx file which has a couple summary pages containing formulas pointing at other (raw data) sheets within the file.
Pull data from SQL Server and insert into template file, overwriting the raw data sheet with relevant data.
Publish report (Most likely this will just be moving xlsx file to a new directory).
Obviously, I have spent a lot of time looking at other peoples solutions to this issue (as this topic has been discussed a lot). The issue I have found however is that (at least in my search) none of the methods purposed have worked for me, my belief is that the previously correct responses are no longer relevant in current versions of pandas etc. Rather than linking to the dozens of articles attempting to answer this question, I will explain the issues I have had with various solutions.
Using openpyxl instead of xlsxwriter - This resulted in "BadZipFile: File is not a zip file" response. Which as I understand pertains to the pandas version, or rather the fix (mode='a') does not work due to the pandas version (I believe anything beyond 1.2 has this issue).
Helper Function This does not work however, also throws the BadZipFile error.
Below is a heavily redacted version of the code which should give all the required detail.
#Imports
import os
import pyodbc
import numpy as np
import shutil
import pandas as pd
import datetime
from datetime import date
from openpyxl import load_workbook
# Set database connection variables.
cnxn = pyodbc.connect(*Credentials*)
cursor = cnxn.cursor()
df = pd.read_sql_query(script, cnxn)
df.to_excel(writer, sheet_name = 'Some Sheet',index=False)
writer.close()
Long story short here, I am finding it very frustrating that what should be very very simple is turning into a multiple day long exercise. Please if anyone has experience with this and could offer some insight I would be very grateful.
Finally, I have to admit that I am quite new to using python, though I have not found the transition too difficult until today. Most of the issues I have been having are easily solvable (for me), with the exception of this issue. If there is something I have somehow completely missed, put me on the track and I will not be a bother.
Okay, so I found that I was infact incorrect (big surprise). That is, my statement that the helper function does not work. It does work, the ZipFile issue was most likely caused by some form of protection on the workbook. Funny thing is, I was able to get it working with a new workbook, but when I changed the name of the new workbook it again started throwing the ZipFile error. After awhile of creating new files and trying different things I eventually got it to work.
Two things I would note about the helper function;
It is not particularly efficient. At least not in the way I have set it up. I replaced all instances of 'to_excel' with 'append_df_to_excel' from the helper function. Doing this resulted in run time going from about 1-2 minutes to well over 10. I will do some more testing and see why this might be (I will post back if I find something intersting), but just something to watch for if using larger datasets.
Not an issue as such, but for me to get this to work as expected, I had to alter the function slightly. Specifically, in order to use the truncate feature in my situation, I needed to move the 'truncate' section to be above the 'firstrow' section. In my situation it made more sense to do that, rather than to specify the start row prior to truncating the sheet.
Hope this helps anyone running into the same issue.
Lesson learned, as always the information is out there, its just a matter of actually paying close attention and trying things out rather than copy paste and scratching your head when things aren't working.
Related
I have a Python script which a few people should be running which uses XLWings to read in live data from a sheet:
book = xw.Book(path + "book_to_read.xlsm")
This works fine on my machine and on some others, but seems to be consistently throwing the same error for most users, "Call was rejected by callee."
I understand that Excel will throw this error if it's being given "too much" to do but that doesn't explain why the bug is only appearing for some people
As far as I can tell everyone is using the same version of Python (3.10), the same version of Excel (2016 64-bit), and has all the relevant dependencies installed. The only difference as far as I can tell is the choice of IDE but I don't see how this could be relevant.
Clearly I'm being stupid and missing something obvious but would appreciate pointers on what might be inconsistent between instances
EDIT: It seems the issue is only occurring for users with multiple Excel books open with a lot of work/macros. Is there a way to work-around this?
Xlwings is not able to connect to an excel file if this excel file (or even another one) is currently being edited or is calculating. This means that if you have many heavy files computing in the background, xlwings cannot start.
If a file is being edited, the program is simply frozen while excel is busy and will continue when edition is done. On the other hand, if the file is busy computing the error Call was rejected by callee is triggered.
The error happens during the creation of the Book instance:
import xlwings as xw
wb = xw.Book(wb_folder + EXCEL_FILENAME)
A possible workaround is to surround this call by a try/catch. It will try to open the file 1000 times and stop the script if it was never able to do it. As soon as it is successfully opened, continue as usual:
import xlwings as xw
wb = None
t = 1000
while t > 0 and wb is None:
try:
wb = xw.Book(wb_folder + EXCEL_FILENAME)
except Exception as str_error:
print(t, str_error)
t -= 1
In order to give more time to excel to finish its computation, you can also add a time.sleep(duration_in_ms) inside the loop.
When connected, my first action is to set the calculation mode to manual:
wb.app.calculation = 'manual'
[...]
wb.app.calculation = 'automatic'
I put it back to automatic at the end of the script. The issue is that it can take a while before being able to actually connect to the file. If all files are set to manual before running the script, the issue never happens. Unfortunately, the users have massive excel file with many cells recomputed every seconds so these files are permanently recomputing.
So far, I didn't find a better way to avoid this issue.
Computer Specs:
Windows 10
Python 3.7.3
Excel Version 2107 (Build 14228.20250) from Microsoft 365
xlwings 0.23.3
Say, I have two files demo.py
# demo.py
from pathlib import Path
for i in range(5):
exec(Path('another_file.txt').read_text())
and another_file.txt (note the indent)
print(i)
Is it possible to make python demo.py run?
N.B. This is useful when using Page (or wxformbuilder or pyqt's designer) to generate a GUI layout where skeletons of callback functioins are automatically generated. The skeletons will have to be modified, at the same time, each iteration overwrites the skeletons -- code snippets will have to be copied back. Anyway, you know what I am talking about if you have used any of Page or wxformbuilder or pyqt's designer.
You can solve the basic problem by removing the indent:
from pathlib import Path
import textwrap
for i in range(5):
exec(textwrap.dedent(Path('another_file.txt').read_text()))
There are still two rather major problems with this:
There are serious security implications here. You're running code without including it in your project. The idea that you can "worry about security and other issues later" will cause you pain at a later point. You'll see similar advice on this site with avoiding SQL injection. That later date may never come, and even if it does, there's a very real chance you won't remember or correctly identify all of the problems. It's absolutely better to avoid the problems in the first place.
Also, with dynamic code like this, you run the very real risk of running into a syntax error with a call stack that shows you nothing about where the code is from. It's not bad for simple cases like this, but as you add more and more complexity to a project like this, you'll likely find you're adding more support to help you debug issues you run into, rather than spending your time adding features.
And, combining these two problems can be fun. It's contrived, but if you changed the for loop to a while loop like this:
i = 0
while i < 5:
exec(textwrap.dedent(Path('another_file.txt').read_text()))
i += 1
And then modified the text file to be this:
print(i)
i += 1
It's trivial to understand why it's no longer operating the 5 times you expect, but as both "sides" of this project get more complex, figuring out the complex interplay between the elements will get much harder.
In short, don't use eval. Future you will thank past you for making your life easier.
I know that xlwings uses windows COM and stuffs, and based on this: https://support.microsoft.com/en-us/help/196776/office-automation-using-visual-c (8th Question):
A common cause of speed problems with Automation is with repetitive reading and writing of data. This is typical for Excel Automation clients.
And that's exactly what I am doing, tons of reading and writing, and later on I can see that EXCEL.exe is taking up 50% of my CPU usage and my python script has kind of stopped (it just halts there, but python.exe is still on TM).
Now is there anyway to work with this? I asked because, continuing the quote above, Microsoft says:
However, most people aren't aware that this data can usually be written or read all at once using SAFEARRAY.
So I guess there's a way to work like this on python using xlwings?
Please be noted that: there are things I cant do on other libraries, like "Getting the values on the cells which are visible to the user, all I get is formulas". So I guess xlwings is that way to go. Thanks.
I have a ridiculously simple python script that uses the arcpy module. I turned it into a script tool in arcmap and am running it that way. It works just fine, I've tested it multiple times on small datasets. The problem is that I have a very large amount of data. I need to run the script/tool on a .dbf table with 4 columns and 490,481,440 rows, and currently it has taken days. Does anyone have any suggestions on how to speed it up? To save time I've already created the columns that will be populated in the table before I run the script. "back" represents the second number after the comma in the "back_pres_dist" column and "dist" represents the fourth. All I want is for them to be in their own separate columns. The table and script look something like this:
back_pres_dist back dist
1,1,1,2345.6
1,1,2,3533.8
1,1,3,4440.5
1,1,4,3892.6
1,1,5,1292.0
import arcpy
from arcpy import env
inputTable = arcpy.GetParameterAsText(0)
back1 = arcpy.GetParameterAsText(1) #the empty back column to be populated
dist3 = arcpy.GetParameterAsText(2) #the empty dist column to be populated
arcpy.CalculateField_management(inputTable, back1, '!back_pres_dist!.split(",")[1]', "PYTHON_9.3")
updateMess = arcpy.AddMessage("back column updated.")
arcpy.CalculateField_management(inputTable, dist3, '!back_pres_dist!.split(",")[3]', "PYTHON_9.3")
updateMess = arcpy.AddMessage("dist column updated.")
updateMess = arcpy.AddMessage("All columns updated.")
Any suggestions would be greatly appreciated. I know that reading some parts of the data into memory might speed things up, but I'm not sure how to do that with python (when using R it took forever to read into memory and was a nightmare trying to write to a .csv).
This is a ton of data. I'm guessing that your main bottleneck is read/write operations on the disk and not CPU or memory.
Your process appears to modify each row independently according to constant input values in what's essentially a tabular operation that doesn't really require GIS functionality. As a result, I would definitely look at doing this outside of the arcpy environment to avoid that overhead. While you could dump this stuff to numpy with the new arcpy.da functionality, I think that even this might be a bottleneck. Seems you should be able to more directly read your *.dbf file with a different library.
In fact, this operation is not really tabular; it's really about iteration. You'll probably want to exploit things like the "WITH"/"AS" keywords (PEP 343, Raymond Hettinger has a good video on youtube, too) or iterators in general (see PEPs 234, 255), which only load a record at a time.
Beyond those general programming approaches, I'm thinking that your best bet would be to break this data into chunks, parallelize, and then reassemble the results. Part of engineering the parallelization could be to spread your data across different disk platters to avoid competing between i/o requests. iPython is an add-on for python that has a pretty easy to use, high-level pacakge, "parallel", if you want an easy place to start. Lots of pretty good videos on youtube from PyCon 2012. There's a 3 hour one where the parallel stuff starts at 2:13:00 or so.
I have never used Python before, most of my programming has been in MATLAB and Unix. However, recently I have been given a new assignment that involves fixing an old PyEPL program written by a former employee (I've tried contacting him directly but he won't respond to my e-mails). I know essentially nothing about Python, and though I am picking it up, I thought I'd just quickly ask for some advice here.
Anyway, there are two issues at hand here, really. The first is this segment of the code:
exp = Experiment()
exp.setBreak()
vt = VideoTrack("video")
at = AudioTrack("audio")
kt = KeyTrack("key")
log = LogTrack("session")
clk = PresentationClock()
I understand what this is doing; it is creating a series of tracking files in the directory after the program is run. However, I have searched a bunch of online tutorials and can't find a reference to any of these commands in them. Maybe I'm not searching the right places or something, but I cannot find ANYTHING about this.
What I need to do is modify the
log = LogTrack("session")
segment of the code, so that all of the session.log files go into a new directory, separate from the other log files. But I also need to find a way to not only concatenate them into a single session.log file, but add a new column to that file that will add the subject number (the program is meant to be run by multiple subjects to collect data).
I am not asking anyone to do my work for me, but if anyone could give me some pointers, or any sort of advice, I would greatly appreciate it.
Thanks
I would first check if there is a line in the code
from some_module_name import *
This could easily explain why you can call these functions (classes?). It will also tell you what file to look in to modify the code for LogTrack.
Edit:
So, a little digging seems to find that LogTrack is part of PyEPL's textlog module. These other classes are from other modules. Somewhere in this person's code should be a line something like:
from PyEPL.display import VideoTrack
from PyEPL.sound import AudioTrack
from PyEPL.textlog import LogTrack
...
This means that these are classes specific to PyEPL. There are a few ways you could go about modifying how they work. You can modify the source of the LogTrack class so that it operates differently. Perhaps easier would be to simply subclass LogTrack and change some of its methods.
Either of these will require a fairly thorough understanding of how this class operates.
In any case, I would download the source from here, open up the code/textlog.py file, and start reading how LogTrack works.