Python 3 openpyxl UserWarning: Data Validation extension not supported

Python 3 openpyxl UserWarning: Data Validation extension not supported - python

So this is my first time that I'm attempting to read from an Excel file and I'm trying to do so with the openpyxl module. My aim is to collate a dictionary with a nested list as its value. However, when I get this warning when I try to run it:
UserWarning: Data Validation extension is not supported and will be removed
warn(msg)
I don't know where I'm going wrong. Any help would be much appreciated. Thanks
import openpyxl
try:
wb = openpyxl.load_workbook("Grantfundme Master London.xlsx")
except FileNotFoundError:
print("File could not be found.")
sheet = wb["FUNDS"]
database = {}
for i in range(250):#this is the number of keys I want in my dictionary so loop through rows
charity = sheet.cell(row=i + 1, column=1).value
area_of_work = []
org = []
funding = sheet.cell(row=i + 1, column=14).value
for x in range(8, 13): # this loops through columns with info I need
if sheet.cell(row=i +1, column=x).value !="":
area_of_work.append(sheet.cell(row=i +1, column=x).value)
for y in range(3, 6): # another column loop
if sheet.cell(row=i +1, column=y).value !="":
org.append(sheet.cell(row=i +1, column=y).value)
database[charity] = [area_of_work,org, funding]
try:
f = open("database.txt", "w")
f.close()
except IOError:
print("Ooops. It hasn't written to the file")
For those asking here is a screenshot of the exception:
(

Excel has a feature called Data Validation (in the Data Tools section of the Data tab in my version) where you can pick from a list of rules to limit the type of data that can be entered in a cell. This is sometimes used to create dropdown lists in Excel. This warning is telling you that this feature is not supported by openpyxl, and those rules will not be enforced. If you want the warning to go away, you can click on the Data Validation icon in Excel, then click the Clear All button to remove all data validation rules and save your workbook.

Sometimes simply clearing the Data Validation rules in the Workbook is not a viable solution - perhaps other users rely on the rules, or maybe they are locked for editing, etc.
The error can be ignored using a simple filter, and the Workbook can remain untouched, as:
import warnings
warnings.simplefilter(action='ignore', category=UserWarning)
In practice, this might look like:
import pandas as pd
import warnings
def load_data(path: str):
"""Load data from an Excel file."""
warnings.simplefilter(action='ignore', category=UserWarning)
return pd.read_excel(path)
Note:
Just remember to reset warnings, else all other UserWarnings will be ignored as well.

Thanks, for the screenshot! Without seeing the actual excel workbook it's hard to say exactly what it is complaining about.
If you notice the screenshot references line 322 of the reader worksheet module. It looks like it is telling you the data valadation extension to the OOXML standard is not supported by the openpyxl library. It's appears to be saying it found parts of the data valadation extension in your workbook and that will be lost when parsing the workbook with the openpyxl extention.

Related

openpyxl blocking excel file after first read

I am trying to overwrite a value in a given cell using openpyxl. I have two sheets. One is called Raw, it is populated by API calls. Second is Data that is fed off of Raw sheet. Two sheets have exactly identical shape (cols/rows). I am doing a comparison of the two to see if there is a bay assignment in Raw. If there is - grab it to Data sheet. If both Raw and Data have the value in that column missing - then run a complex Algo (irrelevant for this question) to assign bay number based on logic.
I am having problems with rewriting Excel using openpyxl.
Here's example of my code.
data_df = pd.read_excel('Algo Build v23test.xlsx', sheet_name='MondayData')
raw_df = pd.read_excel('Algo Build v23test.xlsx', sheet_name='MondayRaw')
no_bay_res = data_df[data_df['Bay assignment'].isnull()].reset_index() #grab rows where there is no bay assignment in a specific column
book = load_workbook("Algo Build v23test.xlsx")
sheet = book["MondayData"]
for index, reservation in no_bay_res.iterrows():
idx = int(reservation['index'])
if pd.isna(raw_df.iloc[idx, 13]):
continue
else:
value = raw_df.iat[idx,13]
data_df.iloc[idx, 13] = value
sheet.cell(idx+2, 14).value = int(value)
book.save("Algo Build v23test.xlsx")
book.close()
print(value) #302
Now the problem is that it seems that book.close() is not working. Book is still callable in python. Now, it overwrites Excel totally fine. However, if I try to run these two lines again
data_df = pd.read_excel('Algo Build v23test.xlsx', sheet_name='MondayData')
raw_df = pd.read_excel('Algo Build v23test.xlsx', sheet_name='MondayRaw')
I am getting datasets full of NULL values, except for the value that was replaced. (attached the image).
However, if I open that Excel file manually from the folder and save it (CTRL+S) and try running the code again - it works properly. Weirdest problem.
I need to loop the code above for Monday-Sunday, so I need it to be able to read the data again without manually resaving the file.

Due to some reason, pandas will read all the formulas as NaN after the file been used in the script by openpyxl until the file has been opened, saved and closed. Here's the code that helps do that within the script. However, it is rather slow.
import xlwings as xl
def df_from_excel(path, sheet_name):
app = xl.App(visible=False)
book = app.books.open(path)
book.save()
app.kill()
return pd.read_excel(path, sheet_name)

I got the same problem, the only workaround I found is to terminate the excel.exe manually from taskmanager. After that everything went fine.

Read dbf with arcpy in PyCharm?

I have exported an ArcGIS Desktop 10.7 table into a dbf file.
Now I want to do some GIS calculation in standalone Python.
Therefore I have started a PyCharm project referencing the ArcGIS Python interpreter and hence am able to import arcpy into my main.py.
Problem is: I don't want to pip install other modules, but I don't know how to correctly read the dbf table with arcpy.
#encoding=utf-8
import arcpy
path=r"D:\test.dbf"
sc=arcpy.SearchCursor(path) # Does not work: IOError exception str() failed
tv=arcpy.mapping.TableView(path) # Does not work either: StandaloneObject invalid data source or table
The dbf file is correct, it can be read into ArcGIS.
Can someone please give me an idea, how to read the file standalone with arcpy?

Using pandas
Python from ArcMap comes with some modules. You can load the data into a pandas.DataFrame and work with this format. Pandas is well-documented and there is a lot of already asked question about it all over the web. It's also super easy to do groupby or table manipulations.
import pandas as pd
import arcpy
def read_arcpy_table(self, table, fields='*', null_value=None):
"""
Transform a table from ArcMap into a pandas.DataFrame object
table : Path the table
fields : Fields to load - '*' loads all fields
null_value : choose a value to replace null values
"""
fields_type = {f.name: f.type for f in arcpy.ListFields(table)}
if fields == '*':
fields = fields_type.keys()
fields = [f.name for f in arcpy.ListFields(table) if f.name in fields]
fields = [f for f in fields if f in fields_type and fields_type[f] != 'Geometry'] # Remove Geometry field if FeatureClass to avoid bug
# Transform in pd.Dataframe
np_array = arcpy.da.FeatureClassToNumPyArray(in_table=table,
field_names=fields,
skip_nulls=False,
null_value=null_value)
df = self.DataFrame(np_array)
return df
# Add the function into the loaded pandas module
pd.read_arcpy_table = types.MethodType(read_arcpy_table, pd)
df = pd.read_arcpy_table(table='path_to_your_table')
# Do whatever calculations need to be done
Using cursor
You can also use arcpy cursors and dict for simple calculation.
There are simple example on this page on how to use correctly cursors :
https://desktop.arcgis.com/fr/arcmap/10.3/analyze/arcpy-data-access/searchcursor-class.htm

My bad,
after reading the Using cursor approach, I figured out that using the
sc=arcpy.SearchCursor(path) # Does not work: IOError exception str() failed
approach was correct, but at the time around 3 AM, I was a little bit exhausted and missed the typo in the path that caused the error. Nevertheless, a more descriptive error message e.g. IOError could not open file rather than IOError exception str() failed would have solved my mistake as acrGIS newbie.. : /

How to return the PrintArea from Excel in Python

I'm trying to create a Python script (I'm using Python 3.7.3 with UTF-8 encoding on Windows 10 64-bit with Microsoft Office 365) that exports user selected worksheets to PDF, after the user has selected the Excel-files.
The Excel-files contain a lot of different settings for page setup and each worksheet in each Excel-file has a different page setup.
The task is therefore that I need to read all current variables regarding page setup to be able to assign them to the related variables for export.
The problem is when I'm trying to get Excel to return the current print area of the worksheet, which I can't figure out.
As far as I understand I need to be able to read the current print area, to be able to set it for the export.
The Excel-files are a mixture of ".xlxs" and ".xlsm".
I've tried using all kind of different methods from the Excel VBA documentation, but nothing has worked so far e.g. by adding ".Range" and ".Address" etc.
I've also tried the ".UsedRange", but there is no significant difference in the cells that I can search for and I can't format them in a specific way so I can't use this.
I've also tried using the "IgnorePrintAreas = False" variable in the "ExportAsFixedFormat"-function, but that didn't work either.
#This is some of the script.
#I've left out irrelevant parts (dialogboxes etc.) just to make it shorter
#Import pywin32 and open Excel and selected workbook.
import win32com.client as win32
excel = win32.gencache.EnsureDispatch("Excel.Application")
excel.Visible = False
wb = excel.Workbooks.Open(wb_path)
#Select the 1st worksheet in the workbook
#This is just used for testing
wb.Sheets([1]).Select()
#This is the line I can't get to work
ps_prar = wb.ActiveSheet.PageSetup.PrintArea
#This is just used to test if I get the print area
print(ps_prar)
#This is exporting the selected worksheet to PDF
wb.Sheets([1]).Select()
wb.ActiveSheet.ExportAsFixedFormat(0, pdf_path, Quality = 0, IncludeDocProperties = True, IgnorePrintAreas = False, OpenAfterPublish = True)
#This closes the workbook and the Excel-file (although Excel sometimes still exists in Task Manager
wb.Close()
wb = None
excel.Quit()
excel = None
If I leave the code as above and try and open a test Excel-file (.xlxs) with a small PrintArea (A1:H8) the print function just gives me a blank line.
If I add something to .PrintArea (as mentioned above) I get 1 of 2 errors:
"TypeError: 'str' object is not callable".
or
"ps_prar = wb.ActiveSheet.PageSetup.PrintArea.Range
AttributeError: 'str' object has no attribute 'Range'"
I'm hoping someone can help me in this matter - thanks, in advance.

try
wb = excel.Workbooks.OpenXML(wb_path)
insead of
wb = excel.Workbooks.Open(wb_path)
My problem was with a german version of ms-office. It works now. Check here https://social.msdn.microsoft.com/Forums/de-DE/3dce9f06-2262-4e22-a8ff-5c0d83166e73/excel-api-interne-namen?forum=officede

Is there a way for 2 scripts to write to the same file?

I have limited python experience, but determined to learn. I am trying to create a script that would write some data inputs to excel until stopped. It is very straightforward when a single person is using it but the problem is that 2 people will be using it at once.
I am thinking about making it simple and just having 2 exact same scripts running at the same time, but the problem comes in when the file is going to be saved. If I have two files being saved with the same name, one is going to overwrite the other and the data will be lost. Is there a way to have the scripts create files with different names without having to manually change the code? (This would eventually be scaled to up to 20 computers running it)
The loop looks like:
import xlwt
from xlwt import Workbook
wb = Workbook()
s1 = wb.add_sheet('Sheet 1')
data = []
while user != '0':
user = input('Scan ID Badge: ')
data.append(user)
order = input('Scan order: ')
data.append(order)
item = input('Scan item barcode: ')
data.append(item)
for i in range(len(data)):
s1.write(row,i,data[i])
wb.save('OrderData.xls')
data = []
row += 1

If you want to use a tabular form of data storage anyways, you could switch to a real database and on interval create an excel-like summary of the db file.

If you know all of the users using this script will be using machines with different network names, you could include the computer name in the XLS name:
import platform
filename = 'AssociateEfficiencyTemp-' + platform.node() + '.xls'
# ...
wb.save(filename)
(You can also use getpass.getuser() to (try and) get the username of the user running the script.)
You can then write another script that reads all of the separate files (glob.glob('AssociateEfficiencyTemp-*.xls') etc.) and combines them.
(I would suggest using another format than .xls for the intermediary files though, such as plain text files of JSON lines.)

Get formula from Excel cell with python xlrd

I have to port an algorithm from an Excel sheet to python code but I have to reverse engineer the algorithm from the Excel file.
The Excel sheet is quite complicated, it contains many cells in which there are formulas that refer to other cells (that can also contains a formula or a constant).
My idea is to analyze with a python script the sheet building a sort of table of dependencies between cells, that is:
A1 depends on B4,C5,E7 formula: "=sqrt(B4)+C5*E7"
A2 depends on B5,C6 formula: "=sin(B5)*C6"
...
The xlrd python module allows to read an XLS workbook but at the moment I can access to the value of a cell, not the formula.
For example, with the following code I can get simply the value of a cell:
import xlrd
#open the .xls file
xlsname="test.xls"
book = xlrd.open_workbook(xlsname)
#build a dictionary of the names->sheets of the book
sd={}
for s in book.sheets():
sd[s.name]=s
#obtain Sheet "Foglio 1" from sheet names dictionary
sheet=sd["Foglio 1"]
#print value of the cell J141
print sheet.cell(142,9)
Anyway, It seems to have no way to get the formul from the Cell object returned by the .cell(...) method.
In documentation they say that it is possible to get a string version of the formula (in english because there is no information about function name translation stored in the Excel file). They speak about formulas (expressions) in the Name and Operand classes, anyway I cannot understand how to get the instances of these classes by the Cell class instance that must contains them.
Could you suggest a code snippet that gets the formula text from a cell?

[Dis]claimer: I'm the author/maintainer of xlrd.
The documentation references to formula text are about "name" formulas; read the section "Named references, constants, formulas, and macros" near the start of the docs. These formulas are associated sheet-wide or book-wide to a name; they are not associated with individual cells. Examples: PI maps to =22/7, SALES maps to =Mktng!$A$2:$Z$99. The name-formula decompiler was written to support inspection of the simpler and/or commonly found usages of defined names.
Formulas in general are of several kinds: cell, shared, and array (all associated with a cell, directly or indirectly), name, data validation, and conditional formatting.
Decompiling general formulas from bytecode to text is a "work-in-progress", slowly. Note that supposing it were available, you would then need to parse the text formula to extract the cell references. Parsing Excel formulas correctly is not an easy job; as with HTML, using regexes looks easy but doesn't work. It would be better to extract the references directly from the formula bytecode.
Also note that cell-based formulas can refer to names, and name formulas can refer both to cells and to other names. So it would be necessary to extract both cell and name references from both cell-based and name formulas. It may be useful to you to have info on shared formulas available; otherwise having parsed the following:
B2 =A2
B3 =A3+B2
B4 =A4+B3
B5 =A5+B4
...
B60 =A60+B59
you would need to deduce the similarity between the B3:B60 formulas yourself.
In any case, none of the above is likely to be available any time soon -- xlrd priorities lie elsewhere.

Update: I have gone and implemented a little library to do exactly what you describe: extracting the cells & dependencies from an Excel spreadsheet and converting them to python code. Code is on github, patches welcome :)
Just to add that you can always interact with excel using win32com (not very fast but it works). This does allow you to get the formula. A tutorial can be found here [cached copy] and details can be found in this chapter [cached copy].
Essentially you just do:
app.ActiveWorkbook.ActiveSheet.Cells(r,c).Formula
As for building a table of cell dependencies, a tricky thing is parsing the excel expressions. If I remember correctly the Trace code you mentioned does not always do this correctly. The best I have seen is the algorithm by E. W. Bachtal, of which a python implementation is available which works well.

So I know this is a very old post, but I found a decent way of getting the formulas from all the sheets in a workbook as well as having the newly created workbook retain all the formatting.
First step is to save a copy of your .xlsx file as .xls
-- Use the .xls as the filename in the code below
Using Python 2.7
from lxml import etree
from StringIO import StringIO
import xlsxwriter
import subprocess
from xlrd import open_workbook
from xlutils.copy import copy
from xlsxwriter.utility import xl_cell_to_rowcol
import os
file_name = '<YOUR-FILE-HERE>'
dir_path = os.path.dirname(os.path.realpath(file_name))
subprocess.call(["unzip",str(file_name+"x"),"-d","file_xml"])
xml_sheet_names = dict()
with open_workbook(file_name,formatting_info=True) as rb:
wb = copy(rb)
workbook_names_list = rb.sheet_names()
for i,name in enumerate(workbook_names_list):
xml_sheet_names[name] = "sheet"+str(i+1)
sheet_formulas = dict()
for i, k in enumerate(workbook_names_list):
xmlFile = os.path.join(dir_path,"file_xml/xl/worksheets/{}.xml".format(xml_sheet_names[k]))
with open(xmlFile) as f:
xml = f.read()
tree = etree.parse(StringIO(xml))
context = etree.iterparse(StringIO(xml))
sheet_formulas[k] = dict()
for _, elem in context:
if elem.tag.split("}")[1]=='f':
cell_key = elem.getparent().get(key="r")
cell_formula = elem.text
sheet_formulas[k][cell_key] = str("="+cell_formula)
sheet_formulas
Structure of Dictionary 'sheet_formulas'
{'Worksheet_Name': {'A1_cell_reference':'cell_formula'}}
Example results:
{u'CY16': {'A1': '=Data!B5',
'B1': '=Data!B1',
'B10': '=IFERROR(Data!B12,"")',
'B11': '=IFERROR(SUM(B9:B10),"")',

It seems that it is impossible now to do what you want with xlrd. You can have a look at this post for the detailed description of why it is so difficult to implement the functionality you need.
Note that the developping team does a great job for support at the python-excel google group.

I know this post is a little late but there's one suggestion that hasn't been covered here. Cut all the entries from the worksheet and paste using paste special (OpenOffice). This will convert the formulas to numbers so there's no need for additional programming and this is a reasonable solution for small workbooks.

Ye! With win32com it's works for me.
import win32com.client
Excel = win32com.client.Dispatch("Excel.Application")
# python -m pip install pywin32
file=r'path Excel file'
wb = Excel.Workbooks.Open(file)
sheet = wb.ActiveSheet
#Get value
val = sheet.Cells(1,1).value
# Get Formula
sheet.Cells(6,2).Formula

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.