This is similar to my earlier question getting formating data in openpyxl The only real difference is that now I really want to use the optimized workbook for the speed increase.
Basically I can't figure out how to retrieve formatting details when I use the optimized reader. Here's a toy sample, the comments explain what I'm seeing on the print statements. Am I doing something wrong? Is there a better way to retrieve formatting details?
Also, if anyone knows of a different excel reader for python that supports xlsx + retrieving formatting I'm open to changing! (I've already tried xlrd, and while that does support xlsx in the newer builds it doesn't yet support formatting)
from openpyxl import Workbook
from openpyxl.reader.excel import load_workbook
from openpyxl.style import Color, Fill
#this is all setup
wb = Workbook()
dest_filename = 'c:\\temp\\test.xlsx'
ws = wb.worksheets[0]
ws.title = 'test'
ws.cell('A1').value = 'foo'
ws.cell('A1').style.font.bold = True
ws.cell('B1').value = 'bar'
ws.cell('B1').style.fill.fill_type = Fill.FILL_SOLID
ws.cell('B1').style.fill.start_color.index = Color.YELLOW
wb.save(filename = dest_filename )
#setup complete
book = load_workbook( filename = dest_filename, use_iterators = True )
sheet = book.get_sheet_by_name('test')
for row in sheet.iter_rows():
for cell in row:
print cell.coordinate
print cell.internal_value
print cell.style_id #returns different numbers here (1, and 2 in case anyone is interested)
print sheet.get_style(cell.coordinate).font.bold #returns False for both
print sheet.get_style(cell.coordinate).fill.fill_type #returns none for bothe
print sheet.get_style(cell.coordinate).fill.start_color.index #returns FFFFFFFF (white I believe) for both
print
import openpyxl
print openpyxl.__version__ #returns 1.6.2
The style_ID appears to be the index for where you can find the style information in workbook -> shared_styles (book.shared_styles or sheet.parent.shared_styles).
In some workbooks this works flawlessly; but, I've also found in other workbooks the style_ID is greater than the length of shared_styles giving me "Out of range" Exceptions when I try to access said styles.
Related
What I want is with openpyxl to write a value I get form a len() or dups() to an excel cell.
Here are my imports:
import xlwings as xw
Here is the code:
#Load workbook
app = xw.App(visible = False)
wb = xw.Book(FilePath)
RawData_ws = wb.sheets['Raw Data']
Sheet1 = wb.sheets['Sheet 1']
RawData_ws['A1'] = (len(df.index))
Sheet1['B7'] = (len(df.index) - tot_dups))
RawData_ws['A2'] = (len(df.index)) #This one is after removing duplicate values
Tot_dups:
tot_dups = len(df.index)
I want the values of the different len() to show be written in the specific cells.
So, I already found the solution.
Change:
RawData_ws['A1'] = (len(df.index))
For:
RawData_ws['A1'].values = (len(df.index))
Using openpyxl, I am trying to read data from an Excel-Workbook and write data to this same Excel-Workbook. Getting data from the Excel-Workbook works fine, but writing data into the Excel-Workbook does not work. With the code below I get the value from Cell A1 in Sheet1 and print it. Then I try to put some values into the cells A2 and A3. This does not work.
from openpyxl import Workbook
from openpyxl import load_workbook
wb = load_workbook("testexcel.xlsm")
ws1 = wb.get_sheet_by_name("Sheet1")
#This works:
print ws1.cell(row=1, column=1).value
#This doesn't work:
ws1['A2'] = "SomeValue1"
#This doesn't work either:
ws1.cell(row=3, column=1).value = "SomeValue2"
I am sure the code is correct ... What is going wrong here?
I believe you are missing a save function. Try adding the additional line below.
from openpyxl import Workbook
from openpyxl import load_workbook
wb = load_workbook("testexcel.xlsm")
ws1 = wb.get_sheet_by_name("Sheet1")
#This works:
print ws1.cell(row=1, column=1).value
#This doesn't work:
ws1['A2'] = "SomeValue1"
#This doesn't work either:
ws1.cell(row=3, column=1).value = "SomeValue2"
#Add this line
wb.save("testexcel.xlsm")
Use this to write a value:
ws1.cell(row=1, column=1,value='Hey')
On the other hand, the following will read the value:
ws1.cell(row=1, column=1).value
while saving the workbook, try giving the full path.
for example: wb1.save(filename=r"C:\Users\7000027842\Downloads\test.xlsx")
I am hoping you can help me - I'm sure its likely a small thing to fix, when one knows how.
In my workshop, neither I nor my colleagues can make 'find and replace all' changes via the front-end of our database. The boss just denies us that level of access. If we need to make changes to dozens or perhaps hundreds of records it must all be done by copy-and-paste or similar means. Craziness.
I am trying to make a workaround to that with Python 2 and in particular libraries such as Pandas, pyautogui and xlrd.
I have researched serval StackOverflow threads and have managed thus far to write some code that works well at reading a given XL file .In production, this will be a file exported from a found data set in the database GUI front-end and will be just a single column of 'Article Numbers' for the items in the computer workshop. This will always have an Excel column header. E.g
ANR
51234
34567
12345
...
All the records numbers are 5 digit numbers.
We also have the means of scanning items with an IR scanner to a 'Workflow' app on the iPad we have and automatically making an XL file out of that list of scanned items.
The XL file here could look something similar to this.
56788
12345
89012
...
It differs in that there is no column header. All XL files have their data 'anchored' at cell A1 on 'Sheet1" and again just single column will be used. No unnecessary complications here!
Here is the script anyway. When it is fully working system arguments will be supplied to it. For now, let's pretend that we need to change records to have their 'RAM' value changed from
"2GB" to "2 GB".
import xlrd
import string
import re
import pandas as pd
field = "RAM"
value = "2 GB"
myFile = "/Users/me/folder/testArticles.xlsx"
df = pd.read_excel(myFile)
myRegex = "^[0-9]{5}$"
# data collection and putting into lists.
workbook = xlrd.open_workbook(myFile)
sheet = workbook.sheet_by_index(0)
data = [[sheet.cell_value(r, c) for c in range(sheet.ncols)] for r in range(sheet.nrows)]
formatted = []
deDuped = []
# removing any possible XL headers, setting all values to strings
# that look like five-digit ints, apply a regex to be sure.
for i in data:
cellValue = str(i)
cellValue = cellValue.translate(None, '\'[u]\'')
# remove the decimal point
# Searching for the header will cause a database front-end problem.
cellValue = cellValue[:-2]
cellValue = cellValue.translate(None, string.letters)
# making sure only valid article numbers get through
# blank rows etc can take a hike
if len(cellValue) != 0:
if re.match(myRegex, cellValue):
formatted.append(cellValue)
# weeding out any possilbe dupes.
for i in formatted:
if i not in deDuped:
deDuped.append(i)
#main code block
for i in deDuped:
#lots going on here involving pyauotgui
#making sure of no error running searches, checking for warnings, moving/tabbing around DB front-end etc
#if all goes to plan
#removing that record number from the excel file and saving the change
#so that if we run the script again for the same XL file
#we don't needlessly update an already OK record again.
df = df[~df['ANR'].astype(str).str.startswith(i)]
df.to_excel(myFile, index=False)
What I really would to like to find out is how can I run the script so that "doesn't care" about the presence or absence of the column header.
df = df[~df['ANR'].astype(str).str.startswith(i)]
Appears to be the line of code where this all hangs on. I've made several changes to the line in different combination but my script always crashes.
If a column header, ("ANR") in my case, is essential for this particular 'pandas' method is there a straight-forward way of inserting a column header into an XL file if it lacks one in the first place - i.e the XL files that come from the IR scanner and the 'Workflow' app on the iPad?
Thanks guys!
UPDATE
I've tried as suggested by Patrick implementing some code to check if cell "A1" has a header or not. Partial success. I can put "ANR" in cell A1 if its missing but I lose whatever was there in the first place.
import xlwt
from openpyxl import Workbook, load_workbook
from xlutils.copy import copy
import openpyxl
# data collection
workbook = xlrd.open_workbook(myFile)
sheet = workbook.sheet_by_index(0)
data = [[sheet.cell_value(r, c) for c in range(sheet.ncols)] for r in range(sheet.nrows)]
cell_a1 = sheet.cell_value(rowx=0, colx=0)
if cell_a1 == "ANR":
print "has header"
else:
wb = openpyxl.load_workbook(filename= myFile)
ws = wb['Sheet1']
ws['A1'] = "ANE"
wb.save(myFile)
#re-open XL file again etc etc.
I found this new block of code over at writing to existing workbook using xlwt. In this instance the contributor actually used openpyxl.
I think I got it fixed for myself.
Still a tiny bit messy but seems to be working. Added an 'if/else' clause to check the value of cell A1 and to take action accordingly. Found most of the code for this at how to append data using openpyxl python to excel file from a specified row? - using the suggestion for openpyxl
import pyperclip
import xlrd
import pyautogui
import string
import re
import os
import pandas as pd
import xlwt
from openpyxl import Workbook, load_workbook
from xlutils.copy import copy
field = "RAM"
value = "2 GB"
myFile = "/Users/me/testSerials.xlsx"
df = pd.read_excel(myFile)
myRegex = "^[0-9]{5}$"
# data collection
workbook = xlrd.open_workbook(myFile)
sheet = workbook.sheet_by_index(0)
data = [[sheet.cell_value(r, c) for c in range(sheet.ncols)] for r in range(sheet.nrows)]
cell_a1 = sheet.cell_value(rowx=0, colx=0)
if cell_a1 == "ANR":
print "has header"
else:
headers = ['ANR']
workbook_name = 'myFile'
wb = Workbook()
page = wb.active
# page.title = 'companies'
page.append(headers) # write the headers to the first line
workbook = xlrd.open_workbook(workbook_name)
sheet = workbook.sheet_by_index(0)
data = [[sheet.cell_value(r, c) for c in range(sheet.ncols)] for r in range(sheet.nrows)]
for records in data:
page.append(records)
wb.save(filename=workbook_name)
#then load the data all over again, this time with inserted header
workbook = xlrd.open_workbook(myFile)
sheet = workbook.sheet_by_index(0)
data = [[sheet.cell_value(r, c) for c in range(sheet.ncols)] for r in range(sheet.nrows)]
formatted = []
deDuped = []
# removing any possible XL headers, setting all values to strings that look like five-digit ints, apply a regex to be sure.
for i in data:
cellValue = str(i)
cellValue = cellValue.translate(None, '\'[u]\'')
# remove the decimal point
cellValue = cellValue[:-2]
# cellValue = cellValue.translate(None, ".0")
cellValue = cellValue.translate(None, string.letters)
# making sure any valid ANRs get through
if len(cellValue) != 0:
if re.match(myRegex, cellValue):
formatted.append(cellValue)
# ------------------------------------------
# weeding out any possilbe dupes.
for i in formatted:
if i not in deDuped:
deDuped.append(i)
# ref - https://stackoverflow.com/questions/48942743/python-pandas-to-remove-rows-in-excel
df = pd.read_excel(myFile)
print df
for i in deDuped:
#pyautogui code is run here...
#if all goes to plan update the XL file
df = df[~df['ANR'].astype(str).str.startswith(i)]
df.to_excel(myFile, index=False)
Why is openpyxl reading every row and column dimension as None? This is the case regardless of whether the table was created via openpyxl or within Microsoft Excel.
import openpyxl
wb = openpyxl.load_workbook(r'C:\data\MyTable.xlsx')
ws = wb.active
print ws.row_dimensions[1].height
print ws.column_dimensions['A'].width
prints None and None. These aren't hidden columns/rows. They clearly have dimensions when viewed in Excel.
I know that loading the workbook with iterators will prevent the dimension dictionaries from being created, but that results in key errors, and I'm not using iterators here.
Is there an alternative way to determine the width/height of a cell/row/column?
===============SOLUTION=================
Thanks to Charlie, I realized that the following is the best way to get a list of all row heights:
import openpyxl
wb = openpyxl.load_workbook(r'C:\Data\Test.xlsx')
ws = wb.active
rowHeights = [ws.row_dimensions[i+1].height for i in range(ws.max_row)]
rowHeights = [15 if rh is None else rh for rh in rowHeights]
RowDimension and ColumnDimension objects exist only when the defaults are to be overwritten. So ws.row_dimensions[1].height will be always be None until it is assigned a value.
The default values are: {'defaultRowHeight': '15', 'baseColWidth': '10'}
openpyxl: 3.0.4
source code: from openpyxl.worksheet.dimensions import SheetFormatProperties
it shows the {'defaultRowHeight': 15, 'baseColWidth': 8}
# dimensions.py
class SheetFormatProperties(Serialisable):
...
def __init__(self,
baseColWidth=8, # <-----------------
defaultColWidth=None,
defaultRowHeight=15, # <------------
customHeight=None,
zeroHeight=None,
thickTop=None,
thickBottom=None,
outlineLevelRow=None,
outlineLevelCol=None,
):
self.baseColWidth = baseColWidth
self.defaultColWidth = defaultColWidth
self.defaultRowHeight = defaultRowHeight
self.customHeight = customHeight
self.zeroHeight = zeroHeight
...
an example
from openpyxl import load_workbook
from openpyxl.worksheet.dimensions import SheetFormatProperties
from openpyxl.worksheet.worksheet import Worksheet
wb = load_workbook('xxx.xlsx')
for sheet in [wb[sheet_name] for sheet_name in wb.sheetnames]:
sheet: Worksheet
sheet_prop: SheetFormatProperties = sheet.sheet_format
default_width = sheet_prop.baseColWidth
default_height = sheet_prop.defaultRowHeight
I'm having an issue with saving an Excel file in openpyxl.
I'm trying to create a processing script which would grab data from one excel file, dump it into a dump excel file, and after some tweaking around with formulas in excel, I will have all of the processed data in the dump excel file. My current code is as so.
from openpyxl import load_workbook
import os
import datetime
from openpyxl.cell import get_column_letter, Cell, column_index_from_string, coordinate_from_string
dump = dumplocation
desktop = desktoplocation
date = datetime.datetime.now().strftime("%Y-%m-%d")
excel = load_workbook(dump+date+ ".xlsx", use_iterators = True)
sheet = excel.get_sheet_by_name("Sheet1")
try:
query = raw_input('How many rows of data is there?\n')
except ValueError:
print 'Not a number'
#sheetname = raw_input('What is the name of the worksheet in the data?\n')
for filename in os.listdir(desktop):
if filename.endswith(".xlsx"):
print filename
data = load_workbook(filename, use_iterators = True)
ws = data.get_sheet_by_name(name = '17270115')
#copying data from excel to data excel
n=16
for row in sheet.iter_rows():
for cell in row:
for rows in ws.iter_rows():
for cells in row:
n+=1
if (n>=17) and (n<=32):
cell.internal_value = cells.internal_value
#adding column between time in UTC and the data
column_index = 1
new_cells = {}
sheet.column_dimensions = {}
for coordinate, cell in sheet._cells.iteritems():
column_letter, row = coordinate_from_string(coordinate)
column = column_index_from_string(column_letter)
# shifting columns
if column >= column_index:
column += 1
column_letter = get_column_letter(column)
coordinate = '%s%s' % (column_letter, row)
# it's important to create new Cell object
new_cells[coordinate] = Cell(sheet, column_letter, row, cell.value)
sheet.cells = new_cells
#setting columns to be hidden
for coordinate, cell in sheet._cells.iteritems():
column_letter, row = coordinate_from_string(coordinate)
column = column_index_from_string(column_letter)
if (column<=3) and (column>=18):
column.set_column(column, options={'hidden': True})
A lot of my code is messy I know since I just started Python two or three weeks ago. I also have a few outstanding issues which I can deal with later on.
It doesn't seem like a lot of people are using openpyxl for my purposes.
I tried using the normal Workbook module but that didn't seem to work because you can't iterate in the cell items. (which is required for me to copy and paste relevant data from one excel file to another)
UPDATE: I realised that openpyxl can only create workbooks but can't edit current ones. So I have decided to change tunes and edit the new workbook after I have transferred data into there. I have resulted to using back to Workbook to transfer data:
from openpyxl import Workbook
from openpyxl import worksheet
from openpyxl import load_workbook
import os
from openpyxl.cell import get_column_letter, Cell, column_index_from_string, coordinate_from_string
dump = "c:/users/y.lai/desktop/data/201501.xlsx"
desktop = "c:/users/y.lai/desktop/"
excel = Workbook()
sheet = excel.add_sheet
try:
query = raw_input('How many rows of data is there?\n')
except ValueError:
print 'Not a number'
#sheetname = raw_input('What is the name of the worksheet in the data?\n')
for filename in os.listdir(desktop):
if filename.endswith(".xlsx"):
print filename
data = load_workbook(filename, use_iterators = True)
ws = data.get_sheet_by_name(name = '17270115')
#copying data from excel to data excel
n=16
q=0
for x in range(6,int(query)):
for s in range(65,90):
for cell in Cell(sheet,chr(s),x):
for rows in ws.iter_rows():
for cells in rows:
q+=1
if q>=5:
n+=1
if (n>=17) and (n<=32):
cell.value = cells.internal_value
But this doesn't seem to work still
Traceback (most recent call last):
File "xxx\Desktop\xlspostprocessing.py", line 40, in <module>
for cell in Cell(sheet,chr(s),x):
File "xxx\AppData\Local\Continuum\Anaconda\lib\site-packages\openpyxl\cell.py", line 181, in __init__
self._shared_date = SharedDate(base_date=worksheet.parent.excel_base_date)
AttributeError: 'function' object has no attribute 'parent'
Went through the API but..I'm overwhelmed by the coding in there so I couldn't make much sense of the API. To me it looks like I have used the Cell module wrongly. I read the definition of the Cell and its attributes, thus having the chr(s) to give the 26 alphabets A-Z.
You can iterate using the standard Workbook mode. use_iterators=True has been renamed read_only=True to emphasise what this mode is used for (on demand reading of parts).
Your code as it stands cannot work with this method as the workbook is read-only and cell.internal_value is always a read only property.
However, it looks like you're not getting that far because there is a problem with your Excel files. You might want to submit a bug with one of the files. Also the mailing list might be a better place for discussion.
You could try using xlrd and xlwt instead of pyopenxl but you might find exactly what you are looking to do already available in xlutil - all are from python-excel.