Openpyxl - Remove formatting from all sheets in an Excel file - python

I have files with a lot of weird formatting and I'm trying to create a function that removes any formatting from an xlsx file.
Some guys in here suggested to use "cell.fill = PatternFill(fill_type=None)" to clear any format from a given cell.
path = r'C:\Desktop\Python\Openpyxl\Formatted.xlsx
wb = xl.load_workbook(filename = path)
def removeFormatting(file):
ws = wb[file]
for row in ws.iter_rows():
for cell in row:
if cell.value is None:
cell.fill = PatternFill(fill_type=None)
wb.save(path)
for s in wb.sheetnames:
removeFormatting(s)
But this won't change anything. If the cells are empty but colored, then openpyxl still sees them as non empty.
Following this post:
Openpyxl check for empty cell
The problem with ws.max_column and ws.max_row is that it will count blank columns as well, thus defeating the purpose."
#bhaskar was right.
When I'm trying to get the max column, I get for all the sheets, the same value as from the first sheet.
col = []
for sheet in wb.worksheets:
col.append(sheet.max_column)
So even if there are different sheet dimensions, if the cell has a background color or any other formatting, it will take it as valid non empty cell.
Does anyone know how to solve this?
Thanks in advance!

This function removes styles from all cells in a given worksheet (passed as ws object).
So you can open your file, iterate over all worksheets and apply this function to each one:
def removeFormatting(ws):
# ws is not the worksheet name, but the worksheet object
for row in ws.iter_rows():
for cell in row:
cell.style = 'Normal'
If you also want to check info about how to define and apply named styles, take a look here:
https://openpyxl.readthedocs.io/en/stable/styles.html#cell-styles-and-named-styles

Related

openpyxl - How to retrieve multiple columns of one row from an Excel file?

I'm looking for the correct way to print cell values from one desired row, but I only want the cell values that are in specific columns. e.g. columns 'A','C','F' etc.
Below is the code I currently have, I get no errors with this but am at a loss with how to progress with my desired outcome.
path_file = 'Readfrom.xlsx'
spreadsheet = load_workbook(path_file)
spreadsheet = spreadsheet.active
desire_columns = spreadsheet['A']
disire_row = spreadsheet['5']
for cell in disire_row:
print(cell.value)
path_file = 'Readfrom.xlsx'
wb = load_workbook(path_file) # Changed variable name
ws = wb.active # Changed variable name
desired_columns = ws['A'] # Minor change in variable name
desired_row = ws['5'] # Minor change in variable name
for column in desired_columns: # Iterate over columns
cell = ws[f"{column}{desired_row}"] # Create the cell reference
print(cell.value)
You must read your pseudocode over and over again and you will find the answer.
I have also corrected your variables, hope you dont mind. Use the common variables as much as possible so that other programmers can easily understand your code. Also, the spreadsheet variable is specially confusing as it initialized as the workbook and later as a worksheet. Think about if you decide later that you want to save the workbook (remember to back up the excel file), or you have multiple sheets in the workbook.
Also read "for loop" to further your study.
https://www.w3schools.com/python/python_for_loops.asp
You can read data from excel using pandas
import pandas as pd
df = pd.read_excel('tmp.xlsx', index_col=0)
#Cell value can be accessed in this way
print (df['ColumnName'].iloc[rowIndex])
#This also can be done in a loop.
for index, row in df.iterrows():
print(row['ColumnName'])
Refer this link for further details
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html

Using Openpxl in Python, create a new spreadsheet containing rows that have a specific substring

I have an excel spreadsheet. I want the rows of spreadsheet that contains the substring "sigma" to be put into a new spreadsheet. I am using Openpxl in Python. I am stuck because something is wrong with my code. Please help.
import openpyxl
file = "C:\\Users\\1847\\Desktop\\DTO_aligned_to_EC_9_7_21_version.xlsx"
wb = openpyxl.load_workbook(file, read_only=True)
ws = wb.active
w_new = openpyxl.Workbook()
w_new.save(C:\\Users\\1847\\Desktop\\DTO_aligned_to_EC_sigma.xlsx)
for row in ws.rows:
for cell in row:
if cell.value is not None and "sigma" in cell.value:
wb_new.save.append()
Once I find the cells, I want to append the row that contains these cells into a new worksheet which I am calling "w_save"

OpenPyXL set number_format for the whole column

I'm exporting some data to Excel and I've successfully implemented formatting each populated cell in a column when exporting into Excel file with this:
import openpyxl
from openpyxl.utils import get_column_letter
wb = openpyxl.Workbook()
ws = wb.active
# Add rows to worksheet
for row in data:
ws.append(row)
# Disable formatting numbers in columns from column `D` onwards
# Need to do this manually for every cell
for col in range(3, ws.max_column+1):
for cell in ws[get_column_letter(col)]:
cell.number_format = '#'
# Export data to Excel file...
But this only formats populated cells in each column. Other cells in this column still have General formatting.
How can I set all empty cells in this column as # so that anyone, who will edit cells in these columns within this exported Excel file, will not have problems with inserting lets say phone numbers as actual Numbers.
For openpyxl you must always set the styles for every cell individually. If you set them for the column, then Excel will apply them when it creates new cells, but styles are always still applied to individual cells.
As you are iterating on the rows of that columns only to the max_cell those are the only cells that are being reformatted. While you can't reformat a column you can use a different way to set the format at least to a specific cell:
last_cell = 100
for col in range(3, ws.max_column+1):
for row in range(1, last_cell):
ws.cell(column=col, row=row).number_format = '#' # Changing format to TEXT
The following will format all the cell in the column up to last_cell you can use that, and, while it's not exactly what you need it's close enough.
conditional formatting will do the hack to put number formatting on the entire column. for applying thousand separator on entire column this worked for me:
diff_style = DifferentialStyle(numFmt = NumberFormat(numFmtId='4',formatCode='#,##0.00'))
rule1 = Rule(type="expression", dxf=diff_style)
rule1.formula = ["=NOT(ISBLANK($H2))"] // column on which thousand separator is to be applied
work_sheet.conditional_formatting.add("$H2:$H500001", rule1) // provide a range of cells

Get column data by Column name and sheet name

Is there a way to access all rows in a column in a specific sheet by using python xlrd.
e.g:
workbook = xlrd.open_workbook('ESC data.xlsx', on_demand=True)
sheet = workbook.sheet['sheetname']
arrayofvalues = sheet['columnname']
Or do i have to create a dictionary by myself?
The excel is pretty big so i would love to avoid iterating over all the colnames/sheets
Yes, you are looking for the col_values() worksheet method. Instead of
arrayofvalues = sheet['columnname']
you need to do
arrayofvalues = sheet.col_values(columnindex)
where columnindex is the number of the column (counting from zero, so column A is index 0, column B is index 1, etc.). If you have a descriptive heading in the first row (or first few rows) you can give a second parameter that tells which row to start from (again, counting from zero). For example, if you have one header row, and thus want values starting in the second row, you could do
arrayofvalues = sheet.col_values(columnindex, 1)
Please check out the tutorial for a reasonably readable discussion of the xlrd package. (The official xlrd documentation is harder to read.)
Also note that (1) while you are free to use the name arrayofvalues, what you are really getting is a Python list, which technically isn't an array, and (2) the on_demand workbook parameter has no effect when working with .xlsx files, which means xlrd will attempt to load the entire workbook into memory regardless. (The on_demand feature works for .xls files.)
This script allows to trasform a xls file to list of dictinnaries,
all dict in list represent a row
import xlrd
workbook = xlrd.open_workbook('esc_data.xlss')
workbook = xlrd.open_workbook('esc_data.xlsx', on_demand = True)
worksheet = workbook.sheet_by_index(0)
first_row = [] # Header
for col in range(worksheet.ncols):
first_row.append( worksheet.cell_value(0,col) )
# tronsform the workbook to a list of dictionnaries
data =[]
for row in range(1, worksheet.nrows):
elm = {}
for col in range(worksheet.ncols):
elm[first_row[col]]=worksheet.cell_value(row,col)
data.append(elm)
print data

How to detect if a cell is empty when reading Excel files using the xlrd library?

I handle Excel files using the functions row_values and col_values:
import xlrd
workbook = xlrd.open_workbook( filename )
sheet_names = workbook.sheet_names()
for sheet_name in sheet_names:
sheet = workbook.sheet_by_name( sheet_name )
# ...
row_values = sheet.row_values( rownum )
# ...
col_values = sheet.col_values( colnum )
For example, I get col_values as list. What if I meet an empty cell in some column? For example a cell (1,1) is not empty, a cell (1,2) is empty and a cell (1,3) is not empty? How can I detect that the cell (1,2) is empty?
Is this true that I get a list with an empty string as a value of an empty cell (for most well-known programs which generate Excel files)?
You could be explicit and check that sheet.cell_type(rowno, colno) in (xlrd.XL_CELL_EMPTY, xlrd.XL_CELL_BLANK) but the docs state the value will be u'' where those are the case anyway.
Instead of using row_values, you could also use row(n) which returns a list of Cell objects which have .value and .cell_type attributes.

Categories

Resources