Get column data by Column name and sheet name

Get column data by Column name and sheet name - python

Is there a way to access all rows in a column in a specific sheet by using python xlrd.
e.g:
workbook = xlrd.open_workbook('ESC data.xlsx', on_demand=True)
sheet = workbook.sheet['sheetname']
arrayofvalues = sheet['columnname']
Or do i have to create a dictionary by myself?
The excel is pretty big so i would love to avoid iterating over all the colnames/sheets

Yes, you are looking for the col_values() worksheet method. Instead of
arrayofvalues = sheet['columnname']
you need to do
arrayofvalues = sheet.col_values(columnindex)
where columnindex is the number of the column (counting from zero, so column A is index 0, column B is index 1, etc.). If you have a descriptive heading in the first row (or first few rows) you can give a second parameter that tells which row to start from (again, counting from zero). For example, if you have one header row, and thus want values starting in the second row, you could do
arrayofvalues = sheet.col_values(columnindex, 1)
Please check out the tutorial for a reasonably readable discussion of the xlrd package. (The official xlrd documentation is harder to read.)
Also note that (1) while you are free to use the name arrayofvalues, what you are really getting is a Python list, which technically isn't an array, and (2) the on_demand workbook parameter has no effect when working with .xlsx files, which means xlrd will attempt to load the entire workbook into memory regardless. (The on_demand feature works for .xls files.)

This script allows to trasform a xls file to list of dictinnaries,
all dict in list represent a row
import xlrd
workbook = xlrd.open_workbook('esc_data.xlss')
workbook = xlrd.open_workbook('esc_data.xlsx', on_demand = True)
worksheet = workbook.sheet_by_index(0)
first_row = [] # Header
for col in range(worksheet.ncols):
first_row.append( worksheet.cell_value(0,col) )
# tronsform the workbook to a list of dictionnaries
data =[]
for row in range(1, worksheet.nrows):
elm = {}
for col in range(worksheet.ncols):
elm[first_row[col]]=worksheet.cell_value(row,col)
data.append(elm)
print data

Related

openpyxl - How to retrieve multiple columns of one row from an Excel file?

I'm looking for the correct way to print cell values from one desired row, but I only want the cell values that are in specific columns. e.g. columns 'A','C','F' etc.
Below is the code I currently have, I get no errors with this but am at a loss with how to progress with my desired outcome.
path_file = 'Readfrom.xlsx'
spreadsheet = load_workbook(path_file)
spreadsheet = spreadsheet.active
desire_columns = spreadsheet['A']
disire_row = spreadsheet['5']
for cell in disire_row:
print(cell.value)

path_file = 'Readfrom.xlsx'
wb = load_workbook(path_file) # Changed variable name
ws = wb.active # Changed variable name
desired_columns = ws['A'] # Minor change in variable name
desired_row = ws['5'] # Minor change in variable name
for column in desired_columns: # Iterate over columns
cell = ws[f"{column}{desired_row}"] # Create the cell reference
print(cell.value)
You must read your pseudocode over and over again and you will find the answer.
I have also corrected your variables, hope you dont mind. Use the common variables as much as possible so that other programmers can easily understand your code. Also, the spreadsheet variable is specially confusing as it initialized as the workbook and later as a worksheet. Think about if you decide later that you want to save the workbook (remember to back up the excel file), or you have multiple sheets in the workbook.
Also read "for loop" to further your study.
https://www.w3schools.com/python/python_for_loops.asp

You can read data from excel using pandas
import pandas as pd
df = pd.read_excel('tmp.xlsx', index_col=0)
#Cell value can be accessed in this way
print (df['ColumnName'].iloc[rowIndex])
#This also can be done in a loop.
for index, row in df.iterrows():
print(row['ColumnName'])
Refer this link for further details
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html

How to extract all the column name from the required table by Excel(or maybe python)

Now I have table1 as below:
A
B
C
First
Second
Second
Second
Third
Forth
I now want to output a list(The first column has already known):
First
A
Second
A
B
C
Thrid
B
Forth
C
How can I output this using Excel(or maybe python, if much easier)?

While I prefer GordonAitchJay's answer as it uses an existing library, most (if not all) spreadsheet editors can export sheets in .csv format (comma-separated values). From there you could read the first line containing the column names, and split on commas.
Off the top of my head it should look something like this:
columnNames = None
with open(filePath, "r", encoding="utf-8") as contents:
columnNames = contents.getlines()[0].split(",") # Or maybe "readlines()"
# Do your magic.

This can be done with Python and the openpyxl package. The code below reads an existing spreadsheet table1.xlsx, and creates a new spreadsheet table1_columns.xlsx listing the unique values and the columns they were found in. With your input, the code produces your output.
import os
from collections import defaultdict
import openpyxl
input_filename = r"table1.xlsx"
output_filename = r"table1_columns.xlsx"
# Read the input spreadsheet file
wb = openpyxl.load_workbook(input_filename)
# Use the first spreadsheet in the workbook
ws = wb.worksheets[0]
# Iterate over every cell in each column, and read the values into a dict
# which has a default value of an empty set.
# The keys are the cell values, e.g. "First"
# The values are sets of column letters the cell value was found in, e.g. {"A"}
values = defaultdict(set)
for cells in ws.iter_cols():
for cell in cells:
values[cell.value].add(cell.column_letter)
# Create a new spreadsheet
wb = openpyxl.Workbook()
ws = wb.active
# Iterate over the dict, writing each value and its column letters to a new row
for row_num, (value, column_letters) in enumerate(values.items(), start=1):
# The first column is always the value, e.g. "First"
ws.cell(row=row_num, column=1).value = value
# As sets are an unordered data type, it first needs to be sorted. If not,
# the second row could be "Second B A C" instead of "Second A B C"
column_letters = sorted(column_letters)
for column_num, column_letter in enumerate(column_letters, start=2):
ws.cell(row=row_num, column=column_num).value = column_letter
# Save the spreadsheet to disk
wb.save(output_filename)
# Launch the new spreadsheet
os.startfile(output_filename)

Openpyxl - Remove formatting from all sheets in an Excel file

I have files with a lot of weird formatting and I'm trying to create a function that removes any formatting from an xlsx file.
Some guys in here suggested to use "cell.fill = PatternFill(fill_type=None)" to clear any format from a given cell.
path = r'C:\Desktop\Python\Openpyxl\Formatted.xlsx
wb = xl.load_workbook(filename = path)
def removeFormatting(file):
ws = wb[file]
for row in ws.iter_rows():
for cell in row:
if cell.value is None:
cell.fill = PatternFill(fill_type=None)
wb.save(path)
for s in wb.sheetnames:
removeFormatting(s)
But this won't change anything. If the cells are empty but colored, then openpyxl still sees them as non empty.
Following this post:
Openpyxl check for empty cell
The problem with ws.max_column and ws.max_row is that it will count blank columns as well, thus defeating the purpose."
#bhaskar was right.
When I'm trying to get the max column, I get for all the sheets, the same value as from the first sheet.
col = []
for sheet in wb.worksheets:
col.append(sheet.max_column)
So even if there are different sheet dimensions, if the cell has a background color or any other formatting, it will take it as valid non empty cell.
Does anyone know how to solve this?
Thanks in advance!

This function removes styles from all cells in a given worksheet (passed as ws object).
So you can open your file, iterate over all worksheets and apply this function to each one:
def removeFormatting(ws):
# ws is not the worksheet name, but the worksheet object
for row in ws.iter_rows():
for cell in row:
cell.style = 'Normal'
If you also want to check info about how to define and apply named styles, take a look here:
https://openpyxl.readthedocs.io/en/stable/styles.html#cell-styles-and-named-styles

OpenPyXL set number_format for the whole column

I'm exporting some data to Excel and I've successfully implemented formatting each populated cell in a column when exporting into Excel file with this:
import openpyxl
from openpyxl.utils import get_column_letter
wb = openpyxl.Workbook()
ws = wb.active
# Add rows to worksheet
for row in data:
ws.append(row)
# Disable formatting numbers in columns from column `D` onwards
# Need to do this manually for every cell
for col in range(3, ws.max_column+1):
for cell in ws[get_column_letter(col)]:
cell.number_format = '#'
# Export data to Excel file...
But this only formats populated cells in each column. Other cells in this column still have General formatting.
How can I set all empty cells in this column as # so that anyone, who will edit cells in these columns within this exported Excel file, will not have problems with inserting lets say phone numbers as actual Numbers.

For openpyxl you must always set the styles for every cell individually. If you set them for the column, then Excel will apply them when it creates new cells, but styles are always still applied to individual cells.

As you are iterating on the rows of that columns only to the max_cell those are the only cells that are being reformatted. While you can't reformat a column you can use a different way to set the format at least to a specific cell:
last_cell = 100
for col in range(3, ws.max_column+1):
for row in range(1, last_cell):
ws.cell(column=col, row=row).number_format = '#' # Changing format to TEXT
The following will format all the cell in the column up to last_cell you can use that, and, while it's not exactly what you need it's close enough.

conditional formatting will do the hack to put number formatting on the entire column. for applying thousand separator on entire column this worked for me:
diff_style = DifferentialStyle(numFmt = NumberFormat(numFmtId='4',formatCode='#,##0.00'))
rule1 = Rule(type="expression", dxf=diff_style)
rule1.formula = ["=NOT(ISBLANK($H2))"] // column on which thousand separator is to be applied
work_sheet.conditional_formatting.add("$H2:$H500001", rule1) // provide a range of cells

How to read Excel Workbook (pandas)

First I want to say that I am not an expert by any means. I am versed but carry a burden of schedule and learning Python like I should have at a younger age!
Question:
I have a workbook that will on occasion have more than one worksheet. When reading in the workbook I will not know the number of sheets or their sheet name. The data arrangement will be the same on every sheet with some columns going by the name of 'Unnamed'. The problem is that everything I try or find online uses the pandas.ExcelFile to gather all sheets which is fine but i need to be able to skips 4 rows and only read 42 rows after that and parse specific columns. Although the sheets might have the exact same structure the column names might be the same or different but would like them to be merged.
So here is what I have:
import pandas as pd
from openpyxl import load_workbook
# Load in the file location and name
cause_effect_file = r'C:\Users\Owner\Desktop\C&E Template.xlsx'
# Set up the ability to write dataframe to the same workbook
book = load_workbook(cause_effect_file)
writer = pd.ExcelWriter(cause_effect_file)
writer.book = book
writer.sheets = dict((ws.title, ws) for ws in book.worksheets)
# Get the file skip rows and parse columns needed
xl_file = pd.read_excel(cause_effect_file, skiprows=4, parse_cols = 'B:AJ', na_values=['NA'], convert_float=False)
# Loop through the sheets loading data in the dataframe
dfi = {sheet_name: xl_file.parse(sheet_name)
for sheet_name in xl_file.sheet_names}
# Remove columns labeled as un-named
for col in dfi:
if r'Unnamed' in col:
del dfi[col]
# Write dataframe to sheet so we can see what the data looks like
dfi.to_excel(writer, "PyDF", index=False)
# Save it back to the book
writer.save()
The link to the file i am working with is below
Excel File

Try to modify the following based on your specific need:
import os
import pandas as pd
df = pd.DataFrame()
xls = pd.ExcelFile(path)
Then iterate over all the available data sheets:
for x in range(0, len(xls.sheet_names)):
a = xls.parse(x,header = 4, parse_cols = 'B:AJ')
a["Sheet Name"] = [xls.sheet_names[x]] * len(a)
df = df.append(a)
You can adjust the header row and the columns to read for each sheet. I added a column that will indicate the name of the data sheet the row came from.

You probably want to look at using read_only mode in openpyxl. This will allow you to load only those sheets that you're interested and look at only the cells you're interested in.
If you want to work with Pandas dataframes then you'll have to create these yourself but that shouldn't be too hard.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get column data by Column name and sheet name - python

Related

openpyxl - How to retrieve multiple columns of one row from an Excel file?

How to extract all the column name from the required table by Excel(or maybe python)

Openpyxl - Remove formatting from all sheets in an Excel file

OpenPyXL set number_format for the whole column

How to read Excel Workbook (pandas)

Categories

Resources