OpenPyXL - How to query cell borders? - python

New to both python and openpyxl.
Writing a py script to glom through a ton of Excel workbooks/sheets, and need to find certain cells identified by their border formatting.
I see several examples online of how to set cell borders, but I need to read them.
Specifically, I wish to identify table boundaries, when the data within the table is inconsistent, but the table borders are always present. So, I need to find identify the cells with:
* top / left borders
* top / right borders
* bottom / left borders
* bottom / right borders
(thin borders). There is only one such table per worksheet.
Could some kind maven point me to a code sample? I would provide my code thus far, but honestly I have no idea how to begin. My code for looping through each worksheet is:
for row in range(1, ws.max_row, 1):
for col in range(1, sheet.max_column+1):
tmp = NumToAlpha(col)
ref = str(tmp) + str(row)
hasTopBorder = ws[ref].?????? <=== how do I get a boolean here?
hasLeftBorder = ws[ref].?????? <=== how do I get a boolean here?
hasRightBorder = ws[ref].?????? <=== how do I get a boolean here?
hasBottomBorder = ws[ref].?????? <=== how do I get a boolean here?
if hasTopBorder==True and hasLeftBorder==True and hasRightBorder==False and hasBottomBorder==False:
tableTopLeftCell = tmp + str(row)
elif hasTopBorder==True and hasLeftBorder==False and hasRightBorder==True and hasBottomBorder==False:
tableTopRightCell = tmp + str(row)
elif hasTopBorder==False and hasLeftBorder==True and hasRightBorder==False and hasBottomBorder==True:
tableBottomLeftCell = tmp + str(row)
elif hasTopBorder==False and hasLeftBorder==False and hasRightBorder==True and hasBottomBorder==True:
tableBottomRightCell = tmp + str(row)
if tableTopLeftCell != "" and tableTopRightCell != "" and tableBottomLeftCell != "" and tableBottomRightCell != "": break
if tableTopLeftCell != "" and tableTopRightCell != "" and tableBottomLeftCell != "" and tableBottomRightCell != "": break
Comments/suggestions for streamlining this novice code welcome and gratefully received.
Update:
By querying a cell like this:
tst = sheet['Q17'].border
I see that I get this type of result - but how do I use it? Or convert it into the desired boolean?

Here's one way.
I used is not none because the borders could be thin, double, etc.
for row in range(1, ws.max_row, 1):
for col in range(1, ws.max_column+1):
tmp = NumToAlpha(col)
cellRef = str(tmp) + str(row)
cellBorders = getCellBorders(ws, cellRef)
if ('T' in cellBorders) or ('L' in cellBorders) or ('R' in cellBorders) or ('B' in cellBorders):
if 'myTableTopLeftCell' not in refs:
if ('T' in cellBorders) and ('L' in cellBorders):
refs['myTableTopLeftCell'] = (cell.row, cell.col_idx)
nowInmyTable = True
if (nowInmyTable == True) and ('L' not in cellBorders):
if 'myTableBottomLeftCell' not in refs:
refs['myTableBottomLeftCell'] = (cell.row-1, cell.col_idx)
def getCellBorders(ws, cellRef):
tmp = ws[cellRef].border
brdrs = ''
if tmp.top.style is not None: brdrs += 'T'
if tmp.left.style is not None: brdrs += 'L'
if tmp.right.style is not None: brdrs += 'R'
if tmp.bottom.style is not None: brdrs += 'B'
return brdrs

To identify whether the "Q17" has a border:
from openpyxl.styles.borders import Border, Side
if sheet['Q17'].border.left.style == "thin":
print("Left side of Cell Q17 have left thin border")

I found a way around using border object by converting into JSON and getting the value of the border style
t = sheet.cell(1,1).border
f = json.dumps(t,default=lambda x: x.__dict__)
r = json.loads(f)
s = r['left']['style']
print(s) # which prints the value for left border style
if s == 'thin':
#do specific action

Related

Checking if a Cell has value, if it has jumps to the other and writes value

I am trying to do script that when exporting values to sheet if it checks that cell has values it jumps to another row. I can't seem to make it work. The output will always be 2 on the same cell
if ws1.cell(column=1, row=xrow).value is None:
sd = ws1.cell(column=1, row=xrow).value
ws1.cell(column=1, row=xrow).value = 2
else:
xrow = xrow + 1
ws1.cell(column=1, row=xrow).value = 2
wb.save(dest_filename)
Welcome to stackoverflow. #blaspas. What I have understood is that you are looking for an empty row to add data if not to just continue.
Below code works for that:
import openpyxl
wb = openpyxl.load_workbook("file.xlsx")
ws = wb.get_sheet_by_name('Sheet1')
ws=wb.active
max_row_val =ws.max_row
col=1
for rows in range(1, max_row_val + 1):
if ws.cell(rows, col).value == None:
sd = ws.cell(col, rows).value
ws.cell(rows,col).value = 2
elif ws.cell(rows, col).value != None:
ws.cell(rows, col).value = 2
rows=rows+1
wb.save("file.xlsx")

Load a apreadsheet and copy a row and pasted it in a different location

How can I copy a row for example from D51 to F51 and paste these values in the row T20 to AF20.
I know how to load a spreadsheet
workbook = load_workbook(output)
sheet = workbook.active
But I dont know how to itenarate in a loop to get this
sheet["T2"] = "=D6"
sheet["U2"] = "=E6"
sheet["V2"] = "=F6"
sheet["W2"] = "=G6"
sheet["X2"] = "=H6"
sheet["Y2"] = "=I6"
sheet["Z2"] = "=J6"
sheet["AA2"] = "=K6"
sheet["AB2"] = "=L6"
sheet["AC2"] = "=M6"
sheet["AD2"] = "=N6"
sheet["AE2"] = "=O6"
sheet["AF2"] = "=P6"
You can achieve this by using code below...
Note that the file output.xlsx is opened, updated and saved. The function num_to_excel_col is borrowed from here.
This will update columns 20 (T) onwards for the next 15 columns (all row 2) with the text as "=D6", "=E6", etc. The num_to_col function will convert the col number to equivalent excel string (for eg. 27 will be converted to AA, etc.)
import pandas as pd
import numpy as np
import openpyxl
workbook = openpyxl.load_workbook('output.xlsx')
ws = workbook.active
def num_to_excel_col(n):
if n < 1:
raise ValueError("Number must be positive")
result = ""
while True:
if n > 26:
n, r = divmod(n - 1, 26)
result = chr(r + ord('A')) + result
else:
return chr(n + ord('A') - 1) + result
outcol = 4 #Paste in col 'D'
for col in range(20,35): #col 20 is T and doing this for next 15 columns
txt = "="+num_to_excel_col(outcol)+"6"
print(txt)
ws.cell(row=2, column=col).value = txt
outcol += 1
workbook.save("output.xlsx")

Using merge to detect differences in Pandas DataFrame

I am using merge to join two data frames together. The 2 data frames are data from a database table over two different dates. I need to work out what changed. The number of rows will be different, but I just to join the newer data set to the older data set with an inner join, and see what changed.
At the moment, I am taking advantage of the _x and _y naming of the data frames, and the .columns data to let me compare the fields for differences after the merge.
There must be an easier way to do this. I did try to use the new compare() method of 1.1.0, but it doesnt seem to like frames with a different shape (i.e. rows in my case), rendering it useless to me.
def get_changed_records(df_old, df_new, file_info):
join_keys = file_info["compare_col"].split(",");
old_file_name = file_info["old_file_name"]
new_file_name = file_info["new_file_name"]
print("Changed Records: JOIN DATA FRAMES ON COLUMNS: previous file ", old_file_name, " current_file_name ", " new file name ", new_file_name)
columns = df_new.columns
df_merged = df_new.merge(df_old, how='inner', on=join_keys, indicator=True)
changed_records = []
for idx, row in df_merged.iterrows():
changes = []
for col in columns:
if col not in join_keys:
after_col = col + '_x';
before_col = col + '_y';
else:
after_col = col
before_col = col
after_val = row[after_col];
before_val = row[before_col];
changed = False
if pd.isnull(before_val) or pd.isnull(after_val):
if pd.isnull(before_val) == False and pd.isnull(after_val) == True:
changed = True;
if pd.isnull(before_val) == True and pd.isnull(after_val) == False:
changed = True;
if pd.isnull(before_val) == True and pd.isnull(after_val) == True:
changed = False;
elif after_val != before_val:
print("COLUMN_CHANGE: ", col, " before ", before_val, " after ", after_val);
changed = True;
if changed == True:
print('-' * 50);
print('-Adding changes to result...')
changes.append(['COLUMN_CHANGE', col, before_val, after_val, row, join_keys]);
print(changes);
if len(changes) > 0:
changed_records.append(changes);
print("changed records ", len(changed_records));
print(changed_records);
return changed_records

Reconciling an array slicer

I've built a function to cut the extraneous garbage out of text entries. It uses an array slicer. I now need to reconcile the lines that've been removed by my cleanup function so all the lines_lost + lines_kept = total lines. Source code below:
def header_cleanup(entry_chunk):
# Removes duplicate headers due to page-continuations
entry_chunk = entry_chunk.replace("\r\n\r\n","\r\n")
header = lines[1:5]
lines[:] = [x for x in lines if not any(header == x for header in headers)]
lines = headers + lines
return("\n".join(lines))
How could I count the lines that do not show up in lines after the slice/mutation, i.e:
original_length = len(lines)
lines = lines.remove_garbage
garbage = lines.garbage_only_plz
if len(lines) + len(garbage) == original_length:
print("Good!")
else:
print("Bad! ;(")
Final answer ended up looking like this:
def header_cleanup(entry_chunk):
lines = entry_chunk.replace("\r\n\r\n","\r\n")
line_length = len(lines)
headers = lines[1:5]
saved_lines = []
bad_lines = []
saved_lines[:] = [x for x in lines if not any(header == x for header in headers)]
bad_lines[:] = [x for x in lines if any(header == x for header in headers)]
total_lines = len(saved_lines) + len(bad_lines)
if total_lines == line_length:
print("Yay!")
else:
print("Boo.")
print(f"{rando_trace_info}")
sys.exit()
final_lines = headers + saved_lines
return("\n".join(final_lines))
Okokokokok - I know you're thinking: that's redundant, but it's required. Open to edits after solution for anything more pythonic. Thanks for consideration.
Don't reuse the lines variable, use a different variable, so you can get the garbage out of the original lines.
clean_lines = remove_garbage(lines)
garbage = garbage_only(lines)
if len(clean_lines) + len(garbage) == len(lines):
print("Good!")
else:
print("Bad!")
You might want to have a single function that returns both:
clean_lines, garbage = filter_garbage(lines)

Extract values from string

I want to extract certain values from a string in python.
snp_1_881627 AA=G;ALLELE=A;DAF_GLOBAL=0.473901;GENE_TRCOUNT_AFFECTED=1;GENE_TRCOUNT_TOTAL=1;SEVERE_GENE=ENSG00000188976;SEVERE_IMPACT=SYNONYMOUS_CODON;TR_AFFECTED=FULL;ANNOTATION_CLASS=REG_FEATURE,SYNONYMOUS_CODON,ACTIVE_CHROM,NC_TRANSCRIPT_VARIANT,NC_TRANSCRIPT_VARIANT;A_A_CHANGE=.,L,.,.,.;A_A_LENGTH=.,750,.,.,.;A_A_POS=.,615,.,.,.;CELL=GM12878,.,GM12878,.,.;CHROM_STATE=.,.,11,.,.;EXON_NUMBER=.,16/19,.,.,.;GENE_ID=.,ENSG00000188976,.,ENSG00000188976,ENSG00000188976;GENE_NAME=.,NOC2L,.,NOC2L,NOC2L;HGVS=.,c.1843N>T,.,n.3290N>T,n.699N>T;REG_ANNOTATION=H3K36me3,.,.,.,.;TR_BIOTYPE=.,PROTEIN_CODING,.,PROCESSED_TRANSCRIPT,PROCESSED_TRANSCRIPT;TR_ID=.,ENST00000327044,.,ENST00000477976,ENST00000483767;TR_LENGTH=.,2790,.,4201,1611;TR_POS=.,1893,.,3290,699;TR_STRAND=.,-1,.,-1,-1
Output:
GENE_ID GENE_NAME EXON_NUMBER SEVERE_IMPACT
snp_1_881627 ENSG00000188976 NOC2L 16/19 SYNONYMOUS_CODON
If the string has values for each of those variables(GENE_ID,GENE_NAME,EXON_NUMBER) existing then output, else "NA"(variables don't exist or their values don't exist).In some cases,these variables don't exist in the string.
Which string method should I use to accomplish this?Should I split my string before extracting any values?I have 10k rows to extract values for each snp_*
string=string.split(';')
P.S. I am a newbie in python
There are two general strategies for this - split and regex.
To use split, first split off the row label (snp_1_881627):
rowname, data = row.split()
Then, you can split data into the individual entries using the ; separator:
data = data.split(';')
Since you need to get the value of certain keys, we can turn it into a dictionary:
dataDictionary = {}
for entry in data:
entry = entry.split('=')
dataDictionary[entry[0]] = entry[1] if len(entry) > 1 else None
Then you can simply check if the keys are in dataDictionary, and if so grab their values.
Using split is nice in that it will index everything in the data string, making it easy to grab whichever ones you need.
If the ones you need will not change, then regex might be a better option:
>>> import re
>>> re.search('(?<=GENE_ID=)[^;]*', 'onevalue;GENE_ID=SOMETHING;othervalue').group()
'SOMETHING'
Here I'm using a "lookbehind" to match one of the keywords, then grabbing the value from the match using group(). Putting your keywords into a list, you could find all the values like this:
import re
...
keywords = ['GENE_ID', 'GENE_NAME', 'EXON_NUMBER', 'SEVERE_IMPACT']
desiredValues = {}
for keyword in keywords:
match = re.search('(?<={}=)[^;]*'.format(keyword), string_to_search)
desiredValues[keyword] = match.group() if match else DEFAULT_VALUE
I think this is going to be the solution you are looking for.
#input
user_in = 'snp_1_881627 AA=G;ALLELE=A;DAF_GLOBAL=0.473901;GENE_TRCOUNT_AFFECTED=1;GENE_TRCOUNT_TOTAL=1;SEVERE_GENE=ENSG00000188976;SEVERE_IMPACT=SYNONYMOUS_CODON;TR_AFFECTED=FULL;ANNOTATION_CLASS=REG_FEATURE,SYNONYMOUS_CODON,ACTIVE_CHROM,NC_TRANSCRIPT_VARIANT,NC_TRANSCRIPT_VARIANT;A_A_CHANGE=.,L,.,.,.;A_A_LENGTH=.,750,.,.,.;A_A_POS=.,615,.,.,.;CELL=GM12878,.,GM12878,.,.;CHROM_STATE=.,.,11,.,.;EXON_NUMBER=.,16/19,.,.,.;GENE_ID=.,ENSG00000188976,.,ENSG00000188976,ENSG00000188976;GENE_NAME=.,NOC2L,.,NOC2L,NOC2L;HGVS=.,c.1843N>T,.,n.3290N>T,n.699N>T;REG_ANNOTATION=H3K36me3,.,.,.,.;TR_BIOTYPE=.,PROTEIN_CODING,.,PROCESSED_TRANSCRIPT,PROCESSED_TRANSCRIPT;TR_ID=.,ENST00000327044,.,ENST00000477976,ENST00000483767;TR_LENGTH=.,2790,.,4201,1611;TR_POS=.,1893,.,3290,699;TR_STRAND=.,-1,.,-1,-1'
#set some empty vars
user_in = user_in.split(';')
final_output = ""
GENE_ID_FOUND = False
GENE_NAME_FOUND = False
EXON_NUMBER_FOUND = False
GENE_ID_OUTPUT = ''
GENE_NAME_OUTPUT = ''
EXON_NUMBER_OUTPUT = ''
SEVERE_IMPACT_OUTPUT = ''
for x in range(0, len(user_in)):
if x == 0:
first_line_count = 0
first_line_print = ''
while(user_in[0][first_line_count] != " "):
first_line_print += user_in[0][first_line_count]
first_line_count += 1
final_output += first_line_print + "\t"
else:
if user_in[x][0:11] == "SEVERE_GENE":
GENE_ID_OUTPUT += user_in[x][12:] + "\t"
GENE_ID_FOUND = True
if user_in[x][0:9] == "GENE_NAME":
GENE_NAME_OUTPUT += user_in[x][10:] + "\t"
GENE_NAME_FOUND = True
if user_in[x][0:11] == "EXON_NUMBER":
EXON_NUMBER_OUTPUT += user_in[x][12:] + "\t"
EXON_NUMBER_FOUND = True
if user_in[x][0:13] == "SEVERE_IMPACT":
SEVERE_IMPACT_OUTPUT += user_in[x][14:] + "\t"
if GENE_ID_FOUND == True:
final_output += GENE_ID_OUTPUT
else:
final_output += "NA"
if GENE_NAME_FOUND == True:
final_output += GENE_NAME_OUTPUT
else:
final_output += "NA"
if EXON_NUMBER_FOUND == True:
final_output += EXON_NUMBER_OUTPUT
else:
final_output += "NA"
final_output += SEVERE_IMPACT_OUTPUT
print(final_output)

Categories

Resources