How to detect "strikethrough" style from xlsx file in R - python

I have to check the data which contain "strikethrough" format when importing excel file in R
Do we have any method to detect them ?
Welcome for both R and Python approach

R-solution
the tidyxl-package can help you...
example temp.xlsx, with data on A1:A4 of the first sheet. Below is an excel-screenshot:
library(tidyxl)
formats <- xlsx_formats( "temp.xlsx" )
cells <- xlsx_cells( "temp.xlsx" )
strike <- which( formats$local$font$strike )
cells[ cells$local_format_id %in% strike, 2 ]
# A tibble: 2 x 1
# address
# <chr>
# 1 A2
# 2 A4

I present below a small sample program that filters out text with strikethrough applied, using the openpyxl package (I tested it on version 2.5.6 with Python 3.7.0). Sorry it took so long to get back to you.
import openpyxl as opx
from openpyxl.styles import Font
def ignore_strikethrough(cell):
if cell.font.strike:
return False
else:
return True
wb = opx.load_workbook('test.xlsx')
ws = wb.active
colA = ws['A']
fColA = filter(ignore_strikethrough, colA)
for i in fColA:
print("Cell {0}{1} has value {2}".format(i.column, i.row, i.value))
print(i.col_idx)
I tested it on a new workbook with the default worksheets, with the letters a,b,c,d,e in the first five rows of column A, where I had applied strikethrough formatting to b and d. This program filters out the cells in columnA which have had strikethrough applied to the font, and then prints the cell, row and values of the remaining ones. The col_idx property returns the 1-based numeric column value.

I found a method below:
'# Assuming the column from 1 - 10 has value : A , the 5th A contains "strikethrough"
TEST_wb = load_workbook(filename = 'TEST.xlsx')
TEST_wb_s = TEST_wb.active
for i in range(1, TEST_wb_s.max_row+1):
ck_range_A = TEST_wb_s['A'+str(i)]
if ck_range_A.font.strikethrough == True:
print('YES')
else:
print('NO')
But it doesn't tell the location (this case is the row numbers),which is hard for knowing where contains "strikethrough" when there is a lot of result , how can i vectorize the result of statement ?

Related

Openpyxl: TypeError - Concatenation of several columns into one cell per row

I am new to openpyxl and cannot figure out what the reason for my error is. I hope you can see the problem and show me what to change!
What I want to do:
I want to concatenate the cells from columns F to M per row and put the concatenated value into column E like below.
(The rows do not always fill up from column F to column M, since per row are different kind of signals. But I have not put an if clause for this yet. This is just an information about the structure.)
Input:
A B C D E F G H .. M
....... E1 90 2A .. 26
....... 0 80 F8 ..
Output:
A B C D E F G H .. M
....... E1902A..26
....... 080F8..
What I did so far (Code):
theFile = openpyxl.load_workbook('T013.xlsx')
allSheetNames = theFile.sheetnames
print("All sheet names {} " .format(theFile.sheetnames))
sheet = theFile.active
#loop to concatenate
for i,row in enumerate(sheet.rows,1):
for column in range(6,13): #column F-M
sRow=str(i)
Ecell = sheet['E' + sRow]
ref = sheet["F:M"] #range of cells
for cell in range(ref):
values = str(cell.value) #collect data
Ecell.value = ''.join(values) # write values
Which kind of error I get (complete Traceback):
C:\Users\..\Desktop\Practical Part\CAN Python>python ConcatenateHEX.py
All sheet names ['T013']
Traceback (most recent call last):
File "ConcatenateBIN.py", line 38, in <module>
for cell in range(ref):
TypeError: 'tuple' object cannot be interpreted as an integer
I already tried to change the 'ref' variable but the error is always the same!
Could you please support me? Thank you very much!
EDIT (2/10/2020):
Further, I want to use the function for all rows which are too many to write down. Therefore I came up with this change:
def concat_f_to_m():
for row_value in range(1, sheet.max_row+1):
values=[]
del values[:]
for row in sheet.iter_rows(min_col=6, max_col=14, min_row=row_value, max_row=row_value):
for cell in row:
if cell.value != None:
values.append(str(cell.value))
else:
del values[:]
break
#print(values)
sheet[f'E{row_value}'].value= ''.join(values)
concat_f_to_m()
I cannot overcome the issue that all the values from row 1 to row xyz are printed in the row_value cell (e.g. row_value=13, all values from row 1 to 13 are concatenated in cell E13). I therefore wanted to iterate over row_value in order to go through all rows but somehow that does not work. Could you give me a hint how to concatenate through all rows by joining the values list at the certain row? Thank you!
Using openpyxl, I created a little function that will do what you want for a line at a time:
import openpyxl
theFile = openpyxl.load_workbook('T013.xlsx')
allSheetNames = theFile.sheetnames
print("All sheet names: {}" .format(theFile.sheetnames))
sheet = theFile.active
def concat_f_to_m(row_value):
values=[]
del values[:]
for row in sheet.iter_rows(min_col=6, max_col=13, min_row=row_value, max_row=row_value):
for cell in row:
if cell.value == None:
pass
else:
values.append(str(cell.value))
sheet[f'E{row_value}'].value= ''.join(values)
concat_f_to_m(1)
concat_f_to_m(2)
theFile.save('T013.xlsx')
Output:

Merging inconsistent data in text files into a single excel spreadsheet

I have a large number of text files with data; each file can be imported into excel separately. However, while most of the columns are the same between the files, in many files there's a column or two added/missing so when I merge all the text files and put it into excel, many columns of data are shifted.
I can make a 'master list' of all the possible data entries, but I'm not exactly sure how to tell excel to put certain types of data in specific columns.
For instance, if I have two files that look like:
Name Year Food Color
Bob 2018 Cake Blue
Charlie 2017 Figs Red
and
LastName Name Age Year Color Size
Lily James 17 2021 green 0
How would I go about merging them like this in excel:
LastName Name Age Year Food Color Size
na Bob na 2018 Cake Blue na
na Charlie na 2017 Figs Red na
Lily James 17 2021 na green 0
Question: Merging inconsistent data in text files into a single excel spreadsheet
This solution is using the following build-in and moudules:
Set Types
Lists
CSV File Reading and Writing
Mapping Types — dict
The core of this solution is to normalize the columns names using a set() object and
the parameter .DictWriter(..., extrasaction='ignore') to handle the inconsistent columns.
The output format is CSV, which can be read from MS-Excel.
The given data, separated by blank
text1 = """Name Year Food Color
Bob 2018 Cake Blue
Charlie 2017 Figs Red
"""
text2 = """LastName Name Age Year Color Size
Lily James 17 2021 green 0
"""
Open three files an get the headers.
Aggregate all columns names, drop double names using a set().
Create a DictReader object for the in_* files.
Note: Replace io.StringIO(... with open(<Path to file>)
with io.StringIO(text1) as in_text1, \
io.StringIO(text2) as in_text2, \
io.StringIO() as out_csv:
columns = set()
reader = []
for n, fh in enumerate([in_text1, in_text2]):
fieldnames = fh.readline().rstrip().split()
[columns.add(name) for name in fieldnames]
reader.append(csv.DictReader(fh, delimiter=' ', fieldnames=fieldnames))
Create a DictWriter object using the normalized column names.
The parameter extrasaction='ignore', handle the inconsistent columns.
Note: The column order is not guaranteed. If you need a defined order, sort the list(columns) to your needs before assigning to fieldnames=.
writer = csv.DictWriter(out_csv, fieldnames=list(columns), , extrasaction='ignore')
writer.writeheader()
Loop all DictReader objects reading all lines and write it to the target .csv file.
for dictReader in reader:
for _dict in dictReader:
writer.writerow(_dict)
Output:
print(out_csv.getvalue())
Color,LastName,Year,Food,Age,Name,Size
Blue,,2018,Cake,,Bob,
Red,,2017,Figs,,Charlie,
green,Lily,2021,,17,James,0
Tested with Python: 3.4.2
If you were happy to work with the text files directly in Excel ... this will work but may need some refinement from yourself.
I understand it’s probably not what you’re looking for but it provides another option.
Open the Visual Basic editor, add a new module and copy the below code and paste in ...
Public Sub ReadAndMergeTextFiles()
Dim strSrcFolder As String, strFileName As String, strLine As String, strPath As String, bFirstLine As Boolean
Dim arrHeaders() As String, lngHeaderIndex As Long, arrFields, i As Long, objDestSheet As Worksheet, bFound As Boolean
Dim objLastHeader As Range, x As Long, lngLastColumn As Long, lngHeaderCol As Long, arrHeaderCols() As Long
Dim lngWriteRow As Long
lngLastColumn = 1
lngWriteRow = 2
Application.EnableEvents = False
Application.ScreenUpdating = False
' Change the sheet name being assigned to your destination worksheet name.
' Alternatively, display a prompt that asks for the sheet or simply uses the active sheet.
Set objDestSheet = Worksheets("Result")
With Application.FileDialog(msoFileDialogFolderPicker)
.Title = "Select Source Folder"
.Show
If .SelectedItems.Count = 1 Then
objDestSheet.Cells.Clear
strSrcFolder = .SelectedItems(1)
strFileName = Dir(strSrcFolder & "\*.txt")
Do While Len(strFileName) > 0
strPath = strSrcFolder & "\" & strFileName
Open strPath For Input As #1
bFirstLine = True
Do Until EOF(1)
Line Input #1, strLine
arrFields = Split(strLine, vbTab, , vbTextCompare)
lngHeaderIndex = -1
For i = 0 To UBound(arrFields)
If bFirstLine Then
' Loop through the header fields already written to the destination worksheet and find a match.
For x = 1 To objDestSheet.Columns.Count
bFound = False
If Trim(objDestSheet.Cells(1, x)) = "" Then Exit For
If UCase(objDestSheet.Cells(1, x)) = UCase(arrFields(i)) Then
lngHeaderCol = x
bFound = True
Exit For
End If
Next
If Not bFound Then
objDestSheet.Cells(1, lngLastColumn) = arrFields(i)
lngHeaderCol = lngLastColumn
lngLastColumn = lngLastColumn + 1
End If
lngHeaderIndex = lngHeaderIndex + 1
ReDim Preserve arrHeaderCols(lngHeaderIndex)
arrHeaderCols(lngHeaderIndex) = lngHeaderCol
Else
' Write out each value into the column found.
objDestSheet.Cells(lngWriteRow, arrHeaderCols(i)) = "'" & arrFields(i)
End If
Next
If Not bFirstLine Then
lngWriteRow = lngWriteRow + 1
End If
bFirstLine = False
Loop
Close #1
strFileName = Dir
Loop
objDestSheet.Columns.AutoFit
End If
End With
Application.ScreenUpdating = True
Application.EnableEvents = True
End Sub
... I did some basic testing with the data you provided and it seemed to work. If for some reason it fails over the data you're using and you can't work it out, let me know and I'll put a fix in.
Some points ...
The order of the columns depends on the order of your files and which columns appear first. Of course, that could be enhanced upon but it is what it is for now.
It assumes all files in the one folder and all files end in .txt
The separator within each file is assumed to be a TAB.
Let me know if that helps.

Python using openpyxl, load_workbook, grab pattern, and print value of cell

I'm still new to Python while exploring, learning, and today I'm working with openpyxl. How do I print a cell value (in this case date) based on the last position of my pattern? I have a worksheet I am trying grab data from row by row since my data set is less then 15 rows totals. The data looks as follows:
Apps, 11/3/2017, 11/4/2017, 11/5/2017, 11/6/2017, 11/7/2017, 11/8/2017...
app_1, a, a, a, b, c,d,e...
app_2, a,b,c,d,e....
app_3, a,a,a,a,b,b,b,b,b,b,c,c,c,c,d,d,d,d,e,e,e,e....
I am able to grab each row, store it into a variable, get the value of cell, and get cell coordinates. For example, I want to 1) look at the Apps column, 2) note app_1, 3) locate letter pattern "aaa", 4) note last coordinate of "aaa" and print date to a new cell of 11/5/2017. Here is a sample of my code so far:
from openpyxl import load_workbook
wb = load_workbook('example.xlsx')
ws1 = wb['Sheet1']
row1 = ws1.iter_rows(min_row=1,max_row=1)
row2 = ws1.iter_rows(min_row=2,max_row=2)
row3 = ws1.iter_rows(min_row=3,max_row=3)
def row_one_apps_dates(): # I have this for each row
for row in row1:
for cell in row:
if cell.value =='a' :
print(cell.column, cell.row)#print(cell.coordinate)
def row_two():
for row in row2:
for cell in row:
if cell.value =='a' or 'b' or 'c' or 'd' or 'e':
print(cell.column, cell.row)#print(cell.coordinate)
My Output:
A 2
B 2
C 2
D 2
E 2
F 2
Thank you for any help as I continue to grow in python.
UPDATE1:
At this point I am trying to retrieve the last value, but keep getting multiple results instead of the last one. Here is my code:
for row in row2:
for cell in row:
if cell.value == 'a':
result = cell.column
last_result = result[-1]
print(last_result)
OUTPUT:
B
C
D
E
F
G
None
I want to retrieve G only and will later add 1 making it G1. Grab value of G1 which would be a date.

openpyxl - change width of n columns

I am trying to change the column width for n number of columns.
I am able to do this for rows as per the below code.
rowheight = 2
while rowheight < 601:
ws.row_dimensions[rowheight].height = 4
rowheight += 1
The problem I have is that columns are in letters and not numbers.
As pointed out by ryachza the answer was to use an openpyxl utility, however the utility to use is get_column_letter and not column_index_from_string as I want to convert number to letter and not visa versa.
Here is the working code
from openpyxl.utils import get_column_letter
# Start changing width from column C onwards
column = 3
while column < 601:
i = get_column_letter(column)
ws.column_dimensions[i].width = 4
column += 1
To get the column index, you should be able to use:
i = openpyxl.utils.column_index_from_string(?)
And then:
ws.column_dimensions[i].width = ?

Read last valid rows in csv using python

I am working on reading from my csv file using python. But I want to read only specific(last valid) rows from the tail in csv also there is a catch that function should return the entire row only when it is valid. Can anyone help me out with this?
Below is my csv file looks like:
Sr. Add A B C D
0 0013A20040D6A141 -308.1 -307.6 -307.7 -154.063
1 0013A20040DC889A -308.7 -311.7 -311.7 -154.263
2 0013A20040DC88C3 -310.1 -310.1 -310.2 -154.863
3 0013A20040D6A141 -308.2 -306.8 -307.7 -153.863
4 0013A20040DC889A -308.7 -311.4 -311.1 -153.263
5 0013A20040DC88C3 -- -- -- --
6 0013A20040D6A141 -308.7 -308.3 -305.2 -154.663
and the code I am trying is:
def last_data(address):
i = sum(1 for line in open("filename.csv", 'r'))
print i # number of lines in csv
cache = {} # dict that saved the last data for particular address
n = 3
with open("filename.csv",'r') as f:
q = deque(f, 3) # 3 lines read at the end
qp = [''] * n
if i +1 >= n: # for checking whether the number of lines greater than number of add.
for k in range(n):
qp[k] = q[k].split(',')
if address == str(qp[k][1]): # check for particular address in row
# if the row has data than put it into json object with address as key and nested key as columns 'A', 'B', etc.
cache.update({address: {'A':struct.pack('>l',int(float(qp[k][3]) * 10)),
'C':struct.pack('>l',int(float(qp[k][4]) * 10))
}})
return cache[address]['A'], cache[address]['C']
For last_data('0013A20040DC88C3') return 5th row with invalid data, where I want to show 2nd row. Can any body tell me how to do this?
With pandas it would look like this:
Note: python 2.7. code. Change the import for the StringIo on Python3.
import pandas as pd
from StringIO import StringIO
input = """Sr. Add A B C D
0 0013A20040D6A141 -308.1 -307.6 -307.7 -154.063
1 0013A20040DC889A -308.7 -311.7 -311.7 -154.263
2 0013A20040DC88C3 -310.1 -310.1 -310.2 -154.863
3 0013A20040D6A141 -308.2 -306.8 -307.7 -153.863
4 0013A20040DC889A -308.7 -311.4 -311.1 -153.263
5 0013A20040DC88C3 -- -- -- --
6 0013A20040D6A141 -308.7 -308.3 -305.2 -154.663
"""
buffer = StringIO(input)
df = pandas.read_csv(buffer, delim_whitespace=True, na_values=["--"])
# you can customize the behaviour here, e.g. how many invalid values are ok per row.
# see http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html
df = df.dropna()

Categories

Resources