OpenPyXL - ReadOnly: How to skip empty rows without knowing when they occur? - python

i'm pretty new to programming so please bear with me if my code is not nice and the answer is too obvious. :)
I want to parse an excel file into a directory so i can later access them via key. I won't know how the excel file will be structured before parsing it. So I can't just code it that way to skip a certain empty row since they will be random.
For this, i am using Python 3 and OpenPyXl (Read Only). This is my code:
from openpyxl import load_workbook
import pprint
# path to file
c = "test.xlsx"
wb = load_workbook(filename=c, read_only=True, data_only=True)
# key for directory
data = {}
# list of worksheet names
wsname = []
# values in rows per worksheet
valuename = []
# took this odd numbers since pprint organizes the numbers weird when 1s and 10s are involved
# counter for row
k = 9
# counter for column
i = 10
# splits name of xlsx - file from .xlsx
workbook = c.split(".")[0]
data[workbook] = {}
for ws in wb.worksheets:
# takes worksheet name and parses it into the wsname list
wsname.append(ws.title)
wsrealname = wsname.pop()
worksheet = wsrealname
data[workbook][worksheet] = {}
for row in ws.rows:
k += 1
for cell in row:
# reads value per row and column
data[workbook][worksheet]["Row: " + str(k) + " Column: " + str(i)] = cell.value
i += 1
i = 10
k = 9
pprint.pprint(data)
And with this i get output like this:
{'test': {'Worksheet1': {'Row: 10 Column: 10': None,
'Row: 10 Column: 11': None,
'Row: 10 Column: 12': None,
'Row: 10 Column: 13': None,
'Row: 11 Column: 10': None,
'Row: 11 Column: 11': 'Test1',
'Row: 11 Column: 12': None,
'Row: 11 Column: 13': None}}}
Which is the Output i want, despite the fact they i want to skip in this example the whole Row 10, since all values are None and therefore empty.
As mentioned, I don't know when empty rows will occur so I can't just hardcode a certain row to be skipped. In Read Only Mode, if you print(row) there will be just 'EmptyCell' in the row like this:
(<EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>)
I tried to let my program check with set() whether there are duplicates in the row "values".
if len(set(row)) == 1:
.....
but that doesn't solve this issue, since I get this Error Message:
TypeError: unhashable type: 'ReadOnlyCell'
If I compare the cell.value with 'None' and exlude all 'Nones', I get this Output:
{'test': {'Worksheet1': {'Row: 11 Column: 11': 'Test1'}}}
which is not beneficial, since I just want just to skip cells if the whole row is empty. Output should be like that:
{'test': {'Worksheet1': {'Row: 11 Column: 10': None,
'Row: 11 Column: 11': 'Test1',
'Row: 11 Column: 12': None,
'Row: 11 Column: 13': None}}}
So, could you please help in figuring out how to skip cells only if the complete row (and therefore all cells) is empty?
Thanks a lot!

from openpyxl.cell.read_only import EmptyCell
for row in ws:
empty = all(isinstance(cell, EmptyCell) for cell in row) # or check if the value is None
NB. in read-only mode avoid multiple calls like data[workbook][worksheet]['A1'] as they will force the library to parse the worsheet again and again

Just create your custom generator which would yield only not empty rows:
def iter_rows_with_data(worksheet):
for row in worksheet.iter_rows(values_only=True):
if any(row):
yield row

Related

Comma separated strings to excel cell with python

I'd like to push contents from the string to xls
contents are
abc,[1,2],abc/er/t_y,def,[3,4],def/er/t_d,ghi,ghi/tr/t_p,jkl,[5],jkl/tr/t_m_n,nop,nop/tr/t_k
this is my sample code (using xlwt)
workbook = xlwt.Workbook()
sh = workbook.add_sheet("Sheet1")
def exporttoexcel ():
print("I am in print excel")
rowCount = 1
for row in finalvalue: # in finalvalue abc,[1,2],abc/er/ty.. is stored as type= str
colCount = 0
for column in row.split(","):
sh.write(rowCount, colCount, column)
colCount += 1
rowCount += 1
workbook.save("myxl.xls")
exporttoexcel()
while ingesting data in excel there are few rules to follow
- column headers are main,ids,UI
- each cell have one value except ids [ids may or may not be there]
- after three columns it should move to the next row
- the second column i.e **id** should have only ids and if not available it should be kept as blank
how to push data into excel which looks similar to this with the above rules?
| A | B | C |
1|main|ids|UI|
2|abc |1,2|abc/tr/t_y|
3|def |3,4|def/tr/t_d|
4|ghi | |ghi/tr/t_p|
5|jkl |5 |jkl/tr/t_m_n|
6|nop | |nop/tr/t_k|
Use Regular expression to check value with []
import re
m = re.search(r"\[(\w+)\]", column)
If your problem is how to break up the input string into something you can process with your code:
import re
content = 'abc,[1,2],abc/er/ty,def,[3,4],def/er/td,ghi,ghi/tr/tp,jkl,[5],jkl/tr/tm,nop,nop/tr/tk'
finalvalue = []
for match in re.finditer(r"(\w+),(\[\d+(?:,\d+)*\],)?([\w/]+)", content):
finalvalue.append((
match.group(1),
None if match.group(2) is None else match.group(2)[1:-2],
match.group(3)
))
print(finalvalue)
Result:
[('abc', '1,2', 'abc/er/ty'), ('def', '3,4', 'def/er/td'), ('ghi', None, 'ghi/tr/tp'), ('jkl', '5', 'jkl/tr/tm'), ('nop', None, 'nop/tr/tk')]
Note: rows are no longer stored as string, but as tuple, so you can simply your code a bit.

Insert dict into dataframe with loop

Fetching data from API with for loop, but only last row is showing. If i put print statement instead of d=, I get all records for some reason. How to populate a Dataframe with all values?
I tried with for loop and with append but keep getting wrong results
for x in elements:
url = "https://my/api/v2/item/" + str(x[number"]) + "/"
get_data = requests.get(url)
get_data_json = get_data.json()
d = {'id': [x["enumber"]],
'name': [x["value1"]["name"]],
'adress': [value2["adress"]],
'stats': [get_data_json["stats"][5]["rating"]]
}
df = pd.DataFrame(data=d)
df.head()
Result:
id name order adress rating
Only last row is showing, probably because it's overwriting until it comes to last element. Should I put another for loop somewhere or there is some obvious solution that I cannot see?
Put all your data into a list of dictionaries, then convert to a dataframe at the very end
At the top of your code write:
all_data = []
Then in your loop, after d = {...}, write
all_data.append(d)
Finally at the end (after the loop has finished)
df = pd.DataFrame(all_data)

Compare columns from different excel files and add a column at the beginning of each with the output

I want to start this by saying that I'm not an Excel expert so I kindly need some help.
Let's assume that I have 3 excel files: main.xlsx, 1.xlsx and 2.xlsx. In all of them I have a column named Serial Numbers. I have to:
lookup for all serial numbers in 1.xlsx and 2.xlsx and verify if they are in main.xlsx.
If a serial number is find:
on the last column of main.xlsx, on the same row with the serial number that was find write OK + name_of_the_file_in which_it_was_found. Else, write NOK. At the same time, write in 1.xlsx and 2.xlsx ok or nok on the last column if the serial number was found or not.
Mention: serial number can be on different columns on 1.xlsx and 2.xlsx
Example:
main.xlsx
name date serial number phone status
a b abcd c <-- ok,2.xlsx
b c 1234 d <-- ok,1.xlsx
c d 3456 e <-- ok,1.xlsx
d e 4567 f <-- NOK
e f g <-- skip,don't write anything to status column
1.xlsx
name date serial number phone status
a b 1234 c <-- OK (because is find in main)
b c lala d <-- NOK (because not find in main)
c d 3456 e <-- OK (because find main)
d e jjjj f <-- NOK (because not find in main)
e f g <-- skip,don't write anything to status column
2.xlsx
name date serial number phone status
a b c <-- skip,don't write anything to status column
b c abcd d <-- OK (because find main)
c d 4533 e <-- NOK (because not find in main)
d e jjjj f <-- NOK (because not find in main)
e f g <-- skip,don't write anything to status column
Now, I tried doing this in Python, but apparently I couldn't figure how to write to the status column (tried using dataFrames), on the same line where the serial number is find. Any help would be much appreciated. (or at least some guidance)
My problem it's not finding the duplicates, but rather keeping track of the rows (to write the status on the correct serial number) and writing to the excel at the specified column (status column).
My try:
import pandas as pd
get_main = pd.ExcelFile('main.xlsx')
get_1 = pd.ExcelFile('1.xlsx')
get_2 = pd.ExcelFile('2.xlsx')
sheet1_from_main = get_main.parse(0)
sheet1_from_1 = get_1.parse(0)
sheet1_from_2 = get_2.parse(0)
column_from_main = sheet1_from_main.iloc[:, 2].real
column_from_main_py = []
for x in column_from_main:
column_from_main_py.append(x)
column_from_1 = sheet1_from_1.iloc[:, 2].real
column_from_1_py = []
for y in column_from_1:
column_from_1_py.append(y)
column_from_2 = sheet1_from_2.iloc[:, 2].real
column_2_py = []
for z in column_from_2:
column_2_py.append(z)
Suggested edit:
import pandas as pd
get_main = pd.read_excel('main.xls', sheetname=0)
get_1 = pd.read_excel('1.xls', sheetname=0)
get_2 = pd.read_excel('2.xls', sheetname=0)
column_from_main = get_main.ix[:, 'Serial No.'].real
column_from_main_py = column_from_main.tolist()
column_from_1 = get_1.ix[:, 'SERIAL NUMBER'].real
column_from_1_py = column_from_1.tolist()
column_from_2 = get_2.ix[:, 'S/N'].real
column_from_2_py = column_from_2.tolist()
# Tried to put example data at specific column
df = pd.DataFrame({'Data': [10, 20, 30, 20, 15, 30, 45]})
writer = pd.ExcelWriter('first.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1')
workbook = writer.book
worksheet = writer.sheets['Sheet1']
worksheet.set_column('M:M', None, None)
writer.save()
First off you can skip using excelfile and parse by using pd.read_excel(filename, sheetname=0).
As far as you columns go, try accessing the columns by name, not by index. And instead of using a for loop to create a list, use the tolist method. So instead of column_from_main = sheet1_from_main.iloc[:, 2].real you could say:
column_from_main = get_main.ix[:, 'serial number'].real
column_from_main_py = column_from_main.tolist()
Do the same for your other files as well. This will remove any issues with the serial number column being indexed differently and will operate faster.
As to your comment about not being able to write to 'Status' properly, can you show your code that you tried? I'd be more than happy to help, but it's nice to see what you've done to this point.
For checking the values in main against the other two files you will want to iterate over the lists you created and check if each value in the main list is in either of the other lists. Within that loop you can then assign the value of status based on whether the serial number in main is present in one, none, or both:
get_main['status'] = ''
get_1['status'] = ''
get_2['status'] = ''
for num in column_from_main_py:
if num not in column_from_1_py and not in column_from_2_py:
get_main.loc[get_main['serial number'] == num, 'status'] = 'NOK'
elif num in column_from_1_py and not in column_from_2_py:
get_main.loc[get_main['serial number'] == num, 'status'] = 'OK,1.xlsx'
get_1.loc[get_1['serial number'] == num, 'status'] = 'OK'
elif num not in column_from_1_py and in column_from_2_py:
get_main.loc[get_main['serial number'] == num, 'status'] = 'OK,2.xlsx'
get_2.loc[get_2['serial number'] == num, 'status'] = 'OK'
The lines get_main.loc are where you are setting the OK or NOK value to the status column. essentially it finds the index where some condition is true and then lets you change the value of a specific column at that index. Once you have gone through the main list then you can look through the lists for 1 and 2 to find serial numbers that aren't in main. Similarly:
for num in column_from_1_py:
if num not in column_from_main_py:
get_1.loc[get_1['serial number'] == num, 'status'] = 'NOK'
for num in column_from_2_py:
if num not in column_from_main_py:
get_2.loc[get_2['serial number'] == num, 'status'] = 'NOK'
That will set you NOK values and you should be good to go ahead and export the dataframes to excel (or csv, hdf, sql, etc...) and that should do it.
There are lots of ways you can index and select data in pandas depending on what you want to do. I recommend reading the Indexing and Selecting Data page in the docs as it has been a great reference for me.
Note that the input files provided in the question aren't the actual input files being used. After obtaining the real input files, the following information/script was constructed. The below will not work for the question as currently asked.
To use the following example you first install petl and openpyxl (for your xlsx files):
pip install openpyxl
pip install petl
Script:
import petl
main = petl.fromxlsx('main.xlsx')
one = petl.fromxlsx('1.xlsx', row_offset=1)
two = petl.fromxlsx('2.xlsx')
non_serial_rows = petl.select(main, lambda rec: rec['serial number'] is None)
serial_rows = petl.select(main, lambda rec: rec['serial number'] is not None)
main_join_one = petl.join(serial_rows, petl.cut(one,['serial number']), key='serial number')
main_join_one_file = petl.addfield(main_join_one, 'file', 'ok, 1.xlsx')
main_join_two = petl.join(serial_rows, petl.cut(two,['serial number']), key='serial number')
main_join_two_file = petl.addfield(main_join_two, 'file', 'ok, 2.xlsx')
stacked_joins = petl.stack(main_join_two_file, main_join_one_file)
nok_rows = petl.antijoin(serial_rows, petl.cut(stacked_joins, ['serial number']), key='serial number')
nok_rows = petl.addfield(nok_rows, 'file', 'NOK')
output_main = petl.stack(stacked_joins, non_serial_rows, nok_rows)
main_final = output_main
def main_compare(table):
non_serial_rows = petl.select(table, lambda rec: rec['serial number'] is None)
serial_rows = petl.select(table, lambda rec: rec['serial number'] is not None)
ok_rows = petl.join(serial_rows, petl.cut(main, ['serial number']), key='serial number')
ok_rows = petl.addfield(ok_rows, 'file', 'OK')
nok_rows = petl.antijoin(serial_rows, petl.cut(main, ['serial number']), key='serial number')
nok_rows = petl.addfield(nok_rows, 'file', 'NOK')
return petl.stack(ok_rows, nok_rows, non_serial_rows)
one_final = main_compare(one)
two_final = main_compare(two)
petl.toxlsx(main_final, 'mainNew.xlsx')
print petl.lookall(main_final)
petl.toxlsx(one_final, '1New.xlsx')
print petl.lookall(one_final)
petl.toxlsx(two_final, '2New.xlsx')
print petl.lookall(two_final)
Output (Text on console, and the actual modified xlsx files)

ignoring null value columns in python

I Have a .txt file which has three columns in it.
id ImplementationAuthority.email AssignedEngineer.email
ALU02034116 bin.a.chen#shan.cn bin.a.chen#ell.com.cn
ALU02035113 Guolin.Pan#ell.com.cn
ALU02034116 bin.a.chen#ming.com.cn Guolin.Pan#ell.com.cn
ALU02022055 fria-sha-qdv#list.com
ALU02030797 fria-che-equipment-1#phoenix.com Balagopal.Velusamy#phoenix.com
I need to create two lists which comprises of values under the column Implementation Authority.mail and Assigned Engineer.mail. It works perfectly when columns have compltete values (i.e no null values). The values got mixed when column contains null values.
aengg=[]
iauth=[]
with open('test.txt') as f:
for i, row in enumerate(f):
columns = row.split()
if len(columns) == 3:
aengg.append(columns[2])
iauth.append(columns[1])
print aengg
print iauth
I tried it with this code and it is perfectly worked for complete column values.
Can anyone please tell me a solution for null values?
It seems you don't have a separator. I use number of spaces for your case. And fill the blank with a None.
Try this:
#!/usr/bin/env python
# -*- coding:utf-8 -*-
aengg = []
iauth = []
with open('C:\\temp\\test.txt') as f:
for i, row in enumerate(f):
columns = row.split()
if len(columns) == 2:
# when there are more than 17 spaces between two elements, I consider it as a third element in the row, then I add a None between them
if row.index(columns[1]) > 17:
columns.insert(1, None)
# if there are less than 17 spaces between two elements, I consider it as the second element in the row, then I add a None to the tail
else:
columns.append(None)
print columns
aengg.append(columns[2])
iauth.append(columns[1])
print aengg
print iauth
Here is the output.
['id', 'ImplementationAuthority.email', 'AssignedEngineer.email']
['ALU02034116', 'bin.a.chen#shan.cn', 'bin.a.chen#ell.com.cn']
['ALU02035113', None, 'Guolin.Pan#ell.com.cn']
['ALU02034116', 'bin.a.chen#ming.com.cn', 'Guolin.Pan#ell.com.cn']
['ALU02022055', 'fria-sha-qdv#list.com', None]
['ALU02030797', 'fria-che-equipment-1#phoenix.com', 'Balagopal.Velusamy#phoenix.com']
['AssignedEngineer.email', 'bin.a.chen#ell.com.cn', 'Guolin.Pan#ell.com.cn', 'Guolin.Pan#ell.com.cn', None, 'Balagopal.Velusamy#phoenix.com']
['ImplementationAuthority.email', 'bin.a.chen#shan.cn', None, 'bin.a.chen#ming.com.cn', 'fria-sha-qdv#list.com', 'fria-che-equipment-1#phoenix.com']
You need to place a 'null' or 0 as place holder.
The interpreter would read Guolin.Pan#ell.com.cn in second row as the second column.
Try this
id ImplementationAuthority.email AssignedEngineer.email
ALU02034116 bin.a.chen#shan.cn bin.a.chen#ell.com.cn
ALU02035113 null Guolin.Pan#ell.com.cn
ALU02034116 bin.a.chen#ming.com.cn Guolin.Pan#ell.com.cn
ALU02022055 fria-sha-qdv#list.com null
ALU02030797 fria-che-equipment-1#phoenix.com Balagopal.Velusamy#phoenix.com
And then append values after checking not null.
with open('test.txt') as f:
for i, row in enumerate(f):
columns = row.split()
if len(columns) == 3:
if columns[2] != "null":
aengg.append(columns[2])
if columns[1] != "null":
iauth.append(columns[1])

Populating google spreadsheet by row, not by cell

I have a spreadsheet whose values I want to populate with values from dictionaries within a list. I wrote a for loop that updates cell by cell, but it is too slow and I get the gspread.httpsession.HTTPError often. I am trying to write a loop to update row by row. Thats what I have:
lstdic=[
{'Amount': 583.33, 'Notes': '', 'Name': 'Jone', 'isTrue': False,},
{'Amount': 58.4, 'Notes': '', 'Name': 'Kit', 'isTrue': False,},
{'Amount': 1083.27, 'Notes': 'Nothing', 'Name': 'Jordan', 'isTrue': True,}
]
Here is my cell by cell loop:
headers = wks.row_values(1)
for k in range(len(lstdic)):
for key in headers:
cell = wks.find(key)
cell_value = lstdic[k][key]
wks.update_cell(cell.row + 1 + k, cell.col, cell_value)
What it does is it finds a header that corresponds to the key in the list of dictionaries and updates the cell under it. The next iteration the row is increased by one, so it updates cells in the same columns, but next row. This is too slow and I want to update by row. My attempt:
headers = wks.row_values(1)
row=2
for k in range(len(lsdic)):
cell_list=wks.range('B%s:AA%s' % (row,row))
for key in headers:
for cell in cell_list:
cell.value = lsdic[k][key]
row+=1
wks.update_cells(cell_list)
This one updates each row quickly, but with the same value. So, the third nested for loop assigns the the same value for each cell. I am breaking my head trying to figure out how to assign right values to the cells. Help appreciated.
P.S. by the way I am using headers because I want a certain order in which values in the google spreadsheet should appear.
The following code is similar to Koba's answer but writes the full sheet at once instead of per row. This is even faster:
# sheet_data is a list of lists representing a matrix of data, headers being the first row.
#first make sure the worksheet is the right size
worksheet.resize(len(sheet_data), len(sheet_data[0]))
cell_matrix = []
rownumber = 1
for row in sheet_data:
# max 24 table width, otherwise a two character selection should be used, I didn't need this.
cellrange = 'A{row}:{letter}{row}'.format(row=rownumber, letter=chr(len(row) + ord('a') - 1))
# get the row from the worksheet
cell_list = worksheet.range(cellrange)
columnnumber = 0
for cell in row:
cell_list[columnnumber].value = row[columnnumber]
columnnumber += 1
# add the cell_list, which represents all cells in a row to the full matrix
cell_matrix = cell_matrix + cell_list
rownumber += 1
# output the full matrix all at once to the worksheet.
worksheet.update_cells(cell_matrix)
I ended up writing the following loop that fills a spreadsheet by row amazingly fast.
headers = wks.row_values(1)
row = 2 # start from the second row because the first row are headers
for k in range(len(lstdic)):
values=[]
cell_list=wks.range('B%s:AB%s' % (row,row)) # make sure your row range equals the length of the values list
for key in headers:
values.append(lstdic[k][key])
for i in range(len(cell_list)):
cell_list[i].value = values[i]
wks.update_cells(cell_list)
print "Updating row " + str(k+2) + '/' + str(len(lstdic) + 1)
row += 1

Categories

Resources