Dynamicaly Build Python lists from Sheets in Excel Workbook

Dynamicaly Build Python lists from Sheets in Excel Workbook - python

I am attempting to compress some code I previous wrote in python. I have some drawn out code that loops through a number of lookup tables in an excel workbook. There are about 20 sheets that contain lookup tables in the workbook. I want to loop through the values in each lookup table and add them to their own list. My existing code looks like this:
test1TableList = []
for row in arcpy.SearchCursor(r"Z:\Excel\LOOKUP_TABLES.xlsx\LookupTable1$"):
test1TableList.append(row.Code)
test2TableList = []
for row in arcpy.SearchCursor(r"Z:\Excel\LOOKUP_TABLES.xlsx\LookupTable1$"):
test2TableList.append(row.Code)
test3TableList = []
for row in arcpy.SearchCursor(r"Z:\Excel\LOOKUP_TABLES.xlsx\LookupTable1$"):
test3TableList.append(row.Code)
test4TableList = []
for row in arcpy.SearchCursor(r"Z:\Excel\LOOKUP_TABLES.xlsx\LookupTable1$"):
test4TableList.append(row.Code)
test5TableList = []
for row in arcpy.SearchCursor(r"Z:\Excel\LOOKUP_TABLES.xlsx\LookupTable1$"):
test5TableList.append(row.Code)
yadda yadda
I want to compress that code (maybe in a function).
Issues to resolve:
Sheet names are all different. I need to loop through each sheet in the excel workbook in order to a) grab the sheet object and b) use the sheet name as part of the python list variable name
I want each list to remain in memory for use further along the code
I've been trying something like this, which work but the python list variables don't seem to stay in memory:
import arcpy, openpyxl
from openpyxl import load_workbook, Workbook
wb = load_workbook(r"Z:\Excel\LOOKUP_TABLES.xlsx")
for i in wb.worksheets:
filepath = r"Z:\Excel\LOOKUP_TABLES.xlsx" + "\\" + i.title + "$"
varList = []
with arcpy.da.SearchCursor(filepath, '*') as cursor:
for row in cursor:
varList.append(row[0])
# This is the area I am struggling with. I can't seem to find a way to return
# each list into memory. I've tried the following code to dynamically create
# variable names from the name of the sheet so that each list has it's own
# variable. After the code has run, I'd just like to set a print statement
# (i.e. print variablename1) which will return the list contained in the variable
newList = str(i.title) + "List"
newList2 = list(varList)
print newList + " = " + str(newList2)
I've been working on this for a while and I have no doubt, at this point, i am over thinking my solution but I'm at a block. Any recommendations are welcome!

Not sure if it is the best for you, but you could use pandas to import your sheets into a dataframes.
from pandas.io.excel import ExcelFile
filename = 'linreg.xlsx'
xl = ExcelFile(filename)
for sheet in xl.sheet_names:
df = xl.parse(sheet)
print df

Instead of having breeding lists, use a dictionary for collecting the data per-sheet:
import arcpy
from openpyxl import load_workbook
wb = load_workbook(r"Z:\Excel\LOOKUP_TABLES.xlsx")
sheets = {}
for i in wb.worksheets:
filepath = r"Z:\Excel\LOOKUP_TABLES.xlsx" + "\\" + i.title + "$"
with arcpy.da.SearchCursor(filepath, '*') as cursor:
sheets[i.title] = [row[0] for row in cursor]
print sheets

Related

Write python list of files names to excel using openpyxl

I have been trying to get the name of files in a folder on my computer and open an excel worksheet and write the file names in a specific column. However, it returns to me the following message of error. "TypeError: Value must be a list, tuple, range or generator, or a dict. Supplied value is <class 'str'>".
The code is:
from openpyxl import load_workbook
import os
import glob, os
os.chdir("/content/drive/MyDrive/picture")
ox = []
for file in glob.glob("*.*"):
for j in range(0, 15):
replaced_text = file.replace('.JPG', '')
ox.append(replaced_text)
oxx = ['K', ox] #k is a column
file1 = load_workbook(filename = '/content/drive/MyDrive/Default.xlsx')
sheet1 = file1['Enter Data Draft']
for item in oxx:
sheet1.append(item)

I've taken a slightly different approach but looking at your code the problem is with the looping.
The problem.
for item in oxx: sheet1.append(item)
When looping over the items in oxx, there are two items. 'K' and then a list with filenames (x15 each) in it. Openpyxl was expecting a different data structure for append. Its actually after a tuple of tuples. documentation here.
The solution
So not knowing what other data you might have on the worksheet I've changed the approach to hopefully satisfy the expected outcome.
I got the following to work as expected.
from openpyxl import load_workbook
import os
import glob, os
os.chdir("/content/drive/MyDrive/picture")
ox = []
for file in glob.glob("*.*"):
for j in range(0, 15): # I've kept this in here assuming you wanted to list the file name 15 times?
replaced_text = file.replace('.JPG', '')
ox.append(replaced_text)
file_dir = '/content/drive/MyDrive/Default.xlsx'
file1 = load_workbook(filename = file_dir)
sheet1 = file1['Enter Data Draft']
# If you were appending to the bottom of a list that was already there use this
# last_row = len(sheet1['K'])
# else use this
last_row = 1 # Excel starts at 1, adjust if you had a header in that column
for counter, item in enumerate(ox):
# K is the 11th column.
sheet1.cell(row=(last_row + counter), column=11).value = item
# Need to save the file or changes wont be reflected
file1.save(file_dir)

Is there a way to export a list of 100+ dataframes to excel?

So this is kind of weird but I'm new to Python and I'm committed to seeing my first project with Python through to the end.
So I am reading about 100 .xlsx files in from a file path. I then trim each file and send only the important information to a list, as an individual and unique dataframe. So now I have a list of 100 unique dataframes, but iterating through the list and writing to excel just overwrites the data in the file. I want to append the end of the .xlsx file. The biggest catch to all of this is, I can only use Excel 2010, I do not have any other version of the application. So the openpyxl library seems to have some interesting stuff, I've tried something like this:
from openpyxl.utils.dataframe import dataframe_to_rows
wb = load_workbook(outfile_path)
ws = wb.active
for frame in main_df_list:
for r in dataframe_to_rows(frame, index = True, header = True):
ws.append(r)
Note: In another post I was told it's not best practice to read dataframes line by line using loops, but when I started I didn't know that. I am however committed to this monstrosity.
Edit after reading Comments
So my code scrapes .xlsx files and stores specific data based on a keyword comparison into dataframes. These dataframes are stored in a list, I will list the entirety of the program below so hopefully I can explain what's in my head. Also, feel free to roast my code because I have no idea what is actual good python practices vs. not.
import os
import pandas as pd
from openpyxl import load_workbook
#the file path I want to pull from
in_path = r'W:\R1_Manufacturing\Parts List Project\Tool_scraping\Excel'
#the file path where row search items are stored
search_parameters = r'W:\R1_Manufacturing\Parts List Project\search_params.xlsx'
#the file I will write the dataframes to
outfile_path = r'W:\R1_Manufacturing\Parts List Project\xlsx_reader.xlsx'
#establishing my list that I will store looped data into
file_list = []
main_df = []
master_list = []
#open the file path to store the directory in files
files = os.listdir(in_path)
#database with terms that I want to track
search = pd.read_excel(search_parameters)
search_size = search.index
#searching only for files that end with .xlsx
for file in files:
if file.endswith('.xlsx'):
file_list.append(in_path + '/' + file)
#read in the files to a dataframe, main loop the files will be maninpulated in
for current_file in file_list:
df = pd.read_excel(current_file)
#get columns headers and a range for total rows
columns = df.columns
total_rows = df.index
#adding to store where headers are stored in DF
row_list = []
column_list = []
header_list = []
for name in columns:
for number in total_rows:
cell = df.at[number, name]
if isinstance(cell, str) == False:
continue
elif cell == '':
continue
for place in search_size:
search_loop = search.at[place, 'Parameters']
#main compare, if str and matches search params, then do...
if insensitive_compare(search_loop, cell) == True:
if cell not in header_list:
header_list.append(df.at[number, name]) #store data headers
row_list.append(number) #store row number where it is in that data frame
column_list.append(name) #store column number where it is in that data frame
else:
continue
else:
continue
for thing in column_list:
df = pd.concat([df, pd.DataFrame(0, columns=[thing], index = range(2))], ignore_index = True)
#turns the dataframe into a set of booleans where its true if
#theres something there
na_finder = df.notna()
#create a new dataframe to write the output to
outdf = pd.DataFrame(columns = header_list)
for i in range(len(row_list)):
k = 0
while na_finder.at[row_list[i] + k, column_list[i]] == True:
#I turn the dataframe into booleans and read until False
if(df.at[row_list[i] + k, column_list[i]] not in header_list):
#Store actual dataframe into my output dataframe, outdf
outdf.at[k, header_list[i]] = df.at[row_list[i] + k, column_list[i]]
k += 1
main_df.append(outdf)
So main_df is a list that has 100+ dataframes in it. For this example I will only use 2 of them. I would like them to print out into excel like:

So the comment from Ashish really helped me, all of the dataframes had different column titles so my 100+ dataframes eventually concat'd to a dataframe that is 569X52. Here is the code that I used, I completely abandoned openpyxl because once I was able to concat all of the dataframes together, I just had to export it using pandas:
# what I want to do here is grab all the data in the same column as each
# header, then move to the next column
for i in range(len(row_list)):
k = 0
while na_finder.at[row_list[i] + k, column_list[i]] == True:
if(df.at[row_list[i] + k, column_list[i]] not in header_list):
outdf.at[k, header_list[i]] = df.at[row_list[i] + k, column_list[i]]
k += 1
main_df.append(outdf)
to_xlsx_df = pd.DataFrame()
for frame in main_df:
to_xlsx_df = pd.concat([to_xlsx_df, frame])
to_xlsx_df.to_excel(outfile_path)
The output to excel ended up looking something like this:
Hopefully this can help someone else out too.

Using Pandas and xlrd together. Ignoring absence/presence of column headers

I am hoping you can help me - I'm sure its likely a small thing to fix, when one knows how.
In my workshop, neither I nor my colleagues can make 'find and replace all' changes via the front-end of our database. The boss just denies us that level of access. If we need to make changes to dozens or perhaps hundreds of records it must all be done by copy-and-paste or similar means. Craziness.
I am trying to make a workaround to that with Python 2 and in particular libraries such as Pandas, pyautogui and xlrd.
I have researched serval StackOverflow threads and have managed thus far to write some code that works well at reading a given XL file .In production, this will be a file exported from a found data set in the database GUI front-end and will be just a single column of 'Article Numbers' for the items in the computer workshop. This will always have an Excel column header. E.g
ANR
51234
34567
12345
...
All the records numbers are 5 digit numbers.
We also have the means of scanning items with an IR scanner to a 'Workflow' app on the iPad we have and automatically making an XL file out of that list of scanned items.
The XL file here could look something similar to this.
56788
12345
89012
...
It differs in that there is no column header. All XL files have their data 'anchored' at cell A1 on 'Sheet1" and again just single column will be used. No unnecessary complications here!
Here is the script anyway. When it is fully working system arguments will be supplied to it. For now, let's pretend that we need to change records to have their 'RAM' value changed from
"2GB" to "2 GB".
import xlrd
import string
import re
import pandas as pd
field = "RAM"
value = "2 GB"
myFile = "/Users/me/folder/testArticles.xlsx"
df = pd.read_excel(myFile)
myRegex = "^[0-9]{5}$"
# data collection and putting into lists.
workbook = xlrd.open_workbook(myFile)
sheet = workbook.sheet_by_index(0)
data = [[sheet.cell_value(r, c) for c in range(sheet.ncols)] for r in range(sheet.nrows)]
formatted = []
deDuped = []
# removing any possible XL headers, setting all values to strings
# that look like five-digit ints, apply a regex to be sure.
for i in data:
cellValue = str(i)
cellValue = cellValue.translate(None, '\'[u]\'')
# remove the decimal point
# Searching for the header will cause a database front-end problem.
cellValue = cellValue[:-2]
cellValue = cellValue.translate(None, string.letters)
# making sure only valid article numbers get through
# blank rows etc can take a hike
if len(cellValue) != 0:
if re.match(myRegex, cellValue):
formatted.append(cellValue)
# weeding out any possilbe dupes.
for i in formatted:
if i not in deDuped:
deDuped.append(i)
#main code block
for i in deDuped:
#lots going on here involving pyauotgui
#making sure of no error running searches, checking for warnings, moving/tabbing around DB front-end etc
#if all goes to plan
#removing that record number from the excel file and saving the change
#so that if we run the script again for the same XL file
#we don't needlessly update an already OK record again.
df = df[~df['ANR'].astype(str).str.startswith(i)]
df.to_excel(myFile, index=False)
What I really would to like to find out is how can I run the script so that "doesn't care" about the presence or absence of the column header.
df = df[~df['ANR'].astype(str).str.startswith(i)]
Appears to be the line of code where this all hangs on. I've made several changes to the line in different combination but my script always crashes.
If a column header, ("ANR") in my case, is essential for this particular 'pandas' method is there a straight-forward way of inserting a column header into an XL file if it lacks one in the first place - i.e the XL files that come from the IR scanner and the 'Workflow' app on the iPad?
Thanks guys!
UPDATE
I've tried as suggested by Patrick implementing some code to check if cell "A1" has a header or not. Partial success. I can put "ANR" in cell A1 if its missing but I lose whatever was there in the first place.
import xlwt
from openpyxl import Workbook, load_workbook
from xlutils.copy import copy
import openpyxl
# data collection
workbook = xlrd.open_workbook(myFile)
sheet = workbook.sheet_by_index(0)
data = [[sheet.cell_value(r, c) for c in range(sheet.ncols)] for r in range(sheet.nrows)]
cell_a1 = sheet.cell_value(rowx=0, colx=0)
if cell_a1 == "ANR":
print "has header"
else:
wb = openpyxl.load_workbook(filename= myFile)
ws = wb['Sheet1']
ws['A1'] = "ANE"
wb.save(myFile)
#re-open XL file again etc etc.
I found this new block of code over at writing to existing workbook using xlwt. In this instance the contributor actually used openpyxl.

I think I got it fixed for myself.
Still a tiny bit messy but seems to be working. Added an 'if/else' clause to check the value of cell A1 and to take action accordingly. Found most of the code for this at how to append data using openpyxl python to excel file from a specified row? - using the suggestion for openpyxl
import pyperclip
import xlrd
import pyautogui
import string
import re
import os
import pandas as pd
import xlwt
from openpyxl import Workbook, load_workbook
from xlutils.copy import copy
field = "RAM"
value = "2 GB"
myFile = "/Users/me/testSerials.xlsx"
df = pd.read_excel(myFile)
myRegex = "^[0-9]{5}$"
# data collection
workbook = xlrd.open_workbook(myFile)
sheet = workbook.sheet_by_index(0)
data = [[sheet.cell_value(r, c) for c in range(sheet.ncols)] for r in range(sheet.nrows)]
cell_a1 = sheet.cell_value(rowx=0, colx=0)
if cell_a1 == "ANR":
print "has header"
else:
headers = ['ANR']
workbook_name = 'myFile'
wb = Workbook()
page = wb.active
# page.title = 'companies'
page.append(headers) # write the headers to the first line
workbook = xlrd.open_workbook(workbook_name)
sheet = workbook.sheet_by_index(0)
data = [[sheet.cell_value(r, c) for c in range(sheet.ncols)] for r in range(sheet.nrows)]
for records in data:
page.append(records)
wb.save(filename=workbook_name)
#then load the data all over again, this time with inserted header
workbook = xlrd.open_workbook(myFile)
sheet = workbook.sheet_by_index(0)
data = [[sheet.cell_value(r, c) for c in range(sheet.ncols)] for r in range(sheet.nrows)]
formatted = []
deDuped = []
# removing any possible XL headers, setting all values to strings that look like five-digit ints, apply a regex to be sure.
for i in data:
cellValue = str(i)
cellValue = cellValue.translate(None, '\'[u]\'')
# remove the decimal point
cellValue = cellValue[:-2]
# cellValue = cellValue.translate(None, ".0")
cellValue = cellValue.translate(None, string.letters)
# making sure any valid ANRs get through
if len(cellValue) != 0:
if re.match(myRegex, cellValue):
formatted.append(cellValue)
# ------------------------------------------
# weeding out any possilbe dupes.
for i in formatted:
if i not in deDuped:
deDuped.append(i)
# ref - https://stackoverflow.com/questions/48942743/python-pandas-to-remove-rows-in-excel
df = pd.read_excel(myFile)
print df
for i in deDuped:
#pyautogui code is run here...
#if all goes to plan update the XL file
df = df[~df['ANR'].astype(str).str.startswith(i)]
df.to_excel(myFile, index=False)

Openpyxl optimizing cells search speed

I need to search the Excel sheet for cells containing some pattern. It takes more time than I can handle. The most optimized code I could write is below. Since the data patterns are usually row after row so I use iter_rows(row_offset=x). Unfortunately the code below finds the given pattern an increasing number of times in each for loop (starting from milliseconds and getting up to almost a minute). What am I doing wrong?
import openpyxl
import datetime
from openpyxl import Workbook
wb = Workbook()
ws = wb.active
ws.title = "test_sheet"
print("Generating quite big excel file")
for i in range(1,10000):
for j in range(1,20):
ws.cell(row = i, column = j).value = "Cell[{},{}]".format(i,j)
print("Saving test excel file")
wb.save('test.xlsx')
def FindXlCell(search_str, last_r):
t = datetime.datetime.utcnow()
for row in ws.iter_rows(row_offset=last_r):
for cell in row:
if (search_str == cell.value):
print(search_str, last_r, cell.row, datetime.datetime.utcnow() - t)
last_r = cell.row
return last_r
print("record not found ",search_str, datetime.datetime.utcnow() - t)
return 1
wb = openpyxl.load_workbook("test.xlsx", data_only=True)
t = datetime.datetime.utcnow()
ws = wb["test_sheet"]
last_row = 1
print("Parsing excel file in a loop for 3 cells")
for i in range(1,100,1):
last_row = FindXlCell("Cell[0,0]", last_row)
last_row = FindXlCell("Cell[1000,6]", last_row)
last_row = FindXlCell("Cell[6000,6]", last_row)

Looping over a worksheet multiple times is inefficient. The reason for the search getting progressively slower looks to be increasingly more memory being used in each loop. This is because last_row = FindXlCell("Cell[0,0]", last_row) means that the next search will create new cells at the end of the rows: openpyxl creates cells on demand because rows can be technically empty but cells in them are still addressable. At the end of your script the worksheet has a total of 598000 rows but you always start searching from A1.
If you wish to search a large file for text multiple times then it would probably make sense to create a matrix keyed by the text with the coordinates being the value.
Something like:
matrix = {}
for row in ws:
for cell in row:
matrix[cell.value] = (cell.row, cell.col_idx)
In a real-world example you'd probably want to use a defaultdict to be able to handle multiple cells with the same text.
This could be combined with read-only mode for a minimal memory footprint. Except, of course, if you want to edit the file.

Why won't this xlsx file open?

I'm trying to use the openpyxl module to take a spreadsheet, see if there are empty cells in a certain column (in this case, column E), and then copy the rows that contain those empty cells to a new spreadsheet. The code runs without traceback, but the resulting file won't open. What's going on?
Here's my code:
#import the openpyxl module
import openpyxl
#First create a new workbook & sheet
newwb = openpyxl.Workbook()
newwb.save('TESTINGTHISTHING.xlsx')
newsheet = newwb.get_sheet_by_name('Sheet')
#open the original file
wb = openpyxl.load_workbook('OriginalWorkbook.xlsx')
#create a sheet object
sheet = wb.get_sheet_by_name('Sheet1')
#Find out how many cells of a certain column are left blank,
#and what rows they're in
count = 0
listofrows = []
for row in range(2, sheet.get_highest_row() + 1):
company = sheet['E' + str(row)].value
if company == None:
listofrows.append(row)
count += 1
print listofrows
print count
#Put the values of the rows with blank company names into the new sheet
for i in range(len(listofrows)):
j = 0
newsheet['A' + str(i+1)] = sheet['A' + str(listofrows[j])].value
j += 1
newwb.save('TESTINGTHISTHING.xlsx')
Please help!

I just ran your program with a mock document. I was able to open my output file without problem. Your issues probably relies within your excel or openpyxl version.
Please provide your software versions in addition to your source document so I can look further into the issue.
You can always update openpyxl with:
c:\Python27\Scripts
pip install openpyxl --upgrade

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Dynamicaly Build Python lists from Sheets in Excel Workbook - python

Not sure if it is the best for you, but you could use pandas to import your sheets into a dataframes. from pandas.io.excel import ExcelFile filename = 'linreg.xlsx' xl = ExcelFile(filename) for sheet in xl.sheet_names: df = xl.parse(sheet) print df

Related

Write python list of files names to excel using openpyxl

Is there a way to export a list of 100+ dataframes to excel?

Using Pandas and xlrd together. Ignoring absence/presence of column headers

Openpyxl optimizing cells search speed

Why won't this xlsx file open?

Categories

Resources