I am writing a Python script that leverages the OpenPyXL module to open an excel sheet that has a column of data that I'd then like to be put into a list for future usage. I've created a simple function to complete this task but I wasn't sure if there was a more effective manner in which to approach this task. Here is what I have right now :
import pyautogui
import openpyxl
wb = openpyxl.load_workbook('120.xlsx')
sheet = wb.get_sheet_by_name('Sheet1')
def dataGrab(sheet):
counter = 0
aData = []
while 1 > 0:
cell = "A"+str(counter)
print(cell)
wo = sheet[cell].value
print(wo)
if wo is None:
break
aData.append(wo)
counter += 1
print(aData)
return aData
aData = dataGrab(sheet)
So in essence what happens is the workbook is opened and then an infinite loop kicks off. While in this loop I am making the cell identifier ( A+counter ) to make things like "A1" and so on to then be passed to the line that gets the value of the cell which is then just appended to an existing list. This loops forever until wo is "None".
This works perfectly fine and yields no errors, I am more looking for suggestions to improve the function or know if I "did it right". Thank you for your feedback!
PS : The line "wb = openpyxl.load_workbook('120.xlsx')" takes a good 30-45 seconds to run. Any way to make that go faster? Not priority.
Related
My project is to treat different Excel files. To do this, I would like to create a single file that contains some data of the previous files. All this in order to have my database. The goal is to obtain graphs of these data. All of this automatically.
I wrote this program in Python. However, it takes 20 minutes to run it. How can I optimize it?
In addition, I have identical variables in some files. So I would like that in the final file, the identical variables are not repeated. How to do?
Here is my program :
import os
import xlrd
import xlsxwriter
from xlrd import open_workbook
wc = xlrd.open_workbook("U:\\INSEE\\table-appartenance-geo-communes-16.xls")
sheet0=wc.sheet_by_index(0)
# création
with xlsxwriter.Workbook('U:\\INSEE\\Department61.xlsx') as bdd:
dept61 = bdd.add_worksheet('deprt61')
folder_path = "U:\\INSEE\\2013_telechargement2016"
col=8
constante3=0
lastCol=0
listeV = list()
for path, dirs, files in os.walk(folder_path):
for filename in files:
filename = os.path.join(path, filename)
wb = xlrd.open_workbook(filename, '.xls')
sheet1 = wb.sheet_by_index(0)
lastRow=sheet1.nrows
lastCol=sheet1.ncols
colDep=None
firstRow=None
for ligne in range(0,lastRow):
for col2 in range(0,lastCol):
if sheet1.cell_value(ligne, col2) == 'DEP':
colDep=col2
firstRow=ligne
break
if colDep is not None:
break
col=col-colDep-2-constante3
constante3=0
for nCol in range(colDep+2,lastCol):
constante=1
for ligne in range(firstRow,lastRow):
if sheet1.cell(ligne, colDep).value=='61':
Q=(sheet1.cell(firstRow, nCol).value in listeV)
if Q==False:
V=sheet1.cell(firstRow, nCol).value
listeV.append(V)
dept61.write(0,col+nCol,sheet1.cell(firstRow, nCol).value)
for ligne in range(ligne,lastRow):
if sheet1.cell(ligne, colDep).value=='61':
dept61.write(constante,col+nCol,sheet1.cell(ligne, nCol).value)
constante=constante+1
elif Q==True:
constante3=constante3+1 # I have a problem here. I would like to count the number of variables that already exists but I find huge numbers.
break
col=col+lastCol
bdd.close()
Thanks you for your future help. :)
This one may be too broad for SO, so here are some pointers for where you can optimise. Maybe add a sample screenshot of what the sheets look like.
Wrt if sheet1.cell_value(ligne, col2) == 'DEP': Can 'DEP' occur multiple times in a sheet? If it will definitely occur only once and that's when you get your values for both colDep and firstRow, then break out of both loops. Add break out of both loops, by adding a break to end the inner loop, then check for a flag value and break out of the outer loop before iterating over it. Like so:
colDep = None # initialise to None
firstRow = None # initialise to None
for ligne in range(0,lastRow):
for col2 in range(0,lastCol):
if sheet1.cell_value(ligne, col2) == 'DEP':
colDep=col2
firstRow=ligne
break # break out of the `col2 in range(0,lastCol)` loop
if colDep is not None: # or just `if colDep:` if colDep will never be 0.
break # break out of the `ligne in range(0,lastRow)` loop
I think the range in for ligne in range(0,lastRow): in your write-to-bdd blocks should start at firstRow since you know that 0 to firstRow-1 will be empty in sheet1 which you've just read to look for the header.
for ligne in range(firstRow, lastRow):
That will avoid wasting time reading empty header rows.
Other considerations for cleaner code:
Use the with xlsxwriter.Workbook('U:\INSEE\\Department61.xlsx') as bdd: syntax for clarity.
and always use double slashes \\ inside strings even if not preceding a control character: 'U:\\INSEE\\Department61.xlsx'
You've used sheet1.cell_value() as well as sheet1.cell().value for your read operations. Pick one, unless you needed extended Cell info in the value=='61' case.
Read PEP-8 for how to write more readable code.
It consists in creating a function def(,) that searches for the name of the kid in the CSV file and gives his age.
The CSV file is structured as this:
Nicholas,12
Matthew,6
Lorna,12
Michael,8
Sebastian,8
Joseph,10
Ahmed,15
while the code that I tried is this:
def fetchcolvalue(kids_agefile, kidname):
import csv
file = open(kids_agefile, 'r')
ct = 0
for row in csv.reader(file):
while True:
print(row[0])
if row[ct] == kidname:
break
The frustrating thing is that it doesn't give me any error, but an infinite loop: I think that's what I'm doing wrong.
So far, what I learnt from the book is only loops (while and for) and if-elif-else cycles, besides CSV and file basic manipulation operations, so I can't really figure out how can I solve the problem with only those tools.
Please notice that the function would have to work with a generic 2-columns CSV file and not only the kids' one.
the while True in your loop is going to make you loop forever (no variables are changed within the loop). Just remove it:
for row in csv.reader(file):
if row[ct] == kidname:
break
else:
print("{} not found".format(kidname))
the csv file is iterated upon, and as soon as row[ct] equals kidname it breaks.
I would add an else statement so you know if the file has been completely scanned without finding the kid's name (just to expose some little-known usage of else after a for loop: if no break encountered, goes into else branch.)
EDIT: you could do it in one line using any and a generator comprehension:
any(kidname == row[ct] for row in csv.reader(file))
will return True if any first cell matches, probably faster too.
This should work, in your example the for loop sets row to the first row of the file, then starts the while loop. The while loop never updates row so it is infinite. Just remove the while loop:
def fetchcolvalue(kids_agefile, kidname):
import csv
file = open(kids_agefile, 'r')
ct = 0
for row in csv.reader(file):
if row[ct] == kidname:
print(row[1])
I am trying to write a simple script that will take csv as an input and write it in a single spreadsheet document. Now I have it working however the script is slow. It takes around 10 minutes to write cca 350 lines in two worksheets.
Here is the script I have:
#!/usr/bin/python
import json, sys
import gspread
from oauth2client.client import SignedJwtAssertionCredentials
json_key = json.load(open('client_secrets.json'))
scope = ['https://spreadsheets.google.com/feeds']
# change to True to see Debug messages
DEBUG = False
def updateSheet(csv,sheet):
linelen = 0
counter1 = 1 # starting column in spreadsheet: A
counter2 = 1 # starting row in spreadsheet: 1
counter3 = 0 # helper for iterating through line entries
credentials = SignedJwtAssertionCredentials(json_key['client_email'], json_key['private_key'], scope)
gc = gspread.authorize(credentials)
wks = gc.open("Test Spreadsheet")
worksheet = wks.get_worksheet(sheet)
if worksheet is None:
if sheet == 0:
worksheet = wks.add_worksheet("First Sheet",1,8)
elif sheet == 1:
worksheet = wks.add_worksheet("Second Sheet",1,8)
else:
print "Error: spreadsheet does not exist"
sys.exit(1)
worksheet.resize(1,8)
for i in csv:
line = i.split(",")
linelen = len(line)-1
if (counter3 > linelen):
counter3 = 0
if (counter1 > linelen):
counter1 = 1
if (DEBUG):
print "entry length (starting from 0): ", linelen
print "line: ", line
print "counter1: ", counter1
print "counter3: ", counter3
while (counter3<=linelen):
if (DEBUG):
print "writing line: ", line[counter3]
worksheet.update_cell(counter2, counter1, line[counter3].rstrip('\n'))
counter3 += 1
counter1 += 1
counter2 += 1
worksheet.resize(counter2,8)
I am sysadmin so I apologize in advance for shitty code.
Anyway, the script will take line by line from csv, split by comma and write cell by cell, hence it takes time to write it. The idea is to have cron execute this once a day and it will remove older entries and write new ones -- that's why I use resize().
Now, I am wondering if there is a better way to take whole csv line and write it in the sheet with each value in it's own cell, avoiding writing cell by cell like I have now? This would significantly reduce time it takes to execute it.
Thanks!
Yes, this can be done. I upload in chunks of 100 lines by 12 rows and it handles it fine - I'm not sure how well this scales though for something like a whole csv in one go. Also be aware that the default length of a sheet is 1000 rows and you will get an error if you try to reference a row outside of this range (so use add_rows beforehand to ensure there is space). Simplified example:
data_to_upload = [[1, 2], [3, 4]]
column_names = ['','A','B','C','D','E','F','G','H', 'I','J','K','L','M','N',
'O','P','Q','R','S','T','U','V','W','X','Y','Z', 'AA']
# To make it dynamic, assuming that all rows contain same number of elements
cell_range = 'A1:' + str(column_names[len(data_to_upload[0])]) + str(len(data_to_upload))
cells = worksheet.range(cell_range)
# Flatten the nested list. 'Cells' will not by default accept xy indexing.
flattened_data = flatten(data_to_upload)
# Go based on the length of flattened_data, not cells.
# This is because if you chunk large data into blocks, all excess cells will take an empty value
# Doing the other way around will get an index out of range
for x in range(len(flattened_data)):
cells[x].value = flattened_data[x].decode('utf-8')
worksheet.update_cells(cells)
If your rows are of different lengths then clearly you would need to insert the appropriate number of empty strings into cells to ensure that the two lists don't get out of sync. I use decode for convenience because I kept crashing with special characters so seems best to just have it in.
I'm an inexperienced coder working in python. I wrote a script to automate a process where certain information would be ripped from a webpage and then copied, where it would be pasted into a new excel spreadsheet. I've written and executed the code, but the excel spreadsheet I've designated to receive the data is completely empty. Worst of all, there is no traceback error. Would you help me find the problem in my code? And how do you generally solve your own problems when not provided a traceback error?
import xlsxwriter, urllib.request, string
def main():
#gets the URL for the expert page
open_sesame = urllib.request.urlopen('https://aries.case.com.pl/main_odczyt.php?strona=eksperci')
#reads the expert page
readpage = open_sesame.read()
#opens up a new file in excel
workbook = xlsxwriter.Workbook('expert_book.xlsx')
#adds worksheet to file
worksheet = workbook.add_worksheet()
#initializing the variable used to move names and dates
#in the excel spreadsheet
boxcoA = ""
boxcoB = ""
#initializing expert attribute variables and lists
expert_name = ""
url_ticker = 0
name_ticker = 0
raw_list = []
url_list = []
name_list= []
date_list= []
#this loop goes through and finds all the lines
#that contain the expert URL and name and saves them to raw_list::
#raw_list loop
for i in readpage:
i = str(i)
if i.startswith('<tr><td align=left><a href='):
raw_list += i
#this loop goes through the lines in raw list and extracts
#the name of the expert, saving it to a list::
#name_list loop
for n in raw_list:
name_snip = n.split('target=_blank>','</a></td><')[1]
name_list += name_snip
#this loop fills a list with the dates the profiles were last updated::
#date_list
for p in raw_list:
url_snipoff = p[28:]
url_snip = url_snipoff.split('"')[0]
url_list += url_snip
expert_url = 'https://aries.case.com.pl/'+url_list[url_ticker]
open_expert = urllib2.openurl(expert_url)
read_expert = open_expert.read()
for i in read_expert:
if i.startswith('<p align=left><small>Last update:'):
update = i.split('Last update:','</small>')[1]
open_expert.close()
date_list += update
#now that we have a list of expert names and a list of profile update dates
#we can work on populating the excel spreadsheet
#this operation will iterate just as long as the list is long
#meaning that it will populate the excel spreadsheet
#with all of our names and dates that we wanted
for z in raw_list:
boxcoA = string('A',z)
boxcoB = string('B',z)
worksheet.write(boxcoA, name_list[z])
worksheet.write(boxcoB, date_list[z])
workbook.close()
print('Operation Complete')
main()
The lack of a traceback only means your code raises no exceptions. It does not mean your code is logically correct.
I would look for logic errors by adding print statements, or using a debugger such as pdb or pudb.
One problem I notice with your code is that the first loop seems to presume that i is a line, whereas it is actually a character. You might find splitlines() more useful
If there is no traceback then there is no error.
Most likely something has gone wrong with your scraping/parsing code and your raw_list or other arrays aren't populated.
Try print out the data that should be written to the worksheet in the last loop to see if there is any data to be written.
If you aren't writing data to the worksheet then it will be empty.
I am writing a program to analyze some of our invoice data. Basically,I need to take an array containing each individual invoice we sent out over the past year & break it down into twelve arrays which contains the invoices for that month using the dateSeperate() function, so that monthly_transactions[0] returns Januaries transactions, monthly_transactions[1] returns Februaries & so forth.
I've managed to get it working so that dateSeperate returns monthly_transactions[0] as the january transactions. However, once all of the January data is entered, I attempt to append the monthly_transactions array using line 44. However, this just causes the program to break & become unrepsonsive. The code still executes & doesnt return an error, but Python becomes unresponsive & I have to force quite out of it.
I've been writing the the global array monthly_transactions. dateSeperate runs fine as long as I don't include the last else statement. If I do that, monthly_transactions[0] returns an array containing all of the january invoices. the issue arises in my last else statement, which when added, causes Python to freeze.
Can anyone help me shed any light on this?
I have written a program that defines all of the arrays I'm going to be using (yes I know global arrays aren't good. I'm a marketer trying to learn programming so any input you could give me on how to improve this would be much appreciated
import csv
line_items = []
monthly_transactions = []
accounts_seperated = []
Then I import all of my data and place it into the line_items array
def csv_dict_reader(file_obj):
global board_info
reader = csv.DictReader(file_obj, delimiter=',')
for line in reader:
item = []
item.append(line["company id"])
item.append(line["user id"])
item.append(line["Amount"])
item.append(line["Transaction Date"])
item.append(line["FIrst Transaction"])
line_items.append(item)
if __name__ == "__main__":
with open("ChurnTest.csv") as f_obj:
csv_dict_reader(f_obj)
#formats the transacation date data to make it more readable
def dateFormat():
for i in range(len(line_items)):
ddmmyyyy =(line_items[i][3])
yyyymmdd = ddmmyyyy[6:] + "-"+ ddmmyyyy[:2] + "-" + ddmmyyyy[3:5]
line_items[i][3] = yyyymmdd
#Takes the line_items array and splits it into new array monthly_tranactions, where each value holds one month of data
def dateSeperate():
for i in range(len(line_items)):
#if there are no values in the monthly transactions, add the first line item
if len(monthly_transactions) == 0:
test = []
test.append(line_items[i])
monthly_transactions.append(test)
# check to see if the line items year & month match a value already in the monthly_transaction array.
else:
for j in range(len(monthly_transactions)):
line_year = line_items[i][3][:2]
line_month = line_items[i][3][3:5]
array_year = monthly_transactions[j][0][3][:2]
array_month = monthly_transactions[j][0][3][3:5]
#print(line_year, array_year, line_month, array_month)
#If it does, add that line item to that month
if line_year == array_year and line_month == array_month:
monthly_transactions[j].append(line_items[i])
#Otherwise, create a new sub array for that month
else:
monthly_transactions.append(line_items[i])
dateFormat()
dateSeperate()
print(monthly_transactions)
I would really, really appreciate any thoughts or feedback you guys could give me on this code.
Based on the comments on the OP, your csv_dict_reader function seems to do exactly what you want it to do, at least inasmuch as it appends data from its argument csv file to the top-level variable line_items. You said yourself that if you print out line_items, it shows the data that you want.
"But appending doesn't work." I take it you mean that appending the line_items to monthly_transactions isn't being done. The reason for that is that you didn't tell the program to do it! The appending that you're talking about is done as part of your dateSeparate function, however you still need to call the function.
I'm not sure exactly how you want to use your dateFormat and dateSeparate functions, but in order to use them, you need to include them in the main function somehow as calls, i.e. dateFormat() and dateSeparate().
EDIT: You've created the potential for an endless loop in the last else: section, which extends monthly_transactions by 1 if the line/array year/month aren't equal. This is problematic because it's within the loop for j in range(len(monthly_transactions)):. This loop will never get to the end if the length of monthly_transactions is increased by 1 every time through.