My project is to treat different Excel files. To do this, I would like to create a single file that contains some data of the previous files. All this in order to have my database. The goal is to obtain graphs of these data. All of this automatically.
I wrote this program in Python. However, it takes 20 minutes to run it. How can I optimize it?
In addition, I have identical variables in some files. So I would like that in the final file, the identical variables are not repeated. How to do?
Here is my program :
import os
import xlrd
import xlsxwriter
from xlrd import open_workbook
wc = xlrd.open_workbook("U:\\INSEE\\table-appartenance-geo-communes-16.xls")
sheet0=wc.sheet_by_index(0)
# création
with xlsxwriter.Workbook('U:\\INSEE\\Department61.xlsx') as bdd:
dept61 = bdd.add_worksheet('deprt61')
folder_path = "U:\\INSEE\\2013_telechargement2016"
col=8
constante3=0
lastCol=0
listeV = list()
for path, dirs, files in os.walk(folder_path):
for filename in files:
filename = os.path.join(path, filename)
wb = xlrd.open_workbook(filename, '.xls')
sheet1 = wb.sheet_by_index(0)
lastRow=sheet1.nrows
lastCol=sheet1.ncols
colDep=None
firstRow=None
for ligne in range(0,lastRow):
for col2 in range(0,lastCol):
if sheet1.cell_value(ligne, col2) == 'DEP':
colDep=col2
firstRow=ligne
break
if colDep is not None:
break
col=col-colDep-2-constante3
constante3=0
for nCol in range(colDep+2,lastCol):
constante=1
for ligne in range(firstRow,lastRow):
if sheet1.cell(ligne, colDep).value=='61':
Q=(sheet1.cell(firstRow, nCol).value in listeV)
if Q==False:
V=sheet1.cell(firstRow, nCol).value
listeV.append(V)
dept61.write(0,col+nCol,sheet1.cell(firstRow, nCol).value)
for ligne in range(ligne,lastRow):
if sheet1.cell(ligne, colDep).value=='61':
dept61.write(constante,col+nCol,sheet1.cell(ligne, nCol).value)
constante=constante+1
elif Q==True:
constante3=constante3+1 # I have a problem here. I would like to count the number of variables that already exists but I find huge numbers.
break
col=col+lastCol
bdd.close()
Thanks you for your future help. :)
This one may be too broad for SO, so here are some pointers for where you can optimise. Maybe add a sample screenshot of what the sheets look like.
Wrt if sheet1.cell_value(ligne, col2) == 'DEP': Can 'DEP' occur multiple times in a sheet? If it will definitely occur only once and that's when you get your values for both colDep and firstRow, then break out of both loops. Add break out of both loops, by adding a break to end the inner loop, then check for a flag value and break out of the outer loop before iterating over it. Like so:
colDep = None # initialise to None
firstRow = None # initialise to None
for ligne in range(0,lastRow):
for col2 in range(0,lastCol):
if sheet1.cell_value(ligne, col2) == 'DEP':
colDep=col2
firstRow=ligne
break # break out of the `col2 in range(0,lastCol)` loop
if colDep is not None: # or just `if colDep:` if colDep will never be 0.
break # break out of the `ligne in range(0,lastRow)` loop
I think the range in for ligne in range(0,lastRow): in your write-to-bdd blocks should start at firstRow since you know that 0 to firstRow-1 will be empty in sheet1 which you've just read to look for the header.
for ligne in range(firstRow, lastRow):
That will avoid wasting time reading empty header rows.
Other considerations for cleaner code:
Use the with xlsxwriter.Workbook('U:\INSEE\\Department61.xlsx') as bdd: syntax for clarity.
and always use double slashes \\ inside strings even if not preceding a control character: 'U:\\INSEE\\Department61.xlsx'
You've used sheet1.cell_value() as well as sheet1.cell().value for your read operations. Pick one, unless you needed extended Cell info in the value=='61' case.
Read PEP-8 for how to write more readable code.
Related
I have two large .csv files in which I would like to compare two columns row by row using either csv DictReader or maybe even pandas.
I need to check that all the rows of a particular column are identical in both files. I've seen some suggestions here, but none were working in my situation. The problem is incorrect iteration order over the second opened file, even though the files are identical.
I've done search and modify tasks really quick with openpyxl, but since the csv file size is several hundreds MB, converting the csv to excel even during runtime doesn't seem like a good decision.
Here is what I have right now code-wise:
import csv
class CsvCompareTester:
work_csv_path = None
test_csv_path = None
#staticmethod
def insert_file_paths():
print()
print('Enter the full absolute path of the WORK .csv file:')
CsvCompareTester.work_csv_path = input()
print('Enter the full absolute path of the TEST .csv file:')
CsvCompareTester.test_csv_path = input()
#staticmethod
def compare_files(work_csv_file, test_csv_file):
work_csv_obj = csv.DictReader(work_csv_file, delimiter=";")
test_csv_obj = csv.DictReader(test_csv_file, delimiter=";")
for work_row in work_csv_obj:
for test_row in test_csv_obj:
if work_row == test_row:
print('ALL CLEAR')
print(str(work_row))
print(str(test_row))
print()
else:
print("STRINGS DON'T MATCH")
print(str(work_row))
print(str(test_row))
print()
if __name__ == "__main__":
csv_tester = CsvCompareTester()
csv_tester.insert_file_paths()
with open(CsvCompareTester.work_csv_path) as work_file:
with open(CsvCompareTester.test_csv_path) as test_file:
csv_tester.compare_files(work_file, test_file)
How do I iterate over the .csv file rows while also being able to address particular rows and columns by key or value (which could definitely reduce the amount of useless iterations).
For some reason, in the code above, every row string from the first file doesn't match the other from the second file. Files are identical and have the same order of entries, I've double-checked it.
Why isn't the second file iterated as the first, from the beginning to its end?
The problem is with how you're looping over the files. The way you have it, an attempt is made to compare each row of the first file to every row of the second one. You instead need to fetch rows of them in lock-step — and a good way to do that is with the build-in zip() function.
So do this instead:
#staticmethod
def compare_files(work_csv_file, test_csv_file):
work_csv_obj = csv.DictReader(work_csv_file, delimiter=";")
test_csv_obj = csv.DictReader(test_csv_file, delimiter=";")
# for work_row in work_csv_obj:
# for test_row in test_csv_obj:
for work_row, test_row in zip(work_csv_obj, test_csv_obj):
if work_row == test_row:
print('ALL CLEAR')
print(str(work_row))
print(str(test_row))
print()
else:
print("STRINGS DON'T MATCH")
print(str(work_row))
print(str(test_row))
print()
By the way, even though it may no be causing any problems yet, I also noticed you're not opening the two files correctly as shown in the csv.DictReader documentation — you left out the newline='' argument.
Here's the proper way of doing that:
if __name__ == "__main__":
csv_tester = CsvCompareTester()
csv_tester.insert_file_paths()
# with open(CsvCompareTester.work_csv_path) as work_file:
# with open(CsvCompareTester.test_csv_path) as test_file:
with open(CsvCompareTester.work_csv_path, newline='') as work_file:
with open(CsvCompareTester.test_csv_path, newline='') as test_file:
csv_tester.compare_files(work_file, test_file)
It consists in creating a function def(,) that searches for the name of the kid in the CSV file and gives his age.
The CSV file is structured as this:
Nicholas,12
Matthew,6
Lorna,12
Michael,8
Sebastian,8
Joseph,10
Ahmed,15
while the code that I tried is this:
def fetchcolvalue(kids_agefile, kidname):
import csv
file = open(kids_agefile, 'r')
ct = 0
for row in csv.reader(file):
while True:
print(row[0])
if row[ct] == kidname:
break
The frustrating thing is that it doesn't give me any error, but an infinite loop: I think that's what I'm doing wrong.
So far, what I learnt from the book is only loops (while and for) and if-elif-else cycles, besides CSV and file basic manipulation operations, so I can't really figure out how can I solve the problem with only those tools.
Please notice that the function would have to work with a generic 2-columns CSV file and not only the kids' one.
the while True in your loop is going to make you loop forever (no variables are changed within the loop). Just remove it:
for row in csv.reader(file):
if row[ct] == kidname:
break
else:
print("{} not found".format(kidname))
the csv file is iterated upon, and as soon as row[ct] equals kidname it breaks.
I would add an else statement so you know if the file has been completely scanned without finding the kid's name (just to expose some little-known usage of else after a for loop: if no break encountered, goes into else branch.)
EDIT: you could do it in one line using any and a generator comprehension:
any(kidname == row[ct] for row in csv.reader(file))
will return True if any first cell matches, probably faster too.
This should work, in your example the for loop sets row to the first row of the file, then starts the while loop. The while loop never updates row so it is infinite. Just remove the while loop:
def fetchcolvalue(kids_agefile, kidname):
import csv
file = open(kids_agefile, 'r')
ct = 0
for row in csv.reader(file):
if row[ct] == kidname:
print(row[1])
I am writing a Python script that leverages the OpenPyXL module to open an excel sheet that has a column of data that I'd then like to be put into a list for future usage. I've created a simple function to complete this task but I wasn't sure if there was a more effective manner in which to approach this task. Here is what I have right now :
import pyautogui
import openpyxl
wb = openpyxl.load_workbook('120.xlsx')
sheet = wb.get_sheet_by_name('Sheet1')
def dataGrab(sheet):
counter = 0
aData = []
while 1 > 0:
cell = "A"+str(counter)
print(cell)
wo = sheet[cell].value
print(wo)
if wo is None:
break
aData.append(wo)
counter += 1
print(aData)
return aData
aData = dataGrab(sheet)
So in essence what happens is the workbook is opened and then an infinite loop kicks off. While in this loop I am making the cell identifier ( A+counter ) to make things like "A1" and so on to then be passed to the line that gets the value of the cell which is then just appended to an existing list. This loops forever until wo is "None".
This works perfectly fine and yields no errors, I am more looking for suggestions to improve the function or know if I "did it right". Thank you for your feedback!
PS : The line "wb = openpyxl.load_workbook('120.xlsx')" takes a good 30-45 seconds to run. Any way to make that go faster? Not priority.
I am trying to write a simple script that will take csv as an input and write it in a single spreadsheet document. Now I have it working however the script is slow. It takes around 10 minutes to write cca 350 lines in two worksheets.
Here is the script I have:
#!/usr/bin/python
import json, sys
import gspread
from oauth2client.client import SignedJwtAssertionCredentials
json_key = json.load(open('client_secrets.json'))
scope = ['https://spreadsheets.google.com/feeds']
# change to True to see Debug messages
DEBUG = False
def updateSheet(csv,sheet):
linelen = 0
counter1 = 1 # starting column in spreadsheet: A
counter2 = 1 # starting row in spreadsheet: 1
counter3 = 0 # helper for iterating through line entries
credentials = SignedJwtAssertionCredentials(json_key['client_email'], json_key['private_key'], scope)
gc = gspread.authorize(credentials)
wks = gc.open("Test Spreadsheet")
worksheet = wks.get_worksheet(sheet)
if worksheet is None:
if sheet == 0:
worksheet = wks.add_worksheet("First Sheet",1,8)
elif sheet == 1:
worksheet = wks.add_worksheet("Second Sheet",1,8)
else:
print "Error: spreadsheet does not exist"
sys.exit(1)
worksheet.resize(1,8)
for i in csv:
line = i.split(",")
linelen = len(line)-1
if (counter3 > linelen):
counter3 = 0
if (counter1 > linelen):
counter1 = 1
if (DEBUG):
print "entry length (starting from 0): ", linelen
print "line: ", line
print "counter1: ", counter1
print "counter3: ", counter3
while (counter3<=linelen):
if (DEBUG):
print "writing line: ", line[counter3]
worksheet.update_cell(counter2, counter1, line[counter3].rstrip('\n'))
counter3 += 1
counter1 += 1
counter2 += 1
worksheet.resize(counter2,8)
I am sysadmin so I apologize in advance for shitty code.
Anyway, the script will take line by line from csv, split by comma and write cell by cell, hence it takes time to write it. The idea is to have cron execute this once a day and it will remove older entries and write new ones -- that's why I use resize().
Now, I am wondering if there is a better way to take whole csv line and write it in the sheet with each value in it's own cell, avoiding writing cell by cell like I have now? This would significantly reduce time it takes to execute it.
Thanks!
Yes, this can be done. I upload in chunks of 100 lines by 12 rows and it handles it fine - I'm not sure how well this scales though for something like a whole csv in one go. Also be aware that the default length of a sheet is 1000 rows and you will get an error if you try to reference a row outside of this range (so use add_rows beforehand to ensure there is space). Simplified example:
data_to_upload = [[1, 2], [3, 4]]
column_names = ['','A','B','C','D','E','F','G','H', 'I','J','K','L','M','N',
'O','P','Q','R','S','T','U','V','W','X','Y','Z', 'AA']
# To make it dynamic, assuming that all rows contain same number of elements
cell_range = 'A1:' + str(column_names[len(data_to_upload[0])]) + str(len(data_to_upload))
cells = worksheet.range(cell_range)
# Flatten the nested list. 'Cells' will not by default accept xy indexing.
flattened_data = flatten(data_to_upload)
# Go based on the length of flattened_data, not cells.
# This is because if you chunk large data into blocks, all excess cells will take an empty value
# Doing the other way around will get an index out of range
for x in range(len(flattened_data)):
cells[x].value = flattened_data[x].decode('utf-8')
worksheet.update_cells(cells)
If your rows are of different lengths then clearly you would need to insert the appropriate number of empty strings into cells to ensure that the two lists don't get out of sync. I use decode for convenience because I kept crashing with special characters so seems best to just have it in.
I have an excel in which there are 2000 rows which contains 1 data each, like
a.xls
RowNum Item
1 'A'
2 'B'
3 'C'
.
.
.
2000 'xyz'
I have another file, b.xls which contains about 6300000 rows of data. In this file there are some occurrences of the data in a.xls . I need to pick all the data from the file b.xls corresponding to an item in a.xls and store them in separate file called A.csv, B.csv, etc
I did it using multi-threading but it's taking lots of time to execute it. Can anybody help me reducing the latency?
This is the code I have used. The following function gets started in a thread,
def parseFromFile(pTickerList):
global gSearchList
lSearchList = gSearchList
for lTickerName in pTickerList:
c = csv.writer( open("op-new/"+ lTickerName + ".csv", "wb"))
c.writerow(["Ticker Name", "Time Stamp","Price", "Size"])
for line in lSearchList:
lSplittedLine = line.split(",")
lTickerNameFromSearchFile = lSplittedLine[0].strip()
if lTickerNameFromSearchFile[0] == "#":
continue
if ord(lTickerName[0]) < ord(lTickerNameFromSearchFile[0]):
break
elif ord(lTickerName[0]) > ord(lTickerNameFromSearchFile[0]):
continue
if lTickerNameFromSearchFile == lTickerName:
lTimeStamp = Decimal(float(lSplittedLine[1]))
lPrice = lSplittedLine[2]
lSize = lSplittedLine[4]
if str(lTimeStamp)[len(str(lTimeStamp))-2:] == "60":
lTimeStamp = str(lTimeStamp)[:len(str(lTimeStamp))-2] + "59.9"
if str(lTimeStamp).find(".") >= 0:
lTimeStamp = float(str(lTimeStamp).split(".")[0] + "." + str(lTimeStamp).split(".")[1][0])
lTimeStamp1 = "%.1f" %float(lTimeStamp)
lHumanReadableTimeStamp = datetime.strptime(str(lTimeStamp1), "%Y%m%d%H%M%S.%f")
else:
lHumanReadableTimeStamp = datetime.strptime(str(lTimeStamp), "%Y%m%d%H%M%S")
except Exception, e:
exc_type, exc_obj, exc_tb = sys.exc_info()
fname = os.path.split(exc_tb.tb_frame.f_code.co_filename)[1]
print(exc_type, fname, exc_tb.tb_lineno)
print line
print lTimeStamp
raw_input()
c.writerow([lTickerNameFromSearchFile, lHumanReadableTimeStamp,lPrice, lSize])
It's hard to look through your code and fully understand it because it's referencing variables differently than your explanation, but I believe this approach will help you.
Start by reading all of a.csv into a set with the traits you want to be able to look up. sets in Python have very fast lookup times. This will also help you because it seems a that you do a lot of repeat computation during your inner loop from your code above.
Then start reading through b.csv, using the previous a.csv set to check. Whenever you find a match, write to A.csv and B.csv.
The big speedups you can do to your current setup are removing the repeat calculations in your inner loop, and the removal of the need for threads. Because a.csv is only 2000 lines, it will be incredibly fast to read.
Let me know if you want me to expand on any part of this.