I am trying to write a simple script that will take csv as an input and write it in a single spreadsheet document. Now I have it working however the script is slow. It takes around 10 minutes to write cca 350 lines in two worksheets.
Here is the script I have:
#!/usr/bin/python
import json, sys
import gspread
from oauth2client.client import SignedJwtAssertionCredentials
json_key = json.load(open('client_secrets.json'))
scope = ['https://spreadsheets.google.com/feeds']
# change to True to see Debug messages
DEBUG = False
def updateSheet(csv,sheet):
linelen = 0
counter1 = 1 # starting column in spreadsheet: A
counter2 = 1 # starting row in spreadsheet: 1
counter3 = 0 # helper for iterating through line entries
credentials = SignedJwtAssertionCredentials(json_key['client_email'], json_key['private_key'], scope)
gc = gspread.authorize(credentials)
wks = gc.open("Test Spreadsheet")
worksheet = wks.get_worksheet(sheet)
if worksheet is None:
if sheet == 0:
worksheet = wks.add_worksheet("First Sheet",1,8)
elif sheet == 1:
worksheet = wks.add_worksheet("Second Sheet",1,8)
else:
print "Error: spreadsheet does not exist"
sys.exit(1)
worksheet.resize(1,8)
for i in csv:
line = i.split(",")
linelen = len(line)-1
if (counter3 > linelen):
counter3 = 0
if (counter1 > linelen):
counter1 = 1
if (DEBUG):
print "entry length (starting from 0): ", linelen
print "line: ", line
print "counter1: ", counter1
print "counter3: ", counter3
while (counter3<=linelen):
if (DEBUG):
print "writing line: ", line[counter3]
worksheet.update_cell(counter2, counter1, line[counter3].rstrip('\n'))
counter3 += 1
counter1 += 1
counter2 += 1
worksheet.resize(counter2,8)
I am sysadmin so I apologize in advance for shitty code.
Anyway, the script will take line by line from csv, split by comma and write cell by cell, hence it takes time to write it. The idea is to have cron execute this once a day and it will remove older entries and write new ones -- that's why I use resize().
Now, I am wondering if there is a better way to take whole csv line and write it in the sheet with each value in it's own cell, avoiding writing cell by cell like I have now? This would significantly reduce time it takes to execute it.
Thanks!
Yes, this can be done. I upload in chunks of 100 lines by 12 rows and it handles it fine - I'm not sure how well this scales though for something like a whole csv in one go. Also be aware that the default length of a sheet is 1000 rows and you will get an error if you try to reference a row outside of this range (so use add_rows beforehand to ensure there is space). Simplified example:
data_to_upload = [[1, 2], [3, 4]]
column_names = ['','A','B','C','D','E','F','G','H', 'I','J','K','L','M','N',
'O','P','Q','R','S','T','U','V','W','X','Y','Z', 'AA']
# To make it dynamic, assuming that all rows contain same number of elements
cell_range = 'A1:' + str(column_names[len(data_to_upload[0])]) + str(len(data_to_upload))
cells = worksheet.range(cell_range)
# Flatten the nested list. 'Cells' will not by default accept xy indexing.
flattened_data = flatten(data_to_upload)
# Go based on the length of flattened_data, not cells.
# This is because if you chunk large data into blocks, all excess cells will take an empty value
# Doing the other way around will get an index out of range
for x in range(len(flattened_data)):
cells[x].value = flattened_data[x].decode('utf-8')
worksheet.update_cells(cells)
If your rows are of different lengths then clearly you would need to insert the appropriate number of empty strings into cells to ensure that the two lists don't get out of sync. I use decode for convenience because I kept crashing with special characters so seems best to just have it in.
Related
The code I am running so far is as follows
import os
import math
import statistics
def main ():
infile = open('USPopulation.txt', 'r')
values = infile.read()
infile.close()
index = 0
while index < len(values):
values(index) = int(values(index))
index += 1
print(values)
main()
The text file contains 41 rows of numbers each entered on a single line like so:
151868
153982
156393
158956
161884
165069
168088
etc.
My tasks is to create a program which shows average change in population during the time period. The year with the greatest increase in population during the time period. The year with the smallest increase in population (from the previous year) during the time period.
The code will print each of the text files entries on a single line, but upon trying to convert to int for use with the statistics package I am getting the following error:
values(index) = int(values(index))
SyntaxError: can't assign to function call
The values(index) = int(values(index)) line was taken from reading as well as resources on stack overflow.
You can change values = infile.read() to values = list(infile.read())
and it will have it ouput as a list instead of a string.
One of the things that tends to happen whenever reading a file like this is, at the end of every line there is an invisible '\n' that declares a new line within the text file, so an easy way to split it by lines and turn them into integers would be, instead of using values = list(infile.read()) you could use values = values.split('\n') which splits the based off of lines, as long as values was previously declared.
and the while loop that you have can be easily replace with a for loop, where you would use len(values) as the end.
the values(index) = int(values(index)) part is a decent way to do it in a while loop, but whenever in a for loop, you can use values[i] = int(values[i]) to turn them into integers, and then values becomes a list of integers.
How I would personally set it up would be :
import os
import math
import statistics
def main ():
infile = open('USPopulation.txt', 'r')
values = infile.read()
infile.close()
values = values.split('\n') # Splits based off of lines
for i in range(0, len(values)) : # loops the length of values and turns each part of values into integers
values[i] = int(values[i])
changes = []
# Use a for loop to get the changes between each number.
for i in range(0, len(values)-1) : # you put the -1 because there would be an indexing error if you tried to count i+1 while at len(values)
changes.append(values[i+1] - values[i]) # This will get the difference between the current and the next.
print('The max change :', max(changes), 'The minimal change :', min(changes))
#And since there is a 'change' for each element of values, meaning if you print both changes and values, you would get the same number of items.
print('A change of :', max(changes), 'Happened at', values[changes.index(max(changes))]) # changes.index(max(changes)) gets the position of the highest number in changes, and finds what population has the same index (position) as it.
print('A change of :', min(changes), 'Happened at', values[changes.index(min(changes))]) #pretty much the same as above just with minimum
# If you wanted to print the second number, you would do values[changes.index(min(changes)) + 1]
main()
If you need any clarification on anything I did in the code, just ask.
I personally would use numpy for reading a text file.
in your case I would do it like this:
import numpy as np
def main ():
infile = np.loadtxt('USPopulation.txt')
maxpop = np.argmax(infile)
minpop = np.argmin(infile)
print(f'maximum population = {maxpop} and minimum population = {minpop}')
main()
I am trying to create a code to compare gene file with gene panels.
The gene panel file is in csv format and has Chromosome, gene, start location and end locations.
patients file has chromosome, mutations and the location.
so i made a loop to pass gene panel information to a function where the comparison is done to return me a list of similar items.
the function works great when i call it with manual data. but doenst not do the comparison inside the loop.
import vcf
import os, sys
records = open('exampleGenePanel.csv')
read = vcf.Reader(open('examplePatientFile.vcf','r'))
#functions to find mutations in patients sequence
def findMutations(gn,chromo,start,end):
start = int(start)
end = int(end)
for each in read:
CHROM = each.CHROM
if CHROM != chromo:
continue
POS = each.POS
if POS < start:
continue
if POS > end:
continue
REF = each.REF
ALT = each.ALT
print (gn,CHROM,POS,REF,ALT)
list.append([gn,CHROM,POS,REF,ALT])
return list
gene = records.readlines()
list=[]
y = len (gene)
x=1
while x < 3:
field = gene[x].split(',')
gname = field[0]
chromo = field[1]
gstart = field[2]
gend = field[3]
findMutations(gname,chromo,gstart,gend)
x = x+1
if not list:
print ('Mutation not found')
else:
print (len(list),' Mutations found')
print (list)
i want to get the details of matching mutations in the list.
This works as expected when i pass the data manually to the function.
Eg.findMutations('TESTGene','chr8','146171437','146229161')
But doesnt compare when passed through the loop
The problem is that findMutations attempts to read from read each time it is called, but after the first call, read has already been read and there's nothing left. I suggest reading the contents of read once, before calling the function, then save the results in a list. Then findMutations can read the list each time it is called.
It would also be a good idea to use a name other than list for your result list, since that name conflicts with the Python built-in function. It would also be better to have findMutations return its result list rather than append it to a global.
My project is to treat different Excel files. To do this, I would like to create a single file that contains some data of the previous files. All this in order to have my database. The goal is to obtain graphs of these data. All of this automatically.
I wrote this program in Python. However, it takes 20 minutes to run it. How can I optimize it?
In addition, I have identical variables in some files. So I would like that in the final file, the identical variables are not repeated. How to do?
Here is my program :
import os
import xlrd
import xlsxwriter
from xlrd import open_workbook
wc = xlrd.open_workbook("U:\\INSEE\\table-appartenance-geo-communes-16.xls")
sheet0=wc.sheet_by_index(0)
# création
with xlsxwriter.Workbook('U:\\INSEE\\Department61.xlsx') as bdd:
dept61 = bdd.add_worksheet('deprt61')
folder_path = "U:\\INSEE\\2013_telechargement2016"
col=8
constante3=0
lastCol=0
listeV = list()
for path, dirs, files in os.walk(folder_path):
for filename in files:
filename = os.path.join(path, filename)
wb = xlrd.open_workbook(filename, '.xls')
sheet1 = wb.sheet_by_index(0)
lastRow=sheet1.nrows
lastCol=sheet1.ncols
colDep=None
firstRow=None
for ligne in range(0,lastRow):
for col2 in range(0,lastCol):
if sheet1.cell_value(ligne, col2) == 'DEP':
colDep=col2
firstRow=ligne
break
if colDep is not None:
break
col=col-colDep-2-constante3
constante3=0
for nCol in range(colDep+2,lastCol):
constante=1
for ligne in range(firstRow,lastRow):
if sheet1.cell(ligne, colDep).value=='61':
Q=(sheet1.cell(firstRow, nCol).value in listeV)
if Q==False:
V=sheet1.cell(firstRow, nCol).value
listeV.append(V)
dept61.write(0,col+nCol,sheet1.cell(firstRow, nCol).value)
for ligne in range(ligne,lastRow):
if sheet1.cell(ligne, colDep).value=='61':
dept61.write(constante,col+nCol,sheet1.cell(ligne, nCol).value)
constante=constante+1
elif Q==True:
constante3=constante3+1 # I have a problem here. I would like to count the number of variables that already exists but I find huge numbers.
break
col=col+lastCol
bdd.close()
Thanks you for your future help. :)
This one may be too broad for SO, so here are some pointers for where you can optimise. Maybe add a sample screenshot of what the sheets look like.
Wrt if sheet1.cell_value(ligne, col2) == 'DEP': Can 'DEP' occur multiple times in a sheet? If it will definitely occur only once and that's when you get your values for both colDep and firstRow, then break out of both loops. Add break out of both loops, by adding a break to end the inner loop, then check for a flag value and break out of the outer loop before iterating over it. Like so:
colDep = None # initialise to None
firstRow = None # initialise to None
for ligne in range(0,lastRow):
for col2 in range(0,lastCol):
if sheet1.cell_value(ligne, col2) == 'DEP':
colDep=col2
firstRow=ligne
break # break out of the `col2 in range(0,lastCol)` loop
if colDep is not None: # or just `if colDep:` if colDep will never be 0.
break # break out of the `ligne in range(0,lastRow)` loop
I think the range in for ligne in range(0,lastRow): in your write-to-bdd blocks should start at firstRow since you know that 0 to firstRow-1 will be empty in sheet1 which you've just read to look for the header.
for ligne in range(firstRow, lastRow):
That will avoid wasting time reading empty header rows.
Other considerations for cleaner code:
Use the with xlsxwriter.Workbook('U:\INSEE\\Department61.xlsx') as bdd: syntax for clarity.
and always use double slashes \\ inside strings even if not preceding a control character: 'U:\\INSEE\\Department61.xlsx'
You've used sheet1.cell_value() as well as sheet1.cell().value for your read operations. Pick one, unless you needed extended Cell info in the value=='61' case.
Read PEP-8 for how to write more readable code.
I am writing a Python script that leverages the OpenPyXL module to open an excel sheet that has a column of data that I'd then like to be put into a list for future usage. I've created a simple function to complete this task but I wasn't sure if there was a more effective manner in which to approach this task. Here is what I have right now :
import pyautogui
import openpyxl
wb = openpyxl.load_workbook('120.xlsx')
sheet = wb.get_sheet_by_name('Sheet1')
def dataGrab(sheet):
counter = 0
aData = []
while 1 > 0:
cell = "A"+str(counter)
print(cell)
wo = sheet[cell].value
print(wo)
if wo is None:
break
aData.append(wo)
counter += 1
print(aData)
return aData
aData = dataGrab(sheet)
So in essence what happens is the workbook is opened and then an infinite loop kicks off. While in this loop I am making the cell identifier ( A+counter ) to make things like "A1" and so on to then be passed to the line that gets the value of the cell which is then just appended to an existing list. This loops forever until wo is "None".
This works perfectly fine and yields no errors, I am more looking for suggestions to improve the function or know if I "did it right". Thank you for your feedback!
PS : The line "wb = openpyxl.load_workbook('120.xlsx')" takes a good 30-45 seconds to run. Any way to make that go faster? Not priority.
I'm an inexperienced coder working in python. I wrote a script to automate a process where certain information would be ripped from a webpage and then copied, where it would be pasted into a new excel spreadsheet. I've written and executed the code, but the excel spreadsheet I've designated to receive the data is completely empty. Worst of all, there is no traceback error. Would you help me find the problem in my code? And how do you generally solve your own problems when not provided a traceback error?
import xlsxwriter, urllib.request, string
def main():
#gets the URL for the expert page
open_sesame = urllib.request.urlopen('https://aries.case.com.pl/main_odczyt.php?strona=eksperci')
#reads the expert page
readpage = open_sesame.read()
#opens up a new file in excel
workbook = xlsxwriter.Workbook('expert_book.xlsx')
#adds worksheet to file
worksheet = workbook.add_worksheet()
#initializing the variable used to move names and dates
#in the excel spreadsheet
boxcoA = ""
boxcoB = ""
#initializing expert attribute variables and lists
expert_name = ""
url_ticker = 0
name_ticker = 0
raw_list = []
url_list = []
name_list= []
date_list= []
#this loop goes through and finds all the lines
#that contain the expert URL and name and saves them to raw_list::
#raw_list loop
for i in readpage:
i = str(i)
if i.startswith('<tr><td align=left><a href='):
raw_list += i
#this loop goes through the lines in raw list and extracts
#the name of the expert, saving it to a list::
#name_list loop
for n in raw_list:
name_snip = n.split('target=_blank>','</a></td><')[1]
name_list += name_snip
#this loop fills a list with the dates the profiles were last updated::
#date_list
for p in raw_list:
url_snipoff = p[28:]
url_snip = url_snipoff.split('"')[0]
url_list += url_snip
expert_url = 'https://aries.case.com.pl/'+url_list[url_ticker]
open_expert = urllib2.openurl(expert_url)
read_expert = open_expert.read()
for i in read_expert:
if i.startswith('<p align=left><small>Last update:'):
update = i.split('Last update:','</small>')[1]
open_expert.close()
date_list += update
#now that we have a list of expert names and a list of profile update dates
#we can work on populating the excel spreadsheet
#this operation will iterate just as long as the list is long
#meaning that it will populate the excel spreadsheet
#with all of our names and dates that we wanted
for z in raw_list:
boxcoA = string('A',z)
boxcoB = string('B',z)
worksheet.write(boxcoA, name_list[z])
worksheet.write(boxcoB, date_list[z])
workbook.close()
print('Operation Complete')
main()
The lack of a traceback only means your code raises no exceptions. It does not mean your code is logically correct.
I would look for logic errors by adding print statements, or using a debugger such as pdb or pudb.
One problem I notice with your code is that the first loop seems to presume that i is a line, whereas it is actually a character. You might find splitlines() more useful
If there is no traceback then there is no error.
Most likely something has gone wrong with your scraping/parsing code and your raw_list or other arrays aren't populated.
Try print out the data that should be written to the worksheet in the last loop to see if there is any data to be written.
If you aren't writing data to the worksheet then it will be empty.