I have two google spreadsheets:
QC- many columns, I want to check if a value from column 4 appears in the second spreadsheet lastEdited_PEID; if it does, it would put 'Bingo!' in column 14 of the same row where the value was found
lastEdited- one column, long spreadsheets of values
I achieve that with the following code:
#acces the documents on Drive
QC = gc.open_by_key("FIRST KEY").sheet1
lastEdited = gc.open_by_key("SECOND KEY").sheet1
#get values from columns and convert to lists
QC_PEID = QC.col_values(4)
lastEdited_PEID = lastEdited.col_values(1)
#iterate by rows and check if value from each row appears in the second document
for value in QC_PEID:
ind = QC_PEID.index(value)
if value in lastEdited_PEID:
QC.update_cell(ind, 14, 'Bingo!')
So it does the job but does it very slowly (about 5 minutes). I am concerned about the speed because I have to perform the operation for about 50 spreadsheets (avg. 6000 rows each).
I tried to remove the element from the second list when found (it can only appear once) with the following code in the loop:
for value in QC_PEID:
ind = QC_PEID.index(value)
if value in lastEdited_PEID:
QC.update_cell(ind, 14, 'Bingo!')
**lastEdited_PEID.remove('value')**
I thought it would make it faster as the reference list would be shorter but surprisingly it takes even more.
What could I do to make the process quicker?
Since gspread is a wrapper around the Google Sheet's REST API each operation you perform on a spreadsheet renders to an HTTP request to the API. Most of the time this is the slowest part of the code. If you want to improve performance you need to figure out how to reduce the number of interactions with the API.
In your code sample each col_values() call makes a single HTTP request. This is good. But then, when you iterating over cells values, there's an update_cell() in a loop:
for value in QC_PEID:
ind = QC_PEID.index(value)
if value in lastEdited_PEID:
QC.update_cell(ind, 14, 'Bingo!') # it makes 2 HTTP requests each time
update_cell makes two HTTP requests to the API (one to retrieve information needed to update the cell and another to actually send the update to the API.) You need to avoid this method call in your loop.
A better idea is to collect all updates and send them in a batch. This is what update_cells() method is for.
update_cells() needs a list of Cell objects to do the batch update. You can get those by calling Worksheet.range().
This is what comes in into my mind:
# A utility method
def col_cells(worksheet, col):
"""Returns a range of cells in a `worksheet`'s column `col`."""
start_cell = self.get_addr_int(1, col)
end_cell = self.get_addr_int(worksheet.row_count, col)
return worksheet.range('%s:%s' % (start_cell, end_cell))
QC_PEID = QC.col_values(4)
lastEdited_PEID = set(lastEdited.col_cells(1)) # make the 'in' lookup a bit faster
column_14_cells = col_cells(QC, 14)
has_updates = False
# iterate by rows and check if value from each row appears in the second document
for i, value in enumerate(QC_PEID):
if value in lastEdited_PEID:
has_updates = True
column_14_cells[i].value = 'Bingo!'
if has_updates:
QC.update_cells(column_14_cells)
I didn't run the code. Beware of typos.
Related
I am currently parsing historic delay data from a public transport network in Sweden. I have ~5700 files (one from every 15 seconds) from the 27th of January containing momentary delay data for vehicles on active trips in the network. It's, unfortunately, a lot of overhead / duplicate data, so I want to parse out the relevant stuff to do visualizations on it.
However, when I try to parse and filter out the relevant delay data on a trip level using the script below it performs really slow. It has been running for over 1,5 hours now (on my 2019 Macbook Pro 15') and isn't finished yet.
How can I optimize / improve this python parser?
Or should I reduce the number of files, and i.e. the frequency of the data collection, for this task?
Thank you so much in advance. 💗
from google.transit import gtfs_realtime_pb2
import gzip
import os
import datetime
import csv
import numpy as np
directory = '../data/tripu/27/'
datapoints = np.zeros((0,3), int)
read_trips = set()
# Loop through all files in directory
for filename in os.listdir(directory)[::3]:
try:
# Uncompress and parse protobuff-file using gtfs_realtime_pb2
with gzip.open(directory + filename, 'rb') as file:
response = file.read()
feed = gtfs_realtime_pb2.FeedMessage()
feed.ParseFromString(response)
print("Filename: " + filename, "Total entities: " + str(len(feed.entity)))
for trip in feed.entity:
if trip.trip_update.trip.trip_id not in read_trips:
try:
if len(trip.trip_update.stop_time_update) == len(stopsOnTrip[trip.trip_update.trip.trip_id]):
print("\t","Adding delays for",len(trip.trip_update.stop_time_update),"stops, on trip_id",trip.trip_update.trip.trip_id)
for i, stop_time_update in enumerate(trip.trip_update.stop_time_update[:-1]):
# Store the delay data point (arrival difference of two ascending nodes)
delay = int(trip.trip_update.stop_time_update[i+1].arrival.delay-trip.trip_update.stop_time_update[i].arrival.delay)
# Store contextual metadata (timestamp and edgeID) for the unique delay data point
ts = int(trip.trip_update.stop_time_update[i+1].arrival.time)
key = int(str(trip.trip_update.stop_time_update[i].stop_id) + str(trip.trip_update.stop_time_update[i+1].stop_id))
# Append data to numpy array
datapoints = np.append(datapoints, np.array([[key,ts,delay]]), axis=0)
read_trips.add(trip.trip_update.trip.trip_id)
except KeyError:
continue
else:
continue
except OSError:
continue
I suspect the problem here is repeatedly calling np.append to add a new row to a numpy array. Because the size of a numpy array is fixed when it is created, np.append() must create a new array, which means that it has to copy the previous array. On each loop, the array is bigger and so all these copies add a quadratic factor to your execution time. This becomes significant when the array is quite big (which apparently it is in your application).
As an alternative, you could just create an ordinary Python list of tuples, and then if necessary convert that to a complete numpy array at the end.
That is (only the modified lines):
datapoints = []
# ...
datapoints.append((key,ts,delay))
# ...
npdata = np.array(datapoints, dtype=int)
I still think the parse routine is your bottleneck (even if it did come from Google), but all those '.'s were killing me! (And they do slow down performance somewhat.) Also, I converted your i, i+1 iterating to using two iterators zipping through the list of updates, this is a little more advanced style of working through a list. Plus the cur/next_update names helped me keep straight when you wanted to reference one vs. the other. Finally, I remove the trailing "else: continue", since you are at the end of the for loop anyway.
for trip in feed.entity:
this_trip_update = trip.trip_update
this_trip_id = this_trip_update.trip.trip_id
if this_trip_id not in read_trips:
try:
if len(this_trip_update.stop_time_update) == len(stopsOnTrip[this_trip_id]):
print("\t", "Adding delays for", len(this_trip_update.stop_time_update), "stops, on trip_id",
this_trip_id)
# create two iterators to walk through the list of updates
cur_updates = iter(this_trip_update.stop_time_update)
nxt_updates = iter(this_trip_update.stop_time_update)
# advance the nxt_updates iter so it is one ahead of cur_updates
next(nxt_updates)
for cur_update, next_update in zip(cur_updates, nxt_updates):
# Store the delay data point (arrival difference of two ascending nodes)
delay = int(nxt_update.arrival.delay - cur_update.arrival.delay)
# Store contextual metadata (timestamp and edgeID) for the unique delay data point
ts = int(next_update.arrival.time)
key = "{}/{}".format(cur_update.stop_id, next_update.stop_id)
# Append data to numpy array
datapoints = np.append(datapoints, np.array([[key, ts, delay]]), axis=0)
read_trips.add(this_trip_id)
except KeyError:
continue
This code should be equivalent to what you posted, and I don't really expect major performance gains either, but perhaps this will be more maintainable when you come back to look at it in 6 months.
(This probably is more appropriate for CodeReview, but I hardly ever go there.)
I'm trying to push data into my google sheet with the following code, how can i change the code so that it will print in the 2nd row at the correct column base on the header that I've created.
First code:
class Header:
def __init__(self):
self.No_DOB_Y=1
self.No_DOB_M=2
self.No_DOB_D=3
self.Paid_too_much_little=4
self.No_number_of_ins=5
self.No_gender=6
self.No_first_login=7
self.No_last_login=8
self.Too_young_old=9
def __repr__(self):
return str(self.__dict__)
def add_col(self,name):
setattr(self,name,max(anomali_header.__dict__.values())+1)
anomali_header=Header()
2nd part of code (NEW):
# No_gender
a = list(df.loc[df['gender'].isnull()]['id'])
#print(a)
cells=sh3.range(1,1,len(a),1)
for i,cell in enumerate(cells):
cell.value=a[i]
sh3.update_cells(cells)
At the moment it updates into A1 cell....
This is what I essentially want to
As you can see, the code writes the results onto the first available cell which is A1, i essentially want it to appear at the bottom of my anomali_header of "No_gender" but I'm not sure how to link my 1st part of the code to the 2nd part of the code...
Thanks to v25, the code below works, but rather than going through the code one by one, i wanted to create a loop which goes through all the function
I'm trying to run the code below, but it seems I get an error when I use the loop.
Error:
TypeError: 'list' object cannot be interpreted as an integer
Code:
# No_DOB_Y
a = list(df.loc[df['Year'].isnull()]['id'])
# No number of ins
b = list(df.loc[df['number of ins'].isnull()]['id'])
# No_gender
c = list(df.loc[df['gender'].isnull()]['id'])
# Updating anomalies to sheet
condition = [a,b,c]
column = [1,2,3]
for j in range(column,condition):
cells=sh3.range(2,column,len(condition)+1,column)
for i,cell in enumerate(cells):
cell.value=condition[i]
print('end of check')
sh3.update_cells(cells)
You need to change the range() parameters:
first_row (int) – Row number
first_col (int) – Row number
last_row (int) – Row number
last_col (int) – Row number
So something like:
cells=sh3.range(2, 6, len(a)+1, 6)
Or you could issue the range as a string:
cells=sh3.range('F2:F' + str(len(a)+1))
These numbers may not be perfect, but this should change the positioning. You might need to tweak the digits slightly ;)
UPDATE:
I've encountered an error use a loop, updated my original post
TypeError: 'list' object cannot be interpreted as an integer
This is happneing because the function range which you use in the for loop (not to be confused with sh3.range which is a different function altogether) expects integers, but you're passing it lists.
However, a simpler way to implement this would be to create a list of tuples which map the strings to column integers, then loop based on this. Something like:
col_map = [ ('Year', 1),
('number of ins', 5),
('gender', 6)
]
for col_tup in col_map:
df_list = list(df.loc[df[col_tup[0]].isnull()]['id'])
cells = sh3.range(2, col_tup[1], len(df_list)+1, col_tup[1])
for i, cell in enumerate(cells)
cell.value=df_list[i]
sh3.update_cells(cells)
Ok so I have two xlsx sheets, both sheets have in their second column, at index 1, a list of sim card numbers. I have successfully printed the contents of both columns into my powershell terminal as 2 lists, and the quantity of elements in those lists, after extracting that data using xlrd.
The first sheet (theirSheet) has 454 entries, the second (ourSheet) has 361. I need to find the 93 that don't exist in the second sheet and put them into (unpaidSims). I could do this manually of course, but I would like to automate this task for the future when I inevitably need to do it again so I am trying to write this python script.
Considering python agrees that I have a list of 454 entries, and a list of 361 entries, I thought I just need to figure out a list comparison and I researched that on Stack Overflow, and tried 3 times with 3 different solutions, but each time, when I use that script to produce the third list (unpaidSims), it says 454...meaning it hasn't removed the entries that are duplicated in the smaller list. Please advise.
from os.path import join, dirname, abspath
import xlrd
theirBookFileName = join(dirname(dirname(abspath(__file__))), 'pycel', 'theirBook.xlsx')
ourBookFileName = join(dirname(dirname(abspath(__file__))), 'pycel', 'ourBook.xlsx')
theirBook = xlrd.open_workbook(theirBookFileName)
ourBook = xlrd.open_workbook(ourBookFileName)
theirSheet = theirBook.sheet_by_index(0)
ourSheet = ourBook.sheet_by_index(0)
theirSimColumn = theirSheet.col(1)
ourSimColumn = ourSheet.col(1)
numColsTheirSheet = theirSheet.ncols
numRowsTheirSheet = theirSheet.nrows
numColsOurSheet = ourSheet.ncols
numRowsOurSheet = ourSheet.nrows
# First Attempt at the comparison, but fails and returns 454 entries from the bigger list
unpaidSims = [d for d in theirSimColumn if d not in ourSimColumn]
print unpaidSims
lengthOfUnpaidSims = len(unpaidSims)
print lengthOfUnpaidSims
print "\nWe are expecting 93 entries in this new list"
# Second Attempt at the comparison, but fails and returns 454 entries from the bigger list
s = set(ourSimColumn)
unpaidSims = [x for x in theirSimColumn if x not in s]
print unpaidSims
lengthOfUnpaidSims = len(unpaidSims)
print lengthOfUnpaidSims
# Third Attempt at the comparison, but fails and returns 454 entries from the bigger list
unpaidSims = tuple(set(theirSimColumn) - set(ourSimColumn))
print unpaidSims
lengthOfUnpaidSims = len(unpaidSims)
print lengthOfUnpaidSims
According to the xlrd Documentation, the col method returns "a sequence of the Cell objects in the given column".
It doesn't mention anything about comparison of Cell objects. Digging into the source, it appears that they didn't code any comparison methods into the class. As such, the Python documentation states that the objects will be compared by "object identity". In other words, the comparison will be False unless they are the exact same instance of the Cell class, even if the values they contain are identical.
You need to compare the values of the Cells instead. For example:
unpaidSims = set(sim.value for sim in theirSimColumn) - set(sim.value for sim in ourSimColumn)
I have some pandas groupby functions that write data to file, but for some reason I'm getting redundant data written to file. Here's the code:
This function gets applied to each item in the dataframe
def item_grouper(df):
# Get the frequency of each tag applied to the item
tag_counts = df['tag'].value_counts()
# Get the most frequent tag (or tags, assuming a tie)
max_tags = tag_counts[tag_counts==tag_counts.max()]
# Get the total nummber of annotations for the item
total_anno = len(df)
# Now, process each user who tagged the item
return df.groupby('uid').apply(user_grouper,total_anno,max_tags,tag_counts)
# This function gets applied to each user who tagged an item
def user_grouper(df,total_anno,max_tags,tag_counts):
# subtract user's annoations from total annoations for the item
total_anno = total_anno - len(df)
# calculate weight
weight = np.log10(total_anno)
# check if user has used (one of) the top tag(s), and adjust max_tag_count
if len(np.intersect1d(max_tags.index.values,df['iid']))>0:
max_tag_count = float(max_tags[0]-1)
else:
max_tag_count = float(max_tags[0])
# for each annotation...
for i,row in df.iterrows():
# calculate raw score
raw_score = (tag_counts[row['tag']]-1) / max_tag_count
# write to file
out.write('\t'.join(map(str,[row['uid'],row['iid'],row['tag'],raw_score,weight]))+'\n')
return df
So, one grouping function groups the data by iid (item id), does some processing, and then groups each sub-dataframe by uid (user_id), does some calculation, and writes to an output file. Now, the output file should have exactly one line per row in the original dataframe, but it doesn't! I keep getting the same data written to file multiple times. For instance, if I run:
out = open('data/test','w')
df.head(1000).groupby('iid').apply(item_grouper)
out.close()
The output should have 1000 lines (the code only writes one line per row in the dataframe), but the result output file has 1,997 lines. Looking at the file shows the exact same lines written multiple (2-4) times, seemingly at random (i.e. not all lines are double-written). Any idea what I'm doing wrong here?
See the docs on apply. Pandas will call the function twice on the first group (to determine between a fast/slow code path), so the side effects of the function (IO) will happen twice for the first group.
Your best bet here is probably to iterate over the groups directly, like this:
for group_name, group_df in df.head(1000).groupby('iid'):
item_grouper(group_df)
I agree with chrisb's determination of the problem. As a cleaner way, consider having your user_grouper() function not save any values, but instead return these. With a structure as
def user_grouper(df, ...):
(...)
df['max_tag_count'] = some_calculation
return df
results = df.groupby(...).apply(user_grouper, ...)
for i,row in results.iterrows():
# calculate raw score
raw_score = (tag_counts[row['tag']]-1) / row['max_tag_count']
# write to file
out.write('\t'.join(map(str,[row['uid'],row['iid'],row['tag'],raw_score,weight]))+'\n')
I have a Google Spreadsheet which I'm populating with values using a python script and the gdata library. If i run the script more than once, it appends new rows to the worksheet, I'd like the script to first clear all the data from the rows before populating it, that way I have a fresh set of data every time I run the script. I've tried using:
UpdateCell(row, col, value, spreadsheet_key, worksheet_id)
but short of running a two for loops like this, is there a cleaner way? Also this loop seems to be horrendously slow:
for x in range(2, 45):
for i in range(1, 5):
self.GetGDataClient().UpdateCell(x, i, '',
self.spreadsheet_key,
self.worksheet_id)
Not sure if you got this sorted out or not, but regarding speeding up the clearing out of current data, try using a batch request. For instance, to clear out every single cell in the sheet, you could do:
cells = client.GetCellsFeed(key, wks_id)
batch_request = gdata.spreadsheet.SpreadsheetsCellsFeed()
# Iterate through every cell in the CellsFeed, replacing each one with ''
# Note that this does not make any calls yet - it all happens locally
for i, entry in enumerate(cells.entry):
entry.cell.inputValue = ''
batch_request.AddUpdate(cells.entry[i])
# Now send the entire batchRequest as a single HTTP request
updated = client.ExecuteBatch(batch_request, cells.GetBatchLink().href)
If you want to do things like save the column headers (assuming they are in the first row), you can use a CellQuery:
# Set up a query that starts at row 2
query = gdata.spreadsheet.service.CellQuery()
query.min_row = '2'
# Pull just those cells
no_headers = client.GetCellsFeed(key, wks_id, query=query)
batch_request = gdata.spreadsheet.SpreadsheetsCellsFeed()
# Iterate through every cell in the CellsFeed, replacing each one with ''
# Note that this does not make any calls yet - it all happens locally
for i, entry in enumerate(no_headers.entry):
entry.cell.inputValue = ''
batch_request.AddUpdate(no_headers.entry[i])
# Now send the entire batchRequest as a single HTTP request
updated = client.ExecuteBatch(batch_request, no_headers.GetBatchLink().href)
Alternatively, you could use this to update your cells as well (perhaps more in line with that you want). The link to the documentation provides a basic way to do that, which is (copied from the docs in case the link ever changes):
import gdata.spreadsheet
import gdata.spreadsheet.service
client = gdata.spreadsheet.service.SpreadsheetsService()
# Authenticate ...
cells = client.GetCellsFeed('your_spreadsheet_key', wksht_id='your_worksheet_id')
batchRequest = gdata.spreadsheet.SpreadsheetsCellsFeed()
cells.entry[0].cell.inputValue = 'x'
batchRequest.AddUpdate(cells.entry[0])
cells.entry[1].cell.inputValue = 'y'
batchRequest.AddUpdate(cells.entry[1])
cells.entry[2].cell.inputValue = 'z'
batchRequest.AddUpdate(cells.entry[2])
cells.entry[3].cell.inputValue = '=sum(3,5)'
batchRequest.AddUpdate(cells.entry[3])
updated = client.ExecuteBatch(batchRequest, cells.GetBatchLink().href)