Perfomance improvement - Looping with Get Method

Perfomance improvement - Looping with Get Method - python

I've built a program to fill up a databank and, by the time, it's working. Basically, the program makes a request to the app I'm using (via REST API) returns the data I want and then manipulate to a acceptable form for the databank.
The problem is: the GET method makes the algorithm too slow, because I'm acessing the details of particular entries, so for each entry I have to make 1 request. I have something close to 15000 requests to do and each row in the bank is taking 1 second to be made.
Is there any possible way to make this requests faster? How can I improve the perfomance of this method? And by the way, any tips to measure the perfomance of the code?
Thanks in advance!!
here's the code:
# Retrieving all the IDs I want to get the detailed info
abc_ids = serializers.serialize('json', modelExample.objects.all(), fields=('id'))
abc_ids = json.loads(abc_ids)
abc_ids_size = len(abc_ids)
# Had to declare this guys right here because in the end of the code I use them in the functions to create and uptade the back
# And python was complaining that I stated before assign. Picked random values for them.
age = 0
time_to_won = 0
data = '2016-01-01 00:00:00'
# First Loop -> Request to the detailed info of ABC
for x in range(0, abc_ids_size):
id = abc_ids[x]['fields']['id']
url = requests.get(
'https://api.example.com/v3/abc/' + str(
id) + '?api_token=123123123')
info = info.json()
dealx = dict(info)
# Second Loop -> Picking the info I want to uptade and create in the bank
for key, result in dealx['data'].items():
# Relevant only for ModelExample -> UPTADE
if key == 'age':
result = dict(result)
age = result['total_seconds']
# Relevant only For ModelExample -> UPTADE
elif key == 'average_time_to_won':
result = dict(result)
time_to_won = result['total_seconds']
# Relevant For Model_Example2 -> CREATE
# Storing a date here to use up foward in a datetime manipulation
if key == 'add_time':
data = str(result)
elif key == 'time_stage':
# Each stage has a total of seconds that the user stayed in.
y = result['times_in_stages']
# The user can be in any stage he want, there's no rule about the order.
# But there's a record of the order he chose.
z = result['order_of_stages']
# Creating a list to fill up with all stages info and use in the bulk_create.
data_set = []
index = 0
# Setting the number of repititions base on the number of the stages in the list.
for elemento in range(0, len(z)):
data_set_i = {}
# The index is to define the order of the stages.
index = index + 1
for key_1, result_1 in y.items():
if int(key_1) == z[elemento]:
data_set_i['stage_id'] = int(z[elemento])
data_set_i['index'] = int(index)
data_set_i['abc_id'] = id
# Datetime manipulation
if result_1 == 0 and index == 1:
data_set_i['add_date'] = data
# I know that I totally repeated the code here, I was trying to get this part shorter
# But I could not get it right.
elif result_1 > 0 and index == 1:
data_t = datetime.strptime(data, "%Y-%m-%d %H:%M:%S")
data_sum = data_t + timedelta(seconds=result_1)
data_sum += timedelta(seconds=3)
data_nova = str(data_sum.year) + '-' + str(formaters.DateNine(
data_sum.month)) + '-' + str(formaters.DateNine(data_sum.day)) + ' ' + str(
data_sum.hour) + ':' + str(formaters.DateNine(data_sum.minute)) + ':' + str(
formaters.DateNine(data_sum.second))
data_set_i['add_date'] = str(data_nova)
else:
data_t = datetime.strptime(data_set[elemento - 1]['add_date'], "%Y-%m-%d %H:%M:%S")
data_sum = data_t + timedelta(seconds=result_1)
data_sum += timedelta(seconds=3)
data_nova = str(data_sum.year) + '-' + str(formaters.DateNine(
data_sum.month)) + '-' + str(formaters.DateNine(data_sum.day)) + ' ' + str(
data_sum.hour) + ':' + str(formaters.DateNine(data_sum.minute)) + ':' + str(
formaters.DateNine(data_sum.second))
data_set_i['add_date'] = str(data_nova)
data_set.append(data_set_i)
Model_Example2_List = [Model_Example2(**vals) for vals in data_set]
Model_Example2.objects.bulk_create(Model_Example2_List)
ModelExample.objects.filter(abc_id=id).update(age=age, time_to_won=time_to_won)

if the bottleneck is in your network request, there isn't much you can do except to perhaps use gzip or deflate but with requests ..
The gzip and deflate transfer-encodings are automatically decoded for
you.
If you want to be doubly sure, you can add the following headers to the get request.
{ 'Accept-Encoding': 'gzip,deflate'}
The other alternative is to use threading and have many requests operate in parrallel, a good option if you have lot's of bandwidth and multiple cores.
Lastly, there are lots of different ways to profile python including with cprofile + kcachegrind combo.

Related

Openpyxl module: return weird value(not error) + hope to calculate

I wrote some codes trying to let the user be able to check the percentage of the money they spent(compared to the money they earned). Almost every step perform normally, until the final part.
a_c[('L'+row_t)].value return:
=<Cell 'Sheet1'.B5>/<Cell 'Sheet1'.J5>
yet I hope it should be some value.
Code:
st_column = st_column_r.capitalize()
row_s = str(a_c.max_row)
row_t = str(a_c.max_row + 1)
row = int(row_t)
a_c[('J'+row_t)] = ('=SUM(I2,J'+row_s+')') #總收入
errorprevention = a_c[('J'+row_t)].value
a_c[(st_column+row_t)] = ('=SUM('+(st_column+'2')+','+(st_column+row_s)+')')
a_c['L'+row_t].number_format = FORMAT_PERCENTAGE_00
if errorprevention != 0:
a_c[('L'+row_t)] = ('='+str(a_c[(st_column+row_t)])+'/'+str(a_c[('J'+row_t)]))
print('過往支出中，'+inputtype[st_column]+'類別佔總收入的比率為:'+a_c[('L'+row_t)].value)

Try changing the formula creation to;
a_c[('L' + row_t)].value = '=' + a_c[(st_column + row_t)].coordinate + '/' + a_c[('J' + row_t)].coordinate
or use an f string
a_c[('L' + row_t)].value = f"={a_c[(st_column + row_t)].coordinate}/{a_c[('J' + row_t)].coordinate}"

How can you apply a custom function to a column in pandas dataframe and store the output in a column of another dataframe?

I am trying to call a specific function passing it each row of a particular column in a dataframe (df) and then store the return of that function in another dataframe (df3).
This is the function:
import pandas as pd
import datetime
import re #enables carrier lookup
def recognize_delivery_service(tracking):
service = None
usps_pattern = [
'^(94|93|92|94|95)[0-9]{20}$',
'^(94|93|92|94|95)[0-9]{22}$',
'^(70|14|23|03)[0-9]{14}$',
'^(M0|82)[0-9]{8}$',
'^([A-Z]{2})[0-9]{9}([A-Z]{2})$'
]
ups_pattern = [
'^(1Z)[0-9A-Z]{16}$',
'^(T)+[0-9A-Z]{10}$',
'^[0-9]{9}$',
'^[0-9]{26}$'
]
fedex_pattern = [
'^[0-9]{20}$',
'^[0-9]{15}$',
'^[0-9]{12}$',
'^[0-9]{22}$'
]
usps = "(" + ")|(".join(usps_pattern) + ")"
fedex = "(" + ")|(".join(fedex_pattern) + ")"
ups= "(" + ")|(".join(ups_pattern) + ")"
if re.match(usps, tracking) != None:
service = 'USPS'
elif re.match(ups, tracking) != None:
service = 'UPS'
elif re.match(fedex, tracking) != None:
service = 'FedEx'
return service
#Pass tracking to function
temp_track = None
temp_service = None
for i in range(len(df)):
temp_track = df.iloc[i, 44] #tracking num located in column 44
temp_service = recognize_delivery_service(temp_track) #pass track num to function
df3.iloc[i, 0] = temp_service #store service value in column 0 of df3
This is one of many attempts I've made but can't get the function to return proper service nor store in df3.

I don't really understood what this function should do, but I think pandas.apply might be relevant...
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html
if you could give an example of input and desired output I could help you more.

Why isn't my Simpy resource keeping a queue?

I've been working for around a week to learn SimPy for a discrete simulation I have to run. I've done my best, but I'm just not experienced enough to figure it out quickly. I am dying. Please help.
The system in question goes like this:
order arrives -> resource_1 (there are 2) performs take_order -> order broken into items -> resource_2 (there are 10) performs process_item
My code runs and performs the simulation, but I'm having a lot of trouble getting the queues on the resources to function. As in, queues do not build up on either resource when I run it, and I cannot find the reason why. I try resource.get_queue and get empty lists. There should absolutely be queues, as the orders arrive faster than they can be processed.
I think it has something to do with the logic for requesting resources, but I can't figure it out. Here's how I've structured the code:
import simpy
import random
import numpy as np
total_items = []
total_a = []
total_b = []
total_c = []
order_Q = []
item_Q = []
skipped_visits = []
order_time_dict = {}
order_time_dict2 = {}
total_order_time_dict = {}
var = []
class System:
def __init__(self,env,num_resource_1,num_resource_2):
self.env = env
self.resource_1 = simpy.Resource(env,num_resource_1)
self.resource_2 = simpy.Resource(env,num_resource_2)
def take_order(self, order):
self.time_to_order = random.triangular(30/60,60/60,120/60)
arrive = self.env.now
yield self.env.timeout(self.time_to_order)
def process_item(self,item):
total_process_time = 0
current = env.now
order_num = item[1][0]
for i in range(1,item[1][1]):
if 'a' in item[0]:
total_process_time += random.triangular(.05,7/60,1/6) #bagging time only
#here edit order time w x
if 'b' in item[0]:
total_process_time += random.triangular(.05,.3333,.75)
if 'c' in item[0]:
total_process_time += random.triangular(.05,7/60,1/6)
#the following is handling time: getting to station, waiting on car to arrive at window after finished, handing to cust
total_process_time += random.triangular(.05, 10/60, 15/60)
item_finish_time = current + total_process_time
if order_num in order_time_dict2.keys():
start = order_time_dict2[order_num][0]
if order_time_dict2[order_num][1] < item_finish_time:
order_time_dict2[order_num] = (start, item_finish_time)
else:
order_time_dict2[order_num] = (current, item_finish_time)
yield self.env.timeout(total_process_time)
class Order:
def __init__(self, order_dict,order_num):
self.order_dict = order_dict
self.order_num = order_num
self.order_stripped = {}
for x,y in list(self.order_dict.items()):
if x != 'total':
if y != 0:
self.order_stripped[x] = (order_num,y) #this gives dictionary format {item: (order number, number items) } but only including items in order
self.order_list = list(self.order_stripped.items())
def generate_order(num_orders):
print('running generate_order')
a_demand = .1914 ** 3
a_stdev = 43.684104
b_demand = .1153
b_stdev = 28.507782
c_demand = .0664
c_stdev = 15.5562624349
num_a = abs(round(np.random.normal(a_demand)))
num_b = abs(round(np.random.normal(b_demand)))
num_c = abs(round(np.random.normal(c_demand)))
total = num_orders
total_a.append(num_a)
total_b.append(num_b)
total_c.append(num_c)
total_num_items = num_a + num_b + num_c
total_items.append(total_num_items)
order_dict = {'num_a':num_a, 'num_b':num_b,'num_c':num_c, 'total': total}
return order_dict
def order_process(order_instance,system):
enter_system_at = system.env.now
print("order " + str(order_instance.order_num) + " arrives at " + str(enter_system_at))
if len(system.resource_1.get_queue) > 1:
print("WORKING HERE ******************")
if len(system.resource_1.get_queue) <= 25:
with system.resource_1.request() as req:
order_Q.append(order_instance)
yield req
yield env.process(system.take_order(order_instance))
order_Q.pop()
enter_workstation_at = system.env.now
print("order num " + str(order_instance.order_num) + " enters workstation at " + str(enter_workstation_at))
for item in order_instance.order_list:
item_Q.append(item)
with system.resource_2.request() as req:
yield req
yield env.process(system.process_item(item))
if len(system.resource_2.get_queue) >1:
var.append(1)
item_Q.pop()
leave_workstation_at = system.env.now
print("Order num " + str(order_instance.order_num) + " leaves at " + str(leave_workstation_at))
order_time_dict[order_instance.order_num] = leave_workstation_at-enter_workstation_at
total_order_time_dict[order_instance.order_num]=leave_workstation_at-enter_system_at
else:
skipped_visits.append(1)
def setup(env):
system = System(env,2,15)
order_num = 0
while True:
next_order = random.expovariate(3.5) #where 20 is order arrival mean (lambda)
yield env.timeout(next_order)
order_num+=1
env.process(order_process(Order(generate_order(order_num),order_num),system))
env = simpy.Environment()
env.process(setup(env))
env.run(until=15*60)
print("1: \n", order_time_dict)

I think you are looking at the wrong queue.
the api for getting queued requests for resources is just attribute queue so try using
len(system.resource_1.queue)
get_queue and put_queue is from the base class and used to derive new resource classes.
but wait they are not what any reasonable person would assume, and I find this confusing too, but the doc says
Requesting a resources is modeled as “putting a process’ token into the resources” which means when you call request() the process is put into the put_queue, not the get_queue. And with resource, release always succeeds immediately so its queue (which is the get_queue) is always empty
I think queue is just a alias for the put_queue, but queue is much less confussing

Removing rows from a pandas dataframe while iterating through it

I have the following python script. In it, I am iterating through a CSV file which has rows and rows of loyalty cards. In many cases, there is more than one entry per card. I am currently looping through each row, then using loc to find all other instances of the card in the current row, so I can combine them together to post to an API. What I'd like to do however, is when that post is done, remove all the rows I've just merged, so that way the iteration doesn't hit them again.
That's the part I'm stuck on. Any ideas? Essentially I want to remove all the rows in card_list from csv before I go for the next iteration. That way even though there might be 5 rows with the same card number, I only process that card once. I tried by using
csv = csv[csv.card != row.card]
At the end of the loop, thinking it might re-generate the dataframe without any rows with a card matching the one just processed, but it didn't work.
import urllib3
import json
import pandas as pd
import os
import time
import pyfiglet
from datetime import datetime
import array as arr
for row in csv.itertuples():
dt = datetime.now()
vouchers = []
if minutePassed(time.gmtime(lastrun)[4]):
print('Getting new token...')
token = get_new_token()
lastrun = time.time()
print('processing ' + str(int(row.card)))
card_list = csv.loc[csv['card'] == int(row.card)]
print('found ' + str(len(card_list)) + ' vouchers against this card')
for row in card_list.itertuples():
print('appending card ' + str(int(row.card)) + ' voucher ' + str(row.voucher))
vouchers.append(row.voucher)
print('vouchers, ', vouchers)
encoded_data = json.dumps({
"store_id":row.store,
"transaction_id":"11111",
"card_number":int(row.card),
"voucher_instance_ids":vouchers
})
print(encoded_data)
number += 1
r = http.request('POST', lcs_base_path + 'customer/auth/redeem-commit',body=encoded_data,headers={'x-api-key': api_key, 'Authorization': 'Bearer ' + token})
response_data = json.loads(r.data)
if (r.status == 200):
print (str(dt) + ' ' + str(number) + ' done. processing card:' + str(int(row.card)) + ' voucher:' + str(row.voucher) + ' store:' + str(row.store) + ' status: ' + response_data['response_message'] + ' request:' + response_data['lcs_request_id'])
else:
print (str(dt) + ' ' + str(number) + 'done. failed to commit ' + str(int(row.card)) + ' voucher:' + str(row.voucher) + ' store:' + str(row.store) + ' status: ' + response_data['message'])
new_row = {'card':row.card, 'voucher':row.voucher, 'store':row.store, 'error':response_data['message']}
failed_csv = failed_csv.append(new_row, ignore_index=True)
failed_csv.to_csv(failed_csv_file, index=False)
csv = csv[csv.card != row.card]
print ('script completed')
print (str(len(failed_csv)) + ' failed vouchers will be saved to failed_commits.csv')
print("--- %s seconds ---" % (time.time() - start_time))

First rule of thumb is never alternate what you are iterating on. Also, I think you are doing it wrong with itertuples. Let's do groupby:
for card, card_list in csv.groupby('card'):
# card_list now contains all the rows that have a specific cards
# exactly like `card_list` in your code
print('processing, card)
print('found', len(card_list), 'vouchers against this card')
# again `itertuples` is over killed -- REMOVE IT
# for row in card_list.itertuples():
encoded_data = json.dumps({
"store_id": card_list['store'].iloc[0], # same as `row.store`
"transaction_id":"11111",
"card_number":int(card),
"voucher_instance_ids": list(card_list['voucher']) # same as `vouchers`
})
# ... Other codes

How to slice a very long string in python

I need to slice a very long string (DNA sequences) in python, currently I'm using this:
new_seq = clean_seq[start:end]
I'm slicing about every 20000 characters, and taking 1000 long slices (approximately)
it's a 250MB file containing a few strings, identified each one with an id, this method is taking too long.
The sequence string comes from biopython module:
def fasta_from_ann(annotation, sequence, feature, windows, output_fasta):
df_gff = pd.read_csv(annotation, index_col=False, sep='\t',header=None)
df_gff.columns = ['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attribute']
fasta_seq = SeqIO.parse(sequence,'fasta')
buffer = []
for record in fasta_seq:
df_exctract = df_gff[(df_gff.seqname == record.id) & (df_gff.feature == feature)]
for k,v in df_exctract.iterrows():
clean_seq = ''.join(str(record.seq).splitlines())
if int(v.start) - windows < 0:
start = 0
else:
start = int(v.start) - windows
if int(v.end) + windows > len(clean_seq):
end = len(clean_seq)
else:
end = int(v.end) + windows
new_seq = clean_seq[start:end]
new_id = record.id + "_from_" + str(v.start) + "_to_" + str(v.end) + "_feature_" + v.feature
desc = "attribute: " + v.attribute + " strand: " + v.strand
seq = SeqRecord(Seq(new_seq), id=new_id,description = desc)
buffer.append(seq)
print(record.id)
SeqIO.write(buffer, output_fasta, "fasta")
Maybe there's a more memory-friendly way to accomplish this.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Perfomance improvement - Looping with Get Method - python

Related

Openpyxl module: return weird value(not error) + hope to calculate

How can you apply a custom function to a column in pandas dataframe and store the output in a column of another dataframe?

Why isn't my Simpy resource keeping a queue?

Removing rows from a pandas dataframe while iterating through it

How to slice a very long string in python

Categories

Resources