I've been working for around a week to learn SimPy for a discrete simulation I have to run. I've done my best, but I'm just not experienced enough to figure it out quickly. I am dying. Please help.
The system in question goes like this:
order arrives -> resource_1 (there are 2) performs take_order -> order broken into items -> resource_2 (there are 10) performs process_item
My code runs and performs the simulation, but I'm having a lot of trouble getting the queues on the resources to function. As in, queues do not build up on either resource when I run it, and I cannot find the reason why. I try resource.get_queue and get empty lists. There should absolutely be queues, as the orders arrive faster than they can be processed.
I think it has something to do with the logic for requesting resources, but I can't figure it out. Here's how I've structured the code:
import simpy
import random
import numpy as np
total_items = []
total_a = []
total_b = []
total_c = []
order_Q = []
item_Q = []
skipped_visits = []
order_time_dict = {}
order_time_dict2 = {}
total_order_time_dict = {}
var = []
class System:
def __init__(self,env,num_resource_1,num_resource_2):
self.env = env
self.resource_1 = simpy.Resource(env,num_resource_1)
self.resource_2 = simpy.Resource(env,num_resource_2)
def take_order(self, order):
self.time_to_order = random.triangular(30/60,60/60,120/60)
arrive = self.env.now
yield self.env.timeout(self.time_to_order)
def process_item(self,item):
total_process_time = 0
current = env.now
order_num = item[1][0]
for i in range(1,item[1][1]):
if 'a' in item[0]:
total_process_time += random.triangular(.05,7/60,1/6) #bagging time only
#here edit order time w x
if 'b' in item[0]:
total_process_time += random.triangular(.05,.3333,.75)
if 'c' in item[0]:
total_process_time += random.triangular(.05,7/60,1/6)
#the following is handling time: getting to station, waiting on car to arrive at window after finished, handing to cust
total_process_time += random.triangular(.05, 10/60, 15/60)
item_finish_time = current + total_process_time
if order_num in order_time_dict2.keys():
start = order_time_dict2[order_num][0]
if order_time_dict2[order_num][1] < item_finish_time:
order_time_dict2[order_num] = (start, item_finish_time)
else:
order_time_dict2[order_num] = (current, item_finish_time)
yield self.env.timeout(total_process_time)
class Order:
def __init__(self, order_dict,order_num):
self.order_dict = order_dict
self.order_num = order_num
self.order_stripped = {}
for x,y in list(self.order_dict.items()):
if x != 'total':
if y != 0:
self.order_stripped[x] = (order_num,y) #this gives dictionary format {item: (order number, number items) } but only including items in order
self.order_list = list(self.order_stripped.items())
def generate_order(num_orders):
print('running generate_order')
a_demand = .1914 ** 3
a_stdev = 43.684104
b_demand = .1153
b_stdev = 28.507782
c_demand = .0664
c_stdev = 15.5562624349
num_a = abs(round(np.random.normal(a_demand)))
num_b = abs(round(np.random.normal(b_demand)))
num_c = abs(round(np.random.normal(c_demand)))
total = num_orders
total_a.append(num_a)
total_b.append(num_b)
total_c.append(num_c)
total_num_items = num_a + num_b + num_c
total_items.append(total_num_items)
order_dict = {'num_a':num_a, 'num_b':num_b,'num_c':num_c, 'total': total}
return order_dict
def order_process(order_instance,system):
enter_system_at = system.env.now
print("order " + str(order_instance.order_num) + " arrives at " + str(enter_system_at))
if len(system.resource_1.get_queue) > 1:
print("WORKING HERE ******************")
if len(system.resource_1.get_queue) <= 25:
with system.resource_1.request() as req:
order_Q.append(order_instance)
yield req
yield env.process(system.take_order(order_instance))
order_Q.pop()
enter_workstation_at = system.env.now
print("order num " + str(order_instance.order_num) + " enters workstation at " + str(enter_workstation_at))
for item in order_instance.order_list:
item_Q.append(item)
with system.resource_2.request() as req:
yield req
yield env.process(system.process_item(item))
if len(system.resource_2.get_queue) >1:
var.append(1)
item_Q.pop()
leave_workstation_at = system.env.now
print("Order num " + str(order_instance.order_num) + " leaves at " + str(leave_workstation_at))
order_time_dict[order_instance.order_num] = leave_workstation_at-enter_workstation_at
total_order_time_dict[order_instance.order_num]=leave_workstation_at-enter_system_at
else:
skipped_visits.append(1)
def setup(env):
system = System(env,2,15)
order_num = 0
while True:
next_order = random.expovariate(3.5) #where 20 is order arrival mean (lambda)
yield env.timeout(next_order)
order_num+=1
env.process(order_process(Order(generate_order(order_num),order_num),system))
env = simpy.Environment()
env.process(setup(env))
env.run(until=15*60)
print("1: \n", order_time_dict)
I think you are looking at the wrong queue.
the api for getting queued requests for resources is just attribute queue so try using
len(system.resource_1.queue)
get_queue and put_queue is from the base class and used to derive new resource classes.
but wait they are not what any reasonable person would assume, and I find this confusing too, but the doc says
Requesting a resources is modeled as “putting a process’ token into the resources” which means when you call request() the process is put into the put_queue, not the get_queue. And with resource, release always succeeds immediately so its queue (which is the get_queue) is always empty
I think queue is just a alias for the put_queue, but queue is much less confussing
Related
For my code I need help with "mean_times[]"
I want to take the corresponding list items from all first elements from "op_time", "ltime", and "ptime" add them up, then divide by three and then all second elements to add them up and divide by three, and so on as many times as the user inputs tasks.
The way I wrote my code is static. I allow for three instances of this. but the code allows the user to input as many tasks as they want. I think I need the code to append to "mean_times[]" as many times as there are user inputs.
Meaning to take "(float_optime[0] + float_ltime[0] + float_ptime[0]) / 3" and append that into mean_times[] then do the same for the second element, and so on as many times as there are tasks.
I'm not sure of the logic I need or what sort of loop I need to do to add then append the product into mean_times[]
import numpy as np
from tabulate import *
tasks = []
t = input("Label the TASKS (a, b, c,..., entering STOP would end the process): ")
tasks.append(t)
while t.lower() != 'stop':
t = input("Label the TASKS (a, b, c, ..., entering STOP would end the process): ")
tasks.append(t)
del tasks[-1]
op_time = []
o = input("Enter OPTIMISTIC time for each task: ")
op_time.append(o)
while o.lower() != 'stop':
o = input("Enter OPTIMISTIC time for each task: :")
op_time.append(o)
del op_time[-1]
ltime = []
lt = input("Enter MOST LIKELY time for each task: ")
ltime.append(lt)
while lt.lower() != 'stop':
lt = input("Enter MOST LIKELY time for each task: :")
ltime.append(lt)
del ltime[-1]
ptime = []
p = input("Enter PESSIMISTIC time for each task: ")
ptime.append(p)
while p.lower() != 'stop':
p = input("Enter PESSIMISTIC time for each task: :")
ptime.append(p)
del ptime[-1]
array_op_time = np.array(op_time)
float_op_time = array_op_time.astype(float)
array_ltime = np.array(ltime)
float_ltime = array_ltime.astype(float)
array_ptime = np.array(ptime)
float_ptime = array_ptime.astype(float)
mean_times = [(float_optime[0] + float_ltime[0] + float_ptime[0]) / 3, (float_optime[1] + float_ltime[1] + float_ptime[1]) / 3, (float_optime[2] + float_ltime[2] + float_ptime[2]) / 3]
array_mean_times = np.array(mean_times)
float_mean_times = array_mean_times.astype(float)
#some logic to append to mean_times
table = {"Task": tasks, "Optimistic": op_time, "Most Likely": ltime, "Pessimistic": ptime, "Mean Duration": mean_times}
print(tabulate(table, headers='keys', tablefmt='fancy_grid', stralign='center', numalign='center'))
Fixed the program
import numpy as np
from tabulate import tabulate
tasks = []
counter = 1
t = input("Label the TASKS (a, b, c,..., entering STOP would end the process): ")
tasks.append(t)
while t.lower() != 'stop':
t = input("Label the TASKS (a, b, c, ..., entering STOP would end the process): ")
tasks.append(t)
counter = counter + 1
del tasks[-1]
op_time = []
for x in range(counter-1):
o = input("Enter OPTIMISTIC time for each task: ")
op_time.append(o)
ltime = []
for x in range(counter-1):
lt = input("Enter MOST LIKELY time for each task: ")
ltime.append(lt)
ptime = []
for x in range(counter-1):
p = input("Enter PESSIMISTIC time for each task: ")
ptime.append(p)
array_op_time = np.array(op_time)
float_op_time = array_op_time.astype(float)
array_ltime = np.array(ltime)
float_ltime = array_ltime.astype(float)
array_ptime = np.array(ptime)
float_ptime = array_ptime.astype(float)
mean_times = []
for x in range(0, counter-1):
m_time = ((float(ltime[x]) + float(ptime[x])+float(op_time[x]))/3)
mean_times.append(m_time)
array_mean_times = np.array(mean_times)
float_mean_times = array_mean_times.astype(float)
table = {"Task": tasks, "Optimistic": op_time, "Most Likely": ltime, "Pessimistic": ptime, "Mean Duration": mean_times}
print(tabulate(table, headers='keys', tablefmt='fancy_grid', stralign='center', numalign='center'))
You keep data in numpy.array so you can do it in one line - without index.
float_mean_times = (float_op_time + float_ltime + float_ptime) / 3
And this gives directly new numpy.array so it doesn't need np.array(...).astype(float)
And if you have it as normal lists then you would need for-loop with zip()
mean_times = []
for o, l, t in zip(float_op_time, float_ltime, float_ptime):
average = (o + l + t) / 3
mean_times.append(average)
mean_times = np.array(mean_times).astype(float)
Frankly, I see one problem - user may give different number of values for different arrays and then first version will not work, and second version will calculate it only for the shortest number of data.
I would rather use one while-loop to make sure user put the same number of data for all lists.
import numpy as np
from tabulate import *
tasks = []
op_time = []
ltime = []
ptime = []
while True:
t = input("Label for single TASK: ")
if t == '!stop':
break
o = input("Enter OPTIMISTIC time for this single task: ")
if o == '!stop':
break
lt = input("Enter MOST LIKELY time for this single task: ")
if lt == '!stop':
break
p = input("Enter PESSIMISTIC time for this single task: ")
if pt == '!stop':
break
# it will add data only if user put all answers
# (and it will skip task when user skip at some question)
tasks.append(t)
op_time.append(o)
ltime.append(lt)
ptime.append(p)
# ---
float_op_time = np.array(op_time).astype(float)
float_ltime = np.array(ltime).astype(float)
float_ptime = np.array(ptime).astype(float)
float_mean_times = (float_optime + float_ltime + float_ptime) / 3
table = {
"Task": tasks,
"Optimistic": op_time,
"Most Likely": ltime,
"Pessimistic": ptime,
"Mean Duration": mean_times
}
print(tabulate(table, headers='keys', tablefmt='fancy_grid', stralign='center', numalign='center'))
How can a non parallelised version of this code run much faster than a parallel version, when I am using a Manager object to share a list across processes. I did this to avoid any serialisation, and I don't need to edit the list.
I return an 800,000 row data set from Oracle, convert it to a list and store it in shared memory using Manager.list().
I am iterating over each column in the query results, in parallel, to obtain some statistics (I know I could do it in SQL).
Main code:
import cx_Oracle
import csv
import os
import glob
import datetime
import multiprocessing as mp
import get_column_stats as gs;
import pandas as pd
import pandas.io.sql as psql
def get_data():
print("Starting Job: " + str(datetime.datetime.now()));
manager = mp.Manager()
# Step 1: Init multiprocessing.Pool()
pool = mp.Pool(mp.cpu_count())
print("CPU Count: " + str(mp.cpu_count()))
dsn_tns = cx_Oracle.makedsn('myserver.net', '1521', service_name='PARIELGX');
con = cx_Oracle.connect(user='fred', password='password123', dsn=dsn_tns);
stats_results = [["OWNER","TABLE","COLUMN_NAME","RECORD_COUNT","DISTINCT_VALUES","MIN_LENGTH","MAX_LENGTH","MIN_VAL","MAX_VAL"]];
sql = "SELECT * FROM ARIEL.DIM_REGISTRATION_SET"
cur = con.cursor();
print("Start Executing SQL: " + str(datetime.datetime.now()));
cur.execute(sql);
print("End SQL Execution: " + str(datetime.datetime.now()));
print("Start SQL Fetch: " + str(datetime.datetime.now()));
rs = cur.fetchall();
print("End SQL Fetch: " + str(datetime.datetime.now()));
print("Start Creation of Shared Memory List: " + str(datetime.datetime.now()));
lrs = manager.list(list(rs)) # shared memory list
print("End Creation of Shared Memory List: " + str(datetime.datetime.now()));
col_names = [];
for field in cur.description:
col_names.append(field[0]);
#print(col_names)
#print('-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-')
#print(rs)
#print('-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-')
#print(lrs)
col_index = 0;
print("Start In-Memory Iteration of Dataset: " + str(datetime.datetime.now()));
# we go through every field
for field in cur.description:
col_names.append(field[0]);
# start at column 0
col_index = 0;
# iterate through each column, to gather stats from each column using parallelisation
pool_results = pool.map_async(gs.get_column_stats_rs, [(lrs, col_name, col_names) for col_name in col_names]).get()
for i in pool_results:
stats_results.append(i)
# Step 3: Don't forget to close
pool.close()
print("End In-Memory Iteration of Dataset: " + str(datetime.datetime.now()));
# end filename for
cur.close();
outfile = open('C:\jupyter\Experiment\stats_dim_registration_set.csv','w');
writer=csv.writer(outfile,quoting=csv.QUOTE_ALL, lineterminator='\n');
writer.writerows(stats_results);
outfile.close()
print("Ending Job: " + str(datetime.datetime.now()));
get_data();
Code being called in parallel:
def get_column_stats_rs(args):
# rs is a list recordset of the results
rs, col_name, col_names = args
col_index = col_names.index(col_name)
sys.stdout = open("col_" + col_name + ".out", "a")
print("Starting Iteration of Column: " + col_name)
max_length = 0
min_length = 100000 # abitrarily large number!!
max_value = ""
min_value = "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz" # abitrarily large number!!
distinct_value_count = 0
has_values = False # does the column have any non-null values
has_null_values = False
row_count = 0
# create a dictionary into which we can add the individual items present in each row of data
# a dictionary will not let us add the same value more than once, so we can simply count the
# dictionary values at the end
distinct_values = {}
row_index = 0
# go through every row, for the current column being processed to gather the stats
for val in rs:
row_value = val[col_index]
row_count += 1
if row_value is None:
value_length = 0
else:
value_length = len(str(row_value))
if value_length > max_length:
max_length = value_length
if value_length < min_length:
if value_length > 0:
min_length = value_length
if row_value is not None:
if str(row_value) > max_value:
max_value = str(row_value)
if str(row_value) < min_value:
min_value = str(row_value)
# capture distinct values
if row_value is None:
row_value = "Null"
has_null_values = True
else:
has_values = True
distinct_values[row_value] = 1
row_index += 1
# end row for
distinct_value_count = len(distinct_values)
if has_values == False:
distinct_value_count = None
min_length = None
max_length = None
min_value = None
max_value = None
elif has_null_values == True and distinct_value_count > 0:
distinct_value_count -= 1
if min_length == 0 and max_length > 0 and has_values == True:
min_length = max_length
print("Ending Iteration of Column: " + col_name)
return ["ARIEL", "DIM_REGISTRATION_SET", col_name, row_count, distinct_value_count, min_length, max_length,
strip_crlf(str(min_value)), strip_crlf(str(max_value))]
Helper function:
def strip_crlf(value):
return value.replace('\n', ' ').replace('\r', '')
I am using a Manager.list() object to share state across the processes:
lrs = manager.list(list(rs)) # shared memory list
And am passing the list in the map_async() method:
pool_results = pool.map_async(gs.get_column_stats_rs, [(lrs, col_name, col_names) for col_name in col_names]).get()
Manager overhead is adding up to your runtime. Also you're not directly using shared memory here. You're only using multiprocessing manager, which is known to be slower than shared memory or single thread implementations. If you don't need synchronisation in your code, meaning that you're not modifying the shared data, just skip manager and use shared memory objects directly.
https://docs.python.org/3.7/library/multiprocessing.html
Server process managers are more flexible than using shared memory
objects because they can be made to support arbitrary object types.
Also, a single manager can be shared by processes on different
computers over a network. They are, however, slower than using shared
memory.
I've built a program to fill up a databank and, by the time, it's working. Basically, the program makes a request to the app I'm using (via REST API) returns the data I want and then manipulate to a acceptable form for the databank.
The problem is: the GET method makes the algorithm too slow, because I'm acessing the details of particular entries, so for each entry I have to make 1 request. I have something close to 15000 requests to do and each row in the bank is taking 1 second to be made.
Is there any possible way to make this requests faster? How can I improve the perfomance of this method? And by the way, any tips to measure the perfomance of the code?
Thanks in advance!!
here's the code:
# Retrieving all the IDs I want to get the detailed info
abc_ids = serializers.serialize('json', modelExample.objects.all(), fields=('id'))
abc_ids = json.loads(abc_ids)
abc_ids_size = len(abc_ids)
# Had to declare this guys right here because in the end of the code I use them in the functions to create and uptade the back
# And python was complaining that I stated before assign. Picked random values for them.
age = 0
time_to_won = 0
data = '2016-01-01 00:00:00'
# First Loop -> Request to the detailed info of ABC
for x in range(0, abc_ids_size):
id = abc_ids[x]['fields']['id']
url = requests.get(
'https://api.example.com/v3/abc/' + str(
id) + '?api_token=123123123')
info = info.json()
dealx = dict(info)
# Second Loop -> Picking the info I want to uptade and create in the bank
for key, result in dealx['data'].items():
# Relevant only for ModelExample -> UPTADE
if key == 'age':
result = dict(result)
age = result['total_seconds']
# Relevant only For ModelExample -> UPTADE
elif key == 'average_time_to_won':
result = dict(result)
time_to_won = result['total_seconds']
# Relevant For Model_Example2 -> CREATE
# Storing a date here to use up foward in a datetime manipulation
if key == 'add_time':
data = str(result)
elif key == 'time_stage':
# Each stage has a total of seconds that the user stayed in.
y = result['times_in_stages']
# The user can be in any stage he want, there's no rule about the order.
# But there's a record of the order he chose.
z = result['order_of_stages']
# Creating a list to fill up with all stages info and use in the bulk_create.
data_set = []
index = 0
# Setting the number of repititions base on the number of the stages in the list.
for elemento in range(0, len(z)):
data_set_i = {}
# The index is to define the order of the stages.
index = index + 1
for key_1, result_1 in y.items():
if int(key_1) == z[elemento]:
data_set_i['stage_id'] = int(z[elemento])
data_set_i['index'] = int(index)
data_set_i['abc_id'] = id
# Datetime manipulation
if result_1 == 0 and index == 1:
data_set_i['add_date'] = data
# I know that I totally repeated the code here, I was trying to get this part shorter
# But I could not get it right.
elif result_1 > 0 and index == 1:
data_t = datetime.strptime(data, "%Y-%m-%d %H:%M:%S")
data_sum = data_t + timedelta(seconds=result_1)
data_sum += timedelta(seconds=3)
data_nova = str(data_sum.year) + '-' + str(formaters.DateNine(
data_sum.month)) + '-' + str(formaters.DateNine(data_sum.day)) + ' ' + str(
data_sum.hour) + ':' + str(formaters.DateNine(data_sum.minute)) + ':' + str(
formaters.DateNine(data_sum.second))
data_set_i['add_date'] = str(data_nova)
else:
data_t = datetime.strptime(data_set[elemento - 1]['add_date'], "%Y-%m-%d %H:%M:%S")
data_sum = data_t + timedelta(seconds=result_1)
data_sum += timedelta(seconds=3)
data_nova = str(data_sum.year) + '-' + str(formaters.DateNine(
data_sum.month)) + '-' + str(formaters.DateNine(data_sum.day)) + ' ' + str(
data_sum.hour) + ':' + str(formaters.DateNine(data_sum.minute)) + ':' + str(
formaters.DateNine(data_sum.second))
data_set_i['add_date'] = str(data_nova)
data_set.append(data_set_i)
Model_Example2_List = [Model_Example2(**vals) for vals in data_set]
Model_Example2.objects.bulk_create(Model_Example2_List)
ModelExample.objects.filter(abc_id=id).update(age=age, time_to_won=time_to_won)
if the bottleneck is in your network request, there isn't much you can do except to perhaps use gzip or deflate but with requests ..
The gzip and deflate transfer-encodings are automatically decoded for
you.
If you want to be doubly sure, you can add the following headers to the get request.
{ 'Accept-Encoding': 'gzip,deflate'}
The other alternative is to use threading and have many requests operate in parrallel, a good option if you have lot's of bandwidth and multiple cores.
Lastly, there are lots of different ways to profile python including with cprofile + kcachegrind combo.
I am trying to optimize this code, as of right now it runs 340 Requests in 10 mins. I have trying to get 1800 requests in 30 mins. Since I can run a request every second, according to amazon api. Can I use multithreading with this code to increase the number of runs??
However, I was reading in the full data to the main function, should I split it now, how can I figure out how many each thread should take?
def newhmac():
return hmac.new(AWS_SECRET_ACCESS_KEY, digestmod=sha256)
def getSignedUrl(params):
hmac = newhmac()
action = 'GET'
server = "webservices.amazon.com"
path = "/onca/xml"
params['Version'] = '2013-08-01'
params['AWSAccessKeyId'] = AWS_ACCESS_KEY_ID
params['Service'] = 'AWSECommerceService'
params['Timestamp'] = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
key_values = [(urllib.quote(k), urllib.quote(v)) for k,v in params.items()]
key_values.sort()
paramstring = '&'.join(['%s=%s' % (k, v) for k, v in key_values])
urlstring = "http://" + server + path + "?" + \
('&'.join(['%s=%s' % (k, v) for k, v in key_values]))
hmac.update(action + "\n" + server + "\n" + path + "\n" + paramstring)
urlstring = urlstring + "&Signature="+\
urllib.quote(base64.encodestring(hmac.digest()).strip())
return urlstring
def readData():
data = []
with open("ASIN.csv") as f:
reader = csv.reader(f)
for row in reader:
data.append(row[0])
return data
def writeData(data):
with open("data.csv", "a") as f:
writer = csv.writer(f)
writer.writerows(data)
def main():
data = readData()
filtData = []
i = 0
count = 0
while(i < len(data) -10 ):
if (count %4 == 0):
time.sleep(1)
asins = ','.join([data[x] for x in range(i,i+10)])
params = {'ResponseGroup':'OfferFull,Offers',
'AssociateTag':'4chin-20',
'Operation':'ItemLookup',
'IdType':'ASIN',
'ItemId':asins}
url = getSignedUrl(params)
resp = requests.get(url)
responseSoup=BeautifulSoup(resp.text)
quantity = ['' if product.amount is None else product.amount.text for product in responseSoup.findAll("offersummary")]
price = ['' if product.lowestnewprice is None else product.lowestnewprice.formattedprice.text for product in responseSoup.findAll("offersummary")]
prime = ['' if product.iseligibleforprime is None else product.iseligibleforprime.text for product in responseSoup("offer")]
for zz in zip(asins.split(","), price,quantity,prime):
print zz
filtData.append(zz)
print i, len(filtData)
i+=10
count +=1
writeData(filtData)
threading.Timer(1.0, main).start()
If you are using python 3.2 you can use concurrent.futures library to make it easy to launch tasks in multiple threads. e.g. here I am simulating running 10 url parsing job in parallel, each one of which takes 1 sec, if run synchronously it would have taken 10 seconds but with thread pool of 10 should take about 1 seconds
import time
from concurrent.futures import ThreadPoolExecutor
def parse_url(url):
time.sleep(1)
print(url)
return "done."
st = time.time()
with ThreadPoolExecutor(max_workers=10) as executor:
for i in range(10):
future = executor.submit(parse_url, "http://google.com/%s"%i)
print("total time: %s"%(time.time() - st))
Output:
http://google.com/0
http://google.com/1
http://google.com/2
http://google.com/3
http://google.com/4
http://google.com/5
http://google.com/6
http://google.com/7
http://google.com/8
http://google.com/9
total time: 1.0066466331481934
I have some code (this is not the full file):
chunk_list = []
def makeFakeTransactions(store_num, num_transactions):
global chunk_list
startTime = datetime.now()
data_load_datetime = startTime.isoformat()
data_load_name = "Faked Data v2.2"
data_load_path = "data was faked"
index_list = []
number_of_stores = store_num + 10
number_of_terminals = 13
for month in range(1, 13):
number_of_days = 30
extra_day_months = [1, 3, 5, 7, 8, 10, 12]
if month == 2:
number_of_days = 28
elif month in extra_day_months:
number_of_days = 31
for day in range(1, number_of_days + 1):
for store in range(store_num, number_of_stores):
operator_id = "0001"
operator_counter = 1
if store < 11:
store_number = "0000" + str(store)
else:
store_number = "000" + str(store)
for terminal in range(1, number_of_terminals + 1):
if terminal < 10:
terminal_id = str(terminal) + "000"
else:
terminal_id = str(terminal) + "00"
transaction_type = "RetailTransaction"
transaction_type_code = "Transaction"
transaction_date = date(2015, month, day)
transaction_date_str = transaction_date.isoformat()
transaction_time = time(random.randint(0, 23), random.randint(0, 59))
transaction_datetime = datetime.combine(transaction_date, transaction_time)
transaction_datetime_str = transaction_datetime.isoformat()
max_transactions = num_transactions
for transaction_number in range (0, max_transactions):
inactive_time = random.randint(80, 200)
item_count = random.randint(1, 15)
sequence_number = terminal_id + str(transaction_number)
transaction_datetime = transaction_datetime + timedelta(0, ring_time + special_time + inactive_time)
transaction_summary = {}
transaction_summary["transaction_type"] = transaction_type
transaction_summary["transaction_type_code"] = transaction_type_code
transaction_summary["store_number"] = store_number
transaction_summary["sequence_number"] = sequence_number
transaction_summary["data_load_path"] = data_load_path
index_list.append(transaction_summary.copy())
operator_counter += 10
operator_id = '{0:04d}'.format(operator_counter)
chunk_list.append(index_list)
if __name__ == '__main__':
store_num = 1
process_number = 6
num_transactions = 10
p = multiprocessing.Pool(process_number)
results = [p.apply(makeFakeTransactions, args = (store_num, num_transactions,)) for store_num in xrange(1, 30, 10)]
results = [p.apply(elasticIndexing, args = (index_list,)) for index_list in chunk_list]
I have a global variable chunk_list that gets appended to at the end of my makeFakeTransactions function and basically it's a list of lists. However, when I do a test print of chunk_list after the 3 processes for makeFakeTransactions, the chunk_list shows up empty, even though it should've been appended to 3 times. Am I doing something wrong regarding global list variables in multiprocessing? Is there a better way to do this?
Edit: makeFakeTransactions appends a dictionary copy to index_list and once all the dictionaries are appended to index_list, it appends index_list to the global variable chunk_list.
First, your code isn't actually running in parallel. According to the docs, p.apply will block until complete, so you are running your tasks sequentially on the process pool. You need to use p.map_async to kick off a task and not wait for it to complete.
Second, as was said in a comment, global state isn't shared between processes. You can use shared memory, but in this case it is much simpler to just transfer the result back from the worker process. Since you don't use chunk_list for anything other than collecting the result, you can just send the result back after computation and collect them on the calling process. This is easy using multiprocessing.Pool, you just return the result from your worker function:
return index_list
This will make p.apply() return index_list. p.apply_async() will return an AsyncResult that will return index_list with AsyncResult.get(). Since you're already using list comprehension, the modifications are small:
p = multiprocessing.Pool(process_number)
async_results = [p.apply_async(makeFakeTransactions, args = (store_num, num_transactions,)) for store_num in xrange(1, 30, 10)]
results = [ar.get() for ar in async_results]
You can do simplify it down to one step by using p.map, which effectively does what those previous two lines do. Note p.map blocks until all results are available.
p = multiprocessing.Pool(process_number)
results = p.map(lambda store_num: makeFakeTransactions(store_num, num_transactions), xrange(1, 30, 10))
Since p.map expects a single argument function, you need to wrap it in a lambda.