Global List not updating when using multiprocessing in python

Global List not updating when using multiprocessing in python - python

I have some code (this is not the full file):
chunk_list = []
def makeFakeTransactions(store_num, num_transactions):
global chunk_list
startTime = datetime.now()
data_load_datetime = startTime.isoformat()
data_load_name = "Faked Data v2.2"
data_load_path = "data was faked"
index_list = []
number_of_stores = store_num + 10
number_of_terminals = 13
for month in range(1, 13):
number_of_days = 30
extra_day_months = [1, 3, 5, 7, 8, 10, 12]
if month == 2:
number_of_days = 28
elif month in extra_day_months:
number_of_days = 31
for day in range(1, number_of_days + 1):
for store in range(store_num, number_of_stores):
operator_id = "0001"
operator_counter = 1
if store < 11:
store_number = "0000" + str(store)
else:
store_number = "000" + str(store)
for terminal in range(1, number_of_terminals + 1):
if terminal < 10:
terminal_id = str(terminal) + "000"
else:
terminal_id = str(terminal) + "00"
transaction_type = "RetailTransaction"
transaction_type_code = "Transaction"
transaction_date = date(2015, month, day)
transaction_date_str = transaction_date.isoformat()
transaction_time = time(random.randint(0, 23), random.randint(0, 59))
transaction_datetime = datetime.combine(transaction_date, transaction_time)
transaction_datetime_str = transaction_datetime.isoformat()
max_transactions = num_transactions
for transaction_number in range (0, max_transactions):
inactive_time = random.randint(80, 200)
item_count = random.randint(1, 15)
sequence_number = terminal_id + str(transaction_number)
transaction_datetime = transaction_datetime + timedelta(0, ring_time + special_time + inactive_time)
transaction_summary = {}
transaction_summary["transaction_type"] = transaction_type
transaction_summary["transaction_type_code"] = transaction_type_code
transaction_summary["store_number"] = store_number
transaction_summary["sequence_number"] = sequence_number
transaction_summary["data_load_path"] = data_load_path
index_list.append(transaction_summary.copy())
operator_counter += 10
operator_id = '{0:04d}'.format(operator_counter)
chunk_list.append(index_list)
if __name__ == '__main__':
store_num = 1
process_number = 6
num_transactions = 10
p = multiprocessing.Pool(process_number)
results = [p.apply(makeFakeTransactions, args = (store_num, num_transactions,)) for store_num in xrange(1, 30, 10)]
results = [p.apply(elasticIndexing, args = (index_list,)) for index_list in chunk_list]
I have a global variable chunk_list that gets appended to at the end of my makeFakeTransactions function and basically it's a list of lists. However, when I do a test print of chunk_list after the 3 processes for makeFakeTransactions, the chunk_list shows up empty, even though it should've been appended to 3 times. Am I doing something wrong regarding global list variables in multiprocessing? Is there a better way to do this?
Edit: makeFakeTransactions appends a dictionary copy to index_list and once all the dictionaries are appended to index_list, it appends index_list to the global variable chunk_list.

First, your code isn't actually running in parallel. According to the docs, p.apply will block until complete, so you are running your tasks sequentially on the process pool. You need to use p.map_async to kick off a task and not wait for it to complete.
Second, as was said in a comment, global state isn't shared between processes. You can use shared memory, but in this case it is much simpler to just transfer the result back from the worker process. Since you don't use chunk_list for anything other than collecting the result, you can just send the result back after computation and collect them on the calling process. This is easy using multiprocessing.Pool, you just return the result from your worker function:
return index_list
This will make p.apply() return index_list. p.apply_async() will return an AsyncResult that will return index_list with AsyncResult.get(). Since you're already using list comprehension, the modifications are small:
p = multiprocessing.Pool(process_number)
async_results = [p.apply_async(makeFakeTransactions, args = (store_num, num_transactions,)) for store_num in xrange(1, 30, 10)]
results = [ar.get() for ar in async_results]
You can do simplify it down to one step by using p.map, which effectively does what those previous two lines do. Note p.map blocks until all results are available.
p = multiprocessing.Pool(process_number)
results = p.map(lambda store_num: makeFakeTransactions(store_num, num_transactions), xrange(1, 30, 10))
Since p.map expects a single argument function, you need to wrap it in a lambda.

Related

Why isn't my Simpy resource keeping a queue?

I've been working for around a week to learn SimPy for a discrete simulation I have to run. I've done my best, but I'm just not experienced enough to figure it out quickly. I am dying. Please help.
The system in question goes like this:
order arrives -> resource_1 (there are 2) performs take_order -> order broken into items -> resource_2 (there are 10) performs process_item
My code runs and performs the simulation, but I'm having a lot of trouble getting the queues on the resources to function. As in, queues do not build up on either resource when I run it, and I cannot find the reason why. I try resource.get_queue and get empty lists. There should absolutely be queues, as the orders arrive faster than they can be processed.
I think it has something to do with the logic for requesting resources, but I can't figure it out. Here's how I've structured the code:
import simpy
import random
import numpy as np
total_items = []
total_a = []
total_b = []
total_c = []
order_Q = []
item_Q = []
skipped_visits = []
order_time_dict = {}
order_time_dict2 = {}
total_order_time_dict = {}
var = []
class System:
def __init__(self,env,num_resource_1,num_resource_2):
self.env = env
self.resource_1 = simpy.Resource(env,num_resource_1)
self.resource_2 = simpy.Resource(env,num_resource_2)
def take_order(self, order):
self.time_to_order = random.triangular(30/60,60/60,120/60)
arrive = self.env.now
yield self.env.timeout(self.time_to_order)
def process_item(self,item):
total_process_time = 0
current = env.now
order_num = item[1][0]
for i in range(1,item[1][1]):
if 'a' in item[0]:
total_process_time += random.triangular(.05,7/60,1/6) #bagging time only
#here edit order time w x
if 'b' in item[0]:
total_process_time += random.triangular(.05,.3333,.75)
if 'c' in item[0]:
total_process_time += random.triangular(.05,7/60,1/6)
#the following is handling time: getting to station, waiting on car to arrive at window after finished, handing to cust
total_process_time += random.triangular(.05, 10/60, 15/60)
item_finish_time = current + total_process_time
if order_num in order_time_dict2.keys():
start = order_time_dict2[order_num][0]
if order_time_dict2[order_num][1] < item_finish_time:
order_time_dict2[order_num] = (start, item_finish_time)
else:
order_time_dict2[order_num] = (current, item_finish_time)
yield self.env.timeout(total_process_time)
class Order:
def __init__(self, order_dict,order_num):
self.order_dict = order_dict
self.order_num = order_num
self.order_stripped = {}
for x,y in list(self.order_dict.items()):
if x != 'total':
if y != 0:
self.order_stripped[x] = (order_num,y) #this gives dictionary format {item: (order number, number items) } but only including items in order
self.order_list = list(self.order_stripped.items())
def generate_order(num_orders):
print('running generate_order')
a_demand = .1914 ** 3
a_stdev = 43.684104
b_demand = .1153
b_stdev = 28.507782
c_demand = .0664
c_stdev = 15.5562624349
num_a = abs(round(np.random.normal(a_demand)))
num_b = abs(round(np.random.normal(b_demand)))
num_c = abs(round(np.random.normal(c_demand)))
total = num_orders
total_a.append(num_a)
total_b.append(num_b)
total_c.append(num_c)
total_num_items = num_a + num_b + num_c
total_items.append(total_num_items)
order_dict = {'num_a':num_a, 'num_b':num_b,'num_c':num_c, 'total': total}
return order_dict
def order_process(order_instance,system):
enter_system_at = system.env.now
print("order " + str(order_instance.order_num) + " arrives at " + str(enter_system_at))
if len(system.resource_1.get_queue) > 1:
print("WORKING HERE ******************")
if len(system.resource_1.get_queue) <= 25:
with system.resource_1.request() as req:
order_Q.append(order_instance)
yield req
yield env.process(system.take_order(order_instance))
order_Q.pop()
enter_workstation_at = system.env.now
print("order num " + str(order_instance.order_num) + " enters workstation at " + str(enter_workstation_at))
for item in order_instance.order_list:
item_Q.append(item)
with system.resource_2.request() as req:
yield req
yield env.process(system.process_item(item))
if len(system.resource_2.get_queue) >1:
var.append(1)
item_Q.pop()
leave_workstation_at = system.env.now
print("Order num " + str(order_instance.order_num) + " leaves at " + str(leave_workstation_at))
order_time_dict[order_instance.order_num] = leave_workstation_at-enter_workstation_at
total_order_time_dict[order_instance.order_num]=leave_workstation_at-enter_system_at
else:
skipped_visits.append(1)
def setup(env):
system = System(env,2,15)
order_num = 0
while True:
next_order = random.expovariate(3.5) #where 20 is order arrival mean (lambda)
yield env.timeout(next_order)
order_num+=1
env.process(order_process(Order(generate_order(order_num),order_num),system))
env = simpy.Environment()
env.process(setup(env))
env.run(until=15*60)
print("1: \n", order_time_dict)

I think you are looking at the wrong queue.
the api for getting queued requests for resources is just attribute queue so try using
len(system.resource_1.queue)
get_queue and put_queue is from the base class and used to derive new resource classes.
but wait they are not what any reasonable person would assume, and I find this confusing too, but the doc says
Requesting a resources is modeled as “putting a process’ token into the resources” which means when you call request() the process is put into the put_queue, not the get_queue. And with resource, release always succeeds immediately so its queue (which is the get_queue) is always empty
I think queue is just a alias for the put_queue, but queue is much less confussing

Can the csv data be saved in dictionary for multiprocessing to save I/O operations and speed up the process?

I have around 1500 csv files with OHLC data of stock which contains 90000-100000 rows each.
Below the multiprocessing code to process each of the files ( with number of iterations ). When I tried to use 16 processess, my system started to hang a bit. I am very sure that its because of high use of I/O devices ( since system has to open each and every file ). Is it a good idea to save all the 1500 csv files to one one Dictionary and then run the code ? Can it reduce the time or slow down the hanging process ?
Also, system is working fine on 10 processes.
Here is the ohlc data look like -
enter image description here
import numpy as np
import pandas as pd
import os
import multiprocessing
import datetime
import itertools
import time
import warnings
warnings.filterwarnings('ignore')
# bank nifty
bn_futures = pd.read_csv('E:\\Tanmay\\Data\\Bank Nifty Index\\BankNifty_Futures GFDL 2011-2020.csv')
bn_futures['Date_time'] = bn_futures['Date'] + ' ' + bn_futures['Time']
bn_futures['Date_time'] = pd.to_datetime(bn_futures['Date_time'],format='%Y-%m-%d %H:%M:%S')
bn_futures = bn_futures[bn_futures['Date_time'].dt.date > datetime.date(2016,5,26)]
req_cols = [x for x in bn_futures.columns if 'Unnamed' not in x]
bn_futures = bn_futures[req_cols]
bn_futures['straddle'] = round(bn_futures['Close'],-2)
bn_futures['straddle'] = bn_futures['straddle'].astype(int)
bn_futures['straddle'] = bn_futures['straddle'].astype(str)
bn_futures['Date'] = bn_futures['Date_time'].dt.date
dates = list(set(bn_futures['Date'].to_list()))
dates.sort()
option_files1 = os.listdir('E:\\\\2nd Set\\')
option_files = []
for i in option_files1:
if datetime.datetime.strptime(i.split('.')[0],'%Y-%m-%d').date() >= datetime.date(2016,5,27):
option_files.append(i)
def time_loop(start_time,end_time,timeframe):
start_datetime = datetime.datetime.combine(datetime.datetime.today().date(),start_time)
end_datetime = datetime.datetime.combine(datetime.datetime.today().date(),end_time)
difference = int((((end_datetime - start_datetime).total_seconds())/60)/timeframe)
final_time_list = []
for i in range(difference):
final_time_list.append((start_datetime+datetime.timedelta(minutes=i*timeframe)).time())
return final_time_list
entry_time_list = time_loop(datetime.time(9,19),datetime.time(15,19),5)
sl_list = np.arange(1.1, 2, 0.1)
# sl_list = list(range(1.1,2,0.1))
paramlist = list(itertools.product(entry_time_list,sl_list))
def strategy(main_entry_time,sl):
print(main_entry_time,sl)
main_dict = {}
for file in option_files:
date = datetime.datetime.strptime(file.split('.')[0],'%Y-%m-%d').date()
try:
# reading current date bn futures
bn = bn_futures[bn_futures['Date'] == date]
# reading main time bn futures
b = bn[bn['Date_time'].dt.time == main_entry_time]
straddle_value = b['straddle'].iloc[0]
df = pd.read_csv('E:\\Tanmay\\Data\\Bank nifty Intraday All expiries\\2nd Set\\'+file)
df['Date_time'] = pd.to_datetime(df['Date_time'],format='%Y-%m-%d %H:%M:%S')
h = [k for k in df.columns if 'Un' not in k]
df = df[h]
total_df = df[(df['Ticker'].str.contains(straddle_value)) & (df['Expiry_number'] == 0) & (df['W/M'] == 'W')]
option_types = ['CE','PE']
for option in option_types:
option_df = total_df[(total_df['Ticker'].str.contains(option)) & (total_df['Date_time'].dt.time == main_entry_time)]
entry_price = option_df['Close'].iloc[0]
strike = option
entry_time = main_entry_time
trade_df = total_df[(total_df['Ticker'].str.contains(option)) & (total_df['Date_time'].dt.time > main_entry_time)]
trade_df.sort_values(by='Date_time',inplace=True)
for t in trade_df.index:
if trade_df['Date_time'][t].time() > entry_time:
if trade_df['High'][t] > entry_price * sl:
exit_price = entry_price * sl
exit_time = trade_df['Date_time'][t].time()
profit = entry_price - exit_price - 0.02* entry_price
main_dict['SL_'+str(sl)+'entry_time_'+str(main_entry_time)+'entry_date_'+str(date)+'_'+option] = {'Entry_date':str(date),'Entry_time':entry_time,'Strike':str(straddle_value)+option,'Entry_price':entry_price,'Exit_price':exit_price,'exit_time':exit_time,'profit':profit,'Reason':'SL'}
break
if trade_df['Date_time'][t].time() >= datetime.time(15,14,0):
exit_price = trade_df['Close'][t]
exit_time = trade_df['Date_time'][t].time()
profit = entry_price - exit_price - 0.02* entry_price
main_dict['SL_'+str(sl)+'entry_time_'+str(main_entry_time)+'entry_date_'+str(date)+'_'+option] = {'Entry_date':str(date),'Entry_time':entry_time,'Strike':str(straddle_value)+option,'Entry_price':entry_price,'Exit_price':exit_price,'exit_time':exit_time,'profit':profit,'Reason':'EOD'}
break
except Exception as yy:
pass
final_dict = dict(main_dict)
final_df = pd.DataFrame(final_dict)
final_df = final_df.transpose()
final_df.to_csv('SL_'+str(sl)+'entry_time_'+str(main_entry_time).replace(':','')+'entry_date_'+str(date)+'.csv')
if __name__=='__main__':
start_time = time.time()
# mgr = multiprocessing.Manager()
# main_dict = mgr.dict()
total_data = paramlist
p = multiprocessing.Pool(processes=10)
p.starmap(strategy,total_data)
p.close()

Before you can improve the multiprocessing performance, you should make sure your serial implementation is as efficient as it can be. Have you done that?
Your strategy method now is rereading every option file repeatedly for each element of total_data that is being passed to it. This is highly inefficient but moreover it might be contributing significantly to what is stalling your I/O (it depends on caching, which I discuss later). What if the data was put in a database and read up front and perhaps stored in a dictionary initialized at the beginning?
As far as strategy writing out the CSV file, it should be returning the input parameters and the final_df back to the main process so it can do all the I/O. For this function imap_unordered with a suitable chunksize argument is better suited so that the main process can write the results as they become available. Because we are no longer using method starmap, strategy will now be passed a tuple that will have to be unpacked:
import numpy as np
import pandas as pd
import os
import multiprocessing
import datetime
import itertools
import time
import warnings
warnings.filterwarnings('ignore')
# bank nifty
if __name__ == '__main__':
bn_futures = pd.read_csv('E:\\Tanmay\\Data\\Bank Nifty Index\\BankNifty_Futures GFDL 2011-2020.csv')
bn_futures['Date_time'] = bn_futures['Date'] + ' ' + bn_futures['Time']
bn_futures['Date_time'] = pd.to_datetime(bn_futures['Date_time'],format='%Y-%m-%d %H:%M:%S')
bn_futures = bn_futures[bn_futures['Date_time'].dt.date > datetime.date(2016,5,26)]
req_cols = [x for x in bn_futures.columns if 'Unnamed' not in x]
bn_futures = bn_futures[req_cols]
bn_futures['straddle'] = round(bn_futures['Close'],-2)
bn_futures['straddle'] = bn_futures['straddle'].astype(int)
bn_futures['straddle'] = bn_futures['straddle'].astype(str)
bn_futures['Date'] = bn_futures['Date_time'].dt.date
dates = list(set(bn_futures['Date'].to_list()))
dates.sort()
option_files1 = os.listdir('E:\\\\2nd Set\\')
option_files = []
for i in option_files1:
if datetime.datetime.strptime(i.split('.')[0],'%Y-%m-%d').date() >= datetime.date(2016,5,27):
option_files.append(i)
def time_loop(start_time,end_time,timeframe):
start_datetime = datetime.datetime.combine(datetime.datetime.today().date(),start_time)
end_datetime = datetime.datetime.combine(datetime.datetime.today().date(),end_time)
difference = int((((end_datetime - start_datetime).total_seconds())/60)/timeframe)
final_time_list = []
for i in range(difference):
final_time_list.append((start_datetime+datetime.timedelta(minutes=i*timeframe)).time())
return final_time_list
entry_time_list = time_loop(datetime.time(9,19),datetime.time(15,19),5)
sl_list = np.arange(1.1, 2, 0.1)
# sl_list = list(range(1.1,2,0.1))
paramlist = list(itertools.product(entry_time_list,sl_list))
def init_pool_processes(o_f):
global option_files
option_files = o_f
def strategy(tpl):
main_entry_time, sl = tpl # unpack tuple
print(main_entry_time,sl)
main_dict = {}
for file in option_files:
date = datetime.datetime.strptime(file.split('.')[0],'%Y-%m-%d').date()
try:
# reading current date bn futures
bn = bn_futures[bn_futures['Date'] == date]
# reading main time bn futures
b = bn[bn['Date_time'].dt.time == main_entry_time]
straddle_value = b['straddle'].iloc[0]
df = pd.read_csv('E:\\Tanmay\\Data\\Bank nifty Intraday All expiries\\2nd Set\\'+file)
df['Date_time'] = pd.to_datetime(df['Date_time'],format='%Y-%m-%d %H:%M:%S')
h = [k for k in df.columns if 'Un' not in k]
df = df[h]
total_df = df[(df['Ticker'].str.contains(straddle_value)) & (df['Expiry_number'] == 0) & (df['W/M'] == 'W')]
option_types = ['CE','PE']
for option in option_types:
option_df = total_df[(total_df['Ticker'].str.contains(option)) & (total_df['Date_time'].dt.time == main_entry_time)]
entry_price = option_df['Close'].iloc[0]
strike = option
entry_time = main_entry_time
trade_df = total_df[(total_df['Ticker'].str.contains(option)) & (total_df['Date_time'].dt.time > main_entry_time)]
trade_df.sort_values(by='Date_time',inplace=True)
for t in trade_df.index:
if trade_df['Date_time'][t].time() > entry_time:
if trade_df['High'][t] > entry_price * sl:
exit_price = entry_price * sl
exit_time = trade_df['Date_time'][t].time()
profit = entry_price - exit_price - 0.02* entry_price
main_dict['SL_'+str(sl)+'entry_time_'+str(main_entry_time)+'entry_date_'+str(date)+'_'+option] = {'Entry_date':str(date),'Entry_time':entry_time,'Strike':str(straddle_value)+option,'Entry_price':entry_price,'Exit_price':exit_price,'exit_time':exit_time,'profit':profit,'Reason':'SL'}
break
if trade_df['Date_time'][t].time() >= datetime.time(15,14,0):
exit_price = trade_df['Close'][t]
exit_time = trade_df['Date_time'][t].time()
profit = entry_price - exit_price - 0.02* entry_price
main_dict['SL_'+str(sl)+'entry_time_'+str(main_entry_time)+'entry_date_'+str(date)+'_'+option] = {'Entry_date':str(date),'Entry_time':entry_time,'Strike':str(straddle_value)+option,'Entry_price':entry_price,'Exit_price':exit_price,'exit_time':exit_time,'profit':profit,'Reason':'EOD'}
break
except Exception as yy:
pass
#final_dict = dict(main_dict) # why make a copy?
final_df = pd.DataFrame(main_dict)
final_df = final_df.transpose()
#final_df.to_csv('SL_'+str(sl)+'entry_time_'+str(main_entry_time).replace(':','')+'entry_date_'+str(date)+'.csv')
return (sl, date, final_df)
def compute_chunksize(iterable_size, pool_size):
chunksize, remainder = divmod(iterable_size, 4 * pool_size)
if remainder:
chunksize += 1
return chunksize
if __name__=='__main__':
start_time = time.time()
# mgr = multiprocessing.Manager()
# main_dict = mgr.dict()
total_data = paramlist
POOL_SIZE = 10
p = multiprocessing.Pool(processes=POOL_SIZE, initializer=init_pool_processes, initargs=(option_list,))
chunksize = compute_chunksize(len(total_data), POOL_SIZE)
results = p.imap_unordered(strategy, total_data, chunksize=chunksize)
for sl, date, final_df in results:
final_df.to_csv('SL_'+str(sl)+'entry_time_'+str(main_entry_time).replace(':','')+'entry_date_'+str(date)+'.csv')
p.close()
p.join()
Since your are running under Windows, as I have mentioned before, code at global scope will be executed by each pool process as part of its initialization and so it is inefficient to have code that is not required by your worker function, strategy at global scope that is not contained within a if __name__ == '__main__': block. So that is what I have done. Since your worker function does need to reference (for the time being until the issue I initially raised is addressed) option_files, I have used the initializer and initargs arguments of the Pool constructor so that after the option_files list is created once by the main process only, it will be copied to each process in the pool used to initialize a global variable option_files.
But I cannot stress enough that you should figure out a way of eliminating the reading of the files in the option_files list repeatedly. Ideally, building a dictionary of some sort that can be passed as another argument to init_process_pools so that each pool process has access to a copy of the dictionary once constructed would be ideal. What might save you is that Windows will cache data. Depending on the cache size and the size of the CSV files, the I/O bottleneck may not be as big as a problem as it might otherwise be.
In the meanwhile, you could experiment and force "single threading" of reading and writing of data by using a multirprocessing.Lock with the following changes. If Windows is caching all the reads after the first time you read all the option files, it will probably not make too big a difference. The posted code above, which single threads the writing, however, should help.
if __name__=='__main__':
start_time = time.time()
# mgr = multiprocessing.Manager()
# main_dict = mgr.dict()
total_data = paramlist
POOL_SIZE = 10
io_lock = multiprocessing.Lock()
p = multiprocessing.Pool(processes=POOL_SIZE, initializer=init_pool_processes, initargs=(option_list, io_lock))
chunksize = compute_chunksize(len(total_data), POOL_SIZE)
results = p.imap_unordered(strategy, total_data, chunksize=chunksize)
for sl, date, final_df in results:
with io_lock:
final_df.to_csv('SL_'+str(sl)+'entry_time_'+str(main_entry_time).replace(':','')+'entry_date_'+str(date)+'.csv')
p.close()
p.join()
And:
def init_pool_processes(o_f, lock):
global option_files, io_lock
option_files = o_f
io_lock = lock
And finally:
def stratgey(tpl):
...
straddle_value = b['straddle'].iloc[0]
with io_lock:
df = pd.read_csv('E:\\Tanmay\\Data\\Bank nifty Intraday All expiries\\2nd Set\\'+file)

Python multiprocessing slower processing a list, even when using shared memory

How can a non parallelised version of this code run much faster than a parallel version, when I am using a Manager object to share a list across processes. I did this to avoid any serialisation, and I don't need to edit the list.
I return an 800,000 row data set from Oracle, convert it to a list and store it in shared memory using Manager.list().
I am iterating over each column in the query results, in parallel, to obtain some statistics (I know I could do it in SQL).
Main code:
import cx_Oracle
import csv
import os
import glob
import datetime
import multiprocessing as mp
import get_column_stats as gs;
import pandas as pd
import pandas.io.sql as psql
def get_data():
print("Starting Job: " + str(datetime.datetime.now()));
manager = mp.Manager()
# Step 1: Init multiprocessing.Pool()
pool = mp.Pool(mp.cpu_count())
print("CPU Count: " + str(mp.cpu_count()))
dsn_tns = cx_Oracle.makedsn('myserver.net', '1521', service_name='PARIELGX');
con = cx_Oracle.connect(user='fred', password='password123', dsn=dsn_tns);
stats_results = [["OWNER","TABLE","COLUMN_NAME","RECORD_COUNT","DISTINCT_VALUES","MIN_LENGTH","MAX_LENGTH","MIN_VAL","MAX_VAL"]];
sql = "SELECT * FROM ARIEL.DIM_REGISTRATION_SET"
cur = con.cursor();
print("Start Executing SQL: " + str(datetime.datetime.now()));
cur.execute(sql);
print("End SQL Execution: " + str(datetime.datetime.now()));
print("Start SQL Fetch: " + str(datetime.datetime.now()));
rs = cur.fetchall();
print("End SQL Fetch: " + str(datetime.datetime.now()));
print("Start Creation of Shared Memory List: " + str(datetime.datetime.now()));
lrs = manager.list(list(rs)) # shared memory list
print("End Creation of Shared Memory List: " + str(datetime.datetime.now()));
col_names = [];
for field in cur.description:
col_names.append(field[0]);
#print(col_names)
#print('-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-')
#print(rs)
#print('-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-')
#print(lrs)
col_index = 0;
print("Start In-Memory Iteration of Dataset: " + str(datetime.datetime.now()));
# we go through every field
for field in cur.description:
col_names.append(field[0]);
# start at column 0
col_index = 0;
# iterate through each column, to gather stats from each column using parallelisation
pool_results = pool.map_async(gs.get_column_stats_rs, [(lrs, col_name, col_names) for col_name in col_names]).get()
for i in pool_results:
stats_results.append(i)
# Step 3: Don't forget to close
pool.close()
print("End In-Memory Iteration of Dataset: " + str(datetime.datetime.now()));
# end filename for
cur.close();
outfile = open('C:\jupyter\Experiment\stats_dim_registration_set.csv','w');
writer=csv.writer(outfile,quoting=csv.QUOTE_ALL, lineterminator='\n');
writer.writerows(stats_results);
outfile.close()
print("Ending Job: " + str(datetime.datetime.now()));
get_data();
Code being called in parallel:
def get_column_stats_rs(args):
# rs is a list recordset of the results
rs, col_name, col_names = args
col_index = col_names.index(col_name)
sys.stdout = open("col_" + col_name + ".out", "a")
print("Starting Iteration of Column: " + col_name)
max_length = 0
min_length = 100000 # abitrarily large number!!
max_value = ""
min_value = "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz" # abitrarily large number!!
distinct_value_count = 0
has_values = False # does the column have any non-null values
has_null_values = False
row_count = 0
# create a dictionary into which we can add the individual items present in each row of data
# a dictionary will not let us add the same value more than once, so we can simply count the
# dictionary values at the end
distinct_values = {}
row_index = 0
# go through every row, for the current column being processed to gather the stats
for val in rs:
row_value = val[col_index]
row_count += 1
if row_value is None:
value_length = 0
else:
value_length = len(str(row_value))
if value_length > max_length:
max_length = value_length
if value_length < min_length:
if value_length > 0:
min_length = value_length
if row_value is not None:
if str(row_value) > max_value:
max_value = str(row_value)
if str(row_value) < min_value:
min_value = str(row_value)
# capture distinct values
if row_value is None:
row_value = "Null"
has_null_values = True
else:
has_values = True
distinct_values[row_value] = 1
row_index += 1
# end row for
distinct_value_count = len(distinct_values)
if has_values == False:
distinct_value_count = None
min_length = None
max_length = None
min_value = None
max_value = None
elif has_null_values == True and distinct_value_count > 0:
distinct_value_count -= 1
if min_length == 0 and max_length > 0 and has_values == True:
min_length = max_length
print("Ending Iteration of Column: " + col_name)
return ["ARIEL", "DIM_REGISTRATION_SET", col_name, row_count, distinct_value_count, min_length, max_length,
strip_crlf(str(min_value)), strip_crlf(str(max_value))]
Helper function:
def strip_crlf(value):
return value.replace('\n', ' ').replace('\r', '')
I am using a Manager.list() object to share state across the processes:
lrs = manager.list(list(rs)) # shared memory list
And am passing the list in the map_async() method:
pool_results = pool.map_async(gs.get_column_stats_rs, [(lrs, col_name, col_names) for col_name in col_names]).get()

Manager overhead is adding up to your runtime. Also you're not directly using shared memory here. You're only using multiprocessing manager, which is known to be slower than shared memory or single thread implementations. If you don't need synchronisation in your code, meaning that you're not modifying the shared data, just skip manager and use shared memory objects directly.
https://docs.python.org/3.7/library/multiprocessing.html
Server process managers are more flexible than using shared memory
objects because they can be made to support arbitrary object types.
Also, a single manager can be shared by processes on different
computers over a network. They are, however, slower than using shared
memory.

to make the for loop efficient and use of lists and dictionary in it

Can please suggest a way to make this following looping efficient?
the program mainly populates the 4 time frames from a main list 'val_rt_xxx'
Each time frame is populated according the frame time variables 'fr_1,..fr_4'
# minimum time of the list of times - 'time_rt_min'
mn_time_rt = min(time_rt_min)
# The four time frames as dictionaries
frame_t = {}
frame_t['1'] = {}
frame_t['2'] = {}
frame_t['3'] = {}
frame_t['4'] = {}
# Creating empty lists in each frame dictionary
for fr in frame_t.keys():
frame_t[fr]['rt_long'] = []
frame_t[fr]['rt_lat'] = []
frame_t[fr]['rt_data'] = []
frame_t[fr]['rt_time'] = []
# There is the frame defined with respect to the mn_time_roti the minimum time
fr_1 = mn_time_rt
fr_2 = mn_time_rt + datetime.timedelta(minutes = 15)
fr_3 = mn_time_rt + datetime.timedelta(minutes = 30)
fr_4 = mn_time_rt + datetime.timedelta(minutes = 45)
# populating the lists in dictionaries with appropriate values
# from the main list 'var_rt_xxx' according to the time frames
for i in range(len(val_rt_time)):
if val_rt_time[i] <= fr_1 :
frame_t['1']['rt_long'].append(val_rt_long[i])
frame_t['1']['rt_lat'].append(val_rt_lat[i])
frame_t['1']['rt_data'].append(val_rt_data[i])
frame_t['1']['rt_time'].append(val_rt_time[i])
elif (val_rt_time[i] > fr_1 and val_rt_time[i] <= fr_2) :
frame_t['2']['rt_long'].append(val_rt_long[i])
frame_t['2']['rt_lat'].append(val_rt_lat[i])
frame_t['2']['rt_data'].append(val_rt_data[i])
frame_t['2']['rt_time'].append(val_rt_time[i])
elif val_rt_time[i] > fr_2 and val_rt_time[i] <= fr_3:
frame_t['3']['rt_long'].append(val_rt_long[i])
frame_t['3']['rt_lat'].append(val_rt_lat[i])
frame_t['3']['rt_data'].append(val_rt_data[i])
frame_t['3']['rt_time'].append(val_rt_time[i])
elif val_rt_time[i] > fr_3 and val_rt_time[i] <= fr_4:
frame_t['4']['rt_long'].append(val_rt_long[i])
frame_t['4']['rt_lat'].append(val_rt_lat[i])
frame_t['4']['rt_data'].append(val_rt_data[i])
frame_t['4']['rt_time'].append(val_rt_time[i])

Why my code is getting NZEC run time error?

Question source: SPOJ.. ORDERS
def swap(ary,idx1,idx2):
tmp = ary[idx1]
ary[idx1] = ary[idx2]
ary[idx2] = tmp
def mkranks(size):
tmp = []
for i in range(1, size + 1):
tmp = tmp + [i]
return tmp
def permutations(ordered, movements):
size = len(ordered)
for i in range(1, size): # The leftmost one never moves
for j in range(0, int(movements[i])):
swap(ordered, i-j, i-j-1)
return ordered
numberofcases = input()
for i in range(0, numberofcases):
sizeofcase = input()
tmp = raw_input()
movements = ""
for i in range(0, len(tmp)):
if i % 2 != 1:
movements = movements + tmp[i]
ordered = mkranks(sizeofcase)
ordered = permutations(ordered, movements)
output = ""
for i in range(0, sizeofcase - 1):
output = output + str(ordered[i]) + " "
output = output + str(ordered[sizeofcase - 1])
print output

Having made your code a bit more Pythonic (but without altering its flow/algorithm):
def swap(ary, idx1, idx2):
ary[idx1], ary[idx2] = [ary[i] for i in (idx2, idx1)]
def permutations(ordered, movements):
size = len(ordered)
for i in range(1, len(ordered)):
for j in range(movements[i]):
swap(ordered, i-j, i-j-1)
return ordered
numberofcases = input()
for i in range(numberofcases):
sizeofcase = input()
movements = [int(s) for s in raw_input().split()]
ordered = [str(i) for i in range(1, sizeofcase+1)]
ordered = permutations(ordered, movements)
output = " ".join(ordered)
print output
I see it runs correctly in the sample case given at the SPOJ URL you indicate. What is your failing case?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Global List not updating when using multiprocessing in python - python

Related

Why isn't my Simpy resource keeping a queue?

Can the csv data be saved in dictionary for multiprocessing to save I/O operations and speed up the process?

Python multiprocessing slower processing a list, even when using shared memory

to make the for loop efficient and use of lists and dictionary in it

Why my code is getting NZEC run time error?

Categories

Resources