Related
Explanation of what I'm trying to accomplish:
I have dataframe to iterate over looking for some condition given a variable.
I have list of variables and I iterate over this df using multiprocessing. I pop(0) everytime a process start.
Now I need to add one more level, but I can't understand how to do it.
Here is the code:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import time
import decimal
import multiprocessing
from multiprocessing import Pool, Manager
import itertools
#dataframe
columns = ['A', 'B', 'C', 'D']
data = np.array([np.random.randint(1, 10_000, size=750)]*4).T
df = pd.DataFrame(data, columns= columns)
print(df)
# Creating a list of tuples to apply a given function
a = np.arange(5,20, 1)
b = np.arange(1.01, 1.10, 0.01)
d = np.arange(0.95, 0.99, 0.01)
c = list(itertools.product(a, b, d))
list_of_tuples = []
dic = {}
for x in c:
dic[(x)] = x
for key, value in dic.items():
uno, due, tre = value[0], value[1], value[2]
list_of_tuples.append((uno, due, tre))
print(len(dic)) #checking size of dictionary
print(len(list_of_tuples), len(df)) #checking if size match
maximum = max(dic, key=dic.get) #maximum key inside dictionary
print(maximum, dic[maximum])
new_dic = {}
i = 1
#look_back_period = (len(df) // 10)
#print(look_back_period)
c = 0
"""chunks is the only way where I could use pool.map, it should be a list of list"""
chunks = [list_of_tuples[i::len(list_of_tuples)] for i in range(len(list_of_tuples))]
print(len(chunks[0]))
#this manager is needed to have every process append to the same Dict the result of the
# function that is given below
manager = Manager()
new_dic = manager.dict()
def multi_prova(list_of_tuples):
list_results = []
given1, given2, given3 = list_of_tuples.pop(0)
#sliding_window = df.iloc[0 : c + look_back_period, : ]
for row in df.itertuples():
result = (given1 / row.A).round(2)
list_results.append(result)
new_dic[str(given1)+', ' + str(given2)+', ' + str(given2)] = result
time1 = time.time()
if __name__ == "__main__":
try:
pool = Pool() # Make the Pool of workers
results = pool.map(multi_prova, chunks) #Open the urls in their own threads
pool.close() #close the pool and wait for the work to finish
pool.join()
except:
print('error')
time2 = time.time()
print(time2 - time1)
#On my original code len(new_dic) matched len(dic), here is 750 vs 150, don't know why?!?!?!
print(new_dic)
print(len(new_dic))
Shouldn't be the len of new_dic == dic
750 rows, and a result for every row to 'append' to the dictionary.
So the problem are two:
why (len(new_dic)) is not 750.
And on top of that I would like to have, a sliding window to iter a slice of dataframe and have a dictionary of list of list with all the result of every slice of the df while c + look_back_period < len(df).
Hope I was clear enough.
A big hug on anyone that can contribute.
I want to simply distribute N items in n cells, both numbers N and n can be large, so I wouldn't like to loop over random as here:
import numpy as np
nitems = 100
ncells = 3
cells = np.zeros((ncells), dtype=np.int)
for _ in range(nitems):
dest = np.random.randint(ncells)
cells[dest] += 1
print(cells)
In this case, the output is:
[31 34 35]
(the sum is always N)
Is it there any faster way?
An answer to the question (I have to thank here to #pjs for his help) follows. I think it is the fastest, and probably, the shortest and most space efficient one possible:
from numpy import *
from time import sleep
g_nitems = 10000
g_ncells = 10
g_nsamples = 10000
def genDist(nitems, ncells):
r = sort(random.randint(0, nitems+1, ncells-1))
return concatenate((r,[nitems])) - concatenate(([0],r))
# Some stats
test = zeros(g_ncells, dtype=int)
Max = zeros(g_ncells, dtype=int)
for _ in range(g_nsamples):
tmp = genDist(g_nitems, g_ncells)
print(tmp.sum(), tmp, end='\r')
# print(_, end='\r')
# sleep(0.5)
test += tmp
for i in range(g_ncells):
if tmp[i] > Max[i]:
Max[i] = tmp[i]
print("\n", Max)
print(test//g_nsamples)
On my machine, your code with a timeit took 151 microseconds. The following took 11 microseconds:
import numpy as np
nitems = 100
ncells = 3
values = np.random.randint(0,ncells,nitems)
cells = np.array_split(values,3)
lengths= [ len(cell) for cell in cells ]
print(lengths,np.sum(lengths))
The result of the print is [34, 33, 33] 100.
The magic here is using numpy to do the splitting, but note that this method will split as close to uniform as possible.
If you want the partitioning done randomly:
import numpy as np
nitems = 100
ncells = 3
values = np.random.randint(0,ncells,nitems)
ind_split = [ np.random.randint(0,nitems) ]
ind_split.append(np.random.randint(ind_split[-1],nitems))
cells = np.array_split(values,ind_split)
lengths= [ len(cell) for cell in cells ]
print(lengths,np.sum(lengths))
This takes advantage of numpy.array_split taking indices of where to perform the split as an argument (rather than the number of near-uniform partitions).
You haven't specified that the counts have to have any particular distribution as long as they add up to N, so the following will work as requested:
import numpy as np
nitems = 100
ncells = 3
range_array = [np.random.randint(nitems + 1) for _ in range(ncells - 1)] + [0, nitems]
range_array.sort()
cells = [range_array[i + 1] - range_array[i] for i in range(ncells)]
print(cells)
It generates an ordered set of random values between 0 and nitems, then takes successive differences to generate the desired number of cell counts.
The complexity is O(ncells) rather than O(nitems), so it should be more efficient when there are substantially more items than cells.
I am trying to build an algorithm which first builds a power set of around 100 symbols excluding null set and repeated elements.
Then for each item in the list of power set it reads data file and evaluates the Sharpe Ratio (Return/Risk).
Results are then appended to a list and at last the program gives the best combination of symbols that would result in highest Sharpe Ratio.
Following is the code:
import pandas as pd
import numpy as np
import math
from itertools import chain, combinations
import operator
import time as t
#ASSUMPTION
#EQUAL ALLOCATION OF RESOURCES
t0 = t.time()
start_date = '2016-06-01'
end_date = '2017-08-18'
allocation = 170000
usesymbols=['PAEL','TPL','SING','DCL','POWER','FCCL','DGKC','LUCK',
'THCCL','PIOC','GWLC','CHCC','MLCF','FLYNG','EPCL',
'LOTCHEM','SPL','DOL','NRSL','AGL','GGL','ICL','AKZO','ICI',
'WAHN','BAPL','FFC','EFERT','FFBL','ENGRO','AHCL','FATIMA',
'EFOODS','QUICE','ASC','TREET','ZIL','FFL','CLOV',
'BGL','STCL','GGGL','TGL','GHGL','OGDC','POL','PPL','MARI',
'SSGC','SNGP','HTL','PSO','SHEL','APL','HASCOL','RPL','MERIT',
'GLAXO','SEARL','FEROZ','HINOON','ABOT','KEL','JPGL','EPQL',
'HUBC','PKGP','NCPL','LPL','KAPCO','TSPL','ATRL','BYCO','NRL','PRL',
'DWSM','SML','MZSM','IMSL','SKRS','HWQS','DSFL','TRG','PTC','TELE',
'WTL','MDTL','AVN','NETSOL','SYS','HUMNL','PAKD',
'ANL','CRTM','NML','NCL','GATM','CLCPS','GFIL','CHBL',
'DFSM','KOSM','AMTEX','HIRAT','NCML','CTM','HMIM',
'CWSM','RAVT','PIBTL','PICT','PNSC','ASL',
'DSL','ISL','CSAP','MUGHAL','DKL','ASTL','INIL']
cost_matrix = []
def data(symbols):
dates=pd.date_range(start_date,end_date)
df=pd.DataFrame(index=dates)
for symbol in symbols:
df_temp=pd.read_csv('/home/furqan/Desktop/python_data/{}.csv'.format(str(symbol)),usecols=['Date','Close'],
parse_dates=True,index_col='Date',na_values=['nan'])
df_temp = df_temp.rename(columns={'Close': symbol})
df=df.join(df_temp)
df=df.fillna(method='ffill')
df=df.fillna(method='bfill')
return df
def mat_alloc_auto(symbols):
n = len(symbols)
mat_alloc = np.zeros((n,n), dtype='float')
for i in range(0,n):
mat_alloc[i,i] = allocation / n
return mat_alloc
def compute_daily_returns(df):
"""Compute and return the daily return values."""
daily_returns=(df/df.shift(1))-1
df=df.fillna(value=0)
daily_returns=daily_returns[1:]
daily_returns = np.array(daily_returns)
return daily_returns
def port_eval(matrix_alloc,daily_return_matrix):
risk_free = 0
amount_matrix = [allocation]
return_mat = np.dot(daily_return_matrix,matrix_alloc)
return_mat = np.sum(return_mat, axis=1, keepdims=True)
return_mat = np.divide(return_mat,amount_matrix)
mat_average = np.mean(return_mat)
mat_std = np.std(return_mat, ddof=1)
sharpe_ratio = ((mat_average-risk_free)/mat_std) * math.sqrt(252)
return return_mat, sharpe_ratio, mat_average
def powerset(iterable):
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(1, len(s)+1))
power_set = list(powerset(usesymbols))
len_power = len(power_set)
sharpe = []
for j in range(0, len_power):
df_01 = data(power_set[j])
matrix_allocation = mat_alloc_auto(power_set[j])
daily_return_mat = compute_daily_returns(df_01)
return_matrix, sharpe_ratio_val, matrix_average = port_eval(matrix_allocation, daily_return_mat)
sharpe.append(sharpe_ratio_val)
max_index, max_value = max(enumerate(sharpe), key=operator.itemgetter(1))
print('Maximum sharpe ratio occurs from ',power_set[max_index], ' value = ', max_value)
t1=t.time()
print('exec time is ', t1-t0, 'seconds')
The above code results in a sigkill error 9.
After research I understood that it is because process allocates too much memory putting pressures on OS.
So I tried running same code on HP Z600 workstation but it takes a lot of time plus the machine is freezes.
My question is how can I make my code more efficient to get instant results.
Running on Python 3.3
I am attempting to create an efficient algorithm to pull all of the similar elements between two lists. The problem is two fold. First, I can not seem to find any algorithms online. Second, there should be a more efficient way.
By 'similar elements', I mean two elements that are equal in value (be it string, int, whatever).
Currently, I am taking a greedy approach by:
Sorting the lists that are being compared,
Comparing each element in the shorter list to each element in the larger list,
Since the largeList and smallList are sorted we can save the last index that was visited,
Continue from the previous index (largeIndex).
Currently, the run-time seems to be average of O(nlog(n)). This can be seen by running the test cases listed after this block of code.
Right now, my code looks as such:
def compare(small,large,largeStart,largeEnd):
for i in range(largeStart, largeEnd):
if small==large[i]:
return [1,i]
if small<large[i]:
if i!=0:
return [0,i-1]
else:
return [0, i]
return [0,largeStart]
def determineLongerList(aList, bList):
if len(aList)>len(bList):
return (aList, bList)
elif len(aList)<len(bList):
return (bList, aList)
else:
return (aList, bList)
def compareElementsInLists(aList, bList):
import time
startTime = time.time()
holder = determineLongerList(aList, bList)
sameItems = []
iterations = 0
##########################################
smallList = sorted(holder[1])
smallLength = len(smallList)
smallIndex = 0
largeList = sorted(holder[0])
largeLength = len(largeList)
largeIndex = 0
while (smallIndex<smallLength):
boolean = compare(smallList[smallIndex],largeList,largeIndex,largeLength)
if boolean[0]==1:
#`compare` returns 1 as True
sameItems.append(smallList[smallIndex])
oldIndex = largeIndex
largeIndex = boolean[1]
else:
#else no match and possible new index
oldIndex = largeIndex
largeIndex = boolean[1]
smallIndex+=1
iterations =largeIndex-oldIndex+iterations+1
print('RAN {it} OUT OF {mathz} POSSIBLE'.format(it=iterations, mathz=smallLength*largeLength))
print('RATIO:\t\t'+str(iterations/(smallLength*largeLength))+'\n')
return sameItems
, and here are some test cases:
def testLargest():
import time
from random import randint
print('\n\n******************************************\n')
start_time = time.time()
lis = []
for i in range(0,1000000):
ran = randint(0,1000000)
lis.append(ran)
lis2 = []
for i in range(0,1000000):
ran = randint(0,1000000)
lis2.append(ran)
timeTaken = time.time()-start_time
print('CREATING LISTS TOOK:\t\t'+str(timeTaken))
print('\n******************************************')
start_time = time.time()
c = compareElementsInLists(lis, lis2)
timeTaken = time.time()-start_time
print('COMPARING LISTS TOOK:\t\t'+str(timeTaken))
print('NUMBER OF SAME ITEMS:\t\t'+str(len(c)))
print('\n******************************************')
#testLargest()
'''
One rendition of testLargest:
******************************************
CREATING LISTS TOOK: 21.009342908859253
******************************************
RAN 999998 OUT OF 1000000000000 POSSIBLE
RATIO: 9.99998e-07
COMPARING LISTS TOOK: 13.99990701675415
NUMBER OF SAME ITEMS: 632328
******************************************
'''
def testLarge():
import time
from random import randint
print('\n\n******************************************\n')
start_time = time.time()
lis = []
for i in range(0,1000000):
ran = randint(0,100)
lis.append(ran)
lis2 = []
for i in range(0,1000000):
ran = randint(0,100)
lis2.append(ran)
timeTaken = time.time()-start_time
print('CREATING LISTS TOOK:\t\t'+str(timeTaken))
print('\n******************************************')
start_time = time.time()
c = compareElementsInLists(lis, lis2)
timeTaken = time.time()-start_time
print('COMPARING LISTS TOOK:\t\t'+str(timeTaken))
print('NUMBER OF SAME ITEMS:\t\t'+str(len(c)))
print('\n******************************************')
testLarge()
If you are just searching for all elements which are in both lists, you should use data types meant to handle such tasks. In this case, sets or bags would be appropriate. These are internally represented by hashing mechanisms which are even more efficient than searching in sorted lists.
(collections.Counter represents a suitable bag.)
If you do not care for doubled elements, then sets would be fine.
a = set(listA)
print a.intersection(listB)
This will print all elements which are in listA and in listB. (Without doubled output for doubled input elements.)
import collections
a = collections.Counter(listA)
b = collections.Counter(listB)
print a & b
This will print how many elements are how often in both lists.
I didn't make any measuring but I'm pretty sure these solutions are way faster than your self-made attempts.
To convert a counter into a list of all represented elements again, you can use list(c.elements()).
Using ipython magic for timeit but it doesn't compare favourably with just a standard set() intersection.
Setup:
import random
alist = [random.randint(0, 100000) for _ in range(1000)]
blist = [random.randint(0, 100000) for _ in range(1000)]
Compare Elements:
%%timeit -n 1000
compareElementsInLists(alist, blist)
1000 loops, best of 3: 1.9 ms per loop
Vs Set Intersection
%%timeit -n 1000
set(alist) & set(blist)
1000 loops, best of 3: 104 µs per loop
Just to make sure we get the same results:
>>> compareElementsInLists(alist, blist)
[8282, 29521, 43042, 47193, 48582, 74173, 96216, 98791]
>>> set(alist) & set(blist)
{8282, 29521, 43042, 47193, 48582, 74173, 96216, 98791}
I'm trying to solve this problem where I store the locations and counts of substrings of a given length. As the strings can be long (genome sequences), I'm trying to use multiple processes to speed it up. While the program runs, the variables that store the object seem to lose all information once the threads end.
import numpy
import multiprocessing
from multiprocessing.managers import BaseManager, DictProxy
from collections import defaultdict, namedtuple, Counter
from functools import partial
import ctypes as c
class MyManager(BaseManager):
pass
MyManager.register('defaultdict', defaultdict, DictProxy)
def gc_count(seq):
return int(100 * ((seq.upper().count('G') + seq.upper().count('C') + 0.0) / len(seq)))
def getreads(length, table, counts, genome):
genome_len = len(genome)
for start in range(0,genome_len):
gc = gc_count(genome[start:start+length])
table[ (length, gc) ].append( (start) )
counts[length,gc] +=1
if __name__ == "__main__":
g = 'ACTACGACTACGACTACGCATCAGCACATACGCATACGCATCAACGACTACGCATACGACCATCAGATCACGACATCAGCATCAGCATCACAGCATCAGCATCAGCACTACAGCATCAGCATCAGCATCAG'
genome_len = len(g)
mgr = MyManager()
mgr.start()
m = mgr.defaultdict(list)
mp_arr = multiprocessing.Array(c.c_double, 10*101)
arr = numpy.frombuffer(mp_arr.get_obj())
count = arr.reshape(10,101)
pool = multiprocessing.Pool(9)
partial_getreads = partial(getreads, table=m, counts=count, genome=g)
pool.map(partial_getreads, range(1, 10))
pool.close()
pool.join()
for i in range(1, 10):
for j in range(0,101):
print count[i,j]
for i in range(1, 10):
for j in range(0,101):
print len(m[(i,j)])
The loops at the end will only print out 0.0 for each element in count and 0 for each list in m, so somehow I'm losing all the counts. If i print the counts within the getreads(...) function, I can see that the values are being increased. Conversely, printing len(table[ (length, gc) ]) in getreads(...) or len(m[(i,j)]) in the main body only results in 0.
You could also formulate your problem as a map-reduce problem, by which you would avoid sharing the data among multiple processes (i guess it would speed up the computation). You would just need to return the resulting table and counts from the function (map) and combine the results from all the processes (reduce).
Going back to your original question...
At the bottom of Managers there is a relevant note about
modifications to mutable values or items in dict and list. Basically, you
need to re-assign the modified object to the container proxy.
l = table[ (length, gc) ]
l.append( (start) )
table[ (length, gc) ] = l
There is also a relevant Stackoverflow post about combining pool map with Array.
Taking both into account you can do something like:
def getreads(length, table, genome):
genome_len = len(genome)
arr = numpy.frombuffer(mp_arr.get_obj())
counts = arr.reshape(10,101)
for start in range(0,genome_len):
gc = gc_count(genome[start:start+length])
l = table[ (length, gc) ]
l.append( (start) )
table[ (length, gc) ] = l
counts[length,gc] +=1
if __name__ == "__main__":
g = 'ACTACGACTACGACTACGCATCAGCACATACGCATACGCATCAACGACTACGCATACGACCATCAGATCACGACATCAGCATCAGCATCACAGCATCAGCATCAGCACTACAGCATCAGCATCAGCATCAG'
genome_len = len(g)
mgr = MyManager()
mgr.start()
m = mgr.defaultdict(list)
mp_arr = multiprocessing.Array(c.c_double, 10*101)
arr = numpy.frombuffer(mp_arr.get_obj())
count = arr.reshape(10,101)
pool = multiprocessing.Pool(9)
partial_getreads = partial(getreads, table=m, genome=g)
pool.map(partial_getreads, range(1, 10))
pool.close()
pool.join()
arr = numpy.frombuffer(mp_arr.get_obj())
count = arr.reshape(10,101)