Optimizing loop for millions of entry selections

Optimizing loop for millions of entry selections - python

I have a python anonymisation mechanism that rely on generating fake data from existing attributes.
Those attributes are accessible in the domain D which is an array of 16 sets, each set representing values possible for each attributes.
the attributes are ['uid', 'trans_id', 'trans_date', 'trans_type', 'operation', 'amount', 'balance', 'k_symbol', 'bank', 'acct_district_id', 'frequency', 'acct_date', 'disp_type', 'cli_district_id', 'gender', 'zip']
Some attributes have very few values (gener is M or F), some are unique (uid) and can have 1260000 different values.
The fake data is generated as tuples of randomly selected attributes inside the domain.
I have to generate nearly 2 million tuples.
The first implementation of this was:
def beta_step(I, V, beta, n, m, D):
r = approx_binomial(m - n, beta)
print("r = " + str(r))
i = 0
while i < r:
t = []
for attribute in D:
a_j = choice(list(attribute))
t.append(a_j)
if t not in V + I:
V.append(t)
i += 1
This took around 0,5s for each tuple.
Note that I and V are existing lists (with initialy respectively 1200000 and 800000 tuples)
I already found out that I could speed-up things by converting D to a 2D array once and for all, in order not to convert sets in list on each run
for attribute in D:
a_j = choice(attribute)
t.append(a_j)
This gets me down to 0.2s by tuple.
I also tried looping fewer times and generating multiple tuples at a time like so:
def beta_step(I, V, beta, n, m, D):
D = [list(attr) for attr in D ] #Convert D in 2D list
r = approx_binomial(m - n, beta)
print("r = " + str(r))
i = 0
NT = 1000 #Number of tuples generated at a time
while i < r:
T = [[] for j in range(NT)]
for attribute in D:
a_j = choices(attribute,k=min(NT,r-i))
for j in range(len(a_j)):
T[j].append(a_j[j])
for t in T:
if t not in V + I:
V.append(t)
i += 1
But this takes around 220s for 1000 tuples so it is not faster than before.
I have timed the different parts and it seems that it is the last for loop that takes most of the time (Around 217s).
Is there any way I could speed things up in order not to run it for 50 hours?
=======================
EDIT : I implemented #Larri suggestion like that :
def beta_step(I, V, beta, n, m, D):
D = [list(attr) for attr in D ] #Convert D in list of lists
I = set(tuple(t) for t in I)
V = set(tuple(t) for t in V)
r = approx_binomial(m - n, beta)
print("r = " + str(r))
i = 0
print('SIZE I', len(I))
print('SIZE V', len(V))
NT = 1000 #Number of tuples to generate at each pass
while i < r:
T = [[] for j in range(min(NT,r-i))]
for attribute in D:
a_j = choices(attribute,k=min(NT,r-i))
for j in range(len(a_j)):
T[j].append(a_j[j])
new_T = set(tuple(t) for t in T) - I
size_V_before = len(V)
V.update(new_T)
size_V_after = len(V)
delta_V = size_V_after-size_V_before
i += delta_V
return [list(t) for t in V]
it now takes about 0s to add elements to V
In total, adding the 1680000 tuples took 91s
However, converting back to a 2d array takes 200s, is there a way to make it faster that doesn't involve rewritting the whole program to work on sets ?

For the last for loop at least, consider converting to sets instead of using arrays. That allows you to use set.update() method without having to check if t is already included in V. This is assuming that you can incorporate the A in the logic somehow. From the given code I can't see any reference to A.
So you can change it to something like V.update(T). The i would then be the delta of len(V) before and after the operation.

Related

Creating and updating 3 index matrix in Python

I'm trying to create a three index matrix that contains 1 value (V) for every node of a numerical spatial mesh (xyz) (real world problem: the electrostatic potential created by finite-paralell- plates in a point of space). This matrix initially has to be filled with zeros except for some specific points (where the plates and space limits are) and then iteratively update the value at each node according to the following 7-point stencil method (j k and l indices of the x y z coordinates respectively):
V[j,k,l] = (V[j+1, k, l] + V[j-1, k, l] + V[j, k+1, l] + V[j, k-1, l] + V[j, k, l+1] +V[j, k, l-1])/6
(i. e., replace the value of a node with the average of the other 6 neighbouring nodes)
I've tried np.zeros and np.meshgrid but I think maybe I just simply have a serious conceptual and basic gap regarding arrays since nothing seems to do what I want. Any orientation would be really appreciated and sorry if I did not explain myself correctly. Here some code I've tried:
V1 = 10
V2 = -5
Mx = 101
My = 151
Mz = 301
V = np.zeros([Mx, My, Mz]).astype(int)
V[46, 51:101, 101:201] = V1 #the values of these nodes should stay fixed throughout iteration
V[56, 51:101, 101:201] = V2 #the values of these nodes should stay fixed throughout iteration
V[1,:,:] =V[100,:,:] =V[:,1,:] =V[:,150,:] =V[:,:,1] =V[:,:,300] = 0 #the values of these nodes should stay fixed throughout iteration
for j in V:
for k in j:
for l in k:
V[j, k, l] = (V[j+1, k, l] + V[j-1, k, l] + V[j, k+1, l] + V[j, k-1, l] + V[j, k, l+1] +V[j, k, l-1])/6
(Update after help from user kcw78)
Implementing the proposed code and trying to implement a while loop that keeps going until error falls below tolerance or the error in two consecutives cycles is the same. The statement of the assignment says more specifically:
"As many of these cycles will be completed as needed for the error to fall below a certain prescribed tolerance, rtol. And what is a good measure of the error here? We will use the maximum value of the local residual, defined as the (absolute value of the) difference between the potential value at the central node and the arithmetic average of the other values in the stencil. As a extra safeguard, we will also compare the errors of any two successive cycles and stop the relaxation if they become equal. A better solution is no longer possible."
Now trying the code below, but not sure if it's trapped in an infinite while loop or just takes a lot of time since I have to stop it after 20 minutes without producing any output (also not sure if maybe I should use .all() instead of .any()):
import numpy as np
V1 = 10
V2 = -5
Mx = 101
My = 151
Mz = 301
rtol = 10**-2
V1_set = { (46,k,l) for k in range(51,101,1) for l in range(101,201,1) }
V2_set = { (56,k,l) for k in range(51,101,1) for l in range(101,201,1) }
V = np.zeros((Mx, My, Mz))
Vnew = np.copy(V)
V[46, 51:101, 101:201] = V1
V[56, 51:101, 101:201] = V2
V[1,:,:] =V[100,:,:] =V[:,1,:] =V[:,150,:] =V[:,:,1] =V[:,:,300] = 0
check_set = set().union(V1_set,V2_set)
error = np.zeros((Mx, My, Mz))
errornew = np.zeros((Mx, My, Mz))
while float(errornew.any()) < rtol or error.any() != errornew.any():
V = Vnew
error = errornew
for j in range(1,V.shape[0]-1):
for k in range(1,V.shape[1]-1):
for l in range(1,V.shape[2]-1):
if (j,k,l) not in check_set:
Vnew[j, k, l] = (V[j+1, k, l] + V[j-1, k, l] + V[j, k+1, l] + V[j, k-1, l] + V[j, k, l+1] +V[j, k, l-1])/6
errornew[j, k, l] = abs(Vnew[j, k, l]-V[j, k, l])

If I understand your question, you will need 2 changes:
First you need additional variables to check the positions that are fixed thru the iteration. I added sets with (j,k,l) tuples to do this. So you can follow my logic, I initially created 3 sets; 1 each for these indices: 1) fixed V1 (V1_set), 2) fixed V2 (V2_set) and 3) boundary (zero_set), then union all 3 sets into a single set (called check_set). You could start with a single set and update as you add. Side note: your code has V[1,:,:] = 0, but I think you really want V[0,:,:] = 0. Let me know if I interpreted that incorrectly.
Second, you need to loop on the axis length in each direction(attributes are V.shape[0], V.shape[1], V.shape[2]). Inside the loop I check each (i,j,k) against check_set, and only calculate anew V1[j, k, l] value if it is NOT in the set.
See code below:
V1 = 10
V2 = -5
Mx = 101
My = 151
Mz = 301
V1_set = { (46,k,l) for k in range(51,101,1) for l in range(101,201,1) }
V2_set = { (56,k,l) for k in range(51,101,1) for l in range(101,201,1) }
zero_set = set()
zero_set.update( { (0,k,l) for k in range(My) for l in range(Mz) } )
zero_set.update( { (100,k,l) for k in range(My) for l in range(Mz) } )
zero_set.update( { (j,0,l) for j in range(Mx) for l in range(Mz) } )
zero_set.update( { (j,150,l) for j in range(Mx) for l in range(Mz) } )
zero_set.update( { (j,k,0) for j in range(Mx) for k in range(My) } )
zero_set.update( { (j,k,300) for j in range(Mx) for k in range(My) } )
check_set = set().union(V1_set,V2_set,zero_set)
V = np.zeros((Mx, My, Mz)).astype(int)
V[46, 51:101, 101:201] = V1 #the values of these nodes should stay fixed throughout iteration
V[56, 51:101, 101:201] = V2 #the values of these nodes should stay fixed throughout iteration
V[1,:,:] =V[100,:,:] =V[:,1,:] =V[:,150,:] =V[:,:,1] =V[:,:,300] = 0 #the values of these nodes should stay fixed throughout iteration
for j in range(V.shape[0]):
for k in range(V.shape[1]):
for l in range(V.shape[2]):
if (j,k,l) not in check_set:
V[j, k, l] = (V[j+1, k, l] + V[j-1, k, l] + V[j, k+1, l] + V[j, k-1, l] + V[j, k, l+1] +V[j, k, l-1])/6
After posting the solution above, it occurred to me that the ranges used in zero_set are intended to avoid the first/last (array boundary) indices. If so, there is no need for zero_set. You can handle this by modifying the range arguments as shown below:
check_set = set().union(V1_set,V2_set)
for j in range(1,V.shape[0]-1):
for k in range(1,V.shape[1]-1):
for l in range(1,V.shape[2]-1):
if (j,k,l) not in check_set:
V[j, k, l] = (V[j+1, k, l] + V[j-1, k, l] + V[j, k+1, l] + V[j, k-1, l] + V[j, k, l+1] +V[j, k, l-1])/6
Additional observations to consider:
I noticed you created array V with .astype(int). Are you sure
that's what you want (and not floats)? In general, your calculations will not return integer values.
The way your code is written, you are changing the values of
V[j,k,l] as you go. So, you are using updated values of V[j,k,l]
for j,k,l less than the current j,k,l, and previous V[j,k,l] values for j,k,l greater than the current j,k,l.
Finally, I assume you are going to iterate thru this calculation until the change between 2 cycles is "acceptably small". If so, you need to have 2 copies of the array ("old" and "new") to take the difference. Take care to use .copy() when copying to create a new/different np.array object.

This is an updated answer based on new information and code added to initial post. You have at least 1 problem with your logic. The if (j,k,l) not in check_set: block skips over (j,k,l) values that you want to hold constant. As a result, you don't calculate Vnew at these points. That will cause problems calculation the change with each iteration (and will give the wrong result).
Also, I think you need V = Vnew.copy(). Otherwise, V and Vnew reference the same object.
Here is my simple approach to iterate with a hardcoded error tolerance.
check_set = set().union(V1_set,V2_set)
Vi = V.copy()
Vn = np.zeros((Mx, My, Mz))
diff = max(abs(V1), abs(V2))
i = 1
print('Start Cycle#',i,'; diff =',diff)
while diff > 0.25:
for j in range(1,V.shape[0]-1):
for k in range(1,V.shape[1]-1):
for l in range(1,V.shape[2]-1):
if (j,k,l) in check_set:
Vn[j, k, l] = Vi[j, k, l]
else:
Vn[j, k, l] = (Vi[j+1, k, l] + Vi[j-1, k, l] + Vi[j, k+1, l] + Vi[j, k-1, l] + Vi[j, k, l+1] +Vi[j, k, l-1])/6
diff = max(abs(np.amax(Vn-Vi)), abs(np.amin(Vn-Vi)))
print('Cycle#',i,'completed; diff =',diff)
i += 1
Vi = Vn.copy()
This implementation will "converge" in 10 iterations. However, this only checks the error between two successive cycles is less than a hard coded tolerance (similar to the second part of the desired error check).
I did NOT implement the first error check: "use the maximum value of the local residual, defined as the (absolute value of the) difference between the potential value at the central node and the arithmetic average of the other values in the stencil." I am not 100 % sure of the intent. Is the stencil the 6 points around [j,k,l]? If so, I think you need a similar calculation AFTER you calculate the new Vn values, something like this:
error[j, k, l] = abs(Vn[j, k, l] - (Vn[j+1, k, l] + Vn[j-1, k, l] + Vn[j, k+1, l] + Vn[j, k-1, l] + Vn[j, k, l+1] +Vn[j, k, l-1])/6 )

Efficiency of finding mismatched patterns

I'm working on a simple bioinformatics problem. I have a working solution, but it is absurdly inefficient. How can I increase my efficiency?
Problem:
Find patterns of length k in the string g, given that the k-mer can have up to d mismatches.
And these strings and patterns are all genomic--so our set of possible characters is {A, T, C, G}.
I'll call the function FrequentWordsMismatch(g, k, d).
So, here are a few helpful examples:
FrequentWordsMismatch('AAAAAAAAAA', 2, 1) → ['AA', 'CA', 'GA', 'TA', 'AC', 'AG', 'AT']
Here's a much longer example, if you implement this and want to test:
FrequentWordsMisMatch('CACAGTAGGCGCCGGCACACACAGCCCCGGGCCCCGGGCCGCCCCGGGCCGGCGGCCGCCGGCGCCGGCACACCGGCACAGCCGTACCGGCACAGTAGTACCGGCCGGCCGGCACACCGGCACACCGGGTACACACCGGGGCGCACACACAGGCGGGCGCCGGGCCCCGGGCCGTACCGGGCCGCCGGCGGCCCACAGGCGCCGGCACAGTACCGGCACACACAGTAGCCCACACACAGGCGGGCGGTAGCCGGCGCACACACACACAGTAGGCGCACAGCCGCCCACACACACCGGCCGGCCGGCACAGGCGGGCGGGCGCACACACACCGGCACAGTAGTAGGCGGCCGGCGCACAGCC', 10, 2) → ['GCACACAGAC', 'GCGCACACAC']
With my naive solution, that second example could easily take ~60 seconds, though the first one is pretty quick.
Naive solution:
My idea was to, for every k-length segment in g, find every possible "neighbor" (e.g. other k-length segments with up to d mismatches) and add those neighbors as keys to a dictionary. I then count how many times each one of those neighbor kmers show up in the string g, and record those in the dictionary.
Obviously that's a kinda shitty way to do that, since the amount of neighbors scales like crazy as k and d increase, and having to scan through the strings with each of those neighbors makes this implementation terribly slow. But alas, that's why I'm asking for help.
I'll put my code below. There're definitely a lot of novice mistakes to unpack, so thanks for your time and attention.
def FrequentWordsMismatch(g, k, d):
'''
Finds the most frequent k-mer patterns in the string g, given that those
patterns can mismatch amongst themselves up to d times
g (String): Collection of {A, T, C, G} characters
k (int): Length of desired pattern
d (int): Number of allowed mismatches
'''
counts = {}
answer = []
for i in range(len(g) - k + 1):
kmer = g[i:i+k]
for neighborkmer in Neighbors(kmer, d):
counts[neighborkmer] = Count(neighborkmer, g, d)
maxVal = max(counts.values())
for key in counts.keys():
if counts[key] == maxVal:
answer.append(key)
return(answer)
def Neighbors(pattern, d):
'''
Find all strings with at most d mismatches to the given pattern
pattern (String): Original pattern of characters
d (int): Number of allowed mismatches
'''
if d == 0:
return [pattern]
if len(pattern) == 1:
return ['A', 'C', 'G', 'T']
answer = []
suffixNeighbors = Neighbors(pattern[1:], d)
for text in suffixNeighbors:
if HammingDistance(pattern[1:], text) < d:
for n in ['A', 'C', 'G', 'T']:
answer.append(n + text)
else:
answer.append(pattern[0] + text)
return(answer)
def HammingDistance(p, q):
'''
Find the hamming distance between two strings
p (String): String to be compared to q
q (String): String to be compared to p
'''
ham = 0 + abs(len(p)-len(q))
for i in range(min(len(p), len(q))):
if p[i] != q[i]:
ham += 1
return(ham)
def Count(pattern, g, d):
'''
Count the number of times that the pattern occurs in the string g,
allowing for up to d mismatches
pattern (String): Pattern of characters
g (String): String in which we're looking for pattern
d (int): Number of allowed mismatches
'''
return len(MatchWithMismatch(pattern, g, d))
def MatchWithMismatch(pattern, g, d):
'''
Find the indicies at which the pattern occurs in the string g,
allowing for up to d mismatches
pattern (String): Pattern of characters
g (String): String in which we're looking for pattern
d (int): Number of allowed mismatches
'''
answer = []
for i in range(len(g) - len(pattern) + 1):
if(HammingDistance(g[i:i+len(pattern)], pattern) <= d):
answer.append(i)
return(answer)
More tests
FrequentWordsMismatch('ACGTTGCATGTCGCATGATGCATGAGAGCT', 4, 1) → ['ATGC', 'ATGT', 'GATG']
FrequentWordsMismatch('AGTCAGTC', 4, 2) → ['TCTC', 'CGGC', 'AAGC', 'TGTG', 'GGCC', 'AGGT', 'ATCC', 'ACTG', 'ACAC', 'AGAG', 'ATTA', 'TGAC', 'AATT', 'CGTT', 'GTTC', 'GGTA', 'AGCA', 'CATC']
FrequentWordsMismatch('AATTAATTGGTAGGTAGGTA', 4, 0) → ["GGTA"]
FrequentWordsMismatch('ATA', 3, 1) → ['GTA', 'ACA', 'AAA', 'ATC', 'ATA', 'AGA', 'ATT', 'CTA', 'TTA', 'ATG']
FrequentWordsMismatch('AAT', 3, 0) → ['AAT']
FrequentWordsMismatch('TAGCG', 2, 1) → ['GG', 'TG']

The problem description is ambiguous in several ways, so I'm going by the examples. You seem to want all k-length strings from the alphabet (A, C, G, T} such that the number of matches to contiguous substrings of g is maximal - where "a match" means character-by-character equality with at most d character inequalities.
I'm ignoring that your HammingDistance() function makes something up even when inputs have different lengths, mostly because it doesn't make much sense to me ;-) , but partly because that isn't needed to get the results you want in any of the examples you gave.
The code below produces the results you want in all the examples, in the sense of producing permutations of the output lists you gave. If you want canonical outputs, I'd suggest sorting an output list before returning it.
The algorithm is pretty simple, but relies on itertools to do the heavy combinatorial lifting "at C speed". All the examples run in well under a second total.
For each length-k contiguous substring of g, consider all combinations(k, d) sets of d distinct index positions. There are 4**d ways to fill those index positions with letters from {A, C, G, T}, and each such way is "a pattern" that matches the substring with at most d discrepancies. Duplicates are weeded out by remembering the patterns already generated; this is faster than making heroic efforts to generate only unique patterns to begin with.
So, in all, the time requirement is O(len(g) * k**d * 4**d) = O(len(g) * (4*k)**d, where k**d is, for reasonably small values of k and d, an overstated standin for the binomial coefficent combinations(k, d). The important thing to note is that - unsurprisingly - it's exponential in d.
def fwm(g, k, d):
from itertools import product, combinations
from collections import defaultdict
all_subs = list(product("ACGT", repeat=d))
all_ixs = list(combinations(range(k), d))
patcount = defaultdict(int)
for starti in range(len(g)):
base = g[starti : starti + k]
if len(base) < k:
break
patcount[base] += 1
seen = set([base])
basea = list(base)
for ixs in all_ixs:
saved = [basea[i] for i in ixs]
for newchars in all_subs:
for i, newchar in zip(ixs, newchars):
basea[i] = newchar
candidate = "".join(basea)
if candidate not in seen:
seen.add(candidate)
patcount[candidate] += 1
for i, ch in zip(ixs, saved):
basea[i] = ch
maxcount = max(patcount.values())
return [p for p, c in patcount.items() if c == maxcount]
EDIT: Generating Patterns Uniquely
Rather than weed out duplicates by keeping a set of those seen so far, it's straightforward enough to prevent generating duplicates to begin with. In fact, the following code is shorter and simpler, although somewhat subtler. In return for less redundant work, there are layers of recursive calls to the inner() function. Which way is faster appears to depend on the specific inputs.
def fwm(g, k, d):
from collections import defaultdict
patcount = defaultdict(int)
alphabet = "ACGT"
allbut = {ch: tuple(c for c in alphabet if c != ch)
for ch in alphabet}
def inner(i, rd):
if not rd or i == k:
patcount["".join(base)] += 1
return
inner(i+1, rd)
orig = base[i]
for base[i] in allbut[orig]:
inner(i+1, rd-1)
base[i] = orig
for i in range(len(g) - k + 1):
base = list(g[i : i + k])
inner(0, d)
maxcount = max(patcount.values())
return [p for p, c in patcount.items() if c == maxcount]

Going on your problem description alone and not your examples (for the reasons I explained in the comment), one approach would be:
s = "CACAGTAGGCGCCGGCACACACAGCCCCGGGCCCCGGGCCGCCCCGGGCCGGCGGCCGCCGGCGCCGGCACACCGGCACAGC"\
"CGTACCGGCACAGTAGTACCGGCCGGCCGGCACACCGGCACACCGGGTACACACCGGGGCGCACACACAGGCGGGCGCCGGG"\
"CCCCGGGCCGTACCGGGCCGCCGGCGGCCCACAGGCGCCGGCACAGTACCGGCACACACAGTAGCCCACACACAGGCGGGCG"\
"GTAGCCGGCGCACACACACACAGTAGGCGCACAGCCGCCCACACACACCGGCCGGCCGGCACAGGCGGGCGGGCGCACACAC"\
"ACCGGCACAGTAGTAGGCGGCCGGCGCACAGCC"
def frequent_words_mismatch(g,k,d):
def num_misspellings(x,y):
return sum(xx != yy for (xx,yy) in zip(x,y))
seen = set()
for i in range(len(g)-k+1):
seen.add(g[i:i+k])
# For each unique sequence, add a (key,bin) pair to the bins dictionary
# (The bin is initialized to a list containing only the sequence, for now)
bins = {seq:[seq,] for seq in seen}
# Loop again through the unique sequences...
for seq in seen:
# Try to fit it in *all* already-existing bins (based on bin key)
for bk in bins:
# Don't re-add seq to it's own bin
if bk == seq: continue
# Test bin keys, try to find all appropriate bins
if num_misspellings(seq, bk) <= d:
bins[bk].append(seq)
# Get a list of the bin keys (one for each unique sequence) sorted in order of the
# number of elements in the corresponding bins
sorted_keys = sorted(bins, key= lambda k:len(bins[k]), reverse=True)
# largest_bin_key will be the key of the largest bin (there may be ties, so in fact
# this is *a* key of *one of the bins with the largest length*). That is, it'll
# be the sequence (found in the string) that the most other sequences (also found
# in the string) are at most d-distance from.
largest_bin_key = sorted_keys[0]
# You can return this bin, as your question description (but not examples) indicate:
return bins[largest_bin_key]
largest_bin = frequent_words_mismatch(s,10,2)
print(len(largest_bin)) # 13
print(largest_bin)
The (this) largest bin contains:
['CGGCCGCCGG', 'GGGCCGGCGG', 'CGGCCGGCGC', 'AGGCGGCCGG', 'CAGGCGCCGG',
'CGGCCGGCCG', 'CGGTAGCCGG', 'CGGCGGCCGC', 'CGGGCGCCGG', 'CCGGCGCCGG',
'CGGGCCCCGG', 'CCGCCGGCGG', 'GGGCCGCCGG']
It's O(n**2) where n is the number of unique sequences and completes on my computer in around 0.1 seconds.

Finding the subsequence that starts and ends with the same number with the maximum sum

I have to make a program that takes as input a list of numbers and returns the sum of the subsequence that starts and ends with the same number which has the maximum sum (including the equal numbers in the beginning and end of the subsequence in the sum). It also has to return the placement of the start and end of the subsequence, that is, their index+1. The problem is that my current code runs smoothly only while the length of list is not that long. When the list length extends to 5000 the program does not give an answer.
The input is the following:
6
3 2 4 3 5 6
The first line is for the length of the list. The second line is the list itself, with list items separated by space. The output will be 12, 1, 4, because as you can see there is 1 equal pair of numbers (3): the first and fourth element, so the sum of elements between them is 3 + 2 + 4 + 3 = 12, and their placement is first and fourth.
Here is my code.
length = int(input())
mass = raw_input().split()
for i in range(length):
mass[i]=int(mass[i])
value=-10000000000
b = 1
e = 1
for i in range(len(mass)):
if mass[i:].count(mass[i])!=1:
for j in range(i,len(mass)):
if mass[j]==mass[i]:
f = mass[i:j+1]
if sum(f)>value:
value = sum(f)
b = i+1
e = j+1
else:
if mass[i]>value:
value = mass[i]
b = i+1
e = i+1
print value
print b,e

This should be faster than your current approach.
Rather than searching through mass looking for pairs of matching numbers we pair each number in mass with its index and sort those pairs. We can then use groupby to find groups of equal numbers. If there are more than 2 of the same number we use the first and last, since they will have the greatest sum between them.
from operator import itemgetter
from itertools import groupby
raw = '3 5 6 3 5 4'
mass = [int(u) for u in raw.split()]
result = []
a = sorted((u, i) for i, u in enumerate(mass))
for _, g in groupby(a, itemgetter(0)):
g = list(g)
if len(g) > 1:
u, v = g[0][1], g[-1][1]
result.append((sum(mass[u:v+1]), u+1, v+1))
print(max(result))
output
(19, 2, 5)
Note that this code will not necessarily give the maximum sum between equal elements in the list if the list contains negative numbers. It will still work correctly with negative numbers if no group of equal numbers has more than two members. If that's not the case, we need to use a slower algorithm that tests every pair within a group of equal numbers.
Here's a more efficient version. Instead of using the sum function we build a list of the cumulative sums of the whole list. This doesn't make much of a difference for small lists, but it's much faster when the list size is large. Eg, for a list of 10,000 elements this approach is about 10 times faster. To test it, I create an array of random positive integers.
from operator import itemgetter
from itertools import groupby
from random import seed, randrange
seed(42)
def maxsum(seq):
total = 0
sums = [0]
for u in seq:
total += u
sums.append(total)
result = []
a = sorted((u, i) for i, u in enumerate(seq))
for _, g in groupby(a, itemgetter(0)):
g = list(g)
if len(g) > 1:
u, v = g[0][1], g[-1][1]
result.append((sums[v+1] - sums[u], u+1, v+1))
return max(result)
num = 25000
hi = num // 2
mass = [randrange(1, hi) for _ in range(num)]
print(maxsum(mass))
output
(155821402, 21, 24831)
If you're using a recent version of Python you can use itertools.accumulate to build the list of cumulative sums. This is around 10% faster.
from itertools import accumulate
def maxsum(seq):
sums = [0] + list(accumulate(seq))
result = []
a = sorted((u, i) for i, u in enumerate(seq))
for _, g in groupby(a, itemgetter(0)):
g = list(g)
if len(g) > 1:
u, v = g[0][1], g[-1][1]
result.append((sums[v+1] - sums[u], u+1, v+1))
return max(result)
Here's a faster version, derived from code by Stefan Pochmann, which uses a dict, instead of sorting & groupby. Thanks, Stefan!
def maxsum(seq):
total = 0
sums = [0]
for u in seq:
total += u
sums.append(total)
where = {}
for i, x in enumerate(seq, 1):
where.setdefault(x, [i, i])[1] = i
return max((sums[j] - sums[i - 1], i, j)
for i, j in where.values())
If the list contains no duplicate items (and hence no subsequences bound by duplicate items) it returns the maximum item in the list.
Here are two more variations. These can handle negative items correctly, and if there are no duplicate items they return None. In Python 3 that could be handled elegantly by passing default=None to max, but that option isn't available in Python 2, so instead I catch the ValueError exception that's raised when you attempt to find the max of an empty iterable.
The first version, maxsum_combo, uses itertools.combinations to generate all combinations of a group of equal numbers and thence finds the combination that gives the maximum sum. The second version, maxsum_kadane uses a variation of Kadane's algorithm to find the maximum subsequence within a group.
If there aren't many duplicates in the original sequence, so the average group size is small, maxsum_combo is generally faster. But if the groups are large, then maxsum_kadane is much faster than maxsum_combo. The code below tests these functions on random sequences of 15000 items, firstly on sequences with few duplicates (and hence small mean group size) and then on sequences with lots of duplicates. It verifies that both versions give the same results, and then it performs timeit tests.
from __future__ import print_function
from itertools import groupby, combinations
from random import seed, randrange
from timeit import Timer
seed(42)
def maxsum_combo(seq):
total = 0
sums = [0]
for u in seq:
total += u
sums.append(total)
where = {}
for i, x in enumerate(seq, 1):
where.setdefault(x, []).append(i)
try:
return max((sums[j] - sums[i - 1], i, j)
for v in where.values() for i, j in combinations(v, 2))
except ValueError:
return None
def maxsum_kadane(seq):
total = 0
sums = [0]
for u in seq:
total += u
sums.append(total)
where = {}
for i, x in enumerate(seq, 1):
where.setdefault(x, []).append(i)
try:
return max(max_sublist([(sums[j] - sums[i-1], i, j)
for i, j in zip(v, v[1:])], k)
for k, v in where.items() if len(v) > 1)
except ValueError:
return None
# Kadane's Algorithm to find maximum sublist
# From https://en.wikipedia.org/wiki/Maximum_subarray_problem
def max_sublist(seq, k):
max_ending_here = max_so_far = seq[0]
for x in seq[1:]:
y = max_ending_here[0] + x[0] - k, max_ending_here[1], x[2]
max_ending_here = max(x, y)
max_so_far = max(max_so_far, max_ending_here)
return max_so_far
def test(num, hi, loops):
print('\nnum = {0}, hi = {1}, loops = {2}'.format(num, hi, loops))
print('Verifying...')
for k in range(5):
mass = [randrange(-hi // 2, hi) for _ in range(num)]
a = maxsum_combo(mass)
b = maxsum_kadane(mass)
print(a, b, a==b)
print('\nTiming...')
for func in maxsum_combo, maxsum_kadane:
t = Timer(lambda: func(mass))
result = sorted(t.repeat(3, loops))
result = ', '.join([format(u, '.5f') for u in result])
print('{0:14} : {1}'.format(func.__name__, result))
loops = 20
num = 15000
hi = num // 4
test(num, hi, loops)
loops = 10
hi = num // 100
test(num, hi, loops)
output
num = 15000, hi = 3750, loops = 20
Verifying...
(13983131, 44, 14940) (13983131, 44, 14940) True
(13928837, 27, 14985) (13928837, 27, 14985) True
(14057416, 40, 14995) (14057416, 40, 14995) True
(13997395, 65, 14996) (13997395, 65, 14996) True
(14050007, 12, 14972) (14050007, 12, 14972) True
Timing...
maxsum_combo : 1.72903, 1.73780, 1.81138
maxsum_kadane : 2.17738, 2.22108, 2.22394
num = 15000, hi = 150, loops = 10
Verifying...
(553789, 21, 14996) (553789, 21, 14996) True
(550174, 1, 14992) (550174, 1, 14992) True
(551017, 13, 14991) (551017, 13, 14991) True
(554317, 2, 14986) (554317, 2, 14986) True
(558663, 15, 14988) (558663, 15, 14988) True
Timing...
maxsum_combo : 7.29226, 7.34213, 7.36688
maxsum_kadane : 1.07532, 1.07695, 1.10525
This code runs on both Python 2 and Python 3. The above results were generated on an old 32 bit 2GHz machine running Python 2.6.6 on a Debian derivative of Linux. The speeds for Python 3.6.0 are similar.
If you want to include groups that consist of a single non-repeated number, and also want to include the numbers that are in groups as a "subsequence" of length 1, you can use this version:
def maxsum_kadane(seq):
if not seq:
return None
total = 0
sums = [0]
for u in seq:
total += u
sums.append(total)
where = {}
for i, x in enumerate(seq, 1):
where.setdefault(x, []).append(i)
# Find the maximum of the single items
m_single = max((k, v[0], v[0]) for k, v in where.items())
# Find the maximum of the subsequences
try:
m_subseq = max(max_sublist([(sums[j] - sums[i-1], i, j)
for i, j in zip(v, v[1:])], k)
for k, v in where.items() if len(v) > 1)
return max(m_single, m_subseq)
except ValueError:
# No subsequences
return m_single
I haven't tested it extensively, but it should work. ;)

Can I parallelize this nested for loop? (Python 3.5)

I recently found out that the bottleneck of my code is the following block. N is of order 10,000, and L (10,000)^2. RQ_func is just a function that takes indices (tuples) and returns float V and dictionary sp_dist of {index : probability} format.
Is there a way I can parallelize this code? I have access to cluster computing from which I can use up to 20 cores at a time and would like to use the option.
R = np.empty((L,))
Q = scipy.sparse.lil_matrix((L, N))
traverser = 0 # Populate R and Q by traversing the array
for s_index in state_indices:
for a_index in action_indices:
V, sp_dist = RQ_func(s_index, a_index)
R[traverser] = V
for sp_index, prob in sp_dist.items():
Q[traverser, sp_index] = prob
traverser += 1

How can I prevent adding the same value (With another key) into a dictionary?

I need to fill a dictionary with pairs key-value given by the next code:
for i in range(1,n+1):
d = {}
Ri = Vector([#SomeCoordinates])
for k in range(1,n+1):
Rk = Vector([#SomeCoordinates])
if i != k:
d['R'+str(i)+str(k)] = (Rk-Ri).mod # Distance between Ri and Rk
else:
None
""" Since (Rk-Ri).mod gives me the distance between two points (i and k),
it's meaningless to calc the distance if i == k. """
Here's the problem:
'Rik' represents the same distance as 'Rki' and I don't want to add a distance twice.
Then, I tried with this code:
if i != k and ( ('R'+str(i)+str(k)) and ('R'+str(k)+str(i)) ) not in d:
d['R'+str(i)+str(k)] = (Rk-Ri).mod
else:
None
but the problem is still there.
When I "print d" I get R12 but also R21 (And the same with every pair of numbers " i k ").
What can I do?

You could use the following:
d = {}
for i in range(1, n + 1):
Ri = Vector([#SomeCoordinates]).
for k in range(i + 1, n + 1):
Rk = Vector([#SomeCoordinates])
d[i, k] = d[k, i] = (Rk - Ri).mod
This way we ensure we'll take only a pair (by enforcing k > i) and then we can assign to the dictionary the distance for both (i, k) and (k, i).
I used d[i, k] instead of d['R' + str(i) + str(k)] because the latter has the following disadvantage: We can't infer for example, if d['R123'] refers to (12, 3) or (1, 23).
Also, I moved dictionary initialisation (d = {}) outside both loops, because it's initialised for each i.

If I undertand you correctly, you are looking for all the combinations of two elements. You can use itertools.combinations to autoamtically generate all such combinations with no duplicates.
d = {}
for i, k in itertools.combinations(range(1, n+1), 2):
Ri = Vector([SomeCoordinates])
Rk = Vector([SomeCoordinates])
d['R'+str(i)+str(k)] = (Rk-Ri).mod
You could even make it a dict comprehension (although it may be a bit long):
d = {'R'+str(i)+str(k)] : (Vector([SomeCoordinates]) - Vector([SomeCoordinates])).mod
for i, k in itertools.combinations(range(1, n+1), 2)}
Or, to do the (possibly expensive) calculation of Vector([SomeCoordinates]) just once for each value of i or k, try this (thanks to JuniorCompressor for pointing this out):
R = {i: Vector([SomeCoordinates]) for i in range(1, n+1)}
d = {(i, k): (R[i] - R[k]).mod for i, k in itertools.combinations(range(1, n+1), 2)}
Also, as others have noted, 'R'+str(i)+str(k) is not a good key, as it will be impossible to distinguish between e.g. (1,23) and (12,3), as both end up as 'R123'. I suggest you just use the tuple (i,k) instead.

You might always put the smaller value first so that a previous entry is automatically overwritten:
if i != k:
key = str(i) + "," + str(k) if i < k else str(k) + "," + str(i)
d['R'+key] = (Rk-Ri).mod
(I assume that your script only needs the distance values, not information from the current keys.)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Optimizing loop for millions of entry selections - python

Related

Creating and updating 3 index matrix in Python

Efficiency of finding mismatched patterns

Finding the subsequence that starts and ends with the same number with the maximum sum

Can I parallelize this nested for loop? (Python 3.5)

How can I prevent adding the same value (With another key) into a dictionary?

Categories

Resources