How to make this nested summation efficient in python? - python

I have a problem making an efficient function to be used in a rootfinding algorithm. I need to make a function that is a triple summation over lables, which contains a triple summation over a pair of lists. I have already tried several implementations such as using nested lists, dictionaries, splitting the inner triple summation in two double summations (with an intermediate lists), keeping the 'm' & 'n' dictionaries/lists apart, using itertools.izip() to go over R and E together and probably a few others I'm forgetting.
The idea is that I need to be able to discriminate the labels for other functions, so I need an efficient way to store, access and sum over these sets of numbers.
Now, this function is part of an iteration, so the first time, most of these lists are empty. Then I need to use a rootfinding algorithm(pretty simple) for each value in the E-dictionary. After using a root finding algorithm (which depends on this function), the lists are refilled with its solutions. This means that in the second iteration each list will contain about the order of 1000 numbers. After that the rootfinding is again used with this new function, giving (after rebinning) 1000 new numbers in each list.
Clearly I have a problem if this rootfinding already takes several minutes in the first iteration. I have a specific implementation for the first iteration (which is reduced to 3 loops over the labels because of all the empty/only one value in list stuff) which finds all these roots in two seconds.
How can I do this summation efficiently, while still being able to discriminate between the various lists?
Thanks in advance
Note: this code is not my most beautifull attempt, but it is the most clear in what I'm trying to accomplish.
E = {}
R = {}
for i in labels:
E[i] = {'m':[i], 'n':[]} #the label happens to be the value that's in here
E[-i] = {'m':[], 'n':[-i]}
R[i] = {'m':[1], 'n':[]}
R[-i] = {'m':[], 'n':[1]}
def function(A, En):
temp = 0
for a in E:
if (not(ACTIVATE) or a != A):
for b in E:
for c in E:
if ( not(ACTIVATE) or b != c):
for i in xrange(len(R[a]['n'])):
for j in xrange(len(R[b]['m'])):
for k in xrange(len(R[c]['m'])):
temp += R[a]['n'][i]*R[b]['m'][j]*R[c]['m'][k]/(En - (-E[a]['n'][i] + E[b]['m'][j] + E[c]['m'][k]))
for i in xrange(len(R[a]['m'])):
for j in xrange(len(R[b]['n'])):
for k in xrange(len(R[c]['n'])):
temp += R[a]['m'][i]*R[b]['n'][j]*R[c]['n'][k]/(En + (E[a]['m'][i] - E[b]['n'][j] - E[c]['n'][k]))
return .5*temp

Related

Optimally selecting n datapoints from k

Problem statement:
I have 32k strings that consist of 13 characters. Each character can take 3 values (a, b or c). I need to select n strings from the 32k that satisfy the following:
select minimal number of strings so that the selected strings are not different than any other string within the 32k by more than 2 characters
This means that the count of strings that needs to be selected is variable. Also, the strings are not randomly generated, so the average difference is less than 2/3*13 - meaning that the eventual count of strings to be selected is not astronomical.
What I tried so far:
Clustering with k++ initialization and then k-means using hamming distance - but this did not yield in the desired outcome, albeit the problem resembles a clustering problem in a sense that we are practically looking for cluster centers with cluster members within a radius of 2.
What I am thinking of is simply selecting that string which has the most other strings having a distance of 1 and then of 2 - afterwards take out all these from the 32k and then repeat the calculation until no strings are left, but this is likely to be a suboptimal solution, e.g. this way I would select more strings than what is required at minimum I believe (selecting additional strings is a cost)
Question:
What other algorithms should I consider or think of? Thanks!
Here are examples of each method from my previous post. I always have trouble working code into my posts, so I did this separately. The first method computes the percentage that the strings are identical; the second method returns the number of differences.
string1 = ('abcbacaacbaab')
string2 = ('abcbacaacbbbb')
from difflib import SequenceMatcher
a=string1
b=string2
x = SequenceMatcher(a=a,b=b).ratio()
print(x)
#output: 0.8462
#OR (I used pip3 install jellyfish first)
import jellyfish
x=jellyfish.damerau_levenshtein_distance
(a,b)
print(x)
#output: 2
You might be able to use one of the types of 'fuzzy string matching' explained at:
https://miguendes.me/python-compare-strings#how-to-compare-two-strings-for-similarity-fuzzy-string-matching
There's "difflib" which computes a ratio of the differences. (You're in luck, your strings are all the same length.)
There's also something called "jellyfish" that returns a character count of the differences. It sounds like an interesting assignment, good luck!
if I understood, you want the minimum subset such that all elements in the subset are not different by more than two characters to the elements outside of the subset (please, let me know if I misunderstood the problem).
If that is the problem, there is a simple ad hoc algorithm that solves it in O(m * max(n, k)), where n is the total number of elements in the set (32000 in this case), m is the number of characters of an element of the set (13 in this case) and k is the size of the alphabet (3 in this case).
You can precalculate the quantity of each unique character of the alphabet in each column in O(m * max(n, k)). It's O(m * k) for initialization of the precalculation matrix and O(m * n) to actually calculate it.
Each column can vote for the removal of a string of the set if the character of the string in that column is equal to the number of strings in the initial set. Notice that a column can vote in O(1) using the precalculation. For each string, iterate through its columns and let the column vote. If you get three votes, you are sure the string needs to be kicked out of the set. So there is no need to continue iterating through the columns, just go to the next string. Else, the string needs to remain, just append it to the answer.
A python code is attached:
def solve(s: list[str], n: int = 32000, m: int = 13, k: int = 3) -> list[str]:
pre_calc = [[0 for j in range(k)] for i in range(m)]
ans = []
for i in range(n):
for j in range(m):
pre_calc[j][ord(s[i][j]) - ord('a')] += 1
for i in range(n):
votes_cnt = 0
remove = False
for j in range(m):
if pre_calc[j][ord(s[i][j]) - ord('a')] == n:
votes_cnt += 1
if votes_cnt == 3:
remove = True
break
if remove is False:
ans.append(s[i])
if len(ans) == 0:
ans.append(s[0])
return ans

how to calculate the minimum unfairness sum of a list

I have tried to summarize the problem statement something like this::
Given n, k and an array(a list) arr where n = len(arr) and k is an integer in set (1, n) inclusive.
For an array (or list) myList, The Unfairness Sum is defined as the sum of the absolute differences between all possible pairs (combinations with 2 elements each) in myList.
To explain: if mylist = [1, 2, 5, 5, 6] then Minimum unfairness sum or MUS. Please note that elements are considered unique by their index in list not their values
MUS = |1-2| + |1-5| + |1-5| + |1-6| + |2-5| + |2-5| + |2-6| + |5-5| + |5-6| + |5-6|
If you actually need to look at the problem statement, It's HERE
My Objective
given n, k, arr(as described above), find the Minimum Unfairness Sum out of all of the unfairness sums of sub arrays possible with a constraint that each len(sub array) = k [which is a good thing to make our lives easy, I believe :) ]
what I have tried
well, there is a lot to be added in here, so I'll try to be as short as I can.
My First approach was this where i used itertools.combinations to get all the possible combinations and statistics.variance to check its spread of data (yeah, I know I'm a mess).
Before you see the code below, Do you think these variance and unfairness sum are perfectly related (i know they are strongly related) i.e. the sub array with minimum variance has to be the sub array with MUS??
You only have to check the LetMeDoIt(n, k, arr) function. If you need MCVE, check the second code snippet below.
from itertools import combinations as cmb
from statistics import variance as varn
def LetMeDoIt(n, k, arr):
v = []
s = []
subs = [list(x) for x in list(cmb(arr, k))] # getting all sub arrays from arr in a list
i = 0
for sub in subs:
if i != 0:
var = varn(sub) # the variance thingy
if float(var) < float(min(v)):
v.remove(v[0])
v.append(var)
s.remove(s[0])
s.append(sub)
else:
pass
elif i == 0:
var = varn(sub)
v.append(var)
s.append(sub)
i = 1
final = []
f = list(cmb(s[0], 2)) # getting list of all pairs (after determining sub array with least MUS)
for r in f:
final.append(abs(r[0]-r[1])) # calculating the MUS in my messy way
return sum(final)
The above code works fine for n<30 but raised a MemoryError beyond that.
In Python chat, Kevin suggested me to try generator which is memory efficient (it really is), but as generator also generates those combination on the fly as we iterate over them, it was supposed to take over 140 hours (:/) for n=50, k=8 as estimated.
I posted the same as a question on SO HERE (you might wanna have a look to understand me properly - it has discussions and an answer by fusion which takes me to my second approach - a better one(i should say fusion's approach xD)).
Second Approach
from itertools import combinations as cmb
def myvar(arr): # a function to calculate variance
l = len(arr)
m = sum(arr)/l
return sum((i-m)**2 for i in arr)/l
def LetMeDoIt(n, k, arr):
sorted_list = sorted(arr) # i think sorting the array makes it easy to get the sub array with MUS quickly
variance = None
min_variance_sub = None
for i in range(n - k + 1):
sub = sorted_list[i:i+k]
var = myvar(sub)
if variance is None or var<variance:
variance = var
min_variance_sub=sub
final = []
f = list(cmb(min_variance_sub, 2)) # again getting all possible pairs in my messy way
for r in f:
final.append(abs(r[0] - r[1]))
return sum(final)
def MainApp():
n = int(input())
k = int(input())
arr = list(int(input()) for _ in range(n))
result = LetMeDoIt(n, k, arr)
print(result)
if __name__ == '__main__':
MainApp()
This code works perfect for n up to 1000 (maybe more), but terminates due to time out (5 seconds is the limit on online judge :/ ) for n beyond 10000 (the biggest test case has n=100000).
=====
How would you approach this problem to take care of all the test cases in given time limits (5 sec) ? (problem was listed under algorithm & dynamic programming)
(for your references you can have a look on
successful submissions(py3, py2, C++, java) on this problem by other candidates - so that you can
explain that approach for me and future visitors)
an editorial by the problem setter explaining how to approach the question
a solution code by problem setter himself (py2, C++).
Input data (test cases) and expected output
Edit1 ::
For future visitors of this question, the conclusions I have till now are,
that variance and unfairness sum are not perfectly related (they are strongly related) which implies that among a lots of lists of integers, a list with minimum variance doesn't always have to be the list with minimum unfairness sum. If you want to know why, I actually asked that as a separate question on math stack exchange HERE where one of the mathematicians proved it for me xD (and it's worth taking a look, 'cause it was unexpected)
As far as the question is concerned overall, you can read answers by archer & Attersson below (still trying to figure out a naive approach to carry this out - it shouldn't be far by now though)
Thank you for any help or suggestions :)
You must work on your list SORTED and check only sublists with consecutive elements. This is because BY DEFAULT, any sublist that includes at least one element that is not consecutive, will have higher unfairness sum.
For example if the list is
[1,3,7,10,20,35,100,250,2000,5000] and you want to check for sublists with length 3, then solution must be one of [1,3,7] [3,7,10] [7,10,20] etc
Any other sublist eg [1,3,10] will have higher unfairness sum because 10>7 therefore all its differences with rest of elements will be larger than 7
The same for [1,7,10] (non consecutive on the left side) as 1<3
Given that, you only have to check for consecutive sublists of length k which reduces the execution time significantly
Regarding coding, something like this should work:
def myvar(array):
return sum([abs(i[0]-i[1]) for i in itertools.combinations(array,2)])
def minsum(n, k, arr):
res=1000000000000000000000 #alternatively make it equal with first subarray
for i in range(n-k):
res=min(res, myvar(l[i:i+k]))
return res
I see this question still has no complete answer. I will write a track of a correct algorithm which will pass the judge. I will not write the code in order to respect the purpose of the Hackerrank challenge. Since we have working solutions.
The original array must be sorted. This has a complexity of O(NlogN)
At this point you can check consecutive sub arrays as non-consecutive ones will result in a worse (or equal, but not better) "unfairness sum". This is also explained in archer's answer
The last check passage, to find the minimum "unfairness sum" can be done in O(N). You need to calculate the US for every consecutive k-long subarray. The mistake is recalculating this for every step, done in O(k), which brings the complexity of this passage to O(k*N). It can be done in O(1) as the editorial you posted shows, including mathematic formulae. It requires a previous initialization of a cumulative array after step 1 (done in O(N) with space complexity O(N) too).
It works but terminates due to time out for n<=10000.
(from comments on archer's question)
To explain step 3, think about k = 100. You are scrolling the N-long array and the first iteration, you must calculate the US for the sub array from element 0 to 99 as usual, requiring 100 passages. The next step needs you to calculate the same for a sub array that only differs from the previous by 1 element 1 to 100. Then 2 to 101, etc.
If it helps, think of it like a snake. One block is removed and one is added.
There is no need to perform the whole O(k) scrolling. Just figure the maths as explained in the editorial and you will do it in O(1).
So the final complexity will asymptotically be O(NlogN) due to the first sort.

Python Iterating through nested list using list comprehension

I'm working on Euler Project, problem 11, which involves finding the greatest product of all possible combinations of four adjacent numbers in a grid. I've split the numbers into a nested list and used a list comprehension to slice the relevant numbers, like this:
if x+4 <= len(matrix[x]): #check right
my_slice = [int(matrix[x][n]) for n in range(y,y+4)]
...and so on for the other cardinal directions. So far, so good. But when I get to the diagonals things get problematic. I tried to use two ranges like this:
if x+4 <= len(matrix[x]) and y-4 >=0:# check up, right
my_slice = [int(matrix[m][n]) for m,n in ((range(x,x+4)),range(y,y+4))]
But this yields the following error:
<ipython-input-53-e7c3ebf29401> in <listcomp>(.0)
48 if x+4 <= len(matrix[x]) and y-4 >=0:# check up, right
---> 49 my_slice = [int(matrix[m][n]) for m,n in ((range(x,x+4)),range(y,y+4))]
ValueError: too many values to unpack (expected 2)
My desired indices for x,y values of [0,0] would be ['0,0','1,1','2,2','3,3']. This does not seem all that different for using the enumerate function to iterate over a list, but clearly I'm missing something.
P.S. My apologies for my terrible variable nomenclature, I'm a work in progress.
You do not need to use two ranges, simply use one and apply it twice:
my_slice = [int(matrix[m][m-x+y]) for m in range(x,x+4)]
Since your n is supposed to be attached to range(y,y+4) we know that there will always be a difference of y-x between m and n. So instead of using two variables, we can counter the difference ourselves.
Or in case you still wish to use two range(..) constructs, you can use zip(..) which takes a list of generators, consumes them concurrently and emits tuples:
my_slice = [int(matrix[m][n]) for m,n in zip(range(x,x+4),range(y,y+4))]
But I think this will not improve performance because of the tuple packing and unpacking overhead.
[int(matrix[x+d][n+d]) for d in range(4)] for one diagonal.
[int(matrix[x+d][n-d]) for d in range(4)] for the other.
Btw, better use standard matrix index names, i.e., row i and column j. Not x and y. It's confusing. I think you even confused yourself, as for example your if x+4 <= len(matrix[x]) tests x against the second dimension length but uses it in the first dimension. Huh?

Algos - Delete Extremes From A List of Integers in Python?

I want to eliminate extremes from a list of integers in Python. I'd say that my problem is one of design. Here's what I cooked up so far:
listToTest = [120,130,140,160,200]
def function(l):
length = len(l)
for x in xrange(0,length - 1):
if l[x] < (l[x+1] - l[x]) * 4:
l.remove(l[x+1])
return l
print function(listToTest)
So the output of this should be: 120,130,140,160 without 200, since that's way too far ahead from the others.
And this works, given 200 is the last one or there's only one extreme. Though, it gets problematic with a list like this:
listToTest = [120,200,130,140,160,200]
Or
listToTest = [120,130,140,160,200,140,130,120,200]
So, the output for the last list should be: 120,130,140,160,140,130,120. 200 should be gone, since it's a lot bigger than the "usual", which revolved around ~130-140.
To illustrate it, here's an image:
Obviously, my method doesn't work. Some thoughts:
- I need to somehow do a comparison between x and x+1, see if the next two pairs have a bigger difference than the last pair, then if it does, the pair that has a bigger difference should have one element eliminated (the biggest one), then, recursively do this again. I think I should also have an "acceptable difference", so it knows when the difference is acceptable and not break the recursivity so I end up with only 2 values.
I tried writting it, but no luck so far.
You can use statistics here, eliminating values that fall beyond n standard deviations from the mean:
import numpy as np
test = [120,130,140,160,200,140,130,120,200]
n = 1
output = [x for x in test if abs(x - np.mean(test)) < np.std(test) * n]
# output is [120, 130, 140, 160, 140, 130, 120]
Your problem statement is not clear. If you simply want to remove the max and min then that is a simple
O(N) with 2 extra memory- which is O(1)
operation. This is achieved by retaining the current min/max value and comparing it to each entry in the list in turn.
If you want the min/max K items it is still
O(N + KlogK) with O(k) extra memory
operation. This is achieved by two priorityqueue's of size K: one for the mins, one for the max's.
Or did you intend a different output/outcome from your algorithm?
UPDATE the OP has updated the question: it appears they want a moving (/windowed) average and to delete outliers.
The following is an online algorithm -i.e. it can handle streaming data http://en.wikipedia.org/wiki/Online_algorithm
We can retain a moving average: let's say you keep K entries for the average.
Then create a linked list of size K and a pointer to the head and tail. Now: handling items within the first K entries needs to be thought out separately. After the first K retained items the algo can proceed as follows:
check the next item in the input list against the running k-average. If the value exceeds the acceptable ratio threshold then put its list index into a separate "deletion queue" list. Otherwise: update the running windowed sum as follows:
(a) remove the head entry from the linked list and subtract its value from the running sum
(b) add the latest list entry as the tail of the linked list and add its value to the running sum
(c) recalculate the running average as the running sum /K
Now: how to handle the first K entries? - i.e. before we have a properly initialized running sum?
You will need to make some hard-coded decisions here. A possibility:
run through all first K+2D (D << K) entries.
Keep d max/min values
Remove the d (<< K) max/min values from that list

Python speeding up the search for a value in a dictionary of ranges

I have a file with a column of values I would like to use to compare with a dictionary that contains two values that together form a range.
for instance:
File A:
Chr1 200 ....
Chr3 300
File B:
Chr1 200 300 ...
Chr2 300 350 ...
For now I created a dictionary of values for File B:
for Line in FileB:
LineB = Line.strip('\n').split('\t')
Ranges[Chr].append(LineB)
For the comparison:
for Line in MethylationFile:
Line = Line.strip("\n")
Info = Line.split("\t")
Chr = Info[0]
Location = int(Info[1])
Annotation = ""
for i, r in enumerate(Ranges[Chr]):
n = i + 1
while (n < len(Ranges[Chr])):
if (int(Ranges[Chr][i][1]) <= Location <= int(Ranges[Chr][i][2])):
Annotation = '\t'.join(Ranges[Chr][i][4:])
n +=1
OutFile.write(Line + '\t' + Annotation + '\n')
If I leave the while loop the program does not seem to run (or is probably running too slow to get results) since I have over 7,000 values in each dictionary. If I change the while loop to an if loop the program runs but at an incredibly slow pace.
I'm looking for a way to make this program faster and more efficient
Dictionaries are great when you want to look up a key by exact match. In particular, the hash of the lookup key has to be the same as the hash of the stored key.
If your ranges are consistent, you could fake this by writing a hash function that returns the same value for a range, and for every value within that range. But if they're not, this hash function would have to keep track of all of the known ranges, which takes you back to the same problem you're starting with.
In that case, the right data structure here is probably some kind of sorted collection. If you only need to build up the collection, and then use it many times without ever modifying it, just sorting a list and using the bisect module will do it for you. If you need to modify the collection after creation, you'll want something built around a binary tree or B-tree variant of some kind, like blist or bintrees.
This will reduce the time to find a range from N/2 to log2(N). So, if you've got 10000 ranges, instead of 5000 comparisons, you'll do 14.
While we're at it, it would help to convert the range start and stop values to ints once, instead of doing it each time. Also, if you want to use the stdlib bisect, you unfortunately can't pass a key to most functions, so let's reorganize the ranges into comparable order too. So:
for Line in FileB:
LineB = Line.strip('\n').split('\t')
Ranges[Chr].append(int(LineB[1]), int(LineB[2]), [LineB[0])
for r in Ranges:
r.sort()
Now, instead of this loop:
for i, r in enumerate(Ranges[Chr]):
# ...
Do this:
i = bisect.bisect(Ranges[Chr], (Location, Location, None))
if i:
r = Ranges[Chr][i-1]
if r[0] <= Location < r[1]:
# do whatever you wanted with r
else:
# there is no range that includes Location
else:
# Location is before all ranges
You have to be careful thinking about bisect, and it's possible I've got this wrong on the first attempt, so… read the docs on what it does, and experiment with your data (printing out the results of the bisect function), before trusting this.
If your ranges can overlap, and you want to be able to find all ranges that contain a value rather than just one, you'll need a bit more than this to keep things efficient. There's no way to fully-order overlapping ranges, so bisect won't cut it.
If you're expecting more than log N matches per average lookup, you can do it with two sorted lists and bisect.
But otherwise, you need a more complex data structure, and more complex code. For example, if you can spare N^2 space, you can keep the time at log N by having, for each range in the first list, a second list, sorted by end, of all the values with a matching start.
And at this point, I think it's getting complex enough that you want to look for a library to do it for you.
However, you might want to consider a different solution.
If you use numpy or a database instead of pure Python, this can't cut the algorithmic complexity from N to log N… but it can cut the constant overhead by a factor of 10 or so, which may be good enough. In fact, if you're doing tons of searches on a medium-small list, it may even be better.
Plus, it looks a lot simpler, and once you get used to array operations or SQL, it may even be more readable. So:
RangeArrays = [np.array(a[:2] for a in value) for value in Ranges]
… or, if Ranges is a dict mapping strings to values, instead of a list:
RangeArrays = {key: np.array(a[:2] for a in value) for key, value in Ranges.items()}
Then, instead of this:
for i, r in enumerate(Ranges[Chr]):
# ...
Do:
comparisons = Location < RangeArrays[Chr]
matches = comparisons[:,0] < comparisons[:,1]
indices = matches.nonzero()[0]
for index in indices:
r = Ranges[indices[0]]
# Do stuff with r
(You can of course make things more concise, but it's worth doing it this way and printing out all of the intermediate steps to see why it works.)
Or, using a database:
cur = db.execute('''SELECT Start, Stop, Chr FROM Ranges
WHERE Start <= ? AND Stop > ?''', (Location, Location))
for (Start, Stop, Chr) in cur:
# do stuff

Categories

Resources