I would like to check which mathematical expressions are equal.
I want to do this using Python I tried it with Sympy.
My idea was to use simplify in order to reduce the expressions such that a pair that is equal will be reduced to the same expression.
Then I substract them all with each other in my two for loops and check if the result equals to zero.
Unfortunately no substraction results in zero which is very improbable to be correct.
I think that probably the simplify function does not really do what I need.
Is there a function in sympy to check if two expressions are indeed mathematically equal?
This is my code so far:
from sympy import *
a = symbols ('a')
b = symbols ('b')
n = symbols ('n')
m = symbols ('m')
x1=simplify(log(a,n**(log(b,a))))
x2=simplify(((a**n)/(b**m))**(1/b))
x3=simplify(b**(n*log(a)))
x4=simplify(log(b,n))
x5=simplify(a**((n-m)/b))
x6=simplify(n*(log(a)+log(b)))
x7=simplify(log((a**n)*(b**n)))
x8=simplify(a**(log(b**n)))
L=[x1,x2,x3,x4,x5,x6,x7,x8]
for i in range (0 , 6):
for k in range (i+1 , 7):
print(L[i]-L[k])
The a.equals(b) method will try really hard (including using random values for variables) to show that a == b. But be aware that two expressions might only be equal for a given range of values. So it might be better to indicate that your symbols are, for example, positive or integer as in Symbol('n', integer=True) or Symbol('a', positive=True). If you do that then simplify(a - b) will more likely reduce to 0 as will a.equals(b).
posify is a function which can replace symbols with symbols having positive assumptions; see below how x6 and x7 simplify when symbols are positive:
>>> from sympy import posify
>>> dif = x6 - x7
>>> dif.simplify() == 0
Ealse
>>> posify(dif)[0].simplify() # [0] gets the the positive-symbol expression
You can also make numerical substitutions yourself using x._random(lo,LO,hi,HI) where (lo, hi) are the lower and upper limit for the real part of the number and (LO, HI) are the same for the imaginary part, e.g. x._random(0,0,1,0) will give a random value that is real between 0 and 1. Create a replacement dictionary and replace the values and check the absolute value of the difference in a and b. Something like this (using the loop as you presented it above):
for i in range (0 , 6):
for k in range (i+1 , 7):
v = L[i]-(L[k])
reps = {i: i._random(0,0,1,0) for i in v.free_symbols}
v = v.xreplace(reps).n()
if abs(v) < 1e-9:
print(L[i],L[k],abs(v))
Another way to check if functions are equal would be to evaluate them at maybe a few thousand points and check the outputs.
from sympy import *
def generateOutput(L, x):
# x -> list of points to evaluate functions at (maybe randomly generated?)
# L -> input list of functions
# returns list of outputs of L[i] applied to x
a = symbols ('a')
b = symbols ('b')
n = symbols ('n')
m = symbols ('m')
x1=simplify(log(a,n**(log(b,a))))
x2=simplify(((a**n)/(b**m))**(1/b))
x3=simplify(b**(n*log(a)))
x4=simplify(log(b,n))
x5=simplify(a**((n-m)/b))
x6=simplify(n*(log(a)+log(b)))
x7=simplify(log((a**n)*(b**n)))
x8=simplify(a**(log(b**n)))
L=[x1,x2,x3,x4,x5,x6,x7,x8]
outputs = generateOutput(L)
# Compare outputs
From the docs:
The Eq function (from sympy.core.relational) looks like it is what you want. Note that if it is given more complex arguments, you will have to simplify to get a result (see last code example in link).
Note: Those for loops don't look right. The first one will only go through indices 0-5 and the second only through i+1 to 6, so the last item in the list will be skipped completely.
Related
Is there a way to remove specific elements in an array using numpy.delete, boolean mask (or any other function) that meet certain criteria such as conditionals on that data type?, this by using numpy methods.
For example:
import numpy as np
arr = np.random.chisquare(6, 10)
array([4.61518458, 4.80728541, 4.59749491, 3.44053946, 5.52507358,
7.97092747, 2.01946678, 6.26877508, 3.68286537, 2.06759469])`
Now for test purposes I would like to know if I can use some numpy function to remove all elements that are divisible by the given value k
>>> np.delete(arr, 1, 0)
[4.61518458 4.59749491 3.44053946 5.52507358 7.97092747 2.01946678
6.26877508 3.68286537 2.06759469]
the delete(arr, 1, 0) call only removes the value at that position, is there a way to delete multiple values based on anonymous function lambda or a condition like the one I mentioned above?.
Yes, this is part of numpy's magic indexing. You can use comparison operator or the apply function to produce an array of booleans, with True for the ones to keep and False for the ones to toss. So, for example, to keep all the elements less than 5::
selections = array < 5
array = array[selections]
That will only keep the elements where selections is True.
Of course, since all your values are floats, they aren't going to be divisible by an integer k, but that's another story.
For doing such division, based on the answer of Tim:
k = 6 # a number
array = array[array % k == 0]
Since you're looking at floating point division and will therefore be subject to numerical limitations, there should be no expectation that the result of the division will be perfect. Instead, I would suggest that you accept removing all the numbers that are almost divisble by k.
For your problem I would set a threshold and use np.logical_and:
arr[np.logical_and(arr % k > threshold, (k - (arr % k) > thresold)]
Explanation
Consider the following problem:
k = 1.0000002300000000450001000101
x = np.array([k * i for i in range(1,10)] + [0.5,])
#array([1.00000023, 2.00000046, 3.00000069, 4.00000092, 5.00000115,
# 6.00000138, 7.00000161, 8.00000184, 9.00000207, 0.5])
In theory, all the numbers but the last one (0.5) should be divisible by k exactly. In reality, numerical precision limits that capability (if you really want to dig into why, I'd refer to the link above on floating point arithmetic)
np.where(x%k==0)
#array([0, 1, 2, 3, 5, 7], dtype=int64),)
x[x%k==0]
#array([1.00000023, 2.00000046, 3.00000069, 4.00000092, 6.00000138,
# 8.00000184])
We've missed a few that we would like to have been caught (x[4], x[6] and x[8], with values of 5*k 6*k and 9*k). If we look at the modular division itself, we see that the missed numbers are almost 0 or almost k (we expect the last one since 0.5%k==0.5):
x[x%k!=0]%k
#array([1.00000023e+00, 4.44089210e-16, 1.00000023e+00, 5.00000000e-01])
So the best we can do is find a work around where we look for cases that are close enough. Noting that the differences above are O(2**-51), we can use 2**-50 as our threshold in this case but for practical purposes we can probably be a bit more lenient.
You also mention you want to eliminate the values that are divisible, so we want to keep the values where x%k > threshold and k-x%k > threshold:
threshold = 2**-50
x[np.logical_and((x % k) > threshold, (k - (x % k)) > threshold)]
#array([0.5])
If you wanted to keep them, then you'd use the opposite inequalities and use np.logical_or:
x[np.logical_or((x % k) < threshold, (k - (x % k)) < threshold)]
#array([1.00000023, 2.00000046, 3.00000069, 4.00000092, 5.00000115,
# 6.00000138, 7.00000161, 8.00000184, 9.00000207])
I'm trying to compare the elements of a list to a lower and upper limit without using for and while loops. My teacher wants us to use the map and reduce functions. The issue is that the upper limit is a parameter tied to a function and I can't use it in another function unless I make it a global variable (which kinda defeats the purpose of it being a parameter). With that being said, the comparison has to be done in the function in which my constant h is in. However, since I have to use map and reduce to somehow cycle through the indexes of my list, I have to exit the function and go to another one (since map and reduce use functions). The code is the following:
def fct(n,l,h):
nX = n * l ; nY = n * h
t = list(map(lambda n: "#",range(nX * nY)))
From here, I would like to compare the elements in t to my limits:
if 0 <= n <= h-1:
#Do something
elif etc...
Here, n is the value being compared (in other words n = t[i]). Is there anyway to obtain n using map and reduce?
depends on what would be the do somethng part:
def fct(n,l,h):
nX = n * l ; nY = n * h
t = list(map(lambda n: n,range(nX * nY)))
return reduce(lambda a,b : a+b, map(lambda a : 0 <= a <= h-1,t))
I imagined #Do something is sum of list elements
def operation(item, lower, upper):
if lower <= item <= upper-1:
...
# Do something
# return value
# do other things, return other values
def iterate(numbers, lower, upper):
return list( map( lambda item: operation(item, lower, upper), numbers) )
from my understanding of what you wrote, you will be iterating through items in a list (presumably of integers) and on each item checking whether it fits into given lower and upper bounds. Depending on whether it fits you'll perform one action, if not another. Break it down into seperate functions, one function that will perform the comparison and then act on it and another that will iterate through the list, applying that function to each item. You can use map to iterate through each number in the list, applying the operation function to each element.
I have a task where I have a list of certain values: l = ["alpha", "beta", "beta", "alpha", "gamma", "alpha", "alpha"]. I have a formula for computing a kind of probability on this list as the following (the probability is high in case there is many different values in the list and low if there are few kind of values):
$ p = - \sum_{i=1}^m f_i log_m f_i $
where m is the length of the list, $f_i$ is the frequency of the ith element of the list.
I want to code this in Python with the following:
from math import log
from collections import Counter
-sum([loc*log(loc, len(set(l))) for loc in Counter(l).values()])
But I somehow suspect that this is not the right way. Any better idea?
Additionally: I do not understand the negative sign in the formula, what is the explanation of this?
Here an alternative way to calculate the Entropy of the list using numpy:
import numpy as np
arr = np.array(l)
elem, c = np.unique(arr, return_counts=True)
# occurrences to probabilities
pc = c / c.sum()
# calculate the entropy (and account for log_m)
entropy = -np.sum(pc * np.log(pc)) * (1/np.log(len(c)))
Although the numpy array is a better solution, in case you don't want to use numpy:
You would be better if you saved the counter and use len(Counter) instead of len(set(l)), so that you don't recalculate in in every iteration. len(Counter) is the same as len(set(l)), but does not get recalculated in every iteration (I assume you use cpython3.x )
If you don't get the desired result, then probably your formula is wrong
In your code you use len(set(l)) and not len(l) and you iterate over the frequencies, not the list which is not what you describe in your formula.
You don't need to wrap the expression inside sum within a list since you only need to iterate over it once (Generator expressions vs. list comprehensions)
EDIT: As to why you get a negative result, this is expected
You sum over f[i] * log(f[i]) >= 0
f[i] >= 1: The frequency of ith element of the list
log(f[i]) >= 0 because f[i] >= 1: The log of each frequency in any base (base doesn't matter).
And then take the negative of that. The result will always be less that or equal to 0.
from math import log
from collections import Counter
l = ["alpha", "beta", "beta", "alpha", "gamma", "alpha", "alpha"]
f = Counter(l)
# This is from your code
p1 = -sum(f[e] * log(f[e], len(f)) for e in f)
# This is from your formula
p2 = -sum(f[e] * log(f[e], len(l)) for e in l)
print(p1, p2)
I have some strings,
['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']
These strings partially overlap each other. If you manually overlapped them you would get:
SGALWDVPSPV
I want a way to go from the list of overlapping strings to the final compressed string in python. I feel like this must be a problem that someone has solved already and am trying to avoid reinventing the wheel. The methods I can imagine now are either brute force or involve getting more complicated by using biopython and sequence aligners than I would like. I have some simple short strings and just want to properly merge them in a simple way.
Does anyone have any advice on a nice way to do this in python? Thanks!
Here is a quick sorting solution:
s = ['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']
new_s = sorted(s, key=lambda x:s[0].index(x[0]))
a = new_s[0]
b = new_s[-1]
final_s = a[:a.index(b[0])]+b
Output:
'SGALWDVPSPV'
This program sorts s by the value of the index of the first character of each element, in an attempt to find the string that will maximize the overlap distance between the first element and the desired output.
My proposed solution with a more challenging test list:
#strFrag = ['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']
strFrag = ['ALWDVPS', 'SGALWDV', 'LWDVPSP', 'WDVPSPV', 'GALWDVP', 'LWDVPSP', 'ALWDVPS']
for repeat in range(0, len(strFrag)-1):
bestMatch = [2, '', ''] #overlap score (minimum value 3), otherStr index, assembled str portion
for otherStr in strFrag[1:]:
for x in range(0,len(otherStr)):
if otherStr[x:] == strFrag[0][:len(otherStr[x:])]:
if len(otherStr)-x > bestMatch[0]:
bestMatch = [len(otherStr)-x, strFrag.index(otherStr), otherStr[:x]+strFrag[0]]
if otherStr[:-x] == strFrag[0][-len(otherStr[x:]):]:
if x > bestMatch[0]:
bestMatch = [x, strFrag.index(otherStr), strFrag[0]+otherStr[-x:]]
if bestMatch[0] > 2:
strFrag[0] = bestMatch[2]
strFrag = strFrag[:bestMatch[1]]+strFrag[bestMatch[1]+1:]
print(strFrag)
print(strFrag[0])
Basically the code compares every string/fragment to the first in list and finds the best match (most overlap). It consolidates the list progressively, merging the best matches and removing the individual strings. Code assumes that there are no unfillable gaps between strings/fragments (Otherwise answer may not result in longest possible assembly. Can be solved by randomizing the starting string/fragment). Also assumes that the reverse complement is not present (poor assumption with contig assembly), which would result in nonsense/unmatchable strings/fragments. I've included a way to restrict the minimum match requirements (changing bestMatch[0] value) to prevent false matches. Last assumption is that all matches are exact. To enable flexibility in permitting mismatches when assembling the sequence makes the problem considerably more complex. I can provide a solution for assembling with mismatches upon request.
To determine the overlap of two strings a and b, you can check if any prefix of b is a suffix of a. You can then use that check in a simple loop, aggregating the result and slicing the next string in the list according to the overlap.
lst = ['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']
def overlap(a, b):
return max(i for i in range(len(b)+1) if a.endswith(b[:i]))
res = lst[0]
for s in lst[1:]:
o = overlap(res, s)
res += s[o:]
print(res) # SGALWDVPSPV
Or using reduce:
from functools import reduce # Python 3
print(reduce(lambda a, b: a + b[overlap(a,b):], lst))
This is probably not super-efficient, with complexity of about O(n k), with n being the number of strings in the list and k the average length per string. You can make it a bit more efficient by only testing whether the last char of the presumed overlap of b is the last character of a, thus reducing the amount of string slicing and function calls in the generator expression:
def overlap(a, b):
return max(i for i in range(len(b)) if b[i-1] == a[-1] and a.endswith(b[:i]))
Here's my solution which borders on brute force from the OP's perspective. It's not bothered by order (threw in a random shuffle to confirm that) and there can be non-matching elements in the list, as well as other independent matches. Assumes overlap means not a proper subset but independent strings with elements in common at the start and end:
from collections import defaultdict
from random import choice, shuffle
def overlap(a, b):
""" get the maximum overlap of a & b plus where the overlap starts """
overlaps = []
for i in range(len(b)):
for j in range(len(a)):
if a.endswith(b[:i + 1], j):
overlaps.append((i, j))
return max(overlaps) if overlaps else (0, -1)
lst = ['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV', 'NONSEQUITUR']
shuffle(lst) # to verify order doesn't matter
overlaps = defaultdict(list)
while len(lst) > 1:
overlaps.clear()
for a in lst:
for b in lst:
if a == b:
continue
amount, start = overlap(a, b)
overlaps[amount].append((start, a, b))
maximum = max(overlaps)
if maximum == 0:
break
start, a, b = choice(overlaps[maximum]) # pick one among equals
lst.remove(a)
lst.remove(b)
lst.append(a[:start] + b)
print(*lst)
OUTPUT
% python3 test.py
NONSEQUITUR SGALWDVPSPV
%
Computes all the overlaps and combines the largest overlap into a single element, replacing the original two, and starts process over again until we're down to a single element or no overlaps.
The overlap() function is horribly inefficient and likely can be improved but that doesn't matter if this isn't the type of matching the OP desires.
Once the peptides start to grow to 20 aminoacids cdlane's code chokes and spams (multiple) incorrect answer(s) with various amino acid lengths.
Try to add and use AA sequence 'VPSGALWDVPS' with or without 'D' and the code starts to fail its task because the N-and C-terminus grow and do not reflect what Adam Price is asking for. The output is: 'SGALWDVPSGALWDVPSPV' and thus 100% incorrect despite the effort.
Tbh imo there is only one 100% answer and that is to use BLAST and its protein search page or BLAST in the BioPython package. Or adapt cdlane's code to reflect AA gaps, substitutions and AA additions.
Dredging up an old thread, but had to solve this myself today.
For this specific case, where the fragments are already in order, and each overlap by the same amount (in this case 1), the following fairly simply concatenation works, though might not be the worlds most robust solution:
lst = ['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']
reference = "SGALWDVPSPV"
string = "".join([i[0] for i in lst] + [lst[-1][1:]])
reference == string
True
Using a single random number and a list, how would you return a random slice of that list?
For example, given the list [0,1,2] there are seven possibilities of random contiguous slices:
[ ]
[ 0 ]
[ 0, 1 ]
[ 0, 1, 2 ]
[ 1 ]
[ 1, 2]
[ 2 ]
Rather than getting a random starting index and a random end index, there must be a way to generate a single random number and use that one value to figure out both starting index and end/length.
I need it that way, to ensure these 7 possibilities have equal probability.
Simply fix one order in which you would sort all possible slices, then work out a way to turn an index in that list of all slices back into the slice endpoints. For example, the order you used could be described by
The empty slice is before all other slices
Non-empty slices are ordered by their starting point
Slices with the same starting point are ordered by their endpoint
So the index 0 should return the empty list. Indices 1 through n should return [0:1] through [0:n]. Indices n+1 through n+(n-1)=2n-1 would be [1:2] through [1:n]; 2n through n+(n-1)+(n-2)=3n-3 would be [2:3] through [2:n] and so on. You see a pattern here: the last index for a given starting point is of the form n+(n-1)+(n-2)+(n-3)+…+(n-k), where k is the starting index of the sequence. That's an arithmetic series, so that sum is (k+1)(2n-k)/2=(2n+(2n-1)k-k²)/2. If you set that term equal to a given index, and solve that for k, you get some formula involving square roots. You could then use the ceiling function to turn that into an integral value for k corresponding to the last index for that starting point. And once you know k, computing the end point is rather easy.
But the quadratic equation in the solution above makes things really ugly. So you might be better off using some other order. Right now I can't think of a way which would avoid such a quadratic term. The order Douglas used in his answer doesn't avoid square roots, but at least his square root is a bit simpler due to the fact that he sorts by end point first. The order in your question and my answer is called lexicographical order, his would be called reverse lexicographical and is often easier to handle since it doesn't depend on n. But since most people think about normal (forward) lexicographical order first, this answer might be more intuitive to many and might even be the required way for some applications.
Here is a bit of Python code which lists all sequence elements in order, and does the conversion from index i to endpoints [k:m] the way I described above:
from math import ceil, sqrt
n = 3
print("{:3} []".format(0))
for i in range(1, n*(n+1)//2 + 1):
b = 1 - 2*n
c = 2*(i - n) - 1
# solve k^2 + b*k + c = 0
k = int(ceil((- b - sqrt(b*b - 4*c))/2.))
m = k + i - k*(2*n-k+1)//2
print("{:3} [{}:{}]".format(i, k, m))
The - 1 term in c doesn't come from the mathematical formula I presented above. It's more like subtracting 0.5 from each value of i. This ensures that even if the result of sqrt is slightly too large, you won't end up with a k which is too large. So that term accounts for numeric imprecision and should make the whole thing pretty robust.
The term k*(2*n-k+1)//2 is the last index belonging to starting point k-1, so i minus that term is the length of the subsequence under consideration.
You can simplify things further. You can perform some computation outside the loop, which might be important if you have to choose random sequences repeatedly. You can divide b by a factor of 2 and then get rid of that factor in a number of other places. The result could look like this:
from math import ceil, sqrt
n = 3
b = n - 0.5
bbc = b*b + 2*n + 1
print("{:3} []".format(0))
for i in range(1, n*(n+1)//2 + 1):
k = int(ceil(b - sqrt(bbc - 2*i)))
m = k + i - k*(2*n-k+1)//2
print("{:3} [{}:{}]".format(i, k, m))
It is a little strange to give the empty list equal weight with the others. It is more natural for the empty list to be given weight 0 or n+1 times the others, if there are n elements on the list. But if you want it to have equal weight, you can do that.
There are n*(n+1)/2 nonempty contiguous sublists. You can specify these by the end point, from 0 to n-1, and the starting point, from 0 to the endpoint.
Generate a random integer x from 0 to n*(n+1)/2.
If x=0, return the empty list. Otherwise, x is unformly distributed from 1 through n(n+1)/2.
Compute e = floor(sqrt(2*x)-1/2). This takes the values 0, 1, 1, 2, 2, 2, 3, 3, 3, 3, etc.
Compute s = (x-1) - e*(e+1)/2. This takes the values 0, 0, 1, 0, 1, 2, 0, 1, 2, 3, ...
Return the interval starting at index s and ending at index e.
(s,e) takes the values (0,0),(0,1),(1,1),(0,2),(1,2),(2,2),...
import random
import math
n=10
x = random.randint(0,n*(n+1)/2)
if (x==0):
print(range(n)[0:0]) // empty set
exit()
e = int(math.floor(math.sqrt(2*x)-0.5))
s = int(x-1 - (e*(e+1)/2))
print(range(n)[s:e+1]) // starting at s, ending at e, inclusive
First create all possible slice indexes.
[0:0], [1:1], etc are equivalent, so we include only one of those.
Finally you pick a random index couple, and apply it.
import random
l = [0, 1, 2]
combination_couples = [(0, 0)]
length = len(l)
# Creates all index couples.
for j in range(1, length+1):
for i in range(j):
combination_couples.append((i, j))
print(combination_couples)
rand_tuple = random.sample(combination_couples, 1)[0]
final_slice = l[rand_tuple[0]:rand_tuple[1]]
print(final_slice)
To ensure we got them all:
for i in combination_couples:
print(l[i[0]:i[1]])
Alternatively, with some math...
For a length-3 list there are 0 to 3 possible index numbers, that is n=4. You have 2 of them, that is k=2. First index has to be smaller than second, therefor we need to calculate the combinations as described here.
from math import factorial as f
def total_combinations(n, k=2):
result = 1
for i in range(1, k+1):
result *= n - k + i
result /= f(k)
# We add plus 1 since we included [0:0] as well.
return result + 1
print(total_combinations(n=4)) # Prints 7 as expected.
there must be a way to generate a single random number and use that one value to figure out both starting index and end/length.
It is difficult to say what method is best but if you're only interested in binding single random number to your contiguous slice you can use modulo.
Given a list l and a single random nubmer r you can get your contiguous slice like that:
l[r % len(l) : some_sparkling_transformation(r) % len(l)]
where some_sparkling_transformation(r) is essential. It depents on your needs but since I don't see any special requirements in your question it could be for example:
l[r % len(l) : (2 * r) % len(l)]
The most important thing here is that both left and right edges of the slice are correlated to r. This makes a problem to define such contiguous slices that wont follow any observable pattern. Above example (with 2 * r) produces slices that are always empty lists or follow a pattern of [a : 2 * a].
Let's use some intuition. We know that we want to find a good random representation of the number r in a form of contiguous slice. It cames out that we need to find two numbers: a and b that are respectively left and right edges of the slice. Assuming that r is a good random number (we like it in some way) we can say that a = r % len(l) is a good approach.
Let's now try to find b. The best way to generate another nice random number will be to use random number generator (random or numpy) which supports seeding (both of them). Example with random module:
import random
def contiguous_slice(l, r):
random.seed(r)
a = int(random.uniform(0, len(l)+1))
b = int(random.uniform(0, len(l)+1))
a, b = sorted([a, b])
return l[a:b]
Good luck and have fun!