Indexing across multiple intervals - python

I am trying to extract the n-th element from a set of multiple intervals. I am currently dealing with genome sequences. Assume we have a gene with a gap in the middle. The position of this gene within the whole DNA is:
gene = [100,110], [130,140]
# representing the lists [100,101,...,109] and [130, 131,...,139]
# the gene spans over these entries of the DNA, so it looks like -gene-gap-gene-
Now, for a position within the gene (e.g. 10th position), I want to find the corresponding position on the whole DNA (which would be 109 in this example).
The function should do the following:
function(gene, 9)
> 109
function(gene, 10)
> 130
My approach is to explicitly generate the two sequences, concatenate them and take the n-th element of this list. However, for large lists (as they happen to occur), this is very inefficient.
Can anyone think of a simple way?
Thanks in advance!

A generic solution, should work for as many gaps in the gene as you want:
gene = [[100,110], [130,140]]
def function(gene, n):
for span in gene:
span_len = span[1] - span[0]
if n <= span_len:
return n + span[0] - 1
else:
n -= span_len
print(function(gene,10))
print(function(gene,11))

your function can be provided both lists and you can find which list you should be indexing and where using the size of the lists
so if you do function(gene, 10) and function(gene, 11)
10 <= len(List1) but 11 > len(list1) so you know you need to access the second list in the case of 11, and the right element is 11 - len(list1) -1 which is index 0 but for the second list.

Related

How do I fix code that calculates the amount of combinations in the partitions of a set?

I am working on a code in Python 2 that partitions a set of 13 elements using integer partitions, then evaluating the different combinations they can have (order does not matter). I have seen the ways people do this by using recursive functions to calculate every partition in a set retroactively, but for what I'm working on I'm taking a different approach.
I'm working with the logic that the different ways a set can be partitioned is determined by the integer partitions of a set. For a set of 4 elements, it can be partitioned in these ways:
[1,1,1,1]
[1,1,2]
[2,2]
[1,3]
[4]
Every number stands for the length of a subset in the partition. Using this info, I can then calculate all of the combinations that can be used with these different integer partitions. If I add the number of combinations from each partition together, I should receive the Bell number (the number of possible partitions in a set). For a list of 4 elements, the Bell number should be 15.
My code runs through the subset lengths in each partition, sets the length of the set to n and the subset length to r, then calculates the combinations in the specific subset. When it goes to the next subset, it subtracts the previous r from n to account for it lessening the amount of combinations available, as n gets smaller when a subset is already defined.
My code, however, is lackluster. When inputting 4 as the length of the set, it outputs 16 (instead of 15). When inputting 5, it outputs 48 (instead of 52). When inputting 13, it outputs 102,513 (instead of 27,644,437). I need it to be exact rather than an estimate.
This is in part because of if elem != 1: not properly accounting for a list of all ones or a list of one subset. It's also in part because it doesn't account for repeats of a combination when appearing in a subset. In [2,2] for a list of 4 elements, it considers the subset to contain 6 combinations when in reality it contains 3.
I'm stuck on how to solve this issue, as I only know enough Python to get by. The way the code currently outputs is how I prefer it to output, obviously without the errors.
The recursive function that calculates the integer partitions is from Nicolas Blanc, and the rest was coded by myself. Important links: Bell number, Partition of a set
import math
in_par = []
stack = []
bell = 0
def partitions(remainder, start_number = 1):
if remainder == 0:
in_par.append(list(stack))
#print stack
else:
for nb_to_add in range(start_number, remainder+1):
stack.append(nb_to_add)
partitions(remainder - nb_to_add, nb_to_add)
stack.pop()
x = partitions(13) # <------- input element count here
for part in in_par:
part.reverse()
combinations = 0
n = 13 # <------- input element count here
for i,elem in enumerate(part):
r = elem
combo = 0
if elem != 1:
if i != (len(part) - 1):
combo = math.factorial(n) / (math.factorial(r) * math.factorial(n-r))
n = n - elem
combinations = combinations + combo
bell = bell + combinations
part.append([combinations])
print part
#print str(bell)
print "Bell Number: " + str(bell)

Select a random subset of indices, with minimum consecutive count

I would like to select a random subset of indices from a numpy array with the caveat that I need each randomly selected index to be part of a consecutive "cluster" of at least three indices in a row.
For example, if I have an array that contains 25 items
a = np.arange(0,25)
I want to make sure that no index is selected without including at least two neighboring indices. So, for example, if I was looking for a subset of length 12, the following two options both fulfill this.
# this has 3 consecutive, followed by 5 consecutive, followed by 4 consecutive
rand_subset_1 = [0,1,2,9,10,11,12,13,18,19,20,21]
# this has 6 consecutive, followed by 3 consecutive, followed by 3 consecutive
rand_subset_2 = [3,4,5,6,7,8,14,15,16,22,23,24]
Attempted Answer
I tried to figure this out initially by dividing a into lists of three.
a_mod = np.array([0,1,2],[3,4,5],[6,7,8],...[21,22,23])
and then using np.random.choice(a_mod, subset_length/3, replace=False)
However this doesn't solve my problem, for two reasons.
I want to be able to input arrays with lengths that don't have to be divisible by three.
I don't mind if the subset indices are in cluster sizes that also aren't divisible by three. I just need the cluster to have at least three consecutive indices.
Clarification Edit:
Is there a method that allows every number in the subset of indices is part of a "cluster" of consecutive numbers? Ideally this wouldn't limit the cluster to be divisible by a particular integer (which is where I got stuck on my attempted solution above), but would be flexible in allowing clusters to be random lengths with a specified minimum cluster size.
Thanks in advance for any help with this problem!
Use the following function.
It selects an index at random and add two consecutive indices.
After that, select indices without considering the indices selected already.
def select_consequtive_index(a, m, n = 3):
# a: array
# m: number of index to be selected
# n: minimum of consequtive counts
output = []
x = np.random.choice(a)
if x == 0:
output += [x, x+1, x+2]
elif x == a[-1]:
output += [x-2, x-1, x]
else:
output += [x-1, x, x+1]
output += np.random.choice(list(set(a) - set(output)), m - n, replace = False).tolist()
output = np.array(output)
output.sort()
return output
code sample.
a = np.arange(0, 25)
print(select_consequtive_index(a, m = 12, n = 3))
The result is as follows.
[ 3 4 7 8 9 10 11 12 17 21 22 24]

Finding locations of repeated 0 in a binary list

I've got a binary list returned from a k means classification with k = 2, and I am trying to 1) identify the number of 0,0,0,... substrings of a given length - say a minimum of length 3, and 2) identify the start and end locations of these sublists, so in a list: L = [1,1,0,0,0,0,0,1,1,1,0,0,1,0,0,0], the outputs would ideally be: number = 2 and start_end_locations = [[2,6],[13,15]].
The lists I'm working with are tens of thousands of elements long, so I need to find a computationally fast way of performing this operation. I've seen many posts using groupby from itertools, but I can't find a way to apply them to my task.
Thanks in advance for your suggestions!
Thanks in advance for your suggestions!
craft a regular expression that matches your pattern: three or more zeros
concatenate the list items to a string
using re.finditer and match object start() and end() methods construct a list of indices
Concatenating the lists to a string could be the most expensive part - you won't know till you try; finditer should be pretty quick. Requires more than one pass through the data but probably low effort to code.
This will probably be better - a single pass through the list but you need to pay attention to the logic - more effort to code.
iterate over the list using enumerate
when you find a zero
capture its index and
set a flag indicating you are tracking zeros
when you find a one
if you are tracking zeros
capture the index
if the length of consecutive zeros meets your criteria capture the start and end indices for that run of zeros
reset flags and intermediate variables as necessary
A bit different than the word version:
def g(a=a):
y = []
criteria = 3
start,end = 0,0
prev = 1
for i,n in enumerate(a):
if not n: # n is zero
end = i
if prev: # previous item one
start = i
else:
if not prev and end - start + 1 >= criteria:
y.append((start,end))
prev = n
return y
You can use zip() to detect indexes of the 1,0 and 0,1 breaks in sequence. Then use zip() on the break indexes to form ranges and extract the ones that start with a zero and span at least 3 positions.
def getZeroStreaks(L,minSize=3):
breaks = [i for i,(a,b) in enumerate(zip(L,L[1:]),1) if a!=b]
return [[s,e-1] for s,e in zip([0]+breaks,breaks+[len(L)])
if e-s>=minSize and not L[s]]
output:
L = [1,1,0,0,0,0,0,1,1,1,0,0,1,0,0,0]
print(getZeroStreaks(L))
[[2, 6], [13, 15]]
from timeit import timeit
t = timeit(lambda:getZeroStreaks(L*1000),number=100)/100
print(t) # 0.0018 sec for 16,000 elements
The function can be generalized to find streaks of any value in a list:
def getStreaks(L,N=0,minSize=3):
breaks = [i for i,(a,b) in enumerate(zip(L,L[1:]),1) if (a==N)!=(b==N)]
return [[s,e-1] for s,e in zip([0]+breaks,breaks+[len(L)])
if e-s>=minSize and L[s]==N]

How to create list and initialize only some of it's values in Python (in comparison to C++)

I need to create an List of size N and then initialize only N-1th and N-2th elements only. Which means if the size of the list is 5 then it should only contain elements in 3rd and 4th position.
i know how to do it in C++ but is there any way to implement it in Python?
for example: In C++
int *n = new int[5];
n[3] = 20
n[4] = 10
//and if we print the output it will show some garbage values in index 0, 1, 2 and will print 20 10 which is the values we initailized
How can i do it in python? or anything similar to this!
In python, list must be initialized with values.
Closest thing you can do:
N = 5
lst = [0] * (N-2) + [20, 10]
This:
Fills the N-2 elements of a list with default value 0
Sets the value for the last two elements
Concatenates the zeros and last two elements sub-lists of stages 1 & 2
In python,
array=[]
length=5
for i in range(length):
array.append(0)
array[3]=20
array[4]=10
Edit: As pointed out by kabanus, a more efficient way to do this would be-
array=[0]*length
Instead of the for loop.

Sorting Technique Python

I'm trying to create a sorting technique that sorts a list of numbers. But what it does is that it compares two numbers, the first being the first number in the list, and the other number would be the index of 2k - 1.
2^k - 1 = [1,3,7, 15, 31, 63...]
For example, if I had a list [1, 4, 3, 6, 2, 10, 8, 19]
The length of this list is 8. So the program should find a number in the 2k - 1 list that is less than 8, in this case it will be 7.
So now it will compare the first number in the random list (1) with the 7th number in the same list (19). if it is greater than the second number, it will swap positions.
After this step, it will continue on to 4 and the 7th number after that, but that doesn't exist, so now it should compare with the 3rd number after 4 because 3 is the next number in 2k - 1.
So it should compare 4 with 2 and swap if they are not in the right place. So this should go on and on until I reach 1 in 2k - 1 in which the list will finally be sorted.
I need help getting started on this code.
So far, I've written a small code that makes the 2k - 1 list but thats as far as I've gotten.
a = []
for i in range(10):
a.append(2**(i+1) -1)
print(a)
EXAMPLE:
Consider sorting the sequence V = 17,4,8,2,11,5,14,9,18,12,7,1. The skipping
sequence 1, 3, 7, 15, … yields r=7 as the biggest value which fits, so looking at V, the first sparse subsequence =
17,9, so as we pass along V we produce 9,4,8,2,11,5,14,17,18,12,7,1 after the first swap, and
9,4,8,2,1,5,14,17,18,12,7,11 after using r=7 completely. Using a=3 (the next smaller term in the skipping
sequence), the first sparse subsequence = 9,2,14,12, which when applied to V gives 2,4,8,9,1,5,12,17,18,14,7,11, and the remaining a = 3 sorts give 2,1,8,9,4,5,12,7,18,14,17,11, and then 2,1,5,9,4,8,12,7,11,14,17,18. Finally, with a = 1, we get 1,2,4,5,7,8,9,11,12,14,17,18.
You might wonder, given that at the end we do a sort with no skips, why
this might be any faster than simply doing that final step as the only step at the beginning. Think of it as a comb
going through the sequence -- notice that in the earlier steps we’re using course combs to get distant things in the
right order, using progressively finer combs until at the end our fine-tuning is dealing with a nearly-sorted sequence
needing little adjustment.
p = 0
x = len(V) #finding out the length of V to find indexer in a
for j in a: #for every element in a (1,3,7....)
if x >= j: #if the length is greater than or equal to current checking value
p = j #sets j as p
So that finds what distance it should compare the first number in the list with but now i need to write something that keeps doing that until the distance is out of range so it switches from 3 to 1 and then just checks the smaller distances until the list is sorted.
The sorting algorithm you're describing actually is called Combsort. In fact, the simpler bubblesort is a special case of combsort where the gap is always 1 and doesn't change.
Since you're stuck on how to start this, here's what I recommend:
Implement the bubblesort algorithm first. The logic is simpler and makes it much easier to reason about as you write it.
Once you've done that you have the important algorithmic structure in place and from there it's just a matter of adding gap length calculation into the mix. This means, computing the gap length with your particular formula. You'll then modifying the loop control index and the inner comparison index to use the calculated gap length.
After each iteration of the loop you decrease the gap length(in effect making the comb shorter) by some scaling amount.
The last step would be to experiment with different gap lengths and formulas to see how it affects algorithm efficiency.

Categories

Resources