count interval length with overlap in a larger data

count interval length with overlap in a larger data - python

I have a list of tuples representing genomic intervals, where each tuple contains the start and end coordinates of the interval. I want to compute the total length of all the intervals, but I need to account for overlapping regions only once. The problem is that the intervals may not be sorted, and some intervals may overlap with multiple others, making it difficult to determine which overlaps should be counted. For example, consider the following list of intervals:
for example:
#sorted data by start pos
[(3, 9), (3, 5), (6, 9)]
The last interval overlaps with the first one, but not with the second one, and so the correct total length should be 6 (not 9 or 10). How can I write a Python code to solve this problem in an efficient way?
Note that this is a large data sheet, may have a lots of intervals

Order by start, keep track of how far you've reached:
intervals = [(3, 9), (3, 5), (6, 9)]
total = 0
reached = float('-inf')
for start, end in sorted(intervals):
if end > reached:
start = max(start, reached)
total += end - start
reached = end
print(total)
Try it online!

Related

How to get a given number by performing operations on a given list - Python

I have been trying to get an idea of how to solve this. If you have any idea on how to solve this, can you help out?
Doubt is,
If a list [3,4,9] is given and a number for example 87 is given, and the only operation allowed is multiplication, the number of steps doesn't matter.
After rough work, we can say that we should perform 3 operations in order to get the answer.
3 x 3 = 9
4 x 15 = 60
9 x 2 = 18
After adding the values of the above operations, we get the value 87.
Another example,
If the list is [2,4,8] and the given value is 16, we perform a single operation
2 x 2 = 4
So, 4+4+8=16. So, the number of steps required is 1.
I have tried to write a program, but it can only perform a single operation on the whole list at a time
a=int(input())
b=list(map(int,input().split()))
h=0
p=len(b)
for i in range(2,100):
d=b[0:p]
c=[i*x for x in d]
if(sum(c)==a):
print(i)
break
if(sum(c)>a):
p-=1
print("Not possible to obtain the sum with single operation")
The above code can only perform operations of single digits on the whole digits. Similarly, I can write 3 for loops if the list is of length 3. But, the length of the list varies. What should I do in such cases?
I want the code to display the minimum number of operations required to perform on the given list in order to obtain the given answer and also display the required operations if possible.
Thank you

I want the code to display minimum number of operations required to perform on the given list in order to obtain the give answer and also display he required operations if possible.
Assumptions
The target number is a positive integer
(Non-positive integers can be handled as well, by multiplying everything by minus one. But not having to deal with that helps thinking clearly about it)
The coefficients for the multiplications are non-negative integers
(Negative integers would mean we need to multiply them with a negative number to reach the same effect. We could do that, I just did not want to think about this case)
The numbers in the list are integers
(Otherwise the problem would either be unsolvable, or always solvable in one operation)
Choosing a list element only once counts also as multiplication operation 1*element
(That assumption could easily be removed)
Approach
Given target and a number_list, we want to find a way to compute target by multiplying list elements with coefficients and summing them up. For some reason, I found it simpler to think about this in the opposite direction (but both should work, I think). That is, I subtract coefficient * element from target and end up with the same problem again, but with a smaller target. This is the essence of a recursive problem.
It could technically happen, that we subtract a big number from target and end up with something unsolvable, while subtracting a smaller number would result in something solvable. So we need to consider all possible coefficients, for all possible list elements.
Possible coefficients are coefficients that are greater than zero and small enough that coefficient * element is not greater than target.
If you think of all the possible paths to choose as a tree of decisions (where the edges are all the possible choices to make), we have two options to walk through that tree: Either depth-first (follow one path all the way down, then the next, and so on) or breadth-first. I chose the latter, because that allows us to stop a bit sooner: The first path to succeed will be pretty short.
In fact, I believe it will be even optimal in the number of operations, because any future options to consider will be branches in the tree that are at least the same length. However, I wrote the code a bit more defensively just in case this belief is not true. Instead of instantly aborting when I find one path that works, I continue to work through the queue but ignore all paths that are already longer than the currently best one.
To know which operations were done, I track the history as a list of tuples containing each list element and the coefficient that corresponds to it.
Result
Run on your example with number_list = [3, 4, 9] and target = 87, I get:
Target 87 was reached using 1 multiplication operations.
history: [(3, 29)]
And we can check that 3*29 is indeed 87 and needed only a single operation. Note that this is better than your suggestion of three operations :D
Code
Written in python 3.8
#!/usr/bin/env python3
# https://stackoverflow.com/questions/70721453/how-to-get-a-given-number-by-performing-operations-on-a-given-list-python?noredirect=1#comment125023761_70721453
from dataclasses import dataclass
def main():
number_list = [3, 4, 9]
target = 87
result_state = solve( number_list, target )
if result_state is None:
print(f"That was not solvable!")
else:
print (f"Target {target} was reached using {result_state.n_ops_done} multiplication operations.")
print (f"history: {result_state.history}")
#dataclass
class State:
"""
Class to keep track of number of operations done in one
BFS path.
"""
n_ops_done: int = 0
current_target: int = 0
# optional but nice for seeing which operations were taken:
# A list of the elements used.
history : list = None
def solve( number_list, target ):
"""
Finds the smallest number of multiplication operations required to
compute `target` as a sum of products c*el where el can be any list element
of `number_list` and c has to be a positive integer.
returns `None` if this is not possible with the given numbers,
otherwise returns a `State` object.
"""
# create a queue for Breath-First Search
queue = [State(current_target = target, history = [])]
best_state = None
# while the queue is not empty, we work through items in it, one after
# the other. In insertion order.
# Extract from the start, add to the end.
while queue:
state = queue.pop(0)
if state.current_target == 0:
# found a path that works. Is it optimal?
if (best_state is None) or (best_state.n_ops_done > state.n_ops_done):
best_state = state
if (best_state is not None) and (best_state.n_ops_done <= state.n_ops_done):
# no need to proceed, it will not be optimal.
continue
for el in number_list:
maximal_coefficient = state.current_target // el
for coeff in range(maximal_coefficient, 0, -1):
# create new state to add to queue.
queue.append(State(
n_ops_done = state.n_ops_done + 1,
current_target = state.current_target - (coeff * el),
history = state.history.copy() + [(el, coeff)]
))
return best_state
if __name__ == "__main__":
main()

This solution uses itertools.product(range(1, target + 1), repeat=len(lst)) to create a list of all possible combinations of factors (e.g. (1,1,1) (1,1,2)...). Be careful with longer lists and high target numbers, as the list of factors can get very long, quickly.
import itertools
def get_factors(lst, target):
factors = itertools.product(range(1, target + 1), repeat=len(lst))
results = []
for fac in factors:
sum_ = sum(i*j for i, j in zip(lst, fac))
if sum_ == target:
results.append(fac)
return results
lst = [3, 4, 9]
results = get_factors(lst, 87)
print(results)
This finds all possible solutions. As we can see, there are some that only require 2 steps:
[(1, 3, 8), (1, 12, 4), (2, 9, 5), (2, 18, 1), (3, 6, 6), (3, 15, 2), (4, 3, 7), (4, 12, 3), (5, 9, 4), (6, 6, 5), (6, 15, 1), (7, 3, 6), (7, 12, 2), (8, 9, 3), (9, 6, 4), (10, 3, 5), (10, 12, 1), (11, 9, 2), (12, 6, 3), (13, 3, 4), (14, 9, 1), (15, 6, 2), (16, 3, 3), (18, 6, 1), (19, 3, 2), (22, 3, 1)]
To find the minimum amount of steps, we have to find the result with the most "1"s:
ones = max(fac.count(1) for fac in results)
print("Steps required:", len(lst) - ones)
# Steps required: 2
This leaves a lot room for optimization. It calculates way more numbers than necessary.

Scattered people with heights

There is a line of scattered people and we need to restore the order.
We know:
How tall each of them are
The number of people who were in front of them who are taller.
This information is contained in a set
Person {
int height;
int tallerAheadCount;
}
I’ve tried sorting it multiple ways, but no luck.
What I managed to figure out is that the shortest person’s tallerAheadCount should match the original index, but this doesn't work in a for loop with sorted heights.
Sorting by tallerAheadCount, and then by height gives us a relatively close answer, but the higher the tallerAheadCount the more incorrect it seems to get. I can’t figure out a rule to merge the shorter people to lower tallerAheadCount sorted lines.
How would you go about this?

Little to go on, but I suppose something like this could be what you're after.
This is probably not the most efficient implementation, and not sure it covers all cases (e.g. duplicates), but it's a start:
import random
# dummy data [(height, taller_ahead_count), ...]
original_line = [
(10, 0), (12, 0), (3, 2), (8, 2), (9, 2), (5, 4), (1, 6), (4, 5), (2, 7)]
# "scatter" the people
scattered_line = original_line.copy()
random.shuffle(scattered_line)
# restore the original line order based on the taller_ahead_count
descending_height = sorted(scattered_line, key=lambda x: x[0], reverse=True)
restored_line = []
for height, taller_ahead_count in descending_height:
taller_count = 0
j = 0
while taller_count < taller_ahead_count and j < len(restored_line):
if restored_line[j] > height:
taller_count += 1
j += 1
restored_line.insert(j, height)
# verify result
assert [height for height, __ in original_line] == restored_line
The basic idea is as follows:
We iterate over people in order of descending height, i.e. from tallest to shortest. That way, at each iteration, we are certain that all people taller than the current person are already in the restored line. We can then count the number of taller people to find the spot where the current person should be inserted.

Get the deepest coverage site from a list of continuous span

What dose continuous span look like?
Continuous span is represented by a tuple, (start, end).
eg, (2, 8) refers to a region starts from 2 and end with 8.
What does deepest coverage mean?
For a list of spans, eg [(0, 4), (2, 8), (5, 10), (6, 9)], the pileup result will be:
│ 0......4
│ 2...........8
│ 5.........10
│ 6.....9
└────────────────────────
0 1 2 3 4 5 6 7 8 9 10 ...
The deepest coverage of span is (6, 8), which is 3.
In this case, the expected return should be (6,8)
My solution
I don't known how to represent continuous span, thus, I break each continuous span in to list of numbers, and try to found the most common one from the counter result.
from collections import Couter
import numpy as np
density = Counter()
for start, end in SPAN_LIST:
density.update(
np.round(np.arange(start, end, 0.01)), 2)
)
most_dense_site, most_dense_count = density.most_common()[0]
The result might not be accurate, and the speed is extremely slow for a large list (about billions of items).
I know that if I increase the precision, the result will be more accurate, but it will also waste more memory.
I would like to know how to speed up the process and make the result more accurate in a better way?

Elaborating on the comment section:
The solution is to go through all starts and ends of ranges, mixed together, in order, "sweeping" through these points. We will consider them events and we will keep track of how many ranges we currently visit. An event triggered by a start of a range will increase the count of currently visited ranges. An event triggered by an end of a range will decrease the count of currently visited ranges.
(The code below assumes the ranges are half-open, including starts but not ends.)
Playground: https://ideone.com/fOAOXr
def deepest_coverage(span_list):
if not span_list:
raise ValueError("The given list must be non-empty")
events = []
for start, end in span_list:
events.append((start, 1))
events.append((end, -1))
events.sort()
ret = None
most_visited = currently_visited = 0
for i in range(len(events)):
currently_visited += events[i][1]
if currently_visited > most_visited:
most_visited = currently_visited
ret = events[i][0], events[i+1][0]
return ret
print(deepest_coverage([(0, 4), (2, 8), (5, 10), (6, 9)]))
Output:
(6, 8)
Resources:
Wikipedia: Sweep line algorithm
Software Engineering Stack Exchange: Job scheduling: algorithm to find the maximum number of overlapping jobs?
Maximum number of overlapping intervals – Merge Overlapping Intervals – Max Task Load

I write this solution with sweep line algorithm in Python.
Spans List and flatten into Positions List, and the startpoint and endpoint identity and recorded by score.
Adjacent startpoint and endpoints are merged, for improving performance and fixing a wrong position bug.
Looping the position list and record the deepest span with its score.
from collections import defaultdict
def get_deepest_coverage(span_list):
"""Get the span with the deepest coverage."""
pos_dict = defaultdict(int)
for start, end in span_list:
pos_dict[start] += 1
pos_dict[end] -= 1
pos_list = sorted((k, v) for k, v in pos_dict.items() if v)
deepest_start = deepest_end = None
deepest_count = current_count = 0
for index, (pos, score) in enumerate(pos_list):
current_count += score
if current_count > deepest_count:
deepest_count = current_count
deepest_start, deepest_end = pos, pos_list[index + 1][0]
return deepest_start, deepest_end, deepest_count
print(get_deepest_coverage([(2, 8), (5, 7), (7, 20)]))
Thanks #miloszlakomy for all the materials. And the nice solution.
(I take me a whole afternoon to write this snippet, and found #miloszlakomy has posted the answer here.)
And thanks xxx for keeping voting down all my post, without which I won't have motivation to finish this problem by myself.

How does a for loop work in tuples

I am new to python, having a hard time understanding the following code. It would be great if someone can give an explanation. I have two tuples. Specifically I cant understand how the for loop works here. And what does weight_cost[index][0] mean.
ratios=[(3, 0.75), (2, 0.5333333333333333), (0, 0.5), (1, 0.5)]
weight cost=[(8, 4), (10, 5), (15, 8), (4, 3)]
best_combination = [0] * number
best_cost = 0
weight = 0
for index, ratio in ratios:
if weight_cost[index][0] + weight <= capacity:
weight += weight_cost[index][0]
best_cost += weight_cost[index][1]
best_combination[index] = 1

A good practice to get into when you're trying to understand a piece of code is removing the irrelevant parts so you can see just what the code you care about is doing. This is often referred to as an MCVE.
With your code snippet there's several things we can clean up, to make the behavior we're interested in clearer.
We can remove the loop's contents, and simply print the values
We can remove the second tuple and the other variables, which we are no longer using
Leaving us with:
ratios=[(3, 0.75), (2, 0.5333333333333333), (0, 0.5), (1, 0.5)]
for index, ratio in ratios:
print('index: %s, ratio %s' % (index, ratio))
Now we can drop this into the REPL and experiment:
>>> ratios=[(3, 0.75), (2, 0.5333333333333333), (0, 0.5), (1, 0.5)]
>>> for index, ratio in ratios:
... print('index: %s, ratio %s' % (index, ratio))
...
index: 3, ratio 0.75
index: 2, ratio 0.5333333333333333
index: 0, ratio 0.5
index: 1, ratio 0.5
You can see clearly now exactly what it's doing - looping over every tuple of the list in order, and extracting the first and second values from the tuple into the index and ratio variables.
Try experimenting with this - what happens if you make one of the tuples of size 1, or 3? What if you only specify one variable in the loop, rather than two? Can you specify more than two variables?

The for loop iterates through each tuple in the array, assigned its zero index to index and its one index to ratio.
It then checks the corresponding index in weight_cost, which is a tuple, and checks the zero index of that tuple. This is added to weight, and it that is less than or equal to capacity, we move into the if block.
Similarly, index is used to access specific items in other lists, as before.

Given coordinates of pawns on a chess board, Count 'safe pieces', Python

Given the coordinates of pawns on a chess board, represented as a set of strings ie. {"b4", "d4", "f4", "c3", "e3", "g5", "d2"}, where the board is represented by rows as digits, and columns as alpha characters. determine the number of protected pawns, i.e the number of pawns with other pawns diagonally behind them on the board.
I am trying to teach myself python and have been at this task for hours. Any help would be greatly appreciated.
Here is my embarrassingly messy attempt:
def safe_pawns(pawns):
count = 0
cols = "abcdefgh"
positions = {'a':[],'b':[],'c':[],'d':[],'e':[],'f':[],'g':[],'h':[]}
for i in positions:
for j in pawns:
if i in j:
positions[i].append(int(j[1]))
for k in range(len(cols)-1):
for l in positions[cols[k+1]]:
if l +1 or l-1 in positions[cols[k]]:
count +=1
return count

I'm willing to bet this is your problem:
if l +1 or l-1 in positions[cols[k]]:
This doesn't mean "if l+1 is in that slot, or l-1 is in that slot". If you meant that (and you almost certainly did), you have to say that:
if l+1 in positions[cols[k]] or l-1 in positions[cols[k]]:
(There are various ways to write it indirectly, too, like if {l+1, l-1}.intersection(positions[cols[k]]), but I think the explicit version is the obvious way here.)

First, using letters for columns is going to cause you problems once you start doing arithmetic because you cant just do 'b' - 1. It will be easier if you convert your set from a set of strings like 'a1' into a set of tuples like (1, 1). (Or you could zero-index, but that is outside the scope here I think).
Second, let's assume you now have a pawns set {(2, 4), (3, 3), (4, 2), (4, 4), (5, 3), (6, 4), (7, 5)}. You don't need that much looping. You can actually get the set of protected pieces (I'm assuming you're going from the "bottom" of the board player?) using a set expression:
{(x,y) for (x,y) in s if (x-1,y-1) in s or (x+1,y-1) in s}
And you'll find the size of that is 6.
Note: the input conversion expression I used was:
s = {("abcdefg".find(let) + 1, int(num)) for let, num in pawns}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

count interval length with overlap in a larger data - python

Order by start, keep track of how far you've reached: intervals = [(3, 9), (3, 5), (6, 9)] total = 0 reached = float('-inf') for start, end in sorted(intervals): if end > reached: start = max(start, reached) total += end - start reached = end print(total) Try it online!

Related

How to get a given number by performing operations on a given list - Python

Scattered people with heights

Get the deepest coverage site from a list of continuous span

How does a for loop work in tuples

Given coordinates of pawns on a chess board, Count 'safe pieces', Python

Categories

Resources