Removing similar items from 2 different lists

Removing similar items from 2 different lists - python

Is there a more Pythonic way to execute this code?
sim_inits = [1,100, 12, 3520, 1250]
prod_inits = [2, 101, 13, 14, 3521, 1500]
for t in range(len(sim_inits)-1):
sim_loop_done = False
for s in sim_inits[:]:
if sim_loop_done == True:
continue
prod_loop_done = False
for p in prod_inits[:]:
if prod_loop_done == True:
continue
if abs(s-p) < 3 :
sim_inits.remove(s)
prod_inits.remove(p)
sim_loop_done = True
prod_loop_done = True
print sim_inits
print prod_inits
Output:
[1250]
[14, 1500]
I'm trying to loop over both lists and the moment I find a match (defined by a difference less than 3), I want to move to the next item. I do NOT want 14 removed from prod_inits because the 12 from sim_inits was removed against the 13 in prod_inits.
The above code works, I was just wondering if it can be done more efficiently.

You can skip one of the loops, and you can use break instead of continue to get out of the other one early without using the cumbersome flags you're currently using.
List slicing is pretty expensive - especially in the case of prod_inits, where you're duplicating the entire list to just remove one element from it. Cheaper to iterate by index and then use pop() instead of remove() to remove that index. Similarly, we can use a while loop to count through the list s (instead of a for loop) because it lets us accommodate for the elements we're removing by doing (we do s -= 1 for this reason).
sim_inits = [1,100, 12, 3520, 1250]
prod_inits = [2, 101, 13, 14, 3521, 1500]
s = 0
while s < len(sim_inits):
for p in range(len(prod_inits)):
if abs(sim_inits[s]-prod_inits[p]) < 3:
sim_inits.pop(s)
prod_inits.pop(p)
s -= 1
break
s += 1
print(sim_inits)
print(prod_inits)
After running this code locally:
>>> print(sim_inits)
[1250]
>>> print(prod_inits)
[14, 1500]

Related

Function which eliminates specific elements from a list in an efficient way

I want to create a function (without using libraries) which takes as input three integer numbers (>0) (a,b,c) , for example:
a = 6
b = 6
c = 3
and returns a list containing c elements (so in this case the returned list should contain 3 elements) taken from a list of a numbers (so in this case the initial list is [1,2,3,4,5,6]). The c elements of the returned list have to be the ones that managed to remain in the initial list after removing a number every b positions from the list of a elements until len(return_list) = c.
So for a = 6, b = 6 and c = 3 the function should do something like this:
1) initial_list = [1,2,3,4,5,6]
2) first_change = [2,3,4,5,6] #the number after the 6th (**b**) is 1 because after the last element you keep counting returning to the first one, so 1 is canceled
3) second_change = [2,4,5,6] #you don't start to count positions from the start but from the first number after the eliminated one, so new number after the 6th is 3 and so 3 is canceled
4) third_change = [2,4,5] #now the number of elements in the list is equal to **c**
Notice that if, when counting, you end up finishing the elements from the list, you keep the counting and return to the first element of the list.
I made this function:
def findNumbers(a,b,c):
count = 0
dictionary = {}
countdown = a
for x in range(1,a+1):
dictionary[x] = True
while countdown > c:
for key,value in dictionary.items():
if value == True:
count += 1
if count == b+1:
dictionary[key] = False
count = 0
countdown -= 1
return [key for key in dictionary.keys() if dictionary[key] == True]
It works in some cases, like the above example. But it doesn't work everytime.
For example:
findNumbers(1000,1,5)
returns:
[209, 465, 721, 977] #wrong result
instead of:
[209, 465, 721, 849, 977] #right result
and for bigger numbers, like:
findNumbers(100000, 200000, 5)
it takes too much time to even do its job, I don't know if the problem is the inefficiency of my algorithm or because there's something in the code that gives problems to Python. I would like to know a different approach to this situation which could be both more efficient and able to work in every situation. Can anyone give me some hints/ideas?
Thank you in advance for your time. And let me know if you need more explanations and/or examples.

You can keep track of the index of the last list item deleted like this:
def findNumbers(a, b, c):
l = list(range(1, a + 1))
i = 0
for n in range(a - c):
i = (i + b) % (a - n)
l.pop(i)
return l
so that findNumbers(6, 6, 3) returns:
[2, 4, 5]
and findNumbers(1000, 1, 5) returns:
[209, 465, 721, 849, 977]
and findNumbers(100000, 200000, 5) returns:
[10153, 38628, 65057, 66893, 89103]

I thought I could be recursive about the problem, so I wrote this:
def func(a,b,c):
d = [i+1 for i in range(a)]
def sub(d,b,c):
if c == 0: return d
else:
k = b % len(d)
d.pop(k)
d = d[k:] + d[:k]
return sub(d,b,c-1)
return sub(d,b,a-c)
so that func(6,6,3) returns: [2, 4, 5] successfully and func(1000,1,5) returns: [209, 465, 721, 849, 977] unfortunately with an error.
It turns out that for values of a > 995, the below flag is raised:
RecursionError: maximum recursion depth exceeded while calling a Python object
There was no need to try func(100000,200000,5) - lesson learnt.
Still, rather than dispose of the code, I decided to share it. It could serve as a recursive thinking precautionary.

speed up regex finditer for large dataset

I am trying to find positions of a match (N or -) in a large dataset.
The number of matches per string (3 million letters) is around 300,000. I have 110 strings to search in the same file so I made a loop using re.finditer to match and report position of each match but it is taking very long time. Each string (DNA sequence) is composed of only six characters (ATGCN-). Only 17 strings were processed in 11 hours. The question is what can I do to speed up the process?
The part of the code I am talking about is:
for found in re.finditer(r"[-N]", DNA_sequence):
position = found.start() + 1
positions_list.append(position)
positions_set = set(positions_list)
all_positions_set = all_positions_set.union(positions_set)
count += 1
print(str(count) + '\t' +record.id+'\t'+'processed')
output_file.write(record.id+'\t'+str(positions_list)+'\n')
I also tried to use re.compile as I googled and found that it could improve performance but nothing changed (match = re.compile('[-N]'))

If you have roughly 300k matches - you are re-creating increasingly larger sets that contain exactly the same elements as the list you are already adding to:
for found in re.finditer(r"[-N]", DNA_sequence):
position = found.start() + 1
positions_list.append(position)
positions_set = set(positions_list) # 300k times ... why? why at all?
You can instead simply use the list you got anyway and put that into your all_positions_set after you found all of them:
all_positions_set = all_positions_set.union(positions_list) # union takes any iterable
That should reduce the memory by more then 50% (sets are more expensive then lists) and also cut down on the runtime significantly.
I am unsure what is faster, but you could even skip using regex:
t = "ATGCN-ATGCN-ATGCN-ATGCN-ATGCN-ATGCN-ATGCN-ATGCN-"
pos = []
for idx,c in enumerate(t):
if c in "N-":
pos.append(idx)
print(pos) # [4, 5, 10, 11, 16, 17, 22, 23, 28, 29, 34, 35, 40, 41, 46, 47]
and instead use enumerate() on your string to find the positions .... you would need to test if that is faster.

Regarding not using regex, I did actually that and now modified my script to run in less than 45 seconds using a defined function
def find_all(a_str, sub):
start = 0
while True:
start = a_str.find(sub, start)
if start == -1: return
yield start + 1
start += len(sub)
So the new coding part is:
N_list = list(find_all(DNA_sequence, 'N'))
dash_list = list(find_all(DNA_sequence, '-'))
positions_list = N_list + dash_list
all_positions_set = all_positions_set.union(positions_list)
count += 1
print(str(count) + '\t' +record.id+'\t'+'processed')
output_file.write(record.id+'\t'+str(sorted(positions_list))+'\n')

Find two numbers from a list that add up to a specific number

This is super bad and messy, I am new to this, please help me.
Basically, I was trying to find two numbers from a list that add up to a target number.
I have set up an example with lst = [2, 4, 6, 10] and a target value of target = 8. The answer in this example would be (2, 6) and (6, 2).
Below is my code but it is long and ugly and I am sure there is a better way of doing it. Can you please see how I can improve from my code below?
from itertools import product, permutations
numbers = [2, 4, 6, 10]
target_number = 8
two_nums = (list(permutations(numbers, 2)))
print(two_nums)
result1 = (two_nums[0][0] + two_nums[0][1])
result2 = (two_nums[1][0] + two_nums[1][1])
result3 = (two_nums[2][0] + two_nums[2][1])
result4 = (two_nums[3][0] + two_nums[3][1])
result5 = (two_nums[4][0] + two_nums[4][1])
result6 = (two_nums[5][0] + two_nums[5][1])
result7 = (two_nums[6][0] + two_nums[6][1])
result8 = (two_nums[7][0] + two_nums[7][1])
result9 = (two_nums[8][0] + two_nums[8][1])
result10 = (two_nums[9][0] + two_nums[9][1])
my_list = (result1, result2, result3, result4, result5, result6, result7, result8, result9, result10)
print (my_list)
for i in my_list:
if i == 8:
print ("Here it is:" + str(i))

For every number on the list, you can look for his complementary (number that when added to the previous one would give the required target sum). If it exists, get the pair and exit, otherwise move on.
This would look like the following:
numbers = [2, 4, 6, 10]
target_number = 8
for i, number in enumerate(numbers[:-1]): # note 1
complementary = target_number - number
if complementary in numbers[i+1:]: # note 2
print("Solution Found: {} and {}".format(number, complementary))
break
else: # note 3
print("No solutions exist")
which produces:
Solution Found: 2 and 6
Notes:
You do not have to check the last number; if there was a pair you would have already found it by then.
Notice that the membership check (which is quite costly in lists) is optimized since it considers the slice numbers[i+1:] only. The previous numbers have been checked already. A positive side-effect of the slicing is that the existence of e.g., one 4 in the list, does not give a pair for a target value of 8.
This is an excellent setup to explain the miss-understood and often confusing use of else in for-loops. The else triggers only if the loop was not abruptly ended by a break.
If the e.g., 4 - 4 solution is acceptable to you even when having a single 4 in the list you can modify as follows:
numbers = [2, 4, 6, 10]
target_number = 8
for i, number in enumerate(numbers):
complementary = target_number - number
if complementary in numbers[i:]:
print("Solution Found: {} and {}".format(number, complementary))
break
else:
print("No solutions exist")

A list comprehension will work well here. Try this:
from itertools import permutations
numbers = [2, 4, 6, 10]
target_number = 8
solutions = [pair for pair in permutations(numbers, 2) if sum(pair) == 8]
print('Solutions:', solutions)
Basically, this list comprehension looks at all the pairs that permutations(numbers, 2) returns, but only keeps the ones whose total sum equals 8.

The simplest general way to do this is to iterate over your list and for each item iterate over the rest of the list to see if it adds up to the target value. The downside of this is it is an O(n^2) operation. I don't know off the top of my head if there is a more efficient solution. I'm not 100% sure my syntax is correct, but it should look something like the following:
done = False
for i, val in enumerate(numbers):
if val >= target_number:
continue
for j, val2 in enumerate(numbers, i+1):
if val + val2 == target_number:
print ("Here it is: " + str(i) + "," + str(j))
done = True
break
if done:
break
Of course you should create this as a function that returns your result instead of just printing it. That would remove the need for the "done" variable.

If you are trying to find the answer for multiple integers with a long list that has duplicate values, I would recommend using frozenset. The "checked" answer will only get the first answer and then stop.
import numpy as np
numbers = np.random.randint(0, 100, 1000)
target = 17
def adds_to_target(base_list, target):
return_list = []
for i in range(len(base_list)):
return_list.extend([list((base_list[i], b)) for b in base_list if (base_list[i] + b)==target])
return set(map(frozenset, return_list))
# sample output
{frozenset({7, 10}),
frozenset({4, 13}),
frozenset({8, 9}),
frozenset({5, 12}),
frozenset({2, 15}),
frozenset({3, 14}),
frozenset({0, 17}),
frozenset({1, 16}),
frozenset({6, 11})}
1) In the first for loop, lists containing two integers that sum to the target value are added to "return_list" i.e. a list of lists is created.
2) Then frozenset takes out all duplicate pairs.
%timeit adds_to_target(numbers, target_number)
# 312 ms ± 8.86 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

you can do it in one line with list comprehension like below:
from itertools import permutations
numbers = [2, 4, 6, 10]
target_number = 8
two_nums = (list(permutations(numbers, 2)))
result=[i for i in two_nums if i[0]+i[1] == target_number]
[(2,6) , (6,2)]

If you want a way to do this efficiently without itertools -
numbers = [1,3,4,5,6,2,3,4,1]
target = 5
number_dict = {}
pairs = []
for num in numbers:
number_dict[num] = number_dict.get(num, 0) + 1
complement = target - num
if complement in number_dict.keys():
pairs.append((num, complement))
number_dict.pop(num)
number_dict.pop(complement)

This is this simple :)
def func(array, target):
flag = 0;
for x in array:
for y in array:
if (target-x) == y and x != y:
print(x,y)
flag = 1
break
if flag ==1:
break

import pandas as pd
Filename = "D:\\python interview\\test.txt"
wordcount_dict = dict()
#input("Enter Filename:")
list_ = [1,2,4,6,8]
num = 10
for number in list_:
num_add = number
for number_ in list_:
if number_ + num_add == num and number_ != num_add :
print(number_ , num_add)

n is the sum desired, L is the List. Basically you enter inside the loop and from that no to end of list iterate through the next loop. If L[i],L[j] indexes in list adds up to n and if L[i]!=L[j] print it.
numbers=[1,2,3,4,9,8,5,10,20,30,6]
def two_no_summer(n,L):
for i in range(0,len(L)):
for j in range(i,len(L)):
if (L[i]+L[j]==n) & (L[i]!=L[j]):
print(L[i],L[j])
Execution: https://i.stack.imgur.com/Wu47x.jpg

Merging Overlapping Intervals in Python [duplicate]

This question already has answers here:
Merging Overlapping Intervals
(4 answers)
Closed last year.
I am trying to solve a question where in overlapping intervals need to be merged.
The question is:
Given a collection of intervals, merge all overlapping intervals.
For example, Given [1,3],[2,6],[8,10],[15,18], return [1,6],[8,10],[15,18].
I tried my solution:
# Definition for an interval.
# class Interval:
# def __init__(self, s=0, e=0):
# self.start = s
# self.end = e
class Solution:
def merge(self, intervals):
"""
:type intervals: List[Interval]
:rtype: List[Interval]
"""
start = sorted([x.start for x in intervals])
end = sorted([x.end for x in intervals])
merged = []
j = 0
new_start = 0
for i in range(len(start)):
if start[i]<end[j]:
continue
else:
j = j + 1
merged.append([start[new_start], end[j]])
new_start = i
return merged
However it is clearly missing the last interval as:
Input : [[1,3],[2,6],[8,10],[15,18]]
Answer :[[1,6],[8,10]]
Expected answer: [[1,6],[8,10],[15,18]]
Not sure how to include the last interval as overlap can only be checked in forward mode.
How to fix my algorithm so that it works till the last slot?

Your code implicitly already assumes the starts and ends to be sorted, so that sort could be left out. To see this, try the following intervals:
intervals = [[3,9],[2,6],[8,10],[15,18]]
start = sorted([x[0] for x in intervals])
end = sorted([x[1] for x in intervals]) #mimicking your start/end lists
merged = []
j = 0
new_start = 0
for i in range(len(start)):
if start[i]<end[j]:
continue
else:
j = j + 1
merged.append([start[new_start], end[j]])
new_start = i
print(merged) #[[2, 9], [8, 10]]
Anyway, the best way to do this is probably recursion, here shown for a list of lists instead of Interval objects.
def recursive_merge(inter, start_index = 0):
for i in range(start_index, len(inter) - 1):
if inter[i][1] > inter[i+1][0]:
new_start = inter[i][0]
new_end = inter[i+1][1]
inter[i] = [new_start, new_end]
del inter[i+1]
return recursive_merge(inter.copy(), start_index=i)
return inter
sorted_on_start = sorted(intervals)
merged = recursive_merge(sorted_on_start.copy())
print(merged) #[[2, 10], [15, 18]]

I know the question is old, but in case it might help, I wrote a Python library to deal with (set of) intervals. Its name is portion and makes it easy to merge intervals:
>>> import portion as P
>>> inputs = [[1,3],[2,6],[8,10],[15,18]]
>>> # Convert each input to an interval
>>> intervals = [P.closed(a, b) for a, b in inputs]
>>> # Merge these intervals
>>> merge = P.Interval(*intervals)
>>> merge
[1,6] | [8,10] | [15,18]
>>> # Output as a list of lists
>>> [[i.lower, i.upper] for i in merge]
[[1,6],[8,10],[15,18]]
Documentation can be found here: https://github.com/AlexandreDecan/portion

We can have intervals sorted by the first interval and we can build the merged list in the same interval list by checking the intervals one by one not appending to another one so. we increment i for every interval and interval_index is current interval check
x =[[1,3],[2,6],[8,10],[15,18]]
#y = [[1,3],[2,6],[8,10],[15,18],[19,25],[20,26],[25,30], [32,40]]
def merge_intervals(intervals):
sorted_intervals = sorted(intervals, key=lambda x: x[0])
interval_index = 0
#print(sorted_intervals)
for i in sorted_intervals:
if i[0] > sorted_intervals[interval_index][1]:
interval_index += 1
sorted_intervals[interval_index] = i
else:
sorted_intervals[interval_index] = [sorted_intervals[interval_index][0], i[1]]
#print(sorted_intervals)
return sorted_intervals[:interval_index+1]
print(merge_intervals(x)) #-->[[1, 6], [8, 10], [15, 18]]
#print ("------------------------------")
#print(merge_intervals(y)) #-->[[1, 6], [8, 10], [15, 18], [19, 30], [32, 40]]

This is very old now, but in case anyone stumbles across this, I thought I'd throw in my two cents, since I wasn't completely happy with the answers above.
I'm going to preface my solution by saying that when I work with intervals, I prefer to convert them to python3 ranges (probably an elegant replacement for your Interval class) because I find them easy to work with. However, you need to remember that ranges are half-open like everything else in Python, so the stop coordinate is not "inside" of the interval. Doesn't matter for my solution, but something to keep in mind.
My own solution:
# Start by converting the intervals to ranges.
my_intervals = [[1, 3], [2, 6], [8, 10], [15, 18]]
my_ranges = [range(start, stop) for start, stop in my_intervals]
# Next, define a check which will return True if two ranges overlap.
# The double inequality approach means that your intervals don't
# need to be sorted to compare them.
def overlap(range1, range2):
if range1.start <= range2.stop and range2.start <= range1.stop:
return True
return False
# Finally, the actual function that returns a list of merged ranges.
def merge_range_list(ranges):
ranges_copy = sorted(ranges.copy(), key=lambda x: x.stop)
ranges_copy = sorted(ranges_copy, key=lambda x: x.start)
merged_ranges = []
while ranges_copy:
range1 = ranges_copy[0]
del ranges_copy[0]
merges = [] # This will store the position of ranges that get merged.
for i, range2 in enumerate(ranges_copy):
if overlap(range1, range2): # Use our premade check function.
range1 = range(min([range1.start, range2.start]), # Overwrite with merged range.
max([range1.stop, range2.stop]))
merges.append(i)
merged_ranges.append(range1)
# Time to delete the ranges that got merged so we don't use them again.
# This needs to be done in reverse order so that the index doesn't move.
for i in reversed(merges):
del ranges_copy[i]
return merged_ranges
print(merge_range_list(my_ranges)) # --> [range(1, 6), range(8, 10), range(15, 18)]

Make pairs for every endpoint: (value; kind = +/-1 for start or end of interval)
Sort them by value. In case of tie choose paie with -1 first if you need to merge intervals with coinciding ends like 0-1 and 1-2
Make CurrCount = 0, walk through sorted list, adding kind to CurrCount
Start new resulting interval when CurrCount becomes nonzero, finish interval when CurrCount becomes zero.

Late to the party, but here is my solution. I typically find recursion with an invariant easier to conceptualize. In this case, the invariant is that the head is always merged, and the tail is always waiting to be merged, and you compare the last element of head with the first element of tail.
One should definitely use sorted with the key argument rather than using a list comprehension.
Not sure how efficient this is with slicing and concatenating lists.
def _merge(head, tail):
if tail == []:
return head
a, b = head[-1]
x, y = tail[0]
do_merge = b > x
if do_merge:
head_ = head[:-1] + [(a, max(b, y))]
tail_ = tail[1:]
return _merge(head_, tail_)
else:
head_ = head + tail[:1]
tail_ = tail[1:]
return _merge(head_, tail_)
def merge_intervals(lst):
if len(lst) <= 1:
return lst
lst = sorted(lst, key=lambda x: x[0])
return _merge(lst[:1], lst[1:])

Find mean of first 9 numbers then the next 9 numbers and so on from for loop

I have a for loop that gives me the following output.
0.53125
0.4375
0.546875
0.578125
0.75
0.734375
0.640625
0.53125
0.515625
0.828125
0.5
0.484375
0.59375
0.59375
0.734375
0.71875
0.609375
0.484375
.
.
.
How do I find the mean of the first 9 values, the next 9 values and so on and store them into a list like [0.58,0.20,...]? I have tried a lot of things but the values seem to be incorrect. What is the correct way of doing this?
What I did:
matchedRatioList = []
matchedRatio = 0
i = 0
for feature in range(90):
featureToCompare = featuresList[feature]
number = labelsList[feature]
match = difflib.SequenceMatcher(None,featureToCompare,imagePixList)
matchingRatio = match.ratio()
print(matchingRatio)
matchedRatio += matchingRatio
if i == 8:
matchedRatioList.append(matchedRatio / 9)
i = 0
matchedRatio = 0
i += 1

Once you have the list of numbers you can calculate the average of each group of 9 numbers using list comprehensions:
from statistics import mean
numbers = [0.53125, 0.4375, 0.546875, 0.578125, 0.75, 0.734375, 0.640625,
0.53125, 0.515625, 0.828125, 0.5, 0.484375, 0.59375, 0.59375,
0.734375, 0.71875, 0.609375, 0.484375]
group_len = 9
matched_ratios = [mean(group) for group in [numbers[i:i+group_len]
for i in range(0, len(numbers), group_len)]]
print(matched_ratios)
# [0.5850694444444444, 0.6163194444444444]

Your solution is close. Start with i = 1 and check for i == 9
matchedRatioList = []
matchedRatio = 0
i = 1 # change here
for feature in range(90):
...
matchedRatio += matchingRatio
if i == 9: # change here
matchedRatioList.append(matchedRatio / 9)
i = 0
matchedRatio = 0
i += 1

I do not know what you have tried so far, but I can present you with one solution to the problem.
Save all values in your for-loop to a buffer array. Use an if-statement with iterator % 9 == 0 inside your for-loop, which will make some portion of code execute only every 9 values.
Inside the if-statement you can write the mean value of your buffer array to a different output array. Reset your buffer array inside this if-statement as well, then this process is repeated and should behave in the way you want.

Try this
r = []
for b in a:
c += b
if i == 8:
c = c/9
r.append(c)
c = 0
i = 0
i += 1

since nobody used reduce so far :)
import functools
l = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]
m = []
for i in range(9,len(l), 9):
m.append(functools.reduce(lambda x, y: x + y, l[i-9:i])/9)
print(m)

Using mean function from the statistics module of Python.
import statistics
# Sample Values list I created.
values_list = list()
for i in range(1,82):
values_list.append(i)
mean_list = list()
for i in range(0, len(values_list), 9):
mean_list.append(statistics.mean(values_list[i:i+9]))
for i in mean_list:
print(i)
This is the simplest way in which you can do it.
https://docs.python.org/3/library/statistics.html#statistics.mean

One-line solution given loop output in numbers:
[float(sum(a))/len(a) for a in zip(*[iter(numbers)]*9)]

Putting ideas from the other answers together, this could be the whole program:
from statistics import mean
matching_ratios = (difflib.SequenceMatcher(None, feature, imagePixList).ratio()
for feature in featuresList[:90])
matchedRatioList = [mean(group) for group in zip(*[matching_ratios] * 9)]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removing similar items from 2 different lists - python

Related

Function which eliminates specific elements from a list in an efficient way

speed up regex finditer for large dataset

Find two numbers from a list that add up to a specific number

Merging Overlapping Intervals in Python [duplicate]

Find mean of first 9 numbers then the next 9 numbers and so on from for loop

Categories

Resources