Python finding patterns within large group of numbers? [duplicate] - python

This question already has answers here:
Finding a repeating sequence at the end of a sequence of numbers
(5 answers)
Closed 9 years ago.
I'm working with a list of lists that have the periods of continued fractions for non-perfect square roots in each of them.
What I'm trying to do with them is to check the size of the largest repeating pattern in each list.
Some of the lists for example:
[
[1,1,1,1,1,1....],
[4,1,4,1,4,1....],
[1,2,10,1,2,10....],
[1,1,1,1,1,4,1,4,1,20,9,8,1,1,1,1,1,4,1,4,1,20,9,8....],
[2,2,2,4,2,2,2,4....],
[1,1,1,13,21,45,3,3,1,16,4,1,4,1,1,1,24,15,1,1,1,13,21,45,3,3,1,16,4,1,4,1,1,1,24,15....],
[1,1,1,3,28,1,1,1,3,28,67,25,1,1,1,3,28,1,1,1,3,28,67,25....]
]
The two similar methods that I've been working with are:
def lengths(seq):
for i in range(len(seq),1,-1):
if seq[0:i] == seq[i:i*2]:
return i
def lengths(seq):
for i in range(1,len(seq)-1):
if seq[0:i] == seq[i:i*2]:
return i
These both take the size of the lists and compare indexed sizes of it from the current position.
The problem is first one returns wrong for just one repeating digit because it starts big and see's just the one large pattern.
The problem with the second is that there are nested patterns like the sixth and seventh example list and it will be satisfied
with the nested loop and overlook the rest of the pattern.

Works (caught a typo in 4th element of your sample)
>>> seq_l = [
... [1,1,1,1,1,1],
... [4,1,4,1,4,1],
... [1,2,10,1,2,10],
... [1,1,1,1,1,4,1,4,1,20,9,8,1,1,1,1,1,4,1,4,1,20,9,8],
... [2,2,2,4,2,2,2,4,2,2,2,4,2,2,2,4],
... [1,1,1,13,21,45,3,3,1,16,4,1,4,1,1,1,24,15,1,1,1,13,21,45,3,3,1,16,4,1,4,1,1,1,24,15],
... [1,1,1,3,28,1,1,1,3,28,67,25,1,1,1,3,28,1,1,1,3,28,67,25]
... ]
>>>
>>> def rep_len(seq):
... s_len = len(seq)
... for i in range(1,s_len-1):
... if s_len%i == 0:
... j = s_len/i
... if seq == j*seq[:i]:
... return i
...
...
>>> [rep_len(seq) for seq in seq_l]
[1, 2, 3, 12, 4, 18, 12]

If it's not unfeasible to convert your lists to strings, using regular expressions would make this a trivial task.
import re
lists = [
[1,1,1,1,1,1],
[4,1,4,1,4,1],
[1,2,10,1,2,10],
[1,1,1,1,1,4,1,4,1,20,9,8,1,1,1,1,1,4,1,4,1,20,9,8], #I think you had a typo in this one...
[2,2,2,4,2,2,2,4],
[1,1,1,13,21,45,3,3,1,16,4,1,4,1,1,1,24,15,1,1,1,13,21,45,3,3,1,16,4,1,4,1,1,1,24,15],
[1,1,1,3,28,1,1,1,3,28,67,25,1,1,1,3,28,1,1,1,3,28,67,25]
]
for l in lists:
s = "x".join(str(i) for i in l)
print s
match = re.match(r"^(?P<foo>.*)x?(?P=foo)", s)
if match:
print match.group('foo')
else:
print "****"
print
(?P<foo>.*) creates a group known as "foo" and (?P=foo) matches that. Since regular expressions are greedy, you get the longest match by default. The "x?" just allows for a single x in the middle to handle even/odd lengths.

You probably could do a collections.defaultdict(int) to keep counts of All the sublists, unless you know there are some sublists you don't care about. Convert the sublists to tuples before making them dictionary keys.
You might be able to get somewhere using a series of bloom filters though, if space is tight. You'd have one bloom filter for subsequences of length 1, another for subsequences of length 2, etc. Then the largest bloom filter that gets a collision has your maximum length sublist.
http://stromberg.dnsalias.org/~strombrg/drs-bloom-filter/

I think you just have to check two levels of sequences at once.0..i == i..i*2 and 0..i/2 != i/2..i.
def lengths(seq):
for i in range(len(seq),1,-1):
if seq[0:i] == seq[i:i*2] and seq[0:i/2] != seq[i/2:i]:
return i
If the two halves of 0..i are equal then it means that you are actually comparing two concatenated patterns with each other.

Starting with the first example method, you could recursively search the sub pattern.
def lengths(seq):
for i in range(len(seq)-1,1,-1):
if seq[0:i] == seq[i:i*2]:
j = lengths(seq[0:i]) # Search pattern for sub pattern
if j < i and i % j == 0: # Found a smaller pattern; further, a longer repeated
# pattern length must be a multiple of the shorter pattern length
n = i/j # Number of pattern repetitions (might change to // if using Py3K)
for k in range(1, n): # Check that all the smaller patterns are the same
if seq[0:j] != seq[j*n:j*(n+1)]: # Stop when we find a mismatch
return i # Not a repetition of smaller pattern
else: return j # All the sub-patterns are the same, return the smaller length
else: return i # No smaller pattern
I get the feeling this solution isn't quite correct, but I'll do some testing and edit it as necessary. (Quick note: Shouldn't the initial for loop start at len(seq)-1? If not, you compare seq[0:len] to seq[len:len], which seems silly, and would cause the recursion to loop infinitely.)
Edit: Seems sorta similar to the top answer in the related question senderle posted, so you'd best just go read that. ;)

Related

Extracting the subsequence of maximum length from a sequence [PYTHON] [duplicate]

This question already has an answer here:
Longest increasing unique subsequence
(1 answer)
Closed 6 years ago.
I have a sequence of values [1,2,3,4,1,5,1,6,7], and I have to find the longest subsequence of increasing length. However, the function needs to stop counting once it reaches a number lower than the previous one. The answer in this sequence in that case is [1,2,3,4]. As it has 4 values before being reset. How would I write the Python code for this?
Note: Finding the "longest increasing subsequence" seems to be a common challenge and so searching online I find a lot of solutions that would count for the entire length of the sequence, and return a subsequence of increasing values, ignoring any decrease, so in this case it would return [1,2,3,4,5,6,7]. That is not what I'm looking for.
It needs to count each subsequence, and reset the count upon reaching a number lower than the previous one. It then needs to compare all the subsequences counted, and return the longest one.
Thanks in advance.
Consider a function that generates all possible ascending subsequences, you would start with an empty list, add items until one element was less (or equal to?) the the previous at which point you save (yield) the subsequence and restart with a new subsequence.
One implementation using a generator could be this:
def all_ascending_subsequences(sequence):
#use an iterator so we can pull out the first element without slicing
seq = iter(sequence)
try: #NOTE 1
last = next(seq) # grab the first element from the sequence
except StopIteration: # or if there are none just return
#yield [] #NOTE 2
return
sub = [last]
for value in seq:
if value > last: #or check if their difference is exactly 1 etc.
sub.append(value)
else: #end of the subsequence, yield it and reset sub
yield sub
sub = [value]
last = value
#after the loop we send the final subsequence
yield sub
two notes about the handling of empty sequences:
To finish a generator a StopIteration needs to be
raised so we could just let the one from next(seq) propegate out - however when from __future__ import generator_stop is in
effect it would cause a RuntimeError so to be future compatible we
need to catch it and explicitly return.
As I've written it passing an empty list to
all_ascending_subsequences would generate no values, which may not
be the desired behaviour. Feel free to uncomment the yield [] to
generate an empty list when passed an empty list.
Then you can just get the longest by calling max on the result with key=len
b = [1,2,3,4,1,5,1,6,7]
result = max(all_ascending_subsequences(b),key=len)
print("longest is", result)
#print(*all_ascending_subsequences(b))
b = [4,1,6,3,4,5,6,7,3,9,1,0]
def findsub(a):
start = -1
count = 0
cur = a[0]
for i, n in enumerate(a):
if n is cur+1:
if start is -1:
start = i - 2
count=1
count+=1
cur = n
if n < cur and count > 1:
return [a[j] for j in range(start,start+count+1)]
print findsub(b)
A somewhat sloppy algorithm, but I believe it does what you want. Usually i would not have simply shown you code, but I suspect that is what you wanted, and I hope you can learn from it, and create your own from what you learn.
a slightly better looking way because I didn't like that:
b = [1,2,0,1,2,3,4,5]
def findsub(a):
answer = [a[0]]
for i in a[1:]:
if answer[-1] + 1 is i:
answer.append(i)
if not len(answer) > 1:
answer = [i]
elif i < answer[-1] and len(answer) > 1:
return answer
return answer
print findsub(b)
You need to do the following:
Create a function W that given a list, returns the index of the last item which is not strictly increasing from the start of the list.
For example, given the following lists: [1,2,3,4,1,2,3], [4,2,1], [5,6,7,8], it should return 4, 1, 4, respectively for each list
Create a variable maxim and set the value to 0
Repeatedly do the following until your list is empty
Call your function W on your list, and let's call the return value x
if x is greater than maxim
set maxim to x
At this point if you wish to store this sequence, you can use the list-slice notation to get that portion of your list which contains the sequence.
delete that portion of your list from the list
Finally, print maxim and if you were storing the parts of your list containing the longest sequence, then print the last one you got

How to check if N can be expressed as sum of two other numbers in specific list

I have a list:
l = [1,3,4,6,7,8,9,11,13,...]
and a number n.
How do I efficiently check if the number n can be expressed as the sum of two numbers (repeats are allowed) within the list l.
If the number is in the list, it does not count unless it can be expressed as two numbers (e.g for l = [2,3,4] 3 would not count, but 4 would.
This, embarrassingly, is what I've tried:
def is_sum_of_2num_inlist(n, num_list):
num_list = filter(lambda x: x < n, num_list)
for num1 in num_list:
for num2 in num_list:
if num1+num2 == n:
return True
return False
Thanks
def summable(n, l):
for v in l:
l_no_v = l[:]
l_no_v.remove(v)
if n - v in l_no_v:
return True
return False
EDIT: Explanation...
The itertools.cominations is a nice way to get all possible answers, but it's ~4x slower than this version since this is a single loop that bails out once it gets to a possible solution.
This loops over the values in l, makes a copy of l removing v so that we don't add v to itself (i.e. no false positive if n = 4; l = [2, 1]). Then subtract v from n and if that value is in l then there are two numbers that sum up to n. If you want to return those numbers instead of returning True just return n, n - v.
Although you can check this by running through the list twice, I would recommend for performance converting the list to a set, since x in set() searches in linear time.
Since n can be the sum of the same number, all you have to do is iterate through the set once and check if n - i occurs elsewhere in the set.
Something like the following should work.
>>> def is_sum_of_numbers(n, numbers):
... for i in numbers:
... if n - i in numbers:
... return True
... return False
...
>>>
>>>
>>> numbers = {2,7,8,9}
>>> is_sum_of_numbers(9, numbers) # 2 + 7
True
>>> is_sum_of_numbers(5, numbers)
False
>>> is_sum_of_numbers(18, numbers) # 9 + 9
True
If the list is ordered you could use two variables to go through the list, one starting at the beginning of the list and one at the end, if the sum of the two variables is greater than N you assign to the variable at the end the values that precedes it, if the sum is less than N you assign to the variable at the beginning the following value in the list. If the sum is N you've found the two values. You can stop when the two variables meet eachother.
If the list is not ordered you start from the beginning of the list and use a variable x to go through the list. You'll need another structure like an hashset or another structure. At every step you'll look up in the second hashset if the value N-x is in there. If there is, you've found the two numbers that add up to N. If there isn't you'll add N-x in the hashset and assign to x the following value. I recommend using an hashset because both the operations of looking up and inserting are O(1).
Both algorithms are linear
I'm sorry I couldn't write directly the code in python because I don't use it.
As I said in the comment HERE there's a video in wich your problem is solved
If I got the OP's concern then-
As the question says repeats are allowed within the list l this process i think is good though a bit slower.So if you need to count the occurances along with the existence of a condition then go with this answer but if you want a bolean esixtence check the go with the others for the mere performance issue nothing else.
You can use itertools.combinations. It will give you all the combinations, not permutations. Now you can just use the sum function to get the sum.
from itertools import combinations
l = [1,3,4,6,7,8,9,11,13]
checks = [4,6] #these are the numbers to check
for chk in checks:
for sm in combinations(l,2):
if chk == sum(sm): #sum(sm) means sum(1,3) for the first pass of the loop
#Do something

Incorrect output when finding number of repeated digits in an ineger

I am trying to find the number of duplicate (repeated) digits in a number, for each unique digit. eg. For 21311243 there would be 3 1's and 2 2's and 2 3's, so I would just need [3,2,2] where order doesn't matter. I am trying to do this as follows, where number_list = ['2','1','3','1','1','2','4', '3']. The code below works for the above number and I get that repeated_numbers = [2, 2, 3] as expected. However, for 213112433 repeated_numbers = [2, 3, 3, 2] for some reason and I don't know why this is and how to rectify the code below as such:
repeated_numbers = [] #The number of repeated digits for each unique digit
for a in number_list:
i = 0
for b in number_list:
if (a == b):
i = i + 1
if i > 1: #If the particular digit is repeated
repeated_numbers.append(i)
number_list.remove(a) #Get rid of the digit that gets repeated from the list
Also, I am open to better ways of doing this as my way has O(n^2) complexity, but need that number_list is considered as a list of characters.
Removing elements in a list while iterating on it is not a good idea.
Look at this related question: Remove items from a list while iterating
I suggest using list comprehension to solve your problem.
As far as the specific bug in your program, I think you meant to remove all occurences of a specific number when performing remove but that's not what happens:
Look at: https://docs.python.org/2/tutorial/datastructures.html
list.remove(x)
Remove the first item from the list whose value is x. It is an error
if there is no such item.

Multiple mismatches in DNA search sequence regex

I have written this barbaric script to create permutations of a string of characters that contain n (up to n=4) $'s in all possible combinations of positions within the string. I will eventually .replace('$','(\\w)') to use for mismatches in a dna search sequence. Because of the way I wrote the script, some of the permutations have less than the requested number of $'s. I then wrote a script to remove them, but it doesn't seem to be effective, and each time I run the removal script, it removes more of the unwanted permutations. In the code pasted below, you will see that I test the function with a simple sequence with 4 mismatches. I then run a series of removal scripts that count how many expressions are removed each time...in my experience, it takes about 8 times to remove all expressions with less than 4 wild-card $'s. I have a couple questions about this:
Is there a built in function for searches with 'n' mismatches? Maybe even in biopython? So far, I've seen the Paul_McGuire_regex function:
Search for string allowing for one mismatch in any location of the string,
which seems only to generate 1 mismatch. I must admit, I don't fully understand all of the code in the remainining functions on that page, as I am a very new coder.
Since I see this as a good exercise for me, is there a better way to write this entire script?...Can I iterate Paul_McGuire_regex function as many times as I need?
Most perplexing to me, why won't the removal script work 100% the first time?
Thanks for any help you can provide!
def Mismatch(Search,n):
List = []
SearchL = list(Search)
if n > 4:
return("Error: Maximum of 4 mismatches")
for i in range(0,len(Search)):
if n == 1:
SearchL_i = list(Search)
SearchL_i[i] = '$'
List.append(''.join(SearchL_i))
if n > 1:
for j in range (0,len(Search)):
if n == 2:
SearchL_j = list(Search)
SearchL_j[i] = '$'
SearchL_j[j] = '$'
List.append(''.join(SearchL_j))
if n > 2:
for k in range(0,len(Search)):
if n == 3:
SearchL_k = list(Search)
SearchL_k[i] = '$'
SearchL_k[j] = '$'
SearchL_k[k] = '$'
List.append(''.join(SearchL_k))
if n > 3:
for l in range(0,len(Search)):
if n ==4:
SearchL_l = list(Search)
SearchL_l[i] = '$'
SearchL_l[j] = '$'
SearchL_l[k] = '$'
SearchL_l[l] = '$'
List.append(''.join(SearchL_l))
counter=0
for el in List:
if el.count('$') < n:
counter+=1
List.remove(el)
return(List)
List_RE = Mismatch('abcde',4)
counter = 0
for el in List_RE:
if el.count('$') < 4:
List_RE.remove(el)
counter+=1
print("Filter2="+str(counter))
We can do away with questions 2 and 3 by answering question 1, but understanding question 3 is important so I'll do that first and then show how you can avoid it entirely:
Question 3
As to question 3, it's because when you loop over a list in python and make changes to it within the loop, the list that you loop over changes.
From the python docs on control flow (for statement section):
It is not safe to modify the sequence being iterated over in the loop
(this can only happen for mutable sequence types, such as lists).
Say your list is [a,b,c,d] and you loop through it with for el in List.
Say el is currently a and you do List.remove(el).
Now, your list is [b,c,d]. However, the iterator points to the second element in the list (since it's done the first), which is now c.
In essence, you've skipped b. So the problem is that you are modifying the list you are iterating over.
There are a few ways to fix this: if your List is not expensive to duplicate, you could make a copy. So iterate over List[:] but remove from List.
But suppose it's expensive to make copies of List all the time.
Then what you do is iterate over it backwards. Note the reversed below:
for el in reversed(List):
if el.count('$') < n:
counter+=1
List.remove(el)
return(List)
In the example above, suppose we iterate backwards over List.
The iterator starts at d, and then goes to c.
Suppose we remove c, so that List=[a,b,d].
Since the iterator is going backwards, it now points to element b, so we haven't skipped anything.
Basically, this avoids modifying bits of the list you have yet to iterate over.
Questions 1 & 2
If I understand your question correctly, you basically want to choose n out of m positions, where m is the length of the string (abcde), and place a '$' in each of these n positions.
In that case, you can use the itertools module to do that.
import itertools
def Mismatch(Search,n):
SearchL = list(Search)
List = [] # hold output
# print list of indices to replace with '$'
idxs = itertools.combinations(range(len(SearchL)),n)
# for each combination `idx` in idxs, replace str[idx] with '$':
for idx in idxs:
str = SearchL[:] # make a copy
for i in idx:
str[i]='$'
List.append( ''.join(str) ) # convert back to string
return List
Let's look at how this works:
turn the Search string into a list so it can be iterated over, create empty List to hold results.
idxs = itertools.combinations(range(len(SearchL)),n) says "find all subsets of length n in the set [0,1,2,3,...,length-of-search-string -1].
Try
idxs = itertools.combinations(range(5),4)
for idx in idxs:
print idx
to see what I mean.
Each element of idxs is a tuple of n indices from 0 to len(SearchL)-1 (e.g. (0,1,2,4). Replace the i'th character of SearchL with a '$' for each i in the tuple.
Convert the result back into a string and add it to List.
As an example:
Mismatch('abcde',3)
['$$$de', '$$c$e', '$$cd$', '$b$$e', '$b$d$', '$bc$$', 'a$$$e', 'a$$d$', 'a$c$$', 'ab$$$']
Mismatch('abcde',4) # note, the code you had made lots of duplicates.
['$$$$e', '$$$d$', '$$c$$', '$b$$$', 'a$$$$']

Count the number of occurrences of a given item in a (sorted) list?

I'm asked to create a method that returns the number of occurrences of a given item in a list. I know how to write code to find a specific item, but how can I code it to where it counts the number of occurrences of a random item.
For example if I have a list [4, 6 4, 3, 6, 4, 9] and I type something like
s1.count(4), it should return 3 or s1.count(6) should return 2.
I'm not allowed to use and built-in functions though.
In a recent assignment, I was asked to count the number of occurrences that sub string "ou" appeared in a given string, and I coded it
if len(astr) < 2:
return 0
else:
return (astr[:2] == "ou")+ count_pattern(astr[1:])
Would something like this work??
def count(self, item):
num=0
for i in self.s_list:
if i in self.s_list:
num[i] +=1
def __str__(self):
return str(self.s_list)
If this list is already sorted, the "most efficient" method -- in terms of Big-O -- would be to perform a binary search with a count-forward/count-backward if the value was found.
However, for an unsorted list as in the example, then the only way to count the occurrences is to go through each item in turn (or sort it first ;-). Here is some pseudo-code, note that it is simpler than the code presented in the original post (there is no if x in list or count[x]):
set count to 0
for each element in the list:
if the element is what we are looking for:
add one to count
Happy coding.
If I told you to count the number of fours in the following list, how would you do it?
1 4 2 4 3 8 2 1 4 2 4 9 7 4
You would start by remembering no fours yet, and add 1 for each element that equals 4. To traverse a list, you can use a for statement. Given an element of the list el, you can check whether it is four like this:
if el == 4:
# TODO: Add 1 to the counter here
In response to your edit:
You're currently testing if i in self.s_list:, which doesn't make any sense since i is an element of the list and therefore always present in it.
When adding to a number, you simply write num += 1. Brackets are only necessary if you want to access the values of a list or dictionary.
Also, don't forget to return num at the end of the function so that somebody calling it gets the result back.
Actually the most efficient method in terms of Big-O would be O(log n). #pst's method would result in O(log n + s) which could become linear if the array is made up of equal elements.
The way to achieve O(log n) would be to use 2 binary searches (which gives O(2log n), but we discard constants, so it is still O(log n)) that are modified to not have an equality test, therefore making all searches unsuccessful. However, on an unsuccessful search (low > high) we return low.
In the first search, if the middle is greater than your search term, recurse into the higher part of the array, else recurse into the lower part. In the second search, reverse the binary comparison.
The first search yields the right boundary of the equal element and the second search yields the left boundary. Simply subtract to get the amount of occurrences.
Based on algorithm described in Skiena.
This seems like a homework... anyways. Try list.count(item). That should do the job.
Third or fourth element here:
http://docs.python.org/tutorial/datastructures.html
Edit:
try something else like:
bukket = dict()
for elem in astr:
if elem not in bukket.keys():
bukket[elem] = 1
else:
bukket[elem] += 1
You can now get all the elements in the list with dict.keys() as list and the corresponding occurences with dict[key].
So you can test it:
import random
l = []
for i in range(0,200):
l.append(random.randint(0,20))
print l
l.sort()
print l
bukket = dict()
for elem in l:
if elem not in bukket.keys():
bukket[elem] = 1
else:
bukket[elem] += 1
print bukket

Categories

Resources