finding repeated substring in k length in a string using function

finding repeated substring in k length in a string using function - python

I just started using function and I'm trying to build one that's find a repeated substring that is length is at least k and returns the results into tuple that contains a dict.
the keys needs to be the substring and the value is how many times it was repeated, and then add to the tuple the length of the substring.
I just started but I didnt really knew how to continue but this is what I tried to do:
def longest_repeat(string, K)
longest = {} ,
if isinstance(K, int) and isinstance(string, str)
for sub_str in string:
if sub_str >= K:
longest[0][sub_seq] = DNA_seq_slic = []
a=0
b=k
for nuc in range(len(DNA_seq)-k+1):
DNA_seq_slic.append(DNA_seq[a:b])
a +=1
b +=1
import collections
for sub_seq in DNA_seq_slic:
repeated = [item for item, count in collections.Counter(DNA_seq_slic).items() if count > 1]
repeated_subseq_dict = dict(zip(repeated,[0 for x in range(0,len(repeated))]))
for key in repeated_subseq_dict:
repeated_subseq_dict[key] = DNA_seq_slic.count(key)
return(repeated_subseq_dict)
Im sorry if its a little bit messed up, I didnt really had direction and I tried to use other function I built to solve this and it didnt really worked. I can clarify more if needed.
the output should be something like this:
longest_repeated("ATAATACATAATA", 5)
output: longest = {ATAATA: 2} , 6
Really appreciate any kind of help! Thanks!

You can try re module:
import re
def longest_repeated(s, k):
m = re.findall(f"(.{{{k},}})(?=.*\\1)", s)
if m:
mx = max(m, key=len)
return {mx: s.count(mx)}, len(mx)
Some tests:
print(longest_repeated("ATAATACATAATA", 5))
({'ATAATA': 2}, 6)
print(longest_repeated("XXXXXATAATACATAATAXXXXX", 5))
({'ATAATA': 2}, 6)

Related

Remove similar items in a list in Python

How do you remove similar items in a list in Python but only for a given item. Example,
l = list('need')
If 'e' is the given item then
l = list('nd')
The set() function will not do the trick since it will remove all duplicates.
count() and remove() is not efficient.

use filter
assuming you write function that decide on the items that you want to keep in the list.
for your example
def pred(x):
return x!="e"
l=list("need")
l=list(filter(pred,l))

Assuming given = 'e' and l= list('need').
for i in range(l.count(given)):
l.remove(given)

If you just want to replace 'e' from the list of words in a list, you can use regex re.sub(). If you also want a count of how many occurrences of e were removed from each word, then you can use re.subn(). The first one will provide you strings in a list. The second will provide you a tuple (string, n) where n is the number of occurrences.
import re
lst = list(('need','feed','seed','deed','made','weed','said'))
j = [re.sub('e','',i) for i in lst]
k = [re.subn('e','',i) for i in lst]
The output for j and k are :
j = ['nd', 'fd', 'sd', 'dd', 'mad', 'wd', 'said']
k = [('nd', 2), ('fd', 2), ('sd', 2), ('dd', 2), ('mad', 1), ('wd', 2), ('said', 0)]
If you want to count the total changes made, just iterate thru k and sum it. There are other simpler ways too. You can simply use regEx
re.subn('e','',''.join(lst))[1]
This will give you total number of items replaced in the list.

List comprehension Method. Not sure if the size/complexity is less than that of count and remove.
def scrub(l, given):
return [i for i in l if i not in given]
Filter method, again i'm not sure
def filter_by(l, given):
return list(filter(lambda x: x not in given, l))
Bruteforce with recursion but there are a lot of potential downfalls. Still an option. Again I don't know the size/comp
def bruteforce(l, given):
try:
l.remove(given[0])
return bruteforce(l, given)
except ValueError:
return bruteforce(l, given[1:])
except IndexError:
return l
return l
For those of you curious as to the actual time associated with the above methods, i've taken the liberty to test them below!
Below is the method I've chosen to use.
def timer(func, name):
print("-------{}-------".format(name))
try:
start = datetime.datetime.now()
x = func()
end = datetime.datetime.now()
print((end-start).microseconds)
except Exception, e:
print("Failed: {}".format(e))
print("\r")
The dataset we are testing against. Where l is our original list and q is the items we want to remove, and r is our expected result.
l = list("need"*50000)
q = list("ne")
r = list("d"*50000)
For posterity I've added the count / remove method the OP was against. (For good reason!)
def count_remove(l, given):
for i in given:
for x in range(l.count(i)):
l.remove(i)
return l
All that's left to do is test!
timer(lambda: scrub(l, q), "List Comp")
assert(scrub(l,q) == r)
timer(lambda: filter_by(l, q), "Filter")
assert(filter_by(l,q) == r)
timer(lambda : count_remove(l, q), "Count/Remove")
assert(count_remove(l,q) == r)
timer(lambda: bruteforce(l, q), "Bruteforce")
assert(bruteforce(l,q) == r)
And our results
-------List Comp-------
10000
-------Filter-------
28000
-------Count/Remove-------
199000
-------Bruteforce-------
Failed: maximum recursion depth exceeded
Process finished with exit code 0
The Recursion method failed with a larger dataset, but we expected this. I tested on smaller datasets, and Recursion is marginally slower. I thought it would be faster.

Remove adjacent duplicates given a condition

I'm trying to write a function that will take a string, and given an integer, will remove all the adjacent duplicates larger than the integer and output the remaining string. I have this function right now that removes all the duplicates in a string, and I'm not sure how to put the integer constraint into it:
def remove_duplicates(string):
s = set()
list = []
for i in string:
if i not in s:
s.add(i)
list.append(i)
return ''.join(list)
string = "abbbccaaadddd"
print(remove_duplicates(string))
This outputs
abc
What I would want is a function like
def remove_duplicates(string, int):
.....
Where if for the same string I input int=2, I want to remove my n characters without removing all the characters. Output should be
abbccaadd
I'm also concerned about run time and complexity for very large strings, so if my initial approach is bad, please suggest a different approach. Any help is appreciated!

Not sure I understand your question correctly. I think that, given m repetitions of a character, you want to remove up to k*n duplicates such that k*n < m.
You could try this, using groupby:
>>> from itertools import groupby
>>> string = "abbbccaaadddd"
>>> n = 2
>>> ''.join(c for k, g in groupby(string) for c in k * (len(list(g)) % n or n))
'abccadd'
Here, k * (len(list(g)) % n or n) means len(g) % n repetitions, or n if that number is 0.
Oh, you changed it... now my original answer with my "interpretation" of your output actually works. You can use groupby together with islice to get at most n characters from each group of duplicates.
>>> from itertools import groupby, islice
>>> string = "abbbccaaadddd"
>>> n = 2
>>> ''.join(c for _, g in groupby(string) for c in islice(g, n))
'abbccaadd'

Create group of letters, but compute the length of the groups, maxed out by your parameter.
Then rebuild the groups and join:
import itertools
def remove_duplicates(string,maxnb):
groups = ((k,min(len(list(v)),maxnb)) for k,v in itertools.groupby(string))
return "".join(itertools.chain.from_iterable(v*k for k,v in groups))
string = "abbbccaaadddd"
print(remove_duplicates(string,2))
this prints:
abbccaadd
can be a one-liner as well (cover your eyes!)
return "".join(itertools.chain.from_iterable(v*k for k,v in ((k,min(len(list(v)),maxnb)) for k,v in itertools.groupby(string))))
not sure about the min(len(list(v)),maxnb) repeat value which can be adapted to suit your needs with a modulo (like len(list(v)) % maxnb), etc...

You should avoid using int as a variable name as it is a python keyword.
Here is a vanilla function that does the job:
def deduplicate(string: str, treshold: int) -> str:
res = ""
last = ""
count = 0
for c in string:
if c != last:
count = 0
res += c
last = c
else:
if count < treshold:
res += c
count += 1
return res

Find all Occurences of Every Substring in String

I am trying to find all occurrences of sub-strings in a main string (of all lengths). My function takes one string and then returns a dictionary of every sub-string (which occurs more than once, of course) and how many times it occurs (format of the dictionary: {substring: # of occurrences, ...}). I am using collections.Counter(s) to help me with it.
Here is my function:
from collections import Counter
def patternFind(s):
patterns = {}
for index in range(1, len(s)+1)[::-1]:
d = nChunks(s, step=index)
parts = dict(Counter(d))
patterns.update({elem: parts[elem] for elem in parts.keys() if parts[elem] > 1})
return patterns
def nChunks(iterable, start=0, step=1):
return [iterable[i:i+step] for i in range(start, len(iterable), step)]
I have a string, data with about 2500 random letters (in a random order). However, there are 2 strings inserted into it (random points). Say this string is 'TEST'. data.count('TEST') returns 2. However, patternFind(data)['TEST'] gives me a KeyError. Therefore, my program does not detect the two strings in it.
What have I done wrong? Thanks!
Edit: My method of creating testing-instances:
def createNewTest():
n = randint(500, 2500)
x, y = randint(500, n), randint(500, n)
s = ''
for i in range(n):
s += choice(uppercase)
if i == x or i == y: s += "TEST"
return s

Using Regular Expressions
Apart from the count() method you described, regex is an obvious alternative
import re
needle = r'TEST'
haystack = 'khjkzahklahjTESTkahklaghTESTjklajhkhzkhjkzahklahjTESTkahklagh'
pattern = re.compile(needle)
print len(re.findall(pattern, haystack))
Short Cut
If you need to build a dictionary of substrings, possibly you can do this with only subset of those strings. Assuming you know the needle you are looking for in the data then you only need the dictionary of substrings of data that are the same length of needle. This is very fast.
from collections import Counter
needle = "TEST"
def gen_sub(s, len_chunk):
for start in range(0, len(s)-len_chunk+1):
yield s[start:start+len_chunk]
data = 'khjkzahklahjTESTkahklaghTESTjklajhkhzkhjkzahklahjTESTkahklaghTESz'
parts = Counter([sub for sub in gen_sub(data, len(needle))])
print parts[needle]
Brute Force: building dictionary of all substrings
If you need to have a count of all possible substrings, this works but it is very slow:
from collections import Counter
def gen_sub(s):
for start in range(0, len(s)):
for end in range(start+1, len(s)+1):
yield s[start:end]
data = 'khjkzahklahjTESTkahklaghTESTjklajhkhz'
parts = Counter([sub for sub in gen_sub(data)])
print parts['TEST']
Substring generator adapted from this: https://stackoverflow.com/a/8305463/1290420

While jurgenreza has explained why your program didn't work, the solution is still quite slow. If you only examine substrings s for which you know that s[:-1] repeats, you get a much faster solution (typically a hundred times faster and more):
from collections import defaultdict
def pfind(prefix, sequences):
collector = defaultdict(list)
for sequence in sequences:
collector[sequence[0]].append(sequence)
for item, matching_sequences in collector.items():
if len(matching_sequences) >= 2:
new_prefix = prefix + item
yield (new_prefix, len(matching_sequences))
for r in pfind(new_prefix, [sequence[1:] for sequence in matching_sequences]):
yield r
def find_repeated_substrings(s):
s0 = s + " "
return pfind("", [s0[i:] for i in range(len(s))])
If you want a dict, you call it like this:
result = dict(find_repeated_substrings(s))
On my machine, for a run with 2247 elements, it took 0.02 sec, while the original (corrected) solution took 12.72 sec.
(Note that this is a rather naive implementation; using indexes of instead of substrings should be even faster.)
Edit: The following variant works with other sequence types (not only strings). Also, it doesn't need a sentinel element.
from collections import defaultdict
def pfind(s, length, ends):
collector = defaultdict(list)
if ends[-1] >= len(s):
del ends[-1]
for end in ends:
if end < len(s):
collector[s[end]].append(end)
for key, matching_ends in collector.items():
if len(matching_ends) >= 2:
end = matching_ends[0]
yield (s[end - length: end + 1], len(matching_ends))
for r in pfind(s, length + 1, [end + 1 for end in matching_ends if end < len(s)]):
yield r
def find_repeated_substrings(s):
return pfind(s, 0, list(range(len(s))))
This still has the problem that very long substrings will exceed recursion depth. You might want to catch the exception.

The problem is in your nChunks function. It does not give you all the chunks that are necessary.
Let's consider a test string:
s='1test2345test'
For the chunks of size 4 your nChunks function gives this output:
>>>nChunks(s, step=4)
['1tes', 't234', '5tes', 't']
But what you really want is:
>>>def nChunks(iterable, start=0, step=1):
return [iterable[i:i+step] for i in range(len(iterable)-step+1)]
>>>nChunks(s, step=4)
['1tes', 'test', 'est2', 'st23', 't234', '2345', '345t', '45te', '5tes', 'test']
You can see that this way there are two 'test' chunks and your patternFind(s) will work like a charm:
>>> patternFind(s)
{'tes': 2, 'st': 2, 'te': 2, 'e': 2, 't': 4, 'es': 2, 'est': 2, 'test': 2, 's': 2}

here you can find a solution that uses a recursive wrapper around string.find() that searches all the occurences of a substring in a main string.
The collectallchuncks() function returns a defaultdict whith all the substrings as keys and for each substring a list of all the indexes where the substring is found in the main string.
import collections
# Minimum substring size, may be 1
MINSIZE = 3
# Recursive wrapper
def recfind(p, data, pos, acc):
res = data.find(p, pos)
if res == -1:
return acc
else:
acc.append(res)
return recfind(p, data, res+1, acc)
def collectallchuncks(data):
res = collections.defaultdict(str)
size = len(data)
for base in xrange(size):
for seg in xrange(MINSIZE, size-base+1):
chunk = data[base:base+seg]
if data.count(chunk) > 1:
res[chunk] = recfind(chunk, data, 0, [])
return res
if __name__ == "__main__":
data = 'khjkzahklahjTESTkahklaghTESTjklajhkhzkhjkzahklahjTESTkahklaghTESz'
allchuncks = collectallchuncks(data)
print 'TEST', allchuncks['TEST']
print 'hklag', allchuncks['hklag']
EDIT: If you just need the number of occurrences of each substring in the main string you can easily obtain it getting rid of the recursive function:
import collections
MINSIZE = 3
def collectallchuncks2(data):
res = collections.defaultdict(str)
size = len(data)
for base in xrange(size):
for seg in xrange(MINSIZE, size-base+1):
chunk = data[base:base+seg]
cnt = data.count(chunk)
if cnt > 1:
res[chunk] = cnt
return res
if __name__ == "__main__":
data = 'khjkzahklahjTESTkahklaghTESTjklajhkhzkhjkzahklahjTESTkahklaghTESz'
allchuncks = collectallchuncks2(data)
print 'TEST', allchuncks['TEST']
print 'hklag', allchuncks['hklag']

Longest common prefix using buffer?

If I have an input string and an array:
s = "to_be_or_not_to_be"
pos = [15, 2, 8]
I am trying to find the longest common prefix between the consecutive elements of the array pos referencing the original s. I am trying to get the following output:
longest = [3,1]
The way I obtained this is by computing the longest common prefix of the following pairs:
s[15:] which is _be and s[2:] which is _be_or_not_to_be giving 3 ( _be )
s[2:] which is _be_or_not_to_be and s[8:] which is _not_to_be giving 1 ( _ )
However, if s is huge, I don't want to create multiple copies when I do something like s[x:]. After hours of searching, I found the function buffer that maintains only one copy of the input string but I wasn't sure what is the most efficient way to utilize it here in this context. Any suggestions on how to achieve this?

Here is a method without buffer which doesn't copy, as it only looks at one character at a time:
from itertools import islice, izip
s = "to_be_or_not_to_be"
pos = [15, 2, 8]
length = len(s)
for start1, start2 in izip(pos, islice(pos, 1, None)):
pref = 0
for pos1, pos2 in izip(xrange(start1, length), xrange(start2, length)):
if s[pos1] == s[pos2]:
pref += 1
else:
break
print pref
# prints 3 1
I use islice, izip, and xrange in case you're talking about potentially very long strings.
I also couldn't resist this "One Liner" which doesn't even require any indexing:
[next((i for i, (a, b) in
enumerate(izip(islice(s, start1, None), islice(s, start2, None)))
if a != b),
length - max((start1, start2)))
for start1, start2 in izip(pos, islice(pos, 1, None))]
One final method, using os.path.commonprefix:
[len(commonprefix((buffer(s, n), buffer(s, m)))) for n, m in zip(pos, pos[1:])]

>>> import os
>>> os.path.commonprefix([s[i:] for i in pos])
'_'
Let Python to manage memory for you. Don't optimize prematurely.
To get the exact output you could do (as #agf suggested):
print [len(commonprefix([buffer(s, i) for i in adj_indexes]))
for adj_indexes in zip(pos, pos[1:])]
# -> [3, 1]

I think your worrying about copies is unfounded. See below:
>>> s = "how long is a piece of string...?"
>>> t = s[12:]
>>> print t
a piece of string...?
>>> id(t[0])
23295440
>>> id(s[12])
23295440
>>> id(t[2:20]) == id(s[14:32])
True
Unless you're copying the slices and leaving references to the copies hanging around, I wouldn't think it could cause any problem.
edit: There are technical details with string interning and stuff that I'm not really clear on myself. But I'm sure that a string slice is not always a copy:
>>> x = 'google.com'
>>> y = x[:]
>>> x is y
True
I guess the answer I'm trying to give is to just let python manage its memory itself, to begin with, you can look at memory buffers and views later if needed. And if this is already a real problem occurring for you, update your question with details of what the actual problem is.

One way of doing using buffer this is give below. However, there could be much faster ways.
s = "to_be_or_not_to_be"
pos = [15, 2, 8]
lcp = []
length = len(pos) - 1
for index in range(0, length):
pre = buffer(s, pos[index])
cur = buffer(s, pos[index+1], pos[index+1]+len(pre))
count = 0
shorter, longer = min(pre, cur), max(pre, cur)
for i, c in enumerate(shorter):
if c != longer[i]:
break
else:
count += 1
lcp.append(count)
print
print lcp

Python: Check the occurrences in a list against a value

lst = [1,2,3,4,1]
I want to know 1 occurs twice in this list, is there any efficient way to do?

lst.count(1) would return the number of times it occurs. If you're going to be counting items in a list, O(n) is what you're going to get.
The general function on the list is list.count(x), and will return the number of times x occurs in a list.

Are you asking whether every item in the list is unique?
len(set(lst)) == len(lst)
Whether 1 occurs more than once?
lst.count(1) > 1
Note that the above is not maximally efficient, because it won't short-circuit -- even if 1 occurs twice, it will still count the rest of the occurrences. If you want it to short-circuit you will have to write something a little more complicated.
Whether the first element occurs more than once?
lst[0] in lst[1:]
How often each element occurs?
import collections
collections.Counter(lst)
Something else?

For multiple occurrences, this give you the index of each occurence:
>>> lst=[1,2,3,4,5,1]
>>> tgt=1
>>> found=[]
>>> for index, suspect in enumerate(lst):
... if(tgt==suspect):
... found.append(index)
...
>>> print len(found), "found at index:",", ".join(map(str,found))
2 found at index: 0, 5
If you want the count of each item in the list:
>>> lst=[1,2,3,4,5,2,2,1,5,5,5,5,6]
>>> count={}
>>> for item in lst:
... count[item]=lst.count(item)
...
>>> count
{1: 2, 2: 3, 3: 1, 4: 1, 5: 5, 6: 1}

def valCount(lst):
res = {}
for v in lst:
try:
res[v] += 1
except KeyError:
res[v] = 1
return res
u = [ x for x,y in valCount(lst).iteritems() if y > 1 ]
u is now a list of all values which appear more than once.
Edit:
#katrielalex: thank you for pointing out collections.Counter, of which I was not previously aware. It can also be written more concisely using a collections.defaultdict, as demonstrated in the following tests. All three methods are roughly O(n) and reasonably close in run-time performance (using collections.defaultdict is in fact slightly faster than collections.Counter).
My intention was to give an easy-to-understand response to what seemed a relatively unsophisticated request. Given that, are there any other senses in which you consider it "bad code" or "done poorly"?
import collections
import random
import time
def test1(lst):
res = {}
for v in lst:
try:
res[v] += 1
except KeyError:
res[v] = 1
return res
def test2(lst):
res = collections.defaultdict(lambda: 0)
for v in lst:
res[v] += 1
return res
def test3(lst):
return collections.Counter(lst)
def rndLst(lstLen):
r = random.randint
return [r(0,lstLen) for i in xrange(lstLen)]
def timeFn(fn, *args):
st = time.clock()
res = fn(*args)
return time.clock() - st
def main():
reps = 5000
res = []
tests = [test1, test2, test3]
for t in xrange(reps):
lstLen = random.randint(10,50000)
lst = rndLst(lstLen)
res.append( [lstLen] + [timeFn(fn, lst) for fn in tests] )
res.sort()
return res
And the results, for random lists containing up to 50,000 items, are as follows:
(Vertical axis is time in seconds, horizontal axis is number of items in list)

Another way to get all items that occur more than once:
lst = [1,2,3,4,1]
d = {}
for x in lst:
d[x] = x in d
print d[1] # True
print d[2] # False
print [x for x in d if d[x]] # [1]

You could also sort the list which is O(n*log(n)), then check the adjacent elements for equality, which is O(n). The result is O(n*log(n)). This has the disadvantage of requiring the entire list be sorted before possibly bailing when a duplicate is found.
For a large list with a relatively rare duplicates, this could be the about the best you can do. The best way to approach this really does depend on the size of the data involved and its nature.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

finding repeated substring in k length in a string using function - python

Related

Remove similar items in a list in Python

Remove adjacent duplicates given a condition

Find all Occurences of Every Substring in String

Longest common prefix using buffer?

Python: Check the occurrences in a list against a value

Categories

Resources