Determine whether string contained within another string in python - python

I am looking to determine whether a string is fully contained at the start of a list of other string. For example if i had the string cde, and the list of strings:
['ab', 'bce', 'cdef']
then it would be determine that cde is contained at the start of cdef
I'm also looking to go the other way around - i.e. if i had the term abc to identify that ab from the above list is contained within it.
Now obviously this is trivial to set up with a for loop, checking each instance with the function startswith, however this is not scalable with a very large list of possibilities to check on.
While checking each instance is O(n) [and hence very slow if you have 100,000 possibilities], i am looking for a way of checking of O(1) ... it feels like if the "list" was pre-sorted, then can simply extract the nearest match, but not sure how.
Clarification:
I solely looking where there is a perfect match at the start of the string (i.e the whole of search term is included).
I will be looking up multiple search terms (thus while initially sorting the data may not be quick, the sunk cost would save on subsequent look troughs).
Ideally would return every possible match (i.e. if cdef and cdefg where in the list, and looking up cde, then both would be returned).
I use the term "list" loosely, as in a collection of terms.

It's not possible in O(1), since by definition you have to go over the entire array. If the array is sorted then you can do a binary search for your string, and then check if the element at that position starts with your string. That operation is O(log n).
import bisect
# return the index of the string starting with the prefix
# or None if no such string is in the list
def search(a, prefix):
i = bisect.bisect_left(a, prefix)
isAtStart = (i < len(a) and a[i].startswith(prefix))
return i if isAtStart else None
search(['ab', 'bce', 'cdef'], 'bc')

Related

How to iterate through a list and use .isdigit to verify if a list of strings is made up of only digits or not

Write a function that takes, as an argument, a list, identified by the variable aList. If the list only contains elements containing digits (either as strings as integers), return the string formed by concatenating all of the elements in the list (see the example that follows). Otherwise, return a string indicating the length of the list, as specified in the examples that follow.
I am just starting to learn how to code and this is my first CS class.
def amIDigits(aList):
for element in range(aList):
if element in aList.isdigit():
bList=[]
bList.append(aList)
return str(bList)
amIDigits([“hello”, 23]) should return the string “The length of the input is 2.”
amIDigits ([“10”, “111”]) should return the string “10111”
If I understand it right the output will be the joined digits even if they are not of the string format. So the best way is to use the all function (returns true if all elements of an iteration are true) and check if the string elements of the list are digits. If so, then return the join of all elements of the list converted to a string. Else, we return the length of the list using the new string formatting syntax (f represents string formatting and the {} return the result of an operation).
code:
def amIDigits(aList):
if all([str(i).isdigit() for i in aList]):
return ''.join(map(str,aList))
else:
return f'The length of the input is {len(aList)}.'
print(amIDigits(['hello', 23]))
print(amIDigits(['10', '111']))
print(amIDigits([55, 33]))
output:
The length of the input is 2.
10111
5533
First, I highly recommend having someone literally sit down and have you walk them through your thought process. It is more useful as a learner to debug your thought process than to have someone give you the answer.
One thing I noticed is that you created your empty list, bList, inside the for block. This will not work. You need to create an empty list to store things into before you begin for looping through the old list, otherwise you will be over-writing your new list every time it loops. So right now, your bList.append() statement is appending an element onto an empty list every time it runs. (You will get only the very last element in the aList stored into your bList.)
Another problem is that you use the range() function, but you don't need to. You want to look at each element inside the list. Range creates a sequence of numbers from 0 to whatever number is inside the parentheses: range() documentation. Your code tries to pass a list into range(), so it is invalid.
The "for blank in blank" statement breaks up whatever list is in the second blank and goes through each of its elements one at a time. For the duration of the for statement, the first blank is the name of the variable that refers to the element being looked at. so for example:
apples = ["Granny Smith","Red Delicious","Green"]
for apple in apples:
eat(apple) #yum!
The for in statement is more naturally spoken as "for each blank in blank:"

Is there any algorithm that can be applied to this program?

I am doing an intern writing a program to do gene matching.
For example:
File "A" contains some strings of gene type. (the original data is not sorted)
rs17760268
rs10439884
rs4911642
rs157640
rs1958589
rs10886159
rs424232
....
and file "B" contains 900 thousands of rs number like above (also not sorted)
My program now can get correct results, but I would like to make it more efficient.
Is there any algorithm that can be applied to this program?
BTW, I will try to make my program do multi-processing and see if it gets better performance.
pseudocode:
read File "A" by string, append to A[]
A[] = rs numbers from File "A"
read File "B" by string
for gene_B in file_B_reader:
for gene_A in A:
if gene_A == gene_B:
#append to result[]
I don't think there's a need to sort anything first.
Process larger list B into a hashmap or hashset, O(n) amortized
Iterate over list A and remove from A if not in B, O(m)
return A
Total: O(n + m)
Though your explanations are quite unclear, I guess that you are appending the A values to a list. Use a dictionary instead, and you can lookup A much more efficiently.
From the description it appears you want result[] to contain rs strings that are in both A and B (aka Intersection).
Your algorithm is O(n*m), but you could easily improve this by sorting both files first (O(n*logn) for comparison based sorts), and then read from both at the same time, increasing position in one that has lower current rs number, and adding matches to result[] at the same time.

Random DNA mutation Generator

I'd like to create a dictionary of dictionaries for a series of mutated DNA strands, with each dictionary demonstrating the original base as well as the base it has mutated to.
To elaborate, what I would like to do is create a generator that allows one to input a specific DNA strand and have it crank out 100 randomly generated strands that have a mutation frequency of 0.66% (this applies to each base, and each base can mutate to any other base). Then, what I would like to do is create a series of dictionary, where each dictionary details the mutations that occured in a specific randomly generated strand. I'd like the keys to be the original base, and the values to be the new mutated base. Is there a straightforward way of doing this? So far, I've been experimenting with a loop that looks like this:
#yields a strand with an A-T mutation frequency of 0.066%
def mutate(string, mutation, threshold):
dna = list(string)
for index, char in enumerate(dna):
if char in mutation:
if random.random() < threshold:
dna[index] = mutation[char]
return ''.join(dna)
dna = "ATGTCGTACGTTTGACGTAGAG"
print("DNA first:", dna)
newDNA = mutate(dna, {"A": "T"}, 0.0066)
print("DNA now:", newDNA)
But I can only yield one strand with this code, and it only focuses on T-->A mutations. I'm also not sure how to tie the dictionary into this. could someone show me a better way of doing this? Thanks.
It sounds like there are two parts to your issue. The first is that you want to mutate your DNA sequence several times, and the second is that you want to gather some additional information about the mutations in a data structure of some kind. I'll handle each of those separately.
Producing 100 random results from the same source string is pretty easy. You can do it with an explicit loop (for instance, in a generator function), but you can just as easily use a list comprehension to run a single-mutation function over and over:
results = [mutate(original_string) for _ in range(100)]
Of course, if you make the mutate function more complicated, this simple code may not be appropriate. If it returns some kind of more sophisticated data structure, rather than just a string, you may need to do some additional processing to combine the data in the format you want.
As for how to build those data structures, I think the code you have already is a good start. You'll need to decide how exactly you're going to be accessing your data, and then let that guide you to the right kind of container.
For instance, if you just want to have a simple record of all the mutations that happen to a string, I'd suggest a basic list that contains tuples of the base before and after the mutation. On the other hand, if you want to be able to efficiently look up what a given base mutates to, a dictionary with lists as values might be more appropriate. You could also include the index of the mutated base if you wanted to.
Here's a quick attempt at a function that returns the mutated string along with a list of tuples recording all the mutations:
bases = "ACGT"
def mutate(orig_string, mutation_rate=0.0066):
result = []
mutations = []
for base in orig_string:
if random.random() < mutation_rate:
new_base = bases[bases.index(base) - random.randint(1, 3)] # negatives are OK
result.append(new_base)
mutations.append((base, new_base))
else:
result.append(base)
return "".join(result), mutations
The most tricky bit of this code is how I'm picking the replacement of the current base. The expression bases[bases.index(base) - random.randint(1, 3)] does it all in one go. Lets break down the different bits. bases.index(base) gives the index of the previous base in the global bases string at the top of the code. Then I subtract a random offset from this index (random.randint(1, 3)). The new index may be negative, but that's OK, as when we use it to index back into the bases string (bases[...]), negative indexes count from the right, rather than the left.
Here's how you could use it:
string = "ATGT"
results = [mutate(string) for _ in range(100)]
for result_string, mutations in results:
if mutations: # skip writing out unmutated strings
print(result_string, mutations)
For short strings, like "ATGT" you're very unlikely to get more than one mutation, and even one is pretty rare. The loop above tends to print between 2 and 4 results on each run (that is, more than 95% of length-four strings are not mutated at all). Longer strings will have mutations more often, and it's more plausible that you'll see multiple mutations in one string.

Check if string in strings

I have a huge list containing many strings like:
['xxxx','xx','xy','yy','x',......]
Now I am looking for an efficient way that removes all strings that are present within another string. For example 'xx' 'x' fit in 'xxxx'.
As the dataset is huge, I was wondering if there is an efficient method for this beside
if a in b:
The complete code: With maybe some optimization parts:
for x in range(len(taxlistcomplete)):
if delete == True:
x = x - 1
delete = False
for y in range(len(taxlistcomplete)):
if taxlistcomplete[x] in taxlistcomplete[y]:
if x != y:
print x,y
print taxlistcomplete[x]
del taxlistcomplete[x]
delete = True
break
print x, len(taxlistcomplete)
An updated version of the code:
for x in enumerate(taxlistcomplete):
if delete == True:
#If element is removed, I need to step 1 back and continue looping.....
delete = False
for y in enumerate(taxlistcomplete):
if x[1] in y[1]:
if x[1] != y[1]:
print x[1],y[1]
print taxlistcomplete[x]
del taxlistcomplete[x[0]]
delete = True
break
print x, len(taxlistcomplete)
Now implemented with the enumerate, only now I am wondering if this is more efficient and howto implement the delete step so I have less to search in as well.
Just a short thought...
Basically what I would like to see...
if element does not match any other elements in list write this one to a file.
Thus if 'xxxxx' not in 'xx','xy','wfirfj',etc... print/save
A new simple version as I dont think I can optimize it much further anyway...
print 'comparison'
file = open('output.txt','a')
for x in enumerate(taxlistcomplete):
delete = False
for y in enumerate(taxlistcomplete):
if x[1] in y[1]:
if x[1] != y[1]:
taxlistcomplete[x[0]] = ''
delete = True
break
if delete == False:
file.write(str(x))
x in <string> is fast, but checking each string against all other strings in the list will take O(n^2) time. Instead of shaving a few cycles by optimizing the comparison, you can achieve huge savings by using a different data structure so that you can check each string in just one lookup: For two thousand strings, that's two thousand checks instead of four million.
There's a data structure called a "prefix tree" (or trie) that allows you to very quickly check whether a string is a prefix of some string you've seen before. Google it. Since you're also interested in strings that occur in the middle of another string x, index all substrings of the form x, x[1:], x[2:], x[3:], etc. (So: only n substrings for a string of length n). That is, you index substrings that start in position 0, 1, 2, etc. and continue to the end of the string. That way you can just check if a new string is an initial part of something in your index.
You can then solve your problem in O(n) time like this:
Order your strings in order of decreasing length. This ensures that no string could be a substring of something you haven't seen yet. Since you only care about length, you can do a bucket sort in O(n) time.
Start with an empty prefix tree and loop over your ordered list of strings. For each string x, use your prefix tree to check whether it is a substring of a string you've seen before. If not, add its substrings x, x[1:], x[2:] etc. to the prefix tree.
Deleting in the middle of a long list is very expensive, so you'll get a further speedup if you collect the strings you want to keep into a new list (the actual string is not copied, just the reference). When you're done, delete the original list and the prefix tree.
If that's too complicated for you, at least don't compare everything with everything. Sort your strings by size (in decreasing order), and only check each string against the ones that have come before it. This will give you a 50% speedup with very little effort. And do make a new list (or write to a file immediately) instead of deleting in place.
Here is a simple approach, assuming you can identify a character (I will use '$' in my example) that is guaranteed not to be in any of the original strings:
result = ''
for substring in taxlistcomplete:
if substring not in result: result += '$' + substring
taxlistcomplete = result.split('$')
This leverages Python's internal optimizations for substring searching by just making one big string to substring-search :)
Here is my suggestion. First I sort the elements by length. Because obviously the shorter the string is, the more likely it is to be a substring of another string. Then I have two for loops, where I run through the list and remove every element from the list where el is a substring. Note that the first for loop only passes each element once.
By sortitng the list first, we destroy the order of elements in the list. So if the order is important, then you can't use this solution.
Edit. I assume there are no identical elements in the list. So that when el == el2, it's because its the same element.
a = ["xyy", "xx", "zy", "yy", "x"]
a.sort(key=len)
for el in a:
for el2 in a:
if el in el2 and el != el2:
a.remove(el2)
Using a list comprehension -- note in -- is the fastest and more Pythonic way of solving your problem:
[element for element in arr if 'xx' in element]

Automatic dictionary generation - Python

I am trying to create a program which outputs all permutations of a string of length n whilst avoiding a defined substring, of length k. For example:
Derive all possible strings, up to a length of 5 characters, that can be generated from an initial empty set, which can either go to A or B, but the string cannot contain the substring "AAB" which is not allowed.
i.e. base case of [""] is the empty set.
The dictionary would be - A:{A}, B:{A,B}
From the empty set we can go to A, and we can go to B. We can not go to a B after an A but we can go to an A after a B. And both A and B can access themselves
example output: a,b,aa,bb,ba,aaa,bbb,baa,bba ... etc
How would I go about prompting a user to define a substring to avoid, and from that generate a dictionary which abides to these rules?
Any help or clarification would be greatly received.
Regards,
rkhad
The itertools module has a useful method called permutations():
(from http://docs.python.org/library/itertools.html#itertools.permutations)
itertools.permutations(iterable[, r])
Return successive r length permutations of elements in the iterable.
If r is not specified or is None, then r defaults to the length of the
iterable and all possible full-length permutations are generated.
Permutations are emitted in lexicographic sort order. So, if the input
iterable is sorted, the permutation tuples will be produced in sorted
order.
List comprehensions provide an easy way to filter generated permutations like this, but beware that if you are storing permutations of a large string that you will quickly get a very large list. You may want to therefore use a set to whittle down your list to non-duplicates. Also, you may find the function sorted to be useful if you intend to iterate through your "paths" in lexicographic order. Lastly, the in operator, when applied to strings, checks for a substring (x in y checks if x is a substring of y).
>>> from itertools import permutations
>>> perms = [''.join(p) for p in permutations('AAAABBBB', 4)]
>>> len(perms)
1680
>>> len(set(perms))
16
>>> filtered = [p for p in sorted(set(perms)) if 'AB' not in p]
>>> filtered
['AAAA', 'BAAA', 'BBAA', 'BBBA', 'BBBB']
I'm working on my dissertation right now too, in the area of Formal Languages. The concept of substring membership can be represented by a very simple regular grammar which corresponds to a deterministic finite automaton. To jog your memory:
http://en.wikipedia.org/wiki/Regular_grammar
http://en.wikipedia.org/wiki/Finite-state_machine
When you look into these you will find that you need to somehow keep track of the current "state" of your computation if you want it to have different "dictionaries" at different phases. I encourage you to read the wikipedia articles, and ask me some follow-up questions as I'd be happy to help you work through this.

Categories

Resources