The question is this:
Given an array of strings, remove each string that is an anagram of an earlier string, then return the remaining array in sorted order.
Example
str = ['code', 'doce', 'ecod', 'framer', 'frame']
code and doce are anagrams. Remove doce from the array and keep the first occurrence code in the array.
code and ecod are anagrams. Remove ecod from the array and keep the first occurrence code in the array.
code and framer are not anagrams. Keep both strings in the array.
framer and frame are not anagrams due to the extra r in framer. Keep both strings in the array.
Order the remaining strings in ascending order: ['code','frame','framer'].
The solution code is:
def checkForAnagrams(word, arr):
# Checking if the word has an anagram in the sliced array.
for x in arr:
if (sorted(word) == sorted(x)):
return True
return False
def funWithAnagrams(text):
limit = len(text)
text.reverse()
# Creating a copy of the list which will be modified,
# and will not affect the array slicing during the loop.
final_text = list(text)
# Looping through the list in reverse since we're eliminating
# the second anagram we find from the original list order.
count = 0
for i in range(0, limit):
if text[i+1:] and checkForAnagrams(text[i], text[i+1:]):
final_text.pop(i - count)
count += 1
return sorted(final_text)
I want to understand in the function funwithanagrams, how is the text[i+1:] useful?
if checkForAnagrams(text[i], text[i+1:]):
would have achieved the same output.
PS-This is not my code. I found it online and really want to know how the text[i+1:] will impact the output if removed?
The if text[i+1:] checks whether the sliced list is empty or not. If it wasn't there, in the final iteration of the for loop, an error would've occurred when iterating through text[i+1:] as it would be empty. The if text[i+1:] equates to False in the last iteration due to which checkForAnagrams(text[i], text[i+1:]) never executes and no error occurs.
A better way to write this code would be to remove the if text[i+1:] and instead replace range(0, limit) with range(0, limit-1) so that text[i+1:] is never empty (will have one element in the last iteration of the for loop). This method would reduce the number of checks required and make the code more efficient (doesn't make a noticeable difference for a code this small).
Related
I have the following program which seeks to check the similarity between two lists. However, the originalword (list) prints correctly the first time but not the second therefore the code doesn't work to check equivalence.
https://trinket.io/python3/b3b7827717
Can anyone spot the error? If so, could a solution be posted
a) using lists
b) Not introducing any new skills (e.g. string slicing)
def palindromechecker():
print("----Palindrome Checker---")
word=input("Enter word:")
#empty list for reversed word
originalword=[]
reversedword=[]
#put each letter in word in the list originallist
for i in range(len(word)):
originalword.append(word[i])
print("Print original word in order:",originalword)
#reverse the word
for i in range(len(word)):
reversedword.append(originalword.pop())
print("Reversed word:",reversedword)
print("Original word:",originalword)
#are original word and reversed word the same?
if originalword==reversedword:
print("--Palindrome Found--")
else:
print("--Not a Palindrome---")
palindromechecker()
You make originalword list empty by doing that originalword.pop() in the second loop. Better way to do that is just reverse list like that
reversedword = originalword[::-1]
It will work without loop
Or you can do something like that:
for i in range(len(word)):
reversedword.append(originalword[-1-i])
If you use .pop() method, it will update the existing list by removing the last element from the list. So it is exhausting your originalword list.
I read that you don't want to use string[::-1]. Here is a similar but not the same workaround. You have to run the loop in reverse.
#reverse the word
for i in range(len(word)-1, -1, -1):
reversedword.append(originalword[i])
Is it possible to iterate through a list multiple times? basically, I have a list of strings and I am looking for the longest superstring. Each of the strings in the list has some overlap of at least half of their length and they are all the same size.I want to see if the superstring I'm adding onto startswith or endswith each of the sequences in the list and when I find a match I want to add that element to my superstring, delete the element from the list and then loop over it again and again until my list is empty.
sequences=['ATTAGACCTG','CCTGCCGGAA','AGACCTGCCG',''GCCGGAATAC']
halfway= len(sequences[0])/2
genome=sequences[0] # this is the string that will be added onto throughout the loop
sequences.remove(sequences[0])
for j in range(len(sequences)):
for sequence in sequences:
front=[]
back=[]
for i in range(halfway,len(sequence)):
if genome.endswith(sequence[:i]):
genome=genome+sequence[i:]
sequences.remove(sequence)
elif genome.startswith(sequence[-i:]):
genome=sequence[:i]+genome
sequences.remove(sequence)
'''
elif not genome.startswith(sequence[-i:]) or not genome.endswith(sequence[:i]):
sequences.remove(sequence) # this doesnt seem to work want to get rid of
#sequences that are in the middle of the string and
#already accounted for
'''
this works when I dont use the final elif statement and gives me the correct answer ATTAGACCTGCCGGAATAC. However, when I do this with a larger list of strings I am still left with strings in the list that I expected to be empty. Also is the last loop even necessary if I am only looking for strings to add onto the front and back of the superstring (genome in my code).
try this:
sequences=['ATTAGACCTG','CCTGCCGGAA','AGACCTGCCG','GCCGGAATAC']
sequences.reverse()
genome = sequences.pop(-1) # this is the string that will be added onto throughout the loop
unrelated = []
while(sequences):
sequence = sequences.pop(-1)
if sequence in genome: continue
found=False
for i in range(3,len(sequence)):
if genome.endswith(sequence[:i]):
genome=genome+sequence[i:]
found = True
break
elif genome.startswith(sequence[-i:]):
genome=sequence[:i]+genome
found = True
break
if not found:
unrelated.append(sequence)
print(genome)
#ATTAGACCTGCCGGAATAC
print(sequences)
#[]
print(unrelated)
#[]
I do not know if you are guaranteed to not have multiple unrelated sequences in the same batch, so I allowed for the unrelated. If that is not necessary, feel free to remove.
Python's handling of deleting from the front of a list is very inefficient, so I reversed the list and pull from the back. The reversal might not be necessary depending on the full data (it is with your example data).
I pop from the sequences list while there are sequences available to avoid removing elements from a list while iterating through it. I then check to see if it is already in the final genome. If it is not then I go into checking the endswith / beginswith checks. If a match is found, slice it into genome; set found flag; break out of the for loop
If the sequence is not already contained and a partial match is not found, it gets put into unrelated
This is how I ended up solving it, I realized that all you need to do is find out which string is the start of the superstring, since we know that the sequences have an overlap of 1/2 or more I found which half wasn't contained in any of the sequences. From here I looped over a list the amount of times equal to the length of the list and looked for sequences in which the ending of the genome matched the beginning of the appropriate sequence. When I found this I added the sequence onto the genome(superstring) and then removed this sequence and continued iterating through the list. When working with a list of 50 sequences that have a length of 1000 this code takes around .806441 to run
def moveFirstSeq(seqList): # move the first sequence in genome to the end of list
d={}
for seq in seqList:
count=0
for seq1 in seqList:
if seq==seq1:
pass
if seq[0:len(seq)/2] not in seq1:
count+=1
d[seq]= count
sorted_values=sorted(d.values())
first_sequence=''
for k,v in d.items():
if v==sorted_values[-1]:
first_sequence=k
seqList.remove(first_sequence)
seqList.append(first_sequence)
return seqList
seq= moveFirstSeq(sequences)
genome = seq.pop(-1) # added first sequence to genome and removed from list
for j in range(len(sequences)): # looping over the list amount of times equal to the length of the sequence list
for sequence in sequences:
for i in range(len(sequence)/2,len(sequence)):
if genome.endswith(sequence[:i]):
genome=genome+sequence[i:] # adding onto the superstring and
sequences.remove(sequence) #removing it from the sequence list
print genome , seq
in my code im having a problem because i cannot compare to list as i wanted. what i try to do is looking for first indexes of inputs firstly and then if indexes not the same looking for the next index of the longer input as i guess1. and then after finishing comparing the first index of elements i want to compare second indexes .. what i mean first checking (A-C)(A-A)(A-T) and then (C-A)(C-T).. and then (T-T)...
and want an input list as (A,T) beacuse of ATT part of guess1..
however i stuck in a moment that i always find ACT not A and T..
where i am wrong.. i will be very glad if you enlighten me..
edit..
what i'm trying to do is looking for the best similarity in the longer list of guess1 and find the most similiar list as ATT
GUESS1="CATTCG"
GUESS2="ACT"
if len(str(GUESS1))>len(str(GUESS2)):
DNA_input_list=list((GUESS1))
DNA_input1_list=list((GUESS2))
common_elements=[]
i=0
while i<len(DNA_input1_list)-1:
j=0
while j<len(DNA_input_list)-len(DNA_input1_list):
if DNA_input_list[i] == DNA_input1_list[j]:
common_elements.append(DNA_input1_list[j])
i+=1
j+=1
if j>len(DNA_input1_list)-1:
break
print(common_elements)
As far as I understand, you want to find a shorter substring in a longer substring, and if not found, remove an element from shorter substring then repeat the search.
You can use string find function in python for that. i.e. "CATTCG".find('ACT'), this function will return -1 because there are no substing ACT. What then you can do is remove an element from the shorter string using slice operator [::] and repeat the search like this --
>>> for x in range(len('ACT')):
... if "CATTCG".find('ACT'[x:]) > -1 :
... print("CATTCG".find('ACT'[x:]))
... print("Match found for " + 'ACT'[x:])
In code here, first a range of lengths is generated i.e. [0, 1, 2, 3] this is the number of items we're gonna slice off from the beginning.
In second line we do the slicing with 'ACT'[x:] (for x==0, we get 'ACT', for x == 1, we get 'CT' and for x==2, we get 'T').
The last two lines print out the position and the string that matched.
If I have understood everything correctly, you want to return the longest similar substring from GUESS2, with is included in GUESS1.
I would use something like this.
<!-- language: lang-py -->
for count in range(len(GUESS2)):
if GUESS2[:count] in GUESS1:
common_elements = GUESS2[:count]
print(GUESS2[:count]) #if a function, return GUESS2[:count]
A loop as long as the count from the searching string.
Then check if the substring is included in the other.
If so, save it to a variable and print/return it after the loop has finished.
I am writing a code for a class that wants me to make a code to check the substring in a string using nested loops.
Basically my teacher wants to prove how the function 'in', as in:
ana in banana will return True.
The goal of the program is to make a function of 2 parameters,
substring(subStr,fullStr)
that will print out a sentence saying if subStr is a substring of fullStr, my program is as follows:
def substring(subStr,fullStr):
tracker=""
for i in (0,(len(fullStr)-1)):
for j in (0,(len(subStr)-1)):
if fullStr[i]==subStr[j]:
tracker=tracker+subStr[j]
i+=1
if i==(len(fullStr)-1):
break
if tracker==subStr:
print "Yes",subStr,"is a substring of",fullStr
When i called the function in the interpreter 'substring("ana","banana")', it printed out a traceback error on line 5 saying string index out of range:
if fullStr[i]==subStr[j]:
I'm banging my head trying to find the error. Any help would be appreciated
There are a few separate issues.
You are not reseting tracker in every iteration of the outer loop. This means that the leftovers from previous iterations contaminate later iterations.
You are not using range, and are instead looping over a tuple of just the 0 and the length of each string.
You are trying to increment the outer counter and skipping checks for the iteration of the outer loop.
You are not doing the bounds check correctly before trying to index into the outer string.
Here is a corrected version.
def substring(subStr,fullStr):
for i in range(0,(len(fullStr))):
tracker=""
for j in range(0,(len(subStr))):
if i + j >= len(fullStr):
break
if fullStr[i+j]==subStr[j]:
tracker=tracker+subStr[j]
if tracker==subStr:
print "Yes",subStr,"is a substring of",fullStr
return
substring("ana", "banana")
First off, your loops should be
for i in xrange(0,(len(fullStr))):
for example. i in (0, len(fullStr)-1) will have i take on the value of 0 the first time around, then take on len(fullStr)-1 the second time. I assume by your algorithm you want it to take on the intermediate values as well.
Now as for the error, consider i on the very last pass of the for loop. i is going to be equal to len(fullStr)-1. Now when we execute i+=1, i is now equal to len(fullStr). This does not fufill the condition of i==len(fullStr)-1, so we do not break, we loop, and we crash. It would be better if you either made it if i>=len(fullStr)-1 or checked for i==len(fullStr)-1 before your if fullStr[i]==subStr[j]: statement.
Lastly, though not related to the question specifically, you do not reset tracker each time you stop checking a certain match. You should place tracker = "" after the for i in xrange(0,(len(fullStr))): line. You also do not check if tracker is correct after looping through the list starting at i, nor do you break from the loop when you get a mismatch(instead continuing and possibly picking up more letters that match, but not consecutively.)
Here is a fully corrected version:
def substring(subStr,fullStr):
for i in xrange(0,(len(fullStr))):
tracker="" #this is going to contain the consecutive matches we find
for j in xrange(0,(len(subStr))):
if i==(len(fullStr)): #end of i; no match.
break
if fullStr[i]==subStr[j]: #okay, looks promising, check the next letter to see if it is a match,
tracker=tracker+subStr[j]
i+=1
else: #found a mismatch, leave inner loop and check what we have so far.
break
if tracker==subStr:
print "Yes",subStr,"is a substring of",fullStr
return #we already know it is a substring, so we don't need to check the rest
I have a huge list containing many strings like:
['xxxx','xx','xy','yy','x',......]
Now I am looking for an efficient way that removes all strings that are present within another string. For example 'xx' 'x' fit in 'xxxx'.
As the dataset is huge, I was wondering if there is an efficient method for this beside
if a in b:
The complete code: With maybe some optimization parts:
for x in range(len(taxlistcomplete)):
if delete == True:
x = x - 1
delete = False
for y in range(len(taxlistcomplete)):
if taxlistcomplete[x] in taxlistcomplete[y]:
if x != y:
print x,y
print taxlistcomplete[x]
del taxlistcomplete[x]
delete = True
break
print x, len(taxlistcomplete)
An updated version of the code:
for x in enumerate(taxlistcomplete):
if delete == True:
#If element is removed, I need to step 1 back and continue looping.....
delete = False
for y in enumerate(taxlistcomplete):
if x[1] in y[1]:
if x[1] != y[1]:
print x[1],y[1]
print taxlistcomplete[x]
del taxlistcomplete[x[0]]
delete = True
break
print x, len(taxlistcomplete)
Now implemented with the enumerate, only now I am wondering if this is more efficient and howto implement the delete step so I have less to search in as well.
Just a short thought...
Basically what I would like to see...
if element does not match any other elements in list write this one to a file.
Thus if 'xxxxx' not in 'xx','xy','wfirfj',etc... print/save
A new simple version as I dont think I can optimize it much further anyway...
print 'comparison'
file = open('output.txt','a')
for x in enumerate(taxlistcomplete):
delete = False
for y in enumerate(taxlistcomplete):
if x[1] in y[1]:
if x[1] != y[1]:
taxlistcomplete[x[0]] = ''
delete = True
break
if delete == False:
file.write(str(x))
x in <string> is fast, but checking each string against all other strings in the list will take O(n^2) time. Instead of shaving a few cycles by optimizing the comparison, you can achieve huge savings by using a different data structure so that you can check each string in just one lookup: For two thousand strings, that's two thousand checks instead of four million.
There's a data structure called a "prefix tree" (or trie) that allows you to very quickly check whether a string is a prefix of some string you've seen before. Google it. Since you're also interested in strings that occur in the middle of another string x, index all substrings of the form x, x[1:], x[2:], x[3:], etc. (So: only n substrings for a string of length n). That is, you index substrings that start in position 0, 1, 2, etc. and continue to the end of the string. That way you can just check if a new string is an initial part of something in your index.
You can then solve your problem in O(n) time like this:
Order your strings in order of decreasing length. This ensures that no string could be a substring of something you haven't seen yet. Since you only care about length, you can do a bucket sort in O(n) time.
Start with an empty prefix tree and loop over your ordered list of strings. For each string x, use your prefix tree to check whether it is a substring of a string you've seen before. If not, add its substrings x, x[1:], x[2:] etc. to the prefix tree.
Deleting in the middle of a long list is very expensive, so you'll get a further speedup if you collect the strings you want to keep into a new list (the actual string is not copied, just the reference). When you're done, delete the original list and the prefix tree.
If that's too complicated for you, at least don't compare everything with everything. Sort your strings by size (in decreasing order), and only check each string against the ones that have come before it. This will give you a 50% speedup with very little effort. And do make a new list (or write to a file immediately) instead of deleting in place.
Here is a simple approach, assuming you can identify a character (I will use '$' in my example) that is guaranteed not to be in any of the original strings:
result = ''
for substring in taxlistcomplete:
if substring not in result: result += '$' + substring
taxlistcomplete = result.split('$')
This leverages Python's internal optimizations for substring searching by just making one big string to substring-search :)
Here is my suggestion. First I sort the elements by length. Because obviously the shorter the string is, the more likely it is to be a substring of another string. Then I have two for loops, where I run through the list and remove every element from the list where el is a substring. Note that the first for loop only passes each element once.
By sortitng the list first, we destroy the order of elements in the list. So if the order is important, then you can't use this solution.
Edit. I assume there are no identical elements in the list. So that when el == el2, it's because its the same element.
a = ["xyy", "xx", "zy", "yy", "x"]
a.sort(key=len)
for el in a:
for el2 in a:
if el in el2 and el != el2:
a.remove(el2)
Using a list comprehension -- note in -- is the fastest and more Pythonic way of solving your problem:
[element for element in arr if 'xx' in element]