I have a huge list containing many strings like:
['xxxx','xx','xy','yy','x',......]
Now I am looking for an efficient way that removes all strings that are present within another string. For example 'xx' 'x' fit in 'xxxx'.
As the dataset is huge, I was wondering if there is an efficient method for this beside
if a in b:
The complete code: With maybe some optimization parts:
for x in range(len(taxlistcomplete)):
if delete == True:
x = x - 1
delete = False
for y in range(len(taxlistcomplete)):
if taxlistcomplete[x] in taxlistcomplete[y]:
if x != y:
print x,y
print taxlistcomplete[x]
del taxlistcomplete[x]
delete = True
break
print x, len(taxlistcomplete)
An updated version of the code:
for x in enumerate(taxlistcomplete):
if delete == True:
#If element is removed, I need to step 1 back and continue looping.....
delete = False
for y in enumerate(taxlistcomplete):
if x[1] in y[1]:
if x[1] != y[1]:
print x[1],y[1]
print taxlistcomplete[x]
del taxlistcomplete[x[0]]
delete = True
break
print x, len(taxlistcomplete)
Now implemented with the enumerate, only now I am wondering if this is more efficient and howto implement the delete step so I have less to search in as well.
Just a short thought...
Basically what I would like to see...
if element does not match any other elements in list write this one to a file.
Thus if 'xxxxx' not in 'xx','xy','wfirfj',etc... print/save
A new simple version as I dont think I can optimize it much further anyway...
print 'comparison'
file = open('output.txt','a')
for x in enumerate(taxlistcomplete):
delete = False
for y in enumerate(taxlistcomplete):
if x[1] in y[1]:
if x[1] != y[1]:
taxlistcomplete[x[0]] = ''
delete = True
break
if delete == False:
file.write(str(x))
x in <string> is fast, but checking each string against all other strings in the list will take O(n^2) time. Instead of shaving a few cycles by optimizing the comparison, you can achieve huge savings by using a different data structure so that you can check each string in just one lookup: For two thousand strings, that's two thousand checks instead of four million.
There's a data structure called a "prefix tree" (or trie) that allows you to very quickly check whether a string is a prefix of some string you've seen before. Google it. Since you're also interested in strings that occur in the middle of another string x, index all substrings of the form x, x[1:], x[2:], x[3:], etc. (So: only n substrings for a string of length n). That is, you index substrings that start in position 0, 1, 2, etc. and continue to the end of the string. That way you can just check if a new string is an initial part of something in your index.
You can then solve your problem in O(n) time like this:
Order your strings in order of decreasing length. This ensures that no string could be a substring of something you haven't seen yet. Since you only care about length, you can do a bucket sort in O(n) time.
Start with an empty prefix tree and loop over your ordered list of strings. For each string x, use your prefix tree to check whether it is a substring of a string you've seen before. If not, add its substrings x, x[1:], x[2:] etc. to the prefix tree.
Deleting in the middle of a long list is very expensive, so you'll get a further speedup if you collect the strings you want to keep into a new list (the actual string is not copied, just the reference). When you're done, delete the original list and the prefix tree.
If that's too complicated for you, at least don't compare everything with everything. Sort your strings by size (in decreasing order), and only check each string against the ones that have come before it. This will give you a 50% speedup with very little effort. And do make a new list (or write to a file immediately) instead of deleting in place.
Here is a simple approach, assuming you can identify a character (I will use '$' in my example) that is guaranteed not to be in any of the original strings:
result = ''
for substring in taxlistcomplete:
if substring not in result: result += '$' + substring
taxlistcomplete = result.split('$')
This leverages Python's internal optimizations for substring searching by just making one big string to substring-search :)
Here is my suggestion. First I sort the elements by length. Because obviously the shorter the string is, the more likely it is to be a substring of another string. Then I have two for loops, where I run through the list and remove every element from the list where el is a substring. Note that the first for loop only passes each element once.
By sortitng the list first, we destroy the order of elements in the list. So if the order is important, then you can't use this solution.
Edit. I assume there are no identical elements in the list. So that when el == el2, it's because its the same element.
a = ["xyy", "xx", "zy", "yy", "x"]
a.sort(key=len)
for el in a:
for el2 in a:
if el in el2 and el != el2:
a.remove(el2)
Using a list comprehension -- note in -- is the fastest and more Pythonic way of solving your problem:
[element for element in arr if 'xx' in element]
Related
The question is this:
Given an array of strings, remove each string that is an anagram of an earlier string, then return the remaining array in sorted order.
Example
str = ['code', 'doce', 'ecod', 'framer', 'frame']
code and doce are anagrams. Remove doce from the array and keep the first occurrence code in the array.
code and ecod are anagrams. Remove ecod from the array and keep the first occurrence code in the array.
code and framer are not anagrams. Keep both strings in the array.
framer and frame are not anagrams due to the extra r in framer. Keep both strings in the array.
Order the remaining strings in ascending order: ['code','frame','framer'].
The solution code is:
def checkForAnagrams(word, arr):
# Checking if the word has an anagram in the sliced array.
for x in arr:
if (sorted(word) == sorted(x)):
return True
return False
def funWithAnagrams(text):
limit = len(text)
text.reverse()
# Creating a copy of the list which will be modified,
# and will not affect the array slicing during the loop.
final_text = list(text)
# Looping through the list in reverse since we're eliminating
# the second anagram we find from the original list order.
count = 0
for i in range(0, limit):
if text[i+1:] and checkForAnagrams(text[i], text[i+1:]):
final_text.pop(i - count)
count += 1
return sorted(final_text)
I want to understand in the function funwithanagrams, how is the text[i+1:] useful?
if checkForAnagrams(text[i], text[i+1:]):
would have achieved the same output.
PS-This is not my code. I found it online and really want to know how the text[i+1:] will impact the output if removed?
The if text[i+1:] checks whether the sliced list is empty or not. If it wasn't there, in the final iteration of the for loop, an error would've occurred when iterating through text[i+1:] as it would be empty. The if text[i+1:] equates to False in the last iteration due to which checkForAnagrams(text[i], text[i+1:]) never executes and no error occurs.
A better way to write this code would be to remove the if text[i+1:] and instead replace range(0, limit) with range(0, limit-1) so that text[i+1:] is never empty (will have one element in the last iteration of the for loop). This method would reduce the number of checks required and make the code more efficient (doesn't make a noticeable difference for a code this small).
Is it possible to iterate through a list multiple times? basically, I have a list of strings and I am looking for the longest superstring. Each of the strings in the list has some overlap of at least half of their length and they are all the same size.I want to see if the superstring I'm adding onto startswith or endswith each of the sequences in the list and when I find a match I want to add that element to my superstring, delete the element from the list and then loop over it again and again until my list is empty.
sequences=['ATTAGACCTG','CCTGCCGGAA','AGACCTGCCG',''GCCGGAATAC']
halfway= len(sequences[0])/2
genome=sequences[0] # this is the string that will be added onto throughout the loop
sequences.remove(sequences[0])
for j in range(len(sequences)):
for sequence in sequences:
front=[]
back=[]
for i in range(halfway,len(sequence)):
if genome.endswith(sequence[:i]):
genome=genome+sequence[i:]
sequences.remove(sequence)
elif genome.startswith(sequence[-i:]):
genome=sequence[:i]+genome
sequences.remove(sequence)
'''
elif not genome.startswith(sequence[-i:]) or not genome.endswith(sequence[:i]):
sequences.remove(sequence) # this doesnt seem to work want to get rid of
#sequences that are in the middle of the string and
#already accounted for
'''
this works when I dont use the final elif statement and gives me the correct answer ATTAGACCTGCCGGAATAC. However, when I do this with a larger list of strings I am still left with strings in the list that I expected to be empty. Also is the last loop even necessary if I am only looking for strings to add onto the front and back of the superstring (genome in my code).
try this:
sequences=['ATTAGACCTG','CCTGCCGGAA','AGACCTGCCG','GCCGGAATAC']
sequences.reverse()
genome = sequences.pop(-1) # this is the string that will be added onto throughout the loop
unrelated = []
while(sequences):
sequence = sequences.pop(-1)
if sequence in genome: continue
found=False
for i in range(3,len(sequence)):
if genome.endswith(sequence[:i]):
genome=genome+sequence[i:]
found = True
break
elif genome.startswith(sequence[-i:]):
genome=sequence[:i]+genome
found = True
break
if not found:
unrelated.append(sequence)
print(genome)
#ATTAGACCTGCCGGAATAC
print(sequences)
#[]
print(unrelated)
#[]
I do not know if you are guaranteed to not have multiple unrelated sequences in the same batch, so I allowed for the unrelated. If that is not necessary, feel free to remove.
Python's handling of deleting from the front of a list is very inefficient, so I reversed the list and pull from the back. The reversal might not be necessary depending on the full data (it is with your example data).
I pop from the sequences list while there are sequences available to avoid removing elements from a list while iterating through it. I then check to see if it is already in the final genome. If it is not then I go into checking the endswith / beginswith checks. If a match is found, slice it into genome; set found flag; break out of the for loop
If the sequence is not already contained and a partial match is not found, it gets put into unrelated
This is how I ended up solving it, I realized that all you need to do is find out which string is the start of the superstring, since we know that the sequences have an overlap of 1/2 or more I found which half wasn't contained in any of the sequences. From here I looped over a list the amount of times equal to the length of the list and looked for sequences in which the ending of the genome matched the beginning of the appropriate sequence. When I found this I added the sequence onto the genome(superstring) and then removed this sequence and continued iterating through the list. When working with a list of 50 sequences that have a length of 1000 this code takes around .806441 to run
def moveFirstSeq(seqList): # move the first sequence in genome to the end of list
d={}
for seq in seqList:
count=0
for seq1 in seqList:
if seq==seq1:
pass
if seq[0:len(seq)/2] not in seq1:
count+=1
d[seq]= count
sorted_values=sorted(d.values())
first_sequence=''
for k,v in d.items():
if v==sorted_values[-1]:
first_sequence=k
seqList.remove(first_sequence)
seqList.append(first_sequence)
return seqList
seq= moveFirstSeq(sequences)
genome = seq.pop(-1) # added first sequence to genome and removed from list
for j in range(len(sequences)): # looping over the list amount of times equal to the length of the sequence list
for sequence in sequences:
for i in range(len(sequence)/2,len(sequence)):
if genome.endswith(sequence[:i]):
genome=genome+sequence[i:] # adding onto the superstring and
sequences.remove(sequence) #removing it from the sequence list
print genome , seq
in my code im having a problem because i cannot compare to list as i wanted. what i try to do is looking for first indexes of inputs firstly and then if indexes not the same looking for the next index of the longer input as i guess1. and then after finishing comparing the first index of elements i want to compare second indexes .. what i mean first checking (A-C)(A-A)(A-T) and then (C-A)(C-T).. and then (T-T)...
and want an input list as (A,T) beacuse of ATT part of guess1..
however i stuck in a moment that i always find ACT not A and T..
where i am wrong.. i will be very glad if you enlighten me..
edit..
what i'm trying to do is looking for the best similarity in the longer list of guess1 and find the most similiar list as ATT
GUESS1="CATTCG"
GUESS2="ACT"
if len(str(GUESS1))>len(str(GUESS2)):
DNA_input_list=list((GUESS1))
DNA_input1_list=list((GUESS2))
common_elements=[]
i=0
while i<len(DNA_input1_list)-1:
j=0
while j<len(DNA_input_list)-len(DNA_input1_list):
if DNA_input_list[i] == DNA_input1_list[j]:
common_elements.append(DNA_input1_list[j])
i+=1
j+=1
if j>len(DNA_input1_list)-1:
break
print(common_elements)
As far as I understand, you want to find a shorter substring in a longer substring, and if not found, remove an element from shorter substring then repeat the search.
You can use string find function in python for that. i.e. "CATTCG".find('ACT'), this function will return -1 because there are no substing ACT. What then you can do is remove an element from the shorter string using slice operator [::] and repeat the search like this --
>>> for x in range(len('ACT')):
... if "CATTCG".find('ACT'[x:]) > -1 :
... print("CATTCG".find('ACT'[x:]))
... print("Match found for " + 'ACT'[x:])
In code here, first a range of lengths is generated i.e. [0, 1, 2, 3] this is the number of items we're gonna slice off from the beginning.
In second line we do the slicing with 'ACT'[x:] (for x==0, we get 'ACT', for x == 1, we get 'CT' and for x==2, we get 'T').
The last two lines print out the position and the string that matched.
If I have understood everything correctly, you want to return the longest similar substring from GUESS2, with is included in GUESS1.
I would use something like this.
<!-- language: lang-py -->
for count in range(len(GUESS2)):
if GUESS2[:count] in GUESS1:
common_elements = GUESS2[:count]
print(GUESS2[:count]) #if a function, return GUESS2[:count]
A loop as long as the count from the searching string.
Then check if the substring is included in the other.
If so, save it to a variable and print/return it after the loop has finished.
A function I created takes a list of string (long list of long sequences) as an argument. Initially, I want to make sure all strings are of equal length. Of course, I could do it by iterating over all sequences in a loop and checking the length. But I am wondering - is there any way to do it faster/more efficiently?
I've tried looking at the unittest module but I am not sure whether it would suit here. Alternatively, I was thinking about creating a list of len(string) of all strings using list comprehension and then checking whether or elements are the same. However, this seems like a lot of effort.
my_list = [ ... ]
FIXED_SIZE = 100 # Lenght of each string which should be equal
result = all(len(my_string) == FIXED_SIZE for my_string in my_list)
This may help you. If all are same length output will be True otherwise False.
str_list = ['ilo', 'jak']
str_len = map(len,str_list)
all(each_len == str_len[0] for each_len in str_len)
I am creating a program for a high school course and our teacher is very specific about what is allowed into our programs. We use python 2.x and he only allows if statements, while loops, functions, boolean values, and lists. I am working on a project that will print the reversal of a string, then print again the same reversal without the numbers in it but I cannot figure it out. Help please. What i have so far is this..
def reverse_str(string):
revstring =('')
length=len(string)
i = length - 1
while i>=0:
revstring = revstring + string[i]
i = i - 1
return revstring
def strip_digits(string):
l = [0,1,2,3,4,5,6,7,8,9]
del (l) rev_string
string = raw_input("Enter a string->")
new_str = rev_str(string)
print new_str
I cannot figure out how to use the "del" function properly, how do i delete any of the items in the list from the reversed string..thanks
In general, you have two options for a task like this:
Iterate through the items in your list, deleting the ones that you do not want to keep.
Iterate through the items in your list, copying the ones that you do want to keep to a new list. Return the new list.
Now, although I would normally prefer option (2), that won't help with your specific question about del. To delete an item at index x from a list a, the following syntax will do it:
del a[x]
That will shift all the elements past index x to the left to close the gap left by deleting the element. You will have to take this shift into account if you're iterating through all the items in the list.
Type str in python is immutable (cannot be altered in place) and does not support the del item deletion function.
Map the characters of the string to a list and delete the elements you want and reconstruct the string.
OR
Iterate through the string elements whilst building a new one, omitting numbers.
correct usage of del is:
>>> a = [1, 2, 3]
>>> del a[1]
>>> a
[1, 3]
You could iterate back over the string copying it again but not copying the digits... It would be interesting for you to also figure out the pythonic way to do everything your not allowed to. Both methods are good to know.