Looping over list multiple times - python

Is it possible to iterate through a list multiple times? basically, I have a list of strings and I am looking for the longest superstring. Each of the strings in the list has some overlap of at least half of their length and they are all the same size.I want to see if the superstring I'm adding onto startswith or endswith each of the sequences in the list and when I find a match I want to add that element to my superstring, delete the element from the list and then loop over it again and again until my list is empty.
sequences=['ATTAGACCTG','CCTGCCGGAA','AGACCTGCCG',''GCCGGAATAC']
halfway= len(sequences[0])/2
genome=sequences[0] # this is the string that will be added onto throughout the loop
sequences.remove(sequences[0])
for j in range(len(sequences)):
for sequence in sequences:
front=[]
back=[]
for i in range(halfway,len(sequence)):
if genome.endswith(sequence[:i]):
genome=genome+sequence[i:]
sequences.remove(sequence)
elif genome.startswith(sequence[-i:]):
genome=sequence[:i]+genome
sequences.remove(sequence)
'''
elif not genome.startswith(sequence[-i:]) or not genome.endswith(sequence[:i]):
sequences.remove(sequence) # this doesnt seem to work want to get rid of
#sequences that are in the middle of the string and
#already accounted for
'''
this works when I dont use the final elif statement and gives me the correct answer ATTAGACCTGCCGGAATAC. However, when I do this with a larger list of strings I am still left with strings in the list that I expected to be empty. Also is the last loop even necessary if I am only looking for strings to add onto the front and back of the superstring (genome in my code).

try this:
sequences=['ATTAGACCTG','CCTGCCGGAA','AGACCTGCCG','GCCGGAATAC']
sequences.reverse()
genome = sequences.pop(-1) # this is the string that will be added onto throughout the loop
unrelated = []
while(sequences):
sequence = sequences.pop(-1)
if sequence in genome: continue
found=False
for i in range(3,len(sequence)):
if genome.endswith(sequence[:i]):
genome=genome+sequence[i:]
found = True
break
elif genome.startswith(sequence[-i:]):
genome=sequence[:i]+genome
found = True
break
if not found:
unrelated.append(sequence)
print(genome)
#ATTAGACCTGCCGGAATAC
print(sequences)
#[]
print(unrelated)
#[]
I do not know if you are guaranteed to not have multiple unrelated sequences in the same batch, so I allowed for the unrelated. If that is not necessary, feel free to remove.
Python's handling of deleting from the front of a list is very inefficient, so I reversed the list and pull from the back. The reversal might not be necessary depending on the full data (it is with your example data).
I pop from the sequences list while there are sequences available to avoid removing elements from a list while iterating through it. I then check to see if it is already in the final genome. If it is not then I go into checking the endswith / beginswith checks. If a match is found, slice it into genome; set found flag; break out of the for loop
If the sequence is not already contained and a partial match is not found, it gets put into unrelated

This is how I ended up solving it, I realized that all you need to do is find out which string is the start of the superstring, since we know that the sequences have an overlap of 1/2 or more I found which half wasn't contained in any of the sequences. From here I looped over a list the amount of times equal to the length of the list and looked for sequences in which the ending of the genome matched the beginning of the appropriate sequence. When I found this I added the sequence onto the genome(superstring) and then removed this sequence and continued iterating through the list. When working with a list of 50 sequences that have a length of 1000 this code takes around .806441 to run
def moveFirstSeq(seqList): # move the first sequence in genome to the end of list
d={}
for seq in seqList:
count=0
for seq1 in seqList:
if seq==seq1:
pass
if seq[0:len(seq)/2] not in seq1:
count+=1
d[seq]= count
sorted_values=sorted(d.values())
first_sequence=''
for k,v in d.items():
if v==sorted_values[-1]:
first_sequence=k
seqList.remove(first_sequence)
seqList.append(first_sequence)
return seqList
seq= moveFirstSeq(sequences)
genome = seq.pop(-1) # added first sequence to genome and removed from list
for j in range(len(sequences)): # looping over the list amount of times equal to the length of the sequence list
for sequence in sequences:
for i in range(len(sequence)/2,len(sequence)):
if genome.endswith(sequence[:i]):
genome=genome+sequence[i:] # adding onto the superstring and
sequences.remove(sequence) #removing it from the sequence list
print genome , seq

Related

Need help to understand fun with anagram code

The question is this:
Given an array of strings, remove each string that is an anagram of an earlier string, then return the remaining array in sorted order.
Example
str = ['code', 'doce', 'ecod', 'framer', 'frame']
code and doce are anagrams. Remove doce from the array and keep the first occurrence code in the array.
code and ecod are anagrams. Remove ecod from the array and keep the first occurrence code in the array.
code and framer are not anagrams. Keep both strings in the array.
framer and frame are not anagrams due to the extra r in framer. Keep both strings in the array.
Order the remaining strings in ascending order: ['code','frame','framer'].
The solution code is:
def checkForAnagrams(word, arr):
# Checking if the word has an anagram in the sliced array.
for x in arr:
if (sorted(word) == sorted(x)):
return True
return False
def funWithAnagrams(text):
limit = len(text)
text.reverse()
# Creating a copy of the list which will be modified,
# and will not affect the array slicing during the loop.
final_text = list(text)
# Looping through the list in reverse since we're eliminating
# the second anagram we find from the original list order.
count = 0
for i in range(0, limit):
if text[i+1:] and checkForAnagrams(text[i], text[i+1:]):
final_text.pop(i - count)
count += 1
return sorted(final_text)
I want to understand in the function funwithanagrams, how is the text[i+1:] useful?
if checkForAnagrams(text[i], text[i+1:]):
would have achieved the same output.
PS-This is not my code. I found it online and really want to know how the text[i+1:] will impact the output if removed?
The if text[i+1:] checks whether the sliced list is empty or not. If it wasn't there, in the final iteration of the for loop, an error would've occurred when iterating through text[i+1:] as it would be empty. The if text[i+1:] equates to False in the last iteration due to which checkForAnagrams(text[i], text[i+1:]) never executes and no error occurs.
A better way to write this code would be to remove the if text[i+1:] and instead replace range(0, limit) with range(0, limit-1) so that text[i+1:] is never empty (will have one element in the last iteration of the for loop). This method would reduce the number of checks required and make the code more efficient (doesn't make a noticeable difference for a code this small).

How to iterate through a list and use .isdigit to verify if a list of strings is made up of only digits or not

Write a function that takes, as an argument, a list, identified by the variable aList. If the list only contains elements containing digits (either as strings as integers), return the string formed by concatenating all of the elements in the list (see the example that follows). Otherwise, return a string indicating the length of the list, as specified in the examples that follow.
I am just starting to learn how to code and this is my first CS class.
def amIDigits(aList):
for element in range(aList):
if element in aList.isdigit():
bList=[]
bList.append(aList)
return str(bList)
amIDigits([“hello”, 23]) should return the string “The length of the input is 2.”
amIDigits ([“10”, “111”]) should return the string “10111”
If I understand it right the output will be the joined digits even if they are not of the string format. So the best way is to use the all function (returns true if all elements of an iteration are true) and check if the string elements of the list are digits. If so, then return the join of all elements of the list converted to a string. Else, we return the length of the list using the new string formatting syntax (f represents string formatting and the {} return the result of an operation).
code:
def amIDigits(aList):
if all([str(i).isdigit() for i in aList]):
return ''.join(map(str,aList))
else:
return f'The length of the input is {len(aList)}.'
print(amIDigits(['hello', 23]))
print(amIDigits(['10', '111']))
print(amIDigits([55, 33]))
output:
The length of the input is 2.
10111
5533
First, I highly recommend having someone literally sit down and have you walk them through your thought process. It is more useful as a learner to debug your thought process than to have someone give you the answer.
One thing I noticed is that you created your empty list, bList, inside the for block. This will not work. You need to create an empty list to store things into before you begin for looping through the old list, otherwise you will be over-writing your new list every time it loops. So right now, your bList.append() statement is appending an element onto an empty list every time it runs. (You will get only the very last element in the aList stored into your bList.)
Another problem is that you use the range() function, but you don't need to. You want to look at each element inside the list. Range creates a sequence of numbers from 0 to whatever number is inside the parentheses: range() documentation. Your code tries to pass a list into range(), so it is invalid.
The "for blank in blank" statement breaks up whatever list is in the second blank and goes through each of its elements one at a time. For the duration of the for statement, the first blank is the name of the variable that refers to the element being looked at. so for example:
apples = ["Granny Smith","Red Delicious","Green"]
for apple in apples:
eat(apple) #yum!
The for in statement is more naturally spoken as "for each blank in blank:"

Shuffling with constraints on pairs

I have n lists each of length m. assume n*m is even. i want to get a randomly shuffled list with all elements, under the constraint that the elements in locations i,i+1 where i=0,2,...,n*m-2 never come from the same list. edit: other than this constraint i do not want to bias the distribution of random lists. that is, the solution should be equivalent to a complete random choice that is reshuffled until the constraint hold.
example:
list1: a1,a2
list2: b1,b2
list3: c1,c2
allowed: b1,c1,c2,a2,a1,b2
disallowed: b1,c1,c2,b2,a1,a2
A possible solution is to think of your number set as n chunks of item, each chunk having the length of m. If you randomly select for each chunk exactly one item from each lists, then you will never hit dead ends. Just make sure that the first item in each chunk (except the first chunk) will be of different list than the last element of the previous chunk.
You can also iteratively randomize numbers, always making sure you pick from a different list than the previous number, but then you can hit some dead ends.
Finally, another possible solution is to randomize a number on each position sequentially, but only from those which "can be put there", that is, if you put a number, none of the constraints will be violated, that is, you will have at least a possible solution.
A variation of b above that avoids dead ends: At each step you choose twice. First, randomly chose an item. Second, randomly choose where to place it. At the Kth step there are k optional places to put the item (the new item can be injected between two existing items). Naturally, you only choose from allowed places.
Money!
arrange your lists into a list of lists
save each item in the list as a tuple with the list index in the list of lists
loop n*m times
on even turns - flatten into one list and just rand pop - yield the item and the item group
on odd turns - temporarily remove the last item group and pop as before - in the end add the removed group back
important - how to avoid deadlocks?
a deadlock can occur if all the remaining items are from one group only.
to avoid that, check in each iteration the lengths of all the lists
and check if the longest list is longer than the sum of all the others.
if true - pull for that list
that way you are never left with only one list full
here's a gist with an attempt to solve this in python
https://gist.github.com/YontiLevin/bd32815a0ec62b920bed214921a96c9d
A very quick and simple method i am trying is:
random shuffle
loop over the pairs in the list:
if pair is bad:
loop over the pairs in the list:
if both elements of the new pair are different than the bad pair:
swap the second elements
break
will this always find a solution? will the solutions have the same distribution as naive shuffling until finding a legit solution?

Using split after a set statement in Python

I have a list of words (equivalent to about two full sentences) and I want to split it into two parts: one part containing 90% of the words and another part containing 10% of them. After that, I want to print a list of the unique words within the 10% list, lexicographically sorted. This is what I have so far:
pos_90 = (90*len(words)) // 100 #list with 90% of the words
pos_90 = pos_90 + 1 #I incremented the number by 1 in order to use it as an index
pos_10 = (10*len(words)) // 100 #list with 10% of the words
list_90 = words[:pos_90] #Creation of the 90% list
list_10 = words[pos_10:] #Creation of the 10% list
uniq_10 = set(list_10) #List of unique words out of the 10% list
split_10 = uniq_10.split()
sorted_10 = split_10.sort()
print(sorted_10)
I get an error saying that split cannot be applied to set, so I assume my mistake must be in the last lines of code. Any idea about what I'm missing here?
split only makes sense when converting from one long str to a list of the components of said str. If the input was in the form 'word1 word2 word3', yes, split would convert that str to ['word1', 'word2', 'word3'], but your input is a set, and there is no sane way to "split" a set like you seem to want; it's already a bag of separated items.
All you really need to do is convert your set back to a sorted list. Replace:
split_10 = uniq_10.split()
sorted_10 = split_10.sort()
with either:
sorted_10 = list(uniq_10)
sorted_10.sort() # NEVER assign the result of .sort(); it's always going to be None
or the simpler one-liner that encompasses both listifying and sorting:
sorted_10 = sorted(uniq_10) # sorted, unlike list.sort, returns a new list
The final option is generally the most Pythonic approach to converting an arbitrary iterable to list and sorting that new list, returning the result. It doesn't mutate the input, doesn't rely on the input being a specific type (set, tuple, list, it doesn't matter), and it's simpler to boot. You only use list.sort() when you already have a known list, and don't mind mutating it.

Check if string in strings

I have a huge list containing many strings like:
['xxxx','xx','xy','yy','x',......]
Now I am looking for an efficient way that removes all strings that are present within another string. For example 'xx' 'x' fit in 'xxxx'.
As the dataset is huge, I was wondering if there is an efficient method for this beside
if a in b:
The complete code: With maybe some optimization parts:
for x in range(len(taxlistcomplete)):
if delete == True:
x = x - 1
delete = False
for y in range(len(taxlistcomplete)):
if taxlistcomplete[x] in taxlistcomplete[y]:
if x != y:
print x,y
print taxlistcomplete[x]
del taxlistcomplete[x]
delete = True
break
print x, len(taxlistcomplete)
An updated version of the code:
for x in enumerate(taxlistcomplete):
if delete == True:
#If element is removed, I need to step 1 back and continue looping.....
delete = False
for y in enumerate(taxlistcomplete):
if x[1] in y[1]:
if x[1] != y[1]:
print x[1],y[1]
print taxlistcomplete[x]
del taxlistcomplete[x[0]]
delete = True
break
print x, len(taxlistcomplete)
Now implemented with the enumerate, only now I am wondering if this is more efficient and howto implement the delete step so I have less to search in as well.
Just a short thought...
Basically what I would like to see...
if element does not match any other elements in list write this one to a file.
Thus if 'xxxxx' not in 'xx','xy','wfirfj',etc... print/save
A new simple version as I dont think I can optimize it much further anyway...
print 'comparison'
file = open('output.txt','a')
for x in enumerate(taxlistcomplete):
delete = False
for y in enumerate(taxlistcomplete):
if x[1] in y[1]:
if x[1] != y[1]:
taxlistcomplete[x[0]] = ''
delete = True
break
if delete == False:
file.write(str(x))
x in <string> is fast, but checking each string against all other strings in the list will take O(n^2) time. Instead of shaving a few cycles by optimizing the comparison, you can achieve huge savings by using a different data structure so that you can check each string in just one lookup: For two thousand strings, that's two thousand checks instead of four million.
There's a data structure called a "prefix tree" (or trie) that allows you to very quickly check whether a string is a prefix of some string you've seen before. Google it. Since you're also interested in strings that occur in the middle of another string x, index all substrings of the form x, x[1:], x[2:], x[3:], etc. (So: only n substrings for a string of length n). That is, you index substrings that start in position 0, 1, 2, etc. and continue to the end of the string. That way you can just check if a new string is an initial part of something in your index.
You can then solve your problem in O(n) time like this:
Order your strings in order of decreasing length. This ensures that no string could be a substring of something you haven't seen yet. Since you only care about length, you can do a bucket sort in O(n) time.
Start with an empty prefix tree and loop over your ordered list of strings. For each string x, use your prefix tree to check whether it is a substring of a string you've seen before. If not, add its substrings x, x[1:], x[2:] etc. to the prefix tree.
Deleting in the middle of a long list is very expensive, so you'll get a further speedup if you collect the strings you want to keep into a new list (the actual string is not copied, just the reference). When you're done, delete the original list and the prefix tree.
If that's too complicated for you, at least don't compare everything with everything. Sort your strings by size (in decreasing order), and only check each string against the ones that have come before it. This will give you a 50% speedup with very little effort. And do make a new list (or write to a file immediately) instead of deleting in place.
Here is a simple approach, assuming you can identify a character (I will use '$' in my example) that is guaranteed not to be in any of the original strings:
result = ''
for substring in taxlistcomplete:
if substring not in result: result += '$' + substring
taxlistcomplete = result.split('$')
This leverages Python's internal optimizations for substring searching by just making one big string to substring-search :)
Here is my suggestion. First I sort the elements by length. Because obviously the shorter the string is, the more likely it is to be a substring of another string. Then I have two for loops, where I run through the list and remove every element from the list where el is a substring. Note that the first for loop only passes each element once.
By sortitng the list first, we destroy the order of elements in the list. So if the order is important, then you can't use this solution.
Edit. I assume there are no identical elements in the list. So that when el == el2, it's because its the same element.
a = ["xyy", "xx", "zy", "yy", "x"]
a.sort(key=len)
for el in a:
for el2 in a:
if el in el2 and el != el2:
a.remove(el2)
Using a list comprehension -- note in -- is the fastest and more Pythonic way of solving your problem:
[element for element in arr if 'xx' in element]

Categories

Resources