Is there any algorithm that can be applied to this program? - python

I am doing an intern writing a program to do gene matching.
For example:
File "A" contains some strings of gene type. (the original data is not sorted)
rs17760268
rs10439884
rs4911642
rs157640
rs1958589
rs10886159
rs424232
....
and file "B" contains 900 thousands of rs number like above (also not sorted)
My program now can get correct results, but I would like to make it more efficient.
Is there any algorithm that can be applied to this program?
BTW, I will try to make my program do multi-processing and see if it gets better performance.
pseudocode:
read File "A" by string, append to A[]
A[] = rs numbers from File "A"
read File "B" by string
for gene_B in file_B_reader:
for gene_A in A:
if gene_A == gene_B:
#append to result[]

I don't think there's a need to sort anything first.
Process larger list B into a hashmap or hashset, O(n) amortized
Iterate over list A and remove from A if not in B, O(m)
return A
Total: O(n + m)

Though your explanations are quite unclear, I guess that you are appending the A values to a list. Use a dictionary instead, and you can lookup A much more efficiently.

From the description it appears you want result[] to contain rs strings that are in both A and B (aka Intersection).
Your algorithm is O(n*m), but you could easily improve this by sorting both files first (O(n*logn) for comparison based sorts), and then read from both at the same time, increasing position in one that has lower current rs number, and adding matches to result[] at the same time.

Related

Is there a better way than a while loop to perform this function?

I was attempting some python exercises and I hit the 5s timeout on one of the tests. The function is pre-populated with the parameters and I am tasked with writing code that is fast enough to run within the max timeframe of 5s.
There are N dishes in a row on a kaiten belt, with the ith dish being of type Di​. Some dishes may be of the same type as one another. The N dishes will arrive in front of you, one after another in order, and for each one you'll eat it as long as it isn't the same type as any of the previous K dishes you've eaten. You eat very fast, so you can consume a dish before the next one gets to you. Any dishes you choose not to eat as they pass will be eaten by others.
Determine how many dishes you'll end up eating.
Issue
The code "works" but is not fast enough.
Code
The idea here is to add the D[i] entry if it is not in the pastDishes list (which can be of size K).
from typing import List
# Write any import statements here
def getMaximumEatenDishCount(N: int, D: List[int], K: int) -> int:
# Write your code here
numDishes=0
pastDishes=[]
i=0
while (i<N):
if(D[i] not in pastDishes):
numDishes+=1
pastDishes.append(D[i])
if len(pastDishes)>K:
pastDishes.pop(0)
i+=1
return numDishes
Is there a more effective way?
After much trial and error, I have finally found a solution that is fast enough to pass the final case in the puzzle you are working on. My previous code was very neat and quick, however, I have finally found a module with a tool that makes this much faster. Its from collections just as deque is, however it is called Counter.
This was my original code:
def getMaximumEatenDishCount(N: int, D: list, K: int) -> int:
numDishes=lastMod=0
pastDishes=[0]*K
for Dval in D:
if Dval in pastDishes:continue
pastDishes[lastMod] = Dval
numDishes,lastMod = numDishes+1,(lastMod+1)%K
return numDishes
I then implemented Counter like so:
from typing import List
# Write any import statements here
from collections import Counter
def getMaximumEatenDishCount(N: int, D: 'list[int]', K: int) -> int:
eatCount=lastMod = 0
pastDishes=[0]*K
eatenCounts = Counter({0:K})
for Dval in D:
if Dval in eatenCounts:continue
eatCount +=1
eatenCounts[Dval] +=1
val = pastDishes[lastMod]
if eatenCounts[val] <= 1: eatenCounts.pop(val)
else: eatenCounts[val] -= 1
pastDishes[lastMod]=Dval
lastMod = (lastMod+1)%K
return eatCount
Which ended up working quite well. I'm sure you can make it less clunky, but this should work fine on its own.
Some Explanation of what i am doing:
Typically while loops are actually marginally faster than a for loop, however since I need to access the value at an index multiple times if i used it, using a for loop I believe is actually better in this situation. You can see i also initialised the list to the max size it needs to be and am writing over the values instead of popping+appending which saves alot of time. Additionally, as pointed out by #outis, another small improvement was made in my code by using the modulo operator in conjunction with the variable which removes the need for an additional if statement. The Counter is essentially a special dict object that holds a hashable as the key and an int as the value. I use the fact that lastMod is an index to what would normally be accesed through list.pop(0) to access the object needed to either remove or decrement in the counter
Note that it is not considered 'pythonic' to assign multiple variable on one line, however I believe it adds a slight performance boost which is why I have done it. This can be argued though, see this post.
If anyone else is interested the problem that we were trying to solve, it can be found here: https://www.facebookrecruiting.com/portal/coding_puzzles/?puzzle=958513514962507
Can we use an appropriate data structure? If so:
Data structures
Seems like an ordered set which you have to shrink to a capacity restriction of K.
To meet that, if exceeds (len(ordered_set) > K) we have to remove the first n items where n = len(ordered_set) - K. Ideally the removal will perform in O(1).
However since removal on a set is in unordered fashion. We first transform it to a list. A list containing unique elements in the order of appearance in their original sequence.
From that ordered list we can then remove the first n elements.
For example: the function lru returns the least-recently-used items for a sequence seq limited by capacity-limit k.
To obtain the length we can simply call len() on that LRU return value:
maximumEatenDishCount = len(lru(seq, k))
See also:
Does Python have an ordered set?
Fastest way to get sorted unique list in python?
Using set for uniqueness (up to Python 3.6)
def lru(seq, k):
return list(set(seq))[:k]
Using dict for uniqueness (since Python 3.6)
Same mechanics as above, but using the preserved insertion order of dicts since 3.7:
using OrderedDict explicitly
from collections import OrderedDict
def lru(seq, k):
return list(OrderedDict.fromkeys(seq).keys())[:k]
using dict factory-method:
def lru(seq, k):
return list(dict.fromkeys(seq).keys())[:k]
using dict-comprehension:
def lru(seq, k):
return list({i:0 for i in seq}.keys())[:k]
See also:
The order of keys in dictionaries
Using ordered dictionary as ordered set
How do you remove duplicates from a list whilst preserving order?
Real Python: OrderedDict vs dict in Python: The Right Tool for the Job
As the problem is an exercise, exact solutions are not included. Instead, strategies are described.
There are at least a couple potential approaches:
Use a data structure that supports fast containment testing (a set in use, if not in name) limited to the K most recently eaten dishes. Fortunately, since dict preserves insertion order in newer Python versions and testing key containment is fast, it will fit the bill. dict requires that keys be hashable, but since the problem uses ints to represent dish types, that requirement is met.
With this approach, the algorithm in the question remains unchanged.
Rather than checking whether the next dish type is any of the last K dishes, check whether the last time the next dish was eaten is within K of the current plate count. If it is, skip the dish. If not, eat the dish (update both the record of when the next dish was last eaten and the current dish count). In terms of data structures, the program will need to keep a record of when any given dish type was last eaten (initialized to -K-1 to ensure that the first time a dish type is encountered it will be eaten; defaultdict can be very useful for this).
With this approach, the algorithm is slightly different. The code ends up being slightly shorter, as there's no shortening of the data structure storing information about the dishes as there is in the original algorithm.
There are two takeaways from the latter approach that might be applied when solving other problems:
More broadly, reframing a problem (such as from "the dish is in the last K dishes eaten" to "the dish was last eaten within K dishes of now") can result in a simpler approach.
Less broadly, sometimes it's more efficient to work with a flipped data structure, swapping keys/indices and values.
Approach & takeaway 2 both remind me of a substring search algorithm (the name escapes me) that uses a table of positions in the needle (the string to search for) of where each character first appears (for characters not in the string, the table has the length of the string); when a mismatch occurs, the algorithm uses the table to align the substring with the mismatching character, then starts checking at the start of the substring. It's not the most efficient string search algorithm, but it's simple and more efficient than the naive algorithm. It's similar to but simpler and less efficient than the skip search algorithm, which uses the positions of every occurrence of each character in the needle.
from typing import List
# Write any import statements here
from collections import deque, Counter
def getMaximumEatenDishCount(N: int, D: List[int], K: int) -> int:
# Write your code here
q = deque()
cnt = 0
dish_counter = Counter()
for d in D:
if dish_counter[d] == 0:
cnt += 1
q.append(d)
dish_counter[d] += 1
if len(q) == K + 1:
remove = q.popleft()
dish_counter[remove] -= 1
return cnt

Merging and sorting n strings in O(n)

I recently was given a question in a coding challenge where I had to merge n strings of alphanumeric characters and then sort the new merged string while only allowing alphabetical characters in the sorted string. Now, this would be fairly straight forward except that the caveat added was that the algorithm had to be O(n) (it didn't specify whether this was time or space complexity or both).
My initial approach was to concatenate the strings into a new one, only adding alphabetical characters and then sorting at the end. I wanted to come up with a more efficient solution but I was given less time than I was initially told. There isn't any sorting algorithm (that I know of) which runs in O(n) time, so the only thing I can think of is that I could increase the space complexity and use a sorted hashtable (e.g. C++ map) to store the counts of each character and then print the hashtable in sorted order. But as this would require possibly printing n characters n times, I think it would still run in quadratic time. Also, I was using python which I don't think has a way to keep a dictionary sorted (maybe it does).
Is there anyway this problem could have been solved in O(n) time and/or space complexity?
Your counting sort is the way to go: build a simple count table for the 26 letters in order. Iterate through your two strings, counting letters, ignoring non-letters. This is one pass of O(n). Now, simply go through your table, printing each letter the number of times indicated. This is also O(n), since the sum of the counts cannot exceed n. You're not printing n letters n times each: you're printing a total of n letters.
Concatenate your strings (not really needed, you can also count chars in the individual strings)
Create an array with length equal to total nr of charcodes
Read through your concatenated string and count occurences in the array made at step 2
By reading through the char freq array, build up an output array with the right nr of repetitions of each char.
Since each step is O(n) the whole thing is O(n)
[#patatahooligan: had made this edit before I saw your remark, accidentally duplicated the answer]
If I've understood the requirement correctly, you're simply sorting the characters in the string?
I.e. ADFSACVB becomes AABCDFSV?
If so then the trick is to not really "sort". You have a fixed (and small) number of values. So you can simply keep a count of each value and generate your result from that.
E.g. Given ABACBA
In the first pass, increment a counters in an array indexed by characters. This produces:
[A] == 3
[B] == 2
[C] == 1
In second pass output the number of each character indicated by the counters. AAABBC
In summary, you're told to sort, but thinking outside the box, you really want a counting algorithm.

Determine whether string contained within another string in python

I am looking to determine whether a string is fully contained at the start of a list of other string. For example if i had the string cde, and the list of strings:
['ab', 'bce', 'cdef']
then it would be determine that cde is contained at the start of cdef
I'm also looking to go the other way around - i.e. if i had the term abc to identify that ab from the above list is contained within it.
Now obviously this is trivial to set up with a for loop, checking each instance with the function startswith, however this is not scalable with a very large list of possibilities to check on.
While checking each instance is O(n) [and hence very slow if you have 100,000 possibilities], i am looking for a way of checking of O(1) ... it feels like if the "list" was pre-sorted, then can simply extract the nearest match, but not sure how.
Clarification:
I solely looking where there is a perfect match at the start of the string (i.e the whole of search term is included).
I will be looking up multiple search terms (thus while initially sorting the data may not be quick, the sunk cost would save on subsequent look troughs).
Ideally would return every possible match (i.e. if cdef and cdefg where in the list, and looking up cde, then both would be returned).
I use the term "list" loosely, as in a collection of terms.
It's not possible in O(1), since by definition you have to go over the entire array. If the array is sorted then you can do a binary search for your string, and then check if the element at that position starts with your string. That operation is O(log n).
import bisect
# return the index of the string starting with the prefix
# or None if no such string is in the list
def search(a, prefix):
i = bisect.bisect_left(a, prefix)
isAtStart = (i < len(a) and a[i].startswith(prefix))
return i if isAtStart else None
search(['ab', 'bce', 'cdef'], 'bc')

Random DNA mutation Generator

I'd like to create a dictionary of dictionaries for a series of mutated DNA strands, with each dictionary demonstrating the original base as well as the base it has mutated to.
To elaborate, what I would like to do is create a generator that allows one to input a specific DNA strand and have it crank out 100 randomly generated strands that have a mutation frequency of 0.66% (this applies to each base, and each base can mutate to any other base). Then, what I would like to do is create a series of dictionary, where each dictionary details the mutations that occured in a specific randomly generated strand. I'd like the keys to be the original base, and the values to be the new mutated base. Is there a straightforward way of doing this? So far, I've been experimenting with a loop that looks like this:
#yields a strand with an A-T mutation frequency of 0.066%
def mutate(string, mutation, threshold):
dna = list(string)
for index, char in enumerate(dna):
if char in mutation:
if random.random() < threshold:
dna[index] = mutation[char]
return ''.join(dna)
dna = "ATGTCGTACGTTTGACGTAGAG"
print("DNA first:", dna)
newDNA = mutate(dna, {"A": "T"}, 0.0066)
print("DNA now:", newDNA)
But I can only yield one strand with this code, and it only focuses on T-->A mutations. I'm also not sure how to tie the dictionary into this. could someone show me a better way of doing this? Thanks.
It sounds like there are two parts to your issue. The first is that you want to mutate your DNA sequence several times, and the second is that you want to gather some additional information about the mutations in a data structure of some kind. I'll handle each of those separately.
Producing 100 random results from the same source string is pretty easy. You can do it with an explicit loop (for instance, in a generator function), but you can just as easily use a list comprehension to run a single-mutation function over and over:
results = [mutate(original_string) for _ in range(100)]
Of course, if you make the mutate function more complicated, this simple code may not be appropriate. If it returns some kind of more sophisticated data structure, rather than just a string, you may need to do some additional processing to combine the data in the format you want.
As for how to build those data structures, I think the code you have already is a good start. You'll need to decide how exactly you're going to be accessing your data, and then let that guide you to the right kind of container.
For instance, if you just want to have a simple record of all the mutations that happen to a string, I'd suggest a basic list that contains tuples of the base before and after the mutation. On the other hand, if you want to be able to efficiently look up what a given base mutates to, a dictionary with lists as values might be more appropriate. You could also include the index of the mutated base if you wanted to.
Here's a quick attempt at a function that returns the mutated string along with a list of tuples recording all the mutations:
bases = "ACGT"
def mutate(orig_string, mutation_rate=0.0066):
result = []
mutations = []
for base in orig_string:
if random.random() < mutation_rate:
new_base = bases[bases.index(base) - random.randint(1, 3)] # negatives are OK
result.append(new_base)
mutations.append((base, new_base))
else:
result.append(base)
return "".join(result), mutations
The most tricky bit of this code is how I'm picking the replacement of the current base. The expression bases[bases.index(base) - random.randint(1, 3)] does it all in one go. Lets break down the different bits. bases.index(base) gives the index of the previous base in the global bases string at the top of the code. Then I subtract a random offset from this index (random.randint(1, 3)). The new index may be negative, but that's OK, as when we use it to index back into the bases string (bases[...]), negative indexes count from the right, rather than the left.
Here's how you could use it:
string = "ATGT"
results = [mutate(string) for _ in range(100)]
for result_string, mutations in results:
if mutations: # skip writing out unmutated strings
print(result_string, mutations)
For short strings, like "ATGT" you're very unlikely to get more than one mutation, and even one is pretty rare. The loop above tends to print between 2 and 4 results on each run (that is, more than 95% of length-four strings are not mutated at all). Longer strings will have mutations more often, and it's more plausible that you'll see multiple mutations in one string.

Check if string in strings

I have a huge list containing many strings like:
['xxxx','xx','xy','yy','x',......]
Now I am looking for an efficient way that removes all strings that are present within another string. For example 'xx' 'x' fit in 'xxxx'.
As the dataset is huge, I was wondering if there is an efficient method for this beside
if a in b:
The complete code: With maybe some optimization parts:
for x in range(len(taxlistcomplete)):
if delete == True:
x = x - 1
delete = False
for y in range(len(taxlistcomplete)):
if taxlistcomplete[x] in taxlistcomplete[y]:
if x != y:
print x,y
print taxlistcomplete[x]
del taxlistcomplete[x]
delete = True
break
print x, len(taxlistcomplete)
An updated version of the code:
for x in enumerate(taxlistcomplete):
if delete == True:
#If element is removed, I need to step 1 back and continue looping.....
delete = False
for y in enumerate(taxlistcomplete):
if x[1] in y[1]:
if x[1] != y[1]:
print x[1],y[1]
print taxlistcomplete[x]
del taxlistcomplete[x[0]]
delete = True
break
print x, len(taxlistcomplete)
Now implemented with the enumerate, only now I am wondering if this is more efficient and howto implement the delete step so I have less to search in as well.
Just a short thought...
Basically what I would like to see...
if element does not match any other elements in list write this one to a file.
Thus if 'xxxxx' not in 'xx','xy','wfirfj',etc... print/save
A new simple version as I dont think I can optimize it much further anyway...
print 'comparison'
file = open('output.txt','a')
for x in enumerate(taxlistcomplete):
delete = False
for y in enumerate(taxlistcomplete):
if x[1] in y[1]:
if x[1] != y[1]:
taxlistcomplete[x[0]] = ''
delete = True
break
if delete == False:
file.write(str(x))
x in <string> is fast, but checking each string against all other strings in the list will take O(n^2) time. Instead of shaving a few cycles by optimizing the comparison, you can achieve huge savings by using a different data structure so that you can check each string in just one lookup: For two thousand strings, that's two thousand checks instead of four million.
There's a data structure called a "prefix tree" (or trie) that allows you to very quickly check whether a string is a prefix of some string you've seen before. Google it. Since you're also interested in strings that occur in the middle of another string x, index all substrings of the form x, x[1:], x[2:], x[3:], etc. (So: only n substrings for a string of length n). That is, you index substrings that start in position 0, 1, 2, etc. and continue to the end of the string. That way you can just check if a new string is an initial part of something in your index.
You can then solve your problem in O(n) time like this:
Order your strings in order of decreasing length. This ensures that no string could be a substring of something you haven't seen yet. Since you only care about length, you can do a bucket sort in O(n) time.
Start with an empty prefix tree and loop over your ordered list of strings. For each string x, use your prefix tree to check whether it is a substring of a string you've seen before. If not, add its substrings x, x[1:], x[2:] etc. to the prefix tree.
Deleting in the middle of a long list is very expensive, so you'll get a further speedup if you collect the strings you want to keep into a new list (the actual string is not copied, just the reference). When you're done, delete the original list and the prefix tree.
If that's too complicated for you, at least don't compare everything with everything. Sort your strings by size (in decreasing order), and only check each string against the ones that have come before it. This will give you a 50% speedup with very little effort. And do make a new list (or write to a file immediately) instead of deleting in place.
Here is a simple approach, assuming you can identify a character (I will use '$' in my example) that is guaranteed not to be in any of the original strings:
result = ''
for substring in taxlistcomplete:
if substring not in result: result += '$' + substring
taxlistcomplete = result.split('$')
This leverages Python's internal optimizations for substring searching by just making one big string to substring-search :)
Here is my suggestion. First I sort the elements by length. Because obviously the shorter the string is, the more likely it is to be a substring of another string. Then I have two for loops, where I run through the list and remove every element from the list where el is a substring. Note that the first for loop only passes each element once.
By sortitng the list first, we destroy the order of elements in the list. So if the order is important, then you can't use this solution.
Edit. I assume there are no identical elements in the list. So that when el == el2, it's because its the same element.
a = ["xyy", "xx", "zy", "yy", "x"]
a.sort(key=len)
for el in a:
for el2 in a:
if el in el2 and el != el2:
a.remove(el2)
Using a list comprehension -- note in -- is the fastest and more Pythonic way of solving your problem:
[element for element in arr if 'xx' in element]

Categories

Resources