My code seems to have some sort of error when considering the 'Q' variable in the following code. I know when counted by hand and when looking at the values I should obtain, Q should be equal to 3. In my case, it is equal to 4. Further, it seems like my code is not properly printing the positions in which the characters differ between two strings. I am struggling to solve this.
s1 = 'GAGACTACGACTAGAGCTAGACGGTACAC'
s2 = 'CAGGCCACTACCCGAGTTGGACAGAATAC'
P1=0
P2=0
sites=[]
for i in range(len(s1)):
if s1[i]=='A' and s2[i]=='G':
P2+=1
z=(i+1)
sites.append(z)
if s1[i]=='G' and s2[i]=='A':
P2+=1
z=(i+1)
sites.append(z)
if s1[i]=='C' and s2[i]=='T':
P1+=1
z=(i+1)
sites.append(z)
if s1[i]=='T' and s2[i]=='C':
P1+=1
z=(i+1)
sites.append(z)
P=P1+P2
print('L lenght of sequence:',len(s1))
print('P1 transitional difference between pyrmidines (c-t):',P1,'/',len(s1))
print('P2 transitional difference between purines (a-g):',P2,'/',len(s1))
x=len([i for i in range(len(s1)) if s1[i] != s2[i]])
Q = x-P
print('Q transversions',Q,'/',len(s1),'\n')
print('transitions',(P1+P2))
print('number of different sites',x)
print('locations at which sites differ',sites)
Output:
L lenght of sequence: 29
P1 transitional difference between pyrmidines (c-t): 4 / 29
P2 transitional difference between purines (a-g): 3 / 29
Q transversions 4 / 29
transitions 7
number of different sites 11
locations at which sites differ [4, 6, 12, 17, 19, 23, 27]
Proper values:
The answer here was simply my mistake; the code works completely fine but I simply mistyped a letter in S2.
Related
I have millions of DNA clone reads and few of them are misreads or error. I want to separate the clean reads only.
For non biological background:
DNA clone consist of only four characters (A,T,C,G) in various permutation/combination. Any character, symbol or sign other that "A","T","C", and "G" in DNA is an error.
Is there any way (fast/high throughput) in python to separate the clean reads only.
Basically I want to find a way through which I can separate a string which has nothing but "A","T","C","G" alphabet characters.
Edit
correct_read_clone: "ATCGGTTCATCGAATCCGGGACTACGTAGCA"
misread_clone: "ATCGGNATCGACGTACGTACGTTTAAAGCAGG" or "ATCGGTT#CATCGAATCCGGGACTACGTAGCA" or "ATCGGTTCATCGAA*TCCGGGACTACGTAGCA" or "AT?CGGTTCATCGAATCCGGGACTACGTAGCA" etc
I have tried the below for loop
check_list=['A','T','C','G']
for i in clone:
if i not in check_list:
continue
but the problem with this for loop is, it iterates over the string and match one by one which makes this process slow. To clean millions of clone this delay is very significant.
If these are the nucleotide sequences with an error in 2 of them,
a = 'ATACTGAGTCAGTACGTACTGAGTCAGTACGT'
b = 'AACTGAGTCAGTACGTACTGAGTCAAGTCAGTACGTSACTGAGTCAGTACGT'
c = 'ATUACTGAGTCAGTACGT'
d = 'AAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT'
e = 'AACTGAGTCAGTAAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT'
f = 'AAGTACGTACTGAGTCAGTACGTACTCAGTACGT'
g = 'ATCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT'
test = a, b, c, d, e, f, g
try:
misread_counter = 0
correct_read_clone = []
for clone in test:
if len(set(list(clone))) <= 4:
correct_read_clone.append(clone)
else:
misread_counter +=1
print(f'Unclean sequences: {misread_counter}')
print(correct_read_clone)
Output:
Unclean sequences: 2
['ATACTGAGTCAGTACGTACTGAGTCAGTACGT', 'AAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT', 'AACTGAGTCAGTAAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT', 'AAGTACGTACTGAGTCAGTACGTACTCAGTACGT', 'ATCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT']
This way the for loop only has to attend each full sequence in a list of clones, rather than looping over each character of every sequence.
or if you want to know which ones have the errors you can make two lists:
misread_clone = []
correct_read_clone = []
for clone in test:
bases = len(set(list(clone)))
misread_clone.append(clone) if bases > 4 else correct_read_clone.append(clone)
print(f'misread sequences count: {len(misread_clone)}')
print(correct_read_clone)
Output:
misread sequences count: 2
['ATACTGAGTCAGTACGTACTGAGTCAGTACGT', 'AAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT', 'AACTGAGTCAGTAAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT', 'AAGTACGTACTGAGTCAGTACGTACTCAGTACGT', 'ATCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT']
I don't think you're going to get too many significant improvements for this. Most operations on a string are going to be O(N), and there isn't much you can do to get it to O(log(N)) or O(1). The checking for the values in ACTG is also O(N), leading to a worse case of O(n*m), where n and m are the lengths of the string and ACTG.
One thing you could do is cast the string into a set, which would be O(N), check if the length of set is more than 4 (which should be impossible if the only characters are ACTG) and if not, loop through the set and do the check against ACTG. I am assuming that it is possible that a clone could possibly be a string such as "AACCAA!!" which results in a set of ['A', 'C', '!'] in which case the length would be less than or equal to 4, but still be unclean/incorrect.
clones = [ "ACGTATCG", "AGCTGACGAT", "AGTACGATCAGT", "ACTGAGTCAGTACGT", "AGTACGTACGATCAGTACGT", "AAACCS", "AAACCCCCGGGGTTTS"]
for clone in clones:
if len(set(clone)) > 4:
print(f"unclean: {clone}")
else:
for char in clone:
if char not in "ACTG":
print(f"unclean: {clone}")
break
else:
print(f"clean: {clone}")
Since len(set) is O(1), that could potentially skip the need to check against ACTG. If it is less than or equal to 4, then the check would be O(n*m) again, but in this case the n is guaranteed to be less than 4 while your m stays the same at 4. The final process becomes O(n) rather than O(n*m), where n and m are the lengths of the set and ACTG. Since you are now checking against a set and anything other than ACTG will be unclean, n has a cap of 5. This means that no matter how large the original string is, doing the ACTG check on the set will be worst case O(5*4) and is thus essentially O(1) (Big O notation is about scale rather than exact values).
However, whether or not this is actually faster would depend on the length of the original string. It may end up taking more time if the original string is short. This would be unlikely, since the string would have to be very short, but can be the case.
You may get more time saved by tackling the amount of entries which you have noted is very large, if possible you may want to consider if you can split this into smaller groups to run them asynchronously. However, at the end of the day none of these are going to actively scale down your time. They would reduce the time taken since you'd be cutting out a constant scale from the time complexity or running a few at the same time, but at the end of the day it's still an O(N*M), with N and M being the number and length of strings, and there isn't anything that can really change that.
try this:
def is_clean_read(read):
for char in read:
if char not in ['A', 'T', 'C', 'G']:
return False
return True
reads = [ "ACGTATCG", "AGCTGACGAT", "AGTACGATCAGT", "ACTGAGTCAGTACGT", "AGTACGTACGATCAGTACGT"]
clean_reads = [read for read in reads if is_clean_read(read)]
print(clean_reads)
ok stealing from answer https://stackoverflow.com/a/75393987/9877065 by Shorn, tried to add multiprocessing, you can play with the lenght of my orfs list in the first part of the code and then try to change the number_of_processes = XXX to different values from 1 to your system max : multiprocessing.cpu_count(), code :
import time
from multiprocessing import Pool
from datetime import datetime
a = 'ATACTGAGTCAGTACGTACTGAGTCAGTACGT'
b = 'AACTGAGTCAGTACGTACTGAGTCAAGTCAGTACGTSACTGAGTCAGTACGT'
c = 'ATUACTGAGTCAGTACGT'
d = 'AAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT'
e = 'AACTGAGTCAGTAAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT'
f = 'AAGTACGTACTGAGTCAGTACGTACTCAGTACGT'
g = 'ATCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT'
aa = 'ATACTGAGTCAGTACGTACTGAGTCAGTACGT'
bb = 'AACTGAGTCAGTACGTACTGAGTCAAGTCAGTACGTSACTGAGTCAGTACGT'
cc = 'ATUACTGAGTCAGTACGT'
dd = 'AAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT'
ee = 'AACTGAGTCAGTAAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT'
ff = 'AAGTACGTACTGAGTCAGTACGTACTCAGTACGT'
gg = 'ATCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT'
aaa = 'ATACTGAGTCAGTACGTACTGAGTCAGTACGT'
bbb = 'AACTGAGTCAGTACGTACTGAGTCAAGTCAGTACGTSACTGAGTCAGTACGT'
ccc = 'ATUACTGAGTCAGTACGT'
ddd = 'AAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT'
eee = 'AACTGAGTCAGTAAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT'
fff = 'AAGTACGTACTGAGTCAGTACGTACTCAGTACGT'
ggg = 'ATCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT'
kkk = 'AAAAAAAAAAAAAAAAAAAAAAkkkkkkkkkkkkk'
clones = [a, b, c, d, e, f, g, aa, bb, cc, dd, ee, ff,gg, aaa, bbb, ccc, ddd, eee, fff, ggg, kkk]
clones_2 = clones
clones_2.extend(clones)
clones_2.extend(clones)
clones_2.extend(clones)
clones_2.extend(clones)
clones_2.extend(clones)
clones_2.extend(clones)
# clones_2.extend(clones)
# clones_2.extend(clones)
#print(clones_2, len(clones_2))
def check(clone):
# ATTENZIONE ALLUNGA TEMPO CPU vs I/O ##############################################################################################################
# time.sleep(1) ####################################################### !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
if len(set(clone)) > 4:
print(f"unclean: {clone}")
else:
for char in clone:
if char not in "ACTG":
print(f"unclean: {clone}")
break
else:
print(f"clean: {clone}")
begin = datetime.now()
number_of_processes = 4
p = Pool(number_of_processes)
list_a = []
cnt_atgc = 0
while True:
for i in clones_2 :
try:
list_a.append(i)
cnt_atgc += 1
if cnt_atgc == number_of_processes:
result = p.map(check, list_a)
p.close()
p.join()
p = Pool(number_of_processes)
cnt_atgc = 0
list_a = []
else:
continue
except:
print('SKIPPED !!!')
if len(list_a) > 0:
p = Pool(number_of_processes)
result = p.map(check, list_a)
p.close()
p.join()
break
else:
print('FINITO !!!!!!!!!!')
break
print('done')
print(datetime.now() - begin)
I have to pre load a list containing the orfs to be multiprocessed at each iteration, despite that can at least cut the execuition time by half on my machine, not sure how stdout influence the speed of the multiprocessing (and how to cope with result order see python multiprocess.Pool show results in order in stdout).
can anyone explain why my code for a hacker rank example is timing out. I'm new to whole idea of efficiency of code based on processing time. The code seems to work on small sets, but once I start testing cases using large datasets it times out. I've provided a brief explanation of the method and its purpose for context. But if you could provide any tips if you notice functions I'm using that might consume a large amount of runtime that would be great.
Complete the migratoryBirds function below.
Params: arr: an array of tallies of species of birds sighted by index.
For example. arr = [Type1 = 1, Type2 = 4, Type3 = 4, Type4 = 4, Type5 = 5, Type6 = 3]
Return the lowest type of the the mode of sightings. In this case 4 sightings is the
mode. Type2 is the lowest type that has the mode. So return integer 2.
def migratoryBirds(arr):
# list of counts of occurrences of birds types with the same
# number of sightings
bird_count_mode = []
for i in range(1, len(arr) + 1):
occurr_count = arr.count(i)
bird_count_mode.append(occurr_count)
most_common_count = max(bird_count_mode)
common_count_index = bird_count_mode.index(most_common_count) + 1
# Find the first occurrence of that common_count_index in arr
# lowest_type_bird = arr.index(common_count_index) + 1
# Expect Input: [1,4,4,4,5,3]
# Expect Output: [1 0 1 3 1 0], 3, 4
return bird_count_mode, most_common_count, common_count_index
P.S. Thank you for the edit Chris Charley. I just tried to edit it at the same time
Use collections.Counter() to create a dictionary that maps species to their counts. Get the maximum count from this, then get all the species with that count. Then search the list for the first element of one of those species.
import collections
def migratoryBirds(arr):
species_counts = collections.Counter(arr)
most_common_count = max(species_counts.values())
most_common_species = {species for species, count in species_counts if count = most_common_count}
for i, species in arr:
if species in most_common_species:
return i
I am trying to find positions of a match (N or -) in a large dataset.
The number of matches per string (3 million letters) is around 300,000. I have 110 strings to search in the same file so I made a loop using re.finditer to match and report position of each match but it is taking very long time. Each string (DNA sequence) is composed of only six characters (ATGCN-). Only 17 strings were processed in 11 hours. The question is what can I do to speed up the process?
The part of the code I am talking about is:
for found in re.finditer(r"[-N]", DNA_sequence):
position = found.start() + 1
positions_list.append(position)
positions_set = set(positions_list)
all_positions_set = all_positions_set.union(positions_set)
count += 1
print(str(count) + '\t' +record.id+'\t'+'processed')
output_file.write(record.id+'\t'+str(positions_list)+'\n')
I also tried to use re.compile as I googled and found that it could improve performance but nothing changed (match = re.compile('[-N]'))
If you have roughly 300k matches - you are re-creating increasingly larger sets that contain exactly the same elements as the list you are already adding to:
for found in re.finditer(r"[-N]", DNA_sequence):
position = found.start() + 1
positions_list.append(position)
positions_set = set(positions_list) # 300k times ... why? why at all?
You can instead simply use the list you got anyway and put that into your all_positions_set after you found all of them:
all_positions_set = all_positions_set.union(positions_list) # union takes any iterable
That should reduce the memory by more then 50% (sets are more expensive then lists) and also cut down on the runtime significantly.
I am unsure what is faster, but you could even skip using regex:
t = "ATGCN-ATGCN-ATGCN-ATGCN-ATGCN-ATGCN-ATGCN-ATGCN-"
pos = []
for idx,c in enumerate(t):
if c in "N-":
pos.append(idx)
print(pos) # [4, 5, 10, 11, 16, 17, 22, 23, 28, 29, 34, 35, 40, 41, 46, 47]
and instead use enumerate() on your string to find the positions .... you would need to test if that is faster.
Regarding not using regex, I did actually that and now modified my script to run in less than 45 seconds using a defined function
def find_all(a_str, sub):
start = 0
while True:
start = a_str.find(sub, start)
if start == -1: return
yield start + 1
start += len(sub)
So the new coding part is:
N_list = list(find_all(DNA_sequence, 'N'))
dash_list = list(find_all(DNA_sequence, '-'))
positions_list = N_list + dash_list
all_positions_set = all_positions_set.union(positions_list)
count += 1
print(str(count) + '\t' +record.id+'\t'+'processed')
output_file.write(record.id+'\t'+str(sorted(positions_list))+'\n')
I have 2 files I converted to list of lists format. Short examples
a
c1 165.001 17593685
c2 1650.94 17799529
c3 16504399 17823261
b
1 rs3094315 **0.48877594** *17593685* G A
1 rs12562034 0.49571378 768448 A G
1 rs12124819 0.49944228 776546 G A
Using the cycle 'for' I tried to find the common values of these lists, but I can't loop the process. It is necessary since I need to get an value that is adjacent to the value that is common to the two lists(in this given example it is 0.48877594 since 17593685 is common for 'a' and 'b' . My attempts that completely froze:
for i in a:
if i[2] == [d[3] for d in b]:
print(i[0], i[2] + d[2])
or
for i in a and d in b:
if i[2] == d[3]
print(i[0], i[2] + d[2]
Overall I need to get the first file with a new column, which will be that bold adjacent value. Is is my first month of programming and I cant understand logic. Thanks in advance!
+++
List's original format:
a = [['c1', '165.001', '17593685'], ['c2', '1650.94', '17799529'], ['c3', '16504399', '17823261']]
[['c1', '16504399', '17593685.1\n'], ['c2', '16504399', '17799529.1\n'], ['c3', '16504399', '17823261.\n']]
++++ My original data
Two or more people can have DNA segments that are the same, because they were inherited from a common ancestor. File 'a' contains the following columns:
SegmentID, Start of segment, End of Segment, IDs of individuals that share this segment(from 2 to infinity). Example(just a little part since real list has > 1000 raws - segments('c'). Number of individuals can be different.
c1 16504399 17593685 19N 19N.0 19N 19N.0 182AR 182AR.0 182AR 182AR.0 6i 6i.1 6i 6i.1 153A 153A.1 153A 153A.1
c2 14404399 17799529 62BB 62BB.0 62BB 62BB.0 55k 55k.0 55k 55k.0 190k 190k.0 190k 190k.0 51A 51A.1 51A 51A.1 3A 3A.1 3A 3A.1 38k 38k.1 38k 38k.1
c3 1289564 177953453 164Bur 164Bur.0 164Bur 164Bur.0 38BO 38BO.1 38BO 38BO.1 36i 36i.1 36i 36i.1 100k 100k.1 100k 100k.1
file b:
This one always has 6 columns but number of rows more the 100 millions, so only it's part:
1 rs3094315 0.48877594 16504399 G A
1 rs12562034 0.49571378 17593685 A G
1 rs12124819 0.49944228 14404399 G A
1 rs3094221 0.48877594 17799529 G A
1 rs12562222 0.49571378 1289564 A G
1 rs121242223 0.49944228 177953453 G A
So, I need to compare a[1] with b[3] and if they are equal
print(a[1],b[3]), because b[3] is position of segment too but in another measurement system. That is what I can't do
Taking a leap (because the question isn't really clear), I think you are looking for the product of a, b, e.g.:
In []:
for i in a:
for d in b:
if i[2] == d[3]:
print(i[0], i[2] + d[2])
Out[]:
c1 175936850.48877594
You can do the same with itertools.product():
In []:
import itertools as it
for i, d in it.product(a, b):
if i[2] == d[3]:
print(i[0], i[2] + d[2])
Out[]:
c1 175936850.48877594
It would be much faster to leave your data as strings and search:
for a_line in [_ for _ in a.split('\n') if _]: # skip blank lines
search_term = a_line.strip().split()[-1] # get search term
term_loc_in_b = b.find(search_term) #get search term loction in file b
if term_loc_in_b !=-1: #-1 means term not found
# split b once just before search term starts
value_in_b = b[:term_loc_in_b].strip().rsplit(maxsplit=1)[-1]
print(value_in_b)
else:
print('{} not found'.format(search_term))
If the file size is large you might consider using mmap to search b.
mmap.find requires bytes, eg. 'search_term'.encode()
When running the code, please see below, input is ('Zoe', 14), I get as result 8, running the 'Finding Buckets' code in the Online Python Tutor also with ('Zoe', 14), where "def hash_string" is included, the result is 2 out 2, when that code finished, why? Or, in other words, does the other 2 defs causing that result?
In the 'Finding Buckets' code are 3 def. I exchanged the order of those def - the results are the same-does the order really does not matter?
def hash_string(keyword,buckets):
out = 0
for s in keyword:
out = (out + ord(s)) % buckets
return out
Online Python Tutor "Finding Buckets":
1 def hashtable_get_bucket(table,keyword):
2 return table[hash_string(keyword,len(table))]
3
4 def hash_string(keyword,buckets):
5 out = 0
6 for s in keyword:
7 out = (out + ord(s)) % buckets
8 return out
9
10 def make_hashtable(nbuckets):
11 table = []
12 for unused in range(0,nbuckets):
13 table.append([])
14 return table
15 table = [[['Francis', 13], ['Ellis', 11]], [], [['Bill', 17],
16 ['Zoe', 14]], [['Coach', 4]], [['Louis', 29], ['Rochelle', 4], ['Nick', 2]]]
17 print hashtable_get_bucket(table, "Zoe")
def hashtable_get_bucket(table,keyword):
return table[hash_string(keyword,len(table))]
def hash_string(keyword,buckets):
out = 0
for s in keyword:
out = (out + ord(s)) % buckets
return out
def make_hashtable(nbuckets):
table = []
for unused in range(0,nbuckets):
table.append([])
return table
Here the comment from the notes:
Function hashtable_get_bucket returns the bucket containing the given keyword, from the hash
table, passed in as the first argument.
If you remember structure of a hash table, you will find out that it is composed of n buckets, one of
which needs to be returned by the hashtable_get_bucket function. Index of the bucket, which
would eventually contain the given keyword (in case the keyword will be present in the hash table),
is computed and returned by already defined function hash_string.
The function hash_string will in turn take the keyword and number of buckets as its
arguments. First argument (the keyword) is straightforward, since it was passed directly to
hashtable_get_bucket function by its caller. The second argument (number of buckets) can be
computed using len function on the hashmap (recall how hashmap is composed of n buckets).
Both Functions do exactly the same.
But in the online part hash_string('Zoe', 5) is called and not hash_string('Zoe', 14)
Where dose the 5 come from?
In line 2 there is:
hash_string(keyword, len(table))
with len(tabel) being 5.