Randomly extracting substrings of equal length from the main string - python

I have a string as follows.
str_main='ATGCAGCACTAGGCAGCACTATGAAGCACTATGCTGCACT'
len(str_main)
40
I want to extract three subtrings from str_main such that each substring contains 20 characters.
These substrings should start from anywhere in the main string and thus, there will obviously be an overlap between the subtrings.
I found some solutions but they do not provide random substring extraction from the main string.
Desired output might be:
substr_1='ATGCAGCACTAGGCAGCACT'
substr_2='CACTATGAAGCACTATGCTG'
substr_3='CACTAGGCAGCACTATGAAG'
They are randomly extraxted from the main string. I should be able to extarct as many string as I want as the overlap is allowed.

We can write a function and use it three times like this:
import random
def get_random_str(main_str, substr_len):
idx = random.randrange(0, len(main_str) - substr_len + 1) # Randomly select an "idx" such that "idx + substr_len <= len(main_str)".
return main_str[idx : (idx+substr_len)]
main_str='ATGCAGCACTAGGCAGCACTATGAAGCACTATGCTGCACT'
print(get_random_str(main_str, 20))
print(get_random_str(main_str, 20))
print(get_random_str(main_str, 20))

It's a matter of slicing your string:
str_main_1[:20] or str_main_1[2:22]
Try out something like this:
for i in range(0, len(str_main_1)):
print(str_main_1[i, i+20])

Since each substring must be 20 characters the maximum value for the lower bound of the substring is the length of the string minus 21 (indexing starts at 0 so we need to subtract 1 more since length is 1 based not 0 based)
lower_bound_max = len(str_main) - 21
Then you just need to generate random numbers between 0 and this value to get the lower band of your random slice and add 20 to get the upper band
import random
lower_bound_max = len(str_main) - 21
for _ in range(3): # repeat 3 times
x = random.randint(0, lower_bound_max)
print(str_main[x:x+20])

Related

Python pandas counting the number of unique string sources for substrings

Let's say I have a list of 5 character strings like:
AAAAB
BBBBA
BBBBA
ABBBB
And I want to find and count every possible 4 character substring and keep track of the number of unique 5 character strings they come from. Meaning while BBBB is found in three different string sources there is only two unique sources.
Example output:
substring repeats unique sources
0 AAAA 1 1
1 AAAB 1 1
2 BBBB 3 2
3 BBBA 2 1
4 ABBB 1 1
I have managed to do that on a small scale with just Python, a dictionary that gets updated, and two lists for comparing already existing substrings and full length strings. However, when applying that to my full data set (~160 000 full length strings (12 character) producing 150 million substrings (4 character)) the constant dictionary updating and list comparison process is too slow (my script's been running for a week now).
Counting the number of substrings present across all full length strings is easily and cheaply accomplished in both Python and pandas.
So my question is: How do I efficiently count and update the count for unique full length sources for substrings in my DataFrame?
TLDR: Here is an attempt that takes an estimated ~2 hours on my computer for the scale of data you describe.
import numpy as np
import pandas as pd
def substring_search(fullstrings, sublen=4):
'''
fullstrings: array like of strings
sublen: length of substring to search
'''
# PART 1: FIND SUBSTRINGS
# length of full strings, assumes all are same
strsize = len(fullstrings[0])
# get unique strings, # occurences
strs, counts = np.unique(fullstrings, return_counts=True)
fullstrings = pd.DataFrame({'string':strs,
'count':counts})
unique_n = len(fullstrings)
# create array to hold substrings
substrings = np.empty(unique_n * (strsize - sublen + 1), dtype=str)
substrings = pd.Series(substrings)
# slice to find each substring
c = 0
while c + sublen <= strsize:
sliced = fullstrings['string'].str.slice(c, c+sublen)
s = c * unique_n
e = s + unique_n
substrings[s: e] = sliced
c += 1
# take the set of substrings, save in output df
substrings = np.unique(substrings)
output = pd.DataFrame({'substrings':substrings,
'repeats': 0,
'unique_sources': 0})
# PART 2: CHECKING FULL STRINGS FOR SUBSTRINGS
for i, s in enumerate(output['substrings']):
# check which fullstrings contain each substring
idx = fullstrings['string'].str.contains(s)
count = fullstrings['count'][idx].sum()
output.loc[i, 'repeats'] = count
output.loc[i, 'unique_sources'] = idx.sum()
print('Finished!')
return output
Applied to your example:
>>> example = ['AAAAB', 'BBBBA', 'BBBBA', 'ABBBB']
>>> substring_search(example)
substrings repeats unique_sources
0 AAAA 1 1
1 AAAB 1 1
2 ABBB 1 1
3 BBBA 2 1
4 BBBB 3 2
Explanation
The basic idea in the above code is to loop over all the unique substrings, and (for each of them) check against the list of full strings using pandas str methods. This saves one for loop (i.e. you don't loop over each full string for each substring). The other idea is to only check unique full strings (in addition to unique substrings); you save the number of occurrences of each full string beforehand and correct the count at the end.
The basic structure is:
Get the unique strings in the input, and record the number of times each occurs.
Find all unique substrings in the input (I do this using pandas.Series.str.slice)
Loop over each substring, and use pandas.Series.str.contains to (element-wise) check the full strings. Since these are unique and we know the number of times each occurred, we can fill both repeats and unique_sources.
Testing
Here is code I used to create larger input data:
n = 100
size = 12
letters = list(string.ascii_uppercase[:20])
bigger = [''.join(np.random.choice(letters, size)) for i in range(n)]
So bigger is n size-length strings:
['FQHMHSOIEKGO',
'FLLNCKAHFISM',
'LDKKRKJROIRL',
...
'KDTTLOKCDMCD',
'SKLNSAQQBQHJ',
'TAIAGSIEQSGI']
With modified code that prints the progress (posted below), I tried with n=150000 and size=12, and got this initial output:
Starting main loop...
5%, 344.59 seconds
10.0%, 685.28 seconds
So 10 * 685 seconds / 60 (seconds/minute) = ~114 minutes. So 2 hours is not ideal, but practically more useful than 1-week. I don't doubt that there is some much cleverer way to do this, but maybe this can be helpful if nothing else is posted.
If you do use this code, you may want to verify that the results are correct with some smaller examples. One thing I was unsure of is whether you want to count whether a substring just appears in each full string (i.e. contains), or if you wanted the number of times it appears in a full string (i.e. count). That would at least hopefully be a small change.
Here is the additional code for printing progress while doing the search; there are just additional statements in #PART 2:
def substring_search_progress(fullstrings, sublen=4):
'''
fullstrings: array like of strings
sublen: length of substring to search
'''
# PART 1: FIND SUBSTRINGS
# length of full strings, assumes all are same
strsize = len(fullstrings[0])
# get unique strings, # occurences
strs, counts = np.unique(fullstrings, return_counts=True)
fullstrings = pd.DataFrame({'string':strs,
'count':counts})
unique_n = len(fullstrings)
# create array to hold substrings
substrings = np.empty(unique_n * (strsize - sublen + 1), dtype=str)
substrings = pd.Series(substrings)
# slice to find each substring
c = 0
while c + sublen <= strsize:
sliced = fullstrings['string'].str.slice(c, c+sublen)
s = c * unique_n
e = s + unique_n
substrings[s: e] = sliced
c += 1
# take the set of substrings, save in output df
substrings = np.unique(substrings)
output = pd.DataFrame({'substrings':substrings,
'repeats': 0,
'unique_sources': 0})
# PART 2: CHECKING FULL STRINGS FOR SUBSTRINGS
# for marking progress
total = len(output)
every = 5
progress = every
# main loop
print('Starting main loop...')
start = time.time()
for i, s in enumerate(output['substrings']):
# progress
if (i / total * 100) > progress:
now = round(time.time() - start, 2)
print(f'{progress}%, {now} seconds')
progress = (((i / total * 100) // every) + 1) * every
# check which fullstrings contain each substring
idx = fullstrings['string'].str.contains(s)
count = fullstrings['count'][idx].sum()
output.loc[i, 'repeats'] = count
output.loc[i, 'unique_sources'] = idx.sum()
print('Finished!')
return output

How to find the frequent sub-sequence of base pairs of a given length

Find the most frequent sub-sequence of base pairs of a given length.
Provided that the string and length is given
Example:
>>> most_freq_seq("AAGTTAGTCA", 3)
"AGT"
Can someone explain what "sub-sequence of base pair" means?
In this code if there are zero repetition the 1st given number of sequence will be displayed. it can changed by editing the second line of code to
most_seq = "no repetiotion" or something
def most_freq_seq(sequence,seq_len):
most_seq = sequence[0:seq_len]
number = 1
for i in range(0,len(sequence)-seq_len):
val = sequence.count(sequence[i:i+seq_len])
if val > number :
number = val
most_seq = sequence[i:i+seq_len]
print(most_seq)
most_freq_seq("AAGTTAGTCA",3)
You can use the Counter class from collection combined with zip to get the subsequences:
from collections import Counter
def most_freq_seq(seq,count):
counts = Counter("".join(s) for s in zip(*(seq[i:] for i in range(count))))
return counts.most_common(1)[0]
output:
r,c = most_freq_seq("AAGTTAGTCA", 3)
print(r,c)
# 'AGT' 2
I know DNA sequences are very long but I believe this will provide results in reasonable time.
For a sequence of 10 million entries, I get the result in roughly 3 seconds using random data:
import random
import time
sequence = "".join(random.choice("ACGT") for _ in range(10_000_000))
size = 7
start = time.time()
seq,count = most_freq_seq(sequence,size)
print(seq,count,time.time()-start)
# CCCAATT 704 3.12
The input you're being given is a fragment of DNA. Each letter corresponds to one of four "base pairs": https://en.wikipedia.org/wiki/Base_pair
A "sub-sequence of base pairs" is therefore simply a substring -- that is, a smaller string that appears in the input string. In the example you're given, "AGT" is a substring that appears twice (which is more often than any other substring that's 3 characters long, making it the most frequent):
AAGTTAGTCA
AGT AGT
Your task is to implement the most_freq_seq function that will produce the most frequent substring of a given length from within the given string. The fact that it's a string of DNA and that the characters are "base pairs" doesn't really affect the implementation.
Good luck!

python string slicing with alternating strides

Is there an elegant way to perform python slicing with more than one stride?
For example, given an input string, create string which contains the characters in positions: 1,4,6,9,11,14,16,19 and so forth
Input example:
s = "abcdefhijklmnopqrstuvwxyz"
Output:
out = "behkmpruwz"
Here is a regex solution which might meet your requirements:
s = "abcdefhijklmnopqrstuvwxyz"
output = re.sub(r'.(.)..(.)', '\\1\\2', s)
print(s)
print(output)
This prints:
abcdefhijklmnopqrstuvwxyz
behkmpruwz
The pattern matches five characters at a time, capturing the second and fifth characters in capture groups \1 and \2. Then, it just replaces those five characters with the two single captured characters.
This happens to work perfectly for your input string, because it is exactly a multiple of 5 in length. Note that my pattern won't do any replacements to any characters 1 to 4 which exceed a multiple length of 5.
If you do not want to use external libraries, then one general solution is to calculate the correct index yourself and join the corresponding characters.
start1 = 0
start2 = 1
stride1 = 5
stride2 = 3
result = ''.join([s[i + j] for i in range(start1, len(s), stride1)
for j in range(start2, stride1, stride2)])
If you do not mind using libraries such as numpy, then you can make the input into N-d arrays (in this case, a 2-D matrix) and apply advanced slicing on multiple axes.
import numpy as np
start1 = 0
start2 = 1
stride1 = 5
stride2 = 3
s_mat = np.array([*s]).reshape(stride1, -1) # Reshape the input into a 5 by 5 matrix
result_mat = s_mat[start1:, start2::stride2].flatten() # Apply slicing and flatten the result into a 1-D array
result = ''.join(result_mat) # Merge the output array into a string
I tried to simplify the loops like this. Not sure if its the perfect fit.
stride_1_seq = s[1::3]
stride_2_seq = s[4::5]
extracted_str = "".join(map(lambda x,y: x+y, stride_1_seq, stride_2_seq))
This should work, if the intervals are configured well.

Weighted random numbers in Python from a list of values

I am trying to create a list of 10,000 random numbers between 1 and 1000. But I want 80-85% of the numbers to be the same category( I mean some 100 numbers out of these should appear 80% of the times in the list of random numbers) and the rest appear around 15-20% of the times. Any idea if this can be done in Python/NumPy/SciPy. Thanks.
This can be easily done using 1 call to random.randint() to select a list and another call to random.choice() on the correct list. I'll assume list frequent contain 100 elements which are to be chose 80 percent times and rare contains 900 elements to be chose 20 percent times.
import random
a = random.randint(1,5)
if a == 1:
# Case for rare numbers
choice = random.choice(rare)
else:
# case for frequent numbers
choice = random.choice(frequent)
Here's an approach -
a = np.arange(1,1001) # Input array to extract numbers from
# Select 100 random unique numbers from input array and get also store leftovers
p1 = np.random.choice(a,size=100,replace=0)
p2 = np.setdiff1d(a,p1)
# Get random indices for indexing into p1 and p2
p1_idx = np.random.randint(0,p1.size,(8000))
p2_idx = np.random.randint(0,p2.size,(2000))
# Index and concatenate and randomize their positions
out = np.random.permutation(np.hstack((p1[p1_idx], p2[p2_idx])))
Let's verify after run -
In [78]: np.in1d(out, p1).sum()
Out[78]: 8000
In [79]: np.in1d(out, p2).sum()
Out[79]: 2000

Using a 3 d list in python to store and modify count of an element based on its occurence

I have a file containing hexadecimal data, in the format 55 12 3d 4e aa etc, in each line. I have a list of unique lengths obtained from scanning the entire file. Also, I am getting the count of these unique length by scanning the entire file. For example:
Line 1: 55,76,2e,44,12,54,aa,76,23,3e
Line 2: 76,2e,44,54,76,23
Line 3: 55,23,65,78,90,e4,aa,67
and so on
My expected output after including the below mentioned process will appear something like:
Length - 10 Count - 20000
EscLength - 2 Count - 10000
EscLength - 7 Count - 8000
Length - 4 Count - 5000
EscLength - 5 Count - 3000
EscLength - 7 Count - 2000
The original data contains "Escape contents". I have an algorithm to detect them and get the length of the updated data. Now, I have to find the unique length of the escape contents for each unique original lengths and then get the count of each escape content corresponding the unique lengths of the original data.
I tried to build a 3 d list in python that will contain [original length, length of escape content, count of the length of escape content]. I could not update the count. It is kind of getting complicated when i find the unique values and the update/insert counts.
for i in xrange(len(escCount_list)):
strspl = str(escCount_list[i]).split(',')
for strs in strspl:
if(strs == len(packet_list)-len(afterEsc_list) and strspl[0] == len(packet_list)):
escCount_list[i].append(len(packet_list)
escCount_list[i].append(len(packet_list)-len(afterEsc_list))
storeCount = escCount_list[i]
escCount_list[i].append(storeCount + 1)
if(strspl[0] == len(packet_list)):
escCount_list[i].append(len(packet_list)
escCount_list[i].append(len(packet_list)-len(afterEsc_list))
storeCount = escCount_list[i]
escCount_list[i].append(storeCount + 1)
When I encounter a new length, I try to include the values like this:
escCount_list[rows].append(len(packet_list)-len(afterEsc_list))
escCount_list[rows].append(len(packet_list))
escCount_list[rows].append(1)
rows = rows + 1
cols = cols + 1
It looks like I have made my script very messy. I had to address loads of issues and I am doing everything on one script because of the requirement I have. Is there a way to modify this or use an easy way to solve this problem. PS: Currently I am getting list index out of bound exception at escCount_list[rows].append() line.

Categories

Resources