python string slicing with alternating strides - python

Is there an elegant way to perform python slicing with more than one stride?
For example, given an input string, create string which contains the characters in positions: 1,4,6,9,11,14,16,19 and so forth
Input example:
s = "abcdefhijklmnopqrstuvwxyz"
Output:
out = "behkmpruwz"

Here is a regex solution which might meet your requirements:
s = "abcdefhijklmnopqrstuvwxyz"
output = re.sub(r'.(.)..(.)', '\\1\\2', s)
print(s)
print(output)
This prints:
abcdefhijklmnopqrstuvwxyz
behkmpruwz
The pattern matches five characters at a time, capturing the second and fifth characters in capture groups \1 and \2. Then, it just replaces those five characters with the two single captured characters.
This happens to work perfectly for your input string, because it is exactly a multiple of 5 in length. Note that my pattern won't do any replacements to any characters 1 to 4 which exceed a multiple length of 5.

If you do not want to use external libraries, then one general solution is to calculate the correct index yourself and join the corresponding characters.
start1 = 0
start2 = 1
stride1 = 5
stride2 = 3
result = ''.join([s[i + j] for i in range(start1, len(s), stride1)
for j in range(start2, stride1, stride2)])
If you do not mind using libraries such as numpy, then you can make the input into N-d arrays (in this case, a 2-D matrix) and apply advanced slicing on multiple axes.
import numpy as np
start1 = 0
start2 = 1
stride1 = 5
stride2 = 3
s_mat = np.array([*s]).reshape(stride1, -1) # Reshape the input into a 5 by 5 matrix
result_mat = s_mat[start1:, start2::stride2].flatten() # Apply slicing and flatten the result into a 1-D array
result = ''.join(result_mat) # Merge the output array into a string

I tried to simplify the loops like this. Not sure if its the perfect fit.
stride_1_seq = s[1::3]
stride_2_seq = s[4::5]
extracted_str = "".join(map(lambda x,y: x+y, stride_1_seq, stride_2_seq))
This should work, if the intervals are configured well.

Related

string to complex matrix representation in python

I have the following example string:
s ="1 1+i\n1-i 0"
Now I have to turn this string into a complex matrix. I am aware of the np.matrix() function but it is not designed for a complex matrix. Maybe some of you can provide me some ideas of how I can go forward. I also tried to split at \n but then I have two arrays which contain exactly one element (1 1+i & 1-i 0 ). The result should be:
np.array([[1, complex(1,1)], [complex(1, -1), 0]])
Thanks in advance!
Your question has two parts.
First, we want to convert the string into a list of complex numbers (as strings), and then convert this list into the actual complex numbers.
s = "1 1+i\n1-i 0"
import re
complex_strings = re.split("\n|\s+", s.replace('i', 'j'))
complex_numbers = [complex(x) for x in complex_strings]
m = np.array(complex_numbers).reshape(2, 2)
[[1.+0.j 1.+1.j]
[1.-1.j 0.+0.j]]

numpy - conditional change with closest elements

In a numpy array, I want to replace every occurrences of 4 where top and left of them is a 5.
so for instance :
0000300
0005000
0054000
0000045
0002050
Should become :
0000300
0005000
0058000
0000045
0002000
I'm sorry I can't share what I tried, that's a very specific question.
I've had a look at things like
map[map == 4] = 8
And np.where() but I have really no idea about how to check nearby elements of a specific value.
this might seem tricky, but an and between three shifted versions of the matrix will work, you simply need to shift the x==5 array to the bottom, and another version of it shifted to the right, the third matrix is the x==4.
first_array = np.zeros(x.shape,dtype=bool)
second_array = np.zeros(x.shape,dtype=bool)
equals_5 = x == 5
equals_4 = x == 4
first_array[1:] = equals_5[:-1] # shift down
second_array[:,1:] = equals_5[:,:-1] # shift right
third_array = equals_4 # put it as it is.
# and operation on the 3 arrays above
results = np.logical_and(np.logical_and(first_array,second_array),third_array)
x[results] = 8
now results will be the needed logical array.
and it's an O(n) algorithm, but it scales badly if the requested pattern is very complex, not that it's not doable.

Optimally selecting n datapoints from k

Problem statement:
I have 32k strings that consist of 13 characters. Each character can take 3 values (a, b or c). I need to select n strings from the 32k that satisfy the following:
select minimal number of strings so that the selected strings are not different than any other string within the 32k by more than 2 characters
This means that the count of strings that needs to be selected is variable. Also, the strings are not randomly generated, so the average difference is less than 2/3*13 - meaning that the eventual count of strings to be selected is not astronomical.
What I tried so far:
Clustering with k++ initialization and then k-means using hamming distance - but this did not yield in the desired outcome, albeit the problem resembles a clustering problem in a sense that we are practically looking for cluster centers with cluster members within a radius of 2.
What I am thinking of is simply selecting that string which has the most other strings having a distance of 1 and then of 2 - afterwards take out all these from the 32k and then repeat the calculation until no strings are left, but this is likely to be a suboptimal solution, e.g. this way I would select more strings than what is required at minimum I believe (selecting additional strings is a cost)
Question:
What other algorithms should I consider or think of? Thanks!
Here are examples of each method from my previous post. I always have trouble working code into my posts, so I did this separately. The first method computes the percentage that the strings are identical; the second method returns the number of differences.
string1 = ('abcbacaacbaab')
string2 = ('abcbacaacbbbb')
from difflib import SequenceMatcher
a=string1
b=string2
x = SequenceMatcher(a=a,b=b).ratio()
print(x)
#output: 0.8462
#OR (I used pip3 install jellyfish first)
import jellyfish
x=jellyfish.damerau_levenshtein_distance
(a,b)
print(x)
#output: 2
You might be able to use one of the types of 'fuzzy string matching' explained at:
https://miguendes.me/python-compare-strings#how-to-compare-two-strings-for-similarity-fuzzy-string-matching
There's "difflib" which computes a ratio of the differences. (You're in luck, your strings are all the same length.)
There's also something called "jellyfish" that returns a character count of the differences. It sounds like an interesting assignment, good luck!
if I understood, you want the minimum subset such that all elements in the subset are not different by more than two characters to the elements outside of the subset (please, let me know if I misunderstood the problem).
If that is the problem, there is a simple ad hoc algorithm that solves it in O(m * max(n, k)), where n is the total number of elements in the set (32000 in this case), m is the number of characters of an element of the set (13 in this case) and k is the size of the alphabet (3 in this case).
You can precalculate the quantity of each unique character of the alphabet in each column in O(m * max(n, k)). It's O(m * k) for initialization of the precalculation matrix and O(m * n) to actually calculate it.
Each column can vote for the removal of a string of the set if the character of the string in that column is equal to the number of strings in the initial set. Notice that a column can vote in O(1) using the precalculation. For each string, iterate through its columns and let the column vote. If you get three votes, you are sure the string needs to be kicked out of the set. So there is no need to continue iterating through the columns, just go to the next string. Else, the string needs to remain, just append it to the answer.
A python code is attached:
def solve(s: list[str], n: int = 32000, m: int = 13, k: int = 3) -> list[str]:
pre_calc = [[0 for j in range(k)] for i in range(m)]
ans = []
for i in range(n):
for j in range(m):
pre_calc[j][ord(s[i][j]) - ord('a')] += 1
for i in range(n):
votes_cnt = 0
remove = False
for j in range(m):
if pre_calc[j][ord(s[i][j]) - ord('a')] == n:
votes_cnt += 1
if votes_cnt == 3:
remove = True
break
if remove is False:
ans.append(s[i])
if len(ans) == 0:
ans.append(s[0])
return ans

Python pandas counting the number of unique string sources for substrings

Let's say I have a list of 5 character strings like:
AAAAB
BBBBA
BBBBA
ABBBB
And I want to find and count every possible 4 character substring and keep track of the number of unique 5 character strings they come from. Meaning while BBBB is found in three different string sources there is only two unique sources.
Example output:
substring repeats unique sources
0 AAAA 1 1
1 AAAB 1 1
2 BBBB 3 2
3 BBBA 2 1
4 ABBB 1 1
I have managed to do that on a small scale with just Python, a dictionary that gets updated, and two lists for comparing already existing substrings and full length strings. However, when applying that to my full data set (~160 000 full length strings (12 character) producing 150 million substrings (4 character)) the constant dictionary updating and list comparison process is too slow (my script's been running for a week now).
Counting the number of substrings present across all full length strings is easily and cheaply accomplished in both Python and pandas.
So my question is: How do I efficiently count and update the count for unique full length sources for substrings in my DataFrame?
TLDR: Here is an attempt that takes an estimated ~2 hours on my computer for the scale of data you describe.
import numpy as np
import pandas as pd
def substring_search(fullstrings, sublen=4):
'''
fullstrings: array like of strings
sublen: length of substring to search
'''
# PART 1: FIND SUBSTRINGS
# length of full strings, assumes all are same
strsize = len(fullstrings[0])
# get unique strings, # occurences
strs, counts = np.unique(fullstrings, return_counts=True)
fullstrings = pd.DataFrame({'string':strs,
'count':counts})
unique_n = len(fullstrings)
# create array to hold substrings
substrings = np.empty(unique_n * (strsize - sublen + 1), dtype=str)
substrings = pd.Series(substrings)
# slice to find each substring
c = 0
while c + sublen <= strsize:
sliced = fullstrings['string'].str.slice(c, c+sublen)
s = c * unique_n
e = s + unique_n
substrings[s: e] = sliced
c += 1
# take the set of substrings, save in output df
substrings = np.unique(substrings)
output = pd.DataFrame({'substrings':substrings,
'repeats': 0,
'unique_sources': 0})
# PART 2: CHECKING FULL STRINGS FOR SUBSTRINGS
for i, s in enumerate(output['substrings']):
# check which fullstrings contain each substring
idx = fullstrings['string'].str.contains(s)
count = fullstrings['count'][idx].sum()
output.loc[i, 'repeats'] = count
output.loc[i, 'unique_sources'] = idx.sum()
print('Finished!')
return output
Applied to your example:
>>> example = ['AAAAB', 'BBBBA', 'BBBBA', 'ABBBB']
>>> substring_search(example)
substrings repeats unique_sources
0 AAAA 1 1
1 AAAB 1 1
2 ABBB 1 1
3 BBBA 2 1
4 BBBB 3 2
Explanation
The basic idea in the above code is to loop over all the unique substrings, and (for each of them) check against the list of full strings using pandas str methods. This saves one for loop (i.e. you don't loop over each full string for each substring). The other idea is to only check unique full strings (in addition to unique substrings); you save the number of occurrences of each full string beforehand and correct the count at the end.
The basic structure is:
Get the unique strings in the input, and record the number of times each occurs.
Find all unique substrings in the input (I do this using pandas.Series.str.slice)
Loop over each substring, and use pandas.Series.str.contains to (element-wise) check the full strings. Since these are unique and we know the number of times each occurred, we can fill both repeats and unique_sources.
Testing
Here is code I used to create larger input data:
n = 100
size = 12
letters = list(string.ascii_uppercase[:20])
bigger = [''.join(np.random.choice(letters, size)) for i in range(n)]
So bigger is n size-length strings:
['FQHMHSOIEKGO',
'FLLNCKAHFISM',
'LDKKRKJROIRL',
...
'KDTTLOKCDMCD',
'SKLNSAQQBQHJ',
'TAIAGSIEQSGI']
With modified code that prints the progress (posted below), I tried with n=150000 and size=12, and got this initial output:
Starting main loop...
5%, 344.59 seconds
10.0%, 685.28 seconds
So 10 * 685 seconds / 60 (seconds/minute) = ~114 minutes. So 2 hours is not ideal, but practically more useful than 1-week. I don't doubt that there is some much cleverer way to do this, but maybe this can be helpful if nothing else is posted.
If you do use this code, you may want to verify that the results are correct with some smaller examples. One thing I was unsure of is whether you want to count whether a substring just appears in each full string (i.e. contains), or if you wanted the number of times it appears in a full string (i.e. count). That would at least hopefully be a small change.
Here is the additional code for printing progress while doing the search; there are just additional statements in #PART 2:
def substring_search_progress(fullstrings, sublen=4):
'''
fullstrings: array like of strings
sublen: length of substring to search
'''
# PART 1: FIND SUBSTRINGS
# length of full strings, assumes all are same
strsize = len(fullstrings[0])
# get unique strings, # occurences
strs, counts = np.unique(fullstrings, return_counts=True)
fullstrings = pd.DataFrame({'string':strs,
'count':counts})
unique_n = len(fullstrings)
# create array to hold substrings
substrings = np.empty(unique_n * (strsize - sublen + 1), dtype=str)
substrings = pd.Series(substrings)
# slice to find each substring
c = 0
while c + sublen <= strsize:
sliced = fullstrings['string'].str.slice(c, c+sublen)
s = c * unique_n
e = s + unique_n
substrings[s: e] = sliced
c += 1
# take the set of substrings, save in output df
substrings = np.unique(substrings)
output = pd.DataFrame({'substrings':substrings,
'repeats': 0,
'unique_sources': 0})
# PART 2: CHECKING FULL STRINGS FOR SUBSTRINGS
# for marking progress
total = len(output)
every = 5
progress = every
# main loop
print('Starting main loop...')
start = time.time()
for i, s in enumerate(output['substrings']):
# progress
if (i / total * 100) > progress:
now = round(time.time() - start, 2)
print(f'{progress}%, {now} seconds')
progress = (((i / total * 100) // every) + 1) * every
# check which fullstrings contain each substring
idx = fullstrings['string'].str.contains(s)
count = fullstrings['count'][idx].sum()
output.loc[i, 'repeats'] = count
output.loc[i, 'unique_sources'] = idx.sum()
print('Finished!')
return output

Randomly extracting substrings of equal length from the main string

I have a string as follows.
str_main='ATGCAGCACTAGGCAGCACTATGAAGCACTATGCTGCACT'
len(str_main)
40
I want to extract three subtrings from str_main such that each substring contains 20 characters.
These substrings should start from anywhere in the main string and thus, there will obviously be an overlap between the subtrings.
I found some solutions but they do not provide random substring extraction from the main string.
Desired output might be:
substr_1='ATGCAGCACTAGGCAGCACT'
substr_2='CACTATGAAGCACTATGCTG'
substr_3='CACTAGGCAGCACTATGAAG'
They are randomly extraxted from the main string. I should be able to extarct as many string as I want as the overlap is allowed.
We can write a function and use it three times like this:
import random
def get_random_str(main_str, substr_len):
idx = random.randrange(0, len(main_str) - substr_len + 1) # Randomly select an "idx" such that "idx + substr_len <= len(main_str)".
return main_str[idx : (idx+substr_len)]
main_str='ATGCAGCACTAGGCAGCACTATGAAGCACTATGCTGCACT'
print(get_random_str(main_str, 20))
print(get_random_str(main_str, 20))
print(get_random_str(main_str, 20))
It's a matter of slicing your string:
str_main_1[:20] or str_main_1[2:22]
Try out something like this:
for i in range(0, len(str_main_1)):
print(str_main_1[i, i+20])
Since each substring must be 20 characters the maximum value for the lower bound of the substring is the length of the string minus 21 (indexing starts at 0 so we need to subtract 1 more since length is 1 based not 0 based)
lower_bound_max = len(str_main) - 21
Then you just need to generate random numbers between 0 and this value to get the lower band of your random slice and add 20 to get the upper band
import random
lower_bound_max = len(str_main) - 21
for _ in range(3): # repeat 3 times
x = random.randint(0, lower_bound_max)
print(str_main[x:x+20])

Categories

Resources