Find and use multiple occurences of a string in a string - python

I recently started using Python and wrote some simple scripts
Now I have this question:
I have this string:
mystring = 'AAAABBAAABBAAAACCAAAACCAAAA'
and I have these following strings:
String_A = BB
String_B = CC
I would like to get all possible combinations of strings starting with String_A and ending with String_B (kind of vague so below is the desired output)
output:
BBAAABBAAAACCAAACC
BBAAABBAAAACC
BBAAACCAAAACC
BBAAACC
I am able to count the number of occurences of String_A and String_B in mystring using
mystring.count()
And I am able to print out one specific output (the one with the first occurence of String_A and the first occurence of String_B), by doing the following:
if String_A in mystring:
String_B_End = mystring.index(String_B) + len(String_B)
output = mystring[mystring.index(String_A); String_B_End]
print(output)
this works perfect but only gives me the following output:
BBAAABBAAAACC
How can I get all the specified output strings from mystring?
thanx in advance!

If I understand the intention of your question correctly you can use the following code:
>>> import re
>>> mystring = 'AAAABBAAABBAAAACCAAAACCAAAA'
>>> String_A = 'BB'
>>> String_B = 'CC'
>>> def find_occurrences(s, a, b):
a_is = [m.start() for m in re.finditer(re.escape(a), s)] # All indexes of a in s
b_is = [m.start() for m in re.finditer(re.escape(b), s)] # All indexes of b in s
result = [s[i:j+len(b)] for i in a_is for j in b_is if j>i]
return result
>>> find_occurrences(mystring, String_A, String_B)
['BBAAABBAAAACC', 'BBAAABBAAAACCAAAACC', 'BBAAAACC', 'BBAAAACCAAAACC']
This uses the find all occurrences of a substring code from this answer
In its current form the code does not work for overlapping substrings, if mystring = 'BBB' and you look for substring 'BB' it only returns the index 0. If you want to account for such overlapping substrings change the lines where you are getting the indexes of the substrings to a_is = [m.start() for m in re.finditer("(?={})".format(re.escape(a)), s)]

Well, first you need to get the indexes of String_A and String_B in the text. See this:
s = mystring
[i for i in range(len(s)-len(String_A)+1) if s[i:i+len(String_A)]==String_A]
it returns [4, 9], i.e. the indexes of 'BB' in mystring. You do similarly for String_B for which the answer would be [15, 21].
Then you do this:
[(i, j) for i in [4, 9] for j in [15, 21] if i < j]
This line combines each starting location with each ending location and ensures that the starting location occurs before the ending location. The i < j would not be essential for this particular example, but in general you should have it. The result is [(4, 15), (4, 21), (9, 15), (9, 21)].
Then you just convert the start and end indices to substrings:
[s[a:b+len(String_B)] for a, b in [(4, 15), (4, 21), (9, 15), (9, 21)]]

Related

How can I get a substring from a string in python using a subset of string?

Lets say that I have this string:
a = 'ashfafhkiojojojhohkhgiobbboddbbgoifbafjgibibfoobfbobobfbafnongokhofgoon'
My goal is to create a function that get me any substrings that start with 'af' and end with 'kh'.
In this example, I would get 2 substring
'afhkiojojojhohkh' and 'afjgibibfoobfbobobfbafnongokh'
I would also like to get the length of these substrings and their location within the larger string.
I have thought about using a for loop but I did not get very far. Any help is very much appreciated.
Thanks.
Using the build-in module re for regular expressions:
import re
text = 'ashfafhkiojojojhohkhgiobbboddbbgoifbafjgibibfoobfbobobfbafnongokhofgoon'
# tuples of the form (substr, (start, end), length)
matches = [(match.group(0), match.span(), int.__rsub__(*match.span()),) for match in re.finditer(r'(af.*?kh)', text)]
longest = max(matches, key=lambda pairs: pairs[-1])
print(matches)
print(longest)
EDIT
if := is supported the terms in the list comprehension can be simplified like this
(match.group(0), pos:=match.span(), int.__rsub__(*pos))
You can use nested searches looking for the start and end:
A full function with dynamic start and end (you can change start and end values) would look like:
def find(inp, start, end):
ls = len(start)
le = len(end)
start_and_len = []
for i in range(len(inp)-ls+1):
if inp[i:i+ls] == start:
for j in range(i, len(inp)-le+1):
if inp[j:j+le] == end:
# (str, start index, len)
start_and_len.append((inp[i:j+le], i, j+le-i,))
return start_and_len
# Use as
>>> a = 'afafaf---khkhkh'
>>> find(a, 'af', 'kh')
[('afafaf---kh', 0, 11),
('afafaf---khkh', 0, 13),
('afafaf---khkhkh', 0, 15),
('afaf---kh', 2, 9),
('afaf---khkh', 2, 11),
('afaf---khkhkh', 2, 13),
('af---kh', 4, 7),
('af---khkh', 4, 9),
('af---khkhkh', 4, 11)]
# Your given example, with more matches
>>> a = 'ashfafhkiojojojhohkhgiobbboddbbgoifbafjgibibfoobfbobobfbafnongokhofgoon'
>>> find(a, 'af', 'kh')
[('afhkiojojojhohkh', 4, 16),
('afhkiojojojhohkhgiobbboddbbgoifbafjgibibfoobfbobobfbafnongokh', 4, 61),
('afjgibibfoobfbobobfbafnongokh', 36, 29),
('afnongokh', 56, 9)]
I was able to modify #cards answer and come up with the following function which answered my question above.
import re
a = 'ashfafhkiojojojhohkhgiobbboddbbgoifbafjgibibfoobfbobobfbafnongokhofgoon'
def substring(string, start, end):
matches = [(match.group(0), match.span(), int.__rsub__(*match.span()),) for match in re.finditer(rf'({start}.*?{end})', string)]
return matches
substring(a, 'af', 'kh')
[('afhkiojojojhohkh', (4, 20), 16),
('afjgibibfoobfbobobfbafnongokh', (36, 65), 29)]
If anybody can come up with the same answer using a for loop, feel free to post it. Thanks to all for their help!

Trying to find a clever way to find indices of keywords in a given string

I know there are plenty of topics about finding indices of given keywords in strings, but my case is a bit different
I have 2 inputs, one is a string and another is a mapping list (or whatever you wanna call it)
s = "I am awesome and I love you"
mapping_list = "1 1 2 3 1 2 3"
each word will always map onto a digit in the mapping list. Now I want to find all indices of a given number, say 1, when matching the string.
In the above case, it will return [0, 2, 17] (Thakns #rahlf23)
My current approach would be zipping each word with a digit by doing
zip(mapping_list.split(' '), s.split(' '))
which gives me
('1', 'I')
('1', 'am')
('2', 'awesome')
('3', 'and')
('1', 'I')
('2', 'love')
('3', 'you')
and then iterate through the list, find "1", use the word to generate a regex, and then search for indices and append it to a list or something. Rinse and repeat.
However this seems really inefficient especially if the s gets really long
I'm wondering if there's a better way to deal with it.
You could map the words to their len and use itertools.accumulate, although you have to add 1 to each length (for the spaces) and add an initial 0 for the start of the first word.
>>> words = "I am awesome and I love you".split()
>>> mapping = list(map(int, "1 1 2 3 1 2 3".split()))
>>> start_indices = list(itertools.accumulate([0] + [len(w)+1 for w in words]))
>>> start_indices
[0, 2, 5, 13, 17, 19, 24, 28]
The last element is not used. Then, zip and iterate the pairs and collect them in a dictionary.
>>> d = collections.defaultdict(list)
>>> for x, y in zip(mapping, start_indices):
... d[x].append(y)
>>> dict(d)
>>> {1: [0, 2, 17], 2: [5, 19], 3: [13, 24]}
Alternatively, you could also use a regular expression like \b\w (word-boundary followed by word-character) to find each position a word starts, then proceed as above.
>>> s = "I am awesome and I love you"
>>> [m.start() for m in re.finditer(r"\b\w", s)]
[0, 2, 5, 13, 17, 19, 24]
# Find the indices of all the word starts
word_starts = [0] + [m.start()+1 for m in re.finditer(' ', s)]
# Break the mapping list into an actual list
mapping = mapping_list.split(' ')
# Find the indices in the mapping list we care about
word_indices = [i for i, e in enumerate(mapping) if e == '1']
# Map those indices onto the word start indices
word_starts_at_indices = [word_starts[i] for i in word_indices]
# Or you can do the last line the fancy way:
# word_starts_at_indices = operator.itemgetter(*word_indices)(word_starts)

How to elegantly transform '1-3,6-8' to '1 2 3 6 7 8' within a list?

Problem
Background:
I have a list of ~10,000 lists containing irregular data which needs to be transformed to a specific format. This data will be ingested into a pandas dataframe after transformation.
TL/DR; How to elegantly transform matched strings of the following regex in a list?
Regex
'\d{1,3}-\d{1,3},\d{1,3}-\d{1,3}'
Example:
'1-3,6-8' to '1 2 3 6 7 8'
Current Solution:
Using list comprehensions required multiple type casts to transform the string and is unfit to be a lasting solution.
pat = re.compile('\d{1,3}-\d{1,3},\d{1,3}-\d{1,3}')
row = ['sss-www,ddd-eee', '1-3,6-8', 'XXXX', '0-2,3-7','234','1,5']
lst = [((str(list(range(int(x.split(',')[0].split('-')[0]),
int(x.split(','[0].split('-')[1])+1))).strip('[]').replace(',', '')+' '
+str(list(range(int(x.split(',')[1].split('-')[0]),
int(x.split(',')[1].split('-')[1]) + 1))).strip('[]').replace(',', '')))
if pat.match(str(x)) else x for x in row]
Result
['sss-www,ddd-eee', '1 2 3 6 7 8', 'XXXX', '0 1 2 3 4 5 6 7', '234', '1,5']
Capture the groups it's easier.
Then you convert the group list to integers, and process them 2 by 2 in a list comprehension, chained with itertools.chain
import re,itertools
pat = re.compile('(\d{1,3})-(\d{1,3}),(\d{1,3})-(\d{1,3})')
z='1-3,6-8'
groups = [int(x) for x in pat.match(z).groups()]
print(list(itertools.chain(*(list(range(groups[i],groups[i+1]+1)) for i in range(0,len(groups),2)))))
result:
[1, 2, 3, 6, 7, 8]
not sure you're calling that "elegant", though. It remains complicated, mostly because most objects return generators that need converting to list explicitly.
Several ways to do this, here is mine:
import re
txt = '1-3,6-8'
# Safer to use a raw string
pat = re.compile(r'(\d{1,3})-(\d{1,3}),(\d{1,3})-(\d{1,3})')
m = pat.match(txt)
if m:
start1, end1, start2, end2 = m.groups()
result = [i for i in range(int(start1), int(end1)+1)]
result += [i for i in range(int(start2), int(end2)+1)]
print(result)
Gives:
[1, 2, 3, 6, 7, 8]
I'm assuming Python 3 here (as stated in the question).
Python 2 could use:
result = range(int(start1), int(end1)+1)
result += range(int(start2), int(end2)+1)
I assume that you want to handle longer sequences as well, like 1-10,15,23-25? You don't really need regular expressions for this, regular string processing functions will work well.
def parse_sequence(seq):
result = []
for part in seq.split(','):
points = [int(s) for s in part.split('-')]
if len(points) == 2:
result.extend(range(points[0], points[1]+1))
elif len(points) == 1:
result.append(points[0])
else:
raise ValueError('invalid sequence')
return result
Here is my solution:
import re
from itertools import chain
s = '1-3, 6 - 8, 12-14, 20 -22'
rslt = list(chain(*[range(int(tup[0]), int(tup[1]) + 1)
for tup in re.findall(r'(\d+)\s*?-\s*?(\d+)', s)]))
Output:
In [43]: rslt
Out[43]: [1, 2, 3, 6, 7, 8, 12, 13, 14, 20, 21, 22]
Step by step:
In [44]: re.findall(r'(\d+)\s*?-\s*?(\d+)', s)
Out[44]: [('1', '3'), ('6', '8'), ('12', '14'), ('20', '22')]
In [45]: [range(int(tup[0]),int(tup[1])+1) for tup in re.findall(r'(\d+)\s*?-\s*?(\d+)', s)]
Out[45]: [range(1, 4), range(6, 9), range(12, 15), range(20, 23)]
Depends on exactly what data you're expecting to see. In general the best way to do this is going to be to write a function that parses the string in chunks. Something like:
def parse(string):
chunks = string.split(',')
for chunk in chunks:
match = re.match('(\d+)-(\d+)', chunk)
if match:
start = int(match.group(1))
end = int(match.group(2))
yield range(start:end+1)
else:
yield int(chunk)
s_tmp = s.split(",")
[*range(x.split("-")int([0]),x.split("-")int(x[1])) for x in s_tmp]
apologies if there is syntax errors . i'm typing this from my phone . basically split by , then split by - then unpack the entries from range

python string slicing with a list

Here is my list:
liPos = [(2,5),(8,9),(18,22)]
The first item of each tuple is the starting position and the second is the ending position.
Then I have a string like this:
s = "I hope that I will find an answer to my question!"
Now, considering my liPos list, I want to format the string by removing the chars between each starting and ending position (and including the surrounding numbers) provided in the tuples. Here is the result that I want:
"I tt I will an answer to my question!"
So basically, I want to remove the chars between 2 and 5 (including 2 and 5), then between 8,9 (including 8 and 9) and finally between 18,22 (including 18 and 22).
Any suggestion?
This assumes that liPos is already sorted, if it is not used sorted(liPos, reverse=True) in the for loop.
liPos = [(2,5),(8,9),(18,22)]
s = "I hope that I will find an answer to my question!"
for begin, end in reversed(liPos):
s = s[:begin] + s[end+1:]
print s
Here is an alternative method that constructs a new list of slice tuples to include, and then joining the string with only those included portions.
from itertools import chain, izip_longest
# second slice index needs to be increased by one, do that when creating liPos
liPos = [(a, b+1) for a, b in liPos]
result = "".join(s[b:e] for b, e in izip_longest(*[iter(chain([0], *liPos))]*2))
To make this slightly easier to understand, here are the slices generated by izip_longest:
>>> list(izip_longest(*[iter(chain([0], *liPos))]*2))
[(0, 2), (6, 8), (10, 18), (23, None)]
liPos = [(2,5),(8,9),(18,22)]
s = "I hope that I will find an answer to my question!"
exclusions = set().union(* (set(range(t[0], t[1]+1)) for t in liPos) )
pruned = ''.join(c for i,c in enumerate(s) if i not in exclusions)
print pruned
Here is one, compact possibility:
"".join(s[i] for i in range(len(s)) if not any(start <= i <= end for start, end in liPos))
This ... is a quick stab at the problem. There may be a better way, but it's a start at least.
>>> liPos = [(2,5),(8,9),(18,22)]
>>>
>>> toRemove = [i for x, y in liPos for i in range(x, y + 1)]
>>>
>>> toRemove
[2, 3, 4, 5, 8, 9, 18, 19, 20, 21, 22]
>>>
>>> s = "I hope that I will find an answer to my question!"
>>>
>>> s2 = ''.join([c for i, c in enumerate(s) if i not in toRemove])
>>>
>>> s2
'I tt I will an answer to my question!'

unexpected list appearing in python loop

I am new to python and have the following piece of test code featuring a nested loop and I'm getting some unexpected lists generated:
import pybel
import math
import openbabel
search = ["CCC","CCCC"]
matches = []
#n = 0
#b = 0
print search
for n in search:
print "n=",n
smarts = pybel.Smarts(n)
allmol = [mol for mol in pybel.readfile("sdf", "zincsdf2mols.sdf.txt")]
for b in allmol:
matches = smarts.findall(b)
print matches, "\n"
Essentially, the list "search" is a couple of strings I am looking to match in some molecules and I want to iterate over both strings in every molecule contained in allmol using the pybel software. However, the result I get is:
['CCC', 'CCCC']
n= CCC
[(1, 2, 28), (1, 2, 4), (2, 4, 5), (4, 2, 28)]
[]
n= CCCC
[(1, 2, 4, 5), (5, 4, 2, 28)]
[]
as expected except for a couple of extra empty lists slotted in which are messing me up and I cannot see where they are coming from. They appear after the "\n" so are not an artefact of the smarts.findall(). What am I doing wrong?
thanks for any help.
allmol has 2 items and so you're looping twice with matches being an empty list the second time.
Notice how the newline is printed after each; changing that "\n" to "<-- matches" may clear things up for you:
print matches, "<-- matches"
# or, more commonly:
print "matches:", matches
Perhaps it is supposed to end like this
for b in allmol:
matches.append(smarts.findall(b))
print matches, "\n"
otherwise I'm not sure why you'd initialise matches to an empty list
If that is the case, you can instead write
matches = [smarts.findall(b) for b in allmol]
print matches
another possibility is that the file is ending in an empty line
for b in allmol:
if not b.strip(): continue
matches.append(smarts.findall(b))
print matches, "\n"

Categories

Resources