Python multiple substring index in string - python

Given the following list of sub-strings:
sub = ['ABC', 'VC', 'KI']
is there a way to get the index of these sub-string in the following string if they exist?
s = 'ABDDDABCTYYYYVCIIII'
so far I have tried:
for i in re.finditer('VC', s):
print(i.start, i.end)
However, re.finditer does not take multiple arguments.
thanks

You can join those patterns together using |:
import re
sub = ['ABC', 'VC', 'KI']
s = 'ABDDDABCTYYYYVCIIII'
r = '|'.join(re.escape(s) for s in sub)
for i in re.finditer(r, s):
print(i.start(), i.end())

You could map over the find string method.
s = 'ABDDDABCTYYYYVCIIII'
sub = ['ABC', 'VC', 'KI']
print(*map(s.find, sub))
# Output 5 13 -1

How about using list comprehension with str.find?
s = 'ABDDDABCTYYYYVCIIII'
sub = ['ABC', 'VC', 'KI']
results = [s.find(pattern) for pattern in sub]
print(*results) # 5 13 -1

Another approach with re, if there can be multiple indices then this might be better as the list of indices is saved for each key, when there is no index found, the substring won't be in the dict.
import re
s = 'ABDDDABCTYYYYVCIIII'
sub = ['ABC', 'VC', 'KI']
# precompile regex pattern
subpat = '|'.join(sub)
pat = re.compile(rf'({subpat})')
matches = dict()
for m in pat.finditer(s):
# append starting index of found substring to value of matched substring
matches.setdefault(m.group(0),[]).append(m.start())
print(f"{matches=}")
print(f"{'KI' in matches=}")
print(f"{matches['ABC']=}")
Outputs:
matches={'ABC': [5], 'VC': [13]}
'KI' in matches=False
matches['ABC']=[5]

A substring may occur more than once in the main string (although it doesn't in the sample data). One could use a generator based around a string's built-in find() function like this:
note the source string has been modified to demonstrate repetition
sub = ['ABC', 'VC', 'KI']
s = 'ABCDDABCTYYYYVCIIII'
def find(s, sub):
for _sub in sub:
offset = 0
while (idx := s[offset:].find(_sub)) >= 0:
yield _sub, idx + offset
offset += idx + 1
for ss, start in find(s, sub):
print(ss, start)
Output:
ABC 0
ABC 5
VC 13

Just Use String index Method
list_ = ['ABC', 'VC', 'KI']
s = 'ABDDDABCTYYYYVCIIII'
for i in list_:
if i in s:
print(s.index(i))

Related

String manipulation with python and storing alphabets and digits in separate lists

For a given string s='ab12dc3e6' I want to add 'ab' and '12' in two different lists. that means for output i am trying to achieve as temp1=['ab','dc','e'] and for temp2=['12,'3','6'].
I am not able to do so with the following code. Can someone provide an efficient way to do it?
S = "ab12dc3e6"
temp=list(S)
x=''
temp1=[]
temp2=[]
for i in range(len(temp)):
while i<len(temp) and (temp[i] and temp[i+1]).isdigit():
x+=temp[i]
i+=1
temp1.append(x)
if not temp[i].isdigit():
break
You can also solve this without any imports:
S = "ab12dc3e6"
def get_adjacent_by_func(content, func):
"""Returns a list of elements from content that fullfull func(...)"""
result = [[]]
for c in content:
if func(c):
# add to last inner list
result[-1].append(c)
elif result[-1]: # last inner list is filled
# add new inner list
result.append([])
# return only non empty inner lists
return [''.join(r) for r in result if r]
print(get_adjacent_by_func(S, str.isalpha))
print(get_adjacent_by_func(S, str.isdigit))
Output:
['ab', 'dc', 'e']
['12', '3', '6']
you can use regex, where you group letters and digits, then append them to lists
import re
S = "ab12dc3e6"
pattern = re.compile(r"([a-zA-Z]*)(\d*)")
temp1 = []
temp2 = []
for match in pattern.finditer(S):
# extract words
#dont append empty match
if match.group(1):
temp1.append(match.group(1))
print(match.group(1))
# extract numbers
#dont append empty match
if match.group(2):
temp2.append(match.group(2))
print(match.group(2))
print(temp1)
print(temp2)
Your code does nothing for isalpha - you also run into IndexError on
while i<len(temp) and (temp[i] and temp[i+1]).isdigit():
for i == len(temp)-1.
You can use itertools.takewhile and the correct string methods of str.isdigit and str.isalpha to filter your string down:
S = "ab12dc3e6"
r = {"digit":[], "letter":[]}
from itertools import takewhile, cycle
# switch between the two test methods
c = cycle([str.isalpha, str.isdigit])
r = {}
i = 0
while S:
what = next(c) # get next method to use
k = ''.join(takewhile(what, S))
S = S[len(k):]
r.setdefault(what.__name__, []).append(k)
print(r)
Output:
{'isalpha': ['ab', 'dc', 'e'],
'isdigit': ['12', '3', '6']}
This essentially creates a dictionary where each seperate list is stored under the functions name:
To get the lists, use r["isalpha"] or r["isdigit"].

How to do slicing in strings in python?

I am trying to do slicing in string "abcdeeefghij", here I want the slicing in such a way that whatever input I use, i divide the output in the format of a list (such that in one list element no alphabets repeat).
In this case [abcde,e,efghij].
Another example is if input is "aaabcdefghiii". Here the expected output is [a,a,acbdefghi,i,i].
Also amongst the list if I want to find the highest len character i tried the below logic:
max_str = max(len(sub_strings[0]),len(sub_strings[1]),len(sub_strings[2]))
print(max_str) #output - 6
which will yield 6 as the output, but i presume this logic is not a generic one: Can someone suggest a generic logic to print the length of the maximum string.
Here is how:
s = "abcdeeefghij"
l = ['']
for c in s: # For character in s
if c in l[-1]: # If the character is already in the last string in l
l.append('') # Add a new string to l
l[-1] += c # Add the character to either the last string, either new, or old
print(l)
Output:
['abcde', 'e', 'efghij']
Use a regular expression:
import re
rx = re.compile(r'(\w)\1+')
strings = ['abcdeeefghij', 'aaabcdefghiii']
lst = [[part for part in rx.split(item) if part] for item in strings]
print(lst)
Which yields
[['abcd', 'e', 'fghij'], ['a', 'bcdefgh', 'i']]
You would loop over the characters in the input and start a new string if there is an existing match, otherwise join them onto the last string in the output list.
input_ = "aaabcdefghiii"
output = []
for char in input_:
if not output or char in output[-1]:
output.append("")
output[-1] += char
print(output)
To avoid repetition of alphabet within a list element repeat, you can greedily track what are the words that are already in the current list. Append the word to your answer once you detected a repeating alphabet.
from collections import defaultdict
s = input()
ans = []
d = defaultdict(int)
cur = ""
for i in s:
if d[i]:
ans.append(cur)
cur = i # start again since there is repeatition
d = defaultdict(int)
d[i] = 1
else:
cur += i #append to cur since no repetition yet
d[i] = 1
if cur: # handlign the last part
ans.append(cur)
print(ans)
An input of aaabcdefghiii produces ['a', 'a', 'abcdefghi', 'i', 'i'] as expected.

Splitting a string based on a certain set of words

I have a list of strings like such,
['happy_feet', 'happy_hats_for_cats', 'sad_fox_or_mad_banana','sad_pandas_and_happy_cats_for_people']
Given a keyword list like ['for', 'or', 'and'] I want to be able to parse the list into another list where if the keyword list occurs in the string, split that string into multiple parts.
For example, the above set would be split into
['happy_feet', 'happy_hats', 'cats', 'sad_fox', 'mad_banana', 'sad_pandas', 'happy_cats', 'people']
Currently I've split each inner string by underscore and have a for loop looking for an index of a key word, then recombining the strings by underscore. Is there a quicker way to do this?
>>> [re.split(r"_(?:f?or|and)_", s) for s in l]
[['happy_feet'],
['happy_hats', 'cats'],
['sad_fox', 'mad_banana'],
['sad_pandas', 'happy_cats', 'people']]
To combine them into a single list, you can use
result = []
for s in l:
result.extend(re.split(r"_(?:f?or|and)_", s))
>>> pat = re.compile("_(?:%s)_"%"|".join(sorted(split_list,key=len)))
>>> list(itertools.chain(pat.split(line) for line in data))
will give you the desired output for the example dataset provided
actually with the _ delimiters you dont really need to sort it by length so you could just do
>>> pat = re.compile("_(?:%s)_"%"|".join(split_list))
>>> list(itertools.chain(pat.split(line) for line in data))
You could use a regular expression:
from itertools import chain
import re
pattern = re.compile(r'_(?:{})_'.format('|'.join([re.escape(w) for w in keywords])))
result = list(chain.from_iterable(pattern.split(w) for w in input_list))
The pattern is dynamically created from your list of keywords. The string 'happy_hats_for_cats' is split on '_for_':
>>> re.split(r'_for_', 'happy_hats_for_cats')
['happy_hats', 'cats']
but because we actually produced a set of alternatives (using the | metacharacter) you get to split on any of the keywords:
>>> re.split(r'_(?:for|or|and)_', 'sad_pandas_and_happy_cats_for_people')
['sad_pandas', 'happy_cats', 'people']
Each split result gives you a list of strings (just one if there was nothing to split on); using itertools.chain.from_iterable() lets us treat all those lists as one long iterable.
Demo:
>>> from itertools import chain
>>> import re
>>> keywords = ['for', 'or', 'and']
>>> input_list = ['happy_feet', 'happy_hats_for_cats', 'sad_fox_or_mad_banana','sad_pandas_and_happy_cats_for_people']
>>> pattern = re.compile(r'_(?:{})_'.format('|'.join([re.escape(w) for w in keywords])))
>>> list(chain.from_iterable(pattern.split(w) for w in input_list))
['happy_feet', 'happy_hats', 'cats', 'sad_fox', 'mad_banana', 'sad_pandas', 'happy_cats', 'people']
Another way of doing this, using only built-in method, is to replace all occurrence of what's in ['for', 'or', 'and'] in every string with a replacement string, say for example _1_ (it could be any string), then at then end of each iteration, to split over this replacement string:
l = ['happy_feet', 'happy_hats_for_cats', 'sad_fox_or_mad_banana','sad_pandas_and_happy_cats_for_people']
replacement_s = '_1_'
lookup = ['for', 'or', 'and']
lookup = [x.join('_'*2) for x in lookup] #Changing to: ['_for_', '_or_', '_and_']
results = []
for i,item in enumerate(l):
for s in lookup:
if s in item:
l[i] = l[i].replace(s,'_1_')
results.extend(l[i].split('_1_'))
OUTPUT:
['happy_feet', 'happy_hats', 'cats', 'sad_fox', 'mad_banana', 'sad_pandas', 'happy_cats', 'people']

Python - Split sorted string of characters when character is different than previous character

I have a string of characters which I know to be sorted. Example:
myString = "aaaabbbbbbcccddddd"
I want to split this item into a list at the point when the character I am on is different than its preceding character, as shown below:
splitList = ["aaaa","bbbbbb","ccc","ddddd"]
I am working in Python 3.4.
Thanks!
In [294]: myString = "aaaabbbbbbcccddddd"
In [295]: [''.join(list(g)) for i,g in itertools.groupby(myString)]
Out[295]: ['aaaa', 'bbbbbb', 'ccc', 'ddddd']
myString = "aaaabbbbbbcccddddd"
result = []
for i,s in enumerate(myString):
l = len(result)
if l == 0 or s != myString[i-1]:
result.append(s)
else:
result[l-1] = result[l-1] + s
print result
Output:
['aaaa', 'bbbbbb', 'ccc', 'ddddd']

how to insert item from one list to other at position which doesnot contains alphanumeric

these are my lists
i=["a","b"]
j=["abc","(3)","ab & ac", "(1,4)","xyz"]
and I want my output be like this
j=["abc","a","ab & ac","b","xyz"]
and i tried like this,
val=0
for item in j:
if item.isalpha():
pass
else:
elem=i[val]
j.replace(item, elem)
val=val+1
How to insert item from one list to other at position which does not contains alphanumeric?
Assuming that "ab & ac" is not alphanumeric (because of the & and whitespaces) and that you made a typo, this will do the trick.
def removeNonAlpha(i,j):
indexI = 0
indexJ = 0
while indexJ < len(j):
if not j[indexJ].isalnum():
j[indexJ] = i[indexI]
indexI += 1
indexJ += 1
return j
>>>i=["a","b", "c"]
>>>j=["abc","(3)","ab & ac", "(1,4)","xyz"]
>>>removeNonAlpha(i,j)
['abc', 'a', 'b', 'c', 'xyz']
This code also assumes that you have enough elements in i to make complete replacements for j.
If for some special reasons you need to allow & signs (which would imply that you would also need to allow the whitespaces) here is the alternative:
def removeNonAlpha(i,j):
indexI = 0
indexJ = 0
while indexJ < len(j):
if not j[indexJ].replace('&', '').replace(' ', '').isalnum():
j[indexJ] = i[indexI]
indexI += 1
indexJ += 1
return j
>>>i=["a","b"]
>>>j=["abc","(3)","ab & ac", "(1,4)","xyz"]
>>>removeNonAlpha(i,j)
['abc', 'a', 'ab & ac', 'b', 'xyz']
This will preserve any list element in a list j that has a letter in it
[s for s in j if any(c in string.letters for c in s)]
If you have a character or string than doesn't occur in any of the strings, you can concatenate the list into a single string using the string .join method, then use a regular expression and the re.sub function to do the replacement. After that, you can use the .split method to divide the string back into a list:
>>> import re
>>> i=["a","b"]; j=["abc","(3)","ab & ac", "(1,4)","xyz"]
>>> js = "|".join(j) # merge j into one string
>>> print js
abc|(3)|ab & ac|(1,4)|xyz
>>> print re.sub("\(.*?\)", i[0], js)
abc|a|ab & ac|a|xyz
>>> print re.sub("\(.*?\)", i[0], js, count=1)
abc|a|ab & ac|(1,4)|xyz
>>> for r in i:
js = re.sub("\(.*?\)", r, js, count=1)
>>> print js
abc|a|ab & ac|b|xyz
That for loop at the end shows you how to do it. The parenthesized fields will be filled in, one at a time, from left to right. To put it back into a list:
jnew = js.split("|")
...and you're done.
It's not clear what your definition of "alphanumeric" is. Your example uses isalpha, which "ab & ac" fails, as DSM pointed out. If isalpha is an acceptable definition of "alphanumeric", and it's OK to modify the arrays, then I propose the following:
for index, item in enumerate(j):
if not item.isalpha():
j[index] = i.pop(0) if i else None
There's probably a list comprehension version of this, but it would be messy.
Note that the above code produces the following result, given your sample inputs:
['abc', 'a', 'b', None, 'xyz']
That's because there aren't actually enough items in i to cover all the non-alphanumeric members of j, so I used None in that case.

Categories

Resources