string matching with substitution using PYTHON - python

I have a string and I need match that string with an sequence and determine the number of times the matched sequence is found in that sequence
But it has following conditions
Sequence can contain only ACGT valid chars so seq could be ACGTGTCTG
the string could be ACGnkG
where n could be replaced by A or G
k could be replaced by C or T
how can we find if the string matches the sequence by substituting valid values for n and k
Is there any regular expression ?

re.findall(pattern, string) will return a list with all matches for pattern in string. len(...) will return the number of items in a list.

If you want to count occurrences of the pattern:
count_regex = sum(1 for _ in re.finditer(r'ACG[AG][CT]G', s))
If you want to count occurrences of a fixed string that matches first the pattern:
m = re.search(r'ACG[AG][CT]G', s)
count_fixed = s.count(m.group(0), m.start(0)) if m else 0

Related

Regex to find specific letter before a condition Python

I just want to find all characters (other than A) which are followed by triple A, i.e., have AAA to the right. I don’t want to include the triple A in the output and just want the character immediately preceding AAA
result = []
s = 'ACAABAACAAABACDBADDDFSDDDFFSSSASDAFAAACBAAAFASD'
pattern = "r'(\w[BF])(?!AAA)'"
for item in re.finditer(pattern, s):
result.append(item.group())
print(result)
I used this pattern r'(\w[BF])(?!AAA)' but didn't worked
I just need find this letters in []
'ACAABAA[C]AAABACDBADDDFSDDDFFSSSASDA[F]AAAC[B]AAAFASD'
In your example, you want to match a single character at the left of tripple A. Using \w[BF] matches at least 2 characters being 1 word character followed by either B or F
The negative lookahead asserts that what is directly to the right is not tripple A, but you want the opposite.
You can match a single B-Z and assert what is directly to the right is AAA
[B-Z](?=AAA)
Regex demo | Python demo
import re
result = []
s = 'ACAABAACAAABACDBADDDFSDDDFFSSSASDAFAAACBAAAFASD'
pattern = r'[B-Z](?=AAA)'
for item in re.finditer(pattern, s):
result.append(item.group())
print(result)
Output
['C', 'F', 'B']
You could also use re.findall
import re
s = 'ACAABAACAAABACDBADDDFSDDDFFSSSASDAFAAACBAAAFASD'
pattern = r'[B-Z](?=AAA)'
result = re.findall(pattern, s)
print(result)
Python demo
[^A](?=A{3})
Here I use positive lookahead.
Here is your problem's solution:
pattern = "([B-Z]{1})(A{3})"
for item in re.finditer(pattern, s):
result.append(item.group(1))

Finding exact number of characters in word

I'm looking for a way to find words with the exact number of the given character.
For example:
If we have this input: ['teststring1','strringrr','wow','strarirngr'] and we are looking for 4 r characters
It will return only ['strringrr','strarirngr'] because they are the words with 4 letters r in it.
I decided to use regex and read the documentation and I can't find a function that satisfies my needs.
I tried with [r{4}] but it apparently returns any word with letters r in it.
Please help
something like this:
import collections
def map_characters(string):
characters = collections.defaultdict(lambda: 0)
for char in string:
characters[char] += 1
return characters
items = ['teststring1','strringrr','wow','strarirngr']
for item in items:
characters_map = map_characters(item)
# if any of string has 4 identical letters
# we print it
if max(characters_map.values()) >= 4:
print(item)
# in the result it outputs strringrr and strarirngr
# because these words have 4 r letters
You can use str.count() to count the occurrences of a character, combined with list comprehensions to create a new list:
myArray = ['teststring1','strringrr','wow','strarirngr']
letter = "r"
amount = 4
filtered = [item for item in myArray if item.count(letter) == amount]
print(filtered) # ['strringrr', 'strarirngr']
If you wanted to make this reusable (to look for different letters or different amounts), you could pack it into a function:
def filterList(stringList, pattern, occurrences):
return [item for item in stringList if item.count(pattern)==occurrences]
myArray = ['teststring1','strringrr','wow','strarirngr']
letter = "r"
amount = 4
print(filterList(myArray, letter, amount)) # ['strringrr', 'strarirngr']
The square brackets are for matching any items in the set, e.g. [abc] matches any words with a,b or c. In your case, it evaluates to [rrrr], so any one r is a match. Try it without the brackets: r{4}
Since you asked about using regex, you could use the following:
import re
l = ['teststring1', 'strringrr', 'wow', 'strarirngr']
[ word for word in l if re.match(r'(.*r.*){4}', word) ]
output: ['strringrr', 'strarirngr']

How can we remove word with repeated single character?

I am trying to remove word with single repeated characters using regex in python, for example :
good => good
gggggggg => g
What I have tried so far is following
re.sub(r'([a-z])\1+', r'\1', 'ffffffbbbbbbbqqq')
Problem with above solution is that it changes good to god and I just want to remove words with single repeated characters.
A better approach here is to use a set
def modify(s):
#Create a set from the string
c = set(s)
#If you have only one character in the set, convert set to string
if len(c) == 1:
return ''.join(c)
#Else return original string
else:
return s
print(modify('good'))
print(modify('gggggggg'))
If you want to use regex, mark the start and end of the string in our regex by ^ and $ (inspired from #bobblebubble comment)
import re
def modify(s):
#Create the sub string with a regex which only matches if a single character is repeated
#Marking the start and end of string as well
out = re.sub(r'^([a-z])\1+$', r'\1', s)
return out
print(modify('good'))
print(modify('gggggggg'))
The output will be
good
g
If you do not want to use a set in your method, this should do the trick:
def simplify(s):
l = len(s)
if l>1 and s.count(s[0]) == l:
return s[0]
return s
print(simplify('good'))
print(simplify('abba'))
print(simplify('ggggg'))
print(simplify('g'))
print(simplify(''))
output:
good
abba
g
g
Explanations:
You compute the length of the string
you count the number of characters that are equal to the first one and you compare the count with the initial string length
depending on the result you return the first character or the whole string
You can use trim command:
take a look at this examples:
"ggggggg".Trim('g');
Update:
and for characters which are in the middle of the string use this function, thanks to this answer
in java:
public static string RemoveDuplicates(string input)
{
return new string(input.ToCharArray().Distinct().ToArray());
}
in python:
used = set()
unique = [x for x in mylist if x not in used and (used.add(x) or True)]
but I think all of these answers does not match situation like aaaaabbbbbcda, this string has an a at the end of string which does not appear in the result (abcd). for this kind of situation use this functions which I wrote:
In:
def unique(s):
used = set()
ret = list()
s = list(s)
for x in s:
if x not in used:
ret.append(x)
used = set()
used.add(x)
return ret
print(unique('aaaaabbbbbcda'))
out:
['a', 'b', 'c', 'd', 'a']

Python find position of last digit in string

I have a string of characters with no specific pattern. I have to look for some specific words and then extract some information.
Currently I am stuck at finding the position of the last number in a string.
So, for example if:
mystring="The total income from company xy was 12320 for the last year and 11932 in the previous year"
I want to find out the position of the last number in this string.
So the result should be "2" in position "70".
You can do this with a regular expression, here's a quick attempt:
>>>mo = re.match('.+([0-9])[^0-9]*$', mystring)
>>>print mo.group(1), mo.start(1)
2 69
This is a 0-based position, of course.
You can use a generator expression to loop over the enumerate from trailing within a next function:
>>> next(i for i,j in list(enumerate(mystring,1))[::-1] if j.isdigit())
70
Or using regex :
>>> import re
>>>
>>> m=re.search(r'(\d)[^\d]*$',mystring)
>>> m.start()+1
70
Save all the digits from the string in an array and pop the last one out of it.
array = [int(s) for s in mystring.split() if s.isdigit()]
lastdigit = array.pop()
It is faster than a regex approach and looks more readable than it.
def find_last(s):
temp = list(enumerate(s))
temp.reverse()
for pos, chr in temp:
try:
return(pos, int(chr))
except ValueError:
continue
You could reverse the string and get the first match with a simple regex:
s = mystring[::-1]
m = re.search('\d', s)
pos = len(s) - m.start(0)

Python: Finding Regex occurance for variable char

I know, for example, that if I want to find lengths of all the occurrences of consecutive 'a's
in input = "1111aaaaa11111aaaaaaa111aaa", I can do
[len(s) for s in re.findall(r'a+', input)]
However, I'm not sure how to do this with a char variable. For instance,
CHAR = 'a'
[len(s) for s in re.findall(r'??????', input)] # Trying to find occurrences of CHARs..
Is there a way to do this??
Here is a general solution that should work for strings of any length:
CHAR = 'a'
[len(s) for s in re.findall(r'(?:{})+'.format(re.escape(CHAR)), input)]
Or an alternative using itertools (single character only):
import itertools
[sum(1 for _ in g) for k, g in itertools.groupby(input) if k == CHAR]
I think what you're asking for is:
[len(s) for s in re.findall(r'{}+'.format(CHAR), input)]
Except of course that this won't work if CHAR is a special value, like \. If that's an issue:
[len(s) for s in re.findall(r'{}+'.format(re.escape(CHAR)), input)]
If you want to match two or more instead of one or more, the syntax for that is {2,}. As the docs say:
{m,n} Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible. For example, a{3,5} will match from 3 to 5 'a' characters. Omitting m specifies a lower bound of zero, and omitting n specifies an infinite upper bound. As an example, a{4,}b will match aaaab or a thousand 'a' characters followed by a b, but not aaab…
That gets a little ugly when we're using {} for string formatting, so let's switch to %-formatting:
[len(s) for s in re.findall(r'%s{2,}' % (re.escape(CHAR),), input)]
… or just simple concatenation:
[len(s) for s in re.findall(re.escape(CHAR) + r'{2,}', input)]

Categories

Resources