Python: Finding Regex occurance for variable char - python

I know, for example, that if I want to find lengths of all the occurrences of consecutive 'a's
in input = "1111aaaaa11111aaaaaaa111aaa", I can do
[len(s) for s in re.findall(r'a+', input)]
However, I'm not sure how to do this with a char variable. For instance,
CHAR = 'a'
[len(s) for s in re.findall(r'??????', input)] # Trying to find occurrences of CHARs..
Is there a way to do this??

Here is a general solution that should work for strings of any length:
CHAR = 'a'
[len(s) for s in re.findall(r'(?:{})+'.format(re.escape(CHAR)), input)]
Or an alternative using itertools (single character only):
import itertools
[sum(1 for _ in g) for k, g in itertools.groupby(input) if k == CHAR]

I think what you're asking for is:
[len(s) for s in re.findall(r'{}+'.format(CHAR), input)]
Except of course that this won't work if CHAR is a special value, like \. If that's an issue:
[len(s) for s in re.findall(r'{}+'.format(re.escape(CHAR)), input)]
If you want to match two or more instead of one or more, the syntax for that is {2,}. As the docs say:
{m,n} Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible. For example, a{3,5} will match from 3 to 5 'a' characters. Omitting m specifies a lower bound of zero, and omitting n specifies an infinite upper bound. As an example, a{4,}b will match aaaab or a thousand 'a' characters followed by a b, but not aaab…
That gets a little ugly when we're using {} for string formatting, so let's switch to %-formatting:
[len(s) for s in re.findall(r'%s{2,}' % (re.escape(CHAR),), input)]
… or just simple concatenation:
[len(s) for s in re.findall(re.escape(CHAR) + r'{2,}', input)]

Related

How can we remove word with repeated single character?

I am trying to remove word with single repeated characters using regex in python, for example :
good => good
gggggggg => g
What I have tried so far is following
re.sub(r'([a-z])\1+', r'\1', 'ffffffbbbbbbbqqq')
Problem with above solution is that it changes good to god and I just want to remove words with single repeated characters.
A better approach here is to use a set
def modify(s):
#Create a set from the string
c = set(s)
#If you have only one character in the set, convert set to string
if len(c) == 1:
return ''.join(c)
#Else return original string
else:
return s
print(modify('good'))
print(modify('gggggggg'))
If you want to use regex, mark the start and end of the string in our regex by ^ and $ (inspired from #bobblebubble comment)
import re
def modify(s):
#Create the sub string with a regex which only matches if a single character is repeated
#Marking the start and end of string as well
out = re.sub(r'^([a-z])\1+$', r'\1', s)
return out
print(modify('good'))
print(modify('gggggggg'))
The output will be
good
g
If you do not want to use a set in your method, this should do the trick:
def simplify(s):
l = len(s)
if l>1 and s.count(s[0]) == l:
return s[0]
return s
print(simplify('good'))
print(simplify('abba'))
print(simplify('ggggg'))
print(simplify('g'))
print(simplify(''))
output:
good
abba
g
g
Explanations:
You compute the length of the string
you count the number of characters that are equal to the first one and you compare the count with the initial string length
depending on the result you return the first character or the whole string
You can use trim command:
take a look at this examples:
"ggggggg".Trim('g');
Update:
and for characters which are in the middle of the string use this function, thanks to this answer
in java:
public static string RemoveDuplicates(string input)
{
return new string(input.ToCharArray().Distinct().ToArray());
}
in python:
used = set()
unique = [x for x in mylist if x not in used and (used.add(x) or True)]
but I think all of these answers does not match situation like aaaaabbbbbcda, this string has an a at the end of string which does not appear in the result (abcd). for this kind of situation use this functions which I wrote:
In:
def unique(s):
used = set()
ret = list()
s = list(s)
for x in s:
if x not in used:
ret.append(x)
used = set()
used.add(x)
return ret
print(unique('aaaaabbbbbcda'))
out:
['a', 'b', 'c', 'd', 'a']

Python find position of last digit in string

I have a string of characters with no specific pattern. I have to look for some specific words and then extract some information.
Currently I am stuck at finding the position of the last number in a string.
So, for example if:
mystring="The total income from company xy was 12320 for the last year and 11932 in the previous year"
I want to find out the position of the last number in this string.
So the result should be "2" in position "70".
You can do this with a regular expression, here's a quick attempt:
>>>mo = re.match('.+([0-9])[^0-9]*$', mystring)
>>>print mo.group(1), mo.start(1)
2 69
This is a 0-based position, of course.
You can use a generator expression to loop over the enumerate from trailing within a next function:
>>> next(i for i,j in list(enumerate(mystring,1))[::-1] if j.isdigit())
70
Or using regex :
>>> import re
>>>
>>> m=re.search(r'(\d)[^\d]*$',mystring)
>>> m.start()+1
70
Save all the digits from the string in an array and pop the last one out of it.
array = [int(s) for s in mystring.split() if s.isdigit()]
lastdigit = array.pop()
It is faster than a regex approach and looks more readable than it.
def find_last(s):
temp = list(enumerate(s))
temp.reverse()
for pos, chr in temp:
try:
return(pos, int(chr))
except ValueError:
continue
You could reverse the string and get the first match with a simple regex:
s = mystring[::-1]
m = re.search('\d', s)
pos = len(s) - m.start(0)

Regular expression to replace a character on odd repeated occurrences in Python

Can't get a regular expression to replace a character on odd repeated occurrences in Python.
Example:
char = ``...```.....``...`....`````...`
to
``...``````.....``...``....``````````...``
on even occurrences doesn't replace.
for example:
>>> import re
>>> s = "`...```.....``...`....`````...`"
>>> re.sub(r'((?<!`)(``)*`(?!`))', r'\1\1', s)
'``...``````.....``...``....``````````...``'
Maybe I'm old fashioned (or my regex skills aren't up to par), but this seems to be a lot easier to read:
import re
def double_odd(regex,string):
"""
Look for groups that match the regex. Double every second one.
"""
count = [0]
def _double(match):
count[0] += 1
return match.group(0) if count[0]%2 == 0 else match.group(0)*2
return re.sub(regex,_double,string)
s = "`...```.....``...`....`````...`"
print double_odd('`',s)
print double_odd('`+',s)
It seems that I might have been a little confused about what you were actually looking for. Based on the comments, this becomes even easier:
def odd_repl(match):
"""
double a match (all of the matched text) when the length of the
matched text is odd
"""
g = match.group(0)
return g*2 if len(g)%2 == 1 else g
re.sub(regex,odd_repl,your_string)
This may be not as good as the regex solution, but works:
In [101]: s1=re.findall(r'`{1,}',char)
In [102]: s2=re.findall(r'\.{1,}',char)
In [103]: fill=s1[-1] if len(s1[-1])%2==0 else s1[-1]*2
In [104]: "".join("".join((x if len(x)%2==0 else x*2,y)) for x,y in zip(s1,s2))+fill
Out[104]: '``...``````.....``...``....``````````...``'

string matching with substitution using PYTHON

I have a string and I need match that string with an sequence and determine the number of times the matched sequence is found in that sequence
But it has following conditions
Sequence can contain only ACGT valid chars so seq could be ACGTGTCTG
the string could be ACGnkG
where n could be replaced by A or G
k could be replaced by C or T
how can we find if the string matches the sequence by substituting valid values for n and k
Is there any regular expression ?
re.findall(pattern, string) will return a list with all matches for pattern in string. len(...) will return the number of items in a list.
If you want to count occurrences of the pattern:
count_regex = sum(1 for _ in re.finditer(r'ACG[AG][CT]G', s))
If you want to count occurrences of a fixed string that matches first the pattern:
m = re.search(r'ACG[AG][CT]G', s)
count_fixed = s.count(m.group(0), m.start(0)) if m else 0

In Python, how do I create a string of n characters in one line of code?

I need to generate a string with n characters in Python. Is there a one line answer to achieve this with the existing Python library? For instance, I need a string of 10 letters:
string_val = 'abcdefghij'
To simply repeat the same letter 10 times:
string_val = "x" * 10 # gives you "xxxxxxxxxx"
And if you want something more complex, like n random lowercase letters, it's still only one line of code (not counting the import statements and defining n):
from random import choice
from string import ascii_lowercase
n = 10
string_val = "".join(choice(ascii_lowercase) for i in range(n))
The first ten lowercase letters are string.lowercase[:10] (if you have imported the standard library module string previously, of course;-).
Other ways to "make a string of 10 characters": 'x'*10 (all the ten characters will be lowercase xs;-), ''.join(chr(ord('a')+i) for i in xrange(10)) (the first ten lowercase letters again), etc, etc;-).
if you just want any letters:
'a'*10 # gives 'aaaaaaaaaa'
if you want consecutive letters (up to 26):
''.join(['%c' % x for x in range(97, 97+10)]) # gives 'abcdefghij'
Why "one line"? You can fit anything onto one line.
Assuming you want them to start with 'a', and increment by one character each time (with wrapping > 26), here's a line:
>>> mkstring = lambda(x): "".join(map(chr, (ord('a')+(y%26) for y in range(x))))
>>> mkstring(10)
'abcdefghij'
>>> mkstring(30)
'abcdefghijklmnopqrstuvwxyzabcd'
This might be a little off the question, but for those interested in the randomness of the generated string, my answer would be:
import os
import string
def _pwd_gen(size=16):
chars = string.letters
chars_len = len(chars)
return str().join(chars[int(ord(c) / 256. * chars_len)] for c in os.urandom(size))
See these answers and random.py's source for more insight.
If you can use repeated letters, you can use the * operator:
>>> 'a'*5
'aaaaa'

Categories

Resources