Replace a random subset of regex matches - python

re.sub(pattern, replacement, text)
re.sub replaces every match in a given string text, except if you pass a count parameter, then it replaces the first count matches. Neither is the behaviour I'm aiming for. Instead of replacing the first count matches, I want to replace a random subset of matches (count is then the subset size).
Is there a straight-forward way to realize this? The only solution I thought of is making use of re.finditer, getting all match objects… randomly subsetting them, and then replacing manually with help of the match objects (although I'm not quite aware of good way to implement the last step), like...
pattern = "ab"
text = "ab ab ab"
replacement = "ba"
count = 2
match_objects = random.sample(list(re.finditer(pattern, text)), count)
...

I would do it like this:
import re, random
def randsub(pat, repl, text, n):
matches = random.sample(list(re.finditer(pat, text)), n)
for i in sorted(matches, key=lambda i: -i.start()):
text = text[:i.start()] + repl + text[i.end():]
return text
for i in range(10):
print(randsub("a{2,3}", "b", "aa|aaa|aa", 2))
b|b|aa
aa|b|b
b|b|aa
b|aaa|b
aa|b|b
b|aaa|b
b|b|aa
b|b|aa
aa|b|b
b|b|aa
So, you first get the list of matches (as you do in your question). However, you can't just substitute all of these sequentially, as once you substitute one the others' indexes will be off. So, we sort them from last to first in the string.

Maybe count the matches, then pick which ones to replace, then use re.sub?
matches = len(re.findall(pattern, text))
pick = [1] * count + [0] * (matches - count)
random.shuffle(pick)
text = re.sub(pattern, lambda m: replacement if pick.pop() else m.group(0), text)

Related

how to split string between different separators in python

I want to pick up a substring from <personne01166+30-90>, which the output should look like: +30 and -90.
The strings can be like: 'personne01144+0-30', 'personne01146+0+0', 'personne01180+60-75', etc.
I tried use
<string.split('+')[len(string.split('+')) -1 ].split('+')[0]>
but the output must be two correspondent numbers.
Here is how you can use a list comprehension and re.findall:
import re
s = ['personne01144+0-30', 'personne01146+0+0', 'personne01180+60-75']
print([re.findall('[+-]\d+', i) for i in s])
Output:
[['+0', '-30'], ['+0', '+0'], ['+60', '-75']]
re.findall('[+-]\d+', i) finds all the patterns of '[+-]\d+' in the string i.
[+-] means any either + or -. \d+ means all numbers in a row.
If you know the interesting part always comes after + then you can simply split twice.
numbers = string.split('+', 1)[1]
if '+' in numbers:
this, that = numbers.split('+')
elif '-' in numbers:
this, that = numbers.split('-')
that = -that
else:
raise ValueError('Could not parse %s', string)
Perhaps a regex-based approach makes more sense, though;
import re
m = re.search(r'([-+]\d+)([-+]\d+)$', string)
if m:
this, that = m.groups()

replace certain not known words in a string

I'm looking for a more elegant solution to replace some upfront not known words in a string, except not,and and or:
Only as an example below, but could be anything but will always be evaluable with eval())
input: (DEFINE_A or not(DEFINE_B and not (DEFINE_C))) and DEFINE_A
output: (self.DEFINE_A or not(self.DEFINE_B and not (self.DEFINE_C))) and self.DEFINE_A
I created a solution, but it looks kind a strange. Is there a more clean way?
s = '(DEFINE_A or not(DEFINE_B and not (DEFINE_C))) and DEFINE_A'
words = re.findall(r'[\w]+|[()]*|[ ]*', s)
for index, word in enumerate(words):
w = re.findall('^[a-zA-Z_]+$', word)
if w and w[0] not in ['and','or','not']:
z = 'self.' + w[0]
words[index] = z
new = ''.join(str(x) for x in words)
print(new)
Will print correctly:
(self.DEFINE_A or not(self.DEFINE_B and not (self.DEFINE_C))) and self.DEFINE_A
First of all, you can match only words by using a simple \w+. Then, Using a negative lookahead you can exclude the ones you don't want. Now all that's left to do is use re.sub directly with that pattern:
s = '(DEFINE_A or not(DEFINE_B and not (DEFINE_C))) and DEFINE_A'
new = re.sub(r"(?!and|not|or)\b(\w+)", r"self.\1", s)
print(new)
Which will give:
(self.DEFINE_A or not(self.DEFINE_B and not (self.DEFINE_C))) and self.DEFINE_A
You can test-out and see how this regex works here.
If the names of your "variables" will always be capitalized, this simplifies the pattern a bit and making it much more efficient. Simply use:
new = re.sub(r"([A-Z\d_]+)", r"self.\1", s)
This is not only a simpler pattern (for readability), but also much more efficient. On this example, it only takes 70 steps compared to 196 of the original (can be seen in the top-right corner in the links).
You can see the new pattern in action here.

Insert Colon between each element of a list python

[x[1] for x in matches]
x
newtest = [x2[-2:] for x2 in x]
newtest
I have a list
[u'asvbsMasd', u'abdhesMrty', u'ahdksC', u'ahdeO', u'ahdnL', u'ahddsS',]
now i want my list to be like a colon between where it finds a lower case and upper case
[u'asvbs:Masd', u'abdhes:Mrty', u'ahdks:C', u'ahde:Oqqq', u'ahdn:L', u'ahdds:S',]
You need to write a regex that matches <lowercase><uppercase> pair:
>>> import re
>>> r = re.compile(r'([a-z])([A-Z])')
Note the letters itself marked as a groups via (). If you have regex matching the pair and the neighbor letters as two separate groups, you may just use the substitution (\1 and \2 are places where matched groups are put into the substitution string):
>>> r.sub(r'\1:\2', u'asvbsMasd')
u'asvbs:Masd'
Then you can use list comprehension to apply that substitution to each element of a list:
>>> l = [u'asvbsMasd', u'abdhesMrty', u'ahdksC', u'ahdeO', u'ahdnL', u'ahddsS']
>>> [r.sub(r'\1:\2', s) for s in l]
[u'asvbs:Masd', u'abdhes:Mrty', u'ahdks:C', u'ahde:O', u'ahdn:L', u'ahdds:S']
Or if you want it wrapped into a function:
import re
re_lowerupper = r = re.compile(r'([a-z])([A-Z])')
def add_colons(l):
global re_lowerupper
return [re_lowerupper.sub(r'\1:\2', s) for s in l]
print add_colons([u'asvbsMasd', u'abdhesMrty', u'ahdksC', u'ahdeO', u'ahdnL', u'ahddsS'])
You may of course simplify it just to a single lambda, like in the next example.
One importand disclaimer, as I see you use Unicode strings: there is no easy way of finding arbitrary Unicode upper/lowercase character. There is no shorthand defined like for matching any digit (\d) or any alphanumeric character (\w). If you need to match diacritics too, you may need to list the lowercase and uppercase diacritics of your language explicitly in the regex, like:
re_lower = ur'[a-zßàáâãäåæçèéêëìíîïðñòóôõöùúûüýþÿāăąćĉčēĕėęěğģĥĩīĭįĵķļľŀłņňŋōŏőœŕŗřśŝşţťũūŭůűųŵŷźžǎǐǒǔǖǘǚǜǩǫǵǹȟȧȩȯȳəḅḋḍḑḟḡḣḥḧḩḱḳṃṕṗṙṛṡṣṫṭṽẁẃẅẇẉẍẏẑẓẗẘẙạẹẽịọụỳỵỹ]'
re_upper = ur'[A-ZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖÙÚÛÜÝÞĀĂĄĆĈČĒĔĖĘĚĞĢĤĨĪĬĮİĴĶĻĽĿŁŅŇŊŌŎŐŒŔŖŘŚŜŞŢŤŨŪŬŮŰŲŴŶŸŹŽƏǍǏǑǓǕǗǙǛǨǪǴǸȞȦȨȮȲḄḊḌḐḞḠḢḤḦḨḰḲṂṔṖṘṚṠṢṪṬṼẀẂẄẆẈẌẎẐẒẠẸẼỊỌỤỲỴỸ]'
re_lowerupper = re.compile('(%s)(%s)' % (re_lower, re_upper))
add_colons = lambda l: [re_lowerupper.sub(r'\1:\2', s) for s in l]
This should do the job for the Latin script European languages.

Split a string using a list of strings as a pattern

Consider an input string :
mystr = "just some stupid string to illustrate my question"
and a list of strings indicating where to split the input string:
splitters = ["some", "illustrate"]
The output should look like
result = ["just ", "some stupid string to ", "illustrate my question"]
I wrote some code which implements the following approach. For each of the strings in splitters, I find its occurrences in the input string, and insert something which I know for sure would not be a part of my input string (for example, this '!!'). Then I split the string using the substring that I just inserted.
for s in splitters:
mystr = re.sub(r'(%s)'%s,r'!!\1', mystr)
result = re.split('!!', mystr)
This solution seems ugly, is there a nicer way of doing it?
Splitting with re.split will always remove the matched string from the output (NB, this is not quite true, see the edit below). Therefore, you must use positive lookahead expressions ((?=...)) to match without removing the match. However, re.split ignores empty matches, so simply using a lookahead expression doesn't work. Instead, you will lose one character at each split at minimum (even trying to trick re with "boundary" matches (\b) does not work). If you don't care about losing one whitespace / non-word character at the end of each item (assuming you only split at non-word characters), you can use something like
re.split(r"\W(?=some|illustrate)")
which would give
["just", "some stupid string to", "illustrate my question"]
(note that the spaces after just and to are missing). You could then programmatically generate these regexes using str.join. Note that each of the split markers is escaped with re.escape so that special characters in the items of splitters do not affect the meaning of the regular expression in any undesired ways (imagine, e.g., a ) in one of the strings, which would otherwise lead to a regex syntax error).
the_regex = r"\W(?={})".format("|".join(re.escape(s) for s in splitters))
Edit (HT to #Arkadiy): Grouping the actual match, i.e. using (\W) instead of \W, returns the non-word characters inserted into the list as seperate items. Joining every two subsequent items would then produce the list as desired as well. Then, you can also drop the requirement of having a non-word character by using (.) instead of \W:
the_new_regex = r"(.)(?={})".format("|".join(re.escape(s) for s in splitters))
the_split = re.split(the_new_regex, mystr)
the_actual_split = ["".join(x) for x in itertools.izip_longest(the_split[::2], the_split[1::2], fillvalue='')]
Because normal text and auxiliary character alternate, the_split[::2] contains the normal split text and the_split[1::2] the auxiliary characters. Then, itertools.izip_longest is used to combine each text item with the corresponding removed character and the last item (which is unmatched in the removed characters)) with fillvalue, i.e. ''. Then, each of these tuples is joined using "".join(x). Note that this requires itertools to be imported (you could of course do this in a simple loop, but itertools provides very clean solutions to these things). Also note that itertools.izip_longest is called itertools.zip_longest in Python 3.
This leads to further simplification of the regular expression, because instead of using auxiliary characters, the lookahead can be replaced with a simple matching group ((some|interesting) instead of (.)(?=some|interesting)):
the_newest_regex = "({})".format("|".join(re.escape(s) for s in splitters))
the_raw_split = re.split(the_newest_regex, mystr)
the_actual_split = ["".join(x) for x in itertools.izip_longest([""] + the_raw_split[1::2], the_raw_split[::2], fillvalue='')]
Here, the slice indices on the_raw_split have swapped, because now the even-numbered items must be added to item afterwards instead of in front. Also note the [""] + part, which is necessary to pair the first item with "" to fix the order.
(end of edit)
Alternatively, you can (if you want) use string.replace instead of re.sub for each splitter (I think that is a matter of preference in your case, but in general it is probably more efficient)
for s in splitters:
mystr = mystr.replace(s, "!!" + s)
Also, if you use a fixed token to indicate where to split, you do not need re.split, but can use string.split instead:
result = mystr.split("!!")
What you could also do (instead of relying on the replacement token not to be in the string anywhere else or relying on every split position being preceded by a non-word character) is finding the split strings in the input using string.find and using string slicing to extract the pieces:
def split(string, splitters):
while True:
# Get the positions to split at for all splitters still in the string
# that are not at the very front of the string
split_positions = [i for i in (string.find(s) for s in splitters) if i > 0]
if len(split_positions) > 0:
# There is still somewhere to split
next_split = min(split_positions)
yield string[:next_split] # Yield everything before that position
string = string[next_split:] # Retain the rest of the string
else:
yield string # Yield the rest of the string
break # Done.
Here, [i for i in (string.find(s) for s in splitters) if i > 0] generates a list of positions where the splitters can be found, for all splitters that are in the string (for this, i < 0 is excluded) and not right at the beginning (where we (possibly) just split, so i == 0 is excluded as well). If there are any left in the string, we yield (this is a generator function) everything up to (excluding) the first splitter (at min(split_positions)) and replace the string with the remaining part. If there are none left, we yield the last part of the string and exit the function. Because this uses yield, it is a generator function, so you need to use list to turn it into an actual list.
Note that you could also replace yield whatever with a call to some_list.append (provided you defined some_list earlier) and return some_list at the very end, I do not consider that to be very good code style, though.
TL;DR
If you are OK with using regular expressions, use
the_newest_regex = "({})".format("|".join(re.escape(s) for s in splitters))
the_raw_split = re.split(the_newest_regex, mystr)
the_actual_split = ["".join(x) for x in itertools.izip_longest([""] + the_raw_split[1::2], the_raw_split[::2], fillvalue='')]
else, the same can also be achieved using string.find with the following split function:
def split(string, splitters):
while True:
# Get the positions to split at for all splitters still in the string
# that are not at the very front of the string
split_positions = [i for i in (string.find(s) for s in splitters) if i > 0]
if len(split_positions) > 0:
# There is still somewhere to split
next_split = min(split_positions)
yield string[:next_split] # Yield everything before that position
string = string[next_split:] # Retain the rest of the string
else:
yield string # Yield the rest of the string
break # Done.
Not especially elegant but avoiding regex:
mystr = "just some stupid string to illustrate my question"
splitters = ["some", "illustrate"]
indexes = [0] + [mystr.index(s) for s in splitters] + [len(mystr)]
indexes = sorted(list(set(indexes)))
print [mystr[i:j] for i, j in zip(indexes[:-1], indexes[1:])]
# ['just ', 'some stupid string to ', 'illustrate my question']
I should acknowledge here that a little more work is needed if a word in splitters occurs more than once because str.index finds only the location of the first occurrence of the word...

Number of regex matches

I'm using the finditer function in the re module to match some things and everything is working.
Now I need to find out how many matches I've got. Is it possible without looping through the iterator twice? (one to find out the count and then the real iteration)
Some code:
imageMatches = re.finditer("<img src\=\"(?P<path>[-/\w\.]+)\"", response[2])
# <Here I need to get the number of matches>
for imageMatch in imageMatches:
doStuff
Everything works, I just need to get the number of matches before the loop.
If you know you will want all the matches, you could use the re.findall function. It will return a list of all the matches. Then you can just do len(result) for the number of matches.
If you always need to know the length, and you just need the content of the match rather than the other info, you might as well use re.findall. Otherwise, if you only need the length sometimes, you can use e.g.
matches = re.finditer(...)
...
matches = tuple(matches)
to store the iteration of the matches in a reusable tuple. Then just do len(matches).
Another option, if you just need to know the total count after doing whatever with the match objects, is to use
matches = enumerate(re.finditer(...))
which will return an (index, match) pair for each of the original matches. So then you can just store the first element of each tuple in some variable.
But if you need the length first of all, and you need match objects as opposed to just the strings, you should just do
matches = tuple(re.finditer(...))
#An example for counting matched groups
import re
pattern = re.compile(r'(\w+).(\d+).(\w+).(\w+)', re.IGNORECASE)
search_str = "My 11 Char String"
res = re.match(pattern, search_str)
print(len(res.groups())) # len = 4
print (res.group(1) ) #My
print (res.group(2) ) #11
print (res.group(3) ) #Char
print (res.group(4) ) #String
If you find you need to stick with finditer(), you can simply use a counter while you iterate through the iterator.
Example:
>>> from re import *
>>> pattern = compile(r'.ython')
>>> string = 'i like python jython and dython (whatever that is)'
>>> iterator = finditer(pattern, string)
>>> count = 0
>>> for match in iterator:
count +=1
>>> count
3
If you need the features of finditer() (not matching to overlapping instances), use this method.
I know this is a little old, but this but here is a concise function for counting regex patterns.
def regex_cnt(string, pattern):
return len(re.findall(pattern, string))
string = 'abc123'
regex_cnt(string, '[0-9]')
For those moments when you really want to avoid building lists:
import re
import operator
from functools import reduce
count = reduce(operator.add, (1 for _ in re.finditer(my_pattern, my_string)))
Sometimes you might need to operate on huge strings. This might help.
if you are using finditer method best way you can count the matches is to initialize a counter and increment it with each match

Categories

Resources