This question already has answers here:
How can I make multiple replacements in a string using a dictionary?
(8 answers)
Closed 6 years ago.
There is this issue I have been thinking for some time.
I have replacement rules for some string transformation job. I am learning regex and slowly finding correct patterns, this is no problem. However, there are many rules in this and I could not do them in a single expression. And now the processes are overlapping. Let me give you a simple example.
Imagine I want to replace every 'a' with 'o' in a string.
I also want to replace every 'o' to 'k' in the same string, however, there is no order, so if I apply the previous rule first, then the converted 'a's now will become 'k', which simply is not my intention. Because all convertions must have the same priority or precedence. How can I overcome this issue ?
I use re.sub(), but I think same issue exists for string.replace() method.
All help appreciated, Thank you !
Don't use str.replace(), use str.translate().
Here is how to do it with Python 2:
from string import maketrans
s = 'aoaoaoaoa'
trans_table = maketrans('ao', 'ok')
print s.translate(trans_table)
Output
okokokoko
It's a little different for Python 3:
s = 'aoaoaoaoa'
trans_table = {ord(k):v for k,v in zip('ao', 'ok')}
print(s.translate(trans_table))
I have had a similar challenge and ended up by replacing the first character with a place holder. I then replaced the 2nd character. The third pass was to replace the place holder with the desired character. Not fancy but worked every time.
Replace the 'a' with '$', replace the 'o' with 'k', replace the '$' with 'o'.
We can solve it by the following code:
a --> ao; o --> k (a --> ao --> ak); ak --> o
string_test = "aaaoakkokkooao"
print (string_test.replace("a", "ao").replace("o", "k").replace("ak", "o"))
Try this (works for python2 and python3)
RULES = { 'a': 'o', 'o':'k'} # a->o, o->k, ... no precedence
source = 'Hello I am ok'
dest = "".join(RULES.get(c,c) for c in source)
print (dest)
You can easily add rules.
It also works if there are "loops" (example, add 'k':'a' would make loop a -> o -> k -> a ).
The big problem is that it doesn't use regular expressions (as your OP asks for). It could if your regular expressions were all for exactly one character, and were all mutually exclusive. If it is the case, then you would not really need regular expressions (my above solution would be enough). What do you do if two regular expressions match (different lengths)? Which one do you use (since you do not want any priorities)?
Related
In Python, I try to find the last position in an arbitrary string that does match a given pattern, which is specified as negative character set regex pattern. For example, with the string uiae1iuae200, and the pattern of not being a number (regex pattern in Python for this would be [^0-9]), I would need '8' (the last 'e' before the '200') as result.
What is the most pythonic way to achieve this?
As it's a little tricky to quickly find method documentation and the best suited method for something in the Python docs (due to method docs being somewhere in the middle of the corresponding page, like re.search() in the re page), the best way I quickly found myself is using re.search() - but the current form simply must be a suboptimal way of doing it:
import re
string = 'uiae1iuae200' # the string to investigate
len(string) - re.search(r'[^0-9]', string[::-1]).start()
I am not satisfied with this for two reasons:
- a) I need to reverse string before using it with [::-1], and
- b) I also need to reverse the resulting position (subtracting it from len(string) because of having reversed the string before.
There needs to be better ways for this, likely even with the result of re.search().
I am aware of re.search(...).end() over .start(), but re.search() seems to split the results into groups, for which I did not quickly find a not-cumbersome way to apply it to the last matched group. Without specifying the group, .start(), .end(), etc, seem to always match the first group, which does not have the position information about the last match. However, selecting the group seems to at first require the return value to temporarily be saved in a variable (which prevents neat one-liners), as I would need to access both the information about selecting the last group and then to select .end() from this group.
What's your pythonic solution to this? I would value being pythonic more than having the most optimized runtime.
Update
The solution should be functional also in corner cases, like 123 (no position that matches the regex), empty string, etc. It should not crash e.g. because of selecting the last index of an empty list. However, as even my ugly answer above in the question would need more than one line for this, I guess a one-liner might be impossible for this (simply because one needs to check the return value of re.search() or re.finditer() before handling it). I'll accept pythonic multi-line solutions to this answer for this reason.
You can use re.finditer to extract start positions of all matches and return the last one from list. Try this Python code:
import re
print([m.start(0) for m in re.finditer(r'\D', 'uiae1iuae200')][-1])
Prints:
8
Edit:
For making the solution a bit more elegant to behave properly in for all kind of inputs, here is the updated code. Now the solution goes in two lines as the check has to be performed if list is empty then it will print -1 else the index value:
import re
arr = ['', '123', 'uiae1iuae200', 'uiae1iuae200aaaaaaaa']
for s in arr:
lst = [m.start() for m in re.finditer(r'\D', s)]
print(s, '-->', lst[-1] if len(lst) > 0 else None)
Prints the following, where if no such index is found then prints None instead of index:
--> None
123 --> None
uiae1iuae200 --> 8
uiae1iuae200aaaaaaaa --> 19
Edit 2:
As OP stated in his post, \d was only an example we started with, due to which I came up with a solution to work with any general regex. But, if this problem has to be really done with \d only, then I can give a better solution which would not require list comprehension at all and can be easily written by using a better regex to find the last occurrence of non-digit character and print its position. We can use .*(\D) regex to find the last occurrence of non-digit and easily print its index using following Python code:
import re
arr = ['', '123', 'uiae1iuae200', 'uiae1iuae200aaaaaaaa']
for s in arr:
m = re.match(r'.*(\D)', s)
print(s, '-->', m.start(1) if m else None)
Prints the string and their corresponding index of non-digit char and None if not found any:
--> None
123 --> None
uiae1iuae200 --> 8
uiae1iuae200aaaaaaaa --> 19
And as you can see, this code doesn't need to use any list comprehension and is better as it can just find the index by just one regex call to match.
But in case OP indeed meant it to be written using any general regex pattern, then my above code using comprehension will be needed. I can even write it as a function that can take the regex (like \d or even a complex one) as an argument and will dynamically generate a negative of passed regex and use that in the code. Let me know if this indeed is needed.
To me it sems that you just want the last position which matches a given pattern (in this case the not a number pattern).
This is as pythonic as it gets:
import re
string = 'uiae1iuae200'
pattern = r'[^0-9]'
match = re.match(fr'.*({pattern})', string)
print(match.end(1) - 1 if match else None)
Output:
8
Or the exact same as a function and with more test cases:
import re
def last_match(pattern, string):
match = re.match(fr'.*({pattern})', string)
return match.end(1) - 1 if match else None
cases = [(r'[^0-9]', 'uiae1iuae200'), (r'[^0-9]', '123a'), (r'[^0-9]', '123'), (r'[^abc]', 'abcabc1abc'), (r'[^1]', '11eea11')]
for pattern, string in cases:
print(f'{pattern}, {string}: {last_match(pattern, string)}')
Output:
[^0-9], uiae1iuae200: 8
[^0-9], 123a: 3
[^0-9], 123: None
[^abc], abcabc1abc: 6
[^1], 11eea11: 4
This does not look Pythonic because it's not a one-liner, and it uses range(len(foo)), but it's pretty straightforward and probably not too inefficient.
def last_match(pattern, string):
for i in range(1, len(string) + 1):
substring = string[-i:]
if re.match(pattern, substring):
return len(string) - i
The idea is to iterate over the suffixes of string from the shortest to the longest, and to check if it matches pattern.
Since we're checking from the end, we know for sure that the first substring we meet that matches the pattern is the last.
This question already has an answer here:
Regex - capture all repeated iteration
(1 answer)
Closed 3 years ago.
I'm trying to find all single letters between ! and !.
For example, the string !abc! should return three matches: a, b, c.
I tried the regex !([a-z])+!, but it returns just one match: c. !(([a-z])+)! also doesn't help.
import re
s = '!abc!'
print(re.findall(r'!([a-z])+!', s))
UPD: Needless to say, it should also work with the strings like !abcdef!. The number of characters between the delimiters is not fixed.
You should place the capture group around ([a-z]+), including the entire repeated term. Then, you may use list() to convert the match into a list of individual letters.
s = '!abc!'
result = re.findall(r'!([a-z]+)!', s)
print list(result[0])
(?<=!.*)\w(?=.*!) Should return the result you want, each character individually
Okay, I'm answering my own question. Found a solution, thanks to this answer.
First off, an alternative regex module is needed, because the regex below uses the \G anchor.
Here is the regex:
(?:!|\G(?!^))\K([a-z])(?=(?:[a-z])*!)
Works like a charm.
import regex
s = '!abcdef!'
print(regex.findall(r'(?:!|\G(?!^))\K([a-z])(?=(?:[a-z])*!)', s))
Prints ['a', 'b', 'c', 'd', 'e', 'f'].
I look into your problem, please follow the following logic inside your expression
s = '!abc!'
print(re.findall(r'!([a-z])([a-z])([a-z])!',s))
each character is divided into groups to get them individually in an array.
I'm using Python 3 and I have two strings: abbcabb and abca. I want to remove every double occurrence of a single character. For example:
abbcabb should give c and abca should give bc.
I've tried the following regex (here):
(.)(.*?)\1
But, it gives wrong output for first string. Also, when I tried another one (here):
(.)(.*?)*?\1
But, this one again gives wrong output. What's going wrong here?
The python code is a print statement:
print(re.sub(r'(.)(.*?)\1', '\g<2>', s)) # s is the string
It can be solved without regular expression, like below
>>>''.join([i for i in s1 if s1.count(i) == 1])
'bc'
>>>''.join([i for i in s if s.count(i) == 1])
'c'
re.sub() doesn't perform overlapping replacements. After it replaces the first match, it starts looking after the end of the match. So when you perform the replacement on
abbcabb
it first replaces abbca with bbc. Then it replaces bb with an empty string. It doesn't go back and look for another match in bbc.
If you want that, you need to write your own loop.
while True:
newS = re.sub(r'(.)(.*?)\1', r'\g<2>', s)
if newS == s:
break
s = newS
print(newS)
DEMO
Regular expressions doesn't seem to be the ideal solution
they don't handle overlapping so it it needs a loop (like in this answer) and it creates strings over and over (performance suffers)
they're overkill here, we just need to count the characters
I like this answer, but using count repeatedly in a list comprehension loops over all elements each time.
It can be solved without regular expression and without O(n**2) complexity, only O(n) using collections.Counter
first count the characters of the string very easily & quickly
then filter the string testing if the count matches using the counter we just created.
like this:
import collections
s = "abbcabb"
cnt = collections.Counter(s)
s = "".join([c for c in s if cnt[c]==1])
(as a bonus, you can change the count to keep characters which have 2, 3, whatever occurrences)
EDIT: based on the comment exchange - if you're just concerned with the parity of the letter counts, then you don't want regex and instead want an approach like #jon's recommendation. (If you don't care about order, then a more performant approach with very long strings might use something like collections.Counter instead.)
My best guess as to what you're trying to match is: "one or more characters - call this subpattern A - followed by a different set of one or more characters - call this subpattern B - followed by subpattern A again".
You can use + as a shortcut for "one or more" (instead of specifying it once and then using * for the rest of the matches), but either way you need to get the subpatterns right. Let's try:
>>> import re
>>> pattern = re.compile(r'(.+?)(.+?)\1')
>>> pattern.sub('\g<2>', 'abbcabbabca')
'bbcbaca'
Hmm. That didn't work. Why? Because with the first pattern not being greedy, our "subpattern A" can just match the first a in the string - it does appear later, after all. So if we use a greedy match, Python will backtrack until it finds as long of a pattern for subpattern A that still allows for the A-B-A pattern to appear:
>>> pattern = re.compile(r'(.+)(.+?)\1')
>>> pattern.sub('\g<2>', 'abbcabbabca')
'cbc'
Looks good to me.
The site explains it well, hover and use the explanation section.
(.)(.*?)\1 Does not remove or match every double occurance. It matches 1 character, followed by anything in the middle sandwiched till that same character is encountered again.
so, for abbcabb the "sandwiched" portion should be bbc between two a
EDIT:
You can try something like this instead without regexes:
string = "abbcabb"
result = []
for i in string:
if i not in result:
result.append(i)
else:
result.remove(i)
print(''.join(result))
Note that this produces the "last" odd occurrence of a string and not first.
For "first" known occurance, you should use a counter as suggested in this answer . Just change the condition to check for odd counts. pseudo code(count[letter] %2 == 1)
When I tried to transform the string into a dict-like form, I met this problem
s = '&a: 12, &b:13, &c:14, &d: 15' # the string I want to convert
Before converting it, I tried to find all the matched results at first so I used
dict_form = re.compile(r'(&[a-zA-Z]*:)(.*),')
result = dict_form.findall(s)
print(result) # [('&a:', ' 12, &b:13, &c:14')]
It's quite unexpected, and a little bit messy
But when I tried another way to match the string:
dict_form1 = re.compile(r'(&[a-zA-Z]*:)([^,]*)')
result = dict_form1.findall(s)
print(result) # [('&a:', ' 12'), ('&b:', '13'), ('&c:', '14'), ('&d:', ' 15')]
This time, I get a better one with key and item separately stored in a tuple.
The only difference I made was (.), into [^,]
The first one I thought was to find anything until it matches a comma
The second one I thought was to find anything but comma
What's the difference?
In the first instance:
dict_form = re.compile(r'(&[a-zA-Z]*:)(.*),')
the (.*) operator is greedy. This means it will match everything up to the last comma, which is why you see the match extend up to &c:14.
In the second instance, by excluding the comma, you are forcing the match to be bound by a comma-- it's like saying "match everything until we hit a comma". This will cause the matching behavior you were expecting in the first place.
as have been said the .* will be greedy and try to match as much as possible, to make it non-greedy use the question mark (?) as in .*?. In your code:
dict_form = re.compile(r'(&[a-zA-Z]*:)(.*?),')
result = dict_form.findall(s)
print(result)
Another maybe easier solution is to just use string splits instead of regex:
result = [_s.split(':') for _s in s.split(',')]
This question already has answers here:
Remove a prefix from a string [duplicate]
(6 answers)
Closed 6 months ago.
Trying to strip the "0b1" from the left end of a binary number.
The following code results in stripping all of binary object. (not good)
>>> bbn = '0b1000101110100010111010001' #converted bin(2**24+**2^24/11)
>>> aan=bbn.lstrip("0b1") #Try stripping all left-end junk at once.
>>> print aan #oops all gone.
''
So I did the .lstrip() in two steps:
>>> bbn = '0b1000101110100010111010001' # Same fraction expqansion
>>> aan=bbn.lstrip("0b")# Had done this before.
>>> print aan #Extra "1" still there.
'1000101110100010111010001'
>>> aan=aan.lstrip("1")# If at first you don't succeed...
>>> print aan #YES!
'000101110100010111010001'
What's the deal?
Thanks again for solving this in one simple step. (see my previous question)
The strip family treat the arg as a set of characters to be removed. The default set is "all whitespace characters".
You want:
if strg.startswith("0b1"):
strg = strg[3:]
No. Stripping removes all characters in the sequence passed, not just the literal sequence. Slice the string if you want to remove a fixed length.
In Python 3.9 you can use bbn.removeprefix('0b1').
(Actually this question has been mentioned as part of the rationale in PEP 616.)
This is the way lstrip works. It removes any of the characters in the parameter, not necessarily the string as a whole. In the first example, since the input consisted of only those characters, nothing was left.
Lstrip is removing any of the characters in the string. So, as well as the initial 0b1, it is removing all zeros and all ones. Hence it is all gone!
#Harryooo: lstrip only takes the characters off the left hand end. So, because there's only one 1 before the first 0, it removes that. If the number started 0b11100101..., calling a.strip('0b').strip('1') would remove the first three ones, so you'd be left with 00101.
>>> i = 0b1000101110100010111010001
>>> print(bin(i))
'0b1000101110100010111010001'
>>> print(format(i, '#b'))
'0b1000101110100010111010001'
>>> print(format(i, 'b'))
'1000101110100010111010001'
See Example in python tutor:
From the standard doucmentation (See standard documentation for function bin()):
bin(x)
Convert an integer number to a binary string prefixed with “0b”. The result is a valid Python expression. If x is not a Python int object, it has to define an index() method that returns an integer. Some examples:
>>> bin(3)
'0b11'
>>> bin(-10)
'-0b1010'
If prefix “0b” is desired or not, you can use either of the following ways.
>>> format(14, '#b'), format(14, 'b')
('0b1110', '1110')
>>> f'{14:#b}', f'{14:b}'
('0b1110', '1110')
See also format() for more information.