Find all matches between delimiters [duplicate] - python

This question already has an answer here:
Regex - capture all repeated iteration
(1 answer)
Closed 3 years ago.
I'm trying to find all single letters between ! and !.
For example, the string !abc! should return three matches: a, b, c.
I tried the regex !([a-z])+!, but it returns just one match: c. !(([a-z])+)! also doesn't help.
import re
s = '!abc!'
print(re.findall(r'!([a-z])+!', s))
UPD: Needless to say, it should also work with the strings like !abcdef!. The number of characters between the delimiters is not fixed.

You should place the capture group around ([a-z]+), including the entire repeated term. Then, you may use list() to convert the match into a list of individual letters.
s = '!abc!'
result = re.findall(r'!([a-z]+)!', s)
print list(result[0])

(?<=!.*)\w(?=.*!) Should return the result you want, each character individually

Okay, I'm answering my own question. Found a solution, thanks to this answer.
First off, an alternative regex module is needed, because the regex below uses the \G anchor.
Here is the regex:
(?:!|\G(?!^))\K([a-z])(?=(?:[a-z])*!)
Works like a charm.
import regex
s = '!abcdef!'
print(regex.findall(r'(?:!|\G(?!^))\K([a-z])(?=(?:[a-z])*!)', s))
Prints ['a', 'b', 'c', 'd', 'e', 'f'].

I look into your problem, please follow the following logic inside your expression
s = '!abc!'
print(re.findall(r'!([a-z])([a-z])([a-z])!',s))
each character is divided into groups to get them individually in an array.

Related

Regex in python: combining 2 regex expressions into one

Suppose I have the following list:
a = ['35','years','opened','7,000','churches','rev.','mr.','brandt','said','adding','denomination','national','goal','one','church','every','10,000','persons']
I want to remove all elements, that contain numbers and elements, that end with dots.
So I want to delete '35','7,000','10,000','mr.','rev.'
I can do it separately using the following regex:
regex = re.compile('[a-zA-Z\.]')
regex2 = re.compile('[0-9]')
But when I try to combine them I delete either all elements or nothing.
How can I combine two regex correctly?
This should work:
reg = re.compile('[a-zA-Z]+\.|[0-9,]+')
Note that your first regex is wrong because it deletes any string within a dot inside it.
To avoid this, I included [a-zA-Z]+\. in the combined regex.
Your second regex is also wrong as it misses a "+" and a ",", which I included in the above solution.
Here a demo.
Also, if you assume that elements which end with a dot might contain some numbers the complete solution should be:
reg = re.compile('[a-zA-Z0-9]+\.|[0-9,]+')
If you don't need to capture the result, this matches any string with a dot at the end, or any with a number in it.
\.$|\d
You could use:
(?:[^\d\n]*\d)|.*\.$
See a demo on regex101.com.
Here is a way to do the job:
import re
a = ['35','years','opened','7,000','churches','rev.','mr.','brandt','said','adding','denomination','national','goal','one','church','every','10,000','per.sons']
b = []
for s in a:
if not re.search(r'^(?:[\d,]+|.*\.)$', s):
b.append(s)
print b
Output:
['years', 'opened', 'churches', 'brandt', 'said', 'adding', 'denomination', 'national', 'goal', 'one', 'church', 'every', 'per.sons']
Demo & explanation

How can I use two patterns in re.search()? [duplicate]

This question already has answers here:
Python RegEx that matches char followed/preceded by same char but uppercase/lowercase
(2 answers)
Closed 3 years ago.
I have the following string string = DdCcaaBbbB. I want to delete all the combinations of the same letter that are of the following form, being x any letter: xX, Xx.
And I want to delete them one by one, in the example, first I would delete Dd, after Cc, Bb and finally bB.
What I have done so far is:
for letter in string.lower():
try:
string = string.replace(re.search(letter + letter.upper(), string).group(),'')
except:
try:
string = string.replace(re.search(letter.upper() + letter, string).group(),'')
except:
pass
But I am sure this is not the most pythonic way to do it. What has come up to my mind, and thus the question, is if I could combine the two patterns I am searching for. Any other suggestion or improvement is more than welcome!
I think you can do a case-insensitive regex search to find all combinations of the same two letters, then have a function check if they're of the xX or Xx format before deciding if it should be replaced (by nothing) or left alone.
def replacer(match):
text = match.group()
if (text[0].islower() and text[1].isupper()) or (text[0].isupper() and text[1].islower()):
return ""
return text
string = "DdCcaaBbbB"
pattern = r'([a-z])\1'
new_string = re.sub(pattern, replacer, string, flags=re.IGNORECASE)
There is a downside to this approach. Because the regex is matching case-insensitively, it won't let you test overlapping matches. So if you have an input string like 'BBbb', it will match the two capital Bs and the two lowercase bs and not replace either pair, and it won't check the the Bb pair in the middle.
Unfortunately I don't think regex can solve that problem, since it has no way to transform cases in the middle of its search. We're already a bit beyond the bounds of the most basic regular expression specifications, since we need to use a backreference to even get as far as we did.

Python re.sub() is not replacing every match

I'm using Python 3 and I have two strings: abbcabb and abca. I want to remove every double occurrence of a single character. For example:
abbcabb should give c and abca should give bc.
I've tried the following regex (here):
(.)(.*?)\1
But, it gives wrong output for first string. Also, when I tried another one (here):
(.)(.*?)*?\1
But, this one again gives wrong output. What's going wrong here?
The python code is a print statement:
print(re.sub(r'(.)(.*?)\1', '\g<2>', s)) # s is the string
It can be solved without regular expression, like below
>>>''.join([i for i in s1 if s1.count(i) == 1])
'bc'
>>>''.join([i for i in s if s.count(i) == 1])
'c'
re.sub() doesn't perform overlapping replacements. After it replaces the first match, it starts looking after the end of the match. So when you perform the replacement on
abbcabb
it first replaces abbca with bbc. Then it replaces bb with an empty string. It doesn't go back and look for another match in bbc.
If you want that, you need to write your own loop.
while True:
newS = re.sub(r'(.)(.*?)\1', r'\g<2>', s)
if newS == s:
break
s = newS
print(newS)
DEMO
Regular expressions doesn't seem to be the ideal solution
they don't handle overlapping so it it needs a loop (like in this answer) and it creates strings over and over (performance suffers)
they're overkill here, we just need to count the characters
I like this answer, but using count repeatedly in a list comprehension loops over all elements each time.
It can be solved without regular expression and without O(n**2) complexity, only O(n) using collections.Counter
first count the characters of the string very easily & quickly
then filter the string testing if the count matches using the counter we just created.
like this:
import collections
s = "abbcabb"
cnt = collections.Counter(s)
s = "".join([c for c in s if cnt[c]==1])
(as a bonus, you can change the count to keep characters which have 2, 3, whatever occurrences)
EDIT: based on the comment exchange - if you're just concerned with the parity of the letter counts, then you don't want regex and instead want an approach like #jon's recommendation. (If you don't care about order, then a more performant approach with very long strings might use something like collections.Counter instead.)
My best guess as to what you're trying to match is: "one or more characters - call this subpattern A - followed by a different set of one or more characters - call this subpattern B - followed by subpattern A again".
You can use + as a shortcut for "one or more" (instead of specifying it once and then using * for the rest of the matches), but either way you need to get the subpatterns right. Let's try:
>>> import re
>>> pattern = re.compile(r'(.+?)(.+?)\1')
>>> pattern.sub('\g<2>', 'abbcabbabca')
'bbcbaca'
Hmm. That didn't work. Why? Because with the first pattern not being greedy, our "subpattern A" can just match the first a in the string - it does appear later, after all. So if we use a greedy match, Python will backtrack until it finds as long of a pattern for subpattern A that still allows for the A-B-A pattern to appear:
>>> pattern = re.compile(r'(.+)(.+?)\1')
>>> pattern.sub('\g<2>', 'abbcabbabca')
'cbc'
Looks good to me.
The site explains it well, hover and use the explanation section.
(.)(.*?)\1 Does not remove or match every double occurance. It matches 1 character, followed by anything in the middle sandwiched till that same character is encountered again.
so, for abbcabb the "sandwiched" portion should be bbc between two a
EDIT:
You can try something like this instead without regexes:
string = "abbcabb"
result = []
for i in string:
if i not in result:
result.append(i)
else:
result.remove(i)
print(''.join(result))
Note that this produces the "last" odd occurrence of a string and not first.
For "first" known occurance, you should use a counter as suggested in this answer . Just change the condition to check for odd counts. pseudo code(count[letter] %2 == 1)

Regular expression help to find space after a long string

My code is as follow:
list = re.findall(("PROGRAM S\d\d"), contents
If I print the list I just print S51 but I want to take everything.
I want to findall everything like that "PROGRAM S51_Mix_Station". I know how to put the digits to find them but I donĀ“t know how to find everything until the next space because usually after the last character there is an space.
Thanks in advance.
You can also use \w+:
import re
s = "PROGRAM S51_Mix_Station"
new_data = re.findall('^PROGRAM\s\w+\_\w+_\w+', s)
final_data = new_data[0] if new_data else new_data
Output:
'PROGRAM S51_Mix_Station'
Ok, thanks. I find another solution.
lista = re.findall(("PROGRAM S\d\d\S+") To find any character after the digit as repetition.
You could use this:
list = re.findall(r"PROGRAM S\d\d[^ ]*", contents)
This would match PROGRAM S followed by two digits, then followed by any number of non space characters. If you wanted to include all whitespace characters with spaces, then the #Wiktor comment would be better, i.e. use PROGRAM S\d\d\S*.

Avoid string replace repetition [duplicate]

This question already has answers here:
How can I make multiple replacements in a string using a dictionary?
(8 answers)
Closed 6 years ago.
There is this issue I have been thinking for some time.
I have replacement rules for some string transformation job. I am learning regex and slowly finding correct patterns, this is no problem. However, there are many rules in this and I could not do them in a single expression. And now the processes are overlapping. Let me give you a simple example.
Imagine I want to replace every 'a' with 'o' in a string.
I also want to replace every 'o' to 'k' in the same string, however, there is no order, so if I apply the previous rule first, then the converted 'a's now will become 'k', which simply is not my intention. Because all convertions must have the same priority or precedence. How can I overcome this issue ?
I use re.sub(), but I think same issue exists for string.replace() method.
All help appreciated, Thank you !
Don't use str.replace(), use str.translate().
Here is how to do it with Python 2:
from string import maketrans
s = 'aoaoaoaoa'
trans_table = maketrans('ao', 'ok')
print s.translate(trans_table)
Output
okokokoko
It's a little different for Python 3:
s = 'aoaoaoaoa'
trans_table = {ord(k):v for k,v in zip('ao', 'ok')}
print(s.translate(trans_table))
I have had a similar challenge and ended up by replacing the first character with a place holder. I then replaced the 2nd character. The third pass was to replace the place holder with the desired character. Not fancy but worked every time.
Replace the 'a' with '$', replace the 'o' with 'k', replace the '$' with 'o'.
We can solve it by the following code:
a --> ao; o --> k (a --> ao --> ak); ak --> o
string_test = "aaaoakkokkooao"
print (string_test.replace("a", "ao").replace("o", "k").replace("ak", "o"))
Try this (works for python2 and python3)
RULES = { 'a': 'o', 'o':'k'} # a->o, o->k, ... no precedence
source = 'Hello I am ok'
dest = "".join(RULES.get(c,c) for c in source)
print (dest)
You can easily add rules.
It also works if there are "loops" (example, add 'k':'a' would make loop a -> o -> k -> a ).
The big problem is that it doesn't use regular expressions (as your OP asks for). It could if your regular expressions were all for exactly one character, and were all mutually exclusive. If it is the case, then you would not really need regular expressions (my above solution would be enough). What do you do if two regular expressions match (different lengths)? Which one do you use (since you do not want any priorities)?

Categories

Resources