How to extract strings between two fixed marks repeatedly

How to extract strings between two fixed marks repeatedly - python

suppose we have a string
text="xaxbx"
We try to get everything between "x". In this case the answer should be "a" and "b"
But when I try
result=re.findall('x(.*?)x',text)
I only get "a", but not "b"
Is there a solution for more generalized situations such as
text="xaxbxcxdxexfx"?
Thanks!

That is, because you "consume" the x's by matching them directly. Look up lookahead and lookbehind. Using these features you get the correct solution:
(?<=x).*?(?=x)
Try it on regex101, you can test example strings there and they explain each part of the regex.

In re.findall('x(.*?)x',text) the characters "x" are consumed in the process of making a match. You can use lookahead and lookbehind instead:
import re
text="xaxbx"
re.findall(r"(?<=x)[^x]+(?=x)", text)
It gives:
['a', 'b']

Another approach is to use regular expression grouping:
import re
text = "xaxbxcxdxexfx"
re.findall("x([^x]+)", text)
Output:
['a', 'b', 'c', 'd', 'e', 'f']

Related

python/regex: match letter only or letter followed by number

I want to split this string 'AB4F2D' in ['A', 'B4', 'F2', 'D'].
Essentially, if character is a letter, return the letter, if character is a number return previous character plus present character (luckily there is no number >9 so there is never a X12).
I have tried several combinations but I am not able to find the correct one:
def get_elements(input_string):
patterns = [
r'[A-Z][A-Z0-9]',
r'[A-Z][A-Z0-9]|[A-Z]',
r'\D|\D\d',
r'[A-Z]|[A-Z][0-9]',
r'[A-Z]{1}|[A-Z0-9]{1,2}'
]
for p in patterns:
elements = re.findall(p, input_string)
print(elements)
results:
['AB', 'F2']
['AB', 'F2', 'D']
['A', 'B', 'F', 'D']
['A', 'B', 'F', 'D']
['A', 'B', '4F', '2D']
Can anyone help? Thanks

\D\d?
One problem with yours is that you put the shorter alternative first, so the longer one never gets a chance. For example, the correct version of your \D|\D\d is \D\d|\D. But just use \D\d?.

Use Extended Groups
There is special syntax for python regexes allowing you to match ahead without consuming the characters (and much more).
Here is a pattern I would come up with using that:
[A-Z](?![0-9])|[A-Z][0-9]
This matches everything in just one pattern. There might be simpler ways to match it, but I find this to be the most flexible if you want to adjust it later. Read it like this: greedily match a letter if the next character is not a digit. If that is not the case, match a letter followed by a digit.
More info in the docs. If you want to test around I recommend using a regex tester like this and make sure to select python syntax.

Regex expression for a given string

I have a small issue i am running into. I need a regular expression that would split a passed string with numbers separately and anything chunk of characters within square brackets separately and regular set of string separately.
for example if I have a strings that resembles
s = 2[abc]3[cd]ef
i need a list with lst = ['2','abc','3','cd','ef']
I have a code so far that has this..
import re
s = "2[abc]3[cd]ef"
s_final = ""
res = re.findall("(\d+)\[([^[\]]*)\]", s)
print(res)
This is outputting a list of tuples that looks like this.
[('2', 'abc'), ('3', 'cd')]
I am very new to regular expression and learning.. Sorry if this is an easy one.
Thanks!

The immediate fix is getting rid of the capturing groups and using alternation to match either digits or chars other than square bracket chars:
import re
s = "2[abc]3[cd]ef"
res = re.findall(r"\d+|[^][]+", s)
print(res)
# => ['2', 'abc', '3', 'cd', 'ef']
See the regex demo and the Python demo. Details:
\d+ - one or more digits
| - or
[^][]+ - one or more chars other than [ and ]
Other solutions that might help are:
re.findall(r'\w+', s)
re.findall(r'\d+|[^\W\d_]+', s)
where \w+ matches one or more letters, digits, underscores and some more connector punctuation with diacritics and [^\W\d_]+ matches any one or more Unicode letters.
See this Python demo.

Don't try a regex that will find all part in the string, but rather a regex that is able to match each block, and \w (meaning [a-zA-Z0-9_]) feats well
s = "2[abc]3[cd]ef"
print(re.findall(r"\w+", s)) # ['2', 'abc', '3', 'cd', 'ef']
Or split on brackets
print(re.split(r"[\[\]]", s)) # ['2', 'abc', '3', 'cd', 'ef ']

Regex is intended to be used as a Regular Expression, your string is Irregular.
regex is being mostly used to find a specific pattern in a long text, text validation, extract things from text.
for example, in order to find a phone number in a string, I would use RegEx, but when I want to build a calculator and I need to extract operators/digits I would not, but I would rather want to write a python code to do that.

Split string by position not character

We know that anchors, word boundaries, and lookaround match at a position, rather than matching a character.
Is it possible to split a string by one of the preceding ways with regex (specifically in python)?
For example consider the following string:
"ThisisAtestForchEck,Match IngwithPosition."
So i want the following result (the sub-strings that start with uppercase letter but not precede by space ):
['Thisis', 'Atest', 'Forch' ,'Eck,' ,'Match Ingwith' ,'Position.']
If i split with grouping i get:
>>> re.split(r'([A-Z])',s)
['', 'T', 'hisis', 'A', 'test', 'F', 'orch', 'E', 'ck,', 'M', 'atchingwith', 'P', 'osition.']
And this is the result with look-around :
>>> re.split(r'(?<=[A-Z])',s)
['ThisisAtestForchEck,MatchingwithPosition.']
>>> re.split(r'((?<=[A-Z]))',s)
['ThisisAtestForchEck,MatchingwithPosition.']
>>> re.split(r'((?<=[A-Z])?)',s)
['ThisisAtestForchEck,MatchingwithPosition.']
Note that if i want to split by sub-strings that start with uppercase and are preceded by a space, e.g.:
['Thisis', 'Atest', 'Forch' ,'Eck,' ,'Match ', Ingwith' ,'Position.']
I can use re.findall, viz.:
>>> re.findall(r'([A-Z][^A-Z]*)',s)
['Thisis', 'Atest', 'Forch', 'Eck,', 'Match ', 'Ingwith', 'Position.']
But what about the first example: is it possible to solve it with re.findall?

A way with re.findall:
re.findall(r'(?:[A-Z]|^[^A-Z\s])[^A-Z\s]*(?:\s+[A-Z][^A-Z]*)*',s)
When you decide to change your approach from split to findall, the first job consists to reformulate your requirements: "I want to split the string on each uppercase letter non preceded by a space" => "I want to find one or more substrings separed by space that begins with an uppercase letter except from the start of the string (if the string doesn't start with an uppercase letter)"

(?<!\s)(?=[A-Z])
You can use this to split with regex module as re does not support split at 0 width assertions.
import regex
x="ThisisAtestForchEck,Match IngwithPosition."
print regex.split(r"(?<![\s])(?=[A-Z])",x,flags=regex.VERSION1)
or
print [i for i in regex.split(r"(?<![\s])(?=[A-Z])",x,flags=regex.VERSION1) if i]
See demo.
https://regex101.com/r/sJ9gM7/65

I know this might be less convenient because of the tuple nature of the result. But I think that this findall finds what you need:
re.findall(r'((?<!\s)[A-Z]([^A-Z]|(?<=\s)[A-Z])*)', s)
## returns [('Thisis', 's'), ('Atest', 't'), ('Forch', 'h'), ('Eck,', ','), ('Match Ingwith', 'h'), ('Position.', '.')]
This can be used in the following list comprehension to give the desired output:
[val[0] for val in re.findall(r'((?<!\s)[A-Z]([^A-Z]|(?<=\s)[A-Z])*)', s)]
## returns ['Thisis', 'Atest', 'Forch', 'Eck,', 'Match Ingwith', 'Position.']
And here is a hack that uses split:
re.split(r'((?<!\s)[A-Z]([^A-Z]|(?<=\s)[A-Z])*)', s)[1::3]
## returns ['Thisis', 'Atest', 'Forch', 'Eck,', 'Match Ingwith', 'Position.']

try capture using this pattern
([A-Z][a-z]*(?: [A-Z][a-z]*)*)
Demo

Remove duplicate chars using regex?

Let's say I want to remove all duplicate chars (of a particular char) in a string using regular expressions. This is simple -
import re
re.sub("a*", "a", "aaaa") # gives 'a'
What if I want to replace all duplicate chars (i.e. a,z) with that respective char? How do I do this?
import re
re.sub('[a-z]*', <what_to_put_here>, 'aabb') # should give 'ab'
re.sub('[a-z]*', <what_to_put_here>, 'abbccddeeffgg') # should give 'abcdefg'
NOTE: I know this remove duplicate approach can be better tackled with a hashtable or some O(n^2) algo, but I want to explore this using regexes

>>> import re
>>> re.sub(r'([a-z])\1+', r'\1', 'ffffffbbbbbbbqqq')
'fbq'
The () around the [a-z] specify a capture group, and then the \1 (a backreference) in both the pattern and the replacement refer to the contents of the first capture group.
Thus, the regex reads "find a letter, followed by one or more occurrences of that same letter" and then entire found portion is replaced with a single occurrence of the found letter.
On side note...
Your example code for just a is actually buggy:
>>> re.sub('a*', 'a', 'aaabbbccc')
'abababacacaca'
You really would want to use 'a+' for your regex instead of 'a*', since the * operator matches "0 or more" occurrences, and thus will match empty strings in between two non-a characters, whereas the + operator matches "1 or more".

In case you are also interested in removing duplicates of non-contiguous occurrences you have to wrap things in a loop, e.g. like this
s="ababacbdefefbcdefde"
while re.search(r'([a-z])(.*)\1', s):
s= re.sub(r'([a-z])(.*)\1', r'\1\2', s)
print s # prints 'abcdef'

A solution including all category:
re.sub(r'(.)\1+', r'\1', 'aaaaabbbbbb[[[[[')
gives:
'ab['

Decompose a Python string into its characters

I want to break a Python string into its characters.
sequenceOfAlphabets = list( string.uppercase )
works.
However, why does not
sequenceOfAlphabets = re.split( '.', string.uppercase )
work?
All I get are empty, albeit expected count of elements

The '.' matches every character and re.split returns everything that wasn't matched, that's why you're getting the empty list.
Using list is usually the way to handle something like this but if you want to use regular expressions just use re.findall
sequenceOfAlphabets = re.findall( '.', string.uppercase )
That should give you ['A', 'B', 'C', .... ,'Z']

Because the delimiter character used by split does not appear in the resulting list. This allows it be used like:
re.split(',', "foo,bar,baz")
['foo', 'bar', 'baz']
Also, you will find the resulting list from your split code actually contains one extra element, since split returns one more than the number of delimiters found. The above has two commas, so it returns a three-element list.

If you can do something with both a built-in function and with regexes, then usually the built-in approach will be faster and more legible.
The regex world is a maze of twisty little passages, populated by purveyors of almost-truths like """The '.' matches every character""" ... which it does, but only when you use the re.DOTALL flag. This information is not cunningly concealed in the fine print of the documentation; it's right there as the FIRST entry of "special characters":
'.'
(Dot.) In the default mode, this matches any character except a newline. If the DOTALL flag has been specified, this matches any character including a newline.
>>> import re
>>> re.findall(".", "fu\nbar")
['f', 'u', 'b', 'a', 'r']
>>>

Just an FYI, this also works:
sequenceOfAlphabets = [a for a in string.uppercase]
...but that does exactly what list() would do so I don't think it would be any faster (I could be wrong).

You can also create an empty set and use the update method, like so:
destroy_string = set()
destroy_string.update('Stack Overflow')
destroy_string
{'k', ' ', 'S', 'c', 'v', 'o', 'r', 't', 'w', 'e', 'f', 'O', 'l', 'a'}
Albeit, it will become unordered and the duplicates will be lost in the set, however, this is still a valid way to decompose a string into a set of its individual members.

From the documentation:
If capturing parentheses are used in
pattern, then the text of all groups
in the pattern are also returned as
part of the resulting list.
Also note:
If there are capturing groups in the
separator and it matches at the start
of the string, the result will start
with an empty string. The same holds
for the end of the string.
So, use re.split( '(.)', string.uppercase)[1:-1] instead.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract strings between two fixed marks repeatedly - python

suppose we have a string text="xaxbx" We try to get everything between "x". In this case the answer should be "a" and "b" But when I try result=re.findall('x(.*?)x',text) I only get "a", but not "b" Is there a solution for more generalized situations such as text="xaxbxcxdxexfx"? Thanks!

That is, because you "consume" the x's by matching them directly. Look up lookahead and lookbehind. Using these features you get the correct solution: (?<=x).*?(?=x) Try it on regex101, you can test example strings there and they explain each part of the regex.

In re.findall('x(.*?)x',text) the characters "x" are consumed in the process of making a match. You can use lookahead and lookbehind instead: import re text="xaxbx" re.findall(r"(?<=x)[^x]+(?=x)", text) It gives: ['a', 'b']

Another approach is to use regular expression grouping: import re text = "xaxbxcxdxexfx" re.findall("x([^x]+)", text) Output: ['a', 'b', 'c', 'd', 'e', 'f']

Related

python/regex: match letter only or letter followed by number

Regex expression for a given string

Split string by position not character

Remove duplicate chars using regex?

Decompose a Python string into its characters

Categories

Resources