How to get all substrings between some delimiters in python [duplicate]

How to get all substrings between some delimiters in python [duplicate] - python

This question already has answers here:
How to use regex to find all overlapping matches
(5 answers)
Closed 3 years ago.
I am trying to get all the substring that matches some delimiters. My issue is that i also need the character at the end of the last occurrence. The strings need to be between any of these characters: . , / , ? , = , - , _
I have tried this regular expression
pattern = re.compile(r"""[./?=\-_][^./?=\-_]+[./?=\-_]""")
In this exemple:
-facebook=chat.messenger?
I am not able to get the substring =chat.
I am only getting -facebook= and .messenger?

Looks like the overlap is what's causing some the drama. If using the regex module (which is expected to eventually replace the re module), you can do
import regex as re
delimiters = r'[./?=\-_]'
pattern = delimiters + r'[a-z]+' + delimiters
s = '-facebook=chat.messenger?'
print(regex.findall(pattern, s, overlapped=True))
# ['-facebook=', '=chat.', '.messenger?']
Notice that this assumes all characters are lowercase with [a-z], and that [./?=\-_] is the list of delimiters you specified.
Hope this helps!

My guess is that this expression might be what we might want to start with:
((?:[/?=_–.-])([a-z]+)(?:[/?=_–.-]))|([a-z]+)
Demo
Test
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"((?:[/?=_–.-])([a-z]+)(?:[/?=_–.-]))|([a-z]+)"
test_str = "-facebook=chat.messenger?"
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

Related

Why re.findall does not find the match in this case? [duplicate]

This question already has answers here:
re.findall not returning full match?
(6 answers)
Closed 12 months ago.
I'm trying to reconstruct an example string from a given regular expression
test_re = r'\s([0-9A-Z]+\w*)\s+\S*[Aa]lloy\s'
However, the code below only gives ['1AZabc']
import re
txt = " 1AZabc sdfsdfAlloy "
test_re = r'\s([0-9A-Z]+\w*)\s+\S*[Aa]lloy\s'
# test_re = r'\s+\S*[Aa]lloy\s'
x = re.findall(test_re,txt)
print(x)
Why the contents after the space (for matching the \s+) is not captured by re? What is a simple and valid example string that matches the text_re?

Your code works and finds all - you just misunderstand regex GROUPs and its usage when calling findall:
# code partially generated by regex101.com to demonstrate the issue
# see https://regex101.com/r/Gngy0r/1
import re
regex = r"\s([0-9A-Z]+\w*)\s+\S*?[Aa]lloy\s"
test_str = " 1AZabc sdfsdfAlloy "
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# use findall and print its results
print(re.findall(regex, test_str))
Output:
# full match that you got
Match 1 was found at 0-20: 1AZabc sdfsdfAlloy
# and what was captured
Group 1 found at 1-7: 1AZabc
# findall only gives you the groups ...
['1AZabc']
Either remove the ( ) or put all into () that you are interested in:
regex = r"\s([0-9A-Z]+\w*\s+\S*?[Aa]lloy)\s"

RegEx for matching email in URLs

I have a regex in my django code but I don't know what it means actually. Here is my regex :
r'^email/(?P<email>[^#\s]+#[^#\s]+\.[^#\s]+)/$',
Could you give me some examples which match with this regex?

RegEx Circuit
You can visualize your expressions in jex.im:
You can also test/modify/change your expressions in regex101.com.
Basically, your expression would match:
email/some_alphanumeric[A-Z0-9]_special_chars_##$*some_alphanumeric_special_chars_#$*.some_alphanumeric_special_chars_#$*
Demo
If you wish to match:
myurl/email/blabla#blabla.com
You can modify it to:
myurl\/email\/([^#\s]+#[^#\s]+\.[^#\s]+)
Python Test
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"myurl\/email\/([^#\s]+#[^#\s]+\.[^#\s]+)"
test_str = "myurl/email/blabla#blabla.com"
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

in addition :
r'^email/(?P<email>[^#\s]+#[^#\s]+\.[^#\s]+)/$'
this regx use in django url
url example : email/test#gmail.com/
email/ = consolent value in your url
[^#\s] = you can write any character except # and space "/s"
#[^#\s] = you must start with # + anything expect #character and space "/s"
\. = matches "."
[^#\s] = you can write anycharacter except # and space "/s"
+ = you can type many character
/$ = end of url

Python findall Regex function catches only some text

I'm still new at Regex, and I've been trying to implement a Gmail validation algorithm in my Python program.
This is my Regex
mail_address = "hello.89#gmail.com"
result = re.findall(r'\w+[\w.]+(#gmail.com){1}', mail_address)
print (str(result))
The first char must be alphanumeric (\w+), from there it catches every set of chars ([\w.]+), followed by only one instance of #gmail.com
This is what it prints:
['#gmail.com']
But it should print
['hello.89#gmail.com']
What am I doing wrong?
EDIT: Here's the Regex I chose:
\A(\w+[\w.]+#gmail\.com)\Z

Just alter the parentheses so that it includes all of your desired output:
result = re.findall(r'(\w+[\w.]+#gmail.com)', mail_address)
I have slightly altered your expression insofar as the gmail.com part is now only a string. Additionally, you don't need to convert the results to string plus you don't need to repeat a group just once.
That being said, in the end, you'd end up having:
import re
mail_address = "hello.89#gmail.com"
result = re.findall(r'(\w+[\w.]+#gmail.com)', mail_address)
print (result)
# ['hello.89#gmail.com']

Problem is in the parentheses as Jan mentioned. But your regex can be also simplified to this:
result = re.findall(r'(\w+[\w.]+#gmail.com)', mail_address)
Demo: https://regex101.com/r/Z5EGbZ/1
Quantifier after #gmail.com is meaningless.

this should work, using your regex only
regex = r"\w+[\w.]+(#gmail.com){1}"
test_str = "hello.89#gmail.com"
matches = re.finditer(regex, test_str)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
check online compiler

Find string after "task-" in a long substring using regex

I have list of files with a pattern sub-*_task-XYZabc_run-*_bold.json and sub-*_task-PQRghu_bold.json, for example:
sub-03_task-dis_run-01_bold.json
sub-03_task-dis_run-02_bold.json
sub-03_task-dis_run-03_bold.json
sub-03_task-dis_run-04_bold.json
sub-03_task-dis_run-05_bold.json
sub-03_task-dis_run-06_bold.json
sub-03_task-fb_run-01_bold.json
sub-03_task-fb_run-02_bold.json
sub-03_task-fb_run-03_bold.json
sub-03_task-fb_run-04_bold.json
I intend to find all different task names from the filename. In the above example, dis and fb are the two tasks.
What kind of regex should I use to find TASKNAME from task-TASKNAME in a given filename?

The following regex should do it :
(?<=task-).*?(?=_)
see regex demo / explanation
python ( demo )
import re
regex = r"(?<=task-).*?(?=_)"
str = """sub-03_task-dis_run-01_bold.json
sub-03_task-dis_run-02_bold.json
sub-03_task-dis_run-03_bold.json
sub-03_task-dis_run-04_bold.json
sub-03_task-dis_run-05_bold.json
sub-03_task-dis_run-06_bold.json
sub-03_task-fb_run-01_bold.json
sub-03_task-fb_run-02_bold.json
sub-03_task-fb_run-03_bold.json
sub-03_task-fb_run-04_bold.json"""
matches = re.finditer(regex, str)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
print ("{match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

My regex works on regex101 but doesn't work in python? [duplicate]

This question already has answers here:
What is the difference between re.search and re.match?
(9 answers)
Closed 1 year ago.
So I need to match strings that are surrounded by |. So, the pattern should simply be r"\|([^\|]*)\|", right? And yet:
>>> pattern = r"\|([^\|]*)\|"
>>> re.match(pattern, "|test|")
<_sre.SRE_Match object at 0x10341dd50>
>>> re.match(pattern, " |test|")
>>> re.match(pattern, "asdf|test|")
>>> re.match(pattern, "asdf|test|1234")
>>> re.match(pattern, "|test|1234")
<_sre.SRE_Match object at 0x10341df30>
It's only matching on strings that begin with |? It works just fine on regex101 and this is python 2.7 if it matters. I'm probably just doing something dumb here so any help would be appreciated. Thanks!

re.match will want to match the string starting at the beginning. In your case, you just need the matching element, correct? In that case you can use something like re.search or re.findall, which will find that match anywhere in the string:
>>> re.search(pattern, " |test|").group(0)
'|test|'
>>> re.findall(pattern, " |test|")
['test']

In order to reproduce code that runs on https://regex101.com/, you have to click on Code Generator on the left handside. This will show you what their website is using. From there you can play around with flags, or with the function you need from re.
Note:
https://regex101.com/ uses re.MULTILINE as default flag
https://regex101.com/ uses re.finditer as default method
import re
regex = r"where"
test_str = "select * from table where t=3;"
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

Python offers two different primitive operations based on regular expressions: re.match() checks for a match only
at the beginning of the string, while re.search() checks for a match anywhere in the string (this is what Perl does
by default).
Document

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to get all substrings between some delimiters in python [duplicate] - python

Related

Why re.findall does not find the match in this case? [duplicate]

RegEx for matching email in URLs

Python findall Regex function catches only some text

Find string after "task-" in a long substring using regex

My regex works on regex101 but doesn't work in python? [duplicate]

Categories

Resources