Find string after "task-" in a long substring using regex - python

I have list of files with a pattern sub-*_task-XYZabc_run-*_bold.json and sub-*_task-PQRghu_bold.json, for example:
sub-03_task-dis_run-01_bold.json
sub-03_task-dis_run-02_bold.json
sub-03_task-dis_run-03_bold.json
sub-03_task-dis_run-04_bold.json
sub-03_task-dis_run-05_bold.json
sub-03_task-dis_run-06_bold.json
sub-03_task-fb_run-01_bold.json
sub-03_task-fb_run-02_bold.json
sub-03_task-fb_run-03_bold.json
sub-03_task-fb_run-04_bold.json
I intend to find all different task names from the filename. In the above example, dis and fb are the two tasks.
What kind of regex should I use to find TASKNAME from task-TASKNAME in a given filename?

The following regex should do it :
(?<=task-).*?(?=_)
see regex demo / explanation
python ( demo )
import re
regex = r"(?<=task-).*?(?=_)"
str = """sub-03_task-dis_run-01_bold.json
sub-03_task-dis_run-02_bold.json
sub-03_task-dis_run-03_bold.json
sub-03_task-dis_run-04_bold.json
sub-03_task-dis_run-05_bold.json
sub-03_task-dis_run-06_bold.json
sub-03_task-fb_run-01_bold.json
sub-03_task-fb_run-02_bold.json
sub-03_task-fb_run-03_bold.json
sub-03_task-fb_run-04_bold.json"""
matches = re.finditer(regex, str)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
print ("{match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

Related

Select only regex match from a continuous string

I want to use this regex
r"Summe\d+\W\d+"
to match this string
150,90‡50,90‡8,13‡Summe50,90•50,90•8,13•Kreditkartenzahlung
but I want to only filter out this specific part
Summe50,90
I can select the entire string with this regex but I'm not sure how to filter out only the matching part
here is the function it is in where i am trying to get the amount from a pdf:
def get_amount(url):
data = requests.get(url)
with open('/Users/derricdonehoo/code/derric-d/price-processor/exmpl.pdf', 'wb') as f:
f.write(data.content)
pdfFileObj = open('exmpl.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
text = pageObj.extractText().split()
regex = re.compile(r"Summe\d+\W\d+")
matches = list(filter(regex.search, text))
for i in range(len(matches)):
matchString = '\n'.join(matches)
print(matchString)
as described above, I would like guidance on how I can best filter out a part of this string so that it returns just the matching portion. preferably with varying lengths of characters on either side but that's not a priority.
thanks!!
My guess is that this simple expression might likely work here,
(Summe.+?)•
Test
import re
regex = r"(Summe.+?)•"
test_str = "150,90‡50,90‡8,13‡Summe50,90•50,90•8,13•Kreditkartenzahlung"
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
Demo
What you are missing is a convenient way to "grab" your match.
import re
text = "150,90‡50,90‡8,13‡Summe50,90•50,90•8,13•Kreditkartenzahlung"
match = re.search("Summe\d+\W\d+", text)
if match:
res = match.group()
>>> print(res)
'Summe50,90'
Note that group accepts an index to return a group from inside your regex but since this one doesn't use groups (Which are surrounded by (...) in your regex) you simply call it like that.
If you want to find all occurences of said pattern use re.findall:
import re
text = "150,90‡50,90‡8,13‡Summe50,90•50,90•Summe8,13•Kreditkartenzahlung"
matchs = re.findall("Summe\d+\W\d+", text)
>>> print(matches)
['Summe50,90', 'Summe8,13']
In this case a list with all matches (already strings, not Match objects) will be returned. Again, if you use capture groups, a list of tuples will be returned where each tuple contains all the groups for a match.
Read about the methods - re.search and re.findall
This is what you want, your regex is correct but you must get the match after searching for it.
regex = re.compile(r"Summe\d+\W\d+")
text = ["150,90‡50,90‡8,13‡Summe50,90•50,90•8,13•Kreditkartenzahlung"]
matches = []
for t in text:
m = regex.search(t)
if m:
matches.append(m.group(0))
print(matches)
re.search returns a Match object on success, None on failure, and that object contains all the information about your matching regex. To get the whole match you call Match.group().
\W will probably match everything up to Kredit...
regex = r'Summe\d+,\d{2}'
should match the first 50,90 after Summe.
If the separating comma is too specific (because it might come as a dot) you can use a character set:
regex = r'Summe\d[,.]\d{2}'

How to get all substrings between some delimiters in python [duplicate]

This question already has answers here:
How to use regex to find all overlapping matches
(5 answers)
Closed 3 years ago.
I am trying to get all the substring that matches some delimiters. My issue is that i also need the character at the end of the last occurrence. The strings need to be between any of these characters: . , / , ? , = , - , _
I have tried this regular expression
pattern = re.compile(r"""[./?=\-_][^./?=\-_]+[./?=\-_]""")
In this exemple:
-facebook=chat.messenger?
I am not able to get the substring =chat.
I am only getting -facebook= and .messenger?
Looks like the overlap is what's causing some the drama. If using the regex module (which is expected to eventually replace the re module), you can do
import regex as re
delimiters = r'[./?=\-_]'
pattern = delimiters + r'[a-z]+' + delimiters
s = '-facebook=chat.messenger?'
print(regex.findall(pattern, s, overlapped=True))
# ['-facebook=', '=chat.', '.messenger?']
Notice that this assumes all characters are lowercase with [a-z], and that [./?=\-_] is the list of delimiters you specified.
Hope this helps!
My guess is that this expression might be what we might want to start with:
((?:[/?=_–.-])([a-z]+)(?:[/?=_–.-]))|([a-z]+)
Demo
Test
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"((?:[/?=_–.-])([a-z]+)(?:[/?=_–.-]))|([a-z]+)"
test_str = "-facebook=chat.messenger?"
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

RegEx for matching email in URLs

I have a regex in my django code but I don't know what it means actually. Here is my regex :
r'^email/(?P<email>[^#\s]+#[^#\s]+\.[^#\s]+)/$',
Could you give me some examples which match with this regex?
RegEx Circuit
You can visualize your expressions in jex.im:
You can also test/modify/change your expressions in regex101.com.
Basically, your expression would match:
email/some_alphanumeric[A-Z0-9]_special_chars_##$*some_alphanumeric_special_chars_#$*.some_alphanumeric_special_chars_#$*
Demo
If you wish to match:
myurl/email/blabla#blabla.com
You can modify it to:
myurl\/email\/([^#\s]+#[^#\s]+\.[^#\s]+)
Python Test
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"myurl\/email\/([^#\s]+#[^#\s]+\.[^#\s]+)"
test_str = "myurl/email/blabla#blabla.com"
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
in addition :
r'^email/(?P<email>[^#\s]+#[^#\s]+\.[^#\s]+)/$'
this regx use in django url
url example : email/test#gmail.com/
email/ = consolent value in your url
[^#\s] = you can write any character except # and space "/s"
#[^#\s] = you must start with # + anything expect #character and space "/s"
\. = matches "."
[^#\s] = you can write anycharacter except # and space "/s"
+ = you can type many character
/$ = end of url

Python findall Regex function catches only some text

I'm still new at Regex, and I've been trying to implement a Gmail validation algorithm in my Python program.
This is my Regex
mail_address = "hello.89#gmail.com"
result = re.findall(r'\w+[\w.]+(#gmail.com){1}', mail_address)
print (str(result))
The first char must be alphanumeric (\w+), from there it catches every set of chars ([\w.]+), followed by only one instance of #gmail.com
This is what it prints:
['#gmail.com']
But it should print
['hello.89#gmail.com']
What am I doing wrong?
EDIT: Here's the Regex I chose:
\A(\w+[\w.]+#gmail\.com)\Z
Just alter the parentheses so that it includes all of your desired output:
result = re.findall(r'(\w+[\w.]+#gmail.com)', mail_address)
I have slightly altered your expression insofar as the gmail.com part is now only a string. Additionally, you don't need to convert the results to string plus you don't need to repeat a group just once.
That being said, in the end, you'd end up having:
import re
mail_address = "hello.89#gmail.com"
result = re.findall(r'(\w+[\w.]+#gmail.com)', mail_address)
print (result)
# ['hello.89#gmail.com']
Problem is in the parentheses as Jan mentioned. But your regex can be also simplified to this:
result = re.findall(r'(\w+[\w.]+#gmail.com)', mail_address)
Demo: https://regex101.com/r/Z5EGbZ/1
Quantifier after #gmail.com is meaningless.
this should work, using your regex only
regex = r"\w+[\w.]+(#gmail.com){1}"
test_str = "hello.89#gmail.com"
matches = re.finditer(regex, test_str)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
check online compiler

My regex works on regex101 but doesn't work in python? [duplicate]

This question already has answers here:
What is the difference between re.search and re.match?
(9 answers)
Closed 1 year ago.
So I need to match strings that are surrounded by |. So, the pattern should simply be r"\|([^\|]*)\|", right? And yet:
>>> pattern = r"\|([^\|]*)\|"
>>> re.match(pattern, "|test|")
<_sre.SRE_Match object at 0x10341dd50>
>>> re.match(pattern, " |test|")
>>> re.match(pattern, "asdf|test|")
>>> re.match(pattern, "asdf|test|1234")
>>> re.match(pattern, "|test|1234")
<_sre.SRE_Match object at 0x10341df30>
It's only matching on strings that begin with |? It works just fine on regex101 and this is python 2.7 if it matters. I'm probably just doing something dumb here so any help would be appreciated. Thanks!
re.match will want to match the string starting at the beginning. In your case, you just need the matching element, correct? In that case you can use something like re.search or re.findall, which will find that match anywhere in the string:
>>> re.search(pattern, " |test|").group(0)
'|test|'
>>> re.findall(pattern, " |test|")
['test']
In order to reproduce code that runs on https://regex101.com/, you have to click on Code Generator on the left handside. This will show you what their website is using. From there you can play around with flags, or with the function you need from re.
Note:
https://regex101.com/ uses re.MULTILINE as default flag
https://regex101.com/ uses re.finditer as default method
import re
regex = r"where"
test_str = "select * from table where t=3;"
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
Python offers two different primitive operations based on regular expressions: re.match() checks for a match only
at the beginning of the string, while re.search() checks for a match anywhere in the string (this is what Perl does
by default).
Document

Categories

Resources