Python findall Regex function catches only some text

Python findall Regex function catches only some text - python

I'm still new at Regex, and I've been trying to implement a Gmail validation algorithm in my Python program.
This is my Regex
mail_address = "hello.89#gmail.com"
result = re.findall(r'\w+[\w.]+(#gmail.com){1}', mail_address)
print (str(result))
The first char must be alphanumeric (\w+), from there it catches every set of chars ([\w.]+), followed by only one instance of #gmail.com
This is what it prints:
['#gmail.com']
But it should print
['hello.89#gmail.com']
What am I doing wrong?
EDIT: Here's the Regex I chose:
\A(\w+[\w.]+#gmail\.com)\Z

Just alter the parentheses so that it includes all of your desired output:
result = re.findall(r'(\w+[\w.]+#gmail.com)', mail_address)
I have slightly altered your expression insofar as the gmail.com part is now only a string. Additionally, you don't need to convert the results to string plus you don't need to repeat a group just once.
That being said, in the end, you'd end up having:
import re
mail_address = "hello.89#gmail.com"
result = re.findall(r'(\w+[\w.]+#gmail.com)', mail_address)
print (result)
# ['hello.89#gmail.com']

Problem is in the parentheses as Jan mentioned. But your regex can be also simplified to this:
result = re.findall(r'(\w+[\w.]+#gmail.com)', mail_address)
Demo: https://regex101.com/r/Z5EGbZ/1
Quantifier after #gmail.com is meaningless.

this should work, using your regex only
regex = r"\w+[\w.]+(#gmail.com){1}"
test_str = "hello.89#gmail.com"
matches = re.finditer(regex, test_str)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
check online compiler

Related

Select only regex match from a continuous string

I want to use this regex
r"Summe\d+\W\d+"
to match this string
150,90‡50,90‡8,13‡Summe50,90•50,90•8,13•Kreditkartenzahlung
but I want to only filter out this specific part
Summe50,90
I can select the entire string with this regex but I'm not sure how to filter out only the matching part
here is the function it is in where i am trying to get the amount from a pdf:
def get_amount(url):
data = requests.get(url)
with open('/Users/derricdonehoo/code/derric-d/price-processor/exmpl.pdf', 'wb') as f:
f.write(data.content)
pdfFileObj = open('exmpl.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
text = pageObj.extractText().split()
regex = re.compile(r"Summe\d+\W\d+")
matches = list(filter(regex.search, text))
for i in range(len(matches)):
matchString = '\n'.join(matches)
print(matchString)
as described above, I would like guidance on how I can best filter out a part of this string so that it returns just the matching portion. preferably with varying lengths of characters on either side but that's not a priority.
thanks!!

My guess is that this simple expression might likely work here,
(Summe.+?)•
Test
import re
regex = r"(Summe.+?)•"
test_str = "150,90‡50,90‡8,13‡Summe50,90•50,90•8,13•Kreditkartenzahlung"
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
Demo

What you are missing is a convenient way to "grab" your match.
import re
text = "150,90‡50,90‡8,13‡Summe50,90•50,90•8,13•Kreditkartenzahlung"
match = re.search("Summe\d+\W\d+", text)
if match:
res = match.group()
>>> print(res)
'Summe50,90'
Note that group accepts an index to return a group from inside your regex but since this one doesn't use groups (Which are surrounded by (...) in your regex) you simply call it like that.
If you want to find all occurences of said pattern use re.findall:
import re
text = "150,90‡50,90‡8,13‡Summe50,90•50,90•Summe8,13•Kreditkartenzahlung"
matchs = re.findall("Summe\d+\W\d+", text)
>>> print(matches)
['Summe50,90', 'Summe8,13']
In this case a list with all matches (already strings, not Match objects) will be returned. Again, if you use capture groups, a list of tuples will be returned where each tuple contains all the groups for a match.
Read about the methods - re.search and re.findall

This is what you want, your regex is correct but you must get the match after searching for it.
regex = re.compile(r"Summe\d+\W\d+")
text = ["150,90‡50,90‡8,13‡Summe50,90•50,90•8,13•Kreditkartenzahlung"]
matches = []
for t in text:
m = regex.search(t)
if m:
matches.append(m.group(0))
print(matches)
re.search returns a Match object on success, None on failure, and that object contains all the information about your matching regex. To get the whole match you call Match.group().

\W will probably match everything up to Kredit...
regex = r'Summe\d+,\d{2}'
should match the first 50,90 after Summe.
If the separating comma is too specific (because it might come as a dot) you can use a character set:
regex = r'Summe\d[,.]\d{2}'

How to get all substrings between some delimiters in python [duplicate]

This question already has answers here:
How to use regex to find all overlapping matches
(5 answers)
Closed 3 years ago.
I am trying to get all the substring that matches some delimiters. My issue is that i also need the character at the end of the last occurrence. The strings need to be between any of these characters: . , / , ? , = , - , _
I have tried this regular expression
pattern = re.compile(r"""[./?=\-_][^./?=\-_]+[./?=\-_]""")
In this exemple:
-facebook=chat.messenger?
I am not able to get the substring =chat.
I am only getting -facebook= and .messenger?

Looks like the overlap is what's causing some the drama. If using the regex module (which is expected to eventually replace the re module), you can do
import regex as re
delimiters = r'[./?=\-_]'
pattern = delimiters + r'[a-z]+' + delimiters
s = '-facebook=chat.messenger?'
print(regex.findall(pattern, s, overlapped=True))
# ['-facebook=', '=chat.', '.messenger?']
Notice that this assumes all characters are lowercase with [a-z], and that [./?=\-_] is the list of delimiters you specified.
Hope this helps!

My guess is that this expression might be what we might want to start with:
((?:[/?=_–.-])([a-z]+)(?:[/?=_–.-]))|([a-z]+)
Demo
Test
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"((?:[/?=_–.-])([a-z]+)(?:[/?=_–.-]))|([a-z]+)"
test_str = "-facebook=chat.messenger?"
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

Regex doesn't stop evaluating after matching with first rule with OR operator

i am having issues with regex matching in python i have a string as follows:
test_str = ("ICD : 12123575.007787. 098.3,\n"
"193235.1, 132534.0, 17707.1,1777029, V40‚0, 5612356,9899\n")
my regular expression has two main groups bind together with | and that regular expression is as follows:
regex = r"((?<=ICD\s:\s).*\n.*)|((?<=ICD\s).*)"
Lets call them (A | B). Where A = ((?<=ICD\s:\s).*\n.*) and B = ((?<=ICD\s).*). According to documentation | works in a way where if A is matched it won't go further with B.
Now my problem is that when i use above mentioned regular expression test_str. It matches for B but not for A. But if i search with regular expression A only (i.e. ((?<=ICD\s:\s).*\n.*)), then the test_string is matched with the regular expression A. So my question is that why with A|B regular expression is not matched with group A and stopped. Following is my python code:
import re
regex = r"((?<=ICD\s:\s).*\n.*)|((?<=ICD\s).*)"
test_str = ("ICD : 12123575.007787. 098.3,\n"
"193235.1, 132534.0, 17707.1,1777029, V40‚0, 5612356,9899\n")
matches = re.search(regex, test_str)
if matches:
print ("Match was found at {start}-{end}: {match}".format(
start = matches.start(),
end = matches.end(),
match = matches.group()))
for groupNum in range(0, len(matches.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(
groupNum = groupNum,
start = matches.start(groupNum),
end = matches.end(groupNum),
group = matches.group(groupNum)))
output:
Match was found at 4-29: : 12123575.007787. 098.3,
Group 1 found at -1--1: None
Group 2 found at 4-29: : 12123575.007787. 098.3,
Python Fiddle
Sorry if you are not able to understand. I don't know why Group 1 found at -1--1: None is not matched. Let me know what could be the reason if you understood it.

The reason why this happens is because regex searches for a match from left to right, and the right half of the regex matches earlier. This is because the left expression has a longer lookbehind: (?<=ICD\s:\s) requires two more characters than (?<=ICD\s).
test_str = "ICD : 12123575.007787. 098.3,\n"
# ^ left half of the regex matches here
# ^ right half of the regex matches here
To put it another way, your regexes are essentially like (?<=.{3}) and (?<=.). If you tried re.search(r'(?<=.{3})|(?<=.)', some_text), it's clear that the right side of the regex would match first, because its lookbehind is shorter.
You can fix this by preventing the right half of the regex from matching too early by adding a negative lookahead:
regex = r"((?<=ICD\s:\s).*\n.*)|((?<=ICD\s)(?!:\s).*)"
# ^^^^^^^
test_str = "ICD : 12123575.007787. 098.3,\n"
# ^ left half of the regex matches here
# right half of the regex matches doesn't match at all

My regex works on regex101 but doesn't work in python? [duplicate]

This question already has answers here:
What is the difference between re.search and re.match?
(9 answers)
Closed 1 year ago.
So I need to match strings that are surrounded by |. So, the pattern should simply be r"\|([^\|]*)\|", right? And yet:
>>> pattern = r"\|([^\|]*)\|"
>>> re.match(pattern, "|test|")
<_sre.SRE_Match object at 0x10341dd50>
>>> re.match(pattern, " |test|")
>>> re.match(pattern, "asdf|test|")
>>> re.match(pattern, "asdf|test|1234")
>>> re.match(pattern, "|test|1234")
<_sre.SRE_Match object at 0x10341df30>
It's only matching on strings that begin with |? It works just fine on regex101 and this is python 2.7 if it matters. I'm probably just doing something dumb here so any help would be appreciated. Thanks!

re.match will want to match the string starting at the beginning. In your case, you just need the matching element, correct? In that case you can use something like re.search or re.findall, which will find that match anywhere in the string:
>>> re.search(pattern, " |test|").group(0)
'|test|'
>>> re.findall(pattern, " |test|")
['test']

In order to reproduce code that runs on https://regex101.com/, you have to click on Code Generator on the left handside. This will show you what their website is using. From there you can play around with flags, or with the function you need from re.
Note:
https://regex101.com/ uses re.MULTILINE as default flag
https://regex101.com/ uses re.finditer as default method
import re
regex = r"where"
test_str = "select * from table where t=3;"
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

Python offers two different primitive operations based on regular expressions: re.match() checks for a match only
at the beginning of the string, while re.search() checks for a match anywhere in the string (this is what Perl does
by default).
Document

Remove leading zeros in middle of string with regex

I have a large number of strings on the format YYYYYYYYXXXXXXXXZZZZZZZZ, where X, Y, and Z are numbers of fix length, eight digits. Now, the problem is that I need to parse out the middle sequence of integers and remove any leading zeroes. Unfortunately is the only way to determine where each of the three sequences begins/ends is to count the number of digits.
I am currently doing it in two steps, i.e:
m = re.match(
r"(?P<first_sequence>\d{8})"
r"(?P<second_sequence>\d{8})"
r"(?P<third_sequence>\d{8})",
string)
second_secquence = m.group(2)
second_secquence.lstrip(0)
Which does work, and gives me the right results, e.g.:
112233441234567855667788 --> 12345678
112233440012345655667788 --> 123456
112233001234567855667788 --> 12345678
112233000012345655667788 --> 123456
But is there a better method? Is is possible to write a single regex expression which matches against the second sequence, sans the leading zeros?
I guess I am looking for a regex which does the following:
Skips over the first eight digits.
Skips any leading zeros.
Captures anything after that, up to the point where there's sixteen characters behind/eight infront.
The above solution does work, as mentioned, so the purpose of this problem is more to improve my knowledge of regex. I appreciate any pointers.

This is a typical case of "useless use of regular expressions".
Your strings are fixed-length. Just cut them at the appropriate positions.
s = "112233440012345655667788"
int(s[8:16])
# -> 123456

I think it's simpler not to use regex.
result = my_str[8:16].lstrip('0')

Agree with the other answers here that regex isn't really required. If you really want to use regex, then \d{8}0*(\d*)\d{8} should do it.

Just to show that it is possible with regex:
https://regex101.com/r/8RSxaH/2
# CODE AUTO GENERATED BY REGEX101.COM (SEE LINK ABOVE)
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"(?<=\d{8})((?:0*)(\d{,8}))(?=\d{8})"
test_str = ("112233441234567855667788\n"
"112233440012345655667788\n"
"112233001234567855667788\n"
"112233000012345655667788")
matches = re.finditer(regex, test_str)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
Although you don't really need it to do what you're asking

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python findall Regex function catches only some text - python

Problem is in the parentheses as Jan mentioned. But your regex can be also simplified to this: result = re.findall(r'(\w+[\w.]+#gmail.com)', mail_address) Demo: https://regex101.com/r/Z5EGbZ/1 Quantifier after #gmail.com is meaningless.

Related

Select only regex match from a continuous string

How to get all substrings between some delimiters in python [duplicate]

Regex doesn't stop evaluating after matching with first rule with OR operator

My regex works on regex101 but doesn't work in python? [duplicate]

Remove leading zeros in middle of string with regex

Categories

Resources