Inconsistency between regex and python search - python

I'm doing a small regex that catch all the text before the numbers.
https://regex101.com/r/JhIiG9/2
import re
regex = "^(.*?)(\d*([-.]\d*)*)$"
message = "Myteeeeext 0.366- 0.3700"
result = re.search(regex, message)
print(result.group(1))
https://www.online-python.com/a7smOJHBwp
When I run this regex instead of just showing the first group which is Myteeeeext I'm getting Myteeeeext 0.366- but in regex101 it shows only

Try this Regex, [^\d.-]+
It catches all the text before the numbers
import re
regex = "[^\d.-]+"
message = "Myteeeeext 0.366- 0.3700 notMyteeeeext"
result = re.search(regex, message)
print(f"'{result.group()}'")
Outputs:
'Myteeeeext '
tell me if its okay for you...

Your regex:
regex = "^(.*?)(\d*([-.]\d*)*)$"
doesn't allow for the numbers part to have any spaces, but your search string:
message = "Myteeeeext 0.366- 0.3700"
does have a space after the dash, so this part of your regex:
(.*?)
matches up to the second number.
It doesn't look like your test string in the regex101.com example you gave has a space, so that's why your results are different.

Related

Having a problem with Python Regex: Prints "None" when printing "matches". Regex works in tester

I'm supposed to extract groups of text from a file with a top ten list: name, rank, etc. for each. You can see the file and the regex here https://regex101.com/r/fXK5YV/1. It works in there and you can see the capturing groups.
import re
pattern = '''
(?P<list><li\sclass="regular-search-result">(.|\n)*?(?<=\<span class=\"indexed-biz-name\"\>)
(?P<rank>\d{1,2})
(.|\n)*?\<span\>
(?P<name>.+)
\<\/span\>(.|\n)*?alt=\"
(?P<stars>\d\.\d)
\sstar\srating\"(.|\n)*?\<span class=\"review-count rating-qualifier\"\>(\s|\t|\n)*?
(?P<numrevs>\d{1,7})(.|\n)*?\<span\sclass=\"business-attribute\sprice-range\">
(?P<price>\${1,6})
\<\/span\>(.|\n)*?<\/li>)
'''
pattern_matcher = re.compile(pattern, re.VERBOSE)
matches = pattern_matcher.match(yelp_html)
This prints None.
There is definitely text inside of yelp_html.
What am I doing wrong?
I see two issues:
You're not using a raw string (prefix the string with an r), which means that your backslashes are going to be trying to represent special things instead of being part of the string.
I believe your multiline string is going to be attempting to match both the newlines between each line and the spaces at the start of the string into your regex (which you don't want, given this is not how the regex is formatted in your link).
import re
pattern = r'''
(?P<list><li\sclass=\"regular-search-result\">(.|\n)*?(?<=\<span\sclass=\"indexed-biz-name\"\>)
(?P<rank>\d{1,2})
(.|\n)*?\<span\>
(?P<name>.+)
\<\/span\>(.|\n)*?alt=\"
(?P<stars>\d\.\d)
\sstar\srating\"(.|\n)*?\<span\sclass=\"review-count\srating-qualifier\"\>(\s|\t|\n)*?
(?P<numrevs>\d{1,7})
(.|\n)*?\<span\sclass=\"business-attribute\sprice-range\">
(?P<price>\${1,6})
\<\/span\>(.|\n)*?<\/li>)
'''
pattern_matcher = re.compile(pattern, re.VERBOSE)
matches = pattern_matcher.finditer(yelp_html)
for item in matches:
print(item.group('rank', 'name', 'stars', 'numrevs', 'price'))

Removing square brackets from output and stopping regex after first match

Python noob here. I'm trying to fix two problems with my current code.
I'm trying to remove the square brackets from my list output
I can't figure out how to stop regex after the first match
For the first problem I've tried a number of different solutions but without success.
str()
"".join()
.replace"[]",''
page_content = "carrots apples001 carrots apples002 apples003"
define_words = 'apples'
parsed_content = str((re.findall(r"([^.]*?%s[^.]*\.)" % define_words, page_content)))
I'm currently getting the following output
[apples001][][][][][apples002 apples003]
When I should be getting
apples001
Any help would be much appreciated and sorry about the messy code!
You can try the following:
Test_String = "carrots apples001 carrots apples002 apples003"
Regex_Pattern = r'(apples\S\S\S).*'
match = re.findall(Regex_Pattern, Test_String)
print(''.join(match))
Rextester
Instead of using re.findall, you could use re.search to search for the first location where the pattern produces a match.
To match the word apples and the following digits, you could use:
\bapples\d+\b
\b Word boundary to prevent being part of a larger word
apples\d+ Match apples followed by 1+ digits
\b Word boundary
Regex demo | Python demo
Your code could look like:
import re
page_content = "carrots apples001 carrots apples002 apples003"
define_words = 'apples'
parsed_content = (re.search(r"\b%s\d+\b" % define_words, page_content).group())
print(parsed_content) # apples001
If define_words can start with a non word character, you might use (?<!\S)%s\d+ instead to assert what is on the left is not a non whitespace character.
parsed_content = (re.search(r"(?<!\S)%s\d+" % define_words, page_content).group())

Regex match " character

I am trying to get data using regex but just don't know how to match the character " in the regex. I have tried the following:
text = "value=1.211.1.1"
regex ='''w+\=(\d+\.\d+\.\d+\.\d+)'''
match_result = 1.211.1.1
However when my text is:
text = value=""value=1.211.1.1""
I am not able to get the match. I tried the following but it doesn't work. How can I determine whether the " character is in a given string?
regex = '''w+\=\"(\d+\.\d+\.\d+\.\d+)\"'''
Your question is a little confusing but is this perhaps what you're after?
import re
s = '"value="1.211.1.1"'
m = re.match('''['\"]*\w+=['\"]?(\d+\.\d+\.\d+\.\d+)['\"]*''', s)
print(m.group(1))
Output
1.211.1.1

How can I "divide" words with regular expressions?

I have a sentence in which every token has a / in it. I want to just print what I have before the slash.
What I have now is basic:
text = less/RBR.....
return re.findall(r'\b(\S+)\b', text)
This obviously just prints the text, how do I cut off the words before the /?
Assuming you want all characters before the slash out of every word that contains a slash. This would mean e.g. for the input string match/this but nothing here but another/one you would want the results match and another.
With regex:
import re
result = re.findall(r"\b(\w*?)/\w*?\b", my_string)
print(result)
Without regex:
result = [word.split("/")[0] for word in my_string.split()]
print(result)
Simple and straight-forward:
rx = r'^[^/]+'
# anchor it to the beginning
# the class says: match everything not a forward slash as many times as possible
In Python this would be:
import re
text = "less/RBR....."
print re.match(r'[^/]+', text)
As this is an object, you'd probably like to print it out, like so:
print re.match(r'[^/]+', text).group(0)
# less
This should also work
\b([^\s/]+)(?=/)\b
Python Code
p = re.compile(r'\b([^\s/]+)(?=/)\b')
test_str = "less/RBR/...."
print(re.findall(p, test_str))
Ideone Demo

regular expression to extract part of email address

I am trying to use a regular expression to extract the part of an email address between the "#" sign and the "." character. This is how I am currently doing it, but can't get the right results.
company = re.findall('^From:.+#(.*).',line)
Gives me:
['#iupui.edu']
I want to get rid of the .edu
To match a literal . in your regex, you need to use \., so your code should look like this:
company = re.findall('^From:.+#(.*)\.',line)
# ^ this position was wrong
See it live here.
Note that this will always match the last occurrence of . in your string, because (.*) is greedy. If you want to match the first occurence, you need to exclude any . from your capturing group:
company = re.findall('^From:.+#([^\.]*)\.',line)
See a demo.
You can try this:
(?<=\#)(.*?)(?=\.)
See a demo.
A simple example would be:
>>> import re
>>> re.findall(".*(?<=\#)(.*?)(?=\.)", "From: atc#moo.com")
['moo']
>>> re.findall(".*(?<=\#)(.*?)(?=\.)", "From: atc#moo-hihihi.com")
['moo-hihihi']
This matches the hostname regardless of the beginning of the line, i.e. it's greedy.
You could just split and find:
s = " abc.def#ghi.mn I"
s = s.split("#", 1)[-1]
print(s[:s.find(".")])
Or just split if it is not always going to match your string:
s = s.split("#", 1)[-1].split(".", 1)[0]
If it is then find will be the fastest:
i = s.find("#")
s = s[i+1:s.find(".", i)]

Categories

Resources