Python forcing hex escape into regex statement [duplicate] - python

I am not sure why this is not working. Perhaps I am missing something with Python regex.
Here is my regex and an example string of what I want it to match too:
PHONE_REGEX = "<(.*)>phone</\1>"
EXAMPLE = "<bar>phone</bar>"
I tested this match in isolation and it failed. I used an online regex tester and it matched. Am I simply missing something that is particular to Python regex?
Thanks!

You have to mark the string as a raw string, due to the \ in there, by putting an r in front of the regex:
m = re.match(r"<(.*)>phone</\1>", "<bar>phone</bar>")

Related

Regex in python for validating mail

I am learning regex for validating an email id, for which the regex I made is the following:
regex = r'^([a-zA-Z0-9_\\-\\.]+)#([a-zA-Z0-9_\\-\\.]+)\\.([a-zA-Z]{2,})$'
Here someone#...com is showing as valid but it should not, how can fix this?
I would recommend the regular expression suggested on this site which properly shows that the email someone#...com is invalid, I quickly wrote up an example using their suggestion below, happy coding!
>>>import re
>>>email = "someone#...com"
>>>regex = re.compile(r"(^[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)")
>>>print(re.match(regex, email))
None
The reason it matches someone#...com is that the dot is in the character class here #([a-zA-Z0-9_\\-\\.]+) and is repeated 1 or more times and it can therefore also match ...
What you can do is place the dot after the character class, and use that whole part in a repeating group.
If you put the - at the end you don't have to escape it.
Note that that character class at the start also has a dot.
^[a-zA-Z0-9_.-]+#(?:[a-zA-Z0-9_-]+\.)+([a-zA-Z]{2,})$
Regex demo

Python: Extract values after decimal using regex

I am given a string which is number example "44.87" or "44.8796". I want to extract everything after decimal (.). I tried to use regex in Python code but was not successful. I am new to Python 3.
import re
s = "44.123"
re.findall(".","44.86")
Something like s.split('.')[1] should work
If you would like to use regex try:
import re
s = "44.123"
regex_pattern = "(?<=\.).*"
matched_string = re.findall(regex_pattern, s)
?<= a negative look behind that returns everything after specified character
\. is an escaped period
.* means "match all items after the period
This online regex tool is a helpful way to test your regex as you build it. You can confirm this solution there! :)

The way to unescape escaped regex pattern Python

I'm trying to unescape the escaped regex pattern to apply it to a string.
It's actually dynamic I don't exactly know what it would look like, but throughout my testing I encountered one problem, the string with escaped regex pattern looks like this:
\\d{4}
I've written a simple regex which replaces every single combination of backslash and a character with just a character
And I'm applying it this way:
sub(r"\\(.)", "\\1", escaped_pattern)
But what it gives me afterwards is d{4} not \d{4} as I expect.
I've tried using raw strings for repl, escape\unescape it, it still doesnt return what I expect it to return. Would appreciate any help.
EDIT
escaped_pattern = settings.reg_exp
regexp = sub(r"\\(.)", "\\1", escaped_pattern)
search(regexp, string_to_regexp).group()[0]
Based on you update I'm pretty sure that you would get exactly your desired output if you just stopped trying to unescape it.
import re
s1 = "1234astring"
matches = re.search("\\d{4}", s1)
matches.group(0)
"1234"
matches.group()[0]
"1"
Try r"\\\\(.)" in search pattern and '\\\1' in substitution pattern.
works OK here: https://regex101.com/r/M3ikqj/1

REGEX: Parsing n digits with non numeric word boundaries

I hope this message finds you in good spirits. I am trying to find a quick tutorial on the \b expression (apologies if there is a better term). I am writing a script at the moment to parse some xml files, but have ran into a bit of a speed bump. I will show an example of my xml:
<....></...><...></...><OrderId>123456</OrderId><...></...>
<CustomerId>44444444</CustomerId><...></...><...></...>
<...> is unimportant and non relevant xml code. Focus primarily on the CustomerID and OrderId.
My issue lies in parsing a string, similar to the above statement. I have a regexParse definition that works perfectly. However it is not intuitive. I need to match only the part of the string that contains 44444444.
My Current setup is:
searchPattern = '>\d{8}</CustomerId'
Great! It works, but I want to do it the right way. My thinking is 1) find 8 digits 2) if the some word boundary is non numeric after that matches CustomerId return it.
Idea:
searchPattern = '\bd{16}\b'
My issue in my tests is incorporating the search for CustomerId somewhere before and after the digits. I was wondering if any of you can either help me out with my issue, or point me in the right path (in words of a guide or something along the lines). Any help is appreciated.
Mods if this is in the wrong area apologies, I wanted to post this in the Python discussion because I am not sure if Python regex supports this functionality.
Thanks again all,
darcmasta
txt = """
<....></...><...></...><OrderId>123456</OrderId><...></...>
<CustomerId>44444444</CustomerId><...></...><...></...>
"""
import re
pattern = "<(\w+)>(\d+)<"
print re.findall(pattern,txt)
#output [('OrderId', '123456'), ('CustomerId', '44444444')]
You might consider using a look-back operator in your regex to make it easy for a human to read:
import re
a = re.compile("(?<=OrderId>)\\d{6}")
a.findall("<....></...><...></...><OrderId>123456</OrderId><...></...><CustomerId>44444444</CustomerId><...></...><...></...>")
['123456']
b = re.compile("(?<=CustomerId>)\\d{8}")
b.findall("<....></...><...></...><OrderId>123456</OrderId><...></...><CustomerId>44444444</CustomerId><...></...><...></...>")
['44444444']
You should be using raw string literals:
searchPattern = r'\b\d{16}\b'
The escape sequence \b in a plain (non-raw) string literal represents the backspace character, so that's what the re module would be receiving (unrecognised escape sequences such as \d get passed on as-is, i.e. backslash followed by 'd').

Issue with regex backreference in Python

I am not sure why this is not working. Perhaps I am missing something with Python regex.
Here is my regex and an example string of what I want it to match too:
PHONE_REGEX = "<(.*)>phone</\1>"
EXAMPLE = "<bar>phone</bar>"
I tested this match in isolation and it failed. I used an online regex tester and it matched. Am I simply missing something that is particular to Python regex?
Thanks!
You have to mark the string as a raw string, due to the \ in there, by putting an r in front of the regex:
m = re.match(r"<(.*)>phone</\1>", "<bar>phone</bar>")

Categories

Resources