I hope this message finds you in good spirits. I am trying to find a quick tutorial on the \b expression (apologies if there is a better term). I am writing a script at the moment to parse some xml files, but have ran into a bit of a speed bump. I will show an example of my xml:
<....></...><...></...><OrderId>123456</OrderId><...></...>
<CustomerId>44444444</CustomerId><...></...><...></...>
<...> is unimportant and non relevant xml code. Focus primarily on the CustomerID and OrderId.
My issue lies in parsing a string, similar to the above statement. I have a regexParse definition that works perfectly. However it is not intuitive. I need to match only the part of the string that contains 44444444.
My Current setup is:
searchPattern = '>\d{8}</CustomerId'
Great! It works, but I want to do it the right way. My thinking is 1) find 8 digits 2) if the some word boundary is non numeric after that matches CustomerId return it.
Idea:
searchPattern = '\bd{16}\b'
My issue in my tests is incorporating the search for CustomerId somewhere before and after the digits. I was wondering if any of you can either help me out with my issue, or point me in the right path (in words of a guide or something along the lines). Any help is appreciated.
Mods if this is in the wrong area apologies, I wanted to post this in the Python discussion because I am not sure if Python regex supports this functionality.
Thanks again all,
darcmasta
txt = """
<....></...><...></...><OrderId>123456</OrderId><...></...>
<CustomerId>44444444</CustomerId><...></...><...></...>
"""
import re
pattern = "<(\w+)>(\d+)<"
print re.findall(pattern,txt)
#output [('OrderId', '123456'), ('CustomerId', '44444444')]
You might consider using a look-back operator in your regex to make it easy for a human to read:
import re
a = re.compile("(?<=OrderId>)\\d{6}")
a.findall("<....></...><...></...><OrderId>123456</OrderId><...></...><CustomerId>44444444</CustomerId><...></...><...></...>")
['123456']
b = re.compile("(?<=CustomerId>)\\d{8}")
b.findall("<....></...><...></...><OrderId>123456</OrderId><...></...><CustomerId>44444444</CustomerId><...></...><...></...>")
['44444444']
You should be using raw string literals:
searchPattern = r'\b\d{16}\b'
The escape sequence \b in a plain (non-raw) string literal represents the backspace character, so that's what the re module would be receiving (unrecognised escape sequences such as \d get passed on as-is, i.e. backslash followed by 'd').
Related
I am learning regex for validating an email id, for which the regex I made is the following:
regex = r'^([a-zA-Z0-9_\\-\\.]+)#([a-zA-Z0-9_\\-\\.]+)\\.([a-zA-Z]{2,})$'
Here someone#...com is showing as valid but it should not, how can fix this?
I would recommend the regular expression suggested on this site which properly shows that the email someone#...com is invalid, I quickly wrote up an example using their suggestion below, happy coding!
>>>import re
>>>email = "someone#...com"
>>>regex = re.compile(r"(^[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)")
>>>print(re.match(regex, email))
None
The reason it matches someone#...com is that the dot is in the character class here #([a-zA-Z0-9_\\-\\.]+) and is repeated 1 or more times and it can therefore also match ...
What you can do is place the dot after the character class, and use that whole part in a repeating group.
If you put the - at the end you don't have to escape it.
Note that that character class at the start also has a dot.
^[a-zA-Z0-9_.-]+#(?:[a-zA-Z0-9_-]+\.)+([a-zA-Z]{2,})$
Regex demo
I've recently decided to jump into the deep end of the Python pool and start converting some of my R code over to Python and I'm stuck on something that is very important to me. In my line of work, I spend a lot of time parsing text data, which, as we all know, is very unstructured. As a result, I've come to rely on the lookaround feature of regex and R's lookaround functionality is quite robust. For example, if I'm parsing a PDF that might introduce some spaces in between letters when I OCR the file, I'd get to the value I want with something like this:
oAcctNum <- str_extract(textBlock[indexVal], "(?<=ORIG\\s?:\\s?/\\s?)[A-Z0-9]+")
In Python, this isn't possible because the use of ? makes the lookbehind a variable-width expression as opposed to a fixed-width. This functionality is important enough to me that it deters me from wanting to use Python, but instead of giving up on the language I'd like to know the Pythonista way of addressing this issue. Would I have to preprocess the string before extracting the text? Something like this:
oAcctNum = re.sub(r"(?<=\b\w)\s(?=\w\b)", "")
oAcctNum = re.search(r"(?<=ORIG:/)([A-Z0-9])", textBlock[indexVal]).group(1)
Is there a more efficient way to do this? Because while this example was trivial, this issue comes up in very complex ways with the data I work with and I'd hate to have to do this kind of preprocessing for every line of text I analyze.
Lastly, I apologize if this is not the right place to ask this question; I wasn't sure where else to post it. Thanks in advance.
Notice that if you can use groups, you generally do not need lookbehinds. So how about
match = re.search(r"ORIG\s?:\s?/\s?([A-Z0-9]+)", string)
if match:
text = match.group(1)
In practice:
>>> string = 'ORIG : / AB123'
>>> match = re.search(r"ORIG\s?:\s?/\s?([A-Z0-9]+)", string)
>>> match
<_sre.SRE_Match object; span=(0, 12), match='ORIG : / AB123'>
>>> match.group(1)
'AB123'
You need to use capture groups in this case you described:
"(?<=ORIG\\s?:\\s?/\\s?)[A-Z0-9]+"
will become
r"ORIG\s?:\s?/\s?([A-Z0-9]+)"
The value will be in .group(1). Note that raw strings are preferred.
Here is a sample code:
import re
p = re.compile(r'ORIG\s?:\s?/\s?([A-Z0-9]+)', re.IGNORECASE)
test_str = "ORIG:/texthere"
print re.search(p, test_str).group(1)
IDEONE demo
Unless you need overlapping matches, capturing groups usage instead of a look-behind is rather straightforward.
print re.findall(r"ORIG\s?:\s?/\s?([A-Z0-9]+)",test_str)
You can directly use findall which will return all the groups in the regex if present.
When using re.findall like my example below is there a to include the final four characters (.JPG)? As they may be lower or uppercase I can't just stitch it together with another string and be certain it will be correct. (In reality it's a list of dozens/hundreds of JPGs, some uppercase and some lowercase.)
I actually found the answer to this about 2 weeks ago but have since lost it (despite a lot of Googling).
I've done a lot of searching/reading and apologize if this exact problem has been asked before.
import re
examplestring = '/home/folder/image.JPG 200x400 20/12/2018'
print(re.findall(r'^(.*?).jpg', examplestring, flags=re.IGNORECASE))
Actual output:
['/home/folder/image']
I'm wanting the output to be:
['/home/folder/image.JPG']
Firstly, make sure to escape the dot since it's a special character in regex.
Either include .jpg in the group
^(.*?\.jpg)
or don't use a group at all
^.*?\.jpg
Method 1
Maybe,
(?i)\S+\.jpg
or
(?i)\S+\.jpe?g
just in case, if we would have had jpeg, might simply work OK.
RegEx Demo 1
We can include additional boundaries, if that'd be necessary, such as start anchor.
Also, the expression does not work if there would be any space in the dir names or filenames.
Method 2
If there would be horizontal spaces in the image path, then
(?i)^[^\r\n]+\.jpg
or
(?i)^[^\r\n]+\.jpe?g
would have been some options to explore.
RegEx Demo 2
Test
import re
string = '''
/home/folder/image.JPG 200x400 20/12/2018
/home/folder/image.jpg 200x400 20/12/2018
/home/folder/image.jpeg 200x400 20/12/2018
'''
expression = r'(?i)\S+\.jpe?g'
print(re.findall(expression, string))
Output
['/home/folder/image.JPG', '/home/folder/image.jpg', '/home/folder/image.jpeg']
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:
I want to find matches between a tweet and a list of strings containing words, phrases, and emoticons. Here is my code:
words = [':)','and i','sleeping','... :)','! <3','facebook']
regex = re.compile(r'\b%s\b|(:\(|:\))+' % '\\b|\\b'.join(words), flags=re.IGNORECASE)
I keep receiving this error:
error: unbalanced parenthesis
Apparently there is something wrong with the code and it cannot match emoticons. Any idea how to fix it?
I tried the below and it stopped throwing the error:
words = [':\)','and i','sleeping','... :\)','! <3','facebook']
The re module has a function escape that takes care of correct escaping of words, so you could just use
words = map(re.escape, [':)','and i','sleeping','... :)','! <3','facebook'])
Note that word boundaries might not work as you expect when used with words that don't start or end with actual word characters.
While words has all the necessary formatting, re uses ( and ) as special characters. This requires you to use \( or \) to avoid them being interpreted as special characters, but rather as the ASCII characters 40 and 41. Since you didn't understand what #Nicarus was saying, you need to use this:
words = [':\)','and i','sleeping','... :\)','! <3','facebook']
Note: I'm only spelling it out because this doesn't seem like a school assignment, for all the people who might want to criticize this. Also, look at the documentation prior to going to stack overflow. This explains everything.
i'm asked to write regular expression which can catch multi-domain email addresses and implement it in python. so i came up with the following regular expression (and code;the emphasis is on the regex though), which i think is correct:
import re
regex = r'\b[\w|\.|-]+#([\w]+\.)+\w{2,4}\b'
input_string = "hey my mail is abc#def.ghi"
match=re.findall(regex,input_string)
print match
now when i run this (using a very simple mail) it doesn't catch it!!
instead it shows an empty list as the output. can somebody tell me where did i go wrong in the regular expression literal?
Here's a simple one to start you off with
regex = r'\b[\w.-]+?#\w+?\.\w+?\b'
re.findall(regex,input_string) # ['abc#def.ghi']
The problem with your original one is that you don't need the | operator inside a character class ([..]). Just write [\w|\.|-] as [\w.-] (If the - is at the end, you don't need to escape it).
Next there are way too many variations on legitimate domain names. Just look for at least one period surrounded by word characters after the # symbol:
#\w+?\.\w+?\b