I'm trying to create a regex to catch all hexadecimal colors in a string literal. I'm using Python 3, and that's what I have:
import re
pattern = re.compile(r"#[a-fA-F\d]{3}([a-fA-F\d]{3})?")
However, when I apply the findall regex method on #abcdef here's what I get:
>>> re.findall(pattern,"#abcdef")
["def"]
Can someone explain me why do I have that? I actually need to get ["#abcdef"]
Thank you in advance
According to http://regex101.com:
It looks like this regex is looking for
#(three characters a through f, A through F or a digit)(three characters a through f, A through F or a digit, which may or may not be present, and if they are they are what is returned from the match)
If you are looking to match any instance of the whole above string, I would recommend this instead:
#[a-fA-F\d]{6}
Thanks to Andrej Kesely, I got the answer to my question, that is:
Regex will return capturing group.
To bypass this, just change the regex from:
r"#[a-fA-F\d]{3}([a-fA-F\d]{3})?"
to:
r"#[a-fA-F\d]{3}(?:[a-fA-F\d]{3})?"
Related
I've recently decided to jump into the deep end of the Python pool and start converting some of my R code over to Python and I'm stuck on something that is very important to me. In my line of work, I spend a lot of time parsing text data, which, as we all know, is very unstructured. As a result, I've come to rely on the lookaround feature of regex and R's lookaround functionality is quite robust. For example, if I'm parsing a PDF that might introduce some spaces in between letters when I OCR the file, I'd get to the value I want with something like this:
oAcctNum <- str_extract(textBlock[indexVal], "(?<=ORIG\\s?:\\s?/\\s?)[A-Z0-9]+")
In Python, this isn't possible because the use of ? makes the lookbehind a variable-width expression as opposed to a fixed-width. This functionality is important enough to me that it deters me from wanting to use Python, but instead of giving up on the language I'd like to know the Pythonista way of addressing this issue. Would I have to preprocess the string before extracting the text? Something like this:
oAcctNum = re.sub(r"(?<=\b\w)\s(?=\w\b)", "")
oAcctNum = re.search(r"(?<=ORIG:/)([A-Z0-9])", textBlock[indexVal]).group(1)
Is there a more efficient way to do this? Because while this example was trivial, this issue comes up in very complex ways with the data I work with and I'd hate to have to do this kind of preprocessing for every line of text I analyze.
Lastly, I apologize if this is not the right place to ask this question; I wasn't sure where else to post it. Thanks in advance.
Notice that if you can use groups, you generally do not need lookbehinds. So how about
match = re.search(r"ORIG\s?:\s?/\s?([A-Z0-9]+)", string)
if match:
text = match.group(1)
In practice:
>>> string = 'ORIG : / AB123'
>>> match = re.search(r"ORIG\s?:\s?/\s?([A-Z0-9]+)", string)
>>> match
<_sre.SRE_Match object; span=(0, 12), match='ORIG : / AB123'>
>>> match.group(1)
'AB123'
You need to use capture groups in this case you described:
"(?<=ORIG\\s?:\\s?/\\s?)[A-Z0-9]+"
will become
r"ORIG\s?:\s?/\s?([A-Z0-9]+)"
The value will be in .group(1). Note that raw strings are preferred.
Here is a sample code:
import re
p = re.compile(r'ORIG\s?:\s?/\s?([A-Z0-9]+)', re.IGNORECASE)
test_str = "ORIG:/texthere"
print re.search(p, test_str).group(1)
IDEONE demo
Unless you need overlapping matches, capturing groups usage instead of a look-behind is rather straightforward.
print re.findall(r"ORIG\s?:\s?/\s?([A-Z0-9]+)",test_str)
You can directly use findall which will return all the groups in the regex if present.
I'm trying to unescape the escaped regex pattern to apply it to a string.
It's actually dynamic I don't exactly know what it would look like, but throughout my testing I encountered one problem, the string with escaped regex pattern looks like this:
\\d{4}
I've written a simple regex which replaces every single combination of backslash and a character with just a character
And I'm applying it this way:
sub(r"\\(.)", "\\1", escaped_pattern)
But what it gives me afterwards is d{4} not \d{4} as I expect.
I've tried using raw strings for repl, escape\unescape it, it still doesnt return what I expect it to return. Would appreciate any help.
EDIT
escaped_pattern = settings.reg_exp
regexp = sub(r"\\(.)", "\\1", escaped_pattern)
search(regexp, string_to_regexp).group()[0]
Based on you update I'm pretty sure that you would get exactly your desired output if you just stopped trying to unescape it.
import re
s1 = "1234astring"
matches = re.search("\\d{4}", s1)
matches.group(0)
"1234"
matches.group()[0]
"1"
Try r"\\\\(.)" in search pattern and '\\\1' in substitution pattern.
works OK here: https://regex101.com/r/M3ikqj/1
How can I get the value from the following strings using one regular expression?
/*##debug_string:value/##*/
or
/*##debug_string:1234/##*/
or
/*##debug_string:http://stackoverflow.com//##*/
The result should be
value
1234
http://stackoverflow.com/
Trying to read behind your pattern
re.findall("/\*##debug_string:(.*?)/##\*/", your_string)
Note that your variations cannot work because you didn't escape the *. In regular expressions, * mean a repetition of the previous character/group. If you really mean the * character, you must use \*.
import re
print re.findall("/\*##debug_string:(.*?)/##\*/", "/*##debug_string:value/##*/")
print re.findall("/\*##debug_string:(.*?)/##\*/", "/*##debug_string:1234/##*/")
print re.findall("/\*##debug_string:(.*?)/##\*/", "/*##debug_string:http://stackoverflow.com//##*/")
Executes as:
['value']
['1234']
['http://stackoverflow.com/']
EDIT: Ok I see that you can have a URL. I've amended the pattern to take it into account.
Use this regex:
[^:]+:([^/]+)
And use capture group #1 for your value.
Live Demo: http://www.rubular.com/r/FxFnpfPHFn
Your regex will be something like: .*:(.*)/.+. Group 1 will be what you are looking for. However this is a REALLY inclusive regex, you might want to post some more details so that you can create some more restrictions.
Assuming that the format stays consistent:
re.findall('debug_string:([^\/]+)\/##', string)
I am trying to learn python and regex at the same time and I am having some trouble in finding how to match till end of string and make a replacement on the fly.
So, I have a string like so:
ss="this_is_my_awesome_string/mysuperid=687y98jhAlsji"
What I'd want is to first find 687y98jhAlsji (I do not know this content before hand) and then replace it to myreplacedstuff like so:
ss="this_is_my_awesome_string/mysuperid=myreplacedstuff"
Ideally, I'd want to do a regex and replace by first finding the contents after mysuperid= (till the end of string) and then perform a .replace or .sub if this makes sense.
I would appreciate any guidance on this.
You can try this:
re.sub(r'[^=]+$', 'myreplacedstuff', ss)
The idea is to use a character class that exclude the delimiter (here =) and to anchor the pattern with $
explanation:
[^=] is a character class and means all characters that are not =
[^=]+ one or more characters from this class
$ end of the string
Since the regex engine works from the left to the right, only characters that are not an = at the end of the string are matched.
You can use regular expressions:
>>> import re
>>> mymatch = re.search(r'mysuperid=(.*)', ss)
>>> ss.replace(mymatch.group(1), 'replacing_stuff')
'this_is_my_awesome_string/mysuperid=replacing_stuff'
You should probably use #Casimir's answer though. It looks cleaner, and I'm not that good at regex :p.
I have a couple email addresses, 'support#company.com' and '1234567#tickets.company.com'.
In perl, I could take the To: line of a raw email and find either of the above addresses with
/\w+#(tickets\.)?company\.com/i
In python, I simply wrote the above regex as'\w+#(tickets\.)?company\.com' expecting the same result. However, support#company.com isn't found at all and a findall on the second returns a list containing only 'tickets.'. So clearly the '(tickets\.)?' is the problem area, but what exactly is the difference in regular expression rules between Perl and Python that I'm missing?
The documentation for re.findall:
findall(pattern, string, flags=0)
Return a list of all non-overlapping matches in the string.
If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result.
Since (tickets\.) is a group, findall returns that instead of the whole match. If you want the whole match, put a group around the whole pattern and/or use non-grouping matches, i.e.
r'(\w+#(tickets\.)?company\.com)'
r'\w+#(?:tickets\.)?company\.com'
Note that you'll have to pick out the first element of each tuple returned by findall in the first case.
I think the problem is in your expectations of extracted values. Try using this in your current Python code:
'(\w+#(?:tickets\.)?company\.com)'
Two problems jump out at me:
You need to use a raw string to avoid having to escape "\"
You need to escape "."
So try:
r'\w+#(tickets\.)?company\.com'
EDIT
Sample output:
>>> import re
>>> exp = re.compile(r'\w+#(tickets\.)?company\.com')
>>> bool(exp.match("s#company.com"))
True
>>> bool(exp.match("1234567#tickets.company.com"))
True
There isn't a difference in the regexes, but there is a difference in what you are looking for. Your regex is capturing only "tickets." if it exists in both regexes. You probably want something like this
#!/usr/bin/python
import re
regex = re.compile("(\w+#(?:tickets\.)?company\.com)");
a = [
"foo#company.com",
"foo#tickets.company.com",
"foo#ticketsacompany.com",
"foo#compant.org"
];
for string in a:
print regex.findall(string)