How can I find all Markdown links using regular expressions? - python

In Markdown there is two ways to place a link, one is to just type the raw link in, like: http://example.com, the other is to use the ()[] syntax: (Stack Overflow)[http://example.com ].
I'm trying to write a regular expression that can match both of these, and, if it's the second match to also capture the display string.
So far I have this:
(?P<href>http://(?:www\.)?\S+.com)|(?<=\((.*)\)\[)((?P=href))(?=\])
Debuggex Demo
But this doesn't seem to match either of my two test cases in Debuggex:
http://example.com
(Example)[http://example.com]
Really not sure why the first one isn't matched at the very least, is it something to do with my use of the named group? Which, if possible I'd like to keep using because this is a simplified expression to match the link and in the real example it is too long for me to feel comfortable duplicating it in two different places in the same pattern.
What am I doing wrong? Or is this not doable at all?
EDIT: I'm doing this in Python so will be using their regex engine.

The reason your pattern doesn't work is here: (?<=\((.*)\)\[) since the re module of Python doesn't allow variable length lookbehind.
You can obtain what you want in a more handy way using the new regex module of Python (since the re module has few features in comparison).
Example: (?|(?<txt>(?<url>(?:ht|f)tps?://\S+(?<=\P{P})))|\(([^)]+)\)\[(\g<url>)\])
An online demo
pattern details:
(?| # open a branch reset group
# first case there is only the url
(?<txt> # in this case, the text and the url
(?<url> # are the same
(?:ht|f)tps?://\S+(?<=\P{P})
)
)
| # OR
# the (text)[url] format
\( ([^)]+) \) # this group will be named "txt" too
\[ (\g<url>) \] # this one "url"
)
This pattern uses the branch reset feature (?|...|...|...) that allows to preserve capturing groups names (or numbers) in an alternation. In the pattern, since the ?<txt> group is opened at first in the first member of the alternation, the first group in the second member will have the same name automatically. The same for the ?<url> group.
\g<url> is a reference to the named subpattern ?<url> (like an alias, in this way, no need to rewrite it in the second member.)
(?<=\P{P}) checks if the last character of the url is not a punctuation character (useful to avoid the closing square bracket for example). (I'm not sure of the syntax, it may be \P{Punct})

Related

Does this regex fail, or do I need to modify the regex to support "optional followed by"?

I am trying the following regex: https://regex101.com/r/5dlRZV/1/, I am aware, that I am trying with \author and not \maketitle
In python, I try the following:
import re
text = str(r'
\author{
\small
}
\maketitle
')
regex = [re.compile(r'[\\]author*|[{]((?:[^{}]*|[{][^{}]*[}])*)[}]', re.M | re.S),
re.compile(r'[\\]maketitle*|[{]((?:[^{}]*|[{][^{}]*[}])*)[}]', re.M | re.S)]
for p in regex:
for m in p.finditer(text):
print(m.group())
Python freezes, I am suspecting that this has something to do with my pattern, and the SRE fails.
EDIT: Is there something wrong with my regex? Can it be improved to actually work? Still I get the same results on my machine.
EDIT 2: Can this be fixed somehow so the pattern supports optional followed by ?: or ?= look-heads? So that one can capture both?
After reading the heading, "Parentheses Create Numbered Capturing Groups", on this site: https://www.regular-expressions.info/brackets.html, I managed to find the answer which is:
Besides grouping part of a regular expression together, parentheses also create a
numbered capturing group. It stores the part of the string matched by the part of
the regular expression inside the parentheses.
The regex Set(Value)? matches Set or SetValue.
In the first case, the first (and only) capturing group remains empty.
In the second case, the first capturing group matches Value.

Pandas extract text notation

I'm new to Pandas, using it for a class, and I can't for the life of me find a resource that shows the notation used in pandas when representing text in the extract function. For example:
movies['year'] = movies['title'].str.extract('.*\((.*)\).*', expand=True)
I know this is telling the extract function to extract everything inside the parentheses from examples done in class, but I don't understand which symbols mean what inside the extract function. Is there a resource that can explain what these symbols mean? Thank you.
In General
The string argument of the .str.extract is a Regular Expression (regex), which is a language used for pattern matching and feature extraction in strings. If you go to the section called "Regular Expression Patterns" in the previous link you can find the meaning of the special control characters.
This Example
What specifically that regex string means is:
match any character, ., zero or more times, *, until a parenthesis, \(, then extract all the content in the parentheses, (.*), then close parenthesis, \), then any character zero or more times, .*, again.
Essentially this will match any string like: 'xxx(message)xxxx' or '(message)' or 'xx(message)' or '(message)x' and extract the 'message'.
Notes on Pandas and Regex
An important part of regular expressions (in general, but particularly for use in pandas with .str.extract) is capturing groups. You can 'capture' or grab part of a string by enclosing the pattern for that part inside of parenthesis. Note that these are the unescaped (no preceding slash - the inner set) parentheses in the regex and not the actual parentheses that appear in the string itself, e.g. in 'xxx(message)xxx'.
Check out the docs on .str.extract for a few examples of using regex with capturing groups in pandas:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.extract.html

Regex optional order of capturing group

I have simple, but tricky question about regex (using in python), which i have did not find answer for anywhere here on google. Is there any "trick" how to make two capture groups in optional order? Let's say we have following:
.*abc.*
What i want is to match also this:
.*acb.*
I know i could use
.*abc|acb.*
but the problem is, that if we have something more complicated then abc, code is very long. Is not there any workaround to say e.g. "match last two capturing groups (or symbols, etc.) in any order?
I don't really get what is this in-any-order thing that would make the regex shorter. On the other hand, I can show you how to make this readable, even if you have tons of options.
import re
pattern = """
.* # match from starting the line
(?: # A non-capturing group starts so we can list lots of alternatives
abc| # alternative 1
acb # alternative 2
) # end of alternatives
.* # then match everything up to the end of the line
"""
re.search(pattern, 'qqabcqq', re.VERBOSE) # returns a match
re.search(pattern, 'qqacbqq', re.VERBOSE) # returns a match
re.search(pattern, 'qqaSDqq', re.VERBOSE) # does not return a match
So what did we just see here?
The """ ... """ construct is a convenient way to define multiline strings in python.
Then the re.VERBOSE skips the whitespaces and comments. As the manual says:
Whitespace within the pattern is ignored, except when in a character
class or when preceded by an unescaped backslash. When a line contains a # that is not in a character class and is not preceded by an unescaped backslash, all characters from the leftmost such # through the end of the line are ignored.
This two things let you add structure and comments to your regex. Here is another great example.
With standard regular expressions you can define patterns without order. Example:
[cdgjow]
Of course this example refers to characters.
Alternative sequences must be specified using "|". Example:
abc|cba
There is no way to express what you would like to express in classic regular expression syntax. Regular expression syntax has no syntactic elements to express what you would like to express. It's lacking this feature. You have to rely on "manually" specifying your alternatives. It's not a limit of the automaton constructed from regular expressions but of the regular expression syntax itself.
That means: You will have to construct the regular expression you require by yourself with all variants possible. There are two ways how to do this:
Do it manually. Take your time, be careful, built the correct regex manually.
Do it programmatically. Write some code that generates the regex you require.
If you do it manually consider #TamasRev answer. (Thanks #TamasRev! Nice answer!) But if I were you I'd build the regex programmatically. (For things like that programming has been invented for anyway :-) )

How can I avoid selecting email Ids with a particular domain name with regex

I have a list of e-mail ids among which I have to select only those which do not have ruba.com as domain name with regex. For examples, if I have ads#gmail.com, dgh#rubd.com and ert#ruba.com, then my regular expression should select first two Ids. What should be the regular expression for this problem?
I have tried with two expressions:
[a-zA-Z0-9_.+-]+#[^(ruba)]+.[a-zA-Z0-9-.]+
and
[a-zA-Z0-9_.+-]+#[^r][^u][^b][^a]+.[a-zA-Z0-9-.]+
None of the above two was able to fulfill my requirement.
I assume that by email ID you mean the part before the # symbol, otherwise that would be a full email address.
.+(?=#)(?!#ruba\.com)
. the dot character is a special symbol for regex engines
and it is used to capture everything
* also known as Kleene plus says you want to capture one or more instances of the preceding symbol, in our case .; basically you are saying "give me every char"
(?=#) is a positive lookahead, i.e. a special search feature that makes sure that what follows is #; I'm using it to take the cursor to the position of # and "stop" capturing, otherwise + would go on indefinitely
(?!#ruba\.com) is a negative lookahead, i.e. a special search feature that makes sure that what follows is not (!) #ruba\.com; I'm escaping the dot not to confuse it with the capture-all symbol I was talking before
Live demo here.
You could use a negative lookahead to ensure that you do not match the domain ruba.com.
The negative lookahead: (?!rubd) will match against anything that you want to exclude. Also, because emails typically have more than word characters (such as hyphens and periods), you would be better off using [\w\.\-] rather than just \w.
^[\w\.\-]+#(?!rubd)[\w\.\-]+\.(?:com|net|org|edu)$
DEMO

Python Regular Expressions to match option of strings

I am new to Python and Python RE.
I am trying to make a parser for ARM assembly code. I want to make regular expression for matching conditional branch instructions which could be:
beq, bne, blt, bgt
I tried a regular expression of the form
'b[eq|ne|lt|gt]'
But this does not match. Can someone please help me with this?
You should be using parentheses for options, not square brackets:
b(eq|ne|lt|gt)
And you'd usually want a non-capture group:
b(?:eq|ne|lt|gt)
And you can also make it a little more optimised too:
b(?:eq|ne|[lg]t)
Square brackets will be understood as being any of the characters or range of characters. So [eq|ne|lt|gt] effectively means either one of e, q, |, n, e (again, so it becomes redundant), etc.
Try the following pattern: b(?:eq|ne|lt|gt)
[] Character set: Will only match any one character inside the brackets. You can specify a range of characters by using the metacharacter -, eg: [a-e] or even negate the expression by using the metacharacter ^, eg: [^aeiou]
() Capturing parentesis: Used for grouping part & for creating number capturing group, you can disable this feature by using the following char-set ?: within the capturing parentesis, eg(?:)
As mentioned above, you should be using the capturing parentesis for more than one character to be matched, so, that is why your pattern using brackets did not match your string.
Please note that using the non capturing parentesis was meant to no save any data being matched, however you can remove the metacharacters ?: in order to capture the group.
As python performs perl compatible regular expression engine, you are able to use named captured groups & numbered backreferences, the main advantage of using it, is to keep your expression easy to maintain, read, edit, etc.
Eg:
(?P<opcode>b(?:eq|ne|lt|gt)) - Will capture the match of your pattern b(?:eq|ne|lt|gt) into the backreference name opcode

Categories

Resources