Regex to match MediaWiki template without certain named parameter - python

I’ll get to the point: I need a regex that matches any template out of a list that have a date parameter - so assuming that my (singleton for now) list of templates is “stub”, the things below that are in bold should be matched:
{{stub}}
{{stub|param}}
{{stub|date=a}}
{{stub|param|date=a}}
{{stub|date=a|param}}
{{stub|param|date=a|param}}
Note: “param” means any number of parameters there.
Additionally, it would be nice if it could also match if the date parameter is blank, but this is not required.
The current regex I have so far is
{{((?:stub|inaccurate)(?!(?:\|.*?\|)*?\|date=.*?(?:\|.*?)*?)(?:\|.*?)*?)}}
However it matches the fourth and sixth items in the list above.
Note: (?:stub|inaccurate) is just to make sure the template is either a stub or inaccurate template.
Note 2: the flavor of regex here is Python 2.7 module RE.

Since you are using Python, you have the luxury of an actual parser:
import mwparserfromhell
wikicode = mwparserfromhell.parse('{{stub|param|date=a|param}}')
for template in wikicode.filter_templates():
if template.get('date')...
That will remain accurate even if the template contains something you would not have expected ({{stub| date=a}}, {{stub|<!--<newline>-->date=a}}, {{stub|foo={{bar}}|date=a}} etc.). The classic answer on the dangers of using regular expressions to parse complex markup applies to wikitext as well.

I think it's enough to have a negative look-ahead, which tries to match date at any position?
{{((?:stub|inaccurate)(?!.*\|date=).*)}}
If empty date parameters have a | following the equals sign, then use
{{((?:stub|inaccurate)(?!.*\|date=[^|}]).*)}}

Related

Regular expression in python to get the last occurence of a file extension in a URL or path

Given a long url or path how do I get the last file extension in it. For example consider these two strings.
url = 'https://image.freepik.com/free-vector/vector-chickens-full-emotions_75487-787.jpg?x=2'
path = './image.freepik.com/free-vector/vector-chickens-full-emotions_75487-787.abc.jpg'
The last extension is jpg and comes after the last . and before the following non-alphanumerics or end-of-string.
There are similar questions to mine but I can't find an exact match.
re.search('\.(\w+)(?!.*\.)', url).group(1)
Use negative lookahead to search for matches that aren't followed by dots
Parsing rules are different for FILENAMES, and URLS - so don't make a single REGEX to do that, its not simple and not worth your time.
Instead, make a test of some sort - to determine what type of object you are looking at, ie: This IS or ISNOT a URL. This could be as simple as: Does it start with http://, then it is a URL.. if not ... it is not a URL
Then apply the specific rule to the specific type.
Always make use of standard tools, they have often already figured out the corner cases or things you will forget.
The URL parser: https://docs.python.org/3/library/urllib.parse.html
Then, for files use: os.path.splitext(path)
in the standard python library: https://docs.python.org/3/library/os.path.html

How can I find all Markdown links using regular expressions?

In Markdown there is two ways to place a link, one is to just type the raw link in, like: http://example.com, the other is to use the ()[] syntax: (Stack Overflow)[http://example.com ].
I'm trying to write a regular expression that can match both of these, and, if it's the second match to also capture the display string.
So far I have this:
(?P<href>http://(?:www\.)?\S+.com)|(?<=\((.*)\)\[)((?P=href))(?=\])
Debuggex Demo
But this doesn't seem to match either of my two test cases in Debuggex:
http://example.com
(Example)[http://example.com]
Really not sure why the first one isn't matched at the very least, is it something to do with my use of the named group? Which, if possible I'd like to keep using because this is a simplified expression to match the link and in the real example it is too long for me to feel comfortable duplicating it in two different places in the same pattern.
What am I doing wrong? Or is this not doable at all?
EDIT: I'm doing this in Python so will be using their regex engine.
The reason your pattern doesn't work is here: (?<=\((.*)\)\[) since the re module of Python doesn't allow variable length lookbehind.
You can obtain what you want in a more handy way using the new regex module of Python (since the re module has few features in comparison).
Example: (?|(?<txt>(?<url>(?:ht|f)tps?://\S+(?<=\P{P})))|\(([^)]+)\)\[(\g<url>)\])
An online demo
pattern details:
(?| # open a branch reset group
# first case there is only the url
(?<txt> # in this case, the text and the url
(?<url> # are the same
(?:ht|f)tps?://\S+(?<=\P{P})
)
)
| # OR
# the (text)[url] format
\( ([^)]+) \) # this group will be named "txt" too
\[ (\g<url>) \] # this one "url"
)
This pattern uses the branch reset feature (?|...|...|...) that allows to preserve capturing groups names (or numbers) in an alternation. In the pattern, since the ?<txt> group is opened at first in the first member of the alternation, the first group in the second member will have the same name automatically. The same for the ?<url> group.
\g<url> is a reference to the named subpattern ?<url> (like an alias, in this way, no need to rewrite it in the second member.)
(?<=\P{P}) checks if the last character of the url is not a punctuation character (useful to avoid the closing square bracket for example). (I'm not sure of the syntax, it may be \P{Punct})

Django URL Reg-Ex

Hi all,
How does this expression actually work?
urlpatterns = patterns('',
url(r'^get/(?P<app_id>\d+)/$', 'app.views.app'),
...
)
I understand what it does, at least to map a url entered by the user to the app() function in the app's view page. I also understand it is a regular expression that ends up taking the id of the app and mapping it to the url. But where is this function going? What is going on with the r'^...?P /$ (I get the d+ is a digit regex, of the id itself, but that's about it).
I also understand this url function draws from the django.conf.urls module.
Perhaps my misunderstanding is more buried in my lack of regex experience. Nonetheless, I need help! I do not like using things I do not understand, and I am guilty.
Let's take a look: r'^get/(?P<app_id>\d+)/$'
The r'' means that assume as string characters every character inside the string quotes.
^ character means the beginning of the regular expression. For example, forget/123 won't match the expression because doesn't start with get, if the sign weren't there, it should've match it because it won't be forcing the matched string to begin with get, just that get...appears in the string.
The $ character means the end of the expression. If absent, get/123/xd may match the expression and this is not desired.
(?P<>) is a way to give a name/alias to a group in the expression.
You should read the python's regular expressions documentation. It's very good to know about regular expressions because they're very useful.
Hope this helps!
r just changes how the following string literal is interpreted. Backslashes (\) are not treated as escape sequences, that means that the regex in the string will be used as is.
^ at the beginning and $ at the end match and the end of the string respectively.
(?P<name>...) is a saving named group - it helps you to cut a part of url and pass it as a parameter into the view. See more in django named groups docs.
Hope that helps.

Regular Expressions Dependant on Previous Matchings

For example, how could we recognize a string of the following format with a single RE:
LenOfStr:Str
An example string in this format is:
5:5:str
The string we're looking for is "5:str".
In python, maybe something like the following (this isn't working):
r'(?P<len>\d+):(?P<str>.{int((?P=len))})'
In general, is there a way to change the previously matched groups before using them or I just asked yet another question not meant for RE.
Thanks.
Yep, what you're describing is outside the bounds of regular expressions. Regular expressions only deal with actual character data. This provides some limited ability to make matches dependent on context (e.g., (.)\1 to match the same character twice), but you can't apply arbitrary functions to pieces of an in-progress match and use the results later in the same match.
You could do something like search for text matching the regex (\d+):\w+, and then postprocess the results to check if the string length is equal to the int value of the first part of the match. But you can't do that as part of the matching process itself.
Well this can be done with a regex (if I understand the question):
>>> s='5:5:str and some more characters...'
>>> m=re.search(r'^(\d+):(.*)$',s)
>>> m.group(2)[0:int(m.group(1))]
'5:str'
It just cannot be done by dynamically changing the previous match group.
You can make it lool like a single regex like so:
>>> re.sub(r'^(\d+):(.*)$',lambda m: m.group(2)[0:int(m.group(1))],s)
'5:str'

Parsing FIX protocol in regex?

I need to parse a logfiles that contains FIX protocol messages.
Each line contains header information (timestamp, logging level, endpoint), followed by a FIX payload.
I've used regex to parse the header information into named groups. E.g.:
<?P<datetime>\d{2}/\d{2}/\d{2} \d{2}:\d{2}:\d{2}.\d{6}) (?<process_id>\d{4}/\d{1,2})\s*(?P<logging_level>\w*)\s*(?P<endpoint>\w*)\s*
I then come to the FIX payload itself (^A is the separator between each tag) e.g:
8=FIX.4.2^A9=61^A35=A...^A11=blahblah...
I need to extract specific tags from this (e.g. "A" from 35=, or "blahblah" from 11=), and ignore all the other stuff - basically I need to ignore anything before "35=A", and anything after up to "11=blahblah", then ignore anything after that etc.
I do know there a libraries that might be able to parse each and every tag (http://source.kentyde.com/fixlib/overview), however, I was hoping for a simple approach using regex here if possible, since I really only need a couple of tags.
Is there a good way in regex to extract the tags I require?
Cheers,
Victor
No need to split on "\x01" then regex then filter. If you wanted just tags 34,49 and 56 (MsgSeqNum, SenderCompId and TargetCompId) you could regex:
dict(re.findall("(?:^|\x01)(34|49|56)=(.*?)\x01", raw_msg))
Simple regexes like this will work if you know your sender does not have embedded data that could cause a bug in any simple regex. Specifically:
No Raw Data fields (actually combination of data len and raw data like RawDataLength,RawData (95/96) or XmlDataLen, XmlData (212,213)
No encoded fields for unicode strings like EncodedTextLen, EncodedText (354/355)
To handle those cases takes a lot of additional parsing. I use a custom python parser but even the fixlib code you referenced above gets these cases wrong. But if your data is clear of these exceptions the regex above should return a nice dict of your desired fields.
Edit: I've left the above regex as-is but it should be revised so that the final match element be (?=\x01). The explanation can be found in #tropleee's answer here.
^A is actually \x{01}, thats just how it shows up in vim. In perl, I had done this via a split on hex 1 and then a split on "=", at the second split, value [0] of the array is the Tag and value [1] is the Value.
Use a regex tool like expresso or regexbuddy.
Why don't you split on ^A and then match ([^=])+=(.*) for each one putting them into a hash? You could also filter with a switch that by default won't add the tags you're uninterested in and that has a fall through for all the tags you are interested in.

Categories

Resources