repeat only once character regex in python - python

I want to match repeat character(/) only once in the middle of the string. At the start it must be http://www.website/shop. Now I am using this code and it matches all html. How can I restrict it.
'^[http://www.website/shop/].*.html$'
With this regex, no result
^[http://]\W*(/){2}\W
Valid
http://www.website/shop/men.html
http://www.website/shop/women.html
Invalid
http://www.website/shop/men/footwear.html
http://www.website/shop/men/causal.html
http://www.website/shop/women/footwear.html

This regex should work
"http://www.website/shop/\w*.html"
... it won't match if there's another slash after the slashes in .../shop/

Related

Python / regex - Extract string between nth and nth character

I have multiple url values, for ex:
https://www.happy.com/de/article/98238811/poppers
https://www.happy.com/sr
https://www.happy.com/en/forum/ocean-liveliness
I want to extract the values between the 3rd and 4th slash (if 4th slash exists) (ex: de, sr, en)
between the 4th and 5th slash (ex: article, forum)
I'm terrible at regex, I've tried this [\/]*[^\/]+[\/]([^\/]+) but it seems to get all values including www.happy. which I don't want.
I agree with the other answers/comments that Split function is easier, but if you insist on regex you have the \K operator in Python"s regex that discards the match portion to the left.
So, ^(?:.*?\/){3}\K.*?(?=\/|$) will search for three slashes from the beginning of each line, then discard it from the match, do a non-greedy match .*? to pick up the result you want, then do a lookahead to stop your match on a slash or end of line, whichever is encountered first. The lookahead will not be included in the match.
Make sure you include the RE.M flag if you are scanning multiple examples at once so ^ and $ match begin/end of line as well as begin/end of string.
In such case you might not even need regular expression. Just simply split the string by slash. And check the returned chunks.
For example.
>>> "https://www.happy.com/de/article/98238811/poppers".split('/')[3]
'de'
>>> "https://www.happy.com/de/article/98238811/poppers".split('/')[4]
'article'

How to handle " in Regex Python

I am trying to grab fary_trigger_post in the code below using Regex. However, I don't understand why it always includes " in the end of the matched pattern, which I don't expect.
Any idea or suggestion?
re.match(
r'-instance[ "\']*(.+)[ "\']*$',
'-instance "fary_trigger_post" '.strip(),
flags=re.S).group(1)
'fary_trigger_post"'
Thank you.
The (.+) is greedy and grabs ANY character until the end of the input. If you modified your input to include characters after the final double quote (e.g. '-instance "fary_trigger_post" asdf') you would find the double quote and the remaining characters in the output (e.g. fary_trigger_post" asdf). Instead of .+ you should try [^"\']+ to capture all characters except the quotes. This should return what you expect.
re.match(r'-instance[ "\']*([^"\']+)[ "\'].*$', '-instance "fary_trigger_post" '.strip(), flags=re.S).group(1)
Also, note that I modified the end of the expression to use .* which will match any characters following the last quote.
Here's what I'd use in your matching string, but it's hard to provide a better answer without knowing all your cases:
r'-instance\s+"(.+)"\s*$'
When you try to get group 1 (i.e. (.+)) regex will follow this match to the end of string, as it can match . (any character) 1 or more times (but it will take maximum amount of times). I would suggest use the following pattern:
'-instance[ "\']*(.+)["\']+ *$'
This will require regex to match all spaces in the end and all quoutes separatelly, so that it won't be included into group 1

How to match substring or whole string

In Python regex, how would I match only the facebook.com...777 substrings given either string? I don't want the ?sfnsn=mo at the end.
I have (?<=https://m\.)([^\s]+) to match everything after the https://m.. I also have (?=\?sfnsn) to match every thing in front of ?sfnsn.
How do I combine the regex to only return the facebook.com...777 part for either string.
have: https://m.facebook.com/story.php?story_fbid=123456789&id=7777777777?sfnsn=mo
want: facebook.com/story.php?story_fbid=123456789&id=7777777777
have: https://m.facebook.com/story.php?story_fbid=123456789&id=7777777777
want: facebook.com/story.php?story_fbid=123456789&id=7777777777
Here's what I was messing around with https://regex101.com/r/WYz5dn/2
(?<=https://m\.)([^\s]+)(?=\?sfnsn)
You could use a capturing group instead of a positive lookbehind and match either ?sfnsn or the end of the string.
https://m\.(\S*?)(?:\?sfnsn|$)
Regex demo
Using the lookarounds, the pattern could be:
(?<=https://m\.)\S*?(?=\?sfnsn|$)
Regex demo
Putting a ? at the end works, since the last grouped lookahead may or may not exist, we put a question mark after it:
(?<=https://m\.)([^\s]+)(?=\?sfnsn)?

Match LaTeX reserved characters with regex

I have an HTML to LaTeX parser tailored to what it's supposed to do (convert snippets of HTML into snippets of LaTeX), but there is a little issue with filling in variables. The issue is that variables should be allowed to contain the LaTeX reserved characters (namely # $ % ^ & _ { } ~ \). These need to be escaped so that they won't kill our LaTeX renderer.
The program that handles the conversion and everything is written in Python, so I tried to find a nice solution. My first idea was to simply do a .replace(), but replace doesn't allow you to match only if the first is not a \. My second attempt was a regex, but I failed miserably at that.
The regex I came up with is ([^\][#\$%\^&_\{\}~\\]). I hoped that this would match any of the reserved characters, but only if it didn't have a \ in front. Unfortunately, this matches ever single character in my input text. I've also tried different variations on this regex, but I can't get it to work. The variations mainly consisted of removing/adding slashes in the second part of the regex.
Can anyone help with this regex?
EDIT Whoops, I seem to have included the slashes as well. Shows how awake I was when I posted this :) They shouldn't be escaped in my case, but it's relatively easy to remove them from the regexes in the answers. Thanks all!
The [^\] is a character class for anything not a \, that is why it is matching everything. You want a negative lookbehind assertion:
((?<!\)[#\$%\^&_\{\}~\\])
(?<!...) will match whatever follows it as long as ... is not in front of it. You can check this out at the python docs
The regex ([^\][#\$%\^&_\{\}~\\]) is matching anything that isn't found between the first [ and the last ], so it should be matching everything except for what you want it to.
Moving around the parenthesis should fix your original regex ([^\\])[#\$%\^&_\{\}~\\].
I would try using regex lookbehinds, which won't match the character preceding what you want to escape. I'm not a regex expert so perhaps there is a better pattern, but this should work (?<!\\)[#\$%\^&_\{\}~\\].
If you're looking to find special characters that aren't escaped, without eliminating special chars preceded by escaped backslashes (e.g. you do want to match the last backslash in abc\\\def), try this:
(?<!\\)(\\\\)*[#\$%\^&_\{\}~\\]
This will match any of your special characters preceded by an even number (this includes 0) of backslashes. It says the character can be preceded by any number of pairs of backslashes, with a negative lookbehind to say those backslashes can't be preceded by another backslash.
The match will include the backslashes, but if you stick another in front of all of them, it'll achieve the same effect of escaping the special char, anyway.

python regular expressions to return complete result

How can I get what was matched from a python regular expression?
re.match("^\\\w*", "/welcome")
All python returns is a valid match; but I want the entire result returned; how do I do that?
Just use re.findall function.
>>> re.findall("a+", 'sdaaddaa')
['aa', 'aa']
You could use a group.
res = re.search("^(\\\w*)", "/welcome")
if res:
res.group(1);
Calling the group() method of the returned match object without any arguments will return the matched portion of the string.
The regular expression "^\\\w*" will match a string beginning with a backslash followed by 0 or more w characters. The string you are searching begins with a forward slash so your regex won't match. That's why you aren't getting anything back.
Note that your regex, if you printed out the string contains \\w. The \\ means match a single backslash then the w means match a literal w. If you want a backslash followed by a word character then you will need to escape the first backslash and the easiest way would be to use a raw string r"^\\\w*" would match "\\welcome" but still not match "/welcome".
Notice that you're "^" says you're string has to start at the beginning of a line. RegexBuddy doesn't tell that to you by default.
Maybe you want to tell us what exactly are you trying to find?

Categories

Resources