Why is this regex code not working - python

Working on a Django app
Here's my urls.py
urlpattterns = [url(r'^(?P<leagueId>[0-9]+)/(?P<year>[0-9]+)/(?P<team>[\S]+)/$', views.team_detail, name="team_detail"),]
An example url would be along the lines of:
http://localhost:8000/123456/2017/Johnny%20Rocket/
I tried playing around with Pythex, but I couldn't get the urls to match
Note: The variables passed are /{number}/{year}/{name}
The name can consist of alphanumeric characters and whitespaces.

The character class \S matches anything except whitespace, and %20 is decoded to a space before it is matched against the regex.
To match alphanumerical characters and whitespace, you can use [\w\s].

I would use the following regex:
r"(?P<leagueId>\d+)/(?P<year>\d+)/(?P<team>[^/]+)/$"
A few small changes:
took out the ^ which means beginning of line
changed the [0-9] to \d
But the big one being the section. I like to use negative character classes. You basically want to match everything up to the next
/
And that is what
[^/]+
will do for you.
You may want to limit the year to a range like:
\d{2,4}

Related

How to handle " in Regex Python

I am trying to grab fary_trigger_post in the code below using Regex. However, I don't understand why it always includes " in the end of the matched pattern, which I don't expect.
Any idea or suggestion?
re.match(
r'-instance[ "\']*(.+)[ "\']*$',
'-instance "fary_trigger_post" '.strip(),
flags=re.S).group(1)
'fary_trigger_post"'
Thank you.
The (.+) is greedy and grabs ANY character until the end of the input. If you modified your input to include characters after the final double quote (e.g. '-instance "fary_trigger_post" asdf') you would find the double quote and the remaining characters in the output (e.g. fary_trigger_post" asdf). Instead of .+ you should try [^"\']+ to capture all characters except the quotes. This should return what you expect.
re.match(r'-instance[ "\']*([^"\']+)[ "\'].*$', '-instance "fary_trigger_post" '.strip(), flags=re.S).group(1)
Also, note that I modified the end of the expression to use .* which will match any characters following the last quote.
Here's what I'd use in your matching string, but it's hard to provide a better answer without knowing all your cases:
r'-instance\s+"(.+)"\s*$'
When you try to get group 1 (i.e. (.+)) regex will follow this match to the end of string, as it can match . (any character) 1 or more times (but it will take maximum amount of times). I would suggest use the following pattern:
'-instance[ "\']*(.+)["\']+ *$'
This will require regex to match all spaces in the end and all quoutes separatelly, so that it won't be included into group 1

Python using re to match string in a specific pattern

I am trying to use python re to match a string with a specific pattern.
The problem I met is, I have this expected sentence:
"It is X. not X`
X can be anything; A word, or a bunch of word, or number, or digits.
The pattern I build is:
It is \w+. not \w+
just using
string.replace("X", "\w+")
It works if X is a word, or bunch of words, or int, but not for digits. How can I build my pattern in order to match everything in this pattern?
The . is a special character in a regular expression that will match any character. So .+ will match one or more characters.
r"It is .+\. not .+"
Not that the period is escaped \., this is because in that case, you want to match an actual period.
Because .+ won't work in some cases, for example
It is quote. not a double-quote
It is a dog. not a cat
I would use this one instead :
(?<=It is ).+(?=\.)|(?<=not ).+$
Explanation
(?<=It is ).+(?=\.) Any consecutive characters precedeed by It is and followed by a point
| OR
(?<=not ).*$ Any consecutive characters precedeed by not and followed by end of line anchor
(?<=It is ).*(?=\.)|(?<=not ).*$
Demo
I have figured out, can use str.replace("X", "(\w+|\d+\.\d+)") to approach the problem. Hope can help others having the same issue.

What does the regex [^\s]*? mean?

I am starting to learn python spider to download some pictures on the web and I found the code as follows. I know some basic regex.
I knew \.jpg means .jpg and | means or. what's the meaning of [^\s]*? of the first line? I am wondering why using \s?
And what's the difference between the two regexes?
http:[^\s]*?(\.jpg|\.png|\.gif)
http://.*?(\.jpg|\.png|\.gif)
Alright, so to answer your first question, I'll break down [^\s]*?.
The square brackets ([]) indicate a character class. A character class basically means that you want to match anything in the class, at that position, one time. [abc] will match the strings a, b, and c. In this case, your character class is negated using the caret (^) at the beginning - this inverts its meaning, making it match anything but the characters in it.
\s is fairly simple - it's a common shorthand in many regex flavours for "any whitespace character". This includes spaces, tabs, and newlines.
*? is a little harder to explain. The * quantifier is fairly simple - it means "match this token (the character class in this case) zero or more times". The ?, when applied to a quantifier, makes it lazy - it will match as little as it can, going from left to right one character at a time.
In this case, what the whole pattern snippet [^\s]*? means is "match any sequence of non-whitespace characters, including the empty string". As mentioned in the comments, this can more succinctly be written as \S*?.
To answer the second part of your question, I'll compare the two regexes you give:
http:[^\s]*?(\.jpg|\.png|\.gif)
http://.*?(\.jpg|\.png|\.gif)
They both start the same way: attempting to match the protocol at the beginning of a URL and the subsequent colon (:) character. The first then matches any string that does not contain any whitespace and ends with the specified file extensions. The second, meanwhile, will match two literal slash characters (/) before matching any sequence of characters followed by a valid extension.
Now, it's obvious that both patterns are meant to match a URL, but both are incorrect. The first pattern, for instance, will match strings like
http:foo.bar.png
http:.png
Both of which are invalid. Likewise, the second pattern will permit spaces, allowing stuff like this:
http:// .jpg
http://foo bar.png
Which is equally illegal in valid URLs. A better regex for this (though I caution strongly against trying to match URLs with regexes) might look like:
https?://\S+\.(jpe?g|png|gif)
In this case, it'll match URLs starting with both http and https, as well as files that end in both variations of jpg.

Finding big string sequence between two keywords within multiple lines

I have a file with the format of
sjaskdjajldlj_abc:
cdf_asjdl_dlsf1:
dfsflks %jdkeajd
sdjfls:
adkfld %dk_.(%sfj)sdaj, %kjdflajfs
afjdfj _ajhfkdjf
zjddjh -15afjkkd
xyz
and I want to find the text in between the string _abc: in the first line and xyz in the last line.
I have already tried print
re.findall(re.escape("*_abc:")+"(*)"+re.escape("xyz"),line)
But I got null.
If I understood the requirement correctly:
a1=re.search(r'_abc(.*)xyz',line,re.DOTALL)
print a1.group(1)
Use re.DOTALL which will enable . to match a newline character as well.
You used re.escape on your pattern when it contains special characters, so there's no way it will work.
>>>>re.escape("*_abc:")
'\\*_abc\\:'
This will match the actual phrase *_abc:, but that's not what you want.
Just take the re.escape calls out and it should work more or less correctly.
It sounds like you have a misunderstanding about what the * symbol means in a regular expression. It doesn't mean "match anything", but rather "repeat the previous thing zero or more times".
To match any string, you need to combine * with ., which matches any single character (almost, more on this later). The pattern .* matches any string of zero or more characters.
So, you could change your pattern to be .*abc(.*)xyz and you'd be most of the way there. However, if the prefix and suffix only exist once in the text the leading .* is unnecessary. You can omit it and just let the regular expression engine handle skipping over any unmatched characters before the abc prefix.
The one remaining issue is that you have multiple lines of text in your source text. I mentioned above that the . patter matches character, but that's not entirely true. By default it won't match a newline. For single-line texts that doesn't matter, but it will cause problems for you here. To change that behavior you can pass the flag re.DOTALL (or its shorter spelling, re.S) as a third argument to re.findall or re.search. That flag tells the regular expression system to allow the . pattern to match any character including newlines.
So, here's how you could turn your current code into a working system:
import re
def find_between(prefix, suffix, text):
pattern = r"{}.*{}".format(re.escape(prefix), re.escape(suffix))
result = re.search(pattern, text, re.DOTALL)
if result:
return result.group()
else:
return None # or perhaps raise an exception instead
I've simplified the pattern a bit, since your comment suggested that you want to get the whole matched text, not just the parts in between the prefix and suffix.

python regex pattern to extract value between two characters

I am trying to extract an id number from urls in the form of
http://www.domain.com/some-slug-here/person/237570
http://www.domain.com/person/237570
either one of these urls could also have params on them
http://www.domain.com/some-slug-here/person/237570?q=some+search+string
http://www.domain.com/person/237570?q=some+search+string
I have tried the following expressions to capture the id value of '237570' from the above urls but each one kinda works but does work across all four url scenarios.
(?<=person\/)(.*)(?=\?)
(?<=person\/)(.*)(?=\?|\z)
(?<=person\/)(.*)(?=\??*)
what I am seeing happening is it is getting the 237570 but including the ? and characters that come after it in the url. how can I say stop capturing either when you hit a ?, /, or the end of the string?
String:
http://www.domain.com/some-slug-here/person/1234?q=some+search+string
http://www.domain.com/person/3456?q=some+search+string
http://www.domain.com/some-slug-here/person/5678
http://www.domain.com/person/7890
Regexp:
person\/(\d{1,})
Output:
>>> regex.findall(string)
[u'1234', u'3456', u'5678', u'7890']
Don't use .* to match the ID. . will match any character (except for line breaks, unless you use the DOTALL option). Just match a bunch of digits: (.*) --> (\d+)

Categories

Resources