I'm trying to build my first Django-powered blog, but I'm stuck in a point.
I'm trying to grab a permanent link from the URL visited in order to display a single post.
The permanent link I'm using is like that:
http://127.0.0.1:8000/blog/20-feb-2012/a-nice-post/
I'd like to grab both the date and the slug from this URL and pass them into a view's function.
I've made this regular expression:
(r'^blog/(?P<day>\d{2})-/(?P<month>\w{3})-/(?P<year>\d{4})/(P?<slug>[-\w]+)/$','blog.views.single_post'),
In the urls.py file, but it seems it is not working.
What's wrong with this regular expression?
You have included slashes between the day-month-year. Remove them.
(r'^blog/(?P<day>\d{2})-(?P<month>\w{3})-(?P<year>\d{4})/(?P<slug>[-\w]+)/$','blog.views.single_post'),
Without checking anything else, you have P? instead of ?P in the slug part.
For starters, you have extra slashes in your regexp for example here (?P<month>\w{3})-**/**(?P<year>\d{4}) and also you have a P? instead of ?P at the end.
In addition, I thought you might want to have a working regexp example. So I tested this one and it works for /blog/20-feb-2012/a-nice-post/:
r'^blog/(?P<day>\d{2})-(?P<month>\w{3})-(?P<year>\d{4})/(?P<slug>[-\w]+)/$'
Related
so I've been working on a web crawler to parse out readable contents from a news site I like, and I've been using regex pretty heavily in python2. I visited https://regexr.com/ to double check that I had the correct expression for this use case, but I keep getting different results than expected, specifically when I cross reference the output from regexr. Here is the expression
re.compile(ur"[\s\S\]*<p.*>([\s\S]+?)<\/p>")
And here is the html I am attempting to match
</figcaption></figure><p>Researchers at MIT and several other
institutions have developed a method for making photonic ...
It doesn't end up getting closed for some time, but the program doesn't grab this section at all, and only after the in
ygen levels</a>, and even blood pressure.</p>
does it begin to grab the html (EDIT: p elements). I guess I am confused by the inconsistencies with different regex engines, and I am trying to figure out when and where to modify my syntax, in this case to grab the entire p element, but also generally. This is my first time posting here, so I may have this formatted incorrectly, but thank you all in advance. Been lurking for a while now.
The expression [\s\S]* will match everything, and so will go straight past the beginning of the tag.
Within the tag, your expression p.* is greedy, and will not stop at the nearest closing bracket. Use .*? for non-greedy.
You seem to have a number of other syntax errors in the regex also. Cut and paste a valid regex.
In general it much easier and less error-prone to use a proper HTML parsing library, even for quite simple tasks. See for example the parsers in lxml.
Perhaps it's because you don't have a closing parenthesis ) in your regular expression?
Try starting with this, then build it out:
import re
s = """</figcaption></figure><p>Researchers at MIT and several other
institutions have developed a method for making photonic</p>"""
r = re.compile(r"<p>([\w\W ]*)</p>")
a = r.search(s)
print(a.group(1))
Note that you don't have to escape the forward slash.
In this case, I ended up getting the response I desired with #marekful 's expression substituted into the regex mentioned in the post. Thank you all for the assistance!
re.compile(ur"[\s\S\]*?<p[^>]*>([\w\W])*</\p>")
I want SOMETHING to be non mandatory so basically I want www.site.com/endpoint to redirect to www.site.com/something/endpoint automatically. I there a way to do it on one line?
Right now I am doing:
url(r'^SOMETHING/endpoint$', 'endpoint', name='endpoint'),
url(r'^endpoint$', RedirectView.as_view(url='SOMETHING/endpoint')),
Cheers.
Why do you want to do this on one line? You're talking about two different URLs that do two different things—one is doing an HTTP redirect and the other is rendering a view. Two lines is the right way to go.
Writing a broader regex to cover both URLs will allow you to use the same view for both, but will not cause a redirect (that is, it will not change the URL to SOMETHING/endpoint).
I believe you are asking to let any url ending in "endpoint" to redirect.
To accomplish this change your redirect url regex to r"endpoint$". The caret operator in regex essentially says from the start of the string.
This regex will match for any url ending in "endpoint", eg. foo/endpoint, bar/endpoint
I am trying to write a script to scrape a website, and am using this one (http://www.theericwang.com/scripts/eBayRead.py).
I however want to use it to crawl sites other than ebay, and to customize to my needs.
I am fairly new to python and have limited re experience.
I am unsure of what this line achieves.
for url, title in re.findall(r'href="([^"]+).*class="vip" title=\'([^\']+)', lines):
Could someone please give me some pointers?
Is there anything else I need to consider if I port this for other sites?
In general, parsing HTML is best done with a library such as BeautifulSoup, which takes care of virtually all of the heavy lifting for you, leaving you with more intuitive code. Also, read #Tadeck's link below - regex and HTML shouldn't be mixed if it can be avoided (to put it lightly).
As for your question, that line uses something called 'regular expression' to find matching patterns in a text (in this case, HTML). re.findall() is a method that returns a list, so if we focus on just that:
re.findall(r'href="([^"]+).*class="vip" title=\'([^\']+)', lines):
r indicates that the following will be interpreted 'raw', meaning that characters like backslashes, etc., will be interpreted literally.
href="([^"]+)
The parentheses indicate a group (what we care about in the match), and the [^"]+ means 'match anything that isn't a quote'. As you can probably guess, this group will return the URL of the link.
.*class="vip"
The .* matches anything (well, almost anything) 0 or more times (which here could include other tags, the closing quote of the link, whitespace, etc.). Nothing special with class="vip" - it just needs to appear.
title=\'([^\']+)', lines):
Here you see an escaped quote and then another group as we saw above. This time, we are capturing anything between the two apostrophes after the title tag.
The end result of this is you are iterating through a list of all matches, and those matches are going to look something like (my_matched_link, my_matched_title), which are passed into for url, title, after which further processing is done.
I am not sure if this would answer your question. But you can consider scrapy: http://scrapy.org for crawling various websites. It is a nice infrastructure which provides a lot of flexibility and is easy to customize to some specific needs.
Regular expressions are bad for parsing HTML
The above is the main idea I would like to communicate to you. For why, see this question: RegEx match open tags except XHTML self-contained tags.
In short, HTML can change as a text (eg. new attribute can be added, order of attributes can be changed, or some other changes may be introduced), but it will result in the exact same HTML as interpreted by web browsers, while completely breaking your script.
The HTML should be parsed using specialized HTML parsers or web scrapers. They know the difference, when it becomes significant.
What to use for scraping?
There are multiple solutions, but one of the most notable ones is: ScraPy. Try it, you may start to love it.
I'm trying to build a small wiki, but I'm having problems writing the regex rules for them.
What I'm trying to do is that every page should have an edit page of its own, and when I press submit on the edit page, it should redirect me to the wiki page.
I want to have the following urls in my application:
http://example.com/<page_name>
http://example.com/_edit/<page_name>
My URLConf has the following rules:
url(r'(_edit/?P<page_name>(?:[a-zA-Z0-9_-]+/?)*)', views.edit),
url(r'(?P<page_name>(^(?:_edit?)?:[a-zA-Z0-9_-]+/?)*)', views.page),
But they're not working for some reason.
How can I make this work?
It seems that one - or both - match the same things.
Following a more concise approach I'd really define the edit URL as:
http://example.com/<pagename>/edit
This is more clear and guessable in my humble opinion.
Then, remember that Django loops your url patterns, in the same order you defined them, and stops on the first one matching the incoming request. So the order they are defined with is really important.
Coming with the answer to your question:
^(?P<page_name>[\w]+)$ matches a request to any /PageName
Please always remember the starting caret and the final dollar signs, that are saying we expect the URL to start and stop respectively right before and after our regexp, otherwise any leading or trailing symbol/character would make the regexp match as well (while you likely want to show up a 404 in that case).
^_edit/(?P<page_name>[\w]+)$ matches the edit URL (or ^(?P<page_name>[\w]+)/edit$ if you like the user-friendly URL commonly referred to as REST urls, while RESTfullnes is a concept that has nothing to do with URL style).
Summarizing put the following in your urls:
url(r'^(?P<page_name>[\w]+)$', views.page)
url(r'^_edit/(?P<page_name>[\w]+)$', views.edit)
You can easily force URLs not to have some particular character by changing \w with a set defined by yourself.
To learn more about Django URL Dispatching read here.
Note: Regexp's are as powerful as dangerous, especially when coming on network. Keep it simple, and be sure to really understand what are you defining, otherwise your web application may be exposed to several security issues.
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. -- Jamie Zawinski
Please try the following URLs, that are simpler:
url(r'_edit/(?P<page_name>[\w-]+), views.edit)'
url(r'(?P<page_name>[\w-]+), views.page),
I had a quick question about Django URL configuration, and I guess REGEX as well. I have a small configuration application that takes in environments and artifacts in the following way:
url(r'^env/(?P<env>\w+)/artifact/(?P<artifact>\w+)/$', 'config.views.ipview', name="bothlist"),
Now, this works fine, but what I would like to do is have it be able to have additional parameters that are optional, such as a verbose mode or no formating mode. I know how to do this just fine in the views, but I can't wrap my head around the regex.
the call would be something like
GET /env/<env>/artifact/<artifact>/<opt:verbose>/<opt:noformat>
Any help would be appreciated, thanks!
-Shawn
I wouldn't put such options into the URL. As you said, these are optional options, they might only change the output. They don't belong in an URL.
Your initial regex should match URLs like:
/env/<env>/artifact/<artifact>?verbose=1&noformat=1
Imho this is a much better usage of URLs