I'm working on a website where I can have the user choose from a list of similar elements on a list page and I want it to open a separate page based on the element they choose in the list
I have a list setup and each element in the list is identifiable with a random string of 15 characters containing [0-9], [a-z] and [A-Z].
Example of an url for an element of the list: http://127.0.0.1:8000/view?s=fkiscl49gtisceg
where s is the identifier (kinda like how youtube videos have a seperate link)
however I can't understand how I need to make django ignore the ?s=fkiscl49gtisceg part of the string. I've written the path like this now:
path('view/(?P<s>[\w]{15})', element_display, name='s'),
django however tels me that the page was not found... How do I fix this?
The principle is simply that Django does not take any account of the querystring. Therefore, you should not have it in your pattern. The URL should just be path('view').
Related
Is there some way to pass scrapy a list of URLs to not crawl?
I found the LinkExtractor class, which has a deny parameter, but it’s regex based. I just want to pass it a list of URLs.
Background: I have a scrapy crawler based on the SitemapSpider class that can extract and index all the links from an XML sitemap no problem.
Since I am using it to index sites on a daily basis (it’s crawling job postings), I only want it to look at new pages. It saves server burden and index time for me to not look at previously indexed pages.
I’ve tried passing a list of links that I’ve previously indexed, but I get an error that the list of links are not regex objects. It could be that I just don’t know how to convert url strings to regex objects.
To convert a string to a regular expression that matches only that string, use re.escape to escape any regex metacharacters, then re.compile to compile the resulting string into a regex.
import re
url_list = ['http://example.com/', 'http://example.net/']
deny_list = [re.compile(re.escape(url)) for url in url_list]
As a side note, this may or may not help for your use case, but if you know the list of denied URLs in advance (i.e. when writing your program), then you can use an external tool to compile them directly to an efficient regex. I'm particularly fond of regexp-opt, which is built-in to Emacs.
For instance, if we know in advance that we want to block those two URLs above (example.com and example.net), then we can do this in Emacs
(regexp-opt '("http://example.com/" "http://example.net"))
which gives us this
\(?:http://example\.\(?:com/\|net\)\)
and that's a much more efficient regex than checking the two URLs separately. But this only works if you know the URLs in advance, as I don't offhand know of any Python tool to precompile match lists like this. (I'm sure one exists, I just don't know of it offhand)
Option #1
The link extractor also has a process_value parameter that can be set to a callable that can be used to filter certain urls by returning None if the extracted url is in your list of urls, otherwise it can simply return the url unchanged.
docs
Example:
previously_indexed = [....] # list of urls
def link_filter(value):
if value in previously_indexed:
return None
return value
linkextractor = LinkExtractor(..., process_value=link_filter)
Option #2
Since you are using the sitemap spider you can override the spiders sitemap_filter method to filter the urls that have been previously indexed.
docs
for example:
from scrapy.spiders import SitemapSpider
class FilteredSitemapSpider(SitemapSpider):
name = 'filtered_sitemap_spider'
allowed_domains = ['example.com']
sitemap_urls = ['http://example.com/sitemap.xml']
previously_indexed = [....]
def sitemap_filter(self, entries):
for entry in entries:
if entry['loc'] not in previously_indexed:
yield entry
I am working with twitter links and I have extracted the links from twitter's api.
https://twitter.com/shiromiru36/status/1302713597521403904/photo/1
The above is an example of links that I have.
I would like to get front part of the link only, I want to delete
/photo/1
So that I can get
https://twitter.com/shiromiru36/status/1302713597521403904
Currently I extract the link by counting the number of words that I don't want
url = https://twitter.com/shiromiru36/status/1302713597521403904/photo/1
url[:-8]
I would like to ask if there is any way to extract the link by finding its own pattern. As the unwanted part is after the 2nd last "/". I am thinking whether I can delete the unwanted part by finding the 2nd last "/" first, and then delete the words after it.
Thank you.
You could do something like this:
'/'.join(s.split('/')[:-2])
Try this
url = 'https://twitter.com/shiromiru36/status/1302713597521403904/photo/1'
url=(((url[::-1]).split('/',2))[-1])[::-1]
print(url)
https://twitter.com/shiromiru36/status/1302713597521403904
I'm using selenium, with find_element_by_path method to do some web scraping, I have some problem to get a path which change through pages, I know how the path is written, but one of the string within the path change through my loop, I would like to know how can I use regex to solve it.
I have this code for one of the page but when I go through all pages the string "NUMBER" below changes:
browser.find_element_by_xpath(re.compile('//*[#id="exhibDetail:exhib"]/section[3]/div[2]/div/div[2]/div/div/div[NUMBER]/div').click()
I want to know if it was possible to use regex in order to say that it has to click whatever the "NUMBER" as long as the rest of the path is the same so I tried this but I'm not sure about the syntax and how to use regex here:
browser.find_element_by_xpath('//*[#id="exhibDetail:exhib"]/section[3]/div[2]/div/div[2]/div/div/div['). + re.compile("^[1-9]\d*$") + ']/div').click()
browser.find_element_by_xpath(re.compile('^//*[#id="exhibDetail:exhib"]/section[3]/div[2]/div/div[2]/div/div/div[')).click()
browser.find_element_by_xpath('//*[#id="exhibDetail:exhib"]/section[3]/div[2]/div/div[2]/div/div/div[1]/div').click()
browser.find_element_by_xpath('//*[#id="exhibDetail:exhib"]/section[3]/div[2]/div/div[2]/div/div/div[9]/div').click()
browser.find_element_by_xpath('//*[#id="exhibDetail:exhib"]/section[3]/div[2]/div/div[2]/div/div/div[4]/div').click()
browser.find_element_by_xpath('//*[#id="exhibDetail:exhib"]/section[3]/div[2]/div/div[2]/div/div/div[10]/div').click()
browser.find_element_by_xpath('//*[#id="exhibDetail:exhib"]/section[3]/div[2]/div/div[2]/div/div/div[6]/div').click()
the path evolves more or less in this manner (randomly) but not gradually one by one.
How do I solve this problem?
Welcome to SO.
If you are trying to pass the NUMBER as part of xpath in your loop then you can do the below.
If NUMBER in an integer:
browser.find_element_by_xpath("//*#id='exhibDetail:exhib']/section[3]/div[2]/div/div[2]/div/div/div[%i]/div"%(NUMBER)).click()
If NUMBER is a string
browser.find_element_by_xpath("//*#id='exhibDetail:exhib']/section[3]/div[2]/div/div[2]/div/div/div[%s]/div"%(NUMBER)).click()
I want to know if it was possible to use regex in order to say that it
has to click whatever the "NUMBER" as long as the rest of the path is
the same
If you want to select those div elements disregarding their position (that is what the predicates [1], [2], etc. are testing) then just don't use the predicates at all:
//*[#id="exhibDetail:exhib"]/section[3]/div[2]/div/div[2]/div/div/div/div
So, I know this sounds a bit odd, but basically here is my HTML example:
$400 + free shipping</title>
<link>https://www.dealnews.com/Samsung-50-4-K-HDR-LED-Smart-TV-for-400-free-shipping/17336849.html?iref=rss-dealnews-editors-choice</link>
<description><img src='http://c.dlnws.com/image/upload/f_auto,t_large,q_auto/content/vdiy8a75wg8v7bo92dhq'
I only want to capture the URL of items that have a dollar sign way before it e.g. everthing after $.... than (URL)
At the moment my regex is this:
img src='([^']+)'.*
This grabs EVERY img src, however I would only like images like I said before that have the "$" sign before it, essentially I don't want any images that aren't to do with a product on this HTML page.
Looking at the HTML example you provided it seems your product images are directly preceded by a <description> HTML tag. It takes less processing power (and time) to use a non-capturing group directly before the desired URL rather than looking back all the way to a potential (but not granted) $ sign. If you use the <description> tag exclusively for products than this regular expression will suit your needs:
(?:<description><img src=')([^']+)
Other things to consider:
Make sure to add the Global and Multiline modifiers if you require this check for multiple lines across your HTML code.
If you need to take HTML entities into account and allow combination of HTML entities alongside parsed HTML consider creating an OR statement to allow for them in your Regex. For example, to allow both < and < before the img tag use:
(?:<description>(?:<|<)img src=')([^']+) and if we take into account the opening and closing entities of the description tag as well we end up with this: (?:(?:<|<)description(?:>|>)(?:<|<)img src=')([^']+)
I have a script that parses an html page for all the links within it. I am getting all of them fine, but I have a list of domains I want to compare it against. So a sample list contains
list=['www.domain.com', 'sub.domain.com']
But I may have a list of links that look like
http://domain.com
http://sub.domain.com/some/other/page
I can strip off the http:// just fine, but in the two example links I just posted, they both should match. The first I would like to match against the www.domain.com, and the second, I would like to match against the subdomain in the list.
Right now I am using url2lib for parsing the html. What are my options in completely this task?
You might consider stripping 'www.' from the list and doing something as simple as:
url = 'domain.com/'
for domain in list:
if url.startswith(domain):
... do something ...
Or trying both wont hurt either I spose:
url = 'domain.com/'
for domain in list:
domain_minus_www = domain
if domain_minus_www.startswith('www.'):
domain_minus_www = domain_minus_www[4:]
if url.startswith(domain) or url.startswith(domain_minus_www):
... do something ...