I have a city card that I display with this link:
/<country_name>/<city_name>
I can access to this card from two different links:
/<country_name>/
/<country_name>/pick-cities
When the card is opened I want to return back to the URL visited before (one of the two below).
I have used a target which I affect a referring:
target = request.META['HTTP_REFERER']
But the problem is when I make some actions on the card then request.META['HTTP_REFERER'] become the card's URL! (like history.back() in JavaScript)
Is there an other way to link back to the visited URL before?
You could save the request.META['HTTP_REFERER'] in sessions and use it when needed.
Store it only when you come to card url which is different from current url, so you will not save it when you edit it.
Something like
def city_card(request):
# do your stuff
if not request.session.get('REFERRER') and request.session['REFERRER'] != request.path
request.session['REFERRER'] = request.path
# use request.session['REFERRER'] when you want to redirect there
NOTE: You may have to use something else than request.path to actual url verification.
Related
I need to enable some pages to write an arbitrary URL that does not depend on the structure of the site.
For example I have structure:
/
/blog
/blog/blogpost1
/blog/blogpost2
But, for example, I need change url from /blog/blbogpost2 to /some/blogpost/url1
For this, I decided to give the opportunity to handle any URL of the main page of the site.
class IndexPage(RoutablePageMixin, Page):
...
#route(r'^(?P<path>.*)/$')
def render_page_with_special_path(self, request, path, *args, **kwargs):
pages = Page.objects.not_exact_type(IndexPage).specific()
for page in pages:
if hasattr(page, 'full_path'):
if page.full_path == path:
return page.serve(request)
# some logic
But now, if this path is not found, but I need to return this request to the standard handler. How can I do it?
This isn't possible with RoutablePageMixin; Wagtail treats URL routing and page serving as two distinct steps, and once it's identified the function responsible for serving the page (which, for RoutablePageMixin, is done by checking the URL route given in #route), there's no way to go back to the URL routing step.
However, it can be done by overriding the page's route() method, which is the low-level mechanism used by RoutablePageMixin. Your version would look something like this:
from wagtail.core.url_routing import RouteResult
class IndexPage(Page):
def route(self, request, path_components):
# reconstruct the original URL path from the list of path components
path = '/'
if path_components:
path += '/'.join(path_components) + '/'
pages = Page.objects.not_exact_type(IndexPage).specific()
for page in pages:
if hasattr(page, 'full_path'):
if page.full_path == path:
return RouteResult(page)
# no match found, so revert to the default routing mechanism
return super().route(request, path_components)
I tried to get all the products from this website but somehow I don't think I chose the best method because some of them are missing and I can't figure out why. It's not the first time when I get stuck when it comes to this.
The way I'm doing it now is like this:
go to the index page of the website
get all the categories from there (A-Z 0-9)
access each of the above category and recursively go through all the subcategories from there until I reach the products page
when I reach the products page, check if the product has more SKUs. If it has, get the links. Otherwise, that's the only SKU.
Now, the below code works but it just doesn't get all the products and I don't see any reasons for why it'd skip some. Maybe the way I approached everything is wrong.
from lxml import html
from random import randint
from string import ascii_uppercase
from time import sleep
from requests import Session
INDEX_PAGE = 'https://www.richelieu.com/us/en/index'
session_ = Session()
def retry(link):
wait = randint(0, 10)
try:
return session_.get(link).text
except Exception as e:
print('Retrying product page in {} seconds because: {}'.format(wait, e))
sleep(wait)
return retry(link)
def get_category_sections():
au = list(ascii_uppercase)
au.remove('Q')
au.remove('Y')
au.append('0-9')
return au
def get_categories():
html_ = retry(INDEX_PAGE)
page = html.fromstring(html_)
sections = get_category_sections()
for section in sections:
for link in page.xpath("//div[#id='index-{}']//li/a/#href".format(section)):
yield '{}?imgMode=m&sort=&nbPerPage=200'.format(link)
def dig_up_products(url):
html_ = retry(url)
page = html.fromstring(html_)
for link in page.xpath(
'//h2[contains(., "CATEGORIES")]/following-sibling::*[#id="carouselSegment2b"]//li//a/#href'
):
yield from dig_up_products(link)
for link in page.xpath('//ul[#id="prodResult"]/li//div[#class="imgWrapper"]/a/#href'):
yield link
for link in page.xpath('//*[#id="ts_resultList"]/div/nav/ul/li[last()]/a/#href'):
if link != '#':
yield from dig_up_products(link)
def check_if_more_products(tree):
more_prods = [
all_prod
for all_prod in tree.xpath("//div[#id='pm2_prodTableForm']//tbody/tr/td[1]//a/#href")
]
if not more_prods:
return False
return more_prods
def main():
for category_link in get_categories():
for product_link in dig_up_products(category_link):
product_page = retry(product_link)
product_tree = html.fromstring(product_page)
more_products = check_if_more_products(product_tree)
if not more_products:
print(product_link)
else:
for sku_product_link in more_products:
print(sku_product_link)
if __name__ == '__main__':
main()
Now, the question might be too generic but I wonder if there's a rule of thumb to follow when someone wants to get all the data (products, in this case) from a website. Could someone please walk me through the whole process of discovering what's the best way to approach a scenario like this?
If your ultimate goal is to scrape the entire product listing for each category, it may make sense to target the full product listings for each category on the index page. This program uses BeautifulSoup to find each category on the index page and then iterates over each product page under each category. The final output is a list of namedtuples stories each category name with the current page link and the full product titles for each link:
url = "https://www.richelieu.com/us/en/index"
import urllib
import re
from bs4 import BeautifulSoup as soup
from collections import namedtuple
import itertools
s = soup(str(urllib.urlopen(url).read()), 'lxml')
blocks = s.find_all('div', {'id': re.compile('index\-[A-Z]')})
results_data = {[c.text for c in i.find_all('h2', {'class':'h1'})][0]:[b['href'] for b in i.find_all('a', href=True)] for i in blocks}
final_data = []
category = namedtuple('category', 'abbr, link, products')
for category1, links in results_data.items():
for link in links:
page_data = str(urllib.urlopen(link).read())
print "link: ", link
page_links = re.findall(';page\=(.*?)#results">(.*?)</a>', page_data)
if not page_links:
final_page_data = soup(page_data, 'lxml')
final_titles = [i.text for i in final_page_data.find_all('h3', {'class':'itemHeading'})]
new_category = category(category1, link, final_titles)
final_data.append(new_category)
else:
page_numbers = set(itertools.chain(*list(map(list, page_links))))
full_page_links = ["{}?imgMode=m&sort=&nbPerPage=48&page={}#results".format(link, num) for num in page_numbers]
for page_result in full_page_links:
new_page_data = soup(str(urllib.urlopen(page_result).read()), 'lxml')
final_titles = [i.text for i in new_page_data.find_all('h3', {'class':'itemHeading'})]
new_category = category(category1, link, final_titles)
final_data.append(new_category)
print final_data
The output will garner results in the format:
[category(abbr=u'A', link='https://www.richelieu.com/us/en/category/tools-and-shop-supplies/workshop-accessories/tool-accessories/sander-accessories/1058847', products=[u'Replacement Plate for MKT9924DB Belt Sander', u'Non-Grip Vacuum Pads', u'Sandpaper Belt 2\xbd " x 14" for Compact Belt Sander PC371 or PC371K', u'Stick-on Non-Vacuum Pads', u'5" Non-Vacuum Disc Pad Hook-Face', u'Sanding Filter Bag', u'Grip-on Vacuum Pads', u'Plates for Non-Vacuum (Grip-On) Dynabug II Disc Pads - 7.62 cm x 10.79 cm (3" x 4-1/4")', u'4" Abrasive for Finishing Tool', u'Sander Backing Pad for RO 150 Sander', u'StickFix Sander Pad for ETS 125 Sander', u'Sub-Base Pad for Stocked Sanders', u'(5") Non-Vacuum Disc Pad Vinyl-Face', u'Replacement Sub-Base Pads for Stocked Sanders', u"5'' Multi-Hole Non-Vaccum Pad", u'Sander Backing Pad for RO 90 DX Sander', u'Converting Sanding Pad', u'Stick-On Vacuum Pads', u'Replacement "Stik It" Sub Base', u'Drum Sander/Planer Sandpaper'])....
To access each attribute, call like so:
categories = [i.abbr for i in final_data]
links = [i.links for i in final_data]
products = [i.products for i in final_data]
I believe the benefit of using BeautifulSoup is this instance is that it provides a higher level of control over the scraping and is easily modified. For instance, should the OP change his mind regarding what facets of the product/index he would like to scrape, simple changes in the find_all parameters should only be needed, as the general structure of the code above centers around each product category from the index page.
First of all, there is no definite answer to your generic question of how would one know if the data one has already scraped is all the available data. This is at least web-site specific and is rarely actually revealed. Plus, the data itself might be highly dynamic. On this web-site though you may more or less use the product counters to verify the amount of results found:
Your best bet here would be to debug - use logging module to print out information while scraping, then analyze the logs and look for why there was a missing product and what caused that.
Some of the ideas I currently have:
could it be that the retry() is the problematic part - could it be that session_.get(link).text does not raise an error but does not contain the actual data in the response as well?
I think the way you extract category links is correct and I don't see you missing categories on the index page
the dig_up_products() is questionable: when you extract links to the subcategories, you have this carouselSegment2b id used in the XPath expression, but I see that on at least some of the pages (like this one) the id value is carouselSegment1b. In any case, I would probably do just //h2[contains(., "CATEGORIES")]/following-sibling::div//li//a/#href here
I also don't like that imgWrapper class used to find a product link (could be that products missing images are missed?). Why not just: //ul[#id="prodResult"]/li//a/#href - this would though bring in some duplicates which you can address separately. But, you can also look for the link in the "info" section of the product container: //ul[#id="prodResult"]/li//div[contains(#class, "infoBox")]//a/#href.
There can also be an anti-bot, anti-web-scraping strategy deployed that may temporarily ban your IP or/and User-Agent or even obfuscate the response. Check for that too.
As pointed out by #mzjn and #alecxe, some websites employ anti-scraping measures. To hide their intentions, scrapers should try to mimic a human visitor.
One particular way for websites to detect a scraper, is to measure the time between subsequent page requests. Which is why scrapers typically keep a (random) delay between requests.
Besides, hammering a web server that is not yours without giving it some slack, is not considered good netiquette.
From Scrapy's documentation:
RANDOMIZE_DOWNLOAD_DELAY
Default: True
If enabled, Scrapy will wait a random amount of time (between 0.5 * DOWNLOAD_DELAY and 1.5 * DOWNLOAD_DELAY) while fetching requests from the same website.
This randomization decreases the chance of the crawler being detected (and subsequently blocked) by sites which analyze requests looking for statistically significant similarities in the time between their requests.
The randomization policy is the same used by wget --random-wait option.
If DOWNLOAD_DELAY is zero (default) this option has no effect.
Oh, and make sure the User-Agent string in your HTTP request resembles that of an ordinary web browser.
Further reading:
https://exposingtheinvisible.org/guides/scraping/
Sites not accepting wget user agent header
I am trying to get the names of members of a group I am a member of. I am able to get the names in the first page but not sure how to go to the next page:
My Code:
url = 'https://graph.facebook.com/v2.5/1671554786408615/members?access_token=<MY_CUSTOM_ACCESS_CODE_HERE>'
json_obj = urllib2.urlopen(url)
data = json.load(json_obj)
for each in data['data']:
print each['name']
Using the code above I am successfully getting all names on the first page but question is -- how do I go to the next page?
In the Graph API Explorer Output screen I see this:
What change does my code need to keep going to next pages and get names of ALL members of the group?
The JSON returned by the Graph API is telling you where to get the next page of data, in data['paging']['next']. You could give something like this a try:
def printNames():
json_obj = urllib2.urlopen(url)
data = json.load(json_obj)
for each in data['data']:
print each['name']
return data['paging']['next'] # Return the URL to the next page of data
url = 'https://graph.facebook.com/v2.5/1671554786408615/members?access_token=<MY_CUSTOM_ACCESS_CODE_HERE>'
url = printNames()
print "====END OF PAGE 1===="
url = printNames()
print "====END OF PAGE 2===="
You would need to add checks, for instance ['paging']['next'] will only be available in your JSON object if there is a next page, so you might want to modify your function to return a more complex structure to convey this information, but this should give you the idea.
I've written a function in Python that gets all the links on a page.
Then, I run that function for all of the links that first function returned.
My question is, if I were to keep on doing this using CNN as my starting point, how would I know when I had crawled all (or most) of CNN's webpages?
Here's the code for the crawler.
base_url = "http://www.cnn.com"
title = "cnn"
my_file = open(title+".txt","w")
def crawl(site):
seed_url = site
br = Browser()
br.set_handle_robots(False)
br.set_handle_equiv(False)
br.open(seed_url)
link_bank = []
for link in br.links():
if link.url[0:4] == "http":
link_bank.append(link.url)
if link.url[0] == "/":
url = link.url
if url.find(".com") == -1:
if url.find(".org") == -1:
link_bank.append(base_url+link.url)
else:
link_bank.append(link.url)
else:
link_bank.append(link.url)
if link.url[0] == "#":
link_bank.append(base_url+link.url)
link_bank = list(set(link_bank))
for link in link_bank:
my_file.write(link+"\n")
return link_bank
my_file.close()
I did not specifically look into your code, but you should look up how to implement a breadth-first-search, and additionally store already visited URLs in a set. If you find a new URL in the currently visited page, append it to the list of URLs to visit, if it wasn't in the set already.
You might need to ignore the query string (everything after the question mark in a URL).
The first thing coming into my mind is to have a set of visited links. Each time you are requesting a link, add a link to a set. Before requesting a link, check if it is not in the set.
Another point is that you are actually reinventing the wheel here, Scrapy web-scraping framework has link extracting mechanism built-in - it's worth using.
Hope that helps.
I'm using scrapy to crawl a multilingual site. For each object, versions in three different languages exist. I'm using the search as a starting point. Unfortunately the search contains URLs in various languages, which causes problems when parsing.
Therefore I'd like to preprocess the URLs before they get sent out. If they contain a specific string, I want to replace that part of the URL.
My spider extends the CrawlSpider. I looked at the docs and found the make_request_from _url(url) method, which led to this attempt:
def make_requests_from_url(self, url):
"""
Override the original function go make sure only german URLs are
being used. If french or italian URLs are detected, they're
rewritten.
"""
if '/f/suche' in url:
self.log('French URL was rewritten: %s' % url)
url = url.replace('/f/suche/pages/', '/d/suche/seiten/')
elif '/i/suche' in url:
self.log('Italian URL was rewritten: %s' % url)
url = url.replace('/i/suche/pagine/', '/d/suche/seiten/')
return super(MyMultilingualSpider, self).make_requests_from_url(url)
But that does not work for some reason. What would be the best way to rewrite URLs before requesting them? Maybe via a rule callback?
Probably worth nothing an example since it took me about 30 minutes to figure it out:
rules = [
Rule(SgmlLinkExtractor(allow = (all_subdomains,)), callback='parse_item', process_links='process_links')
]
def process_links(self,links):
for link in links:
link.url = "something_to_prepend%ssomething_to_append" % link.url
return links
As you already extend CrawlSpider you can use process_links() to process the URL extracted by your link extractors (or process_requests() if you prefer working at the Request level), detailed here