Scraping a paginated website: Scraping page 2 gives back page 1 results - python

I am using the get method of the requests library in python to scrape information from a website which is organized into pages (i.e paginated with numbers at the bottom).
Page 1 link: https://realfood.tesco.com/search.html?DietaryOption=Vegetarian
I am able to extract the data that I need from the first page but when I feed my code the url for the second page, I get the same data from the first page. Now after carefully analyzing my code, I am certain the issue is not my code logic but the way the second page url is structured.
So my question is how can I get my code to work as I want. I suspect it is a question of parameters but I am not 100% percent sure. If indeed it is parameters that I need to pass to request, I would appreciate some guidance on how to break down the parameters. My page 2 link is attached below.
Thanks.
Page 2 link: https://realfood.tesco.com/search.html?DietaryOption=Vegetarian#!q='selectedobjecttype%3DRECIPES%26page%3D2%26perpage%3D30%26DietaryOption%3DVegetarian'
Note: The pages are not really links per se.

It looks like platform is ASP.NET and pagination links are operated by JS. I seriously doubt you will have it easy with python, since beautifulsoup is a HTML parser/extractor, so if you really want to use this site, I would suggest to looking into Selenium or even PhantomJS, since they fully replicate the browser.
But in this particular case you are lucky, because there's a legacy website version which doesn't use modern bells and whistles :)
http://legacy.realfood.tesco.com/recipes/search.html?st=vegetarian&cr=False&page=3&srt=searchRelevance

It looks like the pagination of this site is handled by the query parameters passed in the second URL you posted, ie:
https://realfood.tesco.com/search.html?DietaryOption=Vegetarian#!q='selectedobjecttype%3DRECIPES%26page%3D2%26perpage%3D30%26DietaryOption%3DVegetarian'
The query string is url encoded. %3D is = and %26 is &. It might be more readable like this:
q='selectedobjecttype=RECIPES&page=2&perpage=30&DietaryOption=Vegetarian'
For example, if you wanted to pull back the fifth page of Vegetarian Recipes the URL would look like this:
https://realfood.tesco.com/search.html?DietaryOption=Vegetarian#!q='selectedobjecttype%3DRECIPES%26page%3D5%26perpage%3D30%26DietaryOption%3DVegetarian'
You can keep incrementing the page number until you get a page with no results which looks like this.

What about this?
from bs4 import BeautifulSoup
import urllib.request
for numb in ('1', '10'):
resp = urllib.request.urlopen("https://realfood.tesco.com/search.html?DietaryOption=Vegetarian")
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))
for link in soup.find_all('a', href=True):
print(link['href'])
Hopefully it works for you. I can't test it because my office blocks these kinds of things. I'll try it when I get home tonight to see if it does what it should do...

Related

Python Requests only pulling half of intented tags

I'm attempting to scrape a website, and pull each sheriff's name and county. I'm using devtools in chrome to identify the HTML tag needed to locate that information.
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
URL = 'https://oregonsheriffs.org/about-ossa/meet-your-sheriffs'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
sheriff_names = soup.find_all('a', class_ = 'eg-sheriff-list-skin-element-1')
sheriff_counties = soup.find_all(class_ = 'eg-sheriff-list-skin-element-2')
However, I'm finding that Requests is not pulling the entire page's html, even though the tag is at the end. If I scan page.content, I find that Sheriff Harrold is the last sheriff included, and that every sheriff from curtis landers onwards is not included (I tried pasting the full output of page.contents but it's too long).
My best guess from reading this answer is that the website has javascripts that load the remaining part of the page upon interacting with it, which would imply that I need to use something like Selenium to interact with the page to get the rest of it to first load.
However, if you look at the website, it's very simple, so as a novice part of me is thinking that there has to be a way to scrape this basic website without using a more complex tool like Selenium. That said, I recognize that the website is wordpress generated and wordpress can set up delayed javascripts on even simple web sites.
My questions are:
1) do I really need to use Selenium to scrape a simple, word-press generated website like this? Or is there a way to get the full page to load with just Requests? Is there anyway to tell when web pages will require a web driver and when Requests will not be enough?
2) I'm thinking one step ahead here - if I want to scale up this project, how would I be able to tell that Requests has not returned the full website, without manually inspecting the results of every website?
Thanks!
Unfortunately, your initial instinct is almost certainly correct. If you look at the page source it seems that they have some sort of lazy loading going on, pulling content from an external source.
A quick look at the page source indicates that they're probably using the "Essential Grid" WordPress theme to do this. I think this supports preloading. If you look at the requests that are made you might be able to ascertain how it's loading this and pull directly from that source (perhaps a REST call, AJAX, etc).
In a generalized sense, I'm afraid that there really isn't any automated way to programmatically determine if a page has 'fully' loaded, as that behavior is defined in code and can be triggered by anything.
If you want to capture information from pages that load content as you scroll, though, I believe Selenium is the tool you'll have to use.

How can I scrape a site with multiple pages using beautifulsoup and python?

I am trying to scrape a website. This is a continuation of this
soup.findAll is not working for table
I was able to obtain needed data but the site has multiple pages which vary by the day. Some days it can be 20 pages and 33 pages on another. I was trying to implement this solution by obtaining the last page element How to scrape the next pages in python using Beautifulsoup
but when I got to the pager div in on the site I want to scrape I found this format
<a class="ctl00_cph1_mnuPager_1" href="javascript:__doPostBack('ctl00$cph1$mnuPager','32')">32</a>
<a class="ctl00_cph1_mnuPager_1">33</a>
how can I scrape all the pages in the site given that it the amount of pages change daily?
by the way page url does not change with page changes.
BS4 will not solve this issues anytime, because of it can't run Js
First, you can try to use Scrapy and this answer
You can use Selenium for it
I would learn how to use Selenium -- it's simple and effective in handling situations where BS4 won't do the job.
You can use it to log into sites, enter keys into search boxes, and click buttons on the screen. Not to mention, you can watch what it's doing with a browser.
I use it even when I'm doing something in BS4 to monitor the progress better of a scraping project.
Like some people have mentioned you might want to look at selenium. I wrote a blogpost for doing something like this a while back: http://danielfrg.com/blog/2015/09/28/crawling-python-selenium-docker/
Now things are much better with chrome and firefox headless.
Okay, so if I'm understanding correctly, there's an undetermined amount of pages that you want to scrape? I had a similar issue if that's the case. Look at the inspected page and see if there is an element that doesn't exist there but exists on the pages with content.
In my for loop I used
`pages = list(map(str, range(1, 5000))) /5000 is just a large number that what I
searching for wouldn't reach that high.
for n in pages:
base_url = 'url here'
url = base_url + n /n is the number of pages at the end of my url
/this is the element that didn't exist after the pages with content finished
figure = soup.find_all("figure")
if figure:
pass
else:
break /would break out of the page iterations and jump to my other listing in
another url after there wasn't any content left on the last page`
I hope this helps some, or helps cover what you needed.

Simulate clicking a link when scraping with Python and BeautifulSoup

After reading for years, this is my first SO question. Thanks in advance for the help!
I'm looking to scrape content from articles on the Forbes website. The this as an example page: http://www.forbes.com/sites/katevinton/2015/09/22/google-microsoft-qualcomm-and-baidu-announce-joint-investment-cloudflare/. When an article is loaded directly, the page source becomes a mess of JavaScript that is hard to parse. However, when I click on the 'print' button, it appends a "/print/" to the URL and gives me a page I have no problem parsing with BeautifulSoup.
When I enter the url with "/print/" appended, it redirects to the non-"/print/" page. I only get to the actual "/print/" page when I click on the button. Thus, my question is: how can I simulate clicking that print button programmatically to get to the Beautiful Soup scrapable page? Poking around, people seem to recommend mechanize for simulating browser actions but I'm not sure what I'd be trying to do with it in this case. Or is there a better way to scrape this data entirely?
I appreciate any help you can offer!
You need to request it with the referer set, so something like this would work:
import requests
url = "http://www.forbes.com/sites/samsungbusiness/2015/09/23/how-your-car-is-becoming-the-next-hot-tech-gadget/print/"
print requests.get(url, headers={"referer": url.replace("print/", "")}).content

Python Requests Getting HTML Code from Wrong Page

So I started using python requests to look at the html code from some websites. I would do r = requests.get(url) to get all the information I need.
However, I noticed that this doesn't work sometimes. For example, I'm using steamcommunity.com to get some market data, so I used the url: http://steamcommunity.com/market/search. This brings up the first page of items out of over 7,000 pages. To get a different page, I used this url: http://steamcommunity.com/market/search#p4_quantity_desc. This should take you to the 4th page of the website, and it does if you put it in your browser. But, when I go to read the html code, I get the same code from both urls, even though they should point to different pages with different code.
Any help is greatly appreciated! Thanks!

Python Scraping fb comments from a website

I have been trying to scrape facebook comments using Beautiful Soup on the below website pages.
import BeautifulSoup
import urllib2
import re
url = 'http://techcrunch.com/2012/05/15/facebook-lightbox/'
fd = urllib2.urlopen(url)
soup = BeautifulSoup.BeautifulSoup(fd)
fb_comment = soup("div", {"class":"postText"}).find(text=True)
print fb_comment
The output is a null set. However, I can clearly see the facebook comment is within those above tags in the inspect element of the techcrunch site (I am little new to Python and was wondering if the approach is correct and where I am going wrong?)
Like Christopher and Thiefmaster: it is all because of javascript.
But, if you really need that information, you can still retrieve it thanks to Selenium on http://seleniumhq.org then use beautifulsoup on this output.
Facebook comments are loaded dynamically using AJAX. You can scrape the original page to retrieve this:
<fb:comments href="http://techcrunch.com/2012/05/15/facebook-lightbox/" num_posts="25" width="630"></fb:comments>
After that you need to send a request to some Facebook API that will give you the comments for the URL in that tag.
The parts of the page you are looking for are not included in the source file. Use a browser and you can see this for yourself by opening the page source.
You will need to use something like pywebkitgtk to have the javascript executed before passing the document to BeautifulSoup

Categories

Resources