Selenium and Goodreads' pagination - python

I'm trying to extract information from Goodreads. The problem is if I go into a url like:
https://www.goodreads.com/shelf/show/programming?page=2
with Selenium chrome webdriver or with BeautifulSoup, it still shows the first page, instead of the second one.
Example with the chrome webdriver:
While on a normal browser, it displays those books instead:

Looks like that happen because you're not logged-in in your selenium session, you will have to login and save the cookies between restarts.
Take a look at this stackoverflow answers to understand how to extract cookies.

Related

Scrapy/BeautifulSoup simulating 'clicking' a button in order to load a section of the website

To give a very simple example, let's take this site: https://www.cardmarket.com/en/Magic/Products/Booster-Boxes/Modern-Horizons-2-Collector-Booster-Box
As you can see, in order to load more listings, you need to press the blue "SHOW MORE RESULTS" button, a few times at that. In a nutshell, is there a way to "click" this button using scrapy or beautiful soup, in order to gain access to all of the listing on that site? If so, how do I do that? If not, what are the most efficient tools that have the capability to do so, in order to allow me to scrape that site? I've heard of selenium, but also heard that it's hella slower than scrapy/beautifulsoup, so would prefer doing so with these two, or using another tool for that
This seems like a good use case for Selenium. You could use it to simulate a browser session and then hand the page source off to Beautiful Soup as needed.
Try something like this:
from selenium import webdriver
from bs4 import BeautifulSoup
# Desired URL
url = "https://www.cardmarket.com/en/Magic/Products/Booster-Boxes/Modern-Horizons-2-Collector-Booster-Box"
# create a new Firefox session
driver = webdriver.Firefox()
driver.implicitly_wait(30)
driver.get(url)
# Get button and click it
python_button = driver.find_element_by_id("loadMoreButton")
python_button.click() #click load more button
# Pass to BS4
soup=BeautifulSoup(driver.page_source)
If You Want To Avoid Selenium:
The "Load More" button on the site you've linked is using AJAX requests to load more data. If you really want to avoid using Selenium then you could try to use the requests library to replicate the same AJAX request that the button making when it is clicked.
You'll need to monitor the network tab in your browser to figure out the necessary headers. It's likely going to take some fiddling to get it just right.
Potentially Relevant:
Simulating ajax request with python using requests lib
I see that this website loads content using AJAX which is also known as "dynamic page loading" so what you can do is instead of using "resource heavy" Selenium, you can use Requests+bs4 to get it done.
To start, open up the web page and wait for it to finish initial loading, then press "Ctrl+Shift+I" to open the "inspect" windows, then go to "network" tab and click "Load more" button to load more content. Then you'll see something like this
Then if you see the response, this is base64 encoded, then copy the response as CURL like this
Now you have the CURL request in your clipboard, you can easily convert it to python code using this website or using "postman". There you have it.
You can base64 decode to get the response and parse it.

How to get cookies from all users web-browser with Python?

This question is a extension from How to get cookies from web-browser with Python?
I would like to extract cookies from the same url for 3 accounts logged into chrome. I've already done the code test from the previous url and it returns me only cookies from a single user. I have already researched the library and apparently does not have this functionality.
That is the code i'm using:
import browser_cookie3
from selenium import webdriver
driver = webdriver.Chrome('C:/Users/pedro/Desktop/chromedriver')
cookies = browser_cookie3.chrome(domain_name='my/url')
print('cookies._cookies')
Look into the selenium module. You are able to run Chrome (in selenium named a webdriver) from Python.
If you start the Chrome browser with selenium (and store it into a variable), you have the selenium webdriver and you are able to log in as you would normally do.
Then cookies = driver.get_cookies() leads to cookies being a dictionary, ready to use.
Reference

Selenium Chrome web driver inconsistently executes JS scripts on webpages

I'm trying to scrape articles on PubChem, such as this one, for instance. PubChem requires browsers to have Javascript enabled, or else it redirects to a page with virtually no content that says "This application requires Javascript. Please turn on Javascript in order to use this application". To go around this, I used the Chrome web driver from the Selenium library to obtain the HTML that PubChem generates with JavaScript.
And it does that about half the time. It also frequently does not render the full html, and redirects to the Javascript warning page. How do I make it so that the script retrieves the JS version of the site consistently?
I've also tried to overcome this issue by using PhantomJS, except PhantomJS somehow does not work on my machine after installation.
from bs4 import BeautifulSoup
from requests import get
from requests_html import HTMLSession
from selenium import webdriver
import html5lib
session = HTMLSession()
browser = webdriver.Chrome('/Users/user/Documents/chromedriver')
url = "https://pubchem.ncbi.nlm.nih.gov/compound/"
browser.get(url)
innerHTML = browser.execute_script("return document.body.innerHTML")
soup = BeautifulSoup(innerHTML, "html5lib")
There are no error messages whatsoever. The only issues is that sometimes the web scraper cannot obtain the JS-rendered webpage as expected. Thank you so much!
Answering my own question because why not.
You need to quit your browser by
browser = webdriver.Chrome('/Users/user/Documents/chromedriver')
# stuff
browser.quit()
and do so right after the last operation that involves the browser, as you risk having the browser cache affect your outputs in next iterations of running the script.
Hope that whoever has this issue finds this helpful!
UPDATE EDIT:
So closing the browser does increase the frequency of success, but doesn't make it consistent. Another thing that was helpful in making it work more frequently was running
sudo purge
in the terminal. I'm still not getting consistent results, however. If anyone has an idea of how to do it without using brute force (i.e. opening and closing the WebDriver until it renders the proper page), please let me know! Many thanks

Webdriver how to get the page has been opened

driver = webdriver.chrome()
driver.get(url)
Webdriver are used to open a web page.
But if I turn it.
First open a web page, and then use webdriver access source, feasible? ??
(The first time to come here to ask questions, did not find the forum on python.
I do not know right here!)
For example:
first open manually with firefox stackoverflow.com
Then use python's webdriver to get its source code
Is there a way?
(My English is not good, automatic translation)

How to get html that I see in inspect element?

I'm programming a web-scraper app with python. The website I want to scrape data use JS.
How can I get the source that I see in inspect element?
With javascript pycurl will not work, you need Selenium to get the stuff you need.
import selenium
driver = selenium.webdriver.Firefox()
driver.get("your_url")
Make sure you have Firefox (or another browser selenium supports) installed.

Categories

Resources