I'm trying to crawl a website which return an error message that your js is disabled and you might be a bot. I tried to see same behaviour in web browser and yes the same response, however if JavaScript is enabled it will not affect the original response, I mean original response is not dependent on JS.
So I was thinking if I can tell the web/http server that my JS is enabled and I'm not a BOT.
is this possible in Python requests library, or any other python library for that matter?
And yeah I've set the User-Agent header, even all other headers, like host, language, connection, etc
If the site is just checking whether javascript can be executed or not through executing some js, use selenium to get the page, and then use BeautifulSoup to parse the page that selenium got.
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://your-site/url')
html = driver.page_source
soup = BeautifulSoup(html)
...
Related
Here is the site I am trying to scrap data from:
https://www.onestopwineshop.com/collection/type/red-wines
import requests
from bs4 import BeautifulSoup
url = "https://www.onestopwineshop.com/collection/type/red-wines"
response = requests.get(url)
#print(response.text)
soup = BeautifulSoup(response.content,'lxml')
The code I have above.
It seems like the HTML content I got from the inspector is different from what I got from BeautifulSoup.
My guess is that they are preventing me from getting their data as they detected I am not accessing the site with a browser. If so, is there any way to bypass that?
(Update) Attempt with selenium:
from selenium import webdriver
import time
path = "C:\Program Files (x86)\chromedriver.exe"
# start web browser
browser=webdriver.Chrome(path)
#navigate to the page
url = "https://www.onestopwineshop.com/collection/type/red-wines"
browser.get(url)
# sleep the required amount to let the page load
time.sleep(3)
# get source code
html = browser.page_source
# close web browser
browser.close()
Update 2:(loaded with devtool)
Any website with content that is loaded after the inital page load is unavailable with BS4 with your current method. This is because the content will be loaded with an AJAX call via javascript and the requests library is unable to parse and run JS code.
To achieve this you will have to look at something like selenium which controls a browser via python or other languages... There is a seperate version of selenium for each browser i.e firefox, chrome etc.
Personally I use chrome so the drivers can be found here...
https://chromedriver.chromium.org/downloads
download the correct driver for your version of chrome
install selenium via pip
create a scrape.py file and put the driver in the same folder.
then to get the html string to use with bs4
from selenium import webdriver
import time
# start web browser
browser=webdriver.Chrome()
#navigate to the page
browser.get('http://selenium.dev/')
# sleep the required amount to let the page load
time.sleep(2)
# get source code
html = browser.page_source
# close web browser
browser.close()
You should then be able to use the html variable with BS4
I'll actually turn my comment to an answer because it is a solution to your problem :
As other said, this page is loaded dynamically, but there are ways to retrieve data without running javascript, in your case you want to look at the "network" tab or your dev tools and filter "fetch" requests.
This could be particularly interesting for you :
You don't need selenium or beautifulsoup at all, you can just use requests and parse the json, if you are good enough ;)
There is a working cURL requests : curl 'https://api.commerce7.com/v1/product/for-web?&collectionSlug=red-wines' -H 'tenant: one-stop-wine-shop'
You get an error if you don't add the tenant header.
And that's it, no html parsing, no waiting for the page to load, no javascript running. Much more powerful that the selenium solution.
I have recently started learning web scraping with Scrapy and as a practice, I decided to scrape a weather data table from this url.
By inspecting the table element of the page, I copy its XPath into my code but I only get an empty list when running the code. I tried to check which tables are present in the HTML using this code:
from scrapy import Selector
import requests
import pandas as pd
url = 'https://www.wunderground.com/history/monthly/OIII/date/2000-5'
html = requests.get(url).content
sel = Selector(text=html)
table = sel.xpath('//table')
It only returns one table and it is not the one I wanted.
After some research, I found out that it might have something to do with JavaScript rendering in the page source code and that Python requests can't handle JavaScript.
After going through a number of SO Q&As, I came upon a certain requests-html library which can apparently handle JS execution so I tried acquiring the table using this code snippet:
from requests_html import HTMLSession
from scrapy import Selector
session = HTMLSession()
resp = session.get('https://www.wunderground.com/history/monthly/OIII/date/2000-5')
resp.html.render()
html = resp.html.html
sel = Selector(text=html)
tables = sel.xpath('//table')
print(tables)
But the result doesn't change. How can I acquire that table?
Problem
Multiple problems may be at play here—not only javascript execution, but HTML5 APIs, cookies, user agent, etc.
Solution
Consider using Selenium with headless Chrome or Firefox web driver. Using selenium with a web driver ensures that page will be loaded as intended. Headless mode ensures that you can run your code without spawning the GUI browser—you can, of course, disable headless mode to see what's being done to the page in realtime and even add a breakpoint so that you can debug beyond pdb in the browser's console.
Example Code:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(options=chrome_options)
driver.get("https://www.wunderground.com/history/monthly/OIII/date/2000-5")
tables = driver.find_elements_by_xpath('//table') # There are several APIs to locate elements available.
print(tables)
References
Selenium Github: https://github.com/SeleniumHQ/selenium
Selenium (Python) Documentation: https://selenium-python.readthedocs.io/getting-started.html
Locating Elements: https://selenium-python.readthedocs.io/locating-elements.html
you can use scrapy-splash plugin to work scrapy with Splash (scrapinghub's javascript browser)
Using splash you can render javascript and also execute user events like mouse click
Was hoping someone could help me understand what's going on:
I'm using Selenium with Firefox browser to download a pdf (need Selenium to login to the corresponding website):
le = browser.find_elements_by_xpath('//*[#title="Download PDF"]')
time.sleep(5)
if le:
pdf_link = le[0].get_attribute("href")
browser.get(pdf_link)
The code does download the pdf, but after that just stays idle.
This seems to be related to the following browser settings:
fp.set_preference("pdfjs.disabled", True)
fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/pdf")
If I disable the first, it doesn't hang, but opens pdf instead of downloading it. If I disable the second, a "Save As" pop-up window shows up. Could someone explain how to handle this?
For me, the best way to solve this was to let Firefox render the PDF in the browser via pdf.js and then send a subsequent fetch via the Python requests library with the selenium cookies attached. More explanation below:
There are several ways to render a PDF via Firefox + Selenium. If you're using the most recent version of Firefox, it'll most likely render the PDF via pdf.js so you can view it inline. This isn't ideal because now we can't download the file.
You can disable pdf.js via Selenium options but this will likely lead to the issue in this question where the browser gets stuck. This might be because of an unknown MIME-Type but I'm not totally sure. (There's another StackOverflow answer that says this is also due to Firefox versions.)
However, we can bypass this by passing Selenium's cookie session to requests.session().
Here's a toy example:
import requests
from selenium import webdriver
pdf_url = "/url/to/some/file.pdf"
# setup driver with options
driver = webdriver.Firefox(..options)
# do whatever you need to do to auth/login/click/etc.
# navigate to the PDF URL in case the PDF link issues a
# redirect because requests.session() does not persist cookies
driver.get(pdf_url)
# get the URL from Selenium
current_pdf_url = driver.current_url
# create a requests session
session = requests.session()
# add Selenium's cookies to requests
selenium_cookies = driver.get_cookies()
for cookie in selenium_cookies:
session.cookies.set(cookie["name"], cookie["value"])
# Note: If headers are also important, you'll need to use
# something like seleniumwire to get the headers from Selenium
# Finally, re-send the request with requests.session
pdf_response = session.get(current_pdf_url)
# access the bytes response from the session
pdf_bytes = pdf_response.content
I highly recommend using seleniumwire over regular selenium because it extends Python Selenium to let you return headers, wait for requests to finish, use proxies, and much more.
Here is my python code:
import pandas as pd
import pandas_datareader.data as web
import bs4 as bs
import urllib.request as ul
from selenium import webdriver
style.use('ggplot')
driver = webdriver.PhantomJS(executable_path='C:\\Phantomjs\\bin\\phantomjs.exe')
def getBondRate():
#driver.deleteAllCookies();
url = "https://www.marketwatch.com/investing/index/tnx?countrycode=xx"
driver.get(url)
driver.implicitly_wait(10)
html = driver.page_source
return html
bondRate = getBondRate()
print(bondRate)
Few days back it was reading perfectly fine from Market watch. Now it is returning nothing in Body tag. Is selenium not loading page?
Do you require the HTML tags also? If not, you can try retrieving using the body tag. Here's how I would do it using Java.
String src=driver.findElement(By.tagName("body")).getText();
As per the url https://www.marketwatch.com/investing/index/tnx?countrycode=xx the behavior you are observing is pretty much justified.
I have taken up your code and along with a simple tweak tried to extract the page_source with PhantomJS as well as ChromeDriver. It is observed that when you use any WebDriver variant, the WebDriver fingerprints are geting detected and a Fingerprinting error is raised as follows:
Error details:
Failed to load resource: the server responded with a status of 404 (Not Found)
kpf.js?url=/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/fingerprint&token=058cbc6a-f8b8-f175-ca68-8c2e0fd6a4e3:1 Fingerprinting error
name: Error
message: Error issuing AJAX request (status code: 404)
stack: Error: Error issuing AJAX request (status code: 404)
at XMLHttpRequest.N.a.onreadystatechange (https://www.marketwatch.com/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/fingerprint/script/kpf.js?url=/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/fingerprint&token=058cbc6a-f8b8-f175-ca68-8c2e0fd6a4e3:1:1884)
DevTools failed to parse SourceMap: https://www.marketwatch.com/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/fingerprint/script/fingerprint.js.map
DevTools Snapshot:
You can find a relevant discussions in:
Browser Automation with Selenium: Fingerprints, recognizability and traceability?
Can a website detect when you are using selenium with chromedriver?
Selenium Webdriver is detectable
I am working on a url using python.
If I click the url, I am able to get the excel file.
but If I run following code, it gives me weird output.
>>> import urllib2
>>> urllib2.urlopen('http://intranet.stats.gov.my/trade/download.php?id=4&var=2012/2012%20MALAYSIA%27S%20EXPORTS%20BY%20ECONOMIC%20GROUPING.xls').read()
output :
"<script language=javascript>window.location='2012/2012 MALAYSIA\\'S EXPORTS BY ECONOMIC GROUPING.xls'</script>"
why its not able to read content with urllib2?
Take a look using an http listener (or even Google Chrome Developer Tools), there's a redirect using javascript when you get to the page.
You will need to access the initial url, parse the result and fetch again the actual url.
#Kai in this question seems to have found an answer to javascript redirects using the module Selenium
from selenium import webdriver
driver = webdriver.Firefox()
link = "http://yourlink.com"
driver.get(link)
#this waits for the new page to load
while(link == driver.current_url):
time.sleep(1)
redirected_url = driver.current_url