How do I scrape website HTML after Javascript has run? - python

So I was trying to scrape a website. When I scrape it, it turns out that the result isn't the same as when you try to right click and view page source on Mozilla or Google Chrome.
The code I used:
import urllib
page = urllib.urlopen("http://www.google.com/search?q=python")
#or any other website that uses search
python = page.read()
print python
It turns out that the code only takes the 'raw' web page, which isn't what I wanted. For websites like this, I want the code after javascript etc. has run. So that the result is the same as if you were right clicking and viewing source code from your browser.
Is there any other way of doing this?

its not exactly a raw page as it is an error page from google to you :
in the print python part it says on the top of the message :
Your client does not have permission to get URL /search?q=python from this server.
if you were to change your page variable to
page = urllib.urlopen("http://volt.al/")
you'd see the javascript.
try it out with different pages to see what you like

Related

How to retrieve translation and pronunciation from Papago website using Python? [duplicate]

I am trying to create a script to download an ebook into a pdf. When I try to use beautifulsoup in it I to print the contents of a single page, I get a message in the console stating "Oh no! It looks like JavaScript is disabled in your browser. Please re-enable to access the reader."
I have already enabled Javascript in Chrome and this same piece of code works for a page like a stackO answer page. What could be blocking Javascript in this page and how can I bypass it?
My code for reference:
url = requests.get("https://platform.virdocs.com/r/s/0/doc/350551/sp/14552484/mi/47443495/?cfi=%2F4%2F2%5BP7001013978000000000000000003FF2%5D%2F2%2F2%5BP7001013978000000000000000010019%5D%2F2%2C%2F1%3A0%2C%2F1%3A0")
url.raise_for_status()
soup = bs4.BeautifulSoup(url.text, "html.parser")
elems = soup.select("p")
print(elems[0].getText())
The problem is that the page actually contains no content. To load the content it needs to run some JS code. The requests.get method does not run JS, it just loads the basic HTML.
What you need to do is to emulate a browser, i.e. 'open' the page, run JS, and then scrape content. One way to do it is to use a browser driver as described here - https://stackoverflow.com/a/57912823/9805867

Web scraping Twitter pages Using Python

Using my twitter developer credentials I get twitter API from news channels. Now I want to access the source of the news using URL embedded with twitter API data.
I try to get using BeautifulSoup and requests to get content of the twitter page.
But continuously I got an error 'We've detected that JavaScript is disabled in your browser. Would you like to proceed to legacy Twitter?'
I cleaned the browser and try to use every browser. But got the same response. Please help to solve this problem.
from bs4 import BeautifulSoup
import requests
url = 'https://twitter.com/i/web/status/1283691878588784642'
# get contents from url
content = requests.get(url).content
# get soup
soup = BeautifulSoup(content,'lxml')
Do you get the error 'We've detected that JavaScript is disabled in your browser. Would you like to proceed to legacy Twitter?' when running the script or when visiting it with your GUI web browser?
Have you tried getting your data through legacy?
If you get this error when running the script, there is nothig you can do like clearing browser cache, etc. The only way to get around this problem is to find another way to access the Twitter page in your python program.
From my experience I would say that the easiest way around this problem is to use gecko driver with FireFox. So Twitter gets all the features it needs.

Python BeautifulSoup && Request to scrape search engines

I'm a little confused on how to do this. I'm not sure if this is correct but I'm trying to query a search via a url. I've tried doing this:
url = 'https://duckduckgo.com/dogs?ia=meanings'
session = requests.session()
r = session.get(url)
soup = bs(r.content, 'html.parser')
I get some html back from the response; however, when I look for all the links it comes up with nothing besides the original search url.
links = soup.find_all('a')
for link in links:
print(link)
here
When I do the search on a browser and inspect the html code, all the links exist, but for some reason are not coming back to me via my request.
Anyone have any ideas, I'm trying to build a web-scraping application and I thought this would be something really easy that I could incorporate into my terminal.
The problem is that the search results and most of the page are dynamically loaded with the help of JavaScript code being executed by the browser. requests would only download the initial static HTML page, it has no JS engine since it is not a browser.
You have basically 3 main options:
use DuckDuckGo API (Python wrapper, may be there is a better one - please recheck) - this option is preferred
load the page in a real browser through selenium and then parse the HTML which is now the same complete HTML that you see in your browser
try to explore what requests are made to load the page and mimic them in your BeautifulSoup+requests code. This is the hardest and the most fragile approach that may involve complex logic and javascript code parsing.

Server errors while web scraping https in python

I'm trying to scrape this site https://propaccess.trueautomation.com/ClientDB/Property.aspx?prop_id=17471
I can type the address directly into my url bar, and I get the results I want, but when I scrape in python, i only get the source code for a "runtime error" page.
I'm thinking it might have something to do with https because I can scrape pages in the clear like craigslist.
My code is as follows,
import urllib
import re
domain = "https://propaccess.trueautomation.com/ClientDB/Property.aspx?
prop_id=17471"
htmlfile = urllib.urlopen(domain)
htmltext = htmlfile.read()
print htmltext
I'm new to python, but not to the internet. I was assuming if I could type the url into the browswer with success, I'd be able to type the same url into python. That seems to not be the case, and I don't have a clue why.
Thanks.
Mike
Update: If I browse to said url in a browser I have never used to surf this page, I get the "runtime error" page.
I cannot access the page you linked.
It seems like you are on an authenticated session, and your python code, of course, has no idea what's going on. It, thus, will return the "permission denied" or the sort of result.
If so, you probably want to pass the session cookie when you request.
The Requests library hopefully will do what you need.
(http://docs.python-requests.org/en/latest/user/advanced/#session-objects)
Hint: when you do scraping job, use incognito mode to see a web page.
How the page looks will be exactly the same to your python environment.

Scrapy: scraping website where targeted items are populated using document.write

I am trying to scrap a website where targeted items are populated using document.write method. How can I get full browser html rendered version of the website in the Scrapy?
You can't do this, as scrapy will not execute the JavaScript code.
What you can do:
Rely on a headless browser like Selenium, which will execute the JavaScript. Afterwards, use XPath (or simple DOM access) like before to query the web page after executing the page.
Understand where the contents come from, and load and parse the source directly instead. Chrome Dev Tools / Firebug might help you with that, have a look at the "Network" panel that shows fetched data.
Especially look for JSON, sometimes also XML.

Categories

Resources