How to retrieve translation and pronunciation from Papago website using Python? [duplicate] - python

I am trying to create a script to download an ebook into a pdf. When I try to use beautifulsoup in it I to print the contents of a single page, I get a message in the console stating "Oh no! It looks like JavaScript is disabled in your browser. Please re-enable to access the reader."
I have already enabled Javascript in Chrome and this same piece of code works for a page like a stackO answer page. What could be blocking Javascript in this page and how can I bypass it?
My code for reference:
url = requests.get("https://platform.virdocs.com/r/s/0/doc/350551/sp/14552484/mi/47443495/?cfi=%2F4%2F2%5BP7001013978000000000000000003FF2%5D%2F2%2F2%5BP7001013978000000000000000010019%5D%2F2%2C%2F1%3A0%2C%2F1%3A0")
url.raise_for_status()
soup = bs4.BeautifulSoup(url.text, "html.parser")
elems = soup.select("p")
print(elems[0].getText())

The problem is that the page actually contains no content. To load the content it needs to run some JS code. The requests.get method does not run JS, it just loads the basic HTML.
What you need to do is to emulate a browser, i.e. 'open' the page, run JS, and then scrape content. One way to do it is to use a browser driver as described here - https://stackoverflow.com/a/57912823/9805867

Related

Unable to scrape dynamic web page

I am trying to scrape the table found https://ark.intel.com/content/www/us/en/ark/search/featurefilter.html?productType=873&1_Filter-Family=595&2_StatusCodeText=4
I tried using BeautifulSoup and Soup is unable to parse the info located inside the "body" tag. I get a null output when I try to parse the table.
How can I workaround this?
This page uses JavaScript to add data but BeautifulSoup/LXML can't run JavaScript - if you turn off javaScrip in browser and load page then you will see what BeautifulSoup/LXML can get.
You may need Selenium to control web browser which can run JavaScript.
Or you can try to use DevTools in Chrome/Firefox (tab Network) to get url usesJavaScript(AJAX/XHR) to download data. And you can try to use this url withrequestsandBeautifulSoup`
I found it uses url:
https://ark.intel.com/libs/apps/intel/support/ark/advancedFilterSearch?productType=873&1_Filter-Family=595&2_StatusCodeText=4&forwardPath=/content/www/us/en/ark/search/featurefilter.html&pageNo=1
I didn't check if requests will need special settings (ie. cookies, headers) to get it.
You can use Puppeteer to 'control' the dynamic web page, and scrape it with BS.
See here : https://github.com/puppeteer/puppeteer/tree/master/examples

Python BeautifulSoup && Request to scrape search engines

I'm a little confused on how to do this. I'm not sure if this is correct but I'm trying to query a search via a url. I've tried doing this:
url = 'https://duckduckgo.com/dogs?ia=meanings'
session = requests.session()
r = session.get(url)
soup = bs(r.content, 'html.parser')
I get some html back from the response; however, when I look for all the links it comes up with nothing besides the original search url.
links = soup.find_all('a')
for link in links:
print(link)
here
When I do the search on a browser and inspect the html code, all the links exist, but for some reason are not coming back to me via my request.
Anyone have any ideas, I'm trying to build a web-scraping application and I thought this would be something really easy that I could incorporate into my terminal.
The problem is that the search results and most of the page are dynamically loaded with the help of JavaScript code being executed by the browser. requests would only download the initial static HTML page, it has no JS engine since it is not a browser.
You have basically 3 main options:
use DuckDuckGo API (Python wrapper, may be there is a better one - please recheck) - this option is preferred
load the page in a real browser through selenium and then parse the HTML which is now the same complete HTML that you see in your browser
try to explore what requests are made to load the page and mimic them in your BeautifulSoup+requests code. This is the hardest and the most fragile approach that may involve complex logic and javascript code parsing.

Scrapy: scraping website where targeted items are populated using document.write

I am trying to scrap a website where targeted items are populated using document.write method. How can I get full browser html rendered version of the website in the Scrapy?
You can't do this, as scrapy will not execute the JavaScript code.
What you can do:
Rely on a headless browser like Selenium, which will execute the JavaScript. Afterwards, use XPath (or simple DOM access) like before to query the web page after executing the page.
Understand where the contents come from, and load and parse the source directly instead. Chrome Dev Tools / Firebug might help you with that, have a look at the "Network" panel that shows fetched data.
Especially look for JSON, sometimes also XML.

How do I scrape website HTML after Javascript has run?

So I was trying to scrape a website. When I scrape it, it turns out that the result isn't the same as when you try to right click and view page source on Mozilla or Google Chrome.
The code I used:
import urllib
page = urllib.urlopen("http://www.google.com/search?q=python")
#or any other website that uses search
python = page.read()
print python
It turns out that the code only takes the 'raw' web page, which isn't what I wanted. For websites like this, I want the code after javascript etc. has run. So that the result is the same as if you were right clicking and viewing source code from your browser.
Is there any other way of doing this?
its not exactly a raw page as it is an error page from google to you :
in the print python part it says on the top of the message :
Your client does not have permission to get URL /search?q=python from this server.
if you were to change your page variable to
page = urllib.urlopen("http://volt.al/")
you'd see the javascript.
try it out with different pages to see what you like

Viewing whole webpage with beautifulsoup

I am scraping a website but it only shows a portion of the website at the bottom it has a view more button. Is there anyway to view everything on the webpage via python?
BeautifulSoup just parses the returned HTML. It doesn't execute JavaScript, which is often used to load new content or to modify the existing webpage after it has loaded.
You'll need to execute the JavaScript, which requires more than just an HTML parser. You basically need to use a browser. There are a few Python packages to do this:
Selenium
Ghost.py

Categories

Resources