Website always flags it using an outdated browser - python

I am trying to scrape the site https://anichart.net/ in order to use the information to build a schedule from the information. The problem is that the site is always detecting an outdated browser (shows http://outdatedbrowser.com).
<div class=noscript>We\'re sorry but AniChart requires Javascript.
<br>Please enable Javascript or <a
href=http://outdatedbrowser.com>upgrade to a modern web browser</a>.
</div></noscript><div class="noscript modern-browser" style="display:
none">Sorry, AniChart requires a modern browser.<br>Please <a
href=http://outdatedbrowser.com>upgrade to a newer web browser</a>.</div>
I have tried a regular request and have also tried forcing the user agent, shown below.
import requests
self.url = 'https://anichart.net/Winter-2019'
headers = {'User-agent': 'Chrome/72.0.3626.109'}
self.page = requests.get(self.url, headers=headers)
print(self.page.content)
I understand that the site uses javascript and the Requests module won't reference the javascript generated portion of the site unless I use other tools with it or potentially Selenium. My browsers are up-to-date so this should not be returning an outdated browser result.
This was working just fine a few days ago but it looks like they did just update their site so they may have added something that prevents automated requests on the site.
Edit:
Selenium Code below:
from selenium import webdriver
url = 'https://anichart.net/Winter-2019'
website = webdriver.Chrome()
website.get(url)
print(website.page_source)
html_after_JS = website.execute_script("return document.body.innerHTML")
print(html_after_JS)

The problem is not the browser detection.
requests simply does render JavaScript (as you seem to know already), and most sites nowadays uses front-end Javascript libraries to render content. And some more sites use Javascript detection to prevent bots from scraping the pages...
You instead will need to use a tool like Selenium, which will open a headless, "modern" browser, of your choice, and you can scrape the page from there. But you have not shown that code, so it might make sense to ask about that instead?
Or, better yet, they have an API - https://github.com/AniList/ApiV2-GraphQL-Docs
The AniList & AniChart websites themselves run on the Api, so everything you can do on the sites, you can do via the Api.

Related

Retrieving all network requests required to load a webpage using python

Let say I am making a python request
url = "https://www.google.com"
r = requests.get(url)
Is there any method for getting all the network requests needed to load such a website, for example, those listed in the inspect element tool in chrome? I believe that I could achieve the same effect using Selenium, but is there any library or method that I could use to simply get all the network requests/network responses when requesting a URL.
Selenium Wire may be worth a try. I haven't been able to find much else in this space either.
https://github.com/wkeeling/selenium-wire
Selenium Wire extends Selenium's Python bindings to give you access to the underlying requests made by the browser. You author your code in the same way as you do with Selenium, but you get extra APIs for inspecting requests and responses and making changes to them on the fly.
This article describes more HTTP Request packages that may have similar capabilities or related extensions.
https://www.twilio.com/blog/5-ways-http-requests-python

JavaScript Disabled error while web scraping twitter in Python in BeautifulSoup

I am new to this world of web scraping.
I was trying to scrape twitter with BeautifulSoup in Python.
Here's my code :
from bs4 import BeautifulSoup
import requests
request = requests.get("https://twitter.com/mybmc").text
soup = BeautifulSoup(request, 'html.parser')
print(soup.prettify())
But I am getting a large output which is not the twitter page which I am looking for but there is a error container :
Output Image
which says JavaScript is disabled in this browser. I tried changing my default browsers to Chrome, Firefox and Microsoft Edge but the out was same .
What should I do in this case?
Twitter here seem to be specifically trying to prevent scrapers of the front end, probably with the view that you should use their REST API to fetch that same data. It is not to do with your default browsers, but that requests.get will be providing a python requests user agent, which specifically doesn't support Javascript.
I'd suggest using a different page to practice on, or if it must be the twitter front page, consider using selenium perhaps with a standalone container to scrape.

web scraping vaadin python

I'm trying to scrape a site created using vaadin using python. This is the code I use:
requests.get('http://rnb.osim.ro/?pn=').text
but this is the result which contains no useful information:
<noscript>
You have to enable javascript in your browser to use an application built with Vaadin.
</noscript>
</div>
<script type="text/javascript" src="./VAADIN/vaadinBootstrap.js"></script>
<script type="text/javascript">//<!
Do you know how I can get the data I need from a vaadin site?
This is happening because requests can't really execute the JavaScript inside the website. As you might think, requests is just a request and not a browser that can handle JS and work with frameworks for the front end (i.e. Angular, React, Ajax). To scrape this modern and robust websites I personally recommend to use scrapy library. It's designed specially for scraping and can handle with the JavaScript a little bit. And even if it couldn't, you can still use the selenium web driver to fully emulate a browser.
If you are already familiar with requests, you may also find requests-html useful. If your just want to get the rendered html and not interact with the page, like clicking buttons, page-down etc. then you can use this option.
Your question is ideal for this demonstration. The following code fully renders the html you want.
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('http://rnb.osim.ro/?pn=')
r.html.render(sleep = 5)
print(r.html.html)

miss part of html by getting html by requests in python [duplicate]

I need to scrape a site with python. I obtain the source html code with the urlib module, but I need to scrape also some html code that is generated by a javascript function (which is included in the html source). What this functions does "in" the site is that when you press a button it outputs some html code. How can I "press" this button with python code? Can scrapy help me? I captured the POST request with firebug but when I try to pass it on the url I get a 403 error. Any suggestions?
In Python, I think Selenium 1.0 is the way to go. It’s a library that allows you to control a real web browser from your language of choice.
You need to have the web browser in question installed on the machine your script runs on, but it looks like the most reliable way to programmatically interrogate websites that use a lot of JavaScript.
Since there is no comprehensive answer here, I'll go ahead and write one.
To scrape off JS rendered pages, we will need a browser that has a JavaScript engine (e.i, support JavaScript rendering)
Options like Mechanize, url2lib will not work since they DO NOT support JavaScript.
So here's what you do:
Setup PhantomJS to run with Selenium. After installing the dependencies for both of them (refer this), you can use the following code as an example to fetch the fully rendered website.
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get('http://jokes.cc.com/')
soupFromJokesCC = BeautifulSoup(driver.page_source) #page_source fetches page after rendering is complete
driver.save_screenshot('screen.png') # save a screenshot to disk
driver.quit()
I have had to do this before (in .NET) and you are basically going to have to host a browser, get it to click the button, and then interrogate the DOM (document object model) of the browser to get at the generated HTML.
This is definitely one of the downsides to web apps moving towards an Ajax/Javascript approach to generating HTML client-side.
I use webkit, which is the browser renderer behind Chrome and Safari. There are Python bindings to webkit through Qt. And here is a full example to execute JavaScript and extract the final HTML.
For Scrapy (great python scraping framework) there is scrapyjs: an additional downloader handler / middleware handler able to scraping javascript generated content.
It's based on webkit engine by pygtk, python-webkit, and python-jswebkit and it's quite simple.

python open web page and get source code

We have developed a web based application, with user login etc, and we developed a python application that have to get some data on this page.
Is there any way to communicate python and system default browser ?
Our main goal is to open a webpage, with system browser, and get the HTML source code from it ? We tried with python webbrowser, opened web page succesfully, but could not get source code, and tried with urllib2, in that case, i think we have to use system default browser's cookie etc, and i dont want to this, because of security.
https://pypi.python.org/pypi/selenium
You can try to use Selenium, he was done for testing, but nothing prevents you from using it for other purposes
If your web site is navigable without Javascript, then you could try Mechanize or zope.testbrowser. These tools offer a higher level API than urllib2, letting you do things like follow links on pages and fill out HTML forms.
This can be helpful in navigating a site that uses cookie based authentication with HTML forms for login, for example.
Have a look at the nltk module---they have some utilities for looking at web pages and getting text. There's also BeautifulSoup, which is a bit more elaborate. I'm currently using both to scrape web pages for a learning algorithm---they're pretty widely used modules, so that means you can find lots of hints here :)

Categories

Resources