JavaScript Disabled error while web scraping twitter in Python in BeautifulSoup - python

I am new to this world of web scraping.
I was trying to scrape twitter with BeautifulSoup in Python.
Here's my code :
from bs4 import BeautifulSoup
import requests
request = requests.get("https://twitter.com/mybmc").text
soup = BeautifulSoup(request, 'html.parser')
print(soup.prettify())
But I am getting a large output which is not the twitter page which I am looking for but there is a error container :
Output Image
which says JavaScript is disabled in this browser. I tried changing my default browsers to Chrome, Firefox and Microsoft Edge but the out was same .
What should I do in this case?

Twitter here seem to be specifically trying to prevent scrapers of the front end, probably with the view that you should use their REST API to fetch that same data. It is not to do with your default browsers, but that requests.get will be providing a python requests user agent, which specifically doesn't support Javascript.
I'd suggest using a different page to practice on, or if it must be the twitter front page, consider using selenium perhaps with a standalone container to scrape.

Related

BeautifulSoup request is returning an empty list from LinkedIn.com/jobs

I'm new to BeautifulSoup and web scraping so please bare with me.
I'm using Beautiful soup to pull all job post cards from LinkedIn with the title "Security Engineer". After using inspect element on https://www.linkedin.com/jobs/search/?keywords=security%20engineer on an individual job post card, I believe to have found the correct 'li' portion for the class. The code works, but it's returning an empty list '[ ]'. I don't want to use any APIs because this is an exercise for me to learn web scraping. Thank you for your help. Here's my code so far:
from bs4 import BeautifulSoup
import requests
html_text = requests.get('https://www.linkedin.com/jobs/search/?keywords=security%20engineer').text
soup = BeautifulSoup(html_text, 'lxml')
jobs = soup.find_all('li', class_ = "jobs-search-results__list-item occludable-update p0 relative ember-view")
print(jobs)
As #baduker mentioned, using plain requests won't do all the heavy lifting that browsers do.
Whenever you open a page on your browser, the browser renders the visuals, does extra network calls, and runs javascript. The first thing it does is load the initial response, which is what you're doing with requests.get('https://www.linkedin.com/jobs/search/?keywords=security%20engineer')
The page you see on your browser is because of many, many more requests:
The reason your list is empty is because the html you get back is very minimal. You can print it out to the console and compare it to the browser's.
To make things easier, instead of using requests you can use Selenium which is essentially a library for programmatically controlling a browser. Selenium will do all those requests for you like a normal browser and let you access the page-source as you were expecting it to look.
This is a good place to start, but your scraper will be slow. There are things you can do in Selenium to speed things up, like running in headless-mode
which means don't render the page graphically, but it won't be as fast as figuring out how to do it on your own with requests.
If you want to do it using requests you're going to need to do a lot of snooping through the requests, maybe using a tool like postman, and see how to simulate the necessary steps to get the data from whatever page.
For example some websites have a handshake process when logging in.
A website I've worked on goes like this:
(Step 0 really) Setup request headers because the site doesn't seem to respond unless User-Agent header is included
Fetch initial HTML, get unique key from a hidden element in a <form>
Using this key, make a POST request to the url from that form
Get a session id key from the response
Setup a another POST request that combines username, password, and sessionid. The URL was in some javascript function, but I found it using the network inspector in the devtools
So really, I work strictly with Selenium if it's too complicated and I'm only getting the data once or not so often. I'll go through the heavy stuff if I'm building a scraper for an API that others will use frequently.
Hope any of this made sense to you. Happy scraping!

BeautifulSoup Returns vs View-Source from Chrome (Zillow)

I have been trying to scrape the code from Zillow but beautifulsoup gives much less code than view-source from chrome. Here is my code:
from bs4 import BeautifulSoup
import requests
from bs4 import BeautifulSoup
import requests
url='https://www.zillow.com/homedetails/49-Mountain-St-Hartford-CT-06106/58139903_zpid/'
html=requests.get(url)
bs = BeautifulSoup(html.text,"html.parser")
bs
Results show that contents in the body are so few. However, if you copy the url and view source code on chrome, you see a lot. Could someone show how to scrape the full contents in the body on Zillow? I saw "Please verify you're a human to continue" in the results, how to handle that?
I think your basic problem is that Zillow will load a lot of additional data after the first page request and use that data to populate the page. Zillow may also do things to discourage web scraping (such as the captcha you're seeing).
How to do this well is a huge topic and not one easily answered in a Stack Overflow question. You can look at this page for a list of resources that may be helpful to you as a scraper - https://github.com/niespodd/browser-fingerprinting
You can also open your network tab in your browser's developer tools (ctrl + F11 on Chrome). In the network tab you can see the outgoing requests and the responses. You can find the data you want in the responses and study the requests to find out how to get the data you are looking for.
As for the "verify you are a human", a good captcha today will not have the answer parsable on the client side and eliminate most efforts in modifying the request header. So you may want to try to use selenium browser and web-driver instead of only the requests library, that way you can manually beat the captcha and then let your scraper do its work.

Web scraping Twitter pages Using Python

Using my twitter developer credentials I get twitter API from news channels. Now I want to access the source of the news using URL embedded with twitter API data.
I try to get using BeautifulSoup and requests to get content of the twitter page.
But continuously I got an error 'We've detected that JavaScript is disabled in your browser. Would you like to proceed to legacy Twitter?'
I cleaned the browser and try to use every browser. But got the same response. Please help to solve this problem.
from bs4 import BeautifulSoup
import requests
url = 'https://twitter.com/i/web/status/1283691878588784642'
# get contents from url
content = requests.get(url).content
# get soup
soup = BeautifulSoup(content,'lxml')
Do you get the error 'We've detected that JavaScript is disabled in your browser. Would you like to proceed to legacy Twitter?' when running the script or when visiting it with your GUI web browser?
Have you tried getting your data through legacy?
If you get this error when running the script, there is nothig you can do like clearing browser cache, etc. The only way to get around this problem is to find another way to access the Twitter page in your python program.
From my experience I would say that the easiest way around this problem is to use gecko driver with FireFox. So Twitter gets all the features it needs.

Webscrape a site with authentication and Sign In with Google

I need to scrape a site with authentication and I'm planning on using my google account to do so.
So far I've done:
import requests
from bs4 import BeautifulSoup
url = "https://url.com/login"
r = requests.get(url)
When I tried to follow the Sign In with Google button, I realized that there's no href link within the HTML.
Anyone can help me with this?
Thanks!
BeautifulSoup is a library that does http request directly through python, meaning that there's no browser involved. That implies that you can't scrape websites that require things like login in.
Give a look to Selenium, a library that allows you to do request through your browser.

Website always flags it using an outdated browser

I am trying to scrape the site https://anichart.net/ in order to use the information to build a schedule from the information. The problem is that the site is always detecting an outdated browser (shows http://outdatedbrowser.com).
<div class=noscript>We\'re sorry but AniChart requires Javascript.
<br>Please enable Javascript or <a
href=http://outdatedbrowser.com>upgrade to a modern web browser</a>.
</div></noscript><div class="noscript modern-browser" style="display:
none">Sorry, AniChart requires a modern browser.<br>Please <a
href=http://outdatedbrowser.com>upgrade to a newer web browser</a>.</div>
I have tried a regular request and have also tried forcing the user agent, shown below.
import requests
self.url = 'https://anichart.net/Winter-2019'
headers = {'User-agent': 'Chrome/72.0.3626.109'}
self.page = requests.get(self.url, headers=headers)
print(self.page.content)
I understand that the site uses javascript and the Requests module won't reference the javascript generated portion of the site unless I use other tools with it or potentially Selenium. My browsers are up-to-date so this should not be returning an outdated browser result.
This was working just fine a few days ago but it looks like they did just update their site so they may have added something that prevents automated requests on the site.
Edit:
Selenium Code below:
from selenium import webdriver
url = 'https://anichart.net/Winter-2019'
website = webdriver.Chrome()
website.get(url)
print(website.page_source)
html_after_JS = website.execute_script("return document.body.innerHTML")
print(html_after_JS)
The problem is not the browser detection.
requests simply does render JavaScript (as you seem to know already), and most sites nowadays uses front-end Javascript libraries to render content. And some more sites use Javascript detection to prevent bots from scraping the pages...
You instead will need to use a tool like Selenium, which will open a headless, "modern" browser, of your choice, and you can scrape the page from there. But you have not shown that code, so it might make sense to ask about that instead?
Or, better yet, they have an API - https://github.com/AniList/ApiV2-GraphQL-Docs
The AniList & AniChart websites themselves run on the Api, so everything you can do on the sites, you can do via the Api.

Categories

Resources