How to scrape Hackerearth pages which are java script protected?

How to scrape Hackerearth pages which are java script protected? - python

I want to get Problem Solved in the hackerearth page, for example,
https://www.hackerearth.com/#babe
When I do inspect element, I get
But on doing view source, I cannot find the class dark-weight 700. I think the content is loaded from java script. Therefore, when I use python's bs4 library, it returns me None Element.
I do not want to use selenium because it will open a new browser windows but I am doing all this in DJANGO platform so I want all the scripts to be processed in backend without any interruption and return only the number of problems solved, that is, 119.

Fortunately the data is loaded via publicly avaliable api (/users/pagelets/babe/coding-data/ for this user), so you can get the info with requests and bs4.
import requests
from bs4 import BeautifulSoup
user = 'babe'
url = 'https://www.hackerearth.com/users/pagelets/{}/coding-data/'.format(user)
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
problems_solved = soup.find(string='Problems Solved').find_next().text
print(problems_solved)
119

Related

How do I fix fix getting "None" as a response when web scraping?

So I am trying to create a small code that gets the views from a youtube video and prints them. However using this code when printing the text var I just get the response "None". Is there a way to get a response of the actual view count using these libraries?
import requests
from bs4 import BeautifulSoup
url = requests.get("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
soup = BeautifulSoup(url.text, 'html.parser')
text = soup.find('span', {'class': "view-count style-scopeytd-video-view-count-renderer"})
print(text)

To see why, you should use wget or curl to fetch a copy of that page and look at it, or use "view source" from your browser. That's what requests sees. None of those classes appear in the HTML you get back. That's why you get None -- because there ARE none.
YouTube builds all of its pages dynamically, through Javascript. requests doesn't interpret Javascript. If you need to do this, you'll need to use something like Selenium to run a real browser with a Javascript interpreter built in.

Why am I getting an empty body tag content when trying to use web scraping using the requests library?

I have been trying to use web scraping on a website using the requests and Beautifulsoup python libraries.
The problem is that I'm getting the html data of the web page but the body tag content is empty while on the inspect panel on the website it isn't.
Does anyone can explain why is it happening and what can I do to get the content of the body?
Here is my code:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://webaccess-il.rexail.com/?s_jwe=eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..gKfb7AnqhUiIMIn0PGb35g.SUsLS70gBec9GBgraaV5BK8hKyqm-VvMSNjP3nIumtcrj9h19zOkYjaBHrW4SDL10DjeIcwQcz9ul1p8umMHKxPPC-QZpCyJbk7JQkUSqFM._d_sGsiSyPF_Xqs2hmLN5A#/store-products-shopping-non-customers').text
soup = BeautifulSoup(source, 'lxml')
print(soup)
Here is the inspect panel of the website:
And here is the output of my code:
Thank you :)

There are two reasons, your code could not work for. The fist one is, the website does require additional header or cookie information, that you could try to find using the Inspect Browser Tool and add via
requests.get(url, headers=headers, cookies=cookies)
where headers and cookies are dictionaries.
Another reason, which I believe it is, is that the content is dynamically loaded via Javascript after the side is build, and what you do get is the initially loaded website.
To also provide you a solution, I attache an example using Selenium, which simulates a whole browser, which does serve the full website, however selenium has a bit of a setup overhead, that you can easily google.
from time import sleep
from selenium import webdriver
from bs4 import BeautifulSoup
url = 'https://webaccess-il.rexail.com/?s_jwe=eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..gKfb7AnqhUiIMIn0PGb35g.SUsLS70gBec9GBgraaV5BK8hKyqm-VvMSNjP3nIumtcrj9h19zOkYjaBHrW4SDL10DjeIcwQcz9ul1p8umMHKxPPC-QZpCyJbk7JQkUSqFM._d_sGsiSyPF_Xqs2hmLN5A#/store-products-shopping-non-customers'
driver = webdriver.Firefox()
driver.get(url)
sleep(10)
content = driver.page_source
soup = BeautifulSoup(content)
If you want the browser simulation to be none visible you can add
from selenium.webdriver.firefox.options import Options
options = Options()
options.headless = True
driver = webdriver.Firefox(options=options)
which will make it run in the backgroud.
Alternatively to Firefox, you can use pretty much any browser using the appropriate driver.
A Linux based setup example can be found here Link

Even though I find the use of Selenium easier for beginners, that site bothered me, so I figured out a pure requests way, that I also want to share.
Process:
When you look at the network traffic after loading the website, you find a lot of outgoing get requests. Assuming, you are interested in the products, that are loaded, I found a call right above the product images being loaded from Amazon S3 going to
https://client-il.rexail.com/client/public/public-catalog?s_jwe=eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..gKfb7AnqhUiIMIn0PGb35g.SUsLS70gBec9GBgraaV5BK8hKyqm-VvMSNjP3nIumtcrj9h19zOkYjaBHrW4SDL10DjeIcwQcz9ul1p8umMHKxPPC-QZpCyJbk7JQkUSqFM._d_sGsiSyPF_Xqs2hmLN5A
importantly
https://client-il.rexail.com/client/public/public-catalog?s_jwe=[...]
Upon clicking the URL I found it to be indeed a JSON of the products. However the s_jwe token is dynamic and without it, the JSON doesn't load.
Now investigating the initially loading url and searching for s_jwe you will find
<script>
window.customerStore = {store: angular.fromJson({"id":26,"name":"\u05de\u05e9\u05e7 \u05d4\u05e8 \u05e4\u05e8\u05d7\u05d9\u05dd","imagePath":"images\/stores\/26\/88aa6827bcf05f9484b0dafaedf22b0a.png","secondaryImagePath":"images\/stores\/4d5d1f54038b217244956071ca62312d.png","thirdImagePath":"images\/stores\/26\/2f9294180e7d656ba7280540379869ee.png","fourthImagePath":"images\/stores\/26\/bd2861565b18613497a6ce66903bf9eb.png","externalWebTrackingAccounts":"[{\"accountType\":\"googleAnalytics\",\"identifier\":\"UA-130110792-1\",\"primaryDomain\":\"ecomeshek.co.il\"},{\"accountType\":\"facebookPixel\",\"identifier\":\"3958210627568899\"}]","worksWithStoreCoupons":false,"performSellingUnitsEstimationLearning":false}), s_jwe: "eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..gKfb7AnqhUiIMIn0PGb35g.SUsLS70gBec9GBgraaV5BK8hKyqm-VvMSNjP3nIumtcrj9h19zOkYjaBHrW4SDL10DjeIcwQcz9ul1p8umMHKxPPC-QZpCyJbk7JQkUSqFM._d_sGsiSyPF_Xqs2hmLN5A"};
const externalWebTrackingAccounts = angular.fromJson(customerStore.store.externalWebTrackingAccounts);
</script>
containing
s_jwe: "eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..gKfb7AnqhUiIMIn0PGb35g.SUsLS70gBec9GBgraaV5BK8hKyqm-VvMSNjP3nIumtcrj9h19zOkYjaBHrW4SDL10DjeIcwQcz9ul1p8umMHKxPPC-QZpCyJbk7JQkUSqFM._d_sGsiSyPF_Xqs2hmLN5A"
So to summerize, even though, the initial page does not contain the products, it does contain the token and the product url.
Now you can extract the two and call the product catalog directly as such:
FINAL CODE:
import requests
import re
import json
s = requests.Session()
initial_url = 'https://webaccess-il.rexail.com/?s_jwe=eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..gKfb7AnqhUiIMIn0PGb35g.SUsLS70gBec9GBgraaV5BK8hKyqm-VvMSNjP3nIumtcrj9h19zOkYjaBHrW4SDL10DjeIcwQcz9ul1p8umMHKxPPC-QZpCyJbk7JQkUSqFM._d_sGsiSyPF_Xqs2hmLN5A#/store-products-shopping-non-customers'
initial_site = s.get(url= initial_url).content.decode('utf-8')
jwe = re.findall(r's_jwe:.*"(.*)"', initial_site)
product_url = "https://client-il.rexail.com/client/public/public-catalog?s_jwe="+ jwe[0]
products_site = s.get(url= product_url).content.decode('utf-8')
products = json.loads(products_site)["data"]
print(products[0])
There is a little bit of finetuning required with the decoding, but I am sure you can manage that. ;)
This of course is the leaner way of scraping that website, but as I hopefully showed, scraping is always a bit of playing Sherlock Holmes.
Any questions, glad to help.

Not able to download complete source code of a web page

I am trying to scrape this web-page using python requests library.
But I am not able to download complete html source code. When I use my web-browser to inspect elements, it gives complete html, which I believe can be used for scraping, but when I access this url using python requests library, those html tags which have data are simply disappeared and I am not able to scrape data from those. Here is my sample code :
import requests
from bs4 import BeautifulSoup as BS
import urllib
import http.client
url = 'https://www.udemy.com/topic/financial-analysis/?lang=en'
user_agent='my-user-agent'
request = urllib.request.Request(url,headers={'User-Agent': user_agent})
html = urllib.request.urlopen(request).read()
soup = BS(html,'html.parser')
can anybody please help me out?? Thanks

The page is likely being built by javascript, meaning the site sends over the same source you are pulling from urllib, and then the browser executes the javascript, modifying the source to render the page you are seeing
You will need to use something like selenium, which will open the page in a browser, render the JS, and then return the source e.g.
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.udemy.com/topic/financial-analysis/?lang=en")
driver.page_source # or driver.execute_script("return document.body.innerHTML;")

I recommend you using the stdlib module urllib2, it will allow you to comfortably get web resources.
Example:
import urllib2
response = urllib2.urlopen("http://google.de")
page_source = response.read()
AND...
For parsing the code, have a look at BeautifulSoup.

Thanks to you both, #blakebrojan i tried your method,, but it opened a new chrome instance and display result there,, but what i want is to get source code in my code and scrape data from that code ... here is the code
from selenium import webdriver
driver = webdriver.Chrome('C:\\Users\\Lenovo\\Desktop\\chrome-driver\\chromedriver.exe')
driver.get("https://www.udemy.com/topic/financial-analysis/?lang=en")
html=driver.page_source

Unable to navigate Amazon pagination with Python and BS4

I've been trying to create a simple web scraper program to scrape the book titles of a 100 bestseller list on Amazon. I've used this code before on another site with no problems. But for some reason, it scraps the first page fine but then posts the same results for the following iterations.
I'm not sure if it's something to do with how Amazon creates its urls or not. When I manually enter the "#2" (and beyond) at the end of the url in the browser it navigates fine.
(Once the scrape is working I plan on dumping the data in csv files. But for now, print to the terminal will do.)
import requests
from bs4 import BeautifulSoup
for i in range(5):
url = "https://smile.amazon.com/Best-Sellers-Kindle-Store-Dystopian-Science-Fiction/zgbs/digital-text/6361470011/ref=zg_bs_nav_kstore_4_158591011#{}".format(i)
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
for book in soup.find_all('div', class_='zg_itemWrapper'):
title = book.find('div', class_='p13n-sc-truncate')
name = book.find('a', class_='a-link-child')
price = book.find('span', class_='p13n-sc-price')
print(title)
print(name)
print(price)
print("END")

This is a common problem that you have to face, some sites load the data asynchronous(with ajax) those are XMLHttpRequest that you can see in the tab networking of your DOM inspector. Usually the websites load the data from a different endpoint with POST method to solve that you can use urllib or requests library.
In this case the request is through a GET method and you can scrape it from this url with no need of extend your code https://www.amazon.com/Best-Sellers-Kindle-Store-Dystopian-Science-Fiction/zgbs/digital-text/6361470011/ref=zg_bs_pg_3?_encoding=UTF8&pg=3&ajax=1 where you only change the pg parameter

Go through to original URL on social media management websites

I'm doing web scraping as part of an academic project, where it's important that all links are followed through to the actual content. Annoyingly, there are some important error cases with "social media management" sites, where users post their links to detect who clicks on them.
For instance, consider this link on linkis.com, which links to http:// + bit.ly + /1P1xh9J (separated link due to SO posting restrictions), which in turn links to http://conservatives4palin.com. The issue arises as the original link at linkis.com does not automatically redirect forward. Instead, the user has to click the cross in the top right corner to go to the original URL.
Furthermore, there seems to be different variations (see e.g. linkis.com link 2, where the cross is at the bottom left of the website). These are the only two variations I've found, but there might be more. Note that I'm using a web scraper very similar to this one. The functionality to go through to the actual link does not need to be stable/functioning over time as this is a one-time academic project.
How do I automatically go on to the original URL? Would the best approach be to design a regex that finds the relevant link?

In many cases, you will have to use browser automation to scrape web pages that generate their content using javascript, scraping the html returned by the a get request will not yield the result you want, you have two options here :
Try to get your way around all the additional javascript requests to get the content you want which can be very time consuming .
Use browser automation, which lets you open a real browser and automates its tasks, you can use Selenium for that.
I have been developing bots and scrapers for years now, and unless the webpage you are requesting does not rely heavily on javascript, you should use something like selenium.
Here is some code to get you started with selenium:
from selenium import webdriver
#Create a chrome browser instance, other drivers are also available
driver = webdriver.Chrome()
#Request a page
driver.get('http://linkis.com/conservatives4palin.com/uGXam')
#Select elements on the page and trigger events
#Selenium supports also xpath and css selectors
#Clicks the tag with the given id
driver.find_elements_by_id('some_id').click()

The common architecture that the website follows is that it shows the website as an iframe. The sample code runs for both the cases.
In order to get the final URL you can do something like this:
import requests
from bs4 import BeautifulSoup
urls = ["http://linkis.com/conservatives4palin.com/uGXam", "http://linkis.com/paper.li/gsoberon/jozY2"]
response_data = []
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
short_url = soup.find("iframe", {"id": "source_site"})['src']
response_data.append(requests.get(short_url).url)
print(response_data)

According to the two websites that you given, i think you may try the following code to get the original url for they all hidden in a part of javascript(the main scraper code i am using is from the question that you post):
try:
from HTMLParser import HTMLParser
except ImportError:
from html.parser import HTMLParser
import requests, re
from contextlib import closing
CHUNKSIZE = 1024
reurl = re.compile("\"longUrl\":\"(.*?)\"")
buffer = ""
htmlp = HTMLParser()
with closing(requests.get("http://linkis.com/conservatives4palin.com/uGXam", stream=True)) as res:
for chunk in res.iter_content(chunk_size=CHUNKSIZE, decode_unicode=True):
buffer = "".join([buffer, chunk])
match = reurl.search(buffer)
if match:
print(htmlp.unescape(match.group(1)).replace('\\',''))
break

say you're able to grab the href attribute/value:
s = 'href="/url/go/?url=http%3A%2F%2Fbit.ly%2F1P1xh9J"'
then you need to do the following:
import urllib.parse
s=s.partition('http')
s=s[1]+urllib.parse.unquote(s[2][0:-1])
s=urllib.parse.unquote(s)
and s will now be a string of the original bit-ly link!

try the following code:
import requests
url = 'http://'+'bit.ly'+'/1P1xh9J'
realsite = requests.get(url)
print(realsite.url)
it prints the desired output:
http://conservatives4palin.com/2015/11/robert-tracinski-the-climate-change-inquisition-begins.html?utm_source=twitterfeed&utm_medium=twitter

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to scrape Hackerearth pages which are java script protected? - python

Related

How do I fix fix getting "None" as a response when web scraping?

Why am I getting an empty body tag content when trying to use web scraping using the requests library?

Not able to download complete source code of a web page

Unable to navigate Amazon pagination with Python and BS4

Go through to original URL on social media management websites

Categories

Resources