Not able to download complete source code of a web page - python

I am trying to scrape this web-page using python requests library.
But I am not able to download complete html source code. When I use my web-browser to inspect elements, it gives complete html, which I believe can be used for scraping, but when I access this url using python requests library, those html tags which have data are simply disappeared and I am not able to scrape data from those. Here is my sample code :
import requests
from bs4 import BeautifulSoup as BS
import urllib
import http.client
url = 'https://www.udemy.com/topic/financial-analysis/?lang=en'
user_agent='my-user-agent'
request = urllib.request.Request(url,headers={'User-Agent': user_agent})
html = urllib.request.urlopen(request).read()
soup = BS(html,'html.parser')
can anybody please help me out?? Thanks

The page is likely being built by javascript, meaning the site sends over the same source you are pulling from urllib, and then the browser executes the javascript, modifying the source to render the page you are seeing
You will need to use something like selenium, which will open the page in a browser, render the JS, and then return the source e.g.
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.udemy.com/topic/financial-analysis/?lang=en")
driver.page_source # or driver.execute_script("return document.body.innerHTML;")

I recommend you using the stdlib module urllib2, it will allow you to comfortably get web resources.
Example:
import urllib2
response = urllib2.urlopen("http://google.de")
page_source = response.read()
AND...
For parsing the code, have a look at BeautifulSoup.

Thanks to you both, #blakebrojan i tried your method,, but it opened a new chrome instance and display result there,, but what i want is to get source code in my code and scrape data from that code ... here is the code
from selenium import webdriver
driver = webdriver.Chrome('C:\\Users\\Lenovo\\Desktop\\chrome-driver\\chromedriver.exe')
driver.get("https://www.udemy.com/topic/financial-analysis/?lang=en")
html=driver.page_source

Related

How can I fetch the source code from a website that is blocking the requests when we use python bs4, selenium?

I want to scrape the data present in this website "https://www.findhelp.org/care/support-network--san-francisco-ca?postal=94105". I tried using beautiful soup and selenium,
The first approach:
import requests as requests
from bs4 import BeautifulSoup
url="https://www.findhelp.org/care/support-network--san-francisco-ca?postal=94105"
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'lxml')
print(soup)
This was not giving the output i was expecting, the fetched page contains something like this "Sorry, something about your browser or browsing activity made us think you were a robot."
The second approach:
import selenium
from selenium import webdriver
import webbrowser
url="https://www.findhelp.org/care/support-network--san-francisco-ca?postal=94105"
PATH = r"C:\Users\Vinay Edula\Desktop\xxxxxxxx\chromedriver.exe"
driver= webdriver.Chrome(PATH)
driver.get(url)
This approach works fine for one or two pages in that site but then after the problem is this website is blocking the requests.
Approach 3:
import webbrowser
url="https://www.findhelp.org/care/support-network--san-francisco-ca?postal=94105"
chrome_path = 'C:/Program Files (x86)/Google/Chrome/Application/chrome.exe %s'
for i in range(10):
url="https://www.findhelp.org/care/support-network--san-francisco-ca?postal=94105&cursor="+str(i*10)+"&limit=10"
webbrowser.get(chrome_path).open(url)
time.sleep(10)
This code working fine, it was opening the site in chrome without any error or blockibg but i dont know how to fetch the source code.
When the python code tries to fetch the code or by accessing from the guest browser as selenium did i am getting error. When I manually opens this webpage or using webbrowser module in python I can able to see the contents. So how can i solve this problem , my final aim is to fetch the contents present from this paginated site https://www.findhelp.org/care/support-network--san-francisco-ca?postal=94105? .
Any solution for this problem will be will be highly appreciated.
You can use the page_source property of Selenium as follows:
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("http://example.com")

Why am I getting an empty body tag content when trying to use web scraping using the requests library?

I have been trying to use web scraping on a website using the requests and Beautifulsoup python libraries.
The problem is that I'm getting the html data of the web page but the body tag content is empty while on the inspect panel on the website it isn't.
Does anyone can explain why is it happening and what can I do to get the content of the body?
Here is my code:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://webaccess-il.rexail.com/?s_jwe=eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..gKfb7AnqhUiIMIn0PGb35g.SUsLS70gBec9GBgraaV5BK8hKyqm-VvMSNjP3nIumtcrj9h19zOkYjaBHrW4SDL10DjeIcwQcz9ul1p8umMHKxPPC-QZpCyJbk7JQkUSqFM._d_sGsiSyPF_Xqs2hmLN5A#/store-products-shopping-non-customers').text
soup = BeautifulSoup(source, 'lxml')
print(soup)
Here is the inspect panel of the website:
And here is the output of my code:
Thank you :)
There are two reasons, your code could not work for. The fist one is, the website does require additional header or cookie information, that you could try to find using the Inspect Browser Tool and add via
requests.get(url, headers=headers, cookies=cookies)
where headers and cookies are dictionaries.
Another reason, which I believe it is, is that the content is dynamically loaded via Javascript after the side is build, and what you do get is the initially loaded website.
To also provide you a solution, I attache an example using Selenium, which simulates a whole browser, which does serve the full website, however selenium has a bit of a setup overhead, that you can easily google.
from time import sleep
from selenium import webdriver
from bs4 import BeautifulSoup
url = 'https://webaccess-il.rexail.com/?s_jwe=eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..gKfb7AnqhUiIMIn0PGb35g.SUsLS70gBec9GBgraaV5BK8hKyqm-VvMSNjP3nIumtcrj9h19zOkYjaBHrW4SDL10DjeIcwQcz9ul1p8umMHKxPPC-QZpCyJbk7JQkUSqFM._d_sGsiSyPF_Xqs2hmLN5A#/store-products-shopping-non-customers'
driver = webdriver.Firefox()
driver.get(url)
sleep(10)
content = driver.page_source
soup = BeautifulSoup(content)
If you want the browser simulation to be none visible you can add
from selenium.webdriver.firefox.options import Options
options = Options()
options.headless = True
driver = webdriver.Firefox(options=options)
which will make it run in the backgroud.
Alternatively to Firefox, you can use pretty much any browser using the appropriate driver.
A Linux based setup example can be found here Link
Even though I find the use of Selenium easier for beginners, that site bothered me, so I figured out a pure requests way, that I also want to share.
Process:
When you look at the network traffic after loading the website, you find a lot of outgoing get requests. Assuming, you are interested in the products, that are loaded, I found a call right above the product images being loaded from Amazon S3 going to
https://client-il.rexail.com/client/public/public-catalog?s_jwe=eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..gKfb7AnqhUiIMIn0PGb35g.SUsLS70gBec9GBgraaV5BK8hKyqm-VvMSNjP3nIumtcrj9h19zOkYjaBHrW4SDL10DjeIcwQcz9ul1p8umMHKxPPC-QZpCyJbk7JQkUSqFM._d_sGsiSyPF_Xqs2hmLN5A
importantly
https://client-il.rexail.com/client/public/public-catalog?s_jwe=[...]
Upon clicking the URL I found it to be indeed a JSON of the products. However the s_jwe token is dynamic and without it, the JSON doesn't load.
Now investigating the initially loading url and searching for s_jwe you will find
<script>
window.customerStore = {store: angular.fromJson({"id":26,"name":"\u05de\u05e9\u05e7 \u05d4\u05e8 \u05e4\u05e8\u05d7\u05d9\u05dd","imagePath":"images\/stores\/26\/88aa6827bcf05f9484b0dafaedf22b0a.png","secondaryImagePath":"images\/stores\/4d5d1f54038b217244956071ca62312d.png","thirdImagePath":"images\/stores\/26\/2f9294180e7d656ba7280540379869ee.png","fourthImagePath":"images\/stores\/26\/bd2861565b18613497a6ce66903bf9eb.png","externalWebTrackingAccounts":"[{\"accountType\":\"googleAnalytics\",\"identifier\":\"UA-130110792-1\",\"primaryDomain\":\"ecomeshek.co.il\"},{\"accountType\":\"facebookPixel\",\"identifier\":\"3958210627568899\"}]","worksWithStoreCoupons":false,"performSellingUnitsEstimationLearning":false}), s_jwe: "eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..gKfb7AnqhUiIMIn0PGb35g.SUsLS70gBec9GBgraaV5BK8hKyqm-VvMSNjP3nIumtcrj9h19zOkYjaBHrW4SDL10DjeIcwQcz9ul1p8umMHKxPPC-QZpCyJbk7JQkUSqFM._d_sGsiSyPF_Xqs2hmLN5A"};
const externalWebTrackingAccounts = angular.fromJson(customerStore.store.externalWebTrackingAccounts);
</script>
containing
s_jwe: "eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..gKfb7AnqhUiIMIn0PGb35g.SUsLS70gBec9GBgraaV5BK8hKyqm-VvMSNjP3nIumtcrj9h19zOkYjaBHrW4SDL10DjeIcwQcz9ul1p8umMHKxPPC-QZpCyJbk7JQkUSqFM._d_sGsiSyPF_Xqs2hmLN5A"
So to summerize, even though, the initial page does not contain the products, it does contain the token and the product url.
Now you can extract the two and call the product catalog directly as such:
FINAL CODE:
import requests
import re
import json
s = requests.Session()
initial_url = 'https://webaccess-il.rexail.com/?s_jwe=eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..gKfb7AnqhUiIMIn0PGb35g.SUsLS70gBec9GBgraaV5BK8hKyqm-VvMSNjP3nIumtcrj9h19zOkYjaBHrW4SDL10DjeIcwQcz9ul1p8umMHKxPPC-QZpCyJbk7JQkUSqFM._d_sGsiSyPF_Xqs2hmLN5A#/store-products-shopping-non-customers'
initial_site = s.get(url= initial_url).content.decode('utf-8')
jwe = re.findall(r's_jwe:.*"(.*)"', initial_site)
product_url = "https://client-il.rexail.com/client/public/public-catalog?s_jwe="+ jwe[0]
products_site = s.get(url= product_url).content.decode('utf-8')
products = json.loads(products_site)["data"]
print(products[0])
There is a little bit of finetuning required with the decoding, but I am sure you can manage that. ;)
This of course is the leaner way of scraping that website, but as I hopefully showed, scraping is always a bit of playing Sherlock Holmes.
Any questions, glad to help.

How can I view web page content that is generated using angular JS?

I have an app that uses requests to analyze and act on web page text. But it does not seem to work on this page that is likely built with angular: https://bio.tools/bowtie, in that the source HTML is different than the actual content. I am trying to collect the DOI that is referenced on the page (10.1186/gb-2009-10-3-r25), but when requests picks up the HTML source the DOI is not there.
I've heard that Google is able to parse pages that are generated using javascript. How do they do it? Any tips on viewing the DOI information with python?
You probably need an engine which runs the javascript of the http response for you (like an internet browser does). You can use selenium for this and then parsing the html it returns with beautifulsoup.
Example:
from selenium import webdriver
from bs4 import BeautifulSoup
url = "https://bio.tools/bowtie"
path = "path/to/chrome/webdriver"
browser = webdriver.Chrome(path) # Can also be Firefox, etc.
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')
...

BeautifulSoup - Cant get the content of the page

I'm using BeautifulSoup for a while and I've hadn't had much problems.
But now I'm trying to scrape from a site that gives me some problem.
My code is this:
preSoup = requests.get('https://www.betbrain.com/football/world/')
print(currUrl)
soup = BeautifulSoup(preSoup.content,"lxml")
print(soup)
the content I get seems to be some sort of script and/or api they're connected to, but not the real content of the webpage I see in the browser.
I cant reach the games for example. Does anyone knows a way around it?
Thank you
Okay requests gets only the html and doesnt load the js
you have to use webdriver for that
you can use Chrome, Firefox and etc.. i use PhantomJS because is running in the background its "headless" browser. Underneath you will find some example code that will help you understand how to use it
from bs4 import BeautifulSoup
import time
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get("https://www.betbrain.com/football/world/")
time.sleep(5)# you can give it some time to load the js
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
for i in soup.findAll("span", {"class": "Participant1"}):
print (i.text)

error in using BeautifulSoup python

I'm newbie to HTML parsers. I'm actually trying to parse the source code of the webpage with url (http://www.quora.com/How-many-internships-are-necessary-for-a-B-Tech-student). I'm trying to get the answer_count.
I tried it in the following way:
import urllib2
from bs4 import BeautifulSoup
q = urllib2.urlopen(url)
soup = BeautifulSoup(q)
divs = soup.find_all('div',class_='answer_count')
But I get the list 'divs' as empty. Why is it so? Where am I wrong? How do I implement it to get the result as '2 Answers'?
Maybe you don't have the same page as us on your browser (because you are logged in or so).
When I look at the webpage you provided with Google Chrome, there is nowhere 'answer_count' in the source code. So if Google chrome doen't find it, BeautifulSoup won't either

Categories

Resources