I try to scrap some data from a js web site but even with selenium it still can't reach it.
from discord.ext import commands
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
from selenium.webdriver.common.by import By
options = Options()
options.add_argument("--headless")
options.add_argument("-no-sandbox")
options.add_argument("-disable-dev-shm-usage")
driver = webdriver.Chrome('chromedriver', options=options)
driver.get('http://mc164.boxtoplay.com:65248')
print(driver.page_source.encode("utf-8"))
The output:
b'<html lang="en"><head>\n\n\t<title>Minecraft Dynamic Map</title>\n\n\t<meta charset="utf-8">\n\t<meta name="keywords" content="minecraft, map, dynamic">\n\t<meta name="description" content="Minecraft Dynamic Map">\n\t<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=no">\n\t<!-- These 2 lines make us fullscreen on apple mobile products - remove if you don\'t like that -->\n\t<meta name="apple-mobile-web-app-capable" content="yes">\n\t<meta name="apple-mobile-web-app-status-bar-style" content="black-translucent">\t\n\n\t<link rel="icon" href="images/dynmap.ico" type="image/ico">\n\n\t<script type="text/javascript" src="js/jquery-3.5.1.js?_=3.3.2-696"></script>\n\t<link rel="stylesheet" type="text/css" href="css/leaflet.css?_=3.3.2-696">\n\t<script type="text/javascript" src="js/leaflet.js?_=3.3.2-696"></script>\n <!-- FOr source debug on leaflet \t<script type="text/javascript" src="js/leaflet-src.js?_=3.3.2-696"></script> -->\n\t<script type="text/javascript" src="js/custommarker.js?_=3.3.2-696"></script>\n\n\t<script type="text/javascript" src="js/dynmaputils.js?_=3.3.2-696"></script>\n\t<script type="text/javascript" src="js/sidebarutils.js?_=3.3.2-696"></script>\n\n\t<!--<link rel="stylesheet" type="text/css" href="css/embedded.css" media="screen" />-->\n\t<link rel="stylesheet" type="text/css" href="css/standalone.css?_=3.3.2-696" media="screen">\n\t<link rel="stylesheet" type="text/css" href="css/dynmap_style.css?_=3.3.2-696" media="screen">\n\t<!-- <link rel="stylesheet" type="text/css" href="css/override.css" media="screen" /> -->\n\n\t<script type="text/javascript" src="version.js?_=3.3.2-696"></script>\n\t<script type="text/javascript" src="js/jquery.json.js?_=3.3.2-696"></script>\n\t<script type="text/javascript" src="js/jquery.mousewheel.js?_=3.3.2-696"></script>\n\t<script type="text/javascript" src="js/minecraft.js?_=3.3.2-696"></script>\n\t<script type="text/javascript" src="js/map.js?_=3.3.2-696"></script>\n\t<script type="text/javascript" src="js/hdmap.js?_=3.3.2-696"></script>\n\t<script type="text/javascript" src="standalone/config.js?_=3.3.2-696"></script>\n\n\t<script type="text/javascript">\n\t\t\t$(document).ready(function() {\n\t\t\t\twindow.dynmap = new DynMap($.extend({\n\t\t\t\t\tcontainer: $(\'#mcmap\')\n\t\t\t\t}, config));\n\t\t\t});\n\t</script>\n\n</head>\n<body>\n<noscript>\n For full functionality of this site it is necessary to enable JavaScript.\n Here are the \n instructions how to enable JavaScript in your web browser.\n</noscript>\n\n\t<div id="mcmap"></div>\n\n</body></html>'
I think, javascript is not executed while in headless mode. You can try without headleass mode.
if the app is developed using javascript. you can try any javascript framework to capture the data.
Related
While working on finding how data is being processed on the webpage. I was figuring out this site investorscout.co/investors.
I tried looking at the Network tab to see how they are rendering the data from backend onto the page. I have also looked into WS but no luck.
I am confused as to how the site is able to display the data while none of the requests in the Network tab shows that.
I aim to fetch the data using requests and bs4.
Sending a GET request to the page https://investorscout.co/investors returns a response with multiple references to external JavaScript code in it. This is what is being loaded on the page - dynamic content based on JavaScript functions.
I would suggest an implementation involving selenium instead as you would not be able to scrape content on the page otherwise.
HTML code of page for reference:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta
name="viewport"
content="width=device-width,initial-scale=1,shrink-to-fit=no"
/>
<meta name="theme-color" content="#000000" />
<link rel="manifest" href="/manifest.json" />
<link rel="”shortcut" icon” href="”/favicon.ico”" />
<title>Investor Scout</title>
<script>
!function(n,u){n._rwq=u,n[u]=n[u]||function(){(n[u].q=n[u].q||[]).push(arguments)}}(window,"rewardful")
</script>
<script
async
src="https://r.wdfl.co/rw.js"
data-rewardful="76b542"
></script>
<script type="text/javascript">
var _iub=_iub||[];_iub.csConfiguration={consentOnContinuedBrowsing:!1,ccpaAcknowledgeOnDisplay:!0,whitelabel:!1,lang:"en",siteId:2020596,enableCcpa:!0,countryDetection:!0,cookiePolicyId:26558236,banner:{acceptButtonDisplay:!0,customizeButtonDisplay:!0,acceptButtonColor:"#0073CE",acceptButtonCaptionColor:"white",customizeButtonColor:"#DADADA",customizeButtonCaptionColor:"#4D4D4D",rejectButtonColor:"#0073CE",rejectButtonCaptionColor:"white",position:"float-top-center",textColor:"black",backgroundColor:"white"}}
</script>
<script
type="text/javascript"
src="//cdn.iubenda.com/cs/ccpa/stub.js"
></script>
<script
type="text/javascript"
src="//cdn.iubenda.com/cs/iubenda_cs.js"
charset="UTF-8"
async
></script>
<script
defer="defer"
src="https://use.fontawesome.com/releases/v5.3.1/js/all.js"
></script>
<script type="text/javascript">
window.__lo_site_id=176375,function(){var t=document.createElement("script");t.type="text/javascript",t.async=!0,t.src="https://d10lpsik1i8c69.cloudfront.net/w.js";var e=document.getElementsByTagName("script")[0];e.parentNode.insertBefore(t,e)}()
</script>
<script type="text/javascript">
window.$crisp=[],window.CRISP_WEBSITE_ID="95efad36-fefd-4cf1-ae4b-a3bb5a61360c",d=document,s=d.createElement("script"),s.src="https://client.crisp.chat/l.js",s.async=1,d.getElementsByTagName("head")[0].appendChild(s)
</script>
<link href="/static/css/main.fc05b0f9.chunk.css" rel="stylesheet" />
</head>
<body>
<noscript>You need to enable JavaScript to run this app.</noscript>
<div id="root"></div>
<script>
!function(e){function t(t){for(var n,i,l=t[0],f=t[1],a=t[2],p=0,s=[];p<l.length;p++)i=l[p],Object.prototype.hasOwnProperty.call(o,i)&&o[i]&&s.push(o[i][0]),o[i]=0;for(n in f)Object.prototype.hasOwnProperty.call(f,n)&&(e[n]=f[n]);for(c&&c(t);s.length;)s.shift()();return u.push.apply(u,a||[]),r()}function r(){for(var e,t=0;t<u.length;t++){for(var r=u[t],n=!0,l=1;l<r.length;l++){var f=r[l];0!==o[f]&&(n=!1)}n&&(u.splice(t--,1),e=i(i.s=r[0]))}return e}var n={},o={1:0},u=[];function i(t){if(n[t])return n[t].exports;var r=n[t]={i:t,l:!1,exports:{}};return e[t].call(r.exports,r,r.exports,i),r.l=!0,r.exports}i.m=e,i.c=n,i.d=function(e,t,r){i.o(e,t)||Object.defineProperty(e,t,{enumerable:!0,get:r})},i.r=function(e){"undefined"!=typeof Symbol&&Symbol.toStringTag&&Object.defineProperty(e,Symbol.toStringTag,{value:"Module"}),Object.defineProperty(e,"__esModule",{value:!0})},i.t=function(e,t){if(1&t&&(e=i(e)),8&t)return e;if(4&t&&"object"==typeof e&&e&&e.__esModule)return e;var r=Object.create(null);if(i.r(r),Object.defineProperty(r,"default",{enumerable:!0,value:e}),2&t&&"string"!=typeof e)for(var n in e)i.d(r,n,function(t){return e[t]}.bind(null,n));return r},i.n=function(e){var t=e&&e.__esModule?function(){return e.default}:function(){return e};return i.d(t,"a",t),t},i.o=function(e,t){return Object.prototype.hasOwnProperty.call(e,t)},i.p="/";var l=this["webpackJsonpinvestor-scout"]=this["webpackJsonpinvestor-scout"]||[],f=l.push.bind(l);l.push=t,l=l.slice();for(var a=0;a<l.length;a++)t(l[a]);var c=f;r()}([])
</script>
<script src="/static/js/2.540fc93a.chunk.js"></script>
<script src="/static/js/main.9ac00620.chunk.js"></script>
</body>
</html>
I am trying to access a website and I keep getting the "access denied" message. I have googled and searched all over this, and everything points to using a "User Agent". I have added my user agent and it is not working. Here is my code:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
webpage = str('https://www.kroger.com/account/')
options = Options()
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
options.add_argument('user-agent={0}'.format(user_agent))
driver = webdriver.Chrome('/Path/chromedriver', options=options)
driver.get(webpage)
create = driver.find_element_by_xpath('/html/body/div[1]/div[2]/div[2]/div/main/section/section/section/button[2]')
create.click()
When I click on the create account page (I am using this to demonstrate the error I am getting) it takes me to this page:
And after refreshing it gives me the error.
The web page will load, but as soon as I do anything (Sometimes even manually tabbing between the boxes) it will kick me off and take me to the Access Denied page. Any way to resolve this?
EDIT: I have added code to click the "Create Account" button so to show the error that I am getting, and I have also added a photo of the page it sends me to before hitting the Error page.
It's not that clear in which circumstances you are facing access denied page. However I have executed your usecase and here are the observations:
Code Block:
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-logging"])
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
driver.get('https://www.kroger.com/account/')
print(driver.page_source)
Browser Snapshot:
ConsoleOutput:
<html lang="en-us" data-react-helmet="lang" class="hydrated"><head>
<meta charset="utf-8"><style data-styles="">kds-tooltippable{visibility:hidden}.hydrated{visibility:inherit}</style>
<meta name="viewport" content="width=device-width,initial-scale=1">
<meta name="google-site-verification" content="mLDjWodVihPJXvMAL0-8hcbuNludulWFtLJ5FFFMbyk">
<meta name="apple-itunes-app" content="app-id=403901186">
<meta name="theme-color" content="#0067b1">
<iframe src="javascript:void(0)" title="" style="width: 0px; height: 0px; border: 0px; display: none;"></iframe><script src="https://apis.google.com/_/scs/apps-static/_/js/k=oz.gapi.en_US.hc3rLxj9u8o.O/m=auth2/rt=j/sv=1/d=1/ed=1/am=wQE/rs=AGLTcCMtAagp6kGxB19Nep_bTJunj37kww/cb=gapi.loaded_0" async=""></script><script type="text/javascript" src="https://www.kroger.com/resources/0f06f8547303cb204a2ba5ee8d0c2be4f278e07179439"></script><script type="text/javascript" src="/ruxitagentjs_ICA27SVfghjqrtux_10197200717183318.js" data-dtconfig="rid=RID_-461653321|rpid=-319708356|domain=kroger.com|reportUrl=/rb_7571065c-f052-471e-a3d7-f99d529548bb|app=81222ad3b2deb1ef|agentId=215b1e64d6441901|ssc=1|featureHash=ICA27SVfghjqrtux|vcv=1|rdnt=0|uxrgce=1|cuc=49xey1j6|md=mdcc1=cabTest,mdcc3=bdocument.referrer,mdcc4=bs.visitorID,mdcc6=bs.transactionID,mdcc7=cs_ecid,mdcc8=adiv[data-qa^e^dqsubmit-error^dq] .kds-Message-content,mdcc9=bs_dtm.pageName,mdcc10=cStoreCode,mdcc11=cStoreZipCode,mdcc12=cStoreLocalName,mdcc13=dutm_medium,mdcc14=dutm_campaign,mdcc15=dutm_content,mdcc16=dutm_source,mdcc17=bkrgrData.payload.metaData.campaignID,mdcc18=bsearchCID,mdcc19=ali[data-qa^e^dqCartEstimatedTotal-subTotal^dq],mdcc20=bnavigator.userAgent,mdcc21=cloggedin|lastModification=1597688640250|dtVersion=10197200717183318|tp=500,50,0,1|uxdcw=1500|agentUri=/ruxitagentjs_ICA27SVfghjqrtux_10197200717183318.js"></script><link rel="search" type="application/opensearchdescription+xml" href="/osd.xml" title="Kroger">
<link rel="manifest" href="/site.webmanifest">
<link rel="apple-touch-icon" href="/apple-touch-icon.png">
<title>Kroger</title>
<script src="/sa/kroger-header.d2aa6e624b99b8e4993b.js" defer=""></script>
<script src="/sa/#kroger/account-sign-in.5570149badf101ae09f5.js" defer=""></script>
<script src="/sa/coupons~main.e313a51a37dbad8980b1.js" defer=""></script>
<script src="/sa/products~main.6fd57a24319d5b8ad376.js" defer=""></script>
<script src="/sa/redux~main.b54055dbff5d2dbea98b.js" defer=""></script>
<script src="/sa/internal~main.b6c4b86585460ad2d826.js" defer=""></script>
<script src="/sa/kds~main.a4e3dbc91309d0b1dbb5.js" defer=""></script>
<script src="/sa/time~main.689d89c867b93785cd58.js" defer=""></script>
<script src="/sa/react~main.0f8c529ae5985d95333e.js" defer=""></script>
<script src="/sa/compat~main.a1504007c3b3afabc8e0.js" defer=""></script>
<script src="/sa/common~main.f64c9b672d7d0a00c2d7.js" defer=""></script>
<script src="/sa/vendors~main.725b80732ad8d3325d46.js" defer=""></script>
<script src="/sa/main.240039c3d849b8bd33bc.js" defer=""></script>
<link data-react-helmet="true" rel="canonical" href="https://www.kroger.com/signin">
<link rel="stylesheet" href="/sa/vendors~main.d3cc9575af.css">
<link rel="stylesheet" href="/sa/internal~main.00555b7772.css">
<link rel="stylesheet" href="/sa/products~main.a1bfd3c28a.css">
<link rel="stylesheet" href="/sa/coupons~main.c36bbd64b9.css">
<link rel="stylesheet" href="/sa/kroger-header.75a650a0c2.css">
.
.
.
<div id="ZN_dnk7EnVUuZidS97"></div>
<noscript><img src="https://www.kroger.com/akam/11/pixel_29e0b938?a=dD0zZGYzZWUxMjAzZDM3ZmRlYjA3YjExYjRkM2Y2MDlmOWJlOGUxNDY1JmpzPW9mZg==" style="visibility: hidden; position: absolute; left: -999px; top: -999px;" /></noscript><script type="text/javascript">var _cf = _cf || []; _cf.push(['_setFsp', true]); _cf.push(['_setBm', true]); _cf.push(['_setAu', '/resources/0f06f85473rn244317954ff2256514de']); </script><script type="text/javascript" src="/resources/0f06f85473rn244317954ff2256514de"></script>
<div id="kds-Portal-toast" class="kds-Portal pointer-events-none undefined"><div class="kds-ToastGroup"></div></div><iframe sandbox="allow-scripts allow-same-origin" title="Adobe ID Syncing iFrame" id="destination_publishing_iframe_kroger_0" name="destination_publishing_iframe_kroger_0_name" src="https://kroger.demdex.net/dest5.html?d_nsid=0#https%3A%2F%2Fwww.kroger.com" style="display: none; width: 0px; height: 0px;"></iframe></body></html>
This is my first foray into Selenium. Apologies in advance if this is a stupid/trivial question.
I am trying to scrape information from a webpage. With Python/Selenium I am able to log on to the site and get to the page with the information I need. After the page I need is displayed, I am issuing
time.sleep(20)
html_source = driver.page_source
print html_source
The "source" that gets printed is different from both the
right click and select view page source and
right click and select This Frame, View Frame source
The required information is in the View Frame source. All of this is in Firefox.
What do I need to do to get to the Frame Source? There is no frame name in the Frame Source.
Additional information below:
When I right click and select view page source I get the below:
<!DOCTYPE html><html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<title>xxxxxxx Portal</title>
<base href="https://website.org/page/">
<link rel="shortcut icon" href="images/logos/xxxxxxx.ico">
<meta http-equiv="Pragma" content="no-cache">
<meta http-equiv="Expires" content="-1"><script type="text/javascript" src="https://website.org/page/security/csrf.js"> </script><script type="text/javascript" src="https://website.org/page/security/csrf/execute.js"> </script><script>
function pushFocus()
{
frameDetail.focus();
}
function addInProgressPanel(doc)
{
var d = doc.createElement('div');
d.id="inProgressPane";
d.className="freezeOn";
var tbl = doc.createElement("table");
var row = tbl.insertRow(-1);
var oi = doc.createElement("img");
oi.src= 'https://website.org/page/'+ "images/actions/loading2.gif";
var td = doc.createElement("td");
td.className="detailFormField";
td.bgcolor="red";
td.appendChild(oi);
row.appendChild(td);
td = doc.createElement("td");
td.className="inProcessing";
td.appendChild(doc.createTextNode("Your Request is Being Processed ..."));
row.appendChild(td);
d.appendChild(tbl);
doc.body.appendChild(d);
return d;
}
function inProgressScreen(type)
{
var ws = frames["frameDetail"];
if(!ws) return true;
var ips = ws.document.getElementById("inProgressPane");
if(ips)
{
if(type) ips.className = 'freezeOn';
else ips.className = 'freezeOff';
}else if(type)
ips = addInProgressPanel(ws.document);
}
</script></head>
<frameset id="main" framespacing="0" frameborder="0">
<frame id="frameDetail" name="frameDetail" scrolling="auto" marginwidth="0" marginheight="0" src="portal/portal.xsl?x=portal.PortalOutline&lang=en&mode=notices">
</frameset>
</html>
When I right click and select This Frame, View Frame source I get
<!DOCTYPE html><html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<base href="https://website.org/xxxxxx/">
<meta http-equiv="Content-Language" content="en-us">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta http-equiv="Pragma" content="no-cache">
<meta http-equiv="Expires" content="-1">
<title>xxxxxxxx Portal</title>
<link rel="stylesheet" type="text/css" href="styles/portal/menu.css">
<link rel="stylesheet" type="text/css" href="styles/portal/header.css">
<link rel="stylesheet" type="text/css" href="styles/portal/footer.css">
<link rel="stylesheet" type="text/css" href="styles/portal/jquery-ui-1.8.7.portal.css">
<link rel="stylesheet" type="text/css" href="styles/portal/fg.menu.css">
<link rel="stylesheet" type="text/css" href="styles/portal/portal.css">
<link rel="stylesheet" type="text/css" href="styles/icons.css">
<link rel="stylesheet" type="text/css" href="styles/portal/notifications.css"><script type="text/javascript" src="https://website.org/xxxxxxxx/security/csrf.js"> </script><script type="text/javascript" src="https://website.org/xxxxxxxx/security/csrf/execute.js"> </script><script src="scripts/widgets/common.js"></script><script src="scripts/controller.js"></script><script src="scripts/portal.js"></script><script src="scripts/jquery/jquery-1.7.2.min.js"></script><script type="text/javascript" src="https://website.org/xxxxxxxx/security/csrf/jquery.js"> </script><script src="scripts/jquery/jquery-ui-1.8.16.min.js"></script><script src="scripts/jquery/fg.menu.js"></script><script src="portal/lang/datePickerLanguage.jsp?lang=en"></script><script src="portal/portal.js"></script><script src="portal/portalNoShim.js"></script><script>
Lots more code here. Did not paste as it was too long. There is no frame name other than the reference to iSessionFrame below:
</script><script language="javascript" src="portal/grades.js"></script></div>
</div>
</div>
<div id="footer">
<table id="language"><select id="locale" style="width:175px"></select></table>
</div>
</div><iframe id="iSessionFrame" name="iSessionFrame" width="0" height="0" src="https://website.org/xxxxxx/white.jsp" style="visibility:hidden;"></iframe></body>
</html>
Q: What do I need to do to get to the Frame Source?
A: First you must switch to the wanted frame using the switch_to command and then you should use .page_source to get the html source.
Obs.: take a look at Selenium Docs, more specifically at Moving between windows and frames.
Code:
driver.switch_to_frame(driver.find_element_by_tag_name("frameDetail"))
driver.page_source
You could try to switch to the frame using its ID :
driver.switch_to_frame(driver.find_element_by_id("iSessionFrame"))
driver.page_source
I'm new with web scraping and I encountered a problem.
I tried to extract the list of the states from this site, 'https://www.iso.org/obp/ui/#iso:code:3166:JP', by using Python, selenium and PhantomJS but I failed with the output as below.
<!DOCTYPE html><html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=11;chrome=1">
<style type="text/css">html, body {height:100%;margin:0;}</style>
<link rel="shortcut icon" type="image/vnd.microsoft.icon" href="./../VAADIN/themes/obp/favicon.ico">
<link rel="icon" type="image/vnd.microsoft.icon" href="./../VAADIN/themes/obp/favicon.ico">
<link rel="stylesheet" type="text/css" href="./../VAADIN/themes/obp/styles.css"><script type="text/javascript" src="./../VAADIN/widgetsets/org.iso.obp.ui.widgetset.applicationWidgetset/org.iso.obp.ui.widgetset.applicationWidgetset.nocache.js?1444641834593"></script><script src="https://www.iso.org/obp/VAADIN/widgetsets/org.iso.obp.ui.widgetset.applicationWidgetset/913365F3A38F531CF0D09D8744F3A155.cache.js"></script></head>
<body scroll="auto" class=" v-generated-body">
<div id="obpui-105541713" class=" v-app obp">
<div class=" v-app-loading"></div>
<noscript>
You have to enable javascript in your browser to use an application built with Vaadin.
</noscript>
</div>
<script type="text/javascript" src="./../VAADIN/vaadinBootstrap.js"></script>
<script type="text/javascript">//<![CDATA[
if (!window.vaadin) alert("Failed to load the bootstrap javascript: ./../VAADIN/vaadinBootstrap.js");
vaadin.initApplication("obpui-105541713",{"heartbeatInterval":300,"versionInfo":{"vaadinVersion":"7.3.10"},"vaadinDir":"./../VAADIN/","authErrMsg":{"message":"Take note of any unsaved data, and <u>click here<\/u> or press ESC to continue.","caption":"Authentication problem"},"widgetset":"org.iso.obp.ui.widgetset.applicationWidgetset","theme":"obp","comErrMsg":{"message":"Take note of any unsaved data, and <u>click here<\/u> or press ESC to continue.","caption":"Communication problem"},"serviceUrl":".","standalone":true,"sessExpMsg":{"message":"Take note of any unsaved data, and <u>click here<\/u> or press ESC key to continue.","caption":"Session Expired"}});
//]]></script>
</body></html>
My code in Python is here.
from selenium import webdriver
target_url = 'https://www.iso.org/obp/ui/#iso:code:3166:JP'
driver = webdriver.PhantomJS()
driver.get( target_url)
print driver.page_source
Is there any solution for this?
I'm using BS4 with python2.7. Here's the start of my code (Thanks root):
from bs4 import BeautifulSoup
import urllib2
f=urllib2.urlopen('http://yify-torrents.com/browse-movie')
html=f.read()
soup=BeautifulSoup(html)
When I print html, its contents are the same as the source of the page viewed in chrome. When I print soup however, it cuts out all the entire body and leaves me with this (the contents of the head tag):
<!DOCTYPE html>
<html>
<head>
<title>Browse Movie - YIFY Torrents</title>
<meta charset="utf-8">
<meta content="IE=9" http-equiv="X-UA-Compatible"/>
<meta content="YIFY-Torrents.com - The official YIFY Torrents website. Here you will be able to browse and download all YIFY rip movies in excellent DVD, 720p, 1080p and 3D quality, all at the smallest file size." name="description"/>
<meta content="torrents, yify, movies, movie, download, 720p, 1080p, 3D, browse movies, yify-torrents" name="keywords"/>
<link href="http://static.yify-torrents.com/yify.ico" rel="shortcut icon"/>
<link href="http://yify-torrents.com/rss" rel="alternate" title="YIFY-Torrents RSS feed" type="application/rss+xml"/>
<link href="http://static.yify-torrents.com/assets/css/styles.css?1353330463" rel="stylesheet" type="text/css"/>
<link href="http://static.yify-torrents.com/assets/css/colorbox.css?1327223987" rel="stylesheet" type="text/css"/>
<script src="http://static.yify-torrents.com/assets/js/jquery-1.6.1.min.js?1327224013" type="text/javascript"></script>
<script src="http://static.yify-torrents.com/assets/js/jquery.validate.min.js?1327224011" type="text/javascript"></script>
<script src="http://static.yify-torrents.com/assets/js/jquery.colorbox-min.js?1327224010" type="text/javascript"></script>
<script src="http://static.yify-torrents.com/assets/js/form.js?1349683447" type="text/javascript"></script>
<script src="http://static.yify-torrents.com/assets/js/common.js?1353399801" type="text/javascript"></script>
<script>
var webRoot = 'http://yify-torrents.com/';
var IsLoggedIn = 0 </script>
<!--[if !IE]><!--><style type="text/css">#content input.field:focus, #content textarea:focus{border: 1px solid #47bc15 !important;}</style></meta></head></html>
Where am I going wrong?!
I had the same problem and this solved my problem:
soup = BeautifulSoup(html, 'html5lib')
You need to install html5lib:
pip install html5lib
or
easy_install html5lib
You can read more about different parsers (pros and cons) for Beautiful Soup here:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser