BeautifulSoup, where are you putting my HTML? - python

I'm using BS4 with python2.7. Here's the start of my code (Thanks root):
from bs4 import BeautifulSoup
import urllib2
f=urllib2.urlopen('http://yify-torrents.com/browse-movie')
html=f.read()
soup=BeautifulSoup(html)
When I print html, its contents are the same as the source of the page viewed in chrome. When I print soup however, it cuts out all the entire body and leaves me with this (the contents of the head tag):
<!DOCTYPE html>
<html>
<head>
<title>Browse Movie - YIFY Torrents</title>
<meta charset="utf-8">
<meta content="IE=9" http-equiv="X-UA-Compatible"/>
<meta content="YIFY-Torrents.com - The official YIFY Torrents website. Here you will be able to browse and download all YIFY rip movies in excellent DVD, 720p, 1080p and 3D quality, all at the smallest file size." name="description"/>
<meta content="torrents, yify, movies, movie, download, 720p, 1080p, 3D, browse movies, yify-torrents" name="keywords"/>
<link href="http://static.yify-torrents.com/yify.ico" rel="shortcut icon"/>
<link href="http://yify-torrents.com/rss" rel="alternate" title="YIFY-Torrents RSS feed" type="application/rss+xml"/>
<link href="http://static.yify-torrents.com/assets/css/styles.css?1353330463" rel="stylesheet" type="text/css"/>
<link href="http://static.yify-torrents.com/assets/css/colorbox.css?1327223987" rel="stylesheet" type="text/css"/>
<script src="http://static.yify-torrents.com/assets/js/jquery-1.6.1.min.js?1327224013" type="text/javascript"></script>
<script src="http://static.yify-torrents.com/assets/js/jquery.validate.min.js?1327224011" type="text/javascript"></script>
<script src="http://static.yify-torrents.com/assets/js/jquery.colorbox-min.js?1327224010" type="text/javascript"></script>
<script src="http://static.yify-torrents.com/assets/js/form.js?1349683447" type="text/javascript"></script>
<script src="http://static.yify-torrents.com/assets/js/common.js?1353399801" type="text/javascript"></script>
<script>
var webRoot = 'http://yify-torrents.com/';
var IsLoggedIn = 0 </script>
<!--[if !IE]><!--><style type="text/css">#content input.field:focus, #content textarea:focus{border: 1px solid #47bc15 !important;}</style></meta></head></html>
Where am I going wrong?!

I had the same problem and this solved my problem:
soup = BeautifulSoup(html, 'html5lib')
You need to install html5lib:
pip install html5lib
or
easy_install html5lib
You can read more about different parsers (pros and cons) for Beautiful Soup here:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

Related

How to get raw html with absolute links paths when using 'requests-html'

When making a request using the requests library to https://stackoverflow.com
page = requests.get(url='https://stackoverflow.com')
print(page.content)
I get the following:
<!DOCTYPE html>
<html class="html__responsive html__unpinned-leftnav">
<head>
<title>Stack Overflow - Where Developers Learn, Share, & Build Careers</title>
<link rel="shortcut icon" href="https://cdn.sstatic.net/Sites/stackoverflow/Img/favicon.ico?v=ec617d715196">
<link rel="apple-touch-icon" href="https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-icon.png?v=c78bd457575a">
<link rel="image_src" href="https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-icon.png?v=c78bd457575a">
..........
These source code here have the absolute paths, but when running the same URL using requests-html with js rendering
with HTMLSession() as session:
page = session.get('https://stackoverflow.com')
page.html.render()
print(page.content)
I get the following:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>StackOverflow.org</title>
<script type="text/javascript" src="lib/jquery.js"></script>
<script type="text/javascript" src="lib/interface.js"></script>
<script type="text/javascript" src="lib/window.js"></script>
<link href="lib/dock.css" rel="stylesheet" type="text/css" />
<link href="lib/window.css" rel="stylesheet" type="text/css" />
<link rel="icon" type="image/gif" href="favicon.gif"/>
..........
The links here are relative paths,
How can I get the source code with absolute paths like requests when using requests-html with js rendering?
This should probably a feature request for the request-html developers. However for now we can achieve this with this hackish solution:
from requests_html import HTMLSession
from lxml import etree
with HTMLSession() as session:
html = session.get('https://stackoverflow.com').html
html.render()
# iterate over all links
for link in html.pq('a'):
if "href" in link.attrib:
# Make links absolute
link.attrib["href"] = html._make_absolute(link.attrib["href"])
# Print html with only absolute links
print(etree.tostring(html.lxml).decode())
We change the html-objects underlying lxml tree, by iterating over all links and changing their location to absolute using the html-object's private _make_absolute function.
The documentation on the module in this link mentions a distinguishment between the absolute and relative links.
Quote:
Grab a list of all links on the page, in absolute form (anchors
excluded):
r.html.absolute_links
Could you try this statement?

div Web Scrape on site output returns None

I was trying to web scrape the past multipliers on https://roobet.com/crash . But When I try to run the program there is no results. What's the problem? Code is below
from bs4 import BeautifulSoup
import requests
source = requests.get('https://roobet.com/crash').text
soup = BeautifulSoup(source, 'lxml')
title = soup.find('title').text
results = soup.find_all('div', attrs={'class': 'jss75'})
for i in results:
multi = i.find('span', attrs={"class":"jss75"})
if multi is not None:
print('multi:', multi).text
Thanks for the help!
Take a look at the returned source and you may understand why you cannot find the result you are looking for.
<!DOCTYPE html>
<html lang="en">
<head>
<!-- Google Tag Manager -->
<script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':
new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],
j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src=
'https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);
})(window,document,'script','dataLayer','GTM-563FCQS');</script>
<!-- End Google Tag Manager -->
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="preconnect" href="https://fonts.googleapis.com/" crossorigin>
<title>Roobet | Crypto's Fastest Growing Casino</title>
<meta name="description" content="Roobet, crypto's fastest growing casino. Hop on in, chat to others and play exciting games - Come and join the fun!">
<base href="/">
<meta name="theme-color" content="#191b31" />
<link rel="icon" type="image/png" href="images/favicon.png">
<link rel="manifest" href="/manifest.json" />
<script src="https://cdn.onesignal.com/sdks/OneSignalSDK.js" async ></script>
<script src="https://maps.googleapis.com/maps/api/js?key=AIzaSyCXI19SE-ZWv_ZyW7gGMzCTf4TGfOA3Sdk&libraries=places"></script>
<script src="https://tekhou5-dk2.pragmaticplay.net/gs2c/common/js/lobby/GameLib.js" />
<script>
var OneSignal = window.OneSignal || [];
OneSignal.push(function() {
OneSignal.init({
appId: "29c72f64-e7e6-408c-99b2-d86a84c6a9cb",
notifyButton: {
enable: false,
autoResubscribe: true,
},
welcomeNotification: {
disable: true
}
});
});
</script>
<link href="0.aafac69fdc9eee2864e9.css" rel="stylesheet"><link href="app.aafac69fdc9eee2864e9.css" rel="stylesheet"></head>
<body>
<!-- Google Tag Manager (noscript) -->
<noscript><iframe src="https://www.googletagmanager.com/ns.html?id=GTM-563FCQS"
height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript>
<!-- End Google Tag Manager (noscript) -->
<div id="root"></div>
<div id="modalRoot"></div>
<div id="loader">
<div class="loaderLogo">
<img src="/images/logo.svg" />
</div>
</div>
<script type="text/javascript" src="vendors.37e373e3e07a018e2e49.bundle.js"></script><script type="text/javascript" src="locale.9c51b6a88780f5e87cd3.bundle.js"></script><script type="text/javascript" src="app.7bee5f919f764925b254.bundle.js"></script></body>
<script>(function(){var w=window;var ic=w.Intercom;if(typeof ic==="function"){ic('reattach_activator');ic('update',intercomSettings);}else{var d=document;var i=function(){i.c(arguments)};i.q=[];i.c=function(args){i.q.push(args)};w.Intercom=i;function l(){var s=d.createElement('script');s.type='text/javascript';s.async=true;s.src='https://widget.intercom.io/widget/gcr7bzde';var x=d.getElementsByTagName('script')[0];x.parentNode.insertBefore(s,x);}if(w.attachEvent){w.attachEvent('onload',l);}else{w.addEventListener('load',l,false);}}})()</script>
<script src="https://intaggr.softswiss.net/public/sg.js"></script>
<script type="text/javascript" src="https://www.google.com/recaptcha/api.js?render=6LdG97YUAAAAAHMcbX2hlyxQiHsWu5bY8_tU-2Y_"></script>
<script type="text/javascript">
if (typeof window.grecaptcha !== 'undefined') {
grecaptcha.ready(function() {
grecaptcha.execute('6LdG97YUAAAAAHMcbX2hlyxQiHsWu5bY8_tU-2Y_', {action: 'homepage'});
})
}
</script>
</html>
When you inspect element on then website the div containing the multipliers that your looking for is there. <div class="jss75"> however in the above source you can see the body of the HTML file contains is script imports which generates the HTML you are looking for.
Some of the data you are looking for might be contained in the other files retrieved by the website (open dev tools, go to the network tab and reload). The recentNumbers file looks like it might contain what you need (I'm not familiar with the website) it contains many data points ladled as crashPoint which look like they are the multipliers you are looking for.
https://api.roobet.com/crash/recentNumbers
If this isn't what your looking for i can take a deeper look, or as i say checkout the network tab and all the data it pulls in.

bs4 returns empty array when i try to scrape some elements [duplicate]

Hello World,
New in Python, I am trying to webscrape a javascript page : https://search.gleif.org/#/search/
Please find below the result from my code (using request)
<!DOCTYPE html>
<html>
<head><meta charset="utf-8"/>
<meta content="width=device-width,initial-scale=1" name="viewport"/>
<title>LEI Search 2.0</title>
<link href="/static/icons/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
<link href="https://fonts.googleapis.com/css?family=Open+Sans:200,300,400,600,700,900&subset=cyrillic,cyrillic-ext,greek,greek-ext,latin-ext,vietnamese" rel="stylesheet"/>
<link href="/static/css/main.045139db483277222eb714c1ff8c54f2.css" rel="stylesheet"/></head>
<body>
<div id="app"></div>
<script src="/static/js/manifest.2ae2e69a05c33dfc65f8.js" type="text/javascript"></script>
<script src="/static/js/vendor.6bd9028998d5ca3bb72f.js" type="text/javascript"></script>
<script src="/static/js/main.5da23c5198041f0ec5af.js" type="text/javascript"></script>
</body>
</html>
The question:
Instead of retrieving the above script:
"src="/static/js/manifest.2ae2e69a05c33dfc65f8.js" type="text/javascript""
I would like to have the content of the table in order to store it.
Table that I want to scrape
Following code is written using PySelenium.
import time
from selenium import webdriver
country = []
legal_name = []
lei = []
driver = webdriver.Chrome()
driver.implicitly_wait(5)
for i in range(1,30395):
driver.get('https://search.gleif.org/#/search/fulltextFilterId=LEIREC_FULLTEXT&currentPage='+str(i)+'&perPage=50&expertMode=false#results-section')
time.sleep(5)
country += [i.get_attribute('innerHTML') for i in driver.find_elements_by_xpath('//*[#class="table-cell country"]/a')]
legal_name += [i.get_attribute('innerHTML') for i in driver.find_elements_by_xpath('//*[#class="table-cell legal-name"]/a')]
lei += [i.get_attribute('innerHTML') for i in driver.find_elements_by_xpath('//*[#class="table-cell lei"]/a')]
Logging in (Change this with the respective elements.)
driver.find_element_by_id("UserName").send_keys("xxxx")
driver.find_element_by_name("Password").send_keys("yyyy")
driver.find_element_by_class("loginButton").click()
Get page content
print(driver.page_source)

Scraping webpage

I am trying to write a Python script to scrape data from this webpage. I am trying to scrape the data from the second table ('class': 'char-pico-table') and am using this script to do so:
def getPICO(url):
r = requests.get(url)
print (r.content)
However, this prints this:
b'<!DOCTYPE html>\n<html class="view">\n <head>\n <title>RobotReviewer: Automating evidence synthesis</title>\n <meta charset="utf-8">\n <meta name="viewport" content="width=device-width, initial-scale=1.0">\n <meta name="google" content="notranslate">\n\n <link rel="stylesheet" type="text/css" href="//maxcdn.bootstrapcdn.com/font-awesome/4.3.0/css/font-awesome.min.css">\n <link rel="stylesheet" type="text/css" href="/css/main.css">\n <link rel="stylesheet alternative prefetch" type=text/css href="/css/report.css">\n\n <!-- Preload examples -->\n <link rel="prefetch" href="/report_view/Tvg0-pHV2QBsYpJxE2KW-/html">\n <link rel="prefetch" href="/report_view/_fzGUEvWAeRsqYSmNQbBq/html">\n <link rel="prefetch" href="/report_view/HBkzX1I3Uz_kZEQYeqXJf/html">\n\n <!-- / Preload examples -->\n\n\n <script src="/scripts/modernizr.js"></script>\n <script src="/scripts/spa/scripts/vendor/pdfjs/pdf.js"></script>\n <script src="/scripts/spa/scripts/vendor/compatibility.js"></script>\n <script data-main="/scripts/main" src="/scripts/require.js"></script>\n\n <script>\n PDFJS.disableWebGL = false;\n CSRF_TOKEN = "1508009356##6a03b1bf519972b27a0d871ae4823eb3a3366c0c";\n </script>\n </head>\n\n <body>\n <nav id="top-bar" class="top-bar" data-topbar role="navigation">\n <div>\n <ul class="title-area">\n <li class="name">\n <h1><img src="/img/logo.svg" width="190px"></h1>\n </li>\n </ul>\n\n <section class="top-bar-section">\n <ul class="right">\n <li>About</li>\n </ul>\n </section>\n </div>\n </nav>\n\n <div id="breadcrumbs"></div>\n\n <main id="main"></main>\n\n\n </body>\n</html>'
which is not the output that I see when I view the page in my browser - it contains none of the data that I wish to scrape. Why is this not the case?
When viewing the page in a web browser it looks like this:
Expected Output
Based on the comment from #Shahin, I wrote the following code, which gave me the data in a JSON format from which I was easily able to extract the data.
result = json.loads(requests.get('https://robot-reviewer.vortext.systems/report_view/'+id+'/json').content)

Python selenium print frame source

This is my first foray into Selenium. Apologies in advance if this is a stupid/trivial question.
I am trying to scrape information from a webpage. With Python/Selenium I am able to log on to the site and get to the page with the information I need. After the page I need is displayed, I am issuing
time.sleep(20)
html_source = driver.page_source
print html_source
The "source" that gets printed is different from both the
right click and select view page source and
right click and select This Frame, View Frame source
The required information is in the View Frame source. All of this is in Firefox.
What do I need to do to get to the Frame Source? There is no frame name in the Frame Source.
Additional information below:
When I right click and select view page source I get the below:
<!DOCTYPE html><html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<title>xxxxxxx Portal</title>
<base href="https://website.org/page/">
<link rel="shortcut icon" href="images/logos/xxxxxxx.ico">
<meta http-equiv="Pragma" content="no-cache">
<meta http-equiv="Expires" content="-1"><script type="text/javascript" src="https://website.org/page/security/csrf.js"> </script><script type="text/javascript" src="https://website.org/page/security/csrf/execute.js"> </script><script>
function pushFocus()
{
frameDetail.focus();
}
function addInProgressPanel(doc)
{
var d = doc.createElement('div');
d.id="inProgressPane";
d.className="freezeOn";
var tbl = doc.createElement("table");
var row = tbl.insertRow(-1);
var oi = doc.createElement("img");
oi.src= 'https://website.org/page/'+ "images/actions/loading2.gif";
var td = doc.createElement("td");
td.className="detailFormField";
td.bgcolor="red";
td.appendChild(oi);
row.appendChild(td);
td = doc.createElement("td");
td.className="inProcessing";
td.appendChild(doc.createTextNode("Your Request is Being Processed ..."));
row.appendChild(td);
d.appendChild(tbl);
doc.body.appendChild(d);
return d;
}
function inProgressScreen(type)
{
var ws = frames["frameDetail"];
if(!ws) return true;
var ips = ws.document.getElementById("inProgressPane");
if(ips)
{
if(type) ips.className = 'freezeOn';
else ips.className = 'freezeOff';
}else if(type)
ips = addInProgressPanel(ws.document);
}
</script></head>
<frameset id="main" framespacing="0" frameborder="0">
<frame id="frameDetail" name="frameDetail" scrolling="auto" marginwidth="0" marginheight="0" src="portal/portal.xsl?x=portal.PortalOutline&lang=en&mode=notices">
</frameset>
</html>
When I right click and select This Frame, View Frame source I get
<!DOCTYPE html><html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<base href="https://website.org/xxxxxx/">
<meta http-equiv="Content-Language" content="en-us">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta http-equiv="Pragma" content="no-cache">
<meta http-equiv="Expires" content="-1">
<title>xxxxxxxx Portal</title>
<link rel="stylesheet" type="text/css" href="styles/portal/menu.css">
<link rel="stylesheet" type="text/css" href="styles/portal/header.css">
<link rel="stylesheet" type="text/css" href="styles/portal/footer.css">
<link rel="stylesheet" type="text/css" href="styles/portal/jquery-ui-1.8.7.portal.css">
<link rel="stylesheet" type="text/css" href="styles/portal/fg.menu.css">
<link rel="stylesheet" type="text/css" href="styles/portal/portal.css">
<link rel="stylesheet" type="text/css" href="styles/icons.css">
<link rel="stylesheet" type="text/css" href="styles/portal/notifications.css"><script type="text/javascript" src="https://website.org/xxxxxxxx/security/csrf.js"> </script><script type="text/javascript" src="https://website.org/xxxxxxxx/security/csrf/execute.js"> </script><script src="scripts/widgets/common.js"></script><script src="scripts/controller.js"></script><script src="scripts/portal.js"></script><script src="scripts/jquery/jquery-1.7.2.min.js"></script><script type="text/javascript" src="https://website.org/xxxxxxxx/security/csrf/jquery.js"> </script><script src="scripts/jquery/jquery-ui-1.8.16.min.js"></script><script src="scripts/jquery/fg.menu.js"></script><script src="portal/lang/datePickerLanguage.jsp?lang=en"></script><script src="portal/portal.js"></script><script src="portal/portalNoShim.js"></script><script>
Lots more code here. Did not paste as it was too long. There is no frame name other than the reference to iSessionFrame below:
</script><script language="javascript" src="portal/grades.js"></script></div>
</div>
</div>
<div id="footer">
<table id="language"><select id="locale" style="width:175px"></select></table>
</div>
</div><iframe id="iSessionFrame" name="iSessionFrame" width="0" height="0" src="https://website.org/xxxxxx/white.jsp" style="visibility:hidden;"></iframe></body>
</html>
Q: What do I need to do to get to the Frame Source?
A: First you must switch to the wanted frame using the switch_to command and then you should use .page_source to get the html source.
Obs.: take a look at Selenium Docs, more specifically at Moving between windows and frames.
Code:
driver.switch_to_frame(driver.find_element_by_tag_name("frameDetail"))
driver.page_source
You could try to switch to the frame using its ID :
driver.switch_to_frame(driver.find_element_by_id("iSessionFrame"))
driver.page_source

Categories

Resources