In my code I'm trying to get the first line of text from a webpage into a variable in python. At the moment I'm using urlopen to get the whole page for each link I want to read. How do I only read the first line of words on the webpage.
My code:
import urllib2
line_number = 10
id = (np.arange(1,5))
for n in id:
link = urllib2.urlopen("http://www.cv.edu/id={}".format(n))
l = link.read()
I want to extract the word "old car" from the following html code of the webpage:
<html>
<head>
<link rel="stylesheet">
<style>
.norm { font-family: arial; font-size: 8.5pt; color: #000000; text-decoration : none; }
.norm:Visited { font-family: arial; font-size: 8.5pt; color: #000000; text-decoration : none; }
.norm:Hover { font-family: arial; font-size: 8.5pt; color : #000000; text-decoration : underline; }
</style>
</head>
<body>
<b>Old car</b><br>
<sup>13</sup>CO <font color="red">v = 0</font><br>
ID: 02910<br>
<p>
<p><b>CDS</b></p>
Use XPath. It's exactly what we need.
XPath, the XML Path Language, is a query language for selecting nodes from an XML document.
The lxml python library will help us with this. It's one of many. Libxml2, Element Tree, and PyXML are some of the options. There are many, many, many libraries to do this type of thing.
Using XPath
Something like the following, based on your existing code, will work:
import urllib2
from lxml import html
line_number = 10
id = (np.arange(1,5))
for n in id:
link = urllib2.urlopen("http://www.cv.edu/id={}".format(n))
l = link.read()
tree = html.fromstring(l)
print tree.xpath("//b/text()")[0]
The XPath query //b/text() basically says "get the text from the <b> elements on a page. The tree.xpath function call returns a list, and we select the first one using [0]. Easy.
An aside about Requests
The Requests library is the state-of-the-art when it comes to reading webpages in code. It may save you some headaches later.
The complete program might look like this:
from lxml import html
import requests
for nn in range(1, 6):
page = requests.get("http://www.cv.edu/id=%d" % nn)
tree = html.fromstring(page.text)
print tree.xpath("//b/text()")[0]
Caveats
The urls didn't work for me, so you might have to tinker a bit. The concept is sound, though.
Reading from the webpages aside, you can use the following to test the XPath:
from lxml import html
tree = html.fromstring("""<html>
<head>
<link rel="stylesheet">
</head>
<body>
<b>Old car</b><br>
<sup>13</sup>CO <font color="red">v = 0</font><br>
ID: 02910<br>
<p>
<p><b>CDS</b></p>""")
print tree.xpath("//b/text()")[0] # "Old cars"
If you are going to do this on many different webpages that might be written differently, you might find that BeautifulSoup is helpful.
http://www.crummy.com/software/BeautifulSoup/bs4/doc/
As you can see at the bottom of quick start, it should be possible for you to extract all the text from the page and then take whatever line you are interested in.
Keep in mind that this will only work for HTML text. Some webpages use javascript extensively, and requests/BeautifulSoup will not be able to read content provided by the javascript.
Using Requests and BeautifulSoup - Python returns tag with no text
See also an issue I have had in the past, which was clarified by user avi: Want to pull a journal title from an RCSB Page using python & BeautifulSoup
Related
<iframe id="xyz" src="https://www.XXXXXX.com/" allowfullscreen="yes" style="width: 100%; height: 100%;">
#document
<!DOCTYPE html>
<html>...</html> // a whole new HTML document
</iframe>
I tried the below code, but I am not able to access the inner HTML content. Please guide.
docu=driver.find_element_by_xpath("//*[#id='asdfghg']").find_element_by_tag_name("iframe")
print(docu.get_attribute("innerHTML"))
Not sure if you are particular to an element or need full source , check and upvote if below lines can help you...
from selenium import webdriver
driver = webdriver.Chrome(executable_path="C:\\driver\\chromedriver.exe")
driver.get('https://yoururl')
# HTML Source before getting in frame
print(driver.page_source)
# Switch to Frame
driver.switch_to.frame('yourframeID')
# HTML Source after getting in frame
print(driver.page_source)
I'm trying to generate a dynamic HTML email and send it through a Python script using MIMEMultipart package.
I've created some HTML code and trying to pass in some variables to be used within the HTML from the Python script.
open('emailContent.html').read().format(p1="help")
When running this with just basic HTML file (without any CSS) it works and the variable is passed in.
However, when the HTML as a <style> tag in the HTML I get the following error:
KeyError: '\n fill'
The HTML file looks something like this:
<html>
<head>
<style type = text/css>
.chart-text {
fill: #000;
-moz-transform: translateY(0.1em);
-ms-transform: translateY(0.1em);
-webkit-transform: translateY(0.1em);
transform: translateY(0.1em);
}
</style>
</head>
<body>
<p class="chart-text">This is a test please {p1:}
</body>
</html>
I'm guessing Python is unable to parse the HTML style tags as when these are removed this works. I need the Styling as this will format the email. Any recommendations would be much appreciated.
I wrote following python code:
from bs4 import BeautifulSoup
import urllib2
url= 'http://www.example.com'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read(),"html.parser")
freq=soup.find('div', attrs={'id':'frequenz'})
print freq
The result is:
<div id="frequenz" style="font-size:500%; font-weight: bold; width: 100%; height: 10%; margin-top: 5px; text-align: center">tempsensor</div>
When I look at this site with a web browser, the web page shows a dynamic content, not the string 'tempsensor'. The temperature value is automatically refreshed every second. So something in the web page is
replacing the string 'tempsensor' with a numerical value automatically.
My problem is now: How can I get Python to show the updated numerical value? How can I obtain the value of the automatic update to tempsensor in BeautifulSoup?
Sorry No, Not possible with BeautifulSoup alone.
The problem is that BS4 is not a complete web browser. It is only an HTML parser. It doesn't parse CSS, nor Javascript.
A complete web browser does at least four things:
Connects to web servers, fetches data
Parses HTML content and CSS formatting and presents a web page
Parses Javascript content, runs it.
Provides for user interaction for things like Browser Navigation, HTML Forms and an events API for the Javascript program
Still not sure? Now look at your code. BS4 does not even include the first step, fetching the web page, to do that you had to use urllib2.
Dynamic sites usually include Javascript to run on the browser and periodically update contents. BS4 doesn't provide that, and so you won't see them, and furthermore never will by using only BS4. Why? Because item (3) above, downloading and executing the Javascript program is not happening. It would be happing in IE, Firefox, or Chrome, and that's why those work to show dynamic content while the BS4-only scraping does not show it.
PhantomJS and CasperJS provide a more mechanized browser that often can run the JavaScript codes enabling dynamic websites. But CasperJS and PhantomJS are programmed in server-side Javascript, not Python.
Apparently, some people are using a browser built into PyQt4 for these kinds of dynamic screenscaping tasks, isolating part of the DOM, and sending that to BS4 for parsing. That might allow for a Python solution.
In comments, #Cyphase suggests that the exact data you want might be available at a different URL, in which case it might be fetched and parsed with urllib2/BS4. This can be determined by careful examination of the Javascript that is running at a site, particularly you could look for setTimeout and setInterval which schedules updates, or ajax, or jQuery's .load function for fetching data from the back end. Javascripts for updates of dynamic content will usually only fetch data from back-end URLs of the same web site. If they use jQuery $('#frequenz') refers to the div, and by searching for this in the JS you may find the code that updates the div. Without jQuery the JS update would probably use document.getElementById('frequenz').
You're missing a tiny bit of code:
from bs4 import BeautifulSoup
import urllib2
url= 'http://www.example.com'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read(), 'html.parser')
freq = soup.find('div', attrs={'id':'frequenz'})
print freq.string # Added .string
This should do it:
freq.text.strip()
As in
>>> html = '<div id="frequenz" style="font-size:500%; font-weight: bold; width: 100%; height: 10%; margin-top: 5px; text-align: center">tempsensor</div>'
>>> soup = BeautifulSoup(html)
>>> soup.text.strip()
u'tempsensor'
I am trying to search through all the html of websites that I reach using selenium webdriver. In selenium, when I have an iframe, I must switch to the iframe and then switch back to the main html to search for other iframes.
However, with nested iframes, this can be quite complicated. I must switch to an iframe, search it for iframes, then switch to one iframe found, search IT for iframes, then to go to another iframe I must switch to the main frame, then have my path saved to switch back to where I was before, etc.
Unfortunately, many pages I've found have iframes within iframes within iframes (and so on).
Is there a simple algorithm for this? Or a better way of doing it?
Finding iframes solely by HTML element tag or attributes (including ID) appears to be unreliable.
On the other hand, recursively searching by iframe indexes works relatively fine.
def find_all_iframes(driver):
iframes = driver.find_elements_by_xpath("//iframe")
for index, iframe in enumerate(iframes):
# Your sweet business logic applied to iframe goes here.
driver.switch_to.frame(index)
find_all_iframes(driver)
driver.switch_to.parent_frame()
I was not able to find a website with several layers of nested frames to fully test this concept, but I was able to test it on a site with just one layer of nested frames. So, this might require a bit of debugging to deal with deeper nesting. Also, this code assumes that each of the iframes has a name attribute.
I believe that using a recursive function along these lines will solve the issue for you, and here's an example data structure to go along with it:
def frame_search(path):
framedict = {}
for child_frame in browser.find_elements_by_tag_name('frame'):
child_frame_name = child_frame.get_attribute('name')
framedict[child_frame_name] = {'framepath' : path, 'children' : {}}
xpath = '//frame[#name="{}"]'.format(child_frame_name)
browser.switch_to.frame(browser.find_element_by_xpath(xpath))
framedict[child_frame_name]['children'] = frame_search(framedict[child_frame_name]['framepath']+[child_frame_name])
...
do something involving this child_frame
...
browser.switch_to.default_content()
if len(framedict[child_frame_name]['framepath'])>0:
for parent in framedict[child_frame_name]['framepath']:
parent_xpath = '//frame[#name="{}"]'.format(parent)
browser.switch_to.frame(browser.find_element_by_xpath(parent_xpath))
return framedict
You'd kick it off by calling: frametree = iframe_search([]), and the framedict would end up looking something like this:
frametree =
{'child1' : 'framepath' : [], 'children' : {'child1.1' : 'framepath' : ['child1'], 'children' : {...etc}},
'child2' : 'framepath' : [], 'children' : {'child2.1' : 'framepath' : ['child2'], 'children' : {...etc}}}
A note: The reason that I wrote this to use attributes of the frames to identify them instead of just using the result of the find_elements method is that I've found in certain scenarios Selenium will throw a stale data exception after a page has been open for too long, and those responses are no longer useful. Obviously, the frame's attributes are not going to change, so it's a bit more stable to use the xpath. Hope this helps.
You can nest one iFrame into another iFrame by remembering the simple line of code to position, then re-position, the cursor back to the same area of the screen by using the as in the following COMPLETE code, remembering always to put the larger iFrame FIRST, then define the position of the SMALLER iFrame SECOND, as in the following FULL example:---
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>Daneiella Oddie, Austrailian Ballet Dancer, dancing to Bach-Gounod's Ave Maria</title>
</head>
<body bgcolor="#ffffcc">
<DIV style="position: absolute; top:0px; left:0px; width:0px; height:0px"></div>
<DIV style="position: absolute; top:10px; left:200px; width:900px; height:500px">
<iframe width="824" height="472" src="http://majordomoers.me/Videos/DanielaOddiDancingToBack_GounodsAveMaria.mp4" frameborder="0" allowfullscreen></iframe>
</div>
<DIV style="position: absolute; top:0px; left:0px; width:0px; height:0px"></div>
<DIV style="position: absolute; top:10px; left:0px; width:50px; height:50px">
<iframe src="http://majordomoers.me/Videos/LauraUllrichSingingBach_GounodsAveMaria.mp4" frameborder="0" allowfullscreen></iframe>
</div>
<DIV style="position: absolute; top:0px; left:0px; width:0px; height:0px"></div>
<DIV style="position: absolute; top:470px; left:10px; width:1050px; height:30px">
<br><font face="Comic Sans MS" size="3" color="red">
<li><b>Both Videos will START automatically...but the one with the audio will preceed the dancing by about 17 seconds. You should keep
<li>both videos at the same size as presented here. In all, just lean back and let it all unfold before you, each in its own time.</li></font>
</div>
<br>
</body>
</html>
You can use the below code to get the nested frame hierarchy... Change the getAttribute according to your DOM structure.
static Stack<String> stackOfFrames = new Stack<>();
....
....
public static void getListOfFrames(WebDriver driver) {
List<WebElement> iframes = wd.findElements(By.xpath("//iframe|//frame"));
int numOfFrames = iframes.size();
for(int i=0; i<numOfFrames;i++) {
stackOfFrames.push(iframes.get(i).getAttribute("id"));
System.out.println("Current Stack => " + stackOfFrames);
driver.switchTo().frame(i);
getListOfFrames(driver);
driver.switchTo().parentFrame();
stackOfFrames.pop();
count++;
}
}
i want to know if it is possible to display the values of hidden tags. im using urllib and beautifulsoup but i cant seem to get what i want.
the html code im using is written below: (saved as hiddentry.html)
<html>
<head>
<script type="text/javascript">
//change hidden elem value
function changeValue()
{
document.getElementById('hiddenElem').value = 'hello matey!';
}
//this will verify if i have successfully changed the hiddenElem's value
function printHidden()
{
document.getElementById('displayHere').innerHTML = document.getElementById('hiddenElem').value;
}
</script>
</head>
<body>
<div id="hiddenDiv" style="position: absolute; left: -1500px">
<!--i want to find the value of this element right here-->
<span id="hiddenElem"></span>
</div>
<span id="displayHere"></span>
<script type="text/javascript">
changeValue();
printHidden();
</script>
</body>
</html>
what i want to print is the value of element with id hiddenElem.
to do this i tried using urllib and beautifulsoup combo. the code i used is:
from BeautifulSoup import BeautifulSoup
import urllib2
import urllib
mysite = urllib.urlopen("http://localhost/hiddentry.html")
soup = BeautifulSoup(mysite)
print soup.prettify()
print '\n\n'
areUthere = soup.find(id="hiddenElem").find(text=True)
print areUthere
what i am getting as output though is None.
any ideas? is what i am trying to accomplish even possible?
beautifulsoup parses the html that it gets from the server. If you want to see generated values, you need to somehow execute the embedded javascript on the page before passing the string to beautifulsoup. Once you run the javascript, you'll pass the modified DOM html to beautifulsoup.
As far as browser emulation:
this combo from the creator of jQuery looks interesting
SO question bringing the browser to the server
and SO question headless internet browser
Using browser emulation, you should be able to pull down the base HTML, run browser emulation to execute the javascript, and then take the modified DOM HTML and jam it into beautifulsoup.