Python error when using open('index.html').read().format()

Python error when using open('index.html').read().format() - python

I'm trying to generate a dynamic HTML email and send it through a Python script using MIMEMultipart package.
I've created some HTML code and trying to pass in some variables to be used within the HTML from the Python script.
open('emailContent.html').read().format(p1="help")
When running this with just basic HTML file (without any CSS) it works and the variable is passed in.
However, when the HTML as a <style> tag in the HTML I get the following error:
KeyError: '\n fill'
The HTML file looks something like this:
<html>
<head>
<style type = text/css>
.chart-text {
fill: #000;
-moz-transform: translateY(0.1em);
-ms-transform: translateY(0.1em);
-webkit-transform: translateY(0.1em);
transform: translateY(0.1em);
}
</style>
</head>
<body>
<p class="chart-text">This is a test please {p1:}
</body>
</html>
I'm guessing Python is unable to parse the HTML style tags as when these are removed this works. I need the Styling as this will format the email. Any recommendations would be much appreciated.

Related

Fetch page with Scrapy, execute JS and extract variable

I have a project using the python screen-scraping framework scrapy. I created a spider that loads all <script> tags and processes the second one. This is because within the test data I gathered, the data I need, was in the second <script> tag.
But now I have a problem, whereas some pages contain the data I want in some other script tags (#3 or #4). Further obstacle is that mostly the second line of the second javascript tag has the JSON I want. But depending on the page, this could also be the 3rd or the 4th line.
Consider this simple HTML file:
<html>
<head>
<title> Test </title>
</head>
<body>
<p>
This is a text
</p>
<script type="text/javascript">
var myJSON = {
a: "a",
b: 42
}
</script>
</body>
</html>
I can access myJSON.b and get 42 if I open this page in my browser (firefox) and go to the developer tools and console.log(myJSON.b)
So my Question is: How can I extract JavaScript variable or JSON from a scrapy-fetched-page?

I had run into a similar issue before and I solved it by extracting the text in the script tag using something like (based on your sample HTML file):
response.xpath('//script/text()')
After that I used a regular expression to extract the required data in JSON format. So, using the selector above and your sample HTML, something close to:
pattern = r'i-suck-at-regular-expressions'
json_data = response.xpath('//script/text()').re_first(pattern)
Next, you should be able to use the json library to load the data as a python dictionary like so:
json.loads(json_data)
And it should return something similar to:
{"a": "a", "b": 42}

Open IE browser window using python when link is clicked

I need a url to be opened in a IE browser specifically.
I know the following code is wrong but I don't know what else to try.
How can I achieve that through python?
template
Open in IE
urls.py
url(r'^open-ie$', views.open_in_ie, name='ie'),
views.py
import webbrowser
def open_in_ie(request):
ie = webbrowser.get(webbrowser.iexplore)
return ie.open('https://some-link.com')
Again, I know this is wrong and it tries to open the ie browser at a server level. Any advices? Thank you!

Shot Answer: You can't.
Long Answer: If user is using IE to view your website you can open links in other browsers. But if user is using any other browser(firefox, chrome, etc.) all links will open in same browser, you can't access other browsers. So in your case answer is no, because you are trying to open IE from some other browser.
Here is the code to open another browser from IE if you are interested:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en">
<head>
<title>HTA Test</title>
<hta:application applicationname="HTA Test" scroll="yes" singleinstance="yes">
<script type="text/javascript">
function openURL()
{
var shell = new ActiveXObject("WScript.Shell");
shell.run("http://www.google.com");
}
</script>
</head>
<body>
<input type="button" onclick="openURL()" value="Open Google">
</body>
</html>
Code from here

how to fetch / grab polymer spa webpage by using python with headless server and no GUI

I'm trying to grab the content of the following url:
https://docs-05-dot-polymer-project.appspot.com/0.5/articles/demos/spa/final.html
My goal is to grab the content (source code) of the webpage as seen by the visitor, so after it has rendered all javascripts etc.
To do so I used the example mentioned here:http://techstonia.com/scraping-with-phantomjs-and-python.html
That example works on my server. But the challenge is to also have it work for polymer based SPA sites like the one mentioned. Those are really rendered javascript websites.
My code looks like:
import platform
from bs4 import BeautifulSoup
from selenium import webdriver
# PhantomJS files have different extensions
# under different operating systems
if platform.system() == 'Windows':
PHANTOMJS_PATH = './phantomjs.exe'
else:
PHANTOMJS_PATH = './phantomjs'
# here we'll use pseudo browser PhantomJS,
# but browser can be replaced with browser = webdriver.FireFox(),
# which is good for debugging.
browser = webdriver.PhantomJS(PHANTOMJS_PATH)
browser.get('https://docs-05-dot-polymer-project.appspot.com/0.5/articles/demos/spa/final.html')
print (browser)
The issue is that is delivers the following result:
<!DOCTYPE html>
<html><head>
<meta charset="utf-8">
<meta content="width=device-width, minimum-scale=1.0, initial-scale=1.0, user-scalable=yes" name="viewport">
<title>Single page app using Polymer</title>
<script async="" src="//www.google-analytics.com/analytics.js"></script><script src="/webcomponents.min.js"></script>
<!-- vulcanized version of imported elements --
see "elements.html" for unvulcanized list of imports. -->
<link href="vulcanized.html" rel="import">
<link href="styles.css" rel="stylesheet" shim-shadowdom="">
</link></link></meta></meta></head>
<body fullbleed="" unresolved="">
<template id="t" is="auto-binding">
<!-- Route controller. -->
<flatiron-director autohash="" route="{{route}}"></flatiron-director>
<!-- Keyboard nav controller. -->
<core-a11y-keys id="keys" keys="up down left right space space+shift" on-keys-pressed="{{keyHandler}}" target="{{parentElement}}"></core-a11y-keys>
<core-scaffold id="scaffold">
<nav>
<core-toolbar>
<span>Single Page Polymer</span>
</core-toolbar>
<core-menu on-core-select="{{menuItemSelected}}" selected="{{route}}" selectedmodel="{{selectedPage}}" valueattr="hash">
<template repeat="{{page, i in pages}}">
<paper-item hash="{{page.hash}}" noink="">
<core-icon icon="label{{route != page.hash ? '-outline' : ''}}"></core-icon>
{{page.name}}
</paper-item>
</template>
</core-menu>
</nav>
<core-toolbar flex="" tool="">
<div flex="">{{selectedPage.page.name}}</div>
<core-icon-button icon="refresh"></core-icon-button>
<core-icon-button icon="add"></core-icon-button>
</core-toolbar>
<div center-center="" fit="" horizontal="" layout="">
<core-animated-pages id="pages" on-tap="{{cyclePages}}" selected="{{route}}" transitions="slide-from-right" valueattr="hash">
<template repeat="{{page, i in pages}}">
<section center-center="" hash="{{page.hash}}" layout="" vertical="">
<div>{{page.name}}</div>
</section>
</template>
</core-animated-pages>
</div>
</core-scaffold>
</template>
<script src="app.js"></script>
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-43475701-2', 'auto'); // ebidel's
ga('create', 'UA-39334307-1', 'auto'); // pp.org
ga('send', 'pageview');
</script>
</body></html>
As you see far from the real result you see when looking with your browser.
The questions I have.... What do I do wrong and if possible where to look for the solution.

I think you are missing something from the Selenium Webdriver docs.
You can get the content of a dynamic page, but you have to make sure that the element you are searching is present and visible on the page:
import platform
from selenium import webdriver
browser = webdriver.PhantomJS()
browser.get('https://docs-05-dot-polymer-
project.appspot.com/0.5/articles/demos/spa/final.html')
# Getting content of the first slide
res1 = browser.find_element_by_xpath('//*[#id="pages"]/section[1]/div')
# Save a screenshot so you can see why is failing (if it is)
browser.save_screenshot('screen_test')
# Print the text within the div
print (res1.text)
If you need to get also the text of the other slides, you need to click (using the webdriver) where needs to make visible the second slide, before getting the text from it.

Reading a particular line from a webpage in python

In my code I'm trying to get the first line of text from a webpage into a variable in python. At the moment I'm using urlopen to get the whole page for each link I want to read. How do I only read the first line of words on the webpage.
My code:
import urllib2
line_number = 10
id = (np.arange(1,5))
for n in id:
link = urllib2.urlopen("http://www.cv.edu/id={}".format(n))
l = link.read()
I want to extract the word "old car" from the following html code of the webpage:
<html>
<head>
<link rel="stylesheet">
<style>
.norm { font-family: arial; font-size: 8.5pt; color: #000000; text-decoration : none; }
.norm:Visited { font-family: arial; font-size: 8.5pt; color: #000000; text-decoration : none; }
.norm:Hover { font-family: arial; font-size: 8.5pt; color : #000000; text-decoration : underline; }
</style>
</head>
<body>
<b>Old car</b><br>
<sup>13</sup>CO <font color="red">v = 0</font><br>
ID: 02910<br>
<p>
<p><b>CDS</b></p>

Use XPath. It's exactly what we need.
XPath, the XML Path Language, is a query language for selecting nodes from an XML document.
The lxml python library will help us with this. It's one of many. Libxml2, Element Tree, and PyXML are some of the options. There are many, many, many libraries to do this type of thing.
Using XPath
Something like the following, based on your existing code, will work:
import urllib2
from lxml import html
line_number = 10
id = (np.arange(1,5))
for n in id:
link = urllib2.urlopen("http://www.cv.edu/id={}".format(n))
l = link.read()
tree = html.fromstring(l)
print tree.xpath("//b/text()")[0]
The XPath query //b/text() basically says "get the text from the <b> elements on a page. The tree.xpath function call returns a list, and we select the first one using [0]. Easy.
An aside about Requests
The Requests library is the state-of-the-art when it comes to reading webpages in code. It may save you some headaches later.
The complete program might look like this:
from lxml import html
import requests
for nn in range(1, 6):
page = requests.get("http://www.cv.edu/id=%d" % nn)
tree = html.fromstring(page.text)
print tree.xpath("//b/text()")[0]
Caveats
The urls didn't work for me, so you might have to tinker a bit. The concept is sound, though.
Reading from the webpages aside, you can use the following to test the XPath:
from lxml import html
tree = html.fromstring("""<html>
<head>
<link rel="stylesheet">
</head>
<body>
<b>Old car</b><br>
<sup>13</sup>CO <font color="red">v = 0</font><br>
ID: 02910<br>
<p>
<p><b>CDS</b></p>""")
print tree.xpath("//b/text()")[0] # "Old cars"

If you are going to do this on many different webpages that might be written differently, you might find that BeautifulSoup is helpful.
http://www.crummy.com/software/BeautifulSoup/bs4/doc/
As you can see at the bottom of quick start, it should be possible for you to extract all the text from the page and then take whatever line you are interested in.
Keep in mind that this will only work for HTML text. Some webpages use javascript extensively, and requests/BeautifulSoup will not be able to read content provided by the javascript.
Using Requests and BeautifulSoup - Python returns tag with no text
See also an issue I have had in the past, which was clarified by user avi: Want to pull a journal title from an RCSB Page using python & BeautifulSoup

using urllib and beautifulsoup to find values inside "hidden" tags

i want to know if it is possible to display the values of hidden tags. im using urllib and beautifulsoup but i cant seem to get what i want.
the html code im using is written below: (saved as hiddentry.html)
<html>
<head>
<script type="text/javascript">
//change hidden elem value
function changeValue()
{
document.getElementById('hiddenElem').value = 'hello matey!';
}
//this will verify if i have successfully changed the hiddenElem's value
function printHidden()
{
document.getElementById('displayHere').innerHTML = document.getElementById('hiddenElem').value;
}
</script>
</head>
<body>
<div id="hiddenDiv" style="position: absolute; left: -1500px">
<!--i want to find the value of this element right here-->
<span id="hiddenElem"></span>
</div>
<span id="displayHere"></span>
<script type="text/javascript">
changeValue();
printHidden();
</script>
</body>
</html>
what i want to print is the value of element with id hiddenElem.
to do this i tried using urllib and beautifulsoup combo. the code i used is:
from BeautifulSoup import BeautifulSoup
import urllib2
import urllib
mysite = urllib.urlopen("http://localhost/hiddentry.html")
soup = BeautifulSoup(mysite)
print soup.prettify()
print '\n\n'
areUthere = soup.find(id="hiddenElem").find(text=True)
print areUthere
what i am getting as output though is None.
any ideas? is what i am trying to accomplish even possible?

beautifulsoup parses the html that it gets from the server. If you want to see generated values, you need to somehow execute the embedded javascript on the page before passing the string to beautifulsoup. Once you run the javascript, you'll pass the modified DOM html to beautifulsoup.
As far as browser emulation:
this combo from the creator of jQuery looks interesting
SO question bringing the browser to the server
and SO question headless internet browser
Using browser emulation, you should be able to pull down the base HTML, run browser emulation to execute the javascript, and then take the modified DOM HTML and jam it into beautifulsoup.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python error when using open('index.html').read().format() - python

Related

Fetch page with Scrapy, execute JS and extract variable

Open IE browser window using python when link is clicked

how to fetch / grab polymer spa webpage by using python with headless server and no GUI

Reading a particular line from a webpage in python

using urllib and beautifulsoup to find values inside "hidden" tags

Categories

Resources