Fetch page with Scrapy, execute JS and extract variable - python

I have a project using the python screen-scraping framework scrapy. I created a spider that loads all <script> tags and processes the second one. This is because within the test data I gathered, the data I need, was in the second <script> tag.
But now I have a problem, whereas some pages contain the data I want in some other script tags (#3 or #4). Further obstacle is that mostly the second line of the second javascript tag has the JSON I want. But depending on the page, this could also be the 3rd or the 4th line.
Consider this simple HTML file:
<html>
<head>
<title> Test </title>
</head>
<body>
<p>
This is a text
</p>
<script type="text/javascript">
var myJSON = {
a: "a",
b: 42
}
</script>
</body>
</html>
I can access myJSON.b and get 42 if I open this page in my browser (firefox) and go to the developer tools and console.log(myJSON.b)
So my Question is: How can I extract JavaScript variable or JSON from a scrapy-fetched-page?

I had run into a similar issue before and I solved it by extracting the text in the script tag using something like (based on your sample HTML file):
response.xpath('//script/text()')
After that I used a regular expression to extract the required data in JSON format. So, using the selector above and your sample HTML, something close to:
pattern = r'i-suck-at-regular-expressions'
json_data = response.xpath('//script/text()').re_first(pattern)
Next, you should be able to use the json library to load the data as a python dictionary like so:
json.loads(json_data)
And it should return something similar to:
{"a": "a", "b": 42}

Related

Selenium raw page source

I am trying to get the source code of a particular site with the help of Selenium with:
Python code:
driver.page_source
But it returns it after it has been encoded.
The raw file:
<html>
<head>
<title>AAAAAAAA</title>
</head>
<body>
</body>
When press 'View page source' inside Chrome, I saw the correct source raw without encoding.
How can this be achieved?
You can try using Javascript instead of Python builtin code to get the page source.
javascriptPageSource = driver.execute_script("return document.body.outerHTML;")

Python error when using open('index.html').read().format()

I'm trying to generate a dynamic HTML email and send it through a Python script using MIMEMultipart package.
I've created some HTML code and trying to pass in some variables to be used within the HTML from the Python script.
open('emailContent.html').read().format(p1="help")
When running this with just basic HTML file (without any CSS) it works and the variable is passed in.
However, when the HTML as a <style> tag in the HTML I get the following error:
KeyError: '\n fill'
The HTML file looks something like this:
<html>
<head>
<style type = text/css>
.chart-text {
fill: #000;
-moz-transform: translateY(0.1em);
-ms-transform: translateY(0.1em);
-webkit-transform: translateY(0.1em);
transform: translateY(0.1em);
}
</style>
</head>
<body>
<p class="chart-text">This is a test please {p1:}
</body>
</html>
I'm guessing Python is unable to parse the HTML style tags as when these are removed this works. I need the Styling as this will format the email. Any recommendations would be much appreciated.

Can't scrape the links to next pages when using xPath selectors, returns empty. (Using Scrapy)

I am using Scrapy and trying to scrape this url, when I request any data about the products on the page I get it out. But the div with the paginator class and id=paginator1 is returned as empty even though it is a table with references to next pages. I have tried using xPath selectors for the table and css selectors, but both return empty.
This is what I tried, using css
In [29]: response.css('span a::attr(href)').extract()
Out[29]:
['/registration/formregistration/new',
'/',
'/catalog/solntsezaschitnye_ochki',
'http://wezom.com.ua/prodvizhenie']
and
In [31]: response.xpath('//*[#id="paginator1"]/table/tbody/tr[1]/td[2]/span')
Out[31]: []
The pagination is generated using JavaScript, as you can see in the HTML:
<div class="paginator" id="paginator1"></div>
<div class="paginator_pages">Страниц: 14</div>
<script type="text/javascript">
/*pag1 = new Paginator("id div", vsego stranic, kol-vo na stranice, tekuchay stranica, "url");*/
pag1 = new Paginator("paginator1", 14, 10, 1, "/catalog/s_o_u_l_/page/", "/catalog/s_o_u_l_");
</script>
You can extract all of the relevant information out of the <script> block:
import ast
script = response.xpath('//script[contains(text(), "paginator1")]/text()').extract()[0].strip()
paginator = script.splitlines()[1].strip().split('new Paginator')[1].rstrip(';')
paginatorHolderId, pagesTotal, pagesSpan, pageCurrent, baseUrl = ast.literal_eval(paginator)
You can then build the pagination URLs according to the logic in the pagination script (or just see what the URLs look like).
If you take a look at the actual html source (response.text), you will see the following:
<div class="paginator" id="paginator1"></div>
<div class="paginator_pages">Страниц: 14</div>
<script type="text/javascript">
/*pag1 = new Paginator("id div", vsego stranic, kol-vo na stranice, tekuchay stranica, "url");*/
pag1 = new Paginator("paginator1", 14, 10, 1, "/catalog/s_o_u_l_/page/", "/catalog/s_o_u_l_");
</script>
As you can see, the div is indeed empty, and being populated through javascript.
You have two options to get those links:
Generate them yourself (should be fairly easy)
Use something to run the javascript for you (e.g. a headless browser)

Website - Load Data Intensive Content After the Entire Page Loads

I'll get right into it. What I have is a div that takes a while to load because it involves calling API's and indexing content. The div itself takes much longer to load when compared to the rest of the page. What I would like to do is load the entire page, and then once the data is fetched, it loads the div on its own. Maybe put like a loading animation in place while this happens. I'm just wondering what the best way would be to accomplish this.
I'm not sure if this is relevant for this question but I am using Google App Engine in the Python environment.
Thank you!!
Start fetching the data after the DOM loads.
<html>
<body>
</body>
//load data here !
</html>
Alternatively you can use jQuery DOM Ready event to load your data after the DOM elements have loaded:
<html>
<script>
$(document).ready(function() {
// DOM is loaded, get data here
});
</script>
<body>
</body>
</html>

using urllib and beautifulsoup to find values inside "hidden" tags

i want to know if it is possible to display the values of hidden tags. im using urllib and beautifulsoup but i cant seem to get what i want.
the html code im using is written below: (saved as hiddentry.html)
<html>
<head>
<script type="text/javascript">
//change hidden elem value
function changeValue()
{
document.getElementById('hiddenElem').value = 'hello matey!';
}
//this will verify if i have successfully changed the hiddenElem's value
function printHidden()
{
document.getElementById('displayHere').innerHTML = document.getElementById('hiddenElem').value;
}
</script>
</head>
<body>
<div id="hiddenDiv" style="position: absolute; left: -1500px">
<!--i want to find the value of this element right here-->
<span id="hiddenElem"></span>
</div>
<span id="displayHere"></span>
<script type="text/javascript">
changeValue();
printHidden();
</script>
</body>
</html>
what i want to print is the value of element with id hiddenElem.
to do this i tried using urllib and beautifulsoup combo. the code i used is:
from BeautifulSoup import BeautifulSoup
import urllib2
import urllib
mysite = urllib.urlopen("http://localhost/hiddentry.html")
soup = BeautifulSoup(mysite)
print soup.prettify()
print '\n\n'
areUthere = soup.find(id="hiddenElem").find(text=True)
print areUthere
what i am getting as output though is None.
any ideas? is what i am trying to accomplish even possible?
beautifulsoup parses the html that it gets from the server. If you want to see generated values, you need to somehow execute the embedded javascript on the page before passing the string to beautifulsoup. Once you run the javascript, you'll pass the modified DOM html to beautifulsoup.
As far as browser emulation:
this combo from the creator of jQuery looks interesting
SO question bringing the browser to the server
and SO question headless internet browser
Using browser emulation, you should be able to pull down the base HTML, run browser emulation to execute the javascript, and then take the modified DOM HTML and jam it into beautifulsoup.

Categories

Resources