find some value in javascript, in the response form - python

I have an url www.example.com/test
so by using robobrowsker to visit this url, I find some js in response and it contains something like this
var token = _.unescape("<input name="__RequestVerificationToken" type="hidden" value="wi5U8xXijdXRrPR4aG84OAjSLsuS1YqTV4X7VLDnWeuwr72D39H-KXBsyG7eZEZPT7YXW7GF26IiQBrW0vcEZd5Bqrjof_CVEUFRTDPS4rx68Opmi6juZXnGDEtb9nsBXxM4Why2WNlflqFM6purXw2" />");
aw.antiforgeryToken[$(token).attr('name')] = $(token).val();
I want to get 'wi5U8xXijdXRrPR4aG84OAjSLsuS1YqTV4X7VLDnWeuwr72D39H-KXBsyG7eZEZPT7YXW7GF26IiQBrW0vcEZd5Bqrjof_CVEUFRTDPS4rx68Opmi6juZXnGDEtb9nsBXxM4Why2WNlflqFM6purXw2'
I tried this
browser=RoboBrowser()
browser.open('https://www.example.com/test')
result=browser.find('script',{'name':'__RequestVerificationToken'})
This gives 'None'
so how can I do this ?
thanks

br.find works on html, and as the stuff you want is inside a JS call so we can't use it.
so other options are
use rejex (wiz. a bit hardcoded in my opinion)
By finding the parent node in which the node which eventually contains the data you want is present, and then find that string i.e. 'wi5U8xXijdXRrPR4aG84OAjSLsuS1YqTV4X7VLDnWeuwr72D39H-KXBsyG7eZEZPT7YXW7GF26IiQBrW0vcEZd5Bqrjof_CVEUFRTDPS4rx68Opmi6juZXnGDEtb9nsBXxM4Why2WNlflqFM6purXw2' via regex
lxml.html (xpath)
it is other way which I MAY prefer is lxml.html or import html from lxml one and the same thing
here is a bit of representation of it.
data = lmxl.html(parsedData)
stuff = data.xpath('XPATH to you data')
you can find more here Can I parse xpath using python, selenium and lxml? and have a look in docs
as well
I hope I was helpful.
cheers.

Related

Making external links open in a new window in wagtail

I recently implemented adding target="_blank" to external links like this:
#hooks.register('after_edit_page')
def do_after_page_edit(request, page):
if hasattr(page, "body"):
soup = BeautifulSoup(page.body)
for a in soup.findAll('a'):
if hasattr(a, "href"):
a["target"] = "_blank"
page.body = str(soup)
page.body = page.body.replace("<html><head></head><body>", "")
page.body = page.body.replace("</body></html>", "")
page.body = page.body.replace("></embed>", "/>")
page.save()
#hooks.register('construct_whitelister_element_rules')
def whitelister_element_rules():
return {
'a': attribute_rule({'href': check_url, 'target': True}),
}
Problems:
Beautiful soup messes with the output, adding html, head & body tags - Don't put html, head and body tags automatically, beautifulsoup
It also messes with the embed tags - How to get BeautifulSoup 4 to respect a self-closing tag?
Hence my crappy "fix" manually replacing parts of the output with blank strings.
Question:
What is the correct and best way to do this?
Starting with Wagtail v2.5, there is an API to do customisations like this as part of Wagtail’s rich text processing: Rewrite handlers, with the register_rich_text_features hook.
Here is an example of using this new API to make a rewrite handler that sets a target="_blank" attribute to all external links:
from django.utils.html import escape
from wagtail.core import hooks
from wagtail.core.rich_text import LinkHandler
class NewWindowExternalLinkHandler(LinkHandler):
# This specifies to do this override for external links only.
# Other identifiers are available for other types of links.
identifier = 'external'
#classmethod
def expand_db_attributes(cls, attrs):
href = attrs["href"]
# Let's add the target attr, and also rel="noopener" + noreferrer fallback.
# See https://github.com/whatwg/html/issues/4078.
return '<a href="%s" target="_blank" rel="noopener noreferrer">' % escape(href)
#hooks.register('register_rich_text_features')
def register_external_link(features):
features.register_link_type(NewWindowExternalLinkHandler)
In this example I'm also adding rel="noopener" to fix a known security issue with target="_blank".
Compared to previous solutions to this problem, this new approach is the most reliable: it’s completely server-side and only overrides how links are rendered on the site’s front-end rather than how they are stored, and only relies on documented APIs instead of internal ones / implementation details.
Have been struggling with the same problem and couldn’t achieve it using wagtailhooks. My initial solution was to manipulate the content in base.html, using a filter. The filter to cut pieces of code works perfectly when placed in the content block, example:
{{ self.body|cut: ‘ href="http:’}}
Above filter deletes parts of the content, but unfortunately ‘replace’ is not available as a filter (I´m using Python 3.x). Therefor my next approach was building a custom_filter to create ´replace´ as filter option. Long story short: It partly worked but only if the content was converted from the original ‘StreamValue’ datatype to ‘string’. This conversion resulted in content with all html tags shown, so the replacement did not result in working html. I couldn´t get the content back to StreamValue again and no other Python datatype remedied the issue.
Eventually JQuery got the job done for me:
$(document).ready(function(){
$('a[href^="http://"]').attr('target', '_blank');
});
This code adds ‘target="_blank"’ to each link containing ‘http://’, so all internal links stay in the existing tab. It needs to be placed at the end of your base.html (or similar) and of course you need to load JQuery before you run it.
Got my answer from here .
Don’t know if JQuery is the correct and best way to do it, but it works like a charm for me with minimal coding.

How to replace local path with global path into href attributes, scraping in python

I trying to scrape some html code from this site, now when i print all the content, some link (i want only "Table of Contents" and "Printer-friendly Version") have inside the href this string: "../etc".
When i'm going to print the scraped code i need to replace the local path of the href with the global one, in that way i'll be able to reach the right webpage clicking on my scraped link. In case the requested operation will be not useful, there's a way for write the right path inside the href i need to handle?
#!C:/Python27/python
from lxml import etree
import requests
q = "http://www.dlib.org/dlib/november14/giannakopoulos/11giannakopoulos.html"
page = requests.get(q)
tree = etree.HTML(page.text)
element = tree.xpath('./body/form/table[3]/tr/td/table[5]')
content = etree.tostring(element[0])
print "Content-type: text\n\n"
print content.strip()
In the resulting HTML, just add an xml:base to the root of the document that's the same as the URL of the original page, or any other base URI where you have the same resources.
XPath cannot change the document for you, you will need to that on the document itself. Etree has a property to set the base URI, but I don't know if it will output that when you print the results. The most obvious way to replace this is to use XSLT, which is also supported by lxml.
Setting the base URI will have effect on elements that respect the base URI. Older browsers may not properly respect it, in which case you can use <base>.

How to copy all the text from url (like [Ctrl+A][Ctrl+C] with webbrowser) in python?

I know there is the easy way to copy all the source of url, but it's not my task. I need exactly save just all the text (just like webbrowser user copy it) to the *.txt file.
Is it unavoidable to parse source code html for it, or there is a better way?
I think it is impossible if you don't parse at all. I guess you could use HtmlParser http://docs.python.org/2/library/htmlparser.html and just keep the data tags, but you will most likely get many other elements than you want.
To get exactly the same as [Ctrl-C] would be very difficult to avoid parsing because of things like the style="display: hidden;" which would hide the text, which again will result in full parsing of html, javascript and css of both the document and resource files.
Parsing is required. Don't know if there's a library method. A simple regex:
text = sub(r"<[^>]+>", " ", html)
this requires many improvements, but it's a starting point.
With python, the BeautifulSoup module is great for parsing HTML, and well worth a look. To get the text from a webpage, it's just a case of:
#!/usr/env python
#
import urllib2
from bs4 import BeautifulSoup
url = 'http://python.org'
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)
# you can refine this even further if needed... ie. soup.body.div.get_text()
text = soup.body.get_text()
print text

How can I parse HTML code with "html written" URL in Python?

I am starting to program in Python, and have been reading a couple of posts where they say that I should use an HTML parser to get an URL from a text rather than re.
I have the source code which I got from page.read() with the urllib and urlopen.
Now, my problem is that the parser is removing the url part from the text.
Also, if I had read correctly, var = page.read(), var is stored as a string?
How can I tell it to give me the text between 2 "tags"? The URL is always in between flv= and ; so and as such it doesn't start with href which is what the parsers look for, and it doesn't contain http:// either.
I have read many posts, but it seems they all look for ``href in the code.
Do I have it all completely wrong?
Thank you!
You could consider implementing your own search / grab. In psuedocode, it would look a little like this:
find location of 'flv=' in HTML = location_start
find location of ';' in HTML = location_end
grab everything in between: HTML[location_start : location_end]
You should be able to implement this in python.
Good luck!

How can I iterate over specific elements in HTML file and replace them?

I need to do a seemingly simple thing in Python which turned out to be quite complex. What I need to do is:
Open an HTML file.
Match all instances of a specific HTML element, for example table.
For each instance, extract the element as a string, pass that string to an external command which will do some modifications, and finally replace the original element with a new string returned from the external command.
I can't simply do a re.sub(), because in each case the replacement string is different and based on the original string.
Any suggestions?
You could use Beautiful Soup to do this.
Although for what you need, something simpler like lxml.etree would work fine.
Sounds like you want BeautifulSoup. Likely, you'd want to do something like:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
tables = soup.find_all( 'table' )
for table in tables:
contents = str( table.contents )
new_contents = transform( contents )
table.replaceWith( new_contents )
Alternatively, you may be looking for something closer to soup.replace_with
EDIT: Updated to the eventual solution.
I have found that parsing HTML via BeautifulSoup or any other such parses gets complex as you need to parse different pages, with different structure which sometimes are not well-formed, use javascript manipulation etc. Best solution in this case is to directly access the browser DOM and modify and query nodes. You can easily do that in a headless browser like phanotomjs
e.g. here is a phantomjs script
var page = require('webpage').create();
page.content = '<html><body><table><tr><td>1</td><td>2</td></tr></table></html>';
page.evaluate(function () {
var elems = document.getElementsByTagName('td')
for(var i=0;i<elems.length;i++){
elems[i].innerHTML = '!'+elems[i].innerHTML+'!';
}
});
console.log(page.content);
phantom.exit();
It changes all td text and output is
<html><head></head><body><table><tbody><tr><td>!1!</td><td>!2!</td></tr></tbody></table></body></html>

Categories

Resources