Making external links open in a new window in wagtail - python

I recently implemented adding target="_blank" to external links like this:
#hooks.register('after_edit_page')
def do_after_page_edit(request, page):
if hasattr(page, "body"):
soup = BeautifulSoup(page.body)
for a in soup.findAll('a'):
if hasattr(a, "href"):
a["target"] = "_blank"
page.body = str(soup)
page.body = page.body.replace("<html><head></head><body>", "")
page.body = page.body.replace("</body></html>", "")
page.body = page.body.replace("></embed>", "/>")
page.save()
#hooks.register('construct_whitelister_element_rules')
def whitelister_element_rules():
return {
'a': attribute_rule({'href': check_url, 'target': True}),
}
Problems:
Beautiful soup messes with the output, adding html, head & body tags - Don't put html, head and body tags automatically, beautifulsoup
It also messes with the embed tags - How to get BeautifulSoup 4 to respect a self-closing tag?
Hence my crappy "fix" manually replacing parts of the output with blank strings.
Question:
What is the correct and best way to do this?

Starting with Wagtail v2.5, there is an API to do customisations like this as part of Wagtail’s rich text processing: Rewrite handlers, with the register_rich_text_features hook.
Here is an example of using this new API to make a rewrite handler that sets a target="_blank" attribute to all external links:
from django.utils.html import escape
from wagtail.core import hooks
from wagtail.core.rich_text import LinkHandler
class NewWindowExternalLinkHandler(LinkHandler):
# This specifies to do this override for external links only.
# Other identifiers are available for other types of links.
identifier = 'external'
#classmethod
def expand_db_attributes(cls, attrs):
href = attrs["href"]
# Let's add the target attr, and also rel="noopener" + noreferrer fallback.
# See https://github.com/whatwg/html/issues/4078.
return '<a href="%s" target="_blank" rel="noopener noreferrer">' % escape(href)
#hooks.register('register_rich_text_features')
def register_external_link(features):
features.register_link_type(NewWindowExternalLinkHandler)
In this example I'm also adding rel="noopener" to fix a known security issue with target="_blank".
Compared to previous solutions to this problem, this new approach is the most reliable: it’s completely server-side and only overrides how links are rendered on the site’s front-end rather than how they are stored, and only relies on documented APIs instead of internal ones / implementation details.

Have been struggling with the same problem and couldn’t achieve it using wagtailhooks. My initial solution was to manipulate the content in base.html, using a filter. The filter to cut pieces of code works perfectly when placed in the content block, example:
{{ self.body|cut: ‘ href="http:’}}
Above filter deletes parts of the content, but unfortunately ‘replace’ is not available as a filter (I´m using Python 3.x). Therefor my next approach was building a custom_filter to create ´replace´ as filter option. Long story short: It partly worked but only if the content was converted from the original ‘StreamValue’ datatype to ‘string’. This conversion resulted in content with all html tags shown, so the replacement did not result in working html. I couldn´t get the content back to StreamValue again and no other Python datatype remedied the issue.
Eventually JQuery got the job done for me:
$(document).ready(function(){
$('a[href^="http://"]').attr('target', '_blank');
});
This code adds ‘target="_blank"’ to each link containing ‘http://’, so all internal links stay in the existing tab. It needs to be placed at the end of your base.html (or similar) and of course you need to load JQuery before you run it.
Got my answer from here .
Don’t know if JQuery is the correct and best way to do it, but it works like a charm for me with minimal coding.

Related

I am trying to insert a variable via tags into a url

Complete beginner here.
I am very appreciative of any support.
When I build the url from scratch and insert it(the built url) into filename using .innerHTML I get a 404 error
However when i manually copy and past the concatenated url result (from source/inspect browser page) back into my code it runs??
My HTML element I use to insert the url
<div id="url_string">
</div>
The file I am trying to insert -
'a_pic.jpg'
Approach A(hardcoding)- this runs. Not what i require though
<div id="url_string">
<a href="{{url_for('download_file', filename='a_pic.jpg')}}">
<i class="material-icons" style="font-size:36px">attachment</i>
</a>
</div>
Approach B (building the url) - I get an error here.
Here i receive the url from the server and pass it into javascript data.filename. From console.log, my url variable(y) is coming through.
The link paperclip file symbol comes up on the recipients page (which indicates to me that the concatinated string variable was inserted into the chat page on hearing a socket ping, as planned).
socket.on("sent_file", data => {
var x = "<a href=\"{{url_for(\'download_file\', filename=\'"
var y = `${data.filename}`
var z = "\')}}\"><i class=\"material-icons\" style=\"font-size:36px\">attachment</i></a>"
var entire_url_var = x + y + z
document.querySelector("#url_string").innerHTML = entire_url_var
console.log(y)
console.log(entire_url_var)
})
Here is the concatenated result of the above code (copied from the browser)
<i class="material-icons" style="font-size:36px">attachment</i>
From what i can tell, it is identical to the hardcoded one in approach A.
However clicking the attachment link which appears on running it, i get a 404 Error (The requested URL was not found on the server.)
What i have done so far.
I have tried many variations. Having the id in the a link not the surrounding div (and adjusting the concatenated string accordingly) amongst many others.
I suspect i am missing sth obvious.
I have spent many many hours on this and read a quite a number of other similar questions without managing to solve it just yet (other question seem syntax related, while i'm not certain whether mine is). I appreciate any support and respect your time.
Thank you.
While you are correct in that you are generating the same link in the two cases, the difference is who is interpreting the link.
In your approach A, it is Jinja2, the template engine from Flask, that handles your HTML. So you give it this:
<i class="material-icons" style="font-size:36px">attachment</i>
And Jinja2 will notice that there is Python code in between the {{ ... }}. So it will execute this code and replace it with the result. So by the time this HTML snippet reaches the browser, you'll have something like this instead:
<i class="material-icons" style="font-size:36px">attachment</i>
(Note that the generated href for your image might be different, you have to look at the HTML source for approach A to see exactly how does this URL looks).
In your approach B you are doing everything in the browser, so your Python server is not available. Here you have to render the URL strictly using JavaScript, you can't rely on helper functions from Flask such as url_for().
So you need to generate the href in the way it looks in the browser. It would be something like this:
var x = "<a href=\"/downloads/"
var y = `${data.filename}`
var z = "\"><i class=\"material-icons\" style=\"font-size:36px\">attachment</i></a>"

Losing data when scraping with Python?

UPDATE(4/10/2018):
So I found that my problem was that the information wasn't available in the source code which means I have to use Selenium.
UPDATE:
I played around with this problem a bit more. What I did was instead or running soup, I just took pageH, decoded it into a string and made a text file out of it, and I found that the '{{ optionTitle }}' or '{{priceFormat (showPrice, session.currency)}}' were from the template section separately stated in the HTML file. Which I THINK means that I was just looking at the wrong place. I am still unsure but that's what I think.
So now I have a new question. After having looked at the text file, I am now realizing that the information necessary is not even in pageH. At the place where it should give me the information I am looking for, it says instead:
<bread-crumbs :location="location" :product-name="product.productName"></bread-crumbs>
<product-info ref="productInfo" :basic="product" :location="location" :prod-info="prodInfo"></product-info>
What does this mean?/Is there a way to get through this to get to the information?
ORIGINAL QUESTION:
I am trying to collect the names/prices for products off of a website. I am unsure if the data is being lost because of the html parser or because of BeautifulSoup but what is happening is that once I do get to the position I want to be in, what is returned instead of the specific name/price is '{{ optionTitle }}' or '{{priceFormat (showPrice, session.currency)}}'. After I get the url using pageH = urllib.request.urlopen(), the code that gives this result is:
pageS = soup(pageH, "html.parser")
pageB = pageS.body
names = pageB.findAll("h4")
optionTitle = names[3].get_text()
optionPrice = names[5].get_text()
Because this didn't work, I tried going about it a different way and looked for more specific tags, but the section of the code that mattered just does not show. It completely disappears. Is there something I can do to get the specific names/prices or is this a security measure that I cannot work through?
The {{}} syntax looks like Angular. Try Requests-HTML to do the rendering (by using render())and get the content afterward. Example shows below:
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('http://python-requests.org/')
r.html.render()
r.html.search('Python 2 will retire in only {months} months!')['months']
'<time>25</time>'

Extract Text from HTML div using Python and lxml

I'm trying to get python to extract text from one spot of a website. I've identified the HTML div:
<div class="number">76</div>
which is in:
...div/div[1]/div/div[2]
I'm trying to use lxml to extract the '76' from that, but can't get a return out of it other than:
[]
Here's my code:
from lxml import html
import requests
url = 'https://sleepiq.sleepnumber.com/#/##1'
values = {'username': 'my#gmail.com',
'password': 'mypassword'}
page = requests.get(url, data=values)
tree = html.fromstring(page.content)
hr = tree.xpath('//div[#class="number"]/text()')
print hr
Any suggestions? I feel this should be pretty easy, thanks in advance!
Update: the element I want is not contained in the page.content from requests.get
Updated Update: It looks like this is not logging me in to the page where the content I want is. It is only getting the login screen content.
Have you tried printing your page.content to make sure your requests.get is retrieving the content you want? That is often where things break. And your empty list returned off the xpath search indicates "not found."
Assuming that's okay, your parsing is close. I just tried the following, which is successful:
from lxml import html
tree = html.fromstring('<body><div class="number">76</div></body>')
number = tree.xpath('//div[#class="number"]/text()')[0]
number now equals '76'. Note the [0] indexing, because xpath always returns a list of what's found. You have to dereference to find the content.
A common gotcha here is that the XPath text() function isn't as inclusive or straightforward as it might seem. If there are any sub-elements to the div--e.g. if the text is really <div class="number"><strong>76</strong></div> then text() will return an empty list, because the text belongs to the strong not the div. In real-world HTML--especially HTML that's ever been cut-and-pasted from a word processor, or otherwise edited by humans--such extra elements are entirely common.
While it won't solve all known text management issues, one handy workaround is to use the // multi-level indirection instead of the / single-level indirection to text:
number = ''.join(tree.xpath('//div[#class="number"]//text()'))
Now, regardless of whether there are sub-elements or not, the total text will be concatenated and returned.
Update Ok, if your problem is logging in, you probably want to try a requests.post (rather than .get) at minimum. In simpler cases, just that change might work. In others, the login needs to be done to a separate page than the page you want to retrieve/scape. In that case, you probably want to use a session object:
with requests.Session() as session:
# First POST to the login page
landing_page = session.post(login_url, data=values)
# Now make authenticated request within the session
page = session.get(url)
# ...use page as above...
This is a bit more complex, but shows the logic for a separate login page. Many sites (e.g. WordPress sites) require this. Post-authentication, they often take you to pages (like the site home page) that isn't interesting content (though it can be scraped to identify whether the login was successful). This altered login workflow doesn't change any of the parsing techniques, which work as above.
Beautiful Soup(http://www.pythonforbeginners.com/beautifulsoup/web-scraping-with-beautifulsoup) will help u out.
another way
http://docs.python-guide.org/en/latest/scenarios/scrape/
I'd use plain regex over xml tools in this case. It's easier to handle.
import re
import requests
url = 'http://sleepiq.sleepnumber.com/#/user/-9223372029758346943##2'
values = {'email-email': 'my#gmail.com', 'password-clear': 'Combination',
'password-password': 'mypassword'}
page = requests.get(url, data=values, timeout=5)
m = re.search(r'(\w*)(<div class="number">)(.*)(<\/div>)', page.content)
# m = re.search(r'(\w*)(<title>)(.*)(<\/title>)', page.content)
if m:
print(m.group(3))
else:
print('Not found')

Want to pull a journal title from an RCSB Page using python & BeautifulSoup

I am trying to get specific information about the original citing paper in the Protein Data Bank given only the 4 letter PDBID of the protein.
To do this I am using the python libraries requests and BeautifulSoup. To try and build the code, I went to the page for a particular protein, in this case 1K48, and also save the HTML for the page (by hitting command+s and saving the HTML to my desktop).
First things to note:
1) The url for this page is: http://www.rcsb.org/pdb/explore.do?structureId=1K48
2) You can get to the page for any protein by replacing the last four characters with the appropriate PDBID.
3) I am going to want to perform this procedure on many PDBIDs, in order to sort a large list by the Journal they originally appeared in.
4) Searching through the HTML, one finds the journal title located inside a form here:
<form action="http://www.rcsb.org/pdb/search/smartSubquery.do" method="post" name="queryForm">
<p><span id="se_abstractTitle"><a onclick="c(0);">Refined</a> <a onclick="c(1);">structure</a> <a onclick="c(2);">and</a> <a onclick="c(3);">metal</a> <a onclick="c(4);">binding</a> <a onclick="c(5);">site</a> of the <a onclick="c(8);">kalata</a> <a onclick="c(9);">B1</a> <a onclick="c(10);">peptide.</a></span></p>
<p><a class="sePrimarycitations se_searchLink" onclick="searchCitationAuthor('Skjeldal, L.');">Skjeldal, L.</a>, <a class="sePrimarycitations se_searchLink" onclick="searchCitationAuthor('Gran, L.');">Gran, L.</a>, <a class="sePrimarycitations se_searchLink" onclick="searchCitationAuthor('Sletten, K.');">Sletten, K.</a>, <a class="sePrimarycitations se_searchLink" onclick="searchCitationAuthor('Volkman, B.F.');">Volkman, B.F.</a></p>
<p>
<b>Journal:</b>
(2002)
<span class="se_journal">Arch.Biochem.Biophys.</span>
<span class="se_journal"><b>399: </b>142-148</span>
</p>
A lot more is in the form but it is not relevant. What I do know is that my journal title, "Arch.Biochem.Biophys", is located within a span tag with class "se_journal".
And so I wrote the following code:
def JournalLookup():
PDBID= '1K48'
import requests
from bs4 import BeautifulSoup
session = requests.session()
req = session.get('http://www.rcsb.org/pdb/explore.do?structureId=%s' %PDBID)
doc = BeautifulSoup(req.content)
Journal = doc.findAll('span', class_="se_journal")
Ideally I'd be able to use find instead of findAll as these are the only two in the document, but I used findAll to at least verify I'm getting an empty list. I assumed that it would return a list containing the two span tags with class "se_journal", but it instead returns an empty list.
After spending several hours going through possible solutions, including a piece of code that printed every span in doc, I have concluded that the requests doc does not include the lines I want at all.
Does anybody know why this is the case, and what I could possibly do to fix it?
Thanks.
The content you are interested in is provided by the javascript. It's easy to find out, visit the same URL on browser with javascript disabled and you will not see that specific info. It also displays a friendly message:
"This browser is either not Javascript enabled or has it turned off.
This site will not function correctly without Javascript."
For javascript driven pages, you cannot use Python Requests. There are some alternatives, one being dryscape.
PS: Do not import libraries/modules within a function. Python does not recommend it and PEP08 says that:
Imports are always put at the top of the file, just after any module comments and docstrings, and before module globals and constants.
This SO question explains why it's not recomended way to do it.
The Python package PyPDB that can do this task. The repository can be found here, but it is also available on PyPI
pip install pypdb
For your application, the function describe_pdb takes a four-character PDB ID as an input and returns a dictionary containing the metadata associated with the entry:
my_desc = describe_pdb('4lza')
There's fields in my_desc for 'citation_authors', 'structure_authors', and 'title', but not all entries appear to have journal titles associated with them. The other options are to use the broader function get_all_info('4lza') or get (and parse) the entire raw .pdb file using get_pdb_file('4lza', filetype='cif', compression=True)

How can I emulate ":contains" using BeautifulSoup?

I'm working on a project where I need to a bit of scraping. The project is on Google App Engine, and we're currently using Python 2.5. Ideally, we would use PyQuery but due to running on App Engine and Python 2.5, this is not an option.
I've seen questions like this one on finding an HTML tag with certain text, but they don't quite hit the mark.
I have some HTML that looks like this:
<div class="post">
<div class="description">
This post is about Wikipedia.org
</div>
</div>
<!-- More posts of similar format -->
In PyQuery, I could do something like this (as far as I know):
s = pq(html)
s(".post:contains('This post is about Wikipedia.org')")
# returns all posts containing that text
Naively, I had though that I could do something like this in BeautifulSoup:
soup = BeautifulSoup(html)
soup.findAll(True, "post", text=("This post is about Google.com"))
# []
However, that yielded no results. I changed my query to use a regular expression, and got a bit further, but still no luck:
soup.findAll(True, "post", text=re.compile(".*This post is about.*Google.com.*"))
# []
It works if I omit Google.com, but then I need to do all the filtering manually. Is there anyway to emulate :contains using BeautifulSoup?
Alternatively, is there some PyQuery-like library that works on App Engine (on Python 2.5)?
From the BeautifulSoup docs (emphasis mine):
"text is an argument that lets you search for NavigableString objects
instead of Tags"
That is to say, your code:
soup.findAll(True, "post", text=re.compile(".*This post is about.*Google.com.*"))
Is not the same as:
regex = re.compile('.*This post is about.*Google.com.*')
[post for post in soup.findAll(True, 'post') if regex.match(post.text)]
The reason you have to remove the Google.com is that there's a NavigableString object in the BeautifulSoup tree for "This post is about", and another one for "Google.com", but they're under different elements.
Incidentally, post.text exists but is not documented, so I wouldn't rely on that either, I wrote that code by accident! Use some other means of smushing together all the text under post.

Categories

Resources