I'm working on a project where I need to a bit of scraping. The project is on Google App Engine, and we're currently using Python 2.5. Ideally, we would use PyQuery but due to running on App Engine and Python 2.5, this is not an option.
I've seen questions like this one on finding an HTML tag with certain text, but they don't quite hit the mark.
I have some HTML that looks like this:
<div class="post">
<div class="description">
This post is about Wikipedia.org
</div>
</div>
<!-- More posts of similar format -->
In PyQuery, I could do something like this (as far as I know):
s = pq(html)
s(".post:contains('This post is about Wikipedia.org')")
# returns all posts containing that text
Naively, I had though that I could do something like this in BeautifulSoup:
soup = BeautifulSoup(html)
soup.findAll(True, "post", text=("This post is about Google.com"))
# []
However, that yielded no results. I changed my query to use a regular expression, and got a bit further, but still no luck:
soup.findAll(True, "post", text=re.compile(".*This post is about.*Google.com.*"))
# []
It works if I omit Google.com, but then I need to do all the filtering manually. Is there anyway to emulate :contains using BeautifulSoup?
Alternatively, is there some PyQuery-like library that works on App Engine (on Python 2.5)?
From the BeautifulSoup docs (emphasis mine):
"text is an argument that lets you search for NavigableString objects
instead of Tags"
That is to say, your code:
soup.findAll(True, "post", text=re.compile(".*This post is about.*Google.com.*"))
Is not the same as:
regex = re.compile('.*This post is about.*Google.com.*')
[post for post in soup.findAll(True, 'post') if regex.match(post.text)]
The reason you have to remove the Google.com is that there's a NavigableString object in the BeautifulSoup tree for "This post is about", and another one for "Google.com", but they're under different elements.
Incidentally, post.text exists but is not documented, so I wouldn't rely on that either, I wrote that code by accident! Use some other means of smushing together all the text under post.
Related
Complete beginner here.
I am very appreciative of any support.
When I build the url from scratch and insert it(the built url) into filename using .innerHTML I get a 404 error
However when i manually copy and past the concatenated url result (from source/inspect browser page) back into my code it runs??
My HTML element I use to insert the url
<div id="url_string">
</div>
The file I am trying to insert -
'a_pic.jpg'
Approach A(hardcoding)- this runs. Not what i require though
<div id="url_string">
<a href="{{url_for('download_file', filename='a_pic.jpg')}}">
<i class="material-icons" style="font-size:36px">attachment</i>
</a>
</div>
Approach B (building the url) - I get an error here.
Here i receive the url from the server and pass it into javascript data.filename. From console.log, my url variable(y) is coming through.
The link paperclip file symbol comes up on the recipients page (which indicates to me that the concatinated string variable was inserted into the chat page on hearing a socket ping, as planned).
socket.on("sent_file", data => {
var x = "<a href=\"{{url_for(\'download_file\', filename=\'"
var y = `${data.filename}`
var z = "\')}}\"><i class=\"material-icons\" style=\"font-size:36px\">attachment</i></a>"
var entire_url_var = x + y + z
document.querySelector("#url_string").innerHTML = entire_url_var
console.log(y)
console.log(entire_url_var)
})
Here is the concatenated result of the above code (copied from the browser)
<i class="material-icons" style="font-size:36px">attachment</i>
From what i can tell, it is identical to the hardcoded one in approach A.
However clicking the attachment link which appears on running it, i get a 404 Error (The requested URL was not found on the server.)
What i have done so far.
I have tried many variations. Having the id in the a link not the surrounding div (and adjusting the concatenated string accordingly) amongst many others.
I suspect i am missing sth obvious.
I have spent many many hours on this and read a quite a number of other similar questions without managing to solve it just yet (other question seem syntax related, while i'm not certain whether mine is). I appreciate any support and respect your time.
Thank you.
While you are correct in that you are generating the same link in the two cases, the difference is who is interpreting the link.
In your approach A, it is Jinja2, the template engine from Flask, that handles your HTML. So you give it this:
<i class="material-icons" style="font-size:36px">attachment</i>
And Jinja2 will notice that there is Python code in between the {{ ... }}. So it will execute this code and replace it with the result. So by the time this HTML snippet reaches the browser, you'll have something like this instead:
<i class="material-icons" style="font-size:36px">attachment</i>
(Note that the generated href for your image might be different, you have to look at the HTML source for approach A to see exactly how does this URL looks).
In your approach B you are doing everything in the browser, so your Python server is not available. Here you have to render the URL strictly using JavaScript, you can't rely on helper functions from Flask such as url_for().
So you need to generate the href in the way it looks in the browser. It would be something like this:
var x = "<a href=\"/downloads/"
var y = `${data.filename}`
var z = "\"><i class=\"material-icons\" style=\"font-size:36px\">attachment</i></a>"
I recently implemented adding target="_blank" to external links like this:
#hooks.register('after_edit_page')
def do_after_page_edit(request, page):
if hasattr(page, "body"):
soup = BeautifulSoup(page.body)
for a in soup.findAll('a'):
if hasattr(a, "href"):
a["target"] = "_blank"
page.body = str(soup)
page.body = page.body.replace("<html><head></head><body>", "")
page.body = page.body.replace("</body></html>", "")
page.body = page.body.replace("></embed>", "/>")
page.save()
#hooks.register('construct_whitelister_element_rules')
def whitelister_element_rules():
return {
'a': attribute_rule({'href': check_url, 'target': True}),
}
Problems:
Beautiful soup messes with the output, adding html, head & body tags - Don't put html, head and body tags automatically, beautifulsoup
It also messes with the embed tags - How to get BeautifulSoup 4 to respect a self-closing tag?
Hence my crappy "fix" manually replacing parts of the output with blank strings.
Question:
What is the correct and best way to do this?
Starting with Wagtail v2.5, there is an API to do customisations like this as part of Wagtail’s rich text processing: Rewrite handlers, with the register_rich_text_features hook.
Here is an example of using this new API to make a rewrite handler that sets a target="_blank" attribute to all external links:
from django.utils.html import escape
from wagtail.core import hooks
from wagtail.core.rich_text import LinkHandler
class NewWindowExternalLinkHandler(LinkHandler):
# This specifies to do this override for external links only.
# Other identifiers are available for other types of links.
identifier = 'external'
#classmethod
def expand_db_attributes(cls, attrs):
href = attrs["href"]
# Let's add the target attr, and also rel="noopener" + noreferrer fallback.
# See https://github.com/whatwg/html/issues/4078.
return '<a href="%s" target="_blank" rel="noopener noreferrer">' % escape(href)
#hooks.register('register_rich_text_features')
def register_external_link(features):
features.register_link_type(NewWindowExternalLinkHandler)
In this example I'm also adding rel="noopener" to fix a known security issue with target="_blank".
Compared to previous solutions to this problem, this new approach is the most reliable: it’s completely server-side and only overrides how links are rendered on the site’s front-end rather than how they are stored, and only relies on documented APIs instead of internal ones / implementation details.
Have been struggling with the same problem and couldn’t achieve it using wagtailhooks. My initial solution was to manipulate the content in base.html, using a filter. The filter to cut pieces of code works perfectly when placed in the content block, example:
{{ self.body|cut: ‘ href="http:’}}
Above filter deletes parts of the content, but unfortunately ‘replace’ is not available as a filter (I´m using Python 3.x). Therefor my next approach was building a custom_filter to create ´replace´ as filter option. Long story short: It partly worked but only if the content was converted from the original ‘StreamValue’ datatype to ‘string’. This conversion resulted in content with all html tags shown, so the replacement did not result in working html. I couldn´t get the content back to StreamValue again and no other Python datatype remedied the issue.
Eventually JQuery got the job done for me:
$(document).ready(function(){
$('a[href^="http://"]').attr('target', '_blank');
});
This code adds ‘target="_blank"’ to each link containing ‘http://’, so all internal links stay in the existing tab. It needs to be placed at the end of your base.html (or similar) and of course you need to load JQuery before you run it.
Got my answer from here .
Don’t know if JQuery is the correct and best way to do it, but it works like a charm for me with minimal coding.
I am trying to get specific information about the original citing paper in the Protein Data Bank given only the 4 letter PDBID of the protein.
To do this I am using the python libraries requests and BeautifulSoup. To try and build the code, I went to the page for a particular protein, in this case 1K48, and also save the HTML for the page (by hitting command+s and saving the HTML to my desktop).
First things to note:
1) The url for this page is: http://www.rcsb.org/pdb/explore.do?structureId=1K48
2) You can get to the page for any protein by replacing the last four characters with the appropriate PDBID.
3) I am going to want to perform this procedure on many PDBIDs, in order to sort a large list by the Journal they originally appeared in.
4) Searching through the HTML, one finds the journal title located inside a form here:
<form action="http://www.rcsb.org/pdb/search/smartSubquery.do" method="post" name="queryForm">
<p><span id="se_abstractTitle"><a onclick="c(0);">Refined</a> <a onclick="c(1);">structure</a> <a onclick="c(2);">and</a> <a onclick="c(3);">metal</a> <a onclick="c(4);">binding</a> <a onclick="c(5);">site</a> of the <a onclick="c(8);">kalata</a> <a onclick="c(9);">B1</a> <a onclick="c(10);">peptide.</a></span></p>
<p><a class="sePrimarycitations se_searchLink" onclick="searchCitationAuthor('Skjeldal, L.');">Skjeldal, L.</a>, <a class="sePrimarycitations se_searchLink" onclick="searchCitationAuthor('Gran, L.');">Gran, L.</a>, <a class="sePrimarycitations se_searchLink" onclick="searchCitationAuthor('Sletten, K.');">Sletten, K.</a>, <a class="sePrimarycitations se_searchLink" onclick="searchCitationAuthor('Volkman, B.F.');">Volkman, B.F.</a></p>
<p>
<b>Journal:</b>
(2002)
<span class="se_journal">Arch.Biochem.Biophys.</span>
<span class="se_journal"><b>399: </b>142-148</span>
</p>
A lot more is in the form but it is not relevant. What I do know is that my journal title, "Arch.Biochem.Biophys", is located within a span tag with class "se_journal".
And so I wrote the following code:
def JournalLookup():
PDBID= '1K48'
import requests
from bs4 import BeautifulSoup
session = requests.session()
req = session.get('http://www.rcsb.org/pdb/explore.do?structureId=%s' %PDBID)
doc = BeautifulSoup(req.content)
Journal = doc.findAll('span', class_="se_journal")
Ideally I'd be able to use find instead of findAll as these are the only two in the document, but I used findAll to at least verify I'm getting an empty list. I assumed that it would return a list containing the two span tags with class "se_journal", but it instead returns an empty list.
After spending several hours going through possible solutions, including a piece of code that printed every span in doc, I have concluded that the requests doc does not include the lines I want at all.
Does anybody know why this is the case, and what I could possibly do to fix it?
Thanks.
The content you are interested in is provided by the javascript. It's easy to find out, visit the same URL on browser with javascript disabled and you will not see that specific info. It also displays a friendly message:
"This browser is either not Javascript enabled or has it turned off.
This site will not function correctly without Javascript."
For javascript driven pages, you cannot use Python Requests. There are some alternatives, one being dryscape.
PS: Do not import libraries/modules within a function. Python does not recommend it and PEP08 says that:
Imports are always put at the top of the file, just after any module comments and docstrings, and before module globals and constants.
This SO question explains why it's not recomended way to do it.
The Python package PyPDB that can do this task. The repository can be found here, but it is also available on PyPI
pip install pypdb
For your application, the function describe_pdb takes a four-character PDB ID as an input and returns a dictionary containing the metadata associated with the entry:
my_desc = describe_pdb('4lza')
There's fields in my_desc for 'citation_authors', 'structure_authors', and 'title', but not all entries appear to have journal titles associated with them. The other options are to use the broader function get_all_info('4lza') or get (and parse) the entire raw .pdb file using get_pdb_file('4lza', filetype='cif', compression=True)
I have a HTML page that displays a few values. I also have a little app that displays data from some other pages I have, but these other pages are JSON, not HTML.
I want to consume these values from the HTML page, convert to JSON, then output.
The reason I want to do this is so that I can simply reuse my code, and just change the URL, or even dynamically create it.
I made the HTML page as plain as possible, so as to strip out all the junk in order to make the regex more basic.
Here is the HTML:
<div class="BlockA">
<h4>BlockA</h4>
<div class="name">John Smith</div>
<div class="number">2</div>
<div class="name">Paul Peterson</div>
<div class="number">14</div>
</div>
<div class="BlockB">
<h4>BlockB</h4>
<div class="name">Steve Jones</div>
<div class="number">5</div>
</div>
Both blocks will have varying numbers of elements,depending on a few factors.
Here is my python:
def index(request, toGet="xyz"):
file = urllib2.urlopen("http://www.mysite.com/mypage?data="+toGet)
data = file.read()
dom = parseString(data)
rows = dom.getElementsByTagName("BlockA")[0]
readIn = ""
for row in rows:
readIn = readIn+json.dumps(
{'name': row.getAttribute("location"),
'number': row.getAttribute("number")},
sort_keys=True,
indent=4)+","
response_generator = ( "["+readIn[:-1]+"]" )
return HttpResponse(response_generator)
So this is basically reading the values (actually, the source is XML in this case), looping through them, and outputting all the values.
If someone can point me in the right direction, it would be much appreciated. For example, reading in the tags like "BlockA" and then the tags "name" and "number".
Thanks.
If you truly need to parse an HTML page in Python, you should be using Beautiful Soup. I question whether you really should be doing this though. Are the HTML pages and JSON outputs using the same Django instance? Are they all apart of the same project?
If they are apart of the same project, then you can use something like django-piston which is a RESTful framework for python. This will allow you to define the data that should be exposed, and output in multiple formats such as HTML/Django Template, JSON, XML, or YAML. You can also create your own emitters to output as a different format.
That way, you can expose a particular URL as a regular template, or get the same data as JSON would will be much easier to parse than HTML.
Sorry if I'm misunderstanding your problem. But it really does sound like you want to expose a view as several different formats, and a RESTful framework will help with that.
I'm trying to use BeautifulSoup to parse some HTML in Python. Specifically, I'm trying to create two arrays of soup objects: one for the dates of postings on a website, and one for the postings themselves. However, when I use findAll on the div class that matches the postings, only the initial tag is returned, not the text inside the tag. On the other hand, my code works just fine for the dates. What is going on??
# store all texts of posts
texts = soup.findAll("div", {"class":"quote"})
# store all dates of posts
dates = soup.findAll("div", {"class":"datetab"})
The first line above returns only
<div class="quote">
which is not what I want. The second line returns
<div class="datetab">Feb<span>2</span></div>
which IS what I want (pre-refining).
I have no idea what I'm doing wrong. Here is the website I'm trying to parse. This is for homework, and I'm really really desperate.
Which version of BeautifulSoup are you using? Version 3.1.0 performs significantly worse with real-world HTML (read: invalid HTML) than 3.0.8. This code works with 3.0.8:
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("http://harvardfml.com/")
soup = BeautifulSoup(page)
for incident in soup.findAll('span', { "class" : "quote" }):
print incident.contents
That site is powered by Tumblr. Tumblr has an API.
There's a python port of Tumblr that you can use to read documents.
from tumblr import Api
api = Api('harvardfml.com')
freq = {}
posts = api.read()
for post in posts:
#do something here
for your bogus findAll, without the actual source code of your program it is hard to see what is wrong.