I am trying to make an offline copy of this website: ieeghn. Part of this task is to download all css/js that being referred to using Beautiful Soup and modify any external link to this newly downloaded resource.
At the moment I simply use string replace method. But I don't think this is effective, as I do this inside a loop, snippet below:
local_content = ''
for res in soup.findAll('link', {'rel': 'stylesheet'}):
if not str(res['href']).startswith('data:'):
original_res = res['href']
res['href'] = some_function_to_download_css()
local_content = local_content.replace(original_res, res['href'])
I only save resource for non-embedding resource that start with data:. But the problem is, that local_content = local_content.replace(original_res, res['href']) may lead to the problem that I only able to modify one external resource into local resource. The rest still refer to online version of the resource.
I am guessing that because local_content is a very long string (have a look at the ieeghn source), this didn't work out well.
How do you properly replace content of a string for a given pattern?
Or do I have to store this first to a file and modify it there?
EDITED
I found the problem was in this line of code:
original_res = res['href']
BSoup will somehow sanitized the href string. In my case, & will be changed to &. As I am trying to replace the original href into a newly downloaded local file, str.replace() simply won't find this original value. Either I have to found a way to have original HREF or simply handle this case. Got to say, having the original HREF is the best way
You're already replacing the content, in a way...
res['href'] = some_function_to_download_css()
...updates the href attribute of the res node in BeautifulSoup's representation of the HTML tree.
To make it more efficient, you could cache the URLs of CSS files you've already downloaded, and consult the cache before downloading the file. Once you're done (and if you're OK with BS's attribute ordering/indentation/etc.), you can get the string representation of the tree with str(soup).
Reference: http://beautiful-soup-4.readthedocs.org/en/latest/#changing-tag-names-and-attributes
Related
I am trying to extract the contents of a specific tag on a webpage by using lxml, namely on Indeed.com.
Example page: link
I am trying to extract the company name and position name. Chrome shows that the company name is located at
"//*[#id='job-content']/tbody/tr/td[1]/div/span[1]"
and the position name is located at
"//*[#id='job-content']/tbody/tr/td[1]/div/b/font"
This bit of code tries to extract those values from a locally saved and parsed copy of the page:
import lxml.html as h
xslt_root = h.parse("Temp/IndeedPosition.html")
company = xslt_root.xpath("//*[#id='job-content']/tbody/tr/td[1]/div/span[1]/text()")
position = xslt_root.xpath("//*[#id='job-content']/tbody/tr/td[1]/div/b/font/text()")
print(company)
print(position)
However, the print commands return empty strings, meaning nothing was extracted!
What is going on? Am I using the right tags? I don't think these are dynamically generated since the page loads normally with javascript disabled.
I would really appreciate any help with getting those two values extracted.
Try it like this:
company = xslt_root.xpath("//div[#data-tn-component='jobHeader']/span[#class='company']/text()")
position = xslt_root.xpath("//div[#data-tn-component='jobHeader']/b[#class='jobtitle']//text()")
['The Habitat Company']
['Janitor-A (Scattered Sites)']
Once we have the //div[#data-tn-component='jobHeader'] path things become pretty straightforward:
select the text of the child span /span[#class='company']/text() to get the company name
/b[#class='jobtitle']//text() is a bit more convoluted: since the job title is embedded in a font tag. But we can just select any descendant text using //text() to get the position.
An alternative is to select the b or font node and use text_content() to get the text (recursively, if needed), e.g.
xslt_root.xpath("//div[#data-tn-component='jobHeader']/b[#class='jobtitle']")[0].text_content()
Despite your assumption, it seems that the content on the page is loaded dynamically, and is thus not present during loading time.
This means you can't access the elements from your downloaded HTML file (if you do not believe me, try to look for job-content in the actual file on your computer, which will only contain placeholders and descriptors.
It seems you would have to use technologies like Selenium to perform this task.
Again, I want to stress that whatever you are doing (automatically), is a violation of indeed.com's Terms and Conditions, so I would suggest not to go too far with this anyways.
I'd like to create a variable whose data is the title extracted from an URL without using any external module.
I'm new with Python so if you can, please explain what does every part of the code.
Thanks.
PD: I'm using Python 3.
PD2: I mean the title tag of its HTML.
Let html be an HTML string (say, the HTML source of this particular page). You can find the opening and closing tags with str.find(). The string is converted to the lower case to allow case-insensitive search.
start = html.lower().find('<title>') + len('<title>')
end = html.lower().find('</title>')
You can then extract the part of the HTML string between the tags:
html[start:end]
#'How can I extract the title from a URL in Python without using any...'
Assuming by "title", you mean the title of a resource: take a URL like https://www.foo.com/bar/baz/resource.jpg. You need to split it into a list along the /s, then take the last item in that list. The code
url = "https://www.foo.com/bar/baz/resource.jpg"
print(url.split('/')[-1])
gives the output
resource.jpg
I recently implemented adding target="_blank" to external links like this:
#hooks.register('after_edit_page')
def do_after_page_edit(request, page):
if hasattr(page, "body"):
soup = BeautifulSoup(page.body)
for a in soup.findAll('a'):
if hasattr(a, "href"):
a["target"] = "_blank"
page.body = str(soup)
page.body = page.body.replace("<html><head></head><body>", "")
page.body = page.body.replace("</body></html>", "")
page.body = page.body.replace("></embed>", "/>")
page.save()
#hooks.register('construct_whitelister_element_rules')
def whitelister_element_rules():
return {
'a': attribute_rule({'href': check_url, 'target': True}),
}
Problems:
Beautiful soup messes with the output, adding html, head & body tags - Don't put html, head and body tags automatically, beautifulsoup
It also messes with the embed tags - How to get BeautifulSoup 4 to respect a self-closing tag?
Hence my crappy "fix" manually replacing parts of the output with blank strings.
Question:
What is the correct and best way to do this?
Starting with Wagtail v2.5, there is an API to do customisations like this as part of Wagtail’s rich text processing: Rewrite handlers, with the register_rich_text_features hook.
Here is an example of using this new API to make a rewrite handler that sets a target="_blank" attribute to all external links:
from django.utils.html import escape
from wagtail.core import hooks
from wagtail.core.rich_text import LinkHandler
class NewWindowExternalLinkHandler(LinkHandler):
# This specifies to do this override for external links only.
# Other identifiers are available for other types of links.
identifier = 'external'
#classmethod
def expand_db_attributes(cls, attrs):
href = attrs["href"]
# Let's add the target attr, and also rel="noopener" + noreferrer fallback.
# See https://github.com/whatwg/html/issues/4078.
return '<a href="%s" target="_blank" rel="noopener noreferrer">' % escape(href)
#hooks.register('register_rich_text_features')
def register_external_link(features):
features.register_link_type(NewWindowExternalLinkHandler)
In this example I'm also adding rel="noopener" to fix a known security issue with target="_blank".
Compared to previous solutions to this problem, this new approach is the most reliable: it’s completely server-side and only overrides how links are rendered on the site’s front-end rather than how they are stored, and only relies on documented APIs instead of internal ones / implementation details.
Have been struggling with the same problem and couldn’t achieve it using wagtailhooks. My initial solution was to manipulate the content in base.html, using a filter. The filter to cut pieces of code works perfectly when placed in the content block, example:
{{ self.body|cut: ‘ href="http:’}}
Above filter deletes parts of the content, but unfortunately ‘replace’ is not available as a filter (I´m using Python 3.x). Therefor my next approach was building a custom_filter to create ´replace´ as filter option. Long story short: It partly worked but only if the content was converted from the original ‘StreamValue’ datatype to ‘string’. This conversion resulted in content with all html tags shown, so the replacement did not result in working html. I couldn´t get the content back to StreamValue again and no other Python datatype remedied the issue.
Eventually JQuery got the job done for me:
$(document).ready(function(){
$('a[href^="http://"]').attr('target', '_blank');
});
This code adds ‘target="_blank"’ to each link containing ‘http://’, so all internal links stay in the existing tab. It needs to be placed at the end of your base.html (or similar) and of course you need to load JQuery before you run it.
Got my answer from here .
Don’t know if JQuery is the correct and best way to do it, but it works like a charm for me with minimal coding.
I trying to scrape some html code from this site, now when i print all the content, some link (i want only "Table of Contents" and "Printer-friendly Version") have inside the href this string: "../etc".
When i'm going to print the scraped code i need to replace the local path of the href with the global one, in that way i'll be able to reach the right webpage clicking on my scraped link. In case the requested operation will be not useful, there's a way for write the right path inside the href i need to handle?
#!C:/Python27/python
from lxml import etree
import requests
q = "http://www.dlib.org/dlib/november14/giannakopoulos/11giannakopoulos.html"
page = requests.get(q)
tree = etree.HTML(page.text)
element = tree.xpath('./body/form/table[3]/tr/td/table[5]')
content = etree.tostring(element[0])
print "Content-type: text\n\n"
print content.strip()
In the resulting HTML, just add an xml:base to the root of the document that's the same as the URL of the original page, or any other base URI where you have the same resources.
XPath cannot change the document for you, you will need to that on the document itself. Etree has a property to set the base URI, but I don't know if it will output that when you print the results. The most obvious way to replace this is to use XSLT, which is also supported by lxml.
Setting the base URI will have effect on elements that respect the base URI. Older browsers may not properly respect it, in which case you can use <base>.
I'm trying to use Beautiful Soup to isolate a specific <table> element and put it in a new file. The table has an id, ModelTable, and I can find it using soup.select("#ModelTable") ("soup" being the imported file).
However, I'm having trouble figuring out how to get the element into a new file. Simply writing it to a new file (as in: write(soup.select("#ModelTable") ) doesn't work, as it's not a string object, and converting it with str() results in a string enclosed in brackets.
Ideally I'd like to be able to export the isolated element after running it through .prettify() so that I can get a good HTML file right off the bat. I know I must be missing something obvious... any hints?
You need to iterate over the contents of the returned object. Your question also taught me that BS4's .select uses CSS selectors, which is fantastic.
with open('file_output.html', 'w') as f:
for tag in soup.select("#ModelTable"):
f.write(tag.prettify())