Is HTML or CSS open source like PHP or Python?
For example PHP itself:
PHP source link in GITHUB
You must understand the difference between the specification (documentation, idea) and the implementation (interpreter, compiler, engine). Refer to https://softwareengineering.stackexchange.com/questions/238724/what-license-is-html-released-under for more detailed answers.
HTML
HTML only provides tags around text but other than that html files are practically just text files with extra markers. These tags are then parsed by the parser of the browser you use and displayed in a certain way. The easiest example of this is probably the <strong></strong>, <i></i> and <b></b> tags. These tags only purpose is to modify the text within them to look different. But the strong tag looks different on different browsers. Some browsers make text bold while others do other changes. This shows you that the browser really is the interpreter and there are not always strict rules on what tags must do. The current standard (html5) was created by a group called the WHATWG. See more information also here https://www.w3.org/html/. and their own website https://html.spec.whatwg.org/multipage/. They have a github https://github.com/whatwg/html and I believe you can make changes if they accept your merge requests.
CSS
CSS is also not really a language. CSS is prescribed by the css working group their members are publicly known. https://www.w3.org/Style/CSS/members. If you scroll through the list you will see most of these people are engineers from big tech companies. They do have a github for css it is used for people to post issues. https://github.com/w3c/csswg-drafts. I do believe you can create a merge request although i am unsure.
Summary
In short yes they are basically opensource. However changes to either of these repositories even when accepted and merged will not work untill browsers have implemented the changes.
note
I am not an expert by any means I googled most of my information. I could be wrong about multiple things.
Related
I am relatively new to web scraping/crawlers and was wondering about 2 issues in the event where a parsed DOM element is not found in the fetched webpage anymore:
1- Is there a clever way to detect if the page has changed? I have read that it's possible to store and compare hashes but I am not sure how effective it is.
2- In case a parsed element is not found in the fetched webpage anymore, if we assume that we know that the same DOM element still exists somewhere in the DOM Tree in a different location, is there a way to somehow traverse the DOM Tree efficiently without having to go over all of its nodes?
I am trying to find out how experienced developers deal with those two issues and would appreciate insights/hints/strategies on how to manage them.
Thank you in advance.
I didn't see this in your tag list so I thought I'd mention this before anything else: a tool called BeautifulSoup, designed specifically for web-scraping.
Web scraping is a messy process. Unless there's some long-standing regularity or direct relationship with the web site, you can't really rely on anything remaining static in the web page - certainly not when you scale to millions of web pages.
With that in mind:
There's no one-fit-all solution. Some ideas:
Use RSS, if available.
Split your scraping into crude categories where some categories have either implied or explicit timestamps (eg: news sites) you can use to trigger an update on your end.
You already mentioned this but hashing works quite well and is relatively cheap in terms of storage. Another idea here is to not hash the entire page but rather only dynamic or elements of interest.
Fetch HEAD, if available.
Download and store previous and current version of the files, then use a utility like diff.
Use a 3rd party service to detect a change and trigger a "refresh" on your end.
Obviously each of the above has its pros and cons in terms of processing, storage, and memory requirements.
As of version 4.x of BeautifulSoup you can use different HTML parsers, namely, lxml, which should allow you to use XPath. This will definitely be more efficient than traversing the entire tree manually in a loop.
Alternatively (and likely even more efficient) is using CSS selectors. The latter is more flexible because it doesn't depend on the content being in the same place; of course this assumes the content you're interested in retains the CSS attributes.
Hope this helps!
I am trying to grab content from Wikipedia and use the HTML of the article. Ideally I would also like to be able to alter the content (eg, hide certain infoboxes etc).
I am able to grab page content using mwclient:
>>> import mwclient
>>> site = mwclient.Site('en.wikipedia.org')
>>> page = site.Pages['Samuel_Pepys']
>>> print page.text()
{{Redirect|Pepys}}
{{EngvarB|date=January 2014}}
{{Infobox person
...
But I can't see a relatively simple, lightweight way to translate this wikicode into HTML using python.
Pandoc is too much for my needs.
I could just scrape the original page using Beautiful Soup but that doesn't seem like a particularly elegant solution.
mwparserfromhell might help in the process, but I can't quite tell from the documentation if it gives me anything I need and don't already have.
I can't see an obvious solution on the Alternative Parsers page.
What have I missed?
UPDATE: I wrote up what I ended up doing, following the discussion below.
page="""<html>
your pretty html here
<div id="for_api_content">%s</div>
</html>"""
Now you can grab your raw content with your API and just call
generated_page = page%api_content
This way you can design any HTML you want and just insert the API content in a designed spot.
Those APIs that you are using are designed to return raw content so it's up to you to style how you want the raw content to be displayed.
UPDATE
Since you showed me the actual output you are dealing with I realize your dilemma. However luckily for you there are modules that already parse and convert to HTML for you.
There is one called mwlib that will parse the wiki and output to HTML, PDF, etc. You can install it with pip using the install instructions. This is probably one of your better options since it was created in cooperation between Wikimedia Foundation and PediaPress.
Once you have it installed you can use the writer method to do the dirty work.
def writer(env, output, status_callback, **kwargs): pass
Here are the docs for this module: http://mwlib.readthedocs.org/en/latest/index.html
And you can set attributes on the writer object to set the filetype (HTML, PDF, etc).
writer.description = 'PDF documents (using ReportLab)'
writer.content_type = 'application/pdf'
writer.file_extension = 'pdf'
writer.options = {
'coverimage': {
'param': 'FILENAME',
'help': 'filename of an image for the cover page',
}
}
I don't know what the rendered html looks like but I would imagine that it's close to the actual wiki page. But since it's rendered in code I'm sure you have control over modifications as well.
I would go with HTML parsing, page content is reasonably semantic (class="infobox" and such), and there are classes explicitly meant to demarcate content which should not be displayed in alternative views (the first rule of the print stylesheet might be interesting).
That said, if you really want to manipulate wikitext, the best way is to fetch it, use mwparserfromhell to drop the templates you don't like, and use the parse API to get the modified HTML. Or use the Parsoid API which is a partial reimplementation of the parser returning XHTML/RDFa which is richer in semantic elements.
At any rate, trying to set up a local wikitext->HTML converter is by far the hardest way you can approach this task.
The mediawiki API contains a (perhaps confusingly named) parse action that in effect renders wikitext into HTML. I find that mwclient's faithful mirroring of the API structure sometimes actually gets in the way. There's a good example of just using requests to call the API to "parse" (aka render) a page given its title.
I am using lxml.html.clean to sanitize html. It seems to remove 'style' attributes from all tags and for my purposes I need the style attributes to not be removed.
Before I start allowing, I am keen to understand if there are any security implications around not removing style attributes at the time of sanitizing html.
Any insight on this matter from the gurus out there will be highly appreciated.
(please note my application allows an end user to create html which is saved on a backend database and then rendered on pages. 'clean' above works well in removing any malicious html (e.g. javascript, etc) from the html before it is saved for subsequent rendering).
Per feedparser HTML sanitisation web site - 'style' is not present in the list of 'safe_attrs'
(Also, sorry if this is a pedestrian question. I am new to html/sanitization and related security aspects)
Under normal conditions, styles are not a security risk per se. However, there are several circumstances in which a malicious style could be a liability:
form and input elements with style="display:none" can auto-populate with data in some browsers, causing users to unknowingly submit extra data.
style="display:block" or another display style might break a layout expecting inline or another style.
If your layout engine wants to maintain a particular visual style, allowing style information in the style attribute will give authors/posters more latitude than indented for choosing styles. (What if they decide they want 2000pt font?)
Style attributes can sometimes load other styles through the #import mechanic, or cause URLs to be loaded via background and similar attributes. Unless the sanitizer commits to also sanitize the CSS code, this will be a potential vector for injection.
Since the only reason you'd want to sanatize is that the source is potentially untrusted or insecure, it's assumed that letting the source set their own styles is not desired.
The more research I do, the more grim the outlook becomes.
I am trying to Flat Save, or Static Save a webpage with Python. This means merging all the styles to inline properties, and changing all links to absolute URLs.
I've tried nearly every free conversion website, api, and even libraries on github. None are that impressive. The best python implementation I could find for flattening styles is https://github.com/davecranwell/inline-styler. I adapted that slightly for Flask, but the generated file isn't that great. Here's how it looks:
Obviously, it should look better. Here's what it should look like:
https://dzwonsemrish7.cloudfront.net/items/3U302I3Y1H0J1h1Z0t1V/Screen%20Shot%202012-12-19%20at%205.51.44%20PM.png?v=2d0e3d26
It seems like a neverending struggle dealing with Malformed html, unrecognized CSS properties, Unicode errors, etc. So does anyone have a suggestion on a better way to do this? I understand I can go to file -> save in my local browser, but when I am trying to do this en mass, and extract a particular xpath that's not really viable.
It looks like Evernote's web clipper uses iFrames, but that seems more complicated than I think it should be. But at least the clippings look decent on Evernote.
After walking away for a while, I managed to install a ruby library that flattens the CSS much much better than anything else I've used. It's the library behind the very slow web interface here http://premailer.dialect.ca/
Thank goodness they released the source on Github, it's the best hands down.
https://github.com/alexdunae/premailer
It flattens styles, creates absolute urls, works with a URL or string, and can even create plain text email templates. Very impressed with this library.
Update Nov 2013
I ended up writing my own bookmarklet that works purely client side. It is compatible with Webkit and FireFox only. It recurses through each node and adds inline styles then sends the flattened HTML to the clippy.in API to save to the user's dashboard.
Client Side Bookmarklet
It sounds like inline styles might be a deal-breaker for you, but if not, I suggest taking another look at Evernote Web Clipper. The desktop app has an Export HTML feature for web clips. The output is a bit messy as you'd expect with inline styles, but I've found the markup to be a reliable representation of the saved page.
Regarding inline vs. external styles, for something like this I don't see any way around inline if you're doing a lot of pages from different sites where class names would have conflicting style rules.
You mentioned that Web Clipper uses iFrames, but I haven't found this to be the case for the HTML output. You'd likely have to embed the static page as an iFrame if you're re-publishing on another site (legally I assume), but otherwise that shouldn't be an issue.
Some automation would certainly help so you could go straight from the browser to the HTML output, and perhaps for relocating the saved images to a single repo with updated src links in the HTML. If you end up working on something like this, I'd be grateful to try it out myself.
Is there a API or systematic way of stripping irrelevant parts of a web page while scraping it via Python? For instance, take this very page -- the only important part is the question and the answers, not the side bar column, header, etc. One can guess things like that, but is there any smart way of doing it?
There's the approach from the Readability bookmarklet, with at least two Python implementations available:
decruft
python-readability
In general, no. In specific cases, if you know something about the structure of the site you are scraping, you can use a tool like Beautiful Soup to manipulate the DOM.
One approach is to compare the structure of multiple webpages that share the same template. In this case you would compare multiple SO questions. Then you can determine which content is static (useless) or dynamic (useful).
This field is known as wrapper induction. Unfortunately it is harder than it sounds!
This git hub project solves your problem, but it's in Java. May be worth a look: goose