Selenium python firefox, disable images but show the placeholder - python

I have followed a few other SO threads on how to disable image loading in firefox. However the page looks very messed up after disabling images. Is there a way to show the image placeholders so the page look structure wise similar to the page with images.

No, this can't be done easily. As this answer explains, if you're not actually requesting the image from the server and getting a response, the browser can't be sure how big the placeholder should be. Thus it will assume a size of {0,0}.
As usual there are lots of alternatives and workarounds, but at that point you have to decide whether the benefit of not having to download images is really worth the effort of: rewriting the page to replace images with fixed-size <div>s, rewriting the image requests using a proxy server, adding aggressive caching, etc.

Related

In Django, how can I load images based on screen size?

Apologies for not having any specific broken code here. I already know that what I would try won't work from a different question here, and I have a vague idea of something that might work but is likely not the best way to do it.
I'm building a website for a photographer, so it's important that I'm loading the best looking photos that the user is capable of seeing. The starting file size for the images is a few MB, but the model uses Pillow to save down-scaled copies. There are times when I want a full-screen image at high resolution, but I want to serve a smaller image if the user is on mobile, for example.
What I would have done was load the images from CSS background-image with media queries, but I understand that I can't use template tags in CSS.
My next guess would be to build two separate versions of each template and have the views render a different template based on the user-agent of the request, but that strikes me as probably not a great solution to put that much trust in the request headers, and the functionality could break as easily as a new browser release. There's a better way, right?
Why not CSS? - In this case, the images are dynamic content. The photographer doesn't know web development and I'm not going to update the site for him every time he adds or removes images or blog posts, etc, so the site is a lite CMS. He can freely add or remove images from galleries, etc. The django view/template then find and serve the images from the query set. Instead of referring to specific images and their specific smaller versions, I'm asking the server to serve whatever images currently belong to a specific queryset in the database, and serve either the smaller or larger versions of the images, and to do so with the explicit goal of not having the user download resolution that they can't see.

Display first page of child file pdf in plone

I have a parent file type that is folderish, and I would like to include a thumbnail of the first page of a child pdf in the template. Can someone roughly outline the tools and process you can imagine would achieve this, so that I can investigate further?
Getting out the first page of pdf can be achieved by using ghostscript.
This is an example script which forms an gostscript command and stores the images. I took this from collective.pdfpeek. Which by the way could solve your problem right away :-)
Until few days ago I would have recommended you not to use it, since it was a little bit buggy, but they recently shipped a new version, so give it a try! I'm not sure whether they now support DX or not.
So the workflow for you should be.
Uploading a PDF
Subscribe modified/creation events.
create image of first page using ghostscript (check my command, or collective.pdfpeek)
store it as blob (NamedBlobImage) on your uploaded pdf.
Also implement some queueing like collective.pdfpeek to not block all your threads with ghostscript commands.
OR
Give collective.pdfpeek a shot!
BTW:
imho on a large scale the preview generation for pdfs needs to be implemented as a service, which stores/manages the images for you.

Best way to programmatically save a webpage to a Static HTML File

The more research I do, the more grim the outlook becomes.
I am trying to Flat Save, or Static Save a webpage with Python. This means merging all the styles to inline properties, and changing all links to absolute URLs.
I've tried nearly every free conversion website, api, and even libraries on github. None are that impressive. The best python implementation I could find for flattening styles is https://github.com/davecranwell/inline-styler. I adapted that slightly for Flask, but the generated file isn't that great. Here's how it looks:
Obviously, it should look better. Here's what it should look like:
https://dzwonsemrish7.cloudfront.net/items/3U302I3Y1H0J1h1Z0t1V/Screen%20Shot%202012-12-19%20at%205.51.44%20PM.png?v=2d0e3d26
It seems like a neverending struggle dealing with Malformed html, unrecognized CSS properties, Unicode errors, etc. So does anyone have a suggestion on a better way to do this? I understand I can go to file -> save in my local browser, but when I am trying to do this en mass, and extract a particular xpath that's not really viable.
It looks like Evernote's web clipper uses iFrames, but that seems more complicated than I think it should be. But at least the clippings look decent on Evernote.
After walking away for a while, I managed to install a ruby library that flattens the CSS much much better than anything else I've used. It's the library behind the very slow web interface here http://premailer.dialect.ca/
Thank goodness they released the source on Github, it's the best hands down.
https://github.com/alexdunae/premailer
It flattens styles, creates absolute urls, works with a URL or string, and can even create plain text email templates. Very impressed with this library.
Update Nov 2013
I ended up writing my own bookmarklet that works purely client side. It is compatible with Webkit and FireFox only. It recurses through each node and adds inline styles then sends the flattened HTML to the clippy.in API to save to the user's dashboard.
Client Side Bookmarklet
It sounds like inline styles might be a deal-breaker for you, but if not, I suggest taking another look at Evernote Web Clipper. The desktop app has an Export HTML feature for web clips. The output is a bit messy as you'd expect with inline styles, but I've found the markup to be a reliable representation of the saved page.
Regarding inline vs. external styles, for something like this I don't see any way around inline if you're doing a lot of pages from different sites where class names would have conflicting style rules.
You mentioned that Web Clipper uses iFrames, but I haven't found this to be the case for the HTML output. You'd likely have to embed the static page as an iFrame if you're re-publishing on another site (legally I assume), but otherwise that shouldn't be an issue.
Some automation would certainly help so you could go straight from the browser to the HTML output, and perhaps for relocating the saved images to a single repo with updated src links in the HTML. If you end up working on something like this, I'd be grateful to try it out myself.

Getting image sizes like Facebook link scraper

I'm implementing my own link scraper to copy Facebook's technique as closely as possible (unless someone has a ready made lib for me...).
According to the many answers on SO, Facebook's process for determining the image to associate with a shared link involves searching for several recognized meta tags and then, if those are not found, stepping through the images on the page and returning a list of appropriately sized ones (at least 50px by 50px, have a maximum aspect ratio of 3:1, and in PNG, JPEG or GIF format according to this answer)
My question is, how does Facebook get the size information of the images? Is it loading all images for each shared link and inspecting them? Is there more efficient way to do this. (My backend is Python.)
(Side note: Would it make sense to use a client-side instead of server-side approach?)
Is there more efficient way to do this.
Most common “web” graphic formats – JPEG, GIF, PNG – contain info about the width & height in the header (or at least in the first block, for PNG).
So if the remote web server is accepting range requests it’d be possible to only request the first X bytes of an image resource instead of the whole thing to get the desired information.
(This is what Facebook’s scraper does for HTML pages, too – it’s quite common that you see in the debugger that the request was answered with HTTP status code 206 Partial Content – that meaning Facebook said they’re only interested in the first X (K)Bytes (for meta elements in head), and the web server was able to give them only that.

Retrieving media (images, videos etc.) from links in Perl

Similar to Reddit's r/pic sub-reddit, I want to aggregate media from various sources. Some sites use OEmbed specs to expose media on the page but not all sites do it. I was browsing through Reddit's source because essentially they 'scrape' links that users submit, retrieve images, videos etc. They create thumbnails which are then displayed along the link on their site. Now, I would like to do something similar and I looked at their code[1] and it seems that they have custom scrapers for each domain that they recognize and then they have a generic Scraper class that uses simple logic to get images from any domain (basically they retrieve the web-page, parse the html and then determine the largest image on the page which they then use to generate a thumbnail).
Since it's open source I can probably reuse the code for my application but unfortunately I have chosen Perl as this is a hobby project and I'm trying to learn Perl. Is there a Perl module which has similar functionality? If not, is there a Perl module that is similar to Python Imaging Library? It would be handy to determine the image sizes without actually downloading the whole image & thumbnail generation.
Thanks!
[1] https://github.com/reddit/reddit/blob/master/r2/r2/lib/scraper.py
Image::Size is the specialised module for determining image sizes from various format. It should be enough to read the first 1000 octets or so from a resource, enough for the diverse image headers, into a buffer and operating on that. I have not tested this.
I do not know any general scraping module that has an API for HTTP range requests in order to avoid downloading the whole image resource, but it is easy to subclass WWW::Mechanize.
Try PerlMagick, installation instruction is also listed there.

Categories

Resources