I'm implementing my own link scraper to copy Facebook's technique as closely as possible (unless someone has a ready made lib for me...).
According to the many answers on SO, Facebook's process for determining the image to associate with a shared link involves searching for several recognized meta tags and then, if those are not found, stepping through the images on the page and returning a list of appropriately sized ones (at least 50px by 50px, have a maximum aspect ratio of 3:1, and in PNG, JPEG or GIF format according to this answer)
My question is, how does Facebook get the size information of the images? Is it loading all images for each shared link and inspecting them? Is there more efficient way to do this. (My backend is Python.)
(Side note: Would it make sense to use a client-side instead of server-side approach?)
Is there more efficient way to do this.
Most common “web” graphic formats – JPEG, GIF, PNG – contain info about the width & height in the header (or at least in the first block, for PNG).
So if the remote web server is accepting range requests it’d be possible to only request the first X bytes of an image resource instead of the whole thing to get the desired information.
(This is what Facebook’s scraper does for HTML pages, too – it’s quite common that you see in the debugger that the request was answered with HTTP status code 206 Partial Content – that meaning Facebook said they’re only interested in the first X (K)Bytes (for meta elements in head), and the web server was able to give them only that.
Related
I am working on continuous printing of receipts on a thermal printer. To do this I need to generate PDF to send to printer. Printer uses 58mm roll of paper.
If the content is broken down into multiple pages of fixed height, last page will often have a lot vertical blank space at the end. The printer will then just unnecessarily push out
a lot of blank paper at the end. I then tried cropping and merging pages into single page, but this is highly inefficient (takes at least 4 seconds which is not acceptable).
Only solution I can think of is to generate a PDF with all content on a single page with page width of 58mm and page height dynamically set based on generated content.
I tried using PyPDF2, reportlab and few other libraries, but all the libraries I tried require setting exact page width before even putting elements into place.
Any ideas how can this be done?
Your question or what you want to do is uselessly burdensome without taking advantage of the features of the thermal receipt printer, so I recommend rethinking and switching to character code printing.
If you still want to continue the way you've been doing, these articles may be helpful.
text printed twice on the same page
Resize pdf pages in Python
For example, each time you add a PDF, you can create a blank page that totals the height of the existing PDF and the PDF to be added, and then repeatedly merge both PDFs into the blank page to dynamically expand the page height.
Below is the initial answer.
I will leave it as information to utilize the features of the thermal receipt printer.
After all, the printer's graphic data buffer is finite, so you can't do what you want.
The size of the buffer depends on the printer, so please read the specifications of the printer you are using carefully.
Image data must be created by separating each printer's maximum buffer size.
Response to comment:
It is probably a vendor-made device driver or library that adjusts the characteristics of the printer and the requirements of the OS.
It may be possible if you use a device driver made by such a vendor.
In other words, the vendor's device driver is doing the processing internally by passing as pointed out above, making the application appear to support long pages.
However, if you use the ESC/POS control sequence directly, or if you use a generic library that doesn't care about it, that won't happen.
By the way, if the print content is not a PDF or image and you do not need decorations like printing on a document printer of a desktop system, and you limit the printing method to only the range of the printer font, you can print up to the length of the paper.
In short, it is good if there is no need to expand the printed contents as graphic data.
Apologies for not having any specific broken code here. I already know that what I would try won't work from a different question here, and I have a vague idea of something that might work but is likely not the best way to do it.
I'm building a website for a photographer, so it's important that I'm loading the best looking photos that the user is capable of seeing. The starting file size for the images is a few MB, but the model uses Pillow to save down-scaled copies. There are times when I want a full-screen image at high resolution, but I want to serve a smaller image if the user is on mobile, for example.
What I would have done was load the images from CSS background-image with media queries, but I understand that I can't use template tags in CSS.
My next guess would be to build two separate versions of each template and have the views render a different template based on the user-agent of the request, but that strikes me as probably not a great solution to put that much trust in the request headers, and the functionality could break as easily as a new browser release. There's a better way, right?
Why not CSS? - In this case, the images are dynamic content. The photographer doesn't know web development and I'm not going to update the site for him every time he adds or removes images or blog posts, etc, so the site is a lite CMS. He can freely add or remove images from galleries, etc. The django view/template then find and serve the images from the query set. Instead of referring to specific images and their specific smaller versions, I'm asking the server to serve whatever images currently belong to a specific queryset in the database, and serve either the smaller or larger versions of the images, and to do so with the explicit goal of not having the user download resolution that they can't see.
I've been trying for about a week to automate image extraction from a pdf. Unfortunately, the answers I found here were of no help. I've seen multiple variations on the same code using pypdf2, all with ['/XObject'] in them, which results in a KeyError.
What I'm looking for seems to be hiding in streams, which I can't find in pypdf2's dictionary (even after recursively exploring the whole structure, calling .getObject() on every indirect object I can find).
Using pypdf2 I've written one page off the pdf and opened it using Notepad++, to find some streams with the /FlateDecode filter.
pdfrw was slightly more helpful, allowing me to use PdfReader(path).pages[page].Contents.stream to get A stream (no clue how to get the others).
Using zlib, I decompressed it, and got something starting with:
/Part <</MCID 0 >>BDC
(It also contains a lot of floating-point numbers, both positive and negative)
From what I could find, BDC has something to do with ghostscript.
At this point I gave up and decided to ask for help.
Is there a python tool to, at least, extract all streams (and identify FlateDecode tag?)
And is there a way for me to identify what's hidden in there? I expected the start tag of some image format, which this clearly isn't. How do I further parse this result to find any image that could be hidden in there?
I'm looking for something I can apply to any PDF that's displayed properly. Some tool to further parse, or at least help me make sense of the streams, or even a reference that will help me understand what's going on.
Edit: it seems, as noted by Patrick, that I was barking up the wrong tree. I went to streams since I couldn't find any xObjects when opening the PDF in Notepad++, or when running the various python scripts used to parse PDFs. I managed to find what I suspect are the images, with no xObject tags, but with what seems like a stream tag - though the information is not compressed.
Unless you are looking to extract inline images, which aren't that common, the content stream is not the place to look for images. The more common case are Streams of type XObject, of subtype Image, which are usually found in a page's Resource->XObject dictionary (see sections 7.3.3, 7.8.3, and 8.95 of the PDF Reference indicated by #mkl).
Alternately, Image XObjects can also be found in Form XObjects (subtype Form, which indicates they have their own content streams) in their own Resource->XObject dictionary, so the search for Image XObjects can be recursive.
An Image XObject can also have a softMask, which is itself its own Image XObject. Form XObjects are also used in Tiling Patterns, and so could conceivably contain Image XObjects (but they aren't that common either), or used in an Annotation's Normal Appearance (but Image XObjects are less commonly used within such Annotations, except maybe 3D or multimedia annotations).
I have followed a few other SO threads on how to disable image loading in firefox. However the page looks very messed up after disabling images. Is there a way to show the image placeholders so the page look structure wise similar to the page with images.
No, this can't be done easily. As this answer explains, if you're not actually requesting the image from the server and getting a response, the browser can't be sure how big the placeholder should be. Thus it will assume a size of {0,0}.
As usual there are lots of alternatives and workarounds, but at that point you have to decide whether the benefit of not having to download images is really worth the effort of: rewriting the page to replace images with fixed-size <div>s, rewriting the image requests using a proxy server, adding aggressive caching, etc.
Similar to Reddit's r/pic sub-reddit, I want to aggregate media from various sources. Some sites use OEmbed specs to expose media on the page but not all sites do it. I was browsing through Reddit's source because essentially they 'scrape' links that users submit, retrieve images, videos etc. They create thumbnails which are then displayed along the link on their site. Now, I would like to do something similar and I looked at their code[1] and it seems that they have custom scrapers for each domain that they recognize and then they have a generic Scraper class that uses simple logic to get images from any domain (basically they retrieve the web-page, parse the html and then determine the largest image on the page which they then use to generate a thumbnail).
Since it's open source I can probably reuse the code for my application but unfortunately I have chosen Perl as this is a hobby project and I'm trying to learn Perl. Is there a Perl module which has similar functionality? If not, is there a Perl module that is similar to Python Imaging Library? It would be handy to determine the image sizes without actually downloading the whole image & thumbnail generation.
Thanks!
[1] https://github.com/reddit/reddit/blob/master/r2/r2/lib/scraper.py
Image::Size is the specialised module for determining image sizes from various format. It should be enough to read the first 1000 octets or so from a resource, enough for the diverse image headers, into a buffer and operating on that. I have not tested this.
I do not know any general scraping module that has an API for HTTP range requests in order to avoid downloading the whole image resource, but it is easy to subclass WWW::Mechanize.
Try PerlMagick, installation instruction is also listed there.