How do I grab a thumbnail screenshot of many websites? - python

I have a list of 2500 websites and need to grab a thumbnail screenshot of them. How do I do that?
I could try to parse the sites either with Perl or Python, Mechanize would be a good thing. But I am not so experienced with Perl.

Here is Perl solution:
use WWW::Mechanize::Firefox;
my $mech = WWW::Mechanize::Firefox->new();
$mech->get('http://google.com');
my $png = $mech->content_as_png();
From the docs:
Returns the given tab or the current page rendered as
PNG image.
All parameters are optional. $tab defaults to the current tab. If the
coordinates are given, that rectangle will be cut out. The coordinates
should be a hash with the four usual entries, left,top,width,height.
This is specific to WWW::Mechanize::Firefox.
Currently, the data transfer between Firefox and Perl is done
Base64-encoded. It would be beneficial to find what's necessary to
make JSON handle binary data more gracefully.

Related

How to generate all PDF with all content on a single page?

I am working on continuous printing of receipts on a thermal printer. To do this I need to generate PDF to send to printer. Printer uses 58mm roll of paper.
If the content is broken down into multiple pages of fixed height, last page will often have a lot vertical blank space at the end. The printer will then just unnecessarily push out
a lot of blank paper at the end. I then tried cropping and merging pages into single page, but this is highly inefficient (takes at least 4 seconds which is not acceptable).
Only solution I can think of is to generate a PDF with all content on a single page with page width of 58mm and page height dynamically set based on generated content.
I tried using PyPDF2, reportlab and few other libraries, but all the libraries I tried require setting exact page width before even putting elements into place.
Any ideas how can this be done?
Your question or what you want to do is uselessly burdensome without taking advantage of the features of the thermal receipt printer, so I recommend rethinking and switching to character code printing.
If you still want to continue the way you've been doing, these articles may be helpful.
text printed twice on the same page
Resize pdf pages in Python
For example, each time you add a PDF, you can create a blank page that totals the height of the existing PDF and the PDF to be added, and then repeatedly merge both PDFs into the blank page to dynamically expand the page height.
Below is the initial answer.
I will leave it as information to utilize the features of the thermal receipt printer.
After all, the printer's graphic data buffer is finite, so you can't do what you want.
The size of the buffer depends on the printer, so please read the specifications of the printer you are using carefully.
Image data must be created by separating each printer's maximum buffer size.
Response to comment:
It is probably a vendor-made device driver or library that adjusts the characteristics of the printer and the requirements of the OS.
It may be possible if you use a device driver made by such a vendor.
In other words, the vendor's device driver is doing the processing internally by passing as pointed out above, making the application appear to support long pages.
However, if you use the ESC/POS control sequence directly, or if you use a generic library that doesn't care about it, that won't happen.
By the way, if the print content is not a PDF or image and you do not need decorations like printing on a document printer of a desktop system, and you limit the printing method to only the range of the printer font, you can print up to the length of the paper.
In short, it is good if there is no need to expand the printed contents as graphic data.

Extract Geometry Elements from PDF by OCG (by Layer)

So I've spent the good majority of a month on this issue. I'm looking for a way to extract geometry elements (polylines, text, arcs, etc.) from a vectorized PDF organised by the file's OCGs (Optional Content Groups), which are basically PDF layers. Using PDFminer I was able to extract geometry (LTCurves, LTTextBoxes, LTLines, etc.); using PyPDF2, I was able to view how many OCGs were in the PDF, though I was not able to access geometry associated with that OCG. There were a few hacky scripts I've seen and tried online that may have been able to solve this problem, but to no avail. I even resorted to opening the raw PDF data in a text editor and half hazardly removing parts of it to see if I could come up with some custom parsing technique to do this, but again to no avail. Adobe's PDF manual is minimal at best, so that was no help when I was attempting to create a parser. Does anyone know a solution to this.
At this point, I'm open to a solution in any language, using any OS (though I would prefer a solution using Python 3 on Windows or Linux), as long as it is open source / free.
Can anyone here help end this rabbit hole of darkness? Much appreciated!
A PDF document consists of two "types" of data. There is an object oriented "structure" to the document to divide it into pages, and carry meta data (like, for instance, there is this list of Optional Content Groups), and there is a stream oriented list of marking operators that actually "draw" content onto the page.
The fact that there are OCG's, and their names, and a bit about them is stored on object oriented content, and can be extracted by parsing the object content fairly easily. But the membership of the OCG's is NOT stored in the object structure. It can only be found by parsing the Content Stream. A group of marking operators is a member of a particular OCG group when it is preceeded by the content operator /OC /optionacontentgroupname BDC and followed by the operator EMC.
Parsing a content stream is a less than trivial task. There are many tools out there that will do this for you. I would not, myself, attempt to build such a parser from scratch. There is little value in re-writing the wheel.
The complete syntax of PDF is available from many sources. Search the web for "PDF Specification 1.7", or "ISO32000-1:2008". It is a daunting document to read, but it does supply all of the information needed to create both and object and a content parser
If your PDF is organized in OGC layers, then you can use gdal_translate command of GDAL.
Use the following command to check all available OGC layers in your PDF file:
gdalinfo "sample.pdf" -mdd LAYERS
Then, use the following to command to extract the partiular layer:
gdal_translate "sample.pdf" -of PNG sample.png --config GDAL_PDF_LAYERS "your_specific_layer_name"
More details are mentioned here.
Hey #pythonic_programmer, I am able to use this python library pdflayers to disable the default view (visible/not visible) of the layer into the new pdf file.
https://pypi.org/project/pdflayers/
Pretty much it means disable the default state of the layer
in the pdf file: https://helpx.adobe.com/acrobat/using/pdf-layers.html
Any layer not visible meaning that layer will not render to the pdf document when you process (by default).

Finding and identifying streams in PDF using python

I've been trying for about a week to automate image extraction from a pdf. Unfortunately, the answers I found here were of no help. I've seen multiple variations on the same code using pypdf2, all with ['/XObject'] in them, which results in a KeyError.
What I'm looking for seems to be hiding in streams, which I can't find in pypdf2's dictionary (even after recursively exploring the whole structure, calling .getObject() on every indirect object I can find).
Using pypdf2 I've written one page off the pdf and opened it using Notepad++, to find some streams with the /FlateDecode filter.
pdfrw was slightly more helpful, allowing me to use PdfReader(path).pages[page].Contents.stream to get A stream (no clue how to get the others).
Using zlib, I decompressed it, and got something starting with:
/Part <</MCID 0 >>BDC
(It also contains a lot of floating-point numbers, both positive and negative)
From what I could find, BDC has something to do with ghostscript.
At this point I gave up and decided to ask for help.
Is there a python tool to, at least, extract all streams (and identify FlateDecode tag?)
And is there a way for me to identify what's hidden in there? I expected the start tag of some image format, which this clearly isn't. How do I further parse this result to find any image that could be hidden in there?
I'm looking for something I can apply to any PDF that's displayed properly. Some tool to further parse, or at least help me make sense of the streams, or even a reference that will help me understand what's going on.
Edit: it seems, as noted by Patrick, that I was barking up the wrong tree. I went to streams since I couldn't find any xObjects when opening the PDF in Notepad++, or when running the various python scripts used to parse PDFs. I managed to find what I suspect are the images, with no xObject tags, but with what seems like a stream tag - though the information is not compressed.
Unless you are looking to extract inline images, which aren't that common, the content stream is not the place to look for images. The more common case are Streams of type XObject, of subtype Image, which are usually found in a page's Resource->XObject dictionary (see sections 7.3.3, 7.8.3, and 8.95 of the PDF Reference indicated by #mkl).
Alternately, Image XObjects can also be found in Form XObjects (subtype Form, which indicates they have their own content streams) in their own Resource->XObject dictionary, so the search for Image XObjects can be recursive.
An Image XObject can also have a softMask, which is itself its own Image XObject. Form XObjects are also used in Tiling Patterns, and so could conceivably contain Image XObjects (but they aren't that common either), or used in an Annotation's Normal Appearance (but Image XObjects are less commonly used within such Annotations, except maybe 3D or multimedia annotations).

Getting image sizes like Facebook link scraper

I'm implementing my own link scraper to copy Facebook's technique as closely as possible (unless someone has a ready made lib for me...).
According to the many answers on SO, Facebook's process for determining the image to associate with a shared link involves searching for several recognized meta tags and then, if those are not found, stepping through the images on the page and returning a list of appropriately sized ones (at least 50px by 50px, have a maximum aspect ratio of 3:1, and in PNG, JPEG or GIF format according to this answer)
My question is, how does Facebook get the size information of the images? Is it loading all images for each shared link and inspecting them? Is there more efficient way to do this. (My backend is Python.)
(Side note: Would it make sense to use a client-side instead of server-side approach?)
Is there more efficient way to do this.
Most common “web” graphic formats – JPEG, GIF, PNG – contain info about the width & height in the header (or at least in the first block, for PNG).
So if the remote web server is accepting range requests it’d be possible to only request the first X bytes of an image resource instead of the whole thing to get the desired information.
(This is what Facebook’s scraper does for HTML pages, too – it’s quite common that you see in the debugger that the request was answered with HTTP status code 206 Partial Content – that meaning Facebook said they’re only interested in the first X (K)Bytes (for meta elements in head), and the web server was able to give them only that.

How can/should I break an html document into parts using Python? (Techno- and logically)

I've an HTML document I'm trying to break into separate, smaller chunks. Say, take each < h3 > header and turn into its own separate file, using only the HTML encoded within that chunk (along with html, head, body, tags).
I am using Python's Beautiful Soup which I am new to, but seems easy to use for easy tasks such as this (Any better suggestions like lxml or Mini-dom?). So:
1) How do I go, 'parse all < h3 >s and turn each into a separate doc'? Anything from pointers to the right direction to code snippets to online documentation (found quite little for Soup) will be appreciated.
2) Logically, finding the tag won't be enough - I need to physically 'cut it out' and put it in a separate file (and remove it from original). Perhaps parsing the text lines instead of nodes would be easier (albeit super-ugly, parsing raw text from a formed structure...?)
3) Similarly related - suppose I want to delete a certain attribute from all tags of a type (like, delete the alignment attribute of all images). This seems easy but I've failed - any help will be appreciated!
Thanks for any help!
Yes, you use BeautifulSoup or lxml. Both have methods to find the nodes you want to extract. You can then also recreate HTML from the node objects, and hence save that HTML to new files.

Categories

Resources