How to generate all PDF with all content on a single page? - python

I am working on continuous printing of receipts on a thermal printer. To do this I need to generate PDF to send to printer. Printer uses 58mm roll of paper.
If the content is broken down into multiple pages of fixed height, last page will often have a lot vertical blank space at the end. The printer will then just unnecessarily push out
a lot of blank paper at the end. I then tried cropping and merging pages into single page, but this is highly inefficient (takes at least 4 seconds which is not acceptable).
Only solution I can think of is to generate a PDF with all content on a single page with page width of 58mm and page height dynamically set based on generated content.
I tried using PyPDF2, reportlab and few other libraries, but all the libraries I tried require setting exact page width before even putting elements into place.
Any ideas how can this be done?

Your question or what you want to do is uselessly burdensome without taking advantage of the features of the thermal receipt printer, so I recommend rethinking and switching to character code printing.
If you still want to continue the way you've been doing, these articles may be helpful.
text printed twice on the same page
Resize pdf pages in Python
For example, each time you add a PDF, you can create a blank page that totals the height of the existing PDF and the PDF to be added, and then repeatedly merge both PDFs into the blank page to dynamically expand the page height.
Below is the initial answer.
I will leave it as information to utilize the features of the thermal receipt printer.
After all, the printer's graphic data buffer is finite, so you can't do what you want.
The size of the buffer depends on the printer, so please read the specifications of the printer you are using carefully.
Image data must be created by separating each printer's maximum buffer size.
Response to comment:
It is probably a vendor-made device driver or library that adjusts the characteristics of the printer and the requirements of the OS.
It may be possible if you use a device driver made by such a vendor.
In other words, the vendor's device driver is doing the processing internally by passing as pointed out above, making the application appear to support long pages.
However, if you use the ESC/POS control sequence directly, or if you use a generic library that doesn't care about it, that won't happen.
By the way, if the print content is not a PDF or image and you do not need decorations like printing on a document printer of a desktop system, and you limit the printing method to only the range of the printer font, you can print up to the length of the paper.
In short, it is good if there is no need to expand the printed contents as graphic data.

Related

Finding and identifying streams in PDF using python

I've been trying for about a week to automate image extraction from a pdf. Unfortunately, the answers I found here were of no help. I've seen multiple variations on the same code using pypdf2, all with ['/XObject'] in them, which results in a KeyError.
What I'm looking for seems to be hiding in streams, which I can't find in pypdf2's dictionary (even after recursively exploring the whole structure, calling .getObject() on every indirect object I can find).
Using pypdf2 I've written one page off the pdf and opened it using Notepad++, to find some streams with the /FlateDecode filter.
pdfrw was slightly more helpful, allowing me to use PdfReader(path).pages[page].Contents.stream to get A stream (no clue how to get the others).
Using zlib, I decompressed it, and got something starting with:
/Part <</MCID 0 >>BDC
(It also contains a lot of floating-point numbers, both positive and negative)
From what I could find, BDC has something to do with ghostscript.
At this point I gave up and decided to ask for help.
Is there a python tool to, at least, extract all streams (and identify FlateDecode tag?)
And is there a way for me to identify what's hidden in there? I expected the start tag of some image format, which this clearly isn't. How do I further parse this result to find any image that could be hidden in there?
I'm looking for something I can apply to any PDF that's displayed properly. Some tool to further parse, or at least help me make sense of the streams, or even a reference that will help me understand what's going on.
Edit: it seems, as noted by Patrick, that I was barking up the wrong tree. I went to streams since I couldn't find any xObjects when opening the PDF in Notepad++, or when running the various python scripts used to parse PDFs. I managed to find what I suspect are the images, with no xObject tags, but with what seems like a stream tag - though the information is not compressed.
Unless you are looking to extract inline images, which aren't that common, the content stream is not the place to look for images. The more common case are Streams of type XObject, of subtype Image, which are usually found in a page's Resource->XObject dictionary (see sections 7.3.3, 7.8.3, and 8.95 of the PDF Reference indicated by #mkl).
Alternately, Image XObjects can also be found in Form XObjects (subtype Form, which indicates they have their own content streams) in their own Resource->XObject dictionary, so the search for Image XObjects can be recursive.
An Image XObject can also have a softMask, which is itself its own Image XObject. Form XObjects are also used in Tiling Patterns, and so could conceivably contain Image XObjects (but they aren't that common either), or used in an Annotation's Normal Appearance (but Image XObjects are less commonly used within such Annotations, except maybe 3D or multimedia annotations).

Display first page of child file pdf in plone

I have a parent file type that is folderish, and I would like to include a thumbnail of the first page of a child pdf in the template. Can someone roughly outline the tools and process you can imagine would achieve this, so that I can investigate further?
Getting out the first page of pdf can be achieved by using ghostscript.
This is an example script which forms an gostscript command and stores the images. I took this from collective.pdfpeek. Which by the way could solve your problem right away :-)
Until few days ago I would have recommended you not to use it, since it was a little bit buggy, but they recently shipped a new version, so give it a try! I'm not sure whether they now support DX or not.
So the workflow for you should be.
Uploading a PDF
Subscribe modified/creation events.
create image of first page using ghostscript (check my command, or collective.pdfpeek)
store it as blob (NamedBlobImage) on your uploaded pdf.
Also implement some queueing like collective.pdfpeek to not block all your threads with ghostscript commands.
OR
Give collective.pdfpeek a shot!
BTW:
imho on a large scale the preview generation for pdfs needs to be implemented as a service, which stores/manages the images for you.

Getting image sizes like Facebook link scraper

I'm implementing my own link scraper to copy Facebook's technique as closely as possible (unless someone has a ready made lib for me...).
According to the many answers on SO, Facebook's process for determining the image to associate with a shared link involves searching for several recognized meta tags and then, if those are not found, stepping through the images on the page and returning a list of appropriately sized ones (at least 50px by 50px, have a maximum aspect ratio of 3:1, and in PNG, JPEG or GIF format according to this answer)
My question is, how does Facebook get the size information of the images? Is it loading all images for each shared link and inspecting them? Is there more efficient way to do this. (My backend is Python.)
(Side note: Would it make sense to use a client-side instead of server-side approach?)
Is there more efficient way to do this.
Most common “web” graphic formats – JPEG, GIF, PNG – contain info about the width & height in the header (or at least in the first block, for PNG).
So if the remote web server is accepting range requests it’d be possible to only request the first X bytes of an image resource instead of the whole thing to get the desired information.
(This is what Facebook’s scraper does for HTML pages, too – it’s quite common that you see in the debugger that the request was answered with HTTP status code 206 Partial Content – that meaning Facebook said they’re only interested in the first X (K)Bytes (for meta elements in head), and the web server was able to give them only that.

Python/Javascript: WYSIWYG html editor - Handle large documents fast and/or design theory

Background:
I am writing an ebook editing program in python. Currently it utilizes a source-code view for editing, and I would like to port it over to a wysiwyg view for editing. The best (only?) html renderer I could find for python was webkit (I am using the PyQt version).
Question:
How do I accomplish wysiwyg editing? The requirements/issues are as follows:
An ebook may be up to 10,000 paragraphs / 1,000,000
characters.
PyQt Webkit (ContentEditable): No problem.
PyQt Webkit (TinyMce, etc): Takes forever to open them!
The format is <body><p>...</p><p>...</p>...</body>. The body element contains only paragraphs, there are no divs, etc (but in the paragraph there may be spans, links, etc.). Editing must take place with no significant delays as far as the user is concerned.
PyQt Webkit (ContentEditable): If you try deleting text across multiple paragraphs, it takes forever!! My understanding is that this is because it resets the common-parent of the elements being changed - i.e. the entire body element, since two different paragraphs are being deleted/merged. But, there should be no need for this - it should need only delete/merge/change those individual paragraphs!
I am open to implementing my own wysiwyg editing, but for the life of me I can't figure out how to delete/cut/paste/merge/change the html code correctly. I searched online for articles about html wysiwyg design theory, and came up dry.
Thanks!
Can i suggest a complete another approach ? Since your ebook is only <p></p>:
Split the text on <p></p> to get an indexed array of all your paragraphs
Make your own pagination system, and fill the screen with N paragraphs, that automatically get enough text to show from the indexed array
When you are doing selection, you can use [paragraph index + character index in the paragraph] for selection start / end
Then implement cut/copy/paste/delete/undo/redo based on thoses assumptions.
(Note: when you'll do a selection, since the start point is saved, you can safely change the text on the screen / pagination, until the selection end.)

How do I grab a thumbnail screenshot of many websites?

I have a list of 2500 websites and need to grab a thumbnail screenshot of them. How do I do that?
I could try to parse the sites either with Perl or Python, Mechanize would be a good thing. But I am not so experienced with Perl.
Here is Perl solution:
use WWW::Mechanize::Firefox;
my $mech = WWW::Mechanize::Firefox->new();
$mech->get('http://google.com');
my $png = $mech->content_as_png();
From the docs:
Returns the given tab or the current page rendered as
PNG image.
All parameters are optional. $tab defaults to the current tab. If the
coordinates are given, that rectangle will be cut out. The coordinates
should be a hash with the four usual entries, left,top,width,height.
This is specific to WWW::Mechanize::Firefox.
Currently, the data transfer between Firefox and Perl is done
Base64-encoded. It would be beneficial to find what's necessary to
make JSON handle binary data more gracefully.

Categories

Resources