I have a parent file type that is folderish, and I would like to include a thumbnail of the first page of a child pdf in the template. Can someone roughly outline the tools and process you can imagine would achieve this, so that I can investigate further?
Getting out the first page of pdf can be achieved by using ghostscript.
This is an example script which forms an gostscript command and stores the images. I took this from collective.pdfpeek. Which by the way could solve your problem right away :-)
Until few days ago I would have recommended you not to use it, since it was a little bit buggy, but they recently shipped a new version, so give it a try! I'm not sure whether they now support DX or not.
So the workflow for you should be.
Uploading a PDF
Subscribe modified/creation events.
create image of first page using ghostscript (check my command, or collective.pdfpeek)
store it as blob (NamedBlobImage) on your uploaded pdf.
Also implement some queueing like collective.pdfpeek to not block all your threads with ghostscript commands.
OR
Give collective.pdfpeek a shot!
BTW:
imho on a large scale the preview generation for pdfs needs to be implemented as a service, which stores/manages the images for you.
Related
I'd like to generate thumbnails from various "document" file formats such as odt, doc(x) and ppt(x) but also mp4, psd, tiff (and possibly others) from a Python application. As far as I know for each of these formats there is at least one open source application which can generate preview images/thumbnails (e.g. LibreOffice, ffmpeg) or at least extract embedded thumbnails (e.g. imagemagick).
My main problem is that each of these applications/libraries use different command line options so I'm looking for a Python library (or a unified CLI tool) which provides a high-level API to generate a thumbnail with specified dimensions, quality level given a filename and calls the appropriate external tool (ideally including catching exceptions, segfaults and timeouts). Bonus points if it can generate multiple thumbnails if requested (e.g. one per page, page X-Y, every Z seconds but at most N images).
Does anyone know such a library/utility? (Boundary condition: The files may contain sensitive material or might be quite big so this must work without any network communication, using an external web service is not possible.)
If there is no such thing in Python, a locally installable web service would be fine as well.
I ended up writing my own library (named anythumbnailer, MIT license) which worked well enough for my immediate needs. The library is not what I envisioned (only basic thumbnailing, no support for dimensions, …) but it can generate thumbnails for doc(x), xls(x), ppt(x), videos and pdf on Linux with the help of ffmpeg, LibreOffice and ffmpeg.
you can look at Preview generator. preview-generator is a library for generating preview - thumbnails, pdf, text and json overview for all your file-based content. This module gives you access to jpeg, pdf, text, htlm and json preview of virtually any kind of file. It also includes a cache mechanism so you do not have to care about preview storage.
The more research I do, the more grim the outlook becomes.
I am trying to Flat Save, or Static Save a webpage with Python. This means merging all the styles to inline properties, and changing all links to absolute URLs.
I've tried nearly every free conversion website, api, and even libraries on github. None are that impressive. The best python implementation I could find for flattening styles is https://github.com/davecranwell/inline-styler. I adapted that slightly for Flask, but the generated file isn't that great. Here's how it looks:
Obviously, it should look better. Here's what it should look like:
https://dzwonsemrish7.cloudfront.net/items/3U302I3Y1H0J1h1Z0t1V/Screen%20Shot%202012-12-19%20at%205.51.44%20PM.png?v=2d0e3d26
It seems like a neverending struggle dealing with Malformed html, unrecognized CSS properties, Unicode errors, etc. So does anyone have a suggestion on a better way to do this? I understand I can go to file -> save in my local browser, but when I am trying to do this en mass, and extract a particular xpath that's not really viable.
It looks like Evernote's web clipper uses iFrames, but that seems more complicated than I think it should be. But at least the clippings look decent on Evernote.
After walking away for a while, I managed to install a ruby library that flattens the CSS much much better than anything else I've used. It's the library behind the very slow web interface here http://premailer.dialect.ca/
Thank goodness they released the source on Github, it's the best hands down.
https://github.com/alexdunae/premailer
It flattens styles, creates absolute urls, works with a URL or string, and can even create plain text email templates. Very impressed with this library.
Update Nov 2013
I ended up writing my own bookmarklet that works purely client side. It is compatible with Webkit and FireFox only. It recurses through each node and adds inline styles then sends the flattened HTML to the clippy.in API to save to the user's dashboard.
Client Side Bookmarklet
It sounds like inline styles might be a deal-breaker for you, but if not, I suggest taking another look at Evernote Web Clipper. The desktop app has an Export HTML feature for web clips. The output is a bit messy as you'd expect with inline styles, but I've found the markup to be a reliable representation of the saved page.
Regarding inline vs. external styles, for something like this I don't see any way around inline if you're doing a lot of pages from different sites where class names would have conflicting style rules.
You mentioned that Web Clipper uses iFrames, but I haven't found this to be the case for the HTML output. You'd likely have to embed the static page as an iFrame if you're re-publishing on another site (legally I assume), but otherwise that shouldn't be an issue.
Some automation would certainly help so you could go straight from the browser to the HTML output, and perhaps for relocating the saved images to a single repo with updated src links in the HTML. If you end up working on something like this, I'd be grateful to try it out myself.
What I am trying to accomplish is to allow users to view information in the django admin console and allow them to save and print out a PDF of the information infront of them based upon how ever they sorted/filtered the data.
I have seen a lot of documentation on report lab but mostly for just drawing lines and what not. How can I simply output the admin results to a PDF? If that is even possible. I am open to other suggestions if report lab is not the ideal way to get this done.
Thanks in advance.
Better use some kind of html2pdf because you already have html there.
If html2pdf doesn't do what you need, you can do everything you want to do with ReportLab. Have a look at the ReportLab manual, in particular the parts on Platypus. This is a part of the ReportLab library that allows you to build PDFs out of objects representing page parts (paragraphs, tables, frames, layouts, etc.).
Similar to Reddit's r/pic sub-reddit, I want to aggregate media from various sources. Some sites use OEmbed specs to expose media on the page but not all sites do it. I was browsing through Reddit's source because essentially they 'scrape' links that users submit, retrieve images, videos etc. They create thumbnails which are then displayed along the link on their site. Now, I would like to do something similar and I looked at their code[1] and it seems that they have custom scrapers for each domain that they recognize and then they have a generic Scraper class that uses simple logic to get images from any domain (basically they retrieve the web-page, parse the html and then determine the largest image on the page which they then use to generate a thumbnail).
Since it's open source I can probably reuse the code for my application but unfortunately I have chosen Perl as this is a hobby project and I'm trying to learn Perl. Is there a Perl module which has similar functionality? If not, is there a Perl module that is similar to Python Imaging Library? It would be handy to determine the image sizes without actually downloading the whole image & thumbnail generation.
Thanks!
[1] https://github.com/reddit/reddit/blob/master/r2/r2/lib/scraper.py
Image::Size is the specialised module for determining image sizes from various format. It should be enough to read the first 1000 octets or so from a resource, enough for the diverse image headers, into a buffer and operating on that. I have not tested this.
I do not know any general scraping module that has an API for HTTP range requests in order to avoid downloading the whole image resource, but it is easy to subclass WWW::Mechanize.
Try PerlMagick, installation instruction is also listed there.
I want to make a Python script available as a service on the net. The script, which is my first 'proper' Python program, takes a txt file as argument and writes an image into the work directory. So:
How difficult is it for somebody who is new to Python and web development?
How much work is it?
Do I need a framework (Django, cherryPy, web2py)?
Are there good tutorials?
How do I avoid the server to be compromised?
What are my next steps?
==> What is the easiest way?
In the end it is enough, if it is a white page, with some text, and a button, which when clicked, opens a file dialog. After the txt is processed, the server should just return the image, which was written on the hard drive. Already I have access to a server which has Ubuntu installed through a friend.
[update]
Thanks for all your answers. After reading them I want to stress again, that I want to have it as minimal as possible. Srikar's suggestion sounds like the easiest one:
Put it in executable directory of your OS (commonly known as CGI
path). Provide a simple HTML form & upon form submission hit this
script which executes & returns back the image you want to display.
Any objections or comments? Do you know any tutorials for that?
[udpate2]
I found this SO answer: File Sharing Site in Python Is this a sensible approach?
It's not too difficult. Actually, it sounds like a good first project.
That too subjective to answer. An hour to days.
No, you don't need one, but I'd use one if I were you. They abstract away some of the stuff you really don't care about, and you'll learn a tool you can use again in the future.
Plenty. If you want a real rundown of how Python works for the web, read the HOWTO from Python.org. If you just want to learn how to do this one project, pick a framework and do their tutorial.
This question is so broad and complex that I'm not going to try to answer it. Search this site, or Google, for questions like that.
Your next step should be to pick a framework; I've used Django successfully. Just download it, follow the installation instructions, and work your way through their tutorial; it should tell you everything you need to know to do what you want. If you still have questions once you've learned how to do the basics, come back and ask again!
Edit: The answer to that other question will certainly work for you. There, they just receive a GET request and respond with data from a Python file. You need to receive a GET request, respond with an HTML page (easy enough), then respond to a POST request that includes an uploaded file (slightly more complicated) and run your python routine on the uploaded file and then respond with the created image (or a link to it).
Take a look at this page which includes a simple Python script to do file uploads. You should easily be able to modify it to do what you want.
How difficult is it for somebody who is new to Python and web development?
Depends on your level of knowledge.
How much work is it?
Depends on which method you choose to solve the problem.
Do I need a framework (Django, cherryPy, web2py)?
Not necessarily - you could get started by using the CGI (http://docs.python.org/library/cgi.html)
Are there good tutorials?
Yes, there are plenty. The Python docs are an excellent place to start.
How do I avoid the server to be compromised?
Again, depends on the method you choose to solve the problem, although there are commonalities.
What are my next steps?
Dare I say it again, choose a method, read the docs, have a play!
If its just as simple as you have described it. Then you might not even need Django. You could simply use CGI scripting. All of these design decisions, depend on whether
You need (or foresee) a SQL storage?
or a Content-Management-System?
Will you need multiple-user support?
Do you need tight security?
Do you need different privileges for different users?
Do you need an Admin to manage your site?
If the answer to above questions is atleast 60% correct, then you might consider Django. otherwise, just write a python script. Put it in executable directory of your OS (commonly known as CGI path). Provide a simple HTML form & upon form submission hit this script which executes & returns back the image you want to display. So, it all depends on the features you need...
In the end, I created what I needed with Flask.
They have a well documented pattern / tutorial on Uploading Files. The tutorial is understandable even for people with little python and web expericence.
To get a first working version it took me 2h and the resulting code was only 50 lines. This includes, starting the webserver, having a html file/form with file upload and serving a file back to the user.