Separate a powerpoint presentation into a set of slides using python

Separate a powerpoint presentation into a set of slides using python - python

I have .ppt files containing presentations and I need to split a single .ppt file into the slides that it is made of and to store each slide as an image. Any way of achieving this?
Any help would be appreciated. Thanks in advance!

With Aspose.Slides for Python, you can easy save each presentation slide to an image. The following code example shows you how to do this:
import aspose.slides as slides
import aspose.pydrawing as draw
with slides.Presentation("example.ppt") as presentation:
for slide_index, slide in enumerate(presentation.slides):
# Convert the current slide to an image at 100% scale.
slide_image = slide.get_thumbnail(1, 1)
# Save the slide image to a PNG file.
image_path = "slide_{}.png".format(slide_index + 1)
slide_image.save(image_path, draw.imaging.ImageFormat.png)
This is a paid library, but you can get a temporary license or use a trial mode to evaluate all features for managing presentations. You can see the conversion results without any code by using Online PowerPoint Converter. It is based on this library.
Alternatively, you can use Aspose.Slides Cloud SDK for Python. The code example below shows you how to do the same using Aspose.Slides Cloud:
import asposeslidescloud
from asposeslidescloud.apis.slides_api import SlidesApi
from asposeslidescloud.models import *
slides_api = SlidesApi(None, "my_client_id", "my_client_secret")
with open("example.ppt", "rb") as file_stream:
# Convert all presentation slides to PNG images at 100% scale.
result_path = slides_api.convert(file_stream, SlideExportFormat.PNG)
print("A ZIP file with slide images was saved to " + result_path)
This is also a paid product based on REST, but you can make 150 free API calls per month for managing presentations. I work as a Support Developer at Aspose and we will be glad to answer your questions of this products on Aspose.Slides forum.

Related

How to find and remove watermarks in pdf using python?

I am currently using python to remove watermarks in PDF files. For example, I have a file like this:
The green shape on the center of the page is the watermark. I think it's not stored in the PDF in text form, because I can't find that text by simply searching using Edge browser (which can read PDF files).
Also, I cannot find the watermark by image. I extracted all images from the PDF using PyMuPDF, and the watermark (which was supposed to appear on each page) is not to be found.
The code I used for extracting is like this:
document = fitz.open(self.input)
for each_page in document:
image_list = each_page.getImageList()
for image_info in image_list:
pix = fitz.Pixmap(document, image_info[0])
png = pix.tobytes() # return picture in png format
if png == watermark_image:
document._deleteObject(image_info[0])
document.save(out_filename)
So how do I find and remove the watermark using python's libraries? How is the watermark stored inside a PDF?
Are there any other "magic" libraries that can do this task, other than PyMuPDF?

For anyone interested in details see the solution provided here.
Removal of the type of watermark used in this file works with PyMuPDF's low-level code interface. There is no direct, specialized high-level API for doing this.

Is there any way in OCR/tesseract/OpenCV for extracting text from a particular region of an image?

I’m setting up a new invoice extraction method using AI, I able to recognize "Total"/"Company Details" from invoice images but need help with extracting data from that particular region recognized in the invoice image by specifying an area in the image(Xmin, Xmax, Ymin, Ymax)?

AWS recently launched a service called Textract that does exactly what you try to achieve.
Blog post + example: https://aws.amazon.com/blogs/machine-learning/automatically-extract-text-and-structured-data-from-documents-with-amazon-textract/
You can provide images, PDFs and Excel files and it extracts and transforms any text into objects. I haven't used the service yet, but plan to on the weekend.
Python example below:
import boto3
# Document
s3BucketName = "ki-textract-demo-docs"
documentName = "simple-document-image.jpg"
# Amazon Textract client
textract = boto3.client('textract')
# Call Amazon Textract
response = textract.detect_document_text(
Document={
'S3Object': {
'Bucket': s3BucketName,
'Name': documentName
}
})
#print(response)
# Print detected text
for item in response["Blocks"]:
if item["BlockType"] == "LINE":
print ('\033[94m' + item["Text"] + '\033[0m')

Looks like you are newbird,so let me help you quick walkthrough of understanding of terms used in your keyword.
OCR is optical character recognition a concept
Tesseract is special library handling for OCR.
OpenCV helps in image processing library helping in object detection and recognition.
Yes, you can extract the text from image if its more than 300dpi by using tesseract library
but before that you should train the tesseract model with that font, if font of text is very new or unknown to system.
Also keep in mind, if you are able to box-image the text prior calling to tesseract, it will work more accurately.
Certain word - box image, dpi will create alert, but these are pivot concepts to your work.
My suggestion, if you want to extract the digits from image, go in step by step.
Process the image by enhancing its quality.
Detect the region which want to be extracted.
Find the contour and area.
Pass it to box-image editor and tune the parameters
Finally give it to Tesseract.

How to format and print Dash dashboard to PDF?

I've just created a dashboard in Dash, which fits automatically to my computer monitor and looks half decent. I now have the problem that I have to produce a way to print to PDF.
I've seen the Vanguard example which uses external CSS to format into an A4-size, and then a print button (using external javascript?). Is there a more 'Dash-native' way of doing this that doesn't require me to learn another programming language (I only know Python)?
There doesn't seem to be anything on the Dash User Guide https://dash.plot.ly/, all I've found is this https://plot.ly/python/pdf-reports/ which describes how to print individual plotly figures to pdf.
Ideally, it would format as it is now online (online layout) without losing space down the sides, and then I could use another layout (pdf layout) which could be printed out.
Any links or suggestions on how to proceed with this would be very much appreciated!

You can get pdf from plotly if save plot as png image (the most left option on image):
And then using that code convert png to pdf:
from PIL import Image
# Save your plot as png file (option "Download as png" in upper-left corner)
PNG_FILE = 'horizontal-bar.png'
PDF_FILE = 'horizontal-bar.pdf'
rgba = Image.open(PNG_FILE)
rgb = Image.new('RGB', rgba.size, (255, 255, 255)) # white background
rgb.paste(rgba, mask=rgba.split()[3]) # paste using alpha channel as mask
rgb.save(PDF_FILE, 'PDF', resolution=100.0)
If you are a lazy one, you can use library webkit2png check here how. This allowing you not clicking on Download as png button and just convert from html to png and then to pdf.

Count Images in a pdf document through python

Is there a way to count number of images(JPEG,PNG,JPG) in a pdf document through python?

Using pdfimages from poppler-utils
You might want to take a look at pdfimages from the poppler-utils package.
I have taken the sample pdf from - Sample PDF
On running the following command, images present in the pdf are extracted -
pdfimages /home/tata/Desktop/4555c-5055cBrochure.pdf image
Some of the images extracted from this brochure are -
Extracted Image1
Extracted Image 2
So, you can use python's subprocess module to execute this command, and then extract all the images.
Note: There are some drawbacks to this method. It generates images in ppm format, not jpg. Also, some additional images might be extracted, which might actually not be images in the pdf.
Using pdfminer
If you want to do this using pdfminer, take a look at this blog post -
Extracting Text & Images from PDF Files
Pdfminer allows you to traverse through the layout of a particular pdf page. The following image shows the layout objects as well as the tree structure generated by pdfminer -
Layout Objects and Tree Structure
Image Source - Pdfminer Docs
Thus, extracting LTFigure objects can help you extract / count images in the pdf document.
Note: Please note that both of these methods might not be accurate, and their accuracy is highly dependent on the type of pdf document you are dealing with.

I don't think this can be directly done. Although I have done something similar using the following approach
Using ghostscript to convert pdf to page images.
On each page use computer vision (OpenCV) to extract the area of interest(in your case images).

Color in image gets dull after saving it in OpenCV

I am using opencv module to read and write the image. here is the code and below is the image i am reading and second image is after saving it on disk using cv2.imwrite().
import cv2
img = cv2.imread('originalImage.jpg')
cv2.imwrite('test.jpg',img)
It is significantly visible that colors are dull in second image. Is there any workaround to this problem or I am missing on some sort of setting parameters..?

I have done a bit of research on the point #mark raised about ICC profile. I have figured out a way to handle this in python PIL module. here is the code that worked for me. I have also learned to use PNG file format rather JPEG to do lossless conversion.
import Image
img = Image.open('originalImage.jpg')
img.save('test.jpg',icc_profile=img.info.get('icc_profile'))
I hope this will help others as well.

The difference is that the initial image (on the left in the diagram) has an attached ICC profile whereas the second one (on the right) does not.
I obtained the above image by running the ImageMagick utility called identify like this:
identify -verbose first.jpg > 1.txt
identify -verbose second.jpg > 2.txt
Then I ran the brilliant opendiff tool (which is part of macOS) like this:
opendiff [12].txt
You can extract the ICC profile from the first image also with ImageMagick like this:
convert first.jpg profile.icc

Your first input image has some icc-Profile associated in the meta-data, which is an optional attribute and most devices may not inject it in the first place. The ICC profile basically performs a sort of color correction, and the correction coefficients are calculated for each unique device during calibration.
Modern Web Browsers, Image Viewing utilities mainly take into account this ICC profile information before rendering the image onto the screen, that is the reason why there is a diff in both the images.
But Unfortunately OpenCV doesn't reads the ICC config from the meta data of the image to perform any color correction.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Separate a powerpoint presentation into a set of slides using python - python

I have .ppt files containing presentations and I need to split a single .ppt file into the slides that it is made of and to store each slide as an image. Any way of achieving this? Any help would be appreciated. Thanks in advance!

Related

How to find and remove watermarks in pdf using python?

Is there any way in OCR/tesseract/OpenCV for extracting text from a particular region of an image?

How to format and print Dash dashboard to PDF?

Count Images in a pdf document through python

Color in image gets dull after saving it in OpenCV

Categories

Resources