How can I pull information from Excel into PowerPoint using Python and keep the format? - python

I've written a script with python's xlrd and pptx to read each workbook in a directory and pull information from each sheet into a table in a PowerPoint slide. It works okay if the excel table is small but I don't know what will be in these excel files. It becomes illegible when there is too many rows and columns. My main problem arose when an excel file had graphs instead of cells and the script couldn't read it. So I tried using pyscreenshot to open the document and take a screenshot but this seems slow and unnecessary. I'd like to make a slide in the PowerPoint look exactly as it would in excel but with the ability to add and change things.
import libraries and modules
import xlrd
from pptx import Presentation
from pptx.util import Inches, Pt
import time
import glob
import os
start = time.time()
prs = Presentation()
title_slide_layout = prs.slide_layouts[0]
slide = prs.slides.add_slide(title_slide_layout)
shapes = slide.shapes
title = slide.shapes.title
subtitle = slide.placeholders[1]
title.text = "Dashboard Generator"
subtitle.text = "made with Python-pptx and xlrd"
for filename in glob.glob(os.path.join("C:/Users/penelope/Desktop/PMO/myfiles/", '*.xlsx')):
print(filename)
file_location = filename
try:
workbook = xlrd.open_workbook(file_location)
nsheets = workbook.nsheets
for n in range(0, nsheets):
sheet = workbook.sheet_by_index(n)
print("sheet:", sheet)
rows = sheet.nrows
cols = sheet.ncols
c = cols
r = rows
if c > 0:
print(c, r)
slide = prs.slides.add_slide(prs.slide_layouts[5])
shapes = slide.shapes
title = slide.shapes.title
title.text = "Table testing"
left = Inches(0.0)
top = Inches(2.0)
width = Inches(6.0)
height = Inches(4.0)
num = 10.0/c
table = shapes.add_table(rows, cols, left, top, width, height).table
for i in range(0, c):
table.columns[i].width = Inches(num)
for i in range(0,r):
for e in range(0,c):
table.cell(i,e).text = str(sheet.cell_value(i,e))
cell = table.rows[i].cells[e]
paragraph = cell.text_frame.paragraphs[0]
paragraph.font.size = Pt(11)
except:
print("Error!")
pass
prs.save('powerpointfile1.pptx')
end = time.time()
print(end - start)
And this is my screenshot script:
import os
import time
import pyscreenshot as ImageGrab
from PIL import Image
if __name__ == "__main__":
os.system('start excel.exe "C:/Users/penelope/Desktop/PMO/TestCase.xlsx"')
time.sleep(3)
im=ImageGrab.grab(bbox=(24,210,1800,990))
im.save("image7.png")
img = Image.open('image7.png')
img.show()

Well, you've chosen a hard problem. Certainly all the times I've attempted this sort of thing I've ended up abandoning the effort.
The fundamental explanation I formed was that Excel (and Word) are "flowed" document environments. That is, when you run out of room on one page, it flows to the next. PowerPoint, on the other hand, is a page-by-page exhibit layout environment. Each slide is independent of the rest (evidenced by the ability to reorder slides freely), each meant to be shown all at once, and not scrolled. This leads to each slide being self-contained, which means constrained to a single "page".
There's a limit to how much information one can place on a slide and still have it communicate. Generally less is better. So, perhaps it's not a surprise all my early efforts there ended in frustration :) I also concluded that an effective "dashboard" slide would require very skillful layout, and extreme restraint on content length, probably requiring specific (human) summarization effort (not just copying from a "database").
Regarding the charts bit, those theoretically can be moved to PowerPoint and I've even seen it done, but it's technically quite challenging. There is no API support for it in python-pptx. This historical issue on the GitHub repo may give some idea what was involved. Not for the faint of heart I expect :)

Related

How can I get pytesseract to adjust its targeting parameters in real time depending on where my data has shifted?

I'm trying to use pytesseract to read data from a web page. However, it's not always in the same exact place. Sometimes, I end up on an application where data has shifted to the left or right a little bit, because there's lack of information on the application or because one of the cells contains too much information.
Here's a sample image:
target
Here, I'm trying to read and print the first line containing the text Brazil(BZ) (and eventually, I'll have it read the other two lines as well). However, this information is again, scattered. So, I'm currently targeting it like this:
import pytesseract
import os
import cv2
import numpy as np
import pyautogui
import configparser
pytesseract.pytesseract.tesseract_cmd = os.path.join(os.environ['ProgramFiles'], 'Tesseract-OCR', 'tesseract.exe')
config = configparser.ConfigParser()
config.read('config.ini')
l1 = config['info']['l1']
l2 = config['info']['l2']
l3 = config['info']['l3']
l4 = config['info']['l4']
x1, y1 = int(l1), int(l2)
x2, y2 = int(l3), int(l4)
try:
screenshot = np.array(pyautogui.screenshot())
cropped_image1 = screenshot[y1:y2, x1:x2]
gray_image1 = cv2.cvtColor(cropped_image1, cv2.COLOR_BGR2GRAY)
text1 = pytesseract.image_to_string(gray_image1)
print(text1)
except:
pass
my config.ini file has the following:
l1 = 189
l2 = 1038
l3 = 537
l4 = 1115
This is the area on the page I'm targeting, and it's failing to target the correct area on some applications. Is there a way to maybe search for the text "Country" then go down a little bit, and print whatever is below it, regardless of where it is on my screen?
Would I perhaps do better with selenium? I don't know if it can read html data from web pages in real time AFTER I process an application and end up on a different part of the site. Can it do that?
I tried targeting my data of interest but it fails because information is shifting depending on what information an applicant put in their application.

Cropping the Mediabox does not work for some pdfs

I wrote a little script which shall blank out the lower half of a PDF document. The document itself shall remain the same size, but the lower half shall be just white.
(This is to remove the "instructions" part from parcel labels of German parcel comanies like DHL and Hermes.)
To do this, I take the PDF page, adjust the Mediabox, and then merge this page onto a new, blank page.
Fortunately, this works as intended with the PDFs I need it for. However, I also tried a few other PDFs and for some, it just does not work. It copies over the complete PDF. This happens for example, when my code is given this file: https://www.veeam.com/veeam_backup_product_overview_ds.pdf
Here is the code:
import pypdf # PyPDF2, 3 and 4 are deprecated. PyPDF is currently in active development
reader = pypdf.PdfReader(source_filename)
writer = pypdf.PdfWriter()
# get first page
page = reader.pages[0]
# create new page
new_page = pypdf.PageObject.create_blank_page( None, width = page.mediabox.width, height = page.mediabox.height )
# crop original
page.mediabox.bottom = ( page.mediabox.top - page.mediabox.bottom ) / 2 + page.mediabox.bottom
# merge original into empty new page
new_page.merge_page( page )
writer.add_page(new_page)
with open(output_file, "wb") as fp:
writer.write(fp)
Can anyone explain why it does not work sometimes?

Replacing a word with another word, and replacing an image with another image in a PDF file through python, is this possible?

I need to replace a K words with K other words for every PDF file I have within a certain path file location and on top of this I need to replace every logo with another logo. I have around 1000 PDF files, and so I do not want to use Adobe Acrobat and edit 1 file at a time. How can I start this?
Replacing words seems at least doable as long as there is a decent PDF reader one can access through Python ( Note I want to do this task in Python ), however replacing an image might be more difficult. I will most likely have to find the dimension of the current image and resize the image being used to replace the current image dynamically, whilst the program runs through these PDF files.
Hi, so I've written down some code regarding this:
from pikepdf import Pdf, PdfImage, Name
import os
import glob
from PIL import Image
import zlib
example = Pdf.open(r'...\Likelihood.pdf')
PagesWithImages = []
ImageCodesForPages = []
# Grab all the pages and all the images in every page.
for i in example.pages:
if len(list(i.images.keys())) >= 1:
PagesWithImages.append(i)
ImageCodesForPages.append(list(i.images.keys()))
pdfImages = []
for i,j in zip(PagesWithImages, ImageCodesForPages):
for x in j:
pdfImages.append(i.images[x])
# Replace every single page using random image, ensure that the dimensions remain the same?
for i in pdfImages:
pdfimage = PdfImage(i)
rawimage = pdfimage.obj
im = Image.open(r'...\panda.jpg')
pillowimage = pdfimage.as_pil_image()
print(pillowimage.height)
print(pillowimage.width)
im = im.resize((pillowimage.width, pillowimage.height))
im.show()
rawimage.write(zlib.compress(im.tobytes()), filter=Name("/FlateDecode"))
rawimage.ColorSpace = Name("/DeviceRGB")
So just one problem, it doesn't actually replace anything. If you're wondering why and how I wrote this code I actually got it from this documentation:
https://buildmedia.readthedocs.org/media/pdf/pikepdf/latest/pikepdf.pdf
Start at Page 53
I essentially put all the pdfImages into a list, as 1 page can have multiple images. In conjunction with this, the last for loop essentially tries to replace all these images whilst maintaining the same width and height size. Also note, the file path names I changed here and it definitely is not the issue.
Again Thank You
I have figured out what I was doing wrong. So for anyone that wants to actually replace an image with another image in place on a PDF file what you do is:
from pikepdf import Pdf, PdfImage, Name
from PIL import Image
import zlib
example = Pdf.open(filepath, allow_overwriting_input=True)
PagesWithImages = []
ImageCodesForPages = []
# Grab all the pages and all the images in every page.
for i in example.pages:
imagelists = list(i.images.keys())
if len(imagelists) >= 1:
for x in imagelists:
rawimage = i.images[x]
pdfimage = PdfImage(rawimage)
rawimage = pdfimage.obj
pillowimage = pdfimage.as_pil_image()
im = Image.open(imagePath)
im = im.resize((pillowimage.width, pillowimage.height))
rawimage.write(zlib.compress(im.tobytes()), filter=Name("/FlateDecode"))
rawimage.ColorSpace = Name("/DeviceRGB")
rawimage.Width, rawimage.Height = pillowimage.width, pillowimage.height
example.save()
Essentially, I changed the arguements in the first line, such that I specify that I can overwrite. In conjunction, I also added the last line which actually allows me to save.

Adding extra space to pages in PDF

I have a couple of PDFs I want to add a few inches on one side to give myself more room for handwritten comments in a notes app. Basically, I want to give myself more room to scribble on the sides of the pages (lecture scripts).
The pages should not be scaled, I simply want the contents to stay at the same spot from the upper left corner, but add more space at the right and maybe at the bottom.
Is there a good way to to this either using one of the Python PDF libs or using a command line tool?
Can I simply add extra space to the Media box, or do I need to do something else?
OK, the following code seems to work.
Had to set mediaBox and cropBox to get the desired result.
from PyPDF2 import PdfFileReader, PdfFileWriter
pdf = PdfFileReader("org.pdf")
writer = PdfFileWriter()
factor = 1.3
for page in pdf.pages:
x,y = page.mediaBox.lowerRight
page.mediaBox.lowerRight = ( (factor * float(x)), float(y))
x,y = page.cropBox.lowerRight
page.cropBox.lowerRight = ( (factor * float(x)), float(y))
writer.addPage( page )
with open("out.pdf", "wb") as out_f:
writer.write(out_f)

Insert several images at once using macro scripting in LibreOffice

I am writing a macro for LibreOffice Writer in Python.
I need to insert several images in one document, one after another with minimal space inbetween them.
The folowing code inserts all the images in the same area and all of them are overlapped.
I need to advance the cursor below the inserted image everytime a new image is inserted.
I have tried the cursor.gotoEnd(), cursor.goDown() and other such methods but none seem to work.
How do I make this work?
def InsertAll():
desktop = XSCRIPTCONTEXT.getDesktop()
doc=desktop.loadComponentFromURL('private:factory/swriter','_blank',0,())
text = doc.getText()
cursor = text.createTextCursor()
file_list = glob.glob('/path/of/your/dir/*.png')
for f in file_list:
img = doc.createInstance('com.sun.star.text.TextGraphicObject')
img.GraphicURL = 'file://' + f
text.insertTextContent(cursor, img, False)
cursor.gotoEnd(False) <- doesnt advance the cursor downwards
return None
Insert a paragraph break after each image:
from com.sun.star.text.ControlCharacter import PARAGRAPH_BREAK
text.insertTextContent(cursor, img, False)
text.insertControlCharacter(cursor, PARAGRAPH_BREAK, False)
cursor.gotoEnd(False)
This will separate the images... by a paragraph
Andrew's book is a basic source for solving many OpenOffice scripting problems: +1

Categories

Resources