Substitute string in .docx file by jpg with python

Substitute string in .docx file by jpg with python - python

Sorry for my bad English.
I'm trying to substitute a string in .docx file by a .jpg file. First I convert JPEG to BMP and move it to clipboard follow Copy PIL/PILLOW Image to Windows Clipboard, then use Find.Execute with "^c" to substitute a special string in docx file.
The substitution works well, but it paste a image with width of 15.42cm to the .docx file. I've tried to resize it with im.resize, but it ends with a large blur image rather than a small one. How could I make it smaller?
I'm using python2.7.2 and Win7. Thanks a lot.
from win32com.client import Dispatch
from cStringIO import StringIO
import win32clipboard
import win32com
from PIL import Image
def setImageToClipboard(clip_type, data):
win32clipboard.OpenClipboard()
win32clipboard.EmptyClipboard()
win32clipboard.SetClipboardData(clip_type, data)
win32clipboard.CloseClipboard()
filepath = 'd:/tmp.jpg'
im = Image.open(filepath)
#im = im.resize((10, 10))
output = StringIO()
im.convert("RGB").save(output, "BMP")
data = output.getvalue()[14:]
output.close()
w = win32com.client.Dispatch('Word.Application')
w.Visible = 1
w.DisplayAlerts = 0
doc = w.Documents.Open("d:/clipboard_test.docx")
search = "TEST"
setImageToClipboard(win32clipboard.CF_DIB, data)
w.Selection.Find.ClearFormatting()
w.Selection.Find.Replacement.ClearFormatting()
w.Selection.Find.Execute("TEST", False, True, False, False, False, True, 1, True, ReplaceWith="^c", Replace=2000)
doc.SaveAs("d:/clipboard_test2.docx")
doc.Close()
w.Quit()

You can do this much more cleanly and without actually needing MS-Word installed or starting it up by using python-docx.
You will need to read in your existing document, find the text that you need to replace and then use document.add_picture('monty-truth.png', width=Inches(1.25))
rather than using word and pretending to paint.
The only other option is to have python select the image that you have pasted and set its properties within word.

Related

Cannot save multiple files with PIL save method

I have modified a vk4 converter to allow for the conversion of several .vk4 files into .jpg image files. When ran, IDLE does not give me an error, but it only manages to convert one file before ending the process. I believe the issue is that image.save() only seems to affect a single file and I have been unsuccessful in looping that command to extend to all other files in the directory.
Code:
import numpy as np
from PIL import Image
import vk4extract
import os
os.chdir(r'path\to\directory')
root = ('.\\')
vkimages = os.listdir(root)
for img in vkimages:
if (img.endswith('.vk4')):
with open(img, 'rb') as in_file:
offsets = vk4extract.extract_offsets(in_file)
rgb_dict = vk4extract.extract_color_data(offsets, 'peak', in_file)
rgb_data = rgb_dict['data']
height = rgb_dict['height']
width = rgb_dict['width']
rgb_matrix = np.reshape(rgb_data, (height, width, 3))
image = Image.fromarray(rgb_matrix, 'RGB')
image.save('sample.jpeg', 'JPEG')
How do I prevent the converted files from being overwritten while using the PIL module?
Thank you.

It is saving every file, but since you are always providing the same name to each file (image.save('sample.jpeg', 'JPEG')), only the last one will be saved and all the other ones will be overwritten. You need to specify different names to every file. There are several ways of doing it. One is adding the index when looping using enumerate():
for i, img in enumerate(vkimages):
and then using the i on the name of the file when saving:
image.save(f'sample_{i}.jpeg', 'JPEG')
Another way is to use the original filename and replace the extension. From your code, it looks like the files are .vk4 files. So another possibility is to save with the same name but replacing .vk4 to .jpeg:
image.save(img.replace('.vk4', '.jpeg'), 'JPEG')

How do I add an image from a list in python using docx?

I wrote a code that takes a screenshot that I want to paste into a word document using docx. So far I have to save the image as a png file. The relevant part of my code is:
from docx import Document
import pyautogui
import docx
doc = Document()
images = []
img = pyautogui.screenshot(region = (some region))
images.append(img)
img.save(imagepath.png)
run =doc.add_picture(imagepath.png)
run
I would like to be able to add the image without saving it. Is it possible to do this using docx?

Yes, according to add_picture — Document objects — python-docx 0.8.10 documentation, add_picture can import data from a stream as well.
As per Screenshot Functions — PyAutoGUI 1.0.0 documentation, screenshot() produces a PIL/Pillow image object which can be save()'d with a BytesIO() as destination to produce a compressed image data stream in memory.
So that'll be:
import io
imdata = io.BytesIO()
img.save(imdata, format='png')
imdata.seek(0)
doc.add_picture(imdata)
del imdata # cannot reuse it for other pictures, you need a clean buffer each time
# can use .truncate(0) then .seek(0) instead but this is probably easier

Extract images from word document using Python

How can i extract images/logo from word document using python and store them in a folder. Following code converts docx to html but it doesn't extract images from the html. Any pointer/suggestion will be of great help.
profile_path = <file path>
result=mammoth.convert_to_html( profile_path)
f = open(profile_path, 'rb')
b = open(profile_html, 'wb')
document = mammoth.convert_to_html(f)
b.write(document.value.encode('utf8'))
f.close()
b.close()

You can use the docx2txt library, it will read your .docx document and export images to a directory you specify (must exist).
!pip install docx2txt
import docx2txt
text = docx2txt.process("/path/your_word_doc.docx", '/home/example/img/')
After execution you will have the images in /home/example/img/ and the variable text will have the document text. They would be named image1.png ... imageN.png in order of appearance.
Note: Word document must be in .docx format.

Extract all the images in a docx file using python
1. Using docxtxt
import docx2txt
#extract text
text = docx2txt.process(r"filepath_of_docx")
#extract text and write images in Temporary Image directory
text = docx2txt.process(r"filepath_of_docx",r"Temporary_Image_Directory")
2. Using aspose
import aspose.words as aw
# load the Word document
doc = aw.Document(r"filepath")
# retrieve all shapes
shapes = doc.get_child_nodes(aw.NodeType.SHAPE, True)
imageIndex = 0
# loop through shapes
for shape in shapes :
shape = shape.as_shape()
if (shape.has_image) :
# set image file's name
imageFileName = f"Image.ExportImages.{imageIndex}_{aw.FileFormatUtil.image_type_to_extension(shape.image_data.image_type)}"
# save image
shape.image_data.save(imageFileName)
imageIndex += 1

Native without any lib
To extract the source Images from the docx (which is a variation on a zip file) without distortion or conversion.
shell out to OS and run
tar -m -xf DocxWithImages.docx word/media
You will find the source images Jpeg, PNG WMF or others in the word media folder extracted into a folder of that name. These are the unadulterated source embedment's without scale or crop.
You may be surprised that the visible area may be larger then any cropped version used in the docx itself, and thus need to be aware that Word does not always crop images as expected (A source of embarrassing redaction failure)

Look the Alderven's answer at Extract all the images in a docx file using python
The zipfile works for more image formats than the docx2txt. For example, EMF images are not extracted by docx2txt but can be extracted by zipfile.

Tesseract could not recognize text from a pdf file

I tried to use Tesseract in Python to OCR some PDFs. The workflow is to convert a PDF to a series of images first using wand, then send them to Tesseract based on this example. I applied this to 5 PDFs but found it failed to convert one (completely failed). It works fine to convert PDF to Tiff. Thus, I guess maybe something needs to be tuned in the OCR process? Or any other tools I should use to deal with this situation? I tried xpdfbin-win-3.04 which worked on this PDF but did not work as well as Tesseract on the other PDFs...
Screenshot of failed PDF
Output text
Code
from wand.image import Image
from PIL import Image as PI
import pyocr
import pyocr.builders
import io
tool = pyocr.get_available_tools()[0]
lang = tool.get_available_languages()2
pth_str = "C:/Users/TH/Desktop/OCR_test/"
fname_list = ["999437-Asb_1-34.pdf"]
for each_file in fname_list:
print each_file
req_image = []
final_text = []
# convert to tiff
image_pdf = Image(filename=pth_str+each_file, resolution=600)
image_tif = image_pdf.convert('tiff')
for img in image_tif.sequence:
img_page = Image(image=img)
req_image.append(img_page.make_blob('tiff'))
# begin OCR
for img in req_image:
txt = tool.image_to_string(
PI.open(io.BytesIO(img)),
lang=lang,
builder=pyocr.builders.TextBuilder()
)
final_text.append(txt.encode('ascii','ignore'))

Add text to existing PDF document in Python

I'm trying to convert a pdf to the same size as my pdf which is an A4 page.
convert my_pdf.pdf -density 300x300 -page A4 my_png.png
The resulting png file, however, is 595px × 842px which should be the resolution at 72 dpi.
I was thinking of using PIL to write some text on some of the pdf fields and convert it back to PDF. But currently the image is coming out wrong.
Edit: I was approaching the problem from the wrong angle. The correct approach didn't include imagemagick at all.

After searching around some I finally found the solution:
It turns out that this was the correct approach after all.
Yet, i feel that it wasn't verbose enough.
It appears that the poster probably took it from here (same variable names etc).
The idea: create new blank PDF with Reportlab which only contains a text string.
Then merge/add it as a watermark using pyPdf.
from pyPdf import PdfFileWriter, PdfFileReader
import StringIO
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter
packet = StringIO.StringIO()
# create a new PDF with Reportlab
can = canvas.Canvas(packet, pagesize=letter)
can.drawString(100,100, "Hello world")
can.save()
#move to the beginning of the StringIO buffer
packet.seek(0)
new_pdf = PdfFileReader(packet)
# read your existing PDF
existing_pdf = PdfFileReader(file("mypdf.pdf", "rb"))
output = PdfFileWriter()
# add the "watermark" (which is the new pdf) on the existing page
page = existing_pdf.getPage(0)
page.mergePage(new_pdf.getPage(0))
output.addPage(page)
# finally, write "output" to a real file
outputStream = file("/home/joe/newpdf.pdf", "wb")
output.write(outputStream)
outputStream.close()
Hope this helps somebody else.

I just tried the solution above, but I had quite some troubles to get it running in Python3. So, I would like to share my modifications. The adapted code looks as follows:
from PyPDF2 import PdfFileWriter, PdfFileReader
import io
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter
packet = io.BytesIO()
# create a new PDF with Reportlab
can = canvas.Canvas(packet, pagesize=letter)
can.drawString(100, 100, "Hello world")
can.save()
# move to the beginning of the StringIO buffer
packet.seek(0)
new_pdf = PdfFileReader(packet)
# read your existing PDF
existing_pdf = PdfFileReader(open("mypdf.pdf", "rb"))
output = PdfFileWriter()
# add the "watermark" (which is the new pdf) on the existing page
page = existing_pdf.getPage(0)
page2 = new_pdf.getPage(0)
page.mergePage(page2)
output.addPage(page)
# finally, write "output" to a real file
outputStream = open("newpdf.pdf", "wb")
output.write(outputStream)
outputStream.close()
Now the page.mergePage throws an error. Turns out to be a porting error in pypdf2. Please refer to this question for the solution: Porting to Python3: PyPDF2 mergePage() gives TypeError

You should look at Add text to Existing PDF using Python and also Python as PDF Editing and Processing Framework. These will point you in the right direction.
If you do what you've proposed in the question, when you export back to .pdf, it will really just be an image file embedded in a .pdf, it won't be text.

pdfrw will let you take existing PDFs and place them as form XObjects (similar to images) on a reportlab canvas. There are some examples for this in the pdfrw examples/rl1 subdirectory on github. Disclaimer -- I am the pdfrw author.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Substitute string in .docx file by jpg with python - python

Related

Cannot save multiple files with PIL save method

How do I add an image from a list in python using docx?

Extract images from word document using Python

Tesseract could not recognize text from a pdf file

Add text to existing PDF document in Python

Categories

Resources