Removing formatted images from a word document using Python - python

I have a word document from a colleague who gave me a .docx Microsoft Word file with 90 images on it that need to be extracted so they can be turned into flashcards. I tried using the Python module "docx2txt" which worked ok, but only extracted 34 images. Upon further inspection, I found that it was because when my coworker made the original file, he took screenshots of PowerPoint slides that he had made with about 4-6 of the images on one slide. Then, he would put them in Word and use the built in Word trimming tool to copy the picture several times and trim down to each individual picture he needed in a particular line of the document. Docx2txt copied the pictures files to my designated directly perfectly, but did not keep the formatting. Any picture file he had inserted and "trimmed down" to size, was copied as the full image. Does anyone know of a way to keep the formatting so I don't have to go through and manually copy 90 pictures one by one? Perhaps converting to a .pdf file and using a pdf related module or something? Or might be there some way of using another Python library which will keep the picture formatting? Thanks for any help you can provide! I'm somewhat of a beginner with Python, but love it when I can get it to automate stuff... even if it ends up taking longer to figure out how to do it than just boring myself to death saving the photos manually, lol.

https://support.microsoft.com/en-us/topic/reduce-the-file-size-of-a-picture-in-microsoft-office-8db7211c-d958-457c-babd-194109eb9535
Important: Cropped parts of the picture are not removed from the file, and can potentially be seen by others; including search engines if the cropped image is posted online. Only the Office desktop apps have the ability to remove cropped areas from the underlying image file.
Follow the relevant section for Desktop Office (Windows or Mac) note from above it CANNOT work on Web 365.
go to "Other kinds of cropping"
Important: If you delete cropped areas and later change your mind, you can click the Undo Button Image button to restore them. Deletions can [ONLY] be undone until the file is saved.
So make a backup copy of the file
Select the picture or pictures (If you want all selected that should be easy with CTRL + A to highlight everything)
Then follow the instructions
Picture Tools > Format, and in the Adjust group, click Compress Pictures
Be sure that the Delete cropped areas of pictures check box is selected
DEselect the Apply only to this picture check box.
Double check a few manually to verify all is well then save a copy.

Related

How to attach images using pymupdf

I have a pdf where 2 pages have total of 6 attachment boxes where you can click on them and after clicking you can choose the image file and it will be inserted in the pdf, so I want to do this using python I have tried pymupdf and after checking it is showing me it is one of the widgets as button but I don't know exactly how can I use pymupdf to upload the images automatically, I have tried several techniques but it didn't help so I had to remove the attachment boxes. Can anyone please help me out here? I have used this also as I saw in the documentation adding an annot_file or embedded file can do the trick but I am not sure and confused, has anyone ever did it?
After clicking on one of the image icon
I have tried several techniques there were a few methods of attaching files using annotations and update_file for the document object. If this helps to understand the problem more clearly. Thanks

I cannot find a way to extract underlined text, cant it be done with pdfminer.six?

I am trying to extract a text in pdf which is underlined using python but not able to find a correct solution can anyone help on this, please
In a PDF there are no struck through or struck under fonts thus the best you could hope for is a flag at the start and end like in Rich Text. Commonly a line in paperspace is placed over/under the image / text characters. Often done later (like highlighting) as "Annotation" so you are looking for rectangles with narrow height.
PDFMiner 6 acknowledge they can at best close this issue. see https://github.com/pdfminer/pdfminer.six/issues/237
You could look for StrikeThrough or StrikeUnder Annotation objects and a script showing how that may be done is available at https://github.com/0xabu/pdfannots

Identify the edited location in the PDF modified by online editor www.ilovepdf.com using Python

I have an SBI bank statement PDF which is tampered/forged. Here is the link for the PDF.
This PDF is edited using online editor www.ilovepdf.com. The edited part is the first entry under the 'Credit' column. Original entry was '2,412.00' and I have modified it to '12.00'.
Is there any programmatic way either using Python or any other opensource technology to identify the edited/modified location/area of the PDF (i.e. BBOX(Bounding Box) around 12.00 credit entry in this PDF)?
2 things I already know:
Metadata (Info or XMP metadata) is not useful. Modify date of the metadata doesn't confirm if the PDF is compressed or indeed edited, it will change the modify date in both these cases. Also it doesn't give the location of the edit been done.
PyMuPDF SPANS JSON object is also not useful as the edited entry doesn't come at the end of the SPANS JSON, instead it's in the proper order of the text inside the PDF. Here is the SPAN JSON file generated from PyMuPDF.
Kindly let me know if anyone has any opensource solution to resolve this problem.
iLovePDF completely changes the whole text in the document. You can even see this, just open the original and the manipulated PDFs in two Acrobat Reader tabs and switch back and forth between them, you'll see nearly all letters move a bit.
Internally iLovePDF also rewrote the PDF completely according to its own preferences, and the edit fits in perfectly.
Thus, no, you cannot recognize the manipulated text based on this document alone because it technically is a completely different, a completely new one.

Python - Split pdf or powerpoint by pixel location?

I will explain my dilemma first: I have several thousand powerpoint files (.ppt) that I need to extract the text. The problem is the text is is disorganized in the file and when read as a complete page it makes no sense for what I need (it would read in the example: line 1, line 3, line 2, line 4, line 5).
I was using tika to read the files initially. I then thought if I converted to pdf using glob and win32com.client that I would have some better luck but it's basically the same result. The picture here is an example of what the text is like.
So now my idea now is if I can section the pdf or ppt by pixel location (and save to separate temp files if needed, opened, and read that way) I can keep things in order and get what I need. Although the text moves around within each box, the black outline boxes are always roughly in the same location.
I cannot find anything to split an individual pdf page though, only multiple pages into a single page. Does anyone have an idea how to go about doing this?
I need to read the text in box one together (line 1 and line 2) and load into a dictionary or some other container, and the same for the second box. For reference there is only one slide in the powerpoint.
Allow me to provide the answer as a general guideline:
Both .ppt and .pptx files are glorified .zip files.
Use 7-zip or WinZip to open the .pptx and understand the structure.
Convert them into a .pptx file.
Each slide should now have a .xml file full of tags you can parse.
For example you will find tags for each text box with tags for that box's text nested inside.
Also: python-pptx
Mass convert by tweaking this VBA code: Link for VBA
Or using PowerShell: Link for [PowerShell]

Batch proces Colour profiles of images to any profile needed with python

Im working on a project that needs to process images from one file and output to another the change is varied but the main one is colour profiles that need to be changed however everything i see so far is only able to convert to sRGB or that range but i would need to ether be able to add a profile or have an extensive or full list of profiles to convert to. for example one of the profiles ill be needing to use is eciRGB v2.
Please help me to automate this in python (I cant use photoshop...) .
You can do this with ImageMagick and terminal.
Download the colour profile(eciRGB_v2.icc) and make sure you know the path to it, I recommend keeping it in the same directory as your images.
example set up
Then open Terminal at that directory and run this code:
convert image.tif -profile eciRGB_V2.icc output.tif
example of result
Link to download ImageMagick: https://imagemagick.org/index.php

Categories

Resources