Deleting pages in a word document using python-docx

Deleting pages in a word document using python-docx - python

I have a .docx template I use containing tables with placeholder names that I replace with data from a .xlsx file using openpyxl and python-docx. For example I have a "tool" placeholder that is replaced with the contents of cell A1 in the .xlsx.
Its a dumb workaround, but thats life when I cant get the business to use .xlsx to begin with.
The template is 126 pages, and I use anything from 1-126 depending on the part being documented. Currently I remove the unused pages manually. Is there a way to remove for example pages 10 through 126 using python?
All the pages are identical to start with, so another solution would be to generate the correct amount of pages at the beginning, I've tried various ways of doing that, but can never get the logo picture to copy correctly.

Related

Searchable database of files

I have 1000's of scanned field books as PDF. Each has a unique filename. In a spreadsheet I have metadata for each, where each row has:
index number, filename, info1, info2, info3, info4, etc.
filename is the exact file name of the PDF. info1 is just an example of a metadata field, such as 'Year' or whatever. There are only about 8 fields or so, not ever PDF is relevant to all of them.
I assume there should be a reasonable way to create a database, mysql, or other, reading the spreadsheet (which I can just saves as .csv or .txt or something). This part I am sure I can handle.
I want to be able to lookup/search for a pdf file based on entering in various search items based on the metadata, and get a list of results. In a web interface, or a custom window, and be able to click on the results and open the file. Basically a typical search window with predefined fields you can enter and get results - like at an old school library terminal.
I have decent coding skills in python, mostly math, but some file skills as well. Looking for guidance on what tools and approach I should take to this. My short term goal is to be able to query and find files and open whatever results. Long term want to be able to share this with the public so they can search and find stuff.
After trying to figure out what to search for online, I am obviously at a loss. How do you suggest I do this and what tools or libraries should I use. I cannot find an example of this online. Not sure how to word it.

The actual data stuff could be done with Pandas:
read the excel file into Pandas
perform the search on the Pandas dataframe, e.g. using df.query()
But this does not give you a GUI. For that you could go for a web app, using Flask or Django framework. That, however, one does not master over night :)
This is a good course to learn that kind of stuff: https://www.edx.org/course/cs50s-web-programming-with-python-and-javascript?index=product&queryID=01efddd992de28a8b1b27d136111a2a8&position=3

Identify the edited location in the PDF modified by online editor www.ilovepdf.com using Python

I have an SBI bank statement PDF which is tampered/forged. Here is the link for the PDF.
This PDF is edited using online editor www.ilovepdf.com. The edited part is the first entry under the 'Credit' column. Original entry was '2,412.00' and I have modified it to '12.00'.
Is there any programmatic way either using Python or any other opensource technology to identify the edited/modified location/area of the PDF (i.e. BBOX(Bounding Box) around 12.00 credit entry in this PDF)?
2 things I already know:
Metadata (Info or XMP metadata) is not useful. Modify date of the metadata doesn't confirm if the PDF is compressed or indeed edited, it will change the modify date in both these cases. Also it doesn't give the location of the edit been done.
PyMuPDF SPANS JSON object is also not useful as the edited entry doesn't come at the end of the SPANS JSON, instead it's in the proper order of the text inside the PDF. Here is the SPAN JSON file generated from PyMuPDF.
Kindly let me know if anyone has any opensource solution to resolve this problem.

iLovePDF completely changes the whole text in the document. You can even see this, just open the original and the manipulated PDFs in two Acrobat Reader tabs and switch back and forth between them, you'll see nearly all letters move a bit.
Internally iLovePDF also rewrote the PDF completely according to its own preferences, and the edit fits in perfectly.
Thus, no, you cannot recognize the manipulated text based on this document alone because it technically is a completely different, a completely new one.

Scraping large and complex PDF tables

I've been trying to scrape some data off of PDFs regarding 2020 election results in California for my own morbid curiosity.
I need to scrape many tables that appear across many pages. In some cases, the rows will continue onto the next page, and additional columns will appear on other pages as well. I've included a link to one example. I'm comfortable with R, but I can also use Python if that will be better for scraping. I haven't found many resources indicating how to deal with tables that carry onto additional pages for either language though. I need to get these tables into a CSV or XLSX format.
Thank you in advance!
In this example, Pages 15-28 should be one table.
https://www.co.tehama.ca.us/images/images/Elections/StatementOfVotesCastNOV2020v2excel.pdf

I was able to get the entire table using the following procedure.
Open the pdf in MS Word - not Adobe Acrobat. Word will convert the
document.
After the conversion has completed, select all. (Both may
take some time.)
Paste into a blank Excel worksheet. Save and enjoy.

Trying to read source docx files and transcribe these to target .docx files. What is not working is the header/footer parts

The header, in this case, contains a table with a cell containing a company Logo and a few other cells with alphanumeric ID's. The footer has pagination and a time stamp.
I've looked at using template (docxtpl) and the python-docx 0.8.10 but for header/footers with graphic stuff or tables there doesn't seem to be a solution. I'm not sure yet how to use templates! The docx module doesn't specify how to use non-text in headers.
This project, my first with .docx files, requires that I read thousands of Word documents and transcribe them to similar modified files. Everything works except this final step with the Headers/Footers.

PDF File Manipulation (open a large pdf file, find a keyword, then save in which page was found, and then split those pages and merge them in one pdf)

I'm working on a project for a friend of mine.
I want to find one specific keyword that is on multiple pages, and it has duplicates on other places on a large PDF file (40-60 pages and above) then save in memory in what page the keyword was found, and then split those pages from the original PDF File and lastly, merge them together.
I'm thinking about using PDFMiner or PyPDF2 (i'm open to other suggestions as well)
I'm already writing the code for the most part of it, but i can't figure out a good and efficient way to search the file and find that keyword, because this keyword is located in other places in the same file, and make sure that the data i want to extract from the original file isn't duplicate and all the data was extracted.
Thanks in Advance.

Did you try to split pdf file on couple of blocks and search keyword on each block with multithreading? This should be faster.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Deleting pages in a word document using python-docx - python

Related

Searchable database of files

Identify the edited location in the PDF modified by online editor www.ilovepdf.com using Python

Scraping large and complex PDF tables

Trying to read source docx files and transcribe these to target .docx files. What is not working is the header/footer parts

PDF File Manipulation (open a large pdf file, find a keyword, then save in which page was found, and then split those pages and merge them in one pdf)

Categories

Resources