How to extract all files from a p7m file

How to extract all files from a p7m file - python

I have a bunch of p7m files (used to digitally sign some files, usually pdf files) and I would like some help to find a way to extract the content. I know how to iterate a process over the files in a folder using Python, I need help just with the extraction part.
I tried with PyPDF2.PdfFileReader.decrypt() but I get a "EOF marker not found" error because apparently PyPDF2 cannot manage encrypted files.
I saw somebody used the mime library, but that is way above my level honestly.
Thank you

Related

pdf in python which consist data from .xlsx file and png image

I wanted to create a pdf using Python 3x.
The pdf should have some text data which is stored in a .xlsx file i.e.., it should read data from .xlsx file and write into the .pdf file.
Along with that, the pdf should have a png image of passport size.
I have come up with two basic ideas which are:-
First one is by writing a program which create a text file in which all required data from the pdf will be written along with the png image. After that the program will convert it into a pdf file.
Second one is by writing a program which will create the pdf file and write the data from .xlsx file as well as insert the image too into the pdf file.
I don't know whether these ideas can be used or not and how it can be used but after going through some researches on GFG, Stack overflow..., I have got totally confused and ended up asking this problem on this platform.
I have tried some modules like PIL, FPDF, reportlab,.. and am successfully able to create a pdf file with either texts or images but unable to combine both in the same text file.
Also I am confused in deciding which idea I should implement.
What I need from you guys is the answer of few of my questions which are:-
Are the ideas I mentioned above(second one specially) practically possible?
Can I make a program which imports data from file as well as png image into the same pdf. What modules and functions will be used there and how.
Please provide the code with comments or defining/elaborating the work of function used.
I hope I will get the desired result soon. Meanwhile I will try to solve it out by myself.

How to detect and separate Corrupt/Unreadable PDFs and password protected PDFs from a directory using python?

I have a directory containing about ~ 1,00,000 multipage PDFs.
I want to separate Corrupt/Unreadable and Password protected PDFs from this directory using python.
Need a good and fast solution as I might need to do it for large number of files in future.
Thanks in advance.

You can try to use PyPDF2. Loop over all files in the directory using os.listdir() and try opening each one, and store the name of each one that gives you an error. You can also place them in two different directories depending on whether opening a file gives you an error using simple try/except.

How to convert docx file to .chm using Python

I want to convert contents(text, images, links) of docx file to .chm file using Python. Can anyone please suggest how to do.
I tried to read the docx file content using docx2txt
https://github.com/ankushshah89/python-docx2txt package. But I am not sure how to read the images and links in the file.
Can someone please suggest how to read each content separately and convert it to .chm file.

You maybe warned this has a learn curve.
You need to extract all sections from your Word document into clean HTML files including the graphic files.
Please try to Save Word as HTML. But I think this don't make clean HTML.
You need the Microsoft Htmlhelp compiler for creating Chm files. I recommend using a converter tool or a Help Authoring Tool (Hat) for your task.
Search by Google for such tool "DoctoChm" and give it a try for your needs.

I recently needed to convert some resumes to plain text. There are any number of use cases for wanting to extract readable text from binary formats.
you can see the url 'http://davidmburke.com/2014/02/04/python-convert-documents-doc-docx-odt-pdf-to-plain-text-without-libreoffice/'

Using files I downloaded with python

So I want to download a bunch of clinical trial information from clinicaltrials.gov. They have a system that lets you download searches by using a custom URL. The url format is https://clinicaltrials.gov/ct2/results/download_fields?cond=&term=genentech&locn=pennsylvania&down_count=1000&down_fmt=xml
First of all how do I download that file using python? I'm assuming its something like
file = requests.get('https://clinicaltrials.gov/ct2/results/download_fields?cond=&term=genentech&locn=pennsylvania&down_count=1000&down_fmt=xml')
Then can I also rename the file and put it in my working directory?
In the end I would like to process about three to four hundred downloads and parse the files for certain information. I think I can handle that part but getting all the files into my working directory is what I'm having trouble with now.
Any help would be greatly appreciated.
Thanks!

Can you modify only a text string in an XML file and still maintain integrity and functionality of .docx encasement?

I want to enter data into a Microsoft Excel Spreadsheet, and for that data to interact and write itself to other documents and webforms.
With success, I am pulling data from an Excel spreadsheet using xlwings. Right now, I’m stuck working with .docx files. The goal here is to write the Excel data into specific parts of a Microsoft Word .docx file template and create a new file.
My specific question is:
Can you modify just a text string(s) in a word/document.xml file and still maintain the integrity and functionality of its .docx encasement? It seems that there are numerous things that can change in the XML code when making even the slightest change to a Word document. I've been working with python-docx and lxml, but I'm not sure if what I seek to do is possible via this route.
Any suggestions or experiences to share would be greatly appreciated. I feel I've read every article that is easily discoverable through a google search at least 5 times.
Let me know if anything needs clarification.
Some things to note:
I started getting into coding about 2 months ago. I’ve been doing it intensively for that time and I feel I’m picking up the essential concepts, but there are severe gaps in my knowledge.
Here are my tools:
Yosemite 10.10,
Microsoft Office 2011 for Mac

You probably need to be more specific, but the short answer is, in principle, yes.
At a certain level, all python-docx does is modify strings in the XML. A couple things though:
The XML you create needs to remain well-formed and valid according to the schema. So if you change the text enclosed in a <w:t> element, for example, that works fine. Conversely, if you inject a bunch of random XML at an arbitrary point in one of the .xml parts, that will corrupt the file.
The XML "files", known as parts that make up a .docx file are contained in a Zip archive known as a package. You must unpackage and repackage that set of parts properly in order to have a valid .docx file afterward. python-docx takes care of all those details for you, but if you're going directly at the .docx file you'll need to take care of that yourself.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.