Does PyPDF2 take any safety measures when opening an unsafe file? - python

I'm wanting to use PyPDF2 (source, docs), but first wanted to make sure that it would be safe to use. I'm unable to find anything in it's docs. I want to use it to make sure that uploaded files are valid PDFs. Users are validated, but I'm concerned about them still being able to unknowingly upload something unsafe. Is there any way that PyPDF2 would be able to tell, even if it is a PDF, that it is unsafe?

Is there any way that PyPDF2 would be able to tell, even if it is a
PDF, that it is unsafe?
No, because PyPDF2 does not contain any security scanning functionality. Any content which is harmful to your system may, or may not, pass through PyPDF and continue to be of danger to your system depending on what other precautions you take.
As jpmc26 said PyPDF is simply a parser/generator, so it is highly unlikely that the contents of a PDF could pose a security thread to PyPDF itself.

If you're concerned about validity of pdfs, if you try to manipulate a pdf with PyPDF2 that's not a valid pdf then it will likely return an error. As for checking the contents of the pdf, the library itself doesn't do that, but you can write methods for checking the contents for certain patterns, analyze the stream, and find other ways to check it yourself. The best way to start with that would be to create an invalid pdf yourself and find what things you would want to look for. It also has some password validation, though I've honestly not dealt with that part of the library. PyPDF2 is a pretty powerful tool if you can learn how to use it effectively!

PyPDF2 doesn't execute parts of the PDF. It just parses it.
Bad things that can happen:
Infinite loops
Slow parsing e.g. due to quadratic runtime
We work hard on fixing those issues whenever we note them.
Another topic certainly is supply chain vulnerabilities. PyPDF2 is among the top 1% of packages on PyPI and thus the maintainers are required to use security keys. I review all PRs and I would not allow anything that allows code execution from the PDF itself / opens network connections / looks suspicious.
FYI: I'm the current maintainer of PyPDF2.

Related

How to automate SAS enterprise guide reports with Python Script?

I tried with SASpy but it's not working. I am able to open the SAS .egp file but not able to run the multiple scripts within in sequence.
import os, sys, subprocess
def OpenProject(sas_exe, egp_path):
sasExe = sas_exe
sasEGpath = egp_path
subprocess.call([sasExe, sasEGpath])
sas_exe = path\path\
egp_path = path\path\path\
OpenProject(sas_exe, egp_path)
This depends a bit on exactly what the workflow is. A few side notes, then the full solution.
First: EGP is not really intended to store production processes, in my opinion. EGP should really be used for development, then production is done with .sas (text) files. EGP can directly store the nodes as .sas files; ask a new question about that if you want to know more, but it's pretty easy to figure out. Best practice is to have EGP save the code modules as .sas files, then run those - SASPy will easily do that for you.
Second: If you use SAS's built-in Git connectivity, then you can do this a bit more easily I suspect. Consider doing that if you already use Git for your other processes. Again, then you end up with a .sas file, and can directly run that via SASPy.
So: how can you do this in Python, with the assumption you do have to use the .egp itself, without too many different moving parts? The key here is the .egp format. EGP is a container file, which is actually a .zip format container that has in it, among other things, all of the SAS code you want to run, as text. Text in xml format, but still, text.
You can write a python program that opens the .egp as a .zip file, using the zipfile library, and then use xml.etree.ElementTree to parse the project.xml file inside that project. Exactly what you do from there depends on your particular details, and is well out of scope for a Stack Overflow answer, but if you do better visually you can simply rename the .egp to .zip and then open in unzip program of your choice, then browse project.xml in your text editor, and find the nodes and code related to those nodes.
You can then extract the .sas code as text, and submit it directly via SASPy, or extract it to a .sas file and then submit that however you prefer (SASPy or something else).
I do something similar to this for a project - I don't actually run code from it, I'm just parsing it to verify that the correct programs were synced from the EGP to production - but it would be trivial to actually submit the code from what I've written, which is about 50 lines of code total. I may write a SGF paper this year or next year on this topic, in which case I'll try and remember to submit it here - or you can head over to my github page and see if it's there (in the future!).

saving back to a docx xml file with python

I am trying to preform find and replace on docx files, whilst still maintaining formatting.
From reading up on this, the best way seems to be preforming the find/replace on the xml file of the document.
I can load in the xml file and find/replace on it, but unsure how to write it back.
docx:
Hello {text}!
python:
import zipfile
zip = zipfile.ZipFile(open('test.docx', 'rb'))
xmlString = zip.read('word/document.xml').decode('utf-8')
xmlString.replace('{text}', 'world')
What you are trying is dangerous, because you are processing a high lever docx file at a lower level. If you really want to do it, just use the hints from overwriting file in ziparchive as suggested by #shahvishal.
But unless you fully know all the details of docx format, my advice is : do not do that. Suppose there is for any reason in an internal field or attribute the string {text}. You are likely to change the file in an unexpected way leading immediately or even worse later to the destruction of the file (Word being no longer able to process it).
If you do your processing on a Windows machine with an installed word, you certainly could try to use automation to process the file with Microsoft Word. Unfortunately, I only did that long time ago and cannot give useful links. You will need:
general knowledge on pywin30 module
sufficient knowledge on the Automation interface of MS/WORD. Hopefully, its documentation is nice with many examples provided you have a full installation of Microsoft Office including macro help
You're really going to want to use a library for reading/writing docx files rather than trying to just deal with them as raw XML. A cursory search came up with the pypi module docx but I haven't used this module so I can't endorse it:
https://pypi.python.org/pypi/docx/0.2.4
I've had the (unfortunate) experience of dealing with the manipulation of MS Office documents from other programming languages, and spending the time to find good libraries really paid off.
The old saying goes "don't reinvent the wheel" and I think that's definitely true when manipulating non-trivial file formats. If a somewhat mature library exists to do the job, use it!
You would need to replace the file in the zip archive. There is no "simple" way of achieving this. The following is a question that should help:
overwriting file in ziparchive

Is there a reliable python library for taking a BibTex entry and outputting it into specific formats?

I'm developing using Python and Django for a website. I want to take a BibTex entry and output it in a view in 3 different formats, MLA, APA, and Chicago. Is there a library out there that already does this or am I going to have to manually do the string formatting?
There are the following projects:
BibtexParser
Pybtex
Pybliographer
BabyBib
If you need complex parsing and output, Pybtex is recommended. Example:
>>> from pybtex.database.input import bibtex
>>> parser = bibtex.Parser()
>>> bib_data = parser.parse_file('examples/foo.bib')
>>> bib_data.entries.keys()
[u'ruckenstein-diffusion', u'viktorov-metodoj', u'test-inbook', u'test-booklet']
>>> print bib_data.entries['ruckenstein-diffusion'].fields['title']
Predicting the Diffusion Coefficient in Supercritical Fluids
Good luck.
Having tried them, all of these projects are bad, for various reasons: terrible APIs, bad documentation, and a failure to parse valid BibTeX files. The implementation you want doesn't show up in most Google searches, from my own searching: it's biblib. This text from the README should sell it:
There are a lot of BibTeX parsers out there. Most of them are complete nonsense based on some imaginary grammar made up by the module's author that is almost, but not quite, entirely unlike BibTeX's actual grammar. BibTeX has a grammar. It's even pretty simple, though it's probably not what you think it is. The hardest part of BibTeX's grammar is that it's only written down in one place: the BibTeX source code.
The accepted answer of using pybtex is fraught with danger as Pybtex does not preserve the bibtex format of even simple bibtex files. (https://bitbucket.org/pybtex-devs/pybtex/issues/130/need-to-specially-represent-bibtex-markup)
Pybtex is therefore losing bibtex information when reading and re-writing a simple .bib file without making any changes. Users should be very careful following the recommendations to use pybtex.
I will try biblib as well and report back but the accepted answer should be edited to not recommend pybtex.
Edit:
I was able to import the data using Bibtex Parser, without any loss of data. However, I had to compile from https://github.com/sciunto-org/python-bibtexparser as the version installed via pip was bugged at the time. Users should verify that pip is getting the latest version.
As for exporting, once the data has been imported via BibTex Parser, it's in a dictionary, and can be exported as the user desires. BibTex Parser does not have built in functions for exporting in common formats. As I did not need this functionality, I didn't specifically test it. However, once imported into a dictionary, the string output can be converted to any citation format rather easily.
Here, pybtex and a custom style file can help. I used the style file provided by the journal and compiled in LaTeX instead, but PyBtex has python style files (but also allows ingesting .sty files). So I would recommend taking the Bibtex Parser input and transferring it to PyBtex (or similar) for outputting in a certain style.
The closest thing I know of is the pybtex package

What's a good document standard to use programmatically?

I'm writing a program that requires input in the form of a document, it needs to replace a few values, insert a table, and convert it to PDF. It's written in Python + Qt (PyQt). Is there any well known document standard which can be easily used programmatically? It must be cross platform, and preferably open.
I have looked into Microsoft Doc and Docx, which are binary formats and I can't edit them. Python has bindings for it, but they're only on Windows.
Open Office's ODT/ODF is zipped in an xml file, so I can edit that one but there's no command line utilities or any way to programmatically convert the file to a PDF. Open Office provides bindings, but you need to run Open Office from the command line, start a server, etc. And my clients may not have Open Office installed.
RTF is readable from Python, but I couldn't find any way/libraries to convert RTF documents to PDF.
At the moment I'm exporting from Microsoft Word to HTML, replacing the values and using PyQt to convert it to a PDF. However it loses formatting features and looks awful. I'm surprised there isn't a well known library which lets you edit a variety of document formats and convert them into other formats, am I missing something?
Update: Thanks for the advice, I'll have a look at using Latex.
Thanks,
Jackson
Have you looked into using LaTeX documents?
They are perfect to use programatically (compiling documents? You gotta love that...), and you have several Python frameworks you can use such as plasTeX and PyTex.
Exporting a LaTeX documents to PDF is almost immediate.
Since you're already using PyQt anyway, it might be worth looking at Qt's built-in RTF processing module which looks decent. Here's the documentation on detailed content manipulation including inserting tables. Also the QPrinter module's default print-to-file format happens to be PDF.
Without knowing more about your particular needs it's hard to say if these would do what you want, but since your application already has PyQt as a dependency, seems silly to introduce any more without evaluating the functionality you've already got available.
The non-GUI parts of the Qt framework are often overlooked though.
edit: included more links.
You might want to try ReportLab. The open source version can write PDFs, and the commercial version has a lot of really nice abstractions to allow output to a variety of different formats from a single input.
I don't know the kind of odience of your program, Tex is good and i would go with it.
Another possible choice is Excel format, parsing it with xlrd.
I've used it a couple of time and it's pretty straightforward.
Excel file is a good for the following reasons:
Well known format easy to edit
You could prepare a predefined template with constrains and table
Creating XML documents, transforming them to XSL/fo and rendering with Fop or RenderX. If you use docbook as the primary input, there are toolchains freely available for converting that to PDF, RTF, HTML and so forth.
It is rather quirky to use and not my idea of fun, but is does deliver and can be embedded in an application, AFAICT.
Creating docbook is very straightforward as it has a wide range of semantic tags, table support etc to give a "meaningful" markup which can be reliably formatted. The XSL stylesheets are modular and allow parts to be customized or replaced to generate your own look and feel.
It works well for relatively free flow documents with lots of text.
For filling in the blanks kind of documents, a regular reporting engine may be a better fit, or some straighforward XSL stylesheets spitting out the XSL-fo directly.

What are the different options for processing uploaded PDF files in a Django application?

Our Django application needs to do a few things with uploaded PDF files:
Verify that the file is a PDF and isn't corrupted
Check that the file isn't encrypted
Count the number of pages
We run into problems with one unfortunately popular application that's idea of an unencrypted PDF export is actually an encrypted PDF file, just with a blank password. We've been working with PyPDF to date, which is unable to read those files because the encryption is non-standard. The application exporting these files is quite popular among our users, which is a pain.
Another application exported files with a bad MIME type (something other than application/pdf), so whatever we end up using needs to be able to cope with silly choking points like that.
Is there an actively maintained, robust PDF library anywhere that we could utilize? Even PDFtk, a CLI utility that a couple people have been recommending, was last updated in 2006.
Any help is appreciated.
Update: To clarify, it can be free or paid-for. Suggest whatever you think is the best option.
PDFlib is excellent, but costs money. You didn't say it had to be free, though implicitly somehow I assume you want it to be! :)

Categories

Resources