Using pandoc and pandoc-citeproc with Jupyter notebooks

Using pandoc and pandoc-citeproc with Jupyter notebooks - python

I am developing an infrastructure where developers can document their verification tests using Jupyter notebooks. One part of the infrastructure will be a python script that can convert their .ipynb files to .html files to provide public-facing documentation of their tests.
Using the nbconvert module does most of what I want, but I would like to allow citations and references in the final HTML file. I can use pypandoc to generate HTML text that converts the citations to proper inlined syntax and adds a References section:
from urllib import urlopen
import nbformat
import pypandoc
from nbconvert import MarkdownExporter
response = urlopen('SimpleExample.ipynb').read().decode()
notebook = nbformat.reads(response, as_version=4)
exporter = MarkdownExporter()
(body, resources) = exporter.from_notebook_node(notebook)
filters = ['pandoc-citeproc']
extra_args = ['--bibliography="ref.bib"',
'--reference-links',
'--csl=MWR.csl']
new_body = pypandoc.convert_text(body,
'html',
'md',
filters=filters,
extra_args=extra_args)
The problem is that this generated HTML loses all of the considerable formatting and other capabilities provided by nbconvert.HTMLExporter.
My question is, is there a straightforward way to merge the results of nbconvert.HTMLExporter and pypandoc.convert_text() such that I get mostly the former, with inline citations and a Reference section added from the latter?

I don't know that this necessarily counts as "straightforward" but I was able to come up with a solution. It involves writing a class that inherits from nbconvert.preprocessors.Preprocessor and implements the preprocess(self, nb, resources) method. Here is what preprocess() does:
Loop over every cell in the notebook and store a set of citation keys (these are of the form [#bibtex_key]
Create a short body of text consisting of only these citation keys, each separated by '\n\n'
Use the pandoc conversion above to generate HTML text from this short body of text. If num_cite is the number of citations, the first num_cite lines of the generated text will be the inline versions of the citations (e.g. '(Author, Year)'); the remaining lines will be the content of the references section.
Go back through each cell and substitute the inline text of each citation for its key.
Add a cell to the notebook with ## References
Add a cell to the notebook with the content of the references section
Now, when an HTMLExporter, using this Preprocessor, converts a notebook, the results will have inline citations, a reference section, and all of the formatting you expect from the HTMLExporter.

Related

Extracting text from MS Word Document uploaded through FileUpload from ipyWidgets in Jupyter Notebook

I am trying to allow user to upload MS Word file and then I run a certain function that takes a string as input argument. I am uploading Word file through FileUpload however I am getting a coded object. I am unable to decode using byte UTF-8 and using upload.value or upload.data just returns coded text
Any ideas how I can extract content from uploaded Word File?
> upload = widgets.FileUpload()
> upload
#I select the file I want to upload
> upload.value #Returns coded text
> upload.data #Returns coded text
> #Previously upload['content'] worked, but I read this no longer works in IPYWidgets 8.0

Modern ms-word files (.docx) are actually zip-files.
The text (but not the page headers) are actually inside an XML document called word/document.xml in the zip-file.
The python-docx module can be used to extract text from these documents. It is mainly used for creating documents, but it can read existing ones. Example from here.
>>> import docx
>>> gkzDoc = docx.Document('grokonez.docx')
>>> fullText = []
>>> for paragraph in doc.paragraphs:
... fullText.append(paragraph.text)
...
Note that this will only extract the text from paragraphs. Not e.g. the text from tables.
Edit:
I want to be able to upload the MS file through the FileUpload widget.
There are a couple of ways you can do that.
First, isolate the actual file data. upload.data is actually a dictionary, see here. So do something like:
rawdata = upload.data[0]
(Note that this format has changed over different version of ipywidgets. The above example is from the documentation of the latest version. Read the relevant version of the documentation, or investigate the data in IPython, and adjust accordingly.)
write rawdata to e.g. foo.docx and open that. That would certainly work, but it does seem somewhat un-elegant.
docx.Document can work with file-like objects. So you could create an io.BytesIO object, and use that.
Like this:
foo = io.BytesIO(rawdata)
doc = docx.Document(foo)

Tweaking with #Roland Smith great suggestions, following code finally worked:
import io
import docx
from docx import Document
upload = widgets.FileUpload()
upload
rawdata = upload.data[0]
test = io.BytesIO(rawdata)
doc = Document(test)
for p in doc.paragraphs:
print (p.text)

Python library for dynamic documents

I want to write a script that generates reports for each team in my unit where each report uses the same template, but where the numbers specific to each team is used for each report. The report should be in a format like .pdf that non-programmers know how to open and read. This is in many ways similar to rmarkdown for R, but the reports I want to generate are based on data from code already written in python.
The solution I am looking for does not need to export directly to pdf. It can export to markdown and then I know how to convert. I do not need any fancier formatting than what markdown provides. It does not need to be markdown, but I know how to do everything else in markdown, if I only find a way to dynamically populate numbers and text in a markdown template from python code.
What I need is something that is similar to the code block below, but on a bigger scale and instead of printing output on screen this would saved to a file (.md or .pdf) that can then be shared with each team.
user = {'name':'John Doe', 'email':'jd#example.com'}
print('Name is {}, and email is {}'.format(user["name"], user["email"]))
So the desired functionality heavily influenced by my previous experience using rmarkdown would look something like the code block below, where the the template is a string or a file read as a string, with placeholders that will be populated from variables (or Dicts or objects) from the python code. Then the output can be saved and shared with the teams.
user = {'name':'John Doe', 'email':'jd#example.com'}
template = 'Name is `user["name"]`, and email is `user["email"]`'
output = render(template, user)
When trying to find a rmarkdown equivalent in python, I have found a lot of pointers to Jupyter Notebook which I am familiar with, and very much like, but it is not what I am looking for, as the point is not to share the code, only a rendered output.

Since this question was up-voted I want to answer my own question, as I found a solution that was perfect for me. In the end I shared these reports in a repo, so I write the reports in markdown and do not convert them to PDF. The reason I still think this is an answer to my original quesiton is that this works similar to creating markdown in Rmarkdown which was the core of my question, and markdown can easily be converted to PDF.
I solved this by using a library for backend generated HTML pages. I happened to use jinja2 but there are many other options.
First you need a template file in markdown. Let say this is template.md:
## Overview
**Name:** {{repo.name}}<br>
**URL:** {{repo.url}}
| Branch name | Days since last edit |
|---|---|
{% for branch in repo.branches %}
|{{branch[0]]}}|{{branch[1]}}|
{% endfor %}
And then you have use this in your python script:
from jinja2 import Template
import codecs
#create an dict will all data that will be populate the template
repo = {}
repo.name = 'training-kit'
repo.url = 'https://github.com/github/training-kit'
repo.branches = [
['master',15],
['dev',2]
]
#render the template
with open('template.md', 'r') as file:
template = Template(file.read(),trim_blocks=True)
rendered_file = template.render(repo=repo)
#output the file
output_file = codecs.open("report.md", "w", "utf-8")
output_file.write(rendered_file)
output_file.close()
If you are OK with your dynamic doc being in markdown you are done and the report is written to report.py. If you want PDF you can use pandoc to convert.

I would strongly recommend to install and use the pyFPDF Library, that enables you to write and export PDF files directly from python. The Library was ported from php and offers the same functionality as it's php-variant.
1.) Clone and install pyFPDF
Git-Bash:
git clone https://github.com/reingart/pyfpdf.git
cd pyfpdf
python setup.py install
2.) After successfull installation, you can use python code similar as if you'd work with fpdf in php like:
from fpdf import FPDF
pdf = FPDF()
pdf.add_page()
pdf.set_xy(0, 0)
pdf.set_font('arial', 'B', 13.0)
pdf.cell(ln=0, h=5.0, align='L', w=0, txt="Hello", border=0)
pdf.output('myTest.pdf', 'F')
For more Information, take a look at:
https://pypi.org/project/fpdf/
To work with pyFPDF clone repo from: https://github.com/reingart/pyfpdf
pyFPDF Documentation:
https://pyfpdf.readthedocs.io/en/latest/Tutorial/index.html

Parsing YAML out of a Markdown file

I am working with some legacy code that I have inherited (ie, many of these design decisions were not mine).
The code takes a directory organized into subdirectories with markdown files, and compiles them into one large markdown file (using Markdown-PP: https://github.com/jreese/markdown-pp). Then it converts this file into HTML (using pandoc: https://pandoc.org/), and finally into a PDF (using wkhtmltopdf: https://wkhtmltopdf.org/).
The problem that I am running into is that many of the original markdown files have YAML metadata headers. When stitched together by Markdown-PP, the large markdown ends up with numerous YAML metadata blocks interspersed throughout. Most of this metadata is lost when converting into HTML because of the way pandoc processes YAML (many of the headers use the same key names, and pandoc combines the separate YAML headers and only preserves the first value of the corresponding key).
I originally had no YAML appearing in the HTML, but was able to change this by correctly modifying the HTML template for pandoc. But I only get the first value for each corresponding key. It was not clear if there was a way around this in pandoc, so I instead looked into trying to process the YAML into HTML before the pandoc step. I have tried parsing the YAML in the combined markdown using PyYAML (yaml.load_all()) but only get the first YAML block to appear.
An example of a YAML block:
---
author: foo
size_minimum: 100
time_req_minutes: 120
# and so on
---
The issue being that each one of 20+ modules in the final document have this associated metadata.
To try to parse the YAML, I was using code borrowed from this post: Is it possible to use PyYAML to read a text file written with a "YAML front matter" block inside?
with a few modifications.
import yaml
import sys
def get_yaml(f):
pointer = f.tell()
if f.readline() != '---\n':
f.seek(pointer)
return ''
readline = iter(f.readline, '')
readline = iter(readline.__next__, '---\n') #underscores needed for Python3?
return ''.join(readline)
# Remove sys.argv, not sure what it was doing
with open(filepath, encoding='UTF-8') as f:
config = list(yaml.load_all(get_yaml(f), Loader=yaml.SafeLoader)) # Load all to get all the YAML documents, Loader option required for most recent PyYAML, and list because it was originally returning a generator object
text = f.read()
print("TEXT from", f)
#print(text)
print("CONFIG from", f)
print(config)
But even this only resulted in the first YAML block being read and output.
I would like to able to parse the YAML from the large markdown files, and replace it in the correct place with the corresponding HTML. I just am not sure if these (or any) packages have the capability of doing so. It may be that I just need to manually change the YAML to HTML in the original Markdown files (time intensive, but I could probably already be done with it if I had started that way).

What about this library: https://github.com/eyeseast/python-frontmatter
It parses both the front-matter and the Markdown in the file, placing the Markdown part in the content attribute of the resulting object.
Works with both front-matter containing and front-matterless (is there such a word?) files.

pdfminer doesn't extract data from filled-out pdf form

I'm trying to use pdfminer to extract the filled-out contents in a pdf form. The instructions for accessing the pdf are:
Go to https://www.ffiec.gov/nicpubweb/nicweb/InstitutionProfile.aspx?parID_Rssd=1073757&parDT_END=99991231
Click "Create Report" next to the fourth report from the top (i.e.,Banking Organization Systemic Risk Report (FR Y-15))
Click "Your request for a financial report is ready"
To extract the contents in blue, I copied code from this post:
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1
filename = 'FRY15_1073757_20160630.PDF'
fp = open(filename, 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
fields = resolve1(doc.catalog['AcroForm'])['Fields']
for i in fields:
field = resolve1(i)
name, value = field.get('T'), field.get('V')
print '{0}: {1}'.format(name, value)
This didn't extract the data fields as expected -- nothing was printed. I tried the same code on another pdf and it worked so I suspect the failure might have to do with the security setting of the first pdf, which is shown below
For the second pdf on which the code worked, the security setting shows "Allowed" for all the actions. I also tried using pdfminer's pdf2txt.py functionality (see here) but the filled-out data in the fields in the original pdf form (which is what I want) was not in the converted text file; only the "flat" non-fillable part of the pdf was converted. Interestingly, if I use Adobe Reader's Save As Text to convert the pdf to a text file, the fillable part was in the converted text file. This is what I've been doing to get around the failed code.
Any idea how I can extract data directly from the pdf form? Thanks.

I can only explain what the problem is but cannot present a solution because I have no working Python knowledge.
Your code iterates over the immediate children of the AcroForm Fields array and expect them to represent the form fields.
While this expectation often is fulfilled, it actually only represents a special case: Form fields are arranged as a tree structure with that Fields array as root element, e.g. in case of your sample document there is large tree:
Thus, you have to descend into the structure, not merely iterate over the immediate children of Fields, to find all form fields.

How do you change the code example font size in LaTeX PDF output with Sphinx?

I find the default code example font in the PDF generated by Sphinx to be far too large.
I've tried getting my hands dirty in the generated .tex file inserting font size commands like \tiny above the code blocks, but it just makes the line above the code block tiny, not the code block itself.
I'm not sure what else to do - I'm an absolute beginner with LaTeX.

I worked it out. Pygments uses a \begin{Verbatim} block to denote code snippets, which uses the fancyvrb package. The documentation I found (warning: PDF) mentions a formatcom option for the verbatim block.
Pygments' latex writer source indicates an instance variable, verboptions, is stapled to the end of each verbatim block and Sphinx' latex bridge lets you replace the LatexFormatter.
At the top of my conf.py file, I added the following:
from sphinx.highlighting import PygmentsBridge
from pygments.formatters.latex import LatexFormatter
class CustomLatexFormatter(LatexFormatter):
def __init__(self, **options):
super(CustomLatexFormatter, self).__init__(**options)
self.verboptions = r"formatcom=\footnotesize"
PygmentsBridge.latex_formatter = CustomLatexFormatter
\footnotesize was my preference, but a list of sizes is available here

To change Latex Output options in sphinx, set the relevant latex_elements key in the build configuration file, documentation on this is located here.
To change the font size for all fonts use pointsize.
E.g.
latex_elements = {
'pointsize':'10pt'
}
To change other Latex settings that are listed in the documetntation use preamble or use a custom document class in latex_documents.
E.g.
mypreamble='''customlatexstuffgoeshere
'''
latex_elements = {
'papersize':'letterpaper',
'pointsize':'11pt',
'preamble':mypreamble
}
Reading the Sphinx sourcecode by default the code in LatexWriter sets code snippets to the \code latex primitive.
So what you want to do is replace the \code with a suitable replacement.
This is done by including a Latex command like \newcommand{\code}[1]{\texttt{\tiny{#1}}} either as part of the preamble or as part of a custom document class for sphinx that gets set in latex_documents as the documentclass key. An example sphinx document class is avaliable here.
Other than just making it smaller with \tiny you can modify the latex_documents document class or the latex_elements preamble to use the Latex package listings for more fancy code formatting like in the StackOverflow question here.
The package stuff from the linked post would go as a custom document class and the redefinition similar to \newcommand{\code}[1]{\begin{lstlisting} #1 \end{lstlisting}} would be part of the preamble.
Alternatively you could write a sphinx extension that extends the default latex writer with a custom latex writer of your choosing though that is significantly more effort.
Other relevant StackOverflow questions include
Creating Math Macros with Sphinx
How do I disable colors in LaTeX output generated from sphinx?
sphinx customization of latexpdf output?

You can add a modified Verbatim command into your PREAMBLE (Note in this case the font size is changed to tiny)
\renewcommand{\Verbatim}[1][1]{%
% list starts new par, but we don't want it to be set apart vertically
\bgroup\parskip=0pt%
\smallskip%
% The list environement is needed to control perfectly the vertical
% space.
\list{}{%
\setlength\parskip{0pt}%
\setlength\itemsep{0ex}%
\setlength\topsep{0ex}%
\setlength\partopsep{0pt}%
\setlength\leftmargin{10pt}%
}%
\item\MakeFramed {\FrameRestore}%
\tiny % <---------------- To be changed!
\OriginalVerbatim[#1]%
}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using pandoc and pandoc-citeproc with Jupyter notebooks - python

Related

Extracting text from MS Word Document uploaded through FileUpload from ipyWidgets in Jupyter Notebook

Python library for dynamic documents

Parsing YAML out of a Markdown file

pdfminer doesn't extract data from filled-out pdf form

How do you change the code example font size in LaTeX PDF output with Sphinx?

Categories

Resources