Parsing YAML out of a Markdown file

Parsing YAML out of a Markdown file - python

I am working with some legacy code that I have inherited (ie, many of these design decisions were not mine).
The code takes a directory organized into subdirectories with markdown files, and compiles them into one large markdown file (using Markdown-PP: https://github.com/jreese/markdown-pp). Then it converts this file into HTML (using pandoc: https://pandoc.org/), and finally into a PDF (using wkhtmltopdf: https://wkhtmltopdf.org/).
The problem that I am running into is that many of the original markdown files have YAML metadata headers. When stitched together by Markdown-PP, the large markdown ends up with numerous YAML metadata blocks interspersed throughout. Most of this metadata is lost when converting into HTML because of the way pandoc processes YAML (many of the headers use the same key names, and pandoc combines the separate YAML headers and only preserves the first value of the corresponding key).
I originally had no YAML appearing in the HTML, but was able to change this by correctly modifying the HTML template for pandoc. But I only get the first value for each corresponding key. It was not clear if there was a way around this in pandoc, so I instead looked into trying to process the YAML into HTML before the pandoc step. I have tried parsing the YAML in the combined markdown using PyYAML (yaml.load_all()) but only get the first YAML block to appear.
An example of a YAML block:
---
author: foo
size_minimum: 100
time_req_minutes: 120
# and so on
---
The issue being that each one of 20+ modules in the final document have this associated metadata.
To try to parse the YAML, I was using code borrowed from this post: Is it possible to use PyYAML to read a text file written with a "YAML front matter" block inside?
with a few modifications.
import yaml
import sys
def get_yaml(f):
pointer = f.tell()
if f.readline() != '---\n':
f.seek(pointer)
return ''
readline = iter(f.readline, '')
readline = iter(readline.__next__, '---\n') #underscores needed for Python3?
return ''.join(readline)
# Remove sys.argv, not sure what it was doing
with open(filepath, encoding='UTF-8') as f:
config = list(yaml.load_all(get_yaml(f), Loader=yaml.SafeLoader)) # Load all to get all the YAML documents, Loader option required for most recent PyYAML, and list because it was originally returning a generator object
text = f.read()
print("TEXT from", f)
#print(text)
print("CONFIG from", f)
print(config)
But even this only resulted in the first YAML block being read and output.
I would like to able to parse the YAML from the large markdown files, and replace it in the correct place with the corresponding HTML. I just am not sure if these (or any) packages have the capability of doing so. It may be that I just need to manually change the YAML to HTML in the original Markdown files (time intensive, but I could probably already be done with it if I had started that way).

What about this library: https://github.com/eyeseast/python-frontmatter
It parses both the front-matter and the Markdown in the file, placing the Markdown part in the content attribute of the resulting object.
Works with both front-matter containing and front-matterless (is there such a word?) files.

Related

PDF File dedupe issue with same content, but generated at different time periods from a docx

I working on a pdf file dedupe project and analyzed many libraries in python, which read files, then generate hash value of it and then compare it with the next file for duplication - similar to logic below or using python filecomp lib. But the issue I found these logic is like, if a pdf is generated from a source DOCX(Save to PDF) , those outputs are not considered duplicates - even content is exactly the same. Why this happens? Is there any other logic to read the content, then create a unique hash value based on the actual content.
def calculate_hash_val(path, blocks=65536):
file = open(path, 'rb')
hasher = hashlib.md5()
data = file.read()
while len(data) > 0:
hasher.update(data)
data = file.read()
file.close()
return hasher.hexdigest()

One of the things that happens is that you save metadata to the file including the time of creation. It is invisible in the PDF, but that will make the hash different.
Here is an explanation of how to find and strip out that data with at least one tool. I am sure that there are many others.

Extracting text from MS Word Document uploaded through FileUpload from ipyWidgets in Jupyter Notebook

I am trying to allow user to upload MS Word file and then I run a certain function that takes a string as input argument. I am uploading Word file through FileUpload however I am getting a coded object. I am unable to decode using byte UTF-8 and using upload.value or upload.data just returns coded text
Any ideas how I can extract content from uploaded Word File?
> upload = widgets.FileUpload()
> upload
#I select the file I want to upload
> upload.value #Returns coded text
> upload.data #Returns coded text
> #Previously upload['content'] worked, but I read this no longer works in IPYWidgets 8.0

Modern ms-word files (.docx) are actually zip-files.
The text (but not the page headers) are actually inside an XML document called word/document.xml in the zip-file.
The python-docx module can be used to extract text from these documents. It is mainly used for creating documents, but it can read existing ones. Example from here.
>>> import docx
>>> gkzDoc = docx.Document('grokonez.docx')
>>> fullText = []
>>> for paragraph in doc.paragraphs:
... fullText.append(paragraph.text)
...
Note that this will only extract the text from paragraphs. Not e.g. the text from tables.
Edit:
I want to be able to upload the MS file through the FileUpload widget.
There are a couple of ways you can do that.
First, isolate the actual file data. upload.data is actually a dictionary, see here. So do something like:
rawdata = upload.data[0]
(Note that this format has changed over different version of ipywidgets. The above example is from the documentation of the latest version. Read the relevant version of the documentation, or investigate the data in IPython, and adjust accordingly.)
write rawdata to e.g. foo.docx and open that. That would certainly work, but it does seem somewhat un-elegant.
docx.Document can work with file-like objects. So you could create an io.BytesIO object, and use that.
Like this:
foo = io.BytesIO(rawdata)
doc = docx.Document(foo)

Tweaking with #Roland Smith great suggestions, following code finally worked:
import io
import docx
from docx import Document
upload = widgets.FileUpload()
upload
rawdata = upload.data[0]
test = io.BytesIO(rawdata)
doc = Document(test)
for p in doc.paragraphs:
print (p.text)

Python Camelot - export files without the additional string appended to the filename

Python 3.7 with Camelot 0.7.3. Currently, Camelot exports the converted file with 'page--table-' appended to the file name - we have very specific file name requirements for our application, and I'm trying to export the file without that extra string appended to the file name. Is this possible? The documentation does not mention anything about how to get around this.

The documentation does not mention anything about how to get around this.
I'm not sure what you mean. https://camelot-py.readthedocs.io/en/master/ says:
Here’s how you can extract tables from PDF files. Check out the PDF
used in this example here.
>>> import camelot
>>> tables = camelot.read_pdf('foo.pdf')
>>> tables <TableList n=1>
>>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html
Using tables.export exports all the tables in the PDF to separate files and needs to distinguish them by the filenames.
If you only need to export a specific table, use the example further down on the page:
>>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html
This passes the filename unchanged to pandas.DataFrame.to_csv, as can be seen in https://github.com/camelot-dev/camelot/blob/master/camelot/core.py#L571.

Python library for dynamic documents

I want to write a script that generates reports for each team in my unit where each report uses the same template, but where the numbers specific to each team is used for each report. The report should be in a format like .pdf that non-programmers know how to open and read. This is in many ways similar to rmarkdown for R, but the reports I want to generate are based on data from code already written in python.
The solution I am looking for does not need to export directly to pdf. It can export to markdown and then I know how to convert. I do not need any fancier formatting than what markdown provides. It does not need to be markdown, but I know how to do everything else in markdown, if I only find a way to dynamically populate numbers and text in a markdown template from python code.
What I need is something that is similar to the code block below, but on a bigger scale and instead of printing output on screen this would saved to a file (.md or .pdf) that can then be shared with each team.
user = {'name':'John Doe', 'email':'jd#example.com'}
print('Name is {}, and email is {}'.format(user["name"], user["email"]))
So the desired functionality heavily influenced by my previous experience using rmarkdown would look something like the code block below, where the the template is a string or a file read as a string, with placeholders that will be populated from variables (or Dicts or objects) from the python code. Then the output can be saved and shared with the teams.
user = {'name':'John Doe', 'email':'jd#example.com'}
template = 'Name is `user["name"]`, and email is `user["email"]`'
output = render(template, user)
When trying to find a rmarkdown equivalent in python, I have found a lot of pointers to Jupyter Notebook which I am familiar with, and very much like, but it is not what I am looking for, as the point is not to share the code, only a rendered output.

Since this question was up-voted I want to answer my own question, as I found a solution that was perfect for me. In the end I shared these reports in a repo, so I write the reports in markdown and do not convert them to PDF. The reason I still think this is an answer to my original quesiton is that this works similar to creating markdown in Rmarkdown which was the core of my question, and markdown can easily be converted to PDF.
I solved this by using a library for backend generated HTML pages. I happened to use jinja2 but there are many other options.
First you need a template file in markdown. Let say this is template.md:
## Overview
**Name:** {{repo.name}}<br>
**URL:** {{repo.url}}
| Branch name | Days since last edit |
|---|---|
{% for branch in repo.branches %}
|{{branch[0]]}}|{{branch[1]}}|
{% endfor %}
And then you have use this in your python script:
from jinja2 import Template
import codecs
#create an dict will all data that will be populate the template
repo = {}
repo.name = 'training-kit'
repo.url = 'https://github.com/github/training-kit'
repo.branches = [
['master',15],
['dev',2]
]
#render the template
with open('template.md', 'r') as file:
template = Template(file.read(),trim_blocks=True)
rendered_file = template.render(repo=repo)
#output the file
output_file = codecs.open("report.md", "w", "utf-8")
output_file.write(rendered_file)
output_file.close()
If you are OK with your dynamic doc being in markdown you are done and the report is written to report.py. If you want PDF you can use pandoc to convert.

I would strongly recommend to install and use the pyFPDF Library, that enables you to write and export PDF files directly from python. The Library was ported from php and offers the same functionality as it's php-variant.
1.) Clone and install pyFPDF
Git-Bash:
git clone https://github.com/reingart/pyfpdf.git
cd pyfpdf
python setup.py install
2.) After successfull installation, you can use python code similar as if you'd work with fpdf in php like:
from fpdf import FPDF
pdf = FPDF()
pdf.add_page()
pdf.set_xy(0, 0)
pdf.set_font('arial', 'B', 13.0)
pdf.cell(ln=0, h=5.0, align='L', w=0, txt="Hello", border=0)
pdf.output('myTest.pdf', 'F')
For more Information, take a look at:
https://pypi.org/project/fpdf/
To work with pyFPDF clone repo from: https://github.com/reingart/pyfpdf
pyFPDF Documentation:
https://pyfpdf.readthedocs.io/en/latest/Tutorial/index.html

An XML file inside HDF5, h5py

I am using h5py to save data (float numbers), in groups. In addition to the data itself, I need to include an additional file (an .xml file, containing necessary information) within the hdf5. How do i do this? Is my approach wrong?
f = h5py.File('filename.h5')
f.create_dataset('/data/1',numpy_array_1)
f.create_dataset('/data/2',numpy_array_2)
.
.
my h5 tree should look thus:
/
/data
/data/1 (numpy_array_1)
/data/2 (numpy_array_2)
.
.
/morphology.xml (?)

One option is to add it as a variable-length string dataset.
http://code.google.com/p/h5py/wiki/HowTo#Variable-length_strings
E.g.:
import h5py
xmldata = """<xml>
<something>
<else>Text</else>
</something>
</xml>
"""
# Write the xml file...
f = h5py.File('test.hdf5', 'w')
str_type = h5py.new_vlen(str)
ds = f.create_dataset('something.xml', shape=(1,), dtype=str_type)
ds[:] = xmldata
f.close()
# Read the xml file back...
f = h5py.File('test.hdf5', 'r')
print f['something.xml'][0]

If you just need to attach the XML file to the hdf5 file, you can add it as an attribute to the hdf5 file.
xmlfh = open('morphology.xml', 'rb')
h5f.attrs['xml'] = xmlfh.read()
You can access the xml file then like this:
h5f.attrs['xml']
Notice, also, that you can't store attributes larger than 64K, you may want to compress the file before attaching. You can have a look at compressing libraries in the standard library of Python.
However, this doesn't make the information in the XML file very accessible. If you want to associate the metadata of each dataset to some metadata in the XML file, you could map it as you need using an XML library like lxml. You can also add each field of the XML data as a separate attribute so that you can query datasets by XML field, this all depends on what you have in the XML file. Try to think about how you would like to retrieve the data later.
You may also want to create groups for each xml file with its datasets and put it all in a single hdf5 file. I don't know how large are the files you are managing, YMMV.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing YAML out of a Markdown file - python

Related

PDF File dedupe issue with same content, but generated at different time periods from a docx

Extracting text from MS Word Document uploaded through FileUpload from ipyWidgets in Jupyter Notebook

Python Camelot - export files without the additional string appended to the filename

Python library for dynamic documents

An XML file inside HDF5, h5py

Categories

Resources