Import table from same Notebook

Import table from same Notebook - python

How could one read a markdown table from a text block and import in into a variable in a code block in the same PowerShell notebook?
Attempted to import with PowerShell
$b=Import-Csv -Path <> -Delimiter '|'
Couldn't figure out how to point the -Path parameter to the text block in the same notebook. Using a .ipynb file in Azure Data Studio.

I believe the functionality you are looking for is not possible. As a workaround, I would suggest storing the cell markdown as a variable in Python first and using that variable to populate the printed markdown cell. Here is an example. I believe it will work in any notebook built on top of iPython:
#running this cell in your notebook will print the variable as Markdown
mymd = "# Some markdown"
from IPython.display import display, Markdown
display(Markdown(mymd))
Update: If you are worried that representing multi-line markdown is too complicated, you have two good options. First, use triples quotes to read the line breaks as part of the string:
mymd = """
| First Header | Second Header |
| ------------- | ------------- |
| Content Cell | Content Cell |
| Content Cell | Content Cell |
"""
Option 2: Put your markdown in a file and read it to a string:
with open("somefile.md") as f:
mymd = f.read()
Either option would benefit from a well documented and carefully followed workflow but would work well for this use case.

As per comment on the question, the .ipynb appears to contain JSON formatted text.
Quote about JSON from WikiPedia:
JSON is an open-standard file format or data interchange format that
uses human-readable text to transmit data objects consisting of
attribute–value pairs and array data types (or any other serializable
value). It is a very common data format, with a diverse range of
applications, such as serving as replacement for XML in AJAX systems.
JSON is a language-independent data format. It was derived from
JavaScript, but many modern programming languages include code to
generate and parse JSON-format data. The official Internet media type
for JSON is application/json. JSON filenames use the extension .json.
Also PowerShell has its own "cmdlet" commands for managing JSON files: ConvertTo-Json and ConvertFrom-Json.
The ConvertFrom-Json (and the ConvertTo-Json) cmdlet don't have a -Path parameter, instead it will convert from a -InputObject variable or stream, if the information comes from a file, you can use the Get-Content cmdlet to retrieve your data from a file:
$Data = Get-Content -Path .\YourFile.ipynb | ConvertFrom-Json
If your file is actually not provided through a file system but from a web page on the internet (which I suspect) you need rely on the Invoke-WebRequest cmdlet or if it concerns a web API on the Invoke-RestMethod cmdlet. For these cmdlets you need to figure out and supply more details like the URL you need to address.

Related

Extracting text from MS Word Document uploaded through FileUpload from ipyWidgets in Jupyter Notebook

I am trying to allow user to upload MS Word file and then I run a certain function that takes a string as input argument. I am uploading Word file through FileUpload however I am getting a coded object. I am unable to decode using byte UTF-8 and using upload.value or upload.data just returns coded text
Any ideas how I can extract content from uploaded Word File?
> upload = widgets.FileUpload()
> upload
#I select the file I want to upload
> upload.value #Returns coded text
> upload.data #Returns coded text
> #Previously upload['content'] worked, but I read this no longer works in IPYWidgets 8.0

Modern ms-word files (.docx) are actually zip-files.
The text (but not the page headers) are actually inside an XML document called word/document.xml in the zip-file.
The python-docx module can be used to extract text from these documents. It is mainly used for creating documents, but it can read existing ones. Example from here.
>>> import docx
>>> gkzDoc = docx.Document('grokonez.docx')
>>> fullText = []
>>> for paragraph in doc.paragraphs:
... fullText.append(paragraph.text)
...
Note that this will only extract the text from paragraphs. Not e.g. the text from tables.
Edit:
I want to be able to upload the MS file through the FileUpload widget.
There are a couple of ways you can do that.
First, isolate the actual file data. upload.data is actually a dictionary, see here. So do something like:
rawdata = upload.data[0]
(Note that this format has changed over different version of ipywidgets. The above example is from the documentation of the latest version. Read the relevant version of the documentation, or investigate the data in IPython, and adjust accordingly.)
write rawdata to e.g. foo.docx and open that. That would certainly work, but it does seem somewhat un-elegant.
docx.Document can work with file-like objects. So you could create an io.BytesIO object, and use that.
Like this:
foo = io.BytesIO(rawdata)
doc = docx.Document(foo)

Tweaking with #Roland Smith great suggestions, following code finally worked:
import io
import docx
from docx import Document
upload = widgets.FileUpload()
upload
rawdata = upload.data[0]
test = io.BytesIO(rawdata)
doc = Document(test)
for p in doc.paragraphs:
print (p.text)

Python library for dynamic documents

I want to write a script that generates reports for each team in my unit where each report uses the same template, but where the numbers specific to each team is used for each report. The report should be in a format like .pdf that non-programmers know how to open and read. This is in many ways similar to rmarkdown for R, but the reports I want to generate are based on data from code already written in python.
The solution I am looking for does not need to export directly to pdf. It can export to markdown and then I know how to convert. I do not need any fancier formatting than what markdown provides. It does not need to be markdown, but I know how to do everything else in markdown, if I only find a way to dynamically populate numbers and text in a markdown template from python code.
What I need is something that is similar to the code block below, but on a bigger scale and instead of printing output on screen this would saved to a file (.md or .pdf) that can then be shared with each team.
user = {'name':'John Doe', 'email':'jd#example.com'}
print('Name is {}, and email is {}'.format(user["name"], user["email"]))
So the desired functionality heavily influenced by my previous experience using rmarkdown would look something like the code block below, where the the template is a string or a file read as a string, with placeholders that will be populated from variables (or Dicts or objects) from the python code. Then the output can be saved and shared with the teams.
user = {'name':'John Doe', 'email':'jd#example.com'}
template = 'Name is `user["name"]`, and email is `user["email"]`'
output = render(template, user)
When trying to find a rmarkdown equivalent in python, I have found a lot of pointers to Jupyter Notebook which I am familiar with, and very much like, but it is not what I am looking for, as the point is not to share the code, only a rendered output.

Since this question was up-voted I want to answer my own question, as I found a solution that was perfect for me. In the end I shared these reports in a repo, so I write the reports in markdown and do not convert them to PDF. The reason I still think this is an answer to my original quesiton is that this works similar to creating markdown in Rmarkdown which was the core of my question, and markdown can easily be converted to PDF.
I solved this by using a library for backend generated HTML pages. I happened to use jinja2 but there are many other options.
First you need a template file in markdown. Let say this is template.md:
## Overview
**Name:** {{repo.name}}<br>
**URL:** {{repo.url}}
| Branch name | Days since last edit |
|---|---|
{% for branch in repo.branches %}
|{{branch[0]]}}|{{branch[1]}}|
{% endfor %}
And then you have use this in your python script:
from jinja2 import Template
import codecs
#create an dict will all data that will be populate the template
repo = {}
repo.name = 'training-kit'
repo.url = 'https://github.com/github/training-kit'
repo.branches = [
['master',15],
['dev',2]
]
#render the template
with open('template.md', 'r') as file:
template = Template(file.read(),trim_blocks=True)
rendered_file = template.render(repo=repo)
#output the file
output_file = codecs.open("report.md", "w", "utf-8")
output_file.write(rendered_file)
output_file.close()
If you are OK with your dynamic doc being in markdown you are done and the report is written to report.py. If you want PDF you can use pandoc to convert.

I would strongly recommend to install and use the pyFPDF Library, that enables you to write and export PDF files directly from python. The Library was ported from php and offers the same functionality as it's php-variant.
1.) Clone and install pyFPDF
Git-Bash:
git clone https://github.com/reingart/pyfpdf.git
cd pyfpdf
python setup.py install
2.) After successfull installation, you can use python code similar as if you'd work with fpdf in php like:
from fpdf import FPDF
pdf = FPDF()
pdf.add_page()
pdf.set_xy(0, 0)
pdf.set_font('arial', 'B', 13.0)
pdf.cell(ln=0, h=5.0, align='L', w=0, txt="Hello", border=0)
pdf.output('myTest.pdf', 'F')
For more Information, take a look at:
https://pypi.org/project/fpdf/
To work with pyFPDF clone repo from: https://github.com/reingart/pyfpdf
pyFPDF Documentation:
https://pyfpdf.readthedocs.io/en/latest/Tutorial/index.html

How do I use output of beam.io.ReadFromText as a input in my python class?

I am reading one Json file in dataflow pipeline using beam.io.ReadFromText, When I pass its output to any of the class (ParDo) it will become element. I wanted to use this json file content in my class, How do I do this?
Content in Json File:
{"query": "select * from tablename", "Unit": "XX", "outputFileLocation": "gs://test-bucket/data.csv", "location": "US"}
Here I want to use each of its value like query, Unit, location and outputFileLocation in class Query():
p | beam.io.ReadFromText(file_pattern=user_options.inputFile) | 'Executing Query' >> beam.ParDo(Query())
My class:
class Query(beam.DoFn):
def process(self, element):
# do something using content available in element
.........

I don't think it is possible with current set of IOs.
the reason being that a multiline json requires parsing complete file to identify a single json block. This could have been possible if we had no parallelism while reading. However, as File based IOs run on multiple workers in parallel using certain partitioning logic and Line delimiter, parsing multiline json is not possible.
If you have multiple smaller files then you can probably read those files separately and emit the parsed json. You can further use a reshuffle to evenly distribute the data for the down stream operations.
The pipeline would look something like this.
Get File List -> Reshuffle -> Read content of individual files and emit the parsed json -> Reshuffle -> Do things.

Parsing YAML out of a Markdown file

I am working with some legacy code that I have inherited (ie, many of these design decisions were not mine).
The code takes a directory organized into subdirectories with markdown files, and compiles them into one large markdown file (using Markdown-PP: https://github.com/jreese/markdown-pp). Then it converts this file into HTML (using pandoc: https://pandoc.org/), and finally into a PDF (using wkhtmltopdf: https://wkhtmltopdf.org/).
The problem that I am running into is that many of the original markdown files have YAML metadata headers. When stitched together by Markdown-PP, the large markdown ends up with numerous YAML metadata blocks interspersed throughout. Most of this metadata is lost when converting into HTML because of the way pandoc processes YAML (many of the headers use the same key names, and pandoc combines the separate YAML headers and only preserves the first value of the corresponding key).
I originally had no YAML appearing in the HTML, but was able to change this by correctly modifying the HTML template for pandoc. But I only get the first value for each corresponding key. It was not clear if there was a way around this in pandoc, so I instead looked into trying to process the YAML into HTML before the pandoc step. I have tried parsing the YAML in the combined markdown using PyYAML (yaml.load_all()) but only get the first YAML block to appear.
An example of a YAML block:
---
author: foo
size_minimum: 100
time_req_minutes: 120
# and so on
---
The issue being that each one of 20+ modules in the final document have this associated metadata.
To try to parse the YAML, I was using code borrowed from this post: Is it possible to use PyYAML to read a text file written with a "YAML front matter" block inside?
with a few modifications.
import yaml
import sys
def get_yaml(f):
pointer = f.tell()
if f.readline() != '---\n':
f.seek(pointer)
return ''
readline = iter(f.readline, '')
readline = iter(readline.__next__, '---\n') #underscores needed for Python3?
return ''.join(readline)
# Remove sys.argv, not sure what it was doing
with open(filepath, encoding='UTF-8') as f:
config = list(yaml.load_all(get_yaml(f), Loader=yaml.SafeLoader)) # Load all to get all the YAML documents, Loader option required for most recent PyYAML, and list because it was originally returning a generator object
text = f.read()
print("TEXT from", f)
#print(text)
print("CONFIG from", f)
print(config)
But even this only resulted in the first YAML block being read and output.
I would like to able to parse the YAML from the large markdown files, and replace it in the correct place with the corresponding HTML. I just am not sure if these (or any) packages have the capability of doing so. It may be that I just need to manually change the YAML to HTML in the original Markdown files (time intensive, but I could probably already be done with it if I had started that way).

What about this library: https://github.com/eyeseast/python-frontmatter
It parses both the front-matter and the Markdown in the file, placing the Markdown part in the content attribute of the resulting object.
Works with both front-matter containing and front-matterless (is there such a word?) files.

Most Effecient way to parse Evtx files for specific content

I have hundreds of gigs of Evtx security event logs I want to parse for specific Event IDs (4624) and usernames (joe) based on the Event IDs. I have attempted to use Powershell cmdlet like below:
get-winevent -filterhashtable #{Path="mypath.evtx"; providername="securitystuffprovider"; id=4624}
I know I can pass a variable containing a list to the Path parameter for all of my evtx files, but I am unable to filter based on a subset of the message of the EVTX. Also, this takes an incredibly long time to parse just one Evtx file much less 150 or so. I know there is a python package to parse Evtx but I am not sure how that would look as the python-evtx parser doesn't provide great examples of importing and using the package itself. I can not extract all of the data into csv as that would take too much disk space. Any ideas on how would be amazing. Thanks.

Use -Path with the -FilterXPath parameter, and then filter using an XPath expression like so:
$Username = 'jdoe'
$XPathFilter = "*[System[(EventID=4624)] and EventData[Data[#Name='SubjectUserName'] and (Data='$Username')]]"
Get-WinEvent -Path C:\path\to\log\files\*.evtx -FilterXPath $XPathFilter

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Import table from same Notebook - python

Related

Extracting text from MS Word Document uploaded through FileUpload from ipyWidgets in Jupyter Notebook

Python library for dynamic documents

How do I use output of beam.io.ReadFromText as a input in my python class?

Parsing YAML out of a Markdown file

Most Effecient way to parse Evtx files for specific content

Categories

Resources