I would like to extract out the source code verbatim from code directives in a restructuredtext string.
What follows is my first attempt at doing this, but I would like to know if there is a better (i.e. more robust, or more general, or more direct) way of doing it.
Let's say I have the following rst text as a string in python:
s = '''
My title
========
Use this to square a number.
.. code:: python
def square(x):
return x**2
and here is some javascript too.
.. code:: javascript
foo = function() {
console.log('foo');
}
'''
To get the two code blocks, I could do
from docutils.core import publish_doctree
doctree = publish_doctree(s)
source_code = [child.astext() for child in doctree.children
if 'code' in child.attributes['classes']]
Now source_code is a list with just the verbatim source code from the two code blocks. I could also use the attributes attribute of child to find out the code types too, if necessary.
It does the job, but is there a better way?
Your solution will only find code blocks at the top level of a document, and it may return false positives if the class "code" is used on other elements (unlikely, but possible). I would also check the element/node's type, specified in its .tagname attribute.
There's a "traverse" method on nodes (and a document/doctree is just a special node) that does a full traversal of the document tree. It will look at all elements in a document and return only those that match a user-specified condition (a function that returns a boolean). Here's how:
def is_code_block(node):
return (node.tagname == 'literal_block'
and 'code' in node.attributes['classes'])
code_blocks = doctree.traverse(condition=is_code_block)
source_code = [block.astext() for block in code_blocks]
This can be further simplified like::
source_code = [block.astext() for block in doctree.traverse(nodes.literal_block)
if 'code' in block.attributes['classes']]
Related
I'm trying to parse a markdown document with a regex to find if there is a title in the document (# title).
I've manage to achieve this with this regex (?m)^#{1}(?!#) (.*), the problem is that I can also have code section in my markdown where I can encounter the # title format as a comment.
My idea was to try to find the # title, but if in lines before there is a ```language then don't match.
Here is a text example where I need to only match # my title and not the # helloworld.py below, especially if # my title is missing (which is what I need to find out) :
<!--
.. title: Structuring a Python application
.. medium: yes
.. devto: yes
-->
# my title
In this short article I will explain all the different ways of structuring a Python application, from a quick script to a more complex web application.
## Single python file containing all the code
```python
#!/usr/bin/env python
# helloworld.py
test
This could get real messy with regex. But since it seems like you'll be using python anyway - this can be trivial.
mkdwn = '''<!--
.. title: Structuring a Python application
.. medium: yes
.. devto: yes
-->
# my title
In this short article I will explain all the different ways of structuring a Python application, from a quick script to a more complex web application.
## Single python file containing all the code
```python
#!/usr/bin/env python
# helloworld.py
test'''
'''Get the first occurrence of a substring that
you're 100% certain **will not be present** before the title
but **will be present** in the document after the title (if the title exists)
'''
idx = mkdwn.index('```')
# Now, try to extract the title using regex, starting from the string start but ending at `idx`
title_match = re.search(r'^# (.+)', mkdwn[:idx],flags=re.M)
# Get the 1st group if a match was found, else empty string
title = title_match.group(1) if title_match else ''
print(title)
You can also reduce this
title_match = re.search(r'^# (.+)', mkdwn[:idx],flags=re.M)
# Get the 1st group if a match was found, else empty string
title = title_match.group(1) if title_match else ''
into a one liner, if you're into that sort of thing-
title = getattr(re.search(r'^# (.+)', mkdwn[:idx],flags=re.M), 'group', lambda _: '')(1)
getattr will return the attribute group if present (i.e when match was found) - otherwise it'll just return that dummy function(lambda _: '') which takes a dummy argument and returns an empty string, to be assigned to title.
The returned function is then called with the argument 1, which returns the 1st group if a match was found. If a match wasn't found, well the argument doesn't matter, it just returns an empty string.
Output
my title
This is a task for three regular expressions. Screen all code fragments temporarily with the first one, process markdown with the second one, unscreen code with the third.
"Screening" means storing code fragments in a dictionary and replacing with some special markdown with dictionary key.
I have written a python extension for markdown based on InlineProcessor who correctly match when the pattern appears:
Custom extension:
from markdown.util import AtomicString, etree
from markdown.extensions import Extension
from markdown.inlinepatterns import InlineProcessor
RE = r'(#)(\S{3,})'
class MyPattern(InlineProcessor):
def handleMatch(self, m, data):
tag = m.group(2)
el = etree.Element("a")
el.set('href', f'/{tag}')
el.text = AtomicString(f'#{tag}')
return el, m.start(0), m.end(0)
class MyExtension(Extension):
def extendMarkdown(self, md, md_globals):
# If processed by attr_list extension, not by this one
md.inlinePatterns.register(MyPattern(RE, md), 'my_tag', 200)
def makeExtension(*args, **kwargs):
return MyExtension(*args, **kwargs)
IN: markdown('foo #bar')
OUT: <p>foo #bar</p>
But my extension is breaking a native feature called attr_list in extra of python markdown.
IN: ### Title {style="color:#FF0000;"}
OUT: <h3>Title {style="color:#FF0000;"}</h3>
I'm not sure to correctly understand how Python-Markdown register / apply patterns on the text. I try to register my pattern with a high number to put it at the end of the process md.inlinePatterns.register(MyPattern(RE, md), 'my_tag', 200) but it doesn't do the job.
I have look at the source code of attr_list extension and they use Treeprocessor based class. Did I need to have a class-based onTreeprocessor and not an InlineProcessor for my MyPattern? To find a way to don't apply my tag on element how already have matched with another one (there: attr_list)?
You need a stricter regular expression which won't result in false matches. Or perhaps you need to alter the syntax you use so that it doesn't clash with other legitimate text.
First of all, the order of events is correct. Using your example input:
### Title {style="color:#FF0000;"}
When the InlineProcessor gets it, so far it has been processed to this:
<h3>Title {style="color:#FF0000;"}</h3>
Notice that the block level tags are now present (<h3>), but the attr_list has not been processed. And that is your problem. Your regular expression is matching #FF0000;"} and converting that to a link: #FF0000;"}.
Finally, after all InlinePrecessors are done, the attr_list TreeProsessor is run, but with the link in the middle, it doesn't recognize the text as a valid attr_list and ignores it (as it should).
In other words, your problem has nothing to do with order at all. You can't run an inline processor after the attr_list TreeProcessor, so you need to explore other alternatives. You have at least two options:
Rewrite your regular expression to not have false matches. You might want to try using word boundaries or something.
Reconsider your proposed new syntax. #bar is a pretty indistinct syntax which is likely to reoccur elsewhere in the text and result in false matches. Perhaps you could require it to be wrapped in brackets or use some character other than a hash.
Personally, I would strongly suggest the second option. Read some text with #bar in it, it would not be obvious tome that that is a link. However, [#bar] (or similar) would be much more clear.
I need to filter a rather long (but very regular) set of .html files to modify a few constructs only if they appear in text elements.
One good example is to change <p><div class="speech">it's hard to find his "good" side! He has <i>none</i>!<div></p> to <p><div class="speech">it's hard to find his “good” side! He has <i>none</i>!<div></p>.
I can easily parse my files with html.parser, but it's unclear how to generate result file, which should be as similar to input as possible (no reformatting).
I had a look to beautiful-soup, but it really seems too big for this (supposedly?) simple task.
Note: I do not need/want to serve .html files to a browser of any kind; I just need them updated (possibli in-place) with (slightly) changed content.
UPDATE:
Following #soundstripe advice Iwrote the following code:
import bs4
from re import sub
def handle_html(html):
sp = bs4.BeautifulSoup(html, features='html.parser')
for e in list(sp.strings):
s = sub(r'"([^"]+)"', r'“\1”', e)
if s != e:
e.replace_with(s)
return str(sp).encode()
raw = b"""<p><div class="speech">it's hard to "find" his "good" side! He has <i>none</i>!<div></p>"""
new = handle_html(raw)
print(raw)
print(new)
Unfortunately BeautifulSoup tries to be too smart from its (and my) own good:
b'<p><div class="speech">it\'s hard to "find" his "good" side! He has <i>none</i>!<div></p>'
b'<p><div class="speech">it\'s hard to “find” his “good” side! He has <i>none</i>!<div></div></div></p>'
i.e.: it transforms plain & to & thus breaking “ entity (notice I'm working with bytearrays, not strings. Is it relevant?).
How can I fix this?
I don't know why you wouldn't use BeautifulSoup. Here's an example that replaces your quotes like you're asking.
import re
import bs4
raw = b"""<p><div class="speech">it's hard to find his "good" side! He has <i>none</i>!<div></p> to <p><div class="speech">it's hard to find his “good” side! He has <i>none</i>!<div></p>"""
soup = bs4.BeautifulSoup(raw, features='html.parser')
def replace_quotes(s):
return re.sub(r'"([^"]+)"', r'“\1”', e)
for e in list(soup.strings):
# wrapping the new string in BeautifulSoup() call to correctly parse entities
new_string = bs4.BeautifulSoup(replace_quotes(e))
e.replace_with(new_string)
# use the soup.encode() formatter keyword to specify you want html entities in your output
new = soup.encode(formatter='html')
print(raw)
print(new)
actually I am working on a small project and need to parse public available XML data. My goal is to write the data to an mysql database for further processing.
XML Data Link: http://offenedaten.frankfurt.de/dataset/912fe0ab-8976-4837-b591-57dbf163d6e5/resource/48378186-5732-41f3-9823-9d1938f2695e/download/parkdatendyn.xml
XML structure (example):
<parkingAreaStatus>
<parkingAreaOccupancy>0.2533602</parkingAreaOccupancy>
<parkingAreaOccupancyTrend>stable</parkingAreaOccupancyTrend>
<parkingAreaReference targetClass="ParkingArea" id="2[Zeil]"
version="1.0"/>
<parkingAreaStatusTime>2018-02-
04T01:30:00.000+01:00</parkingAreaStatusTime
</parkingAreaStatus>
<parkingAreaStatus>
<parkingAreaOccupancy>0.34625</parkingAreaOccupancy>
<parkingAreaOccupancyTrend>stable</parkingAreaOccupancyTrend>
<parkingAreaReference targetClass="ParkingArea" id="5[Dom / Römer]"
version="1.0"/>
</parkingAreaStatus>
Using the code
import csv
import pymysql
import urllib.request
url = "http://offenedaten.frankfurt.de/dataset/912fe0ab-8976-4837-b591-57dbf163d6e5/resource/48378186-5732-41f3-9823-9d1938f2695e/download/parkdatendyn.xml"
from lxml.objectify import parse
from lxml import etree
from urllib.request import urlopen
locations_root = parse(urlopen(url)).getroot()
locations = list(locations_root.payloadPublication.genericPublicationExtension.parkingFacilityTableStatusPublication.parkingAreaStatus.parkingAreaReference)
print(*locations)
I expected to get a list of all "parkingAreaReference" entries within the XML document. Unfortunately the list is empty.
Playing arround with some code I got the sentiment that only the first block is parsed, I was able to fill the list with the value of "parkingAreaOccupancy" of the "parkingAreaReference" id="2[Zeil]" block by using the code
locations = list(locations_root.payloadPublication.genericPublicationExtension.parkingFacilityTableStatusPublication.parkingAreaStatus.parkingAreaOccupancy)
print(*locations)
-> 0.2533602
which is not the expected outcome
-> 0.2533602
-> 0.34625
MY question is:
What is the best way to get a matrix i can further work with of all blocks incl. the corresponding values stated in the XML document?
Example output:
A = [[ID:2[Zeil],0.2533602,stable,2018-02-
04T01:30:00.000+01:00],[id="5[Dom / Römer],0.34625,stable,2018-02-
04T01:30:00.000+01:00]]
or in general
A = [parkingAreaOccupancy,parkingAreaOccupancyTrend,parkingAreaStatusTime,....],[parkingAreaOccupancy,parkingAreaOccupancyTrend,parkingAreaStatusTime,.....]
After hours of research I hope for some tips from your site
Thank you in advance,
TR
You can just use etree directly and find interesting elements using XPath1 query. One important thing to note is, that your XML has default namespace declared at the root element :
xmlns="http://datex2.eu/schema/2/2_0"
By definition, element where default namespace is declared and all descendant elements without prefix are belong to this default namespace (unless another default namespace found in one of the descendant elements, which is not the case with your XML). This is why we define a prefix d, which references default namespace URI, in the following code, and we use that prefix to find every elements we need to get information from :
root = etree.parse(urlopen(url)).getroot()
ns = { 'd': 'http://datex2.eu/schema/2/2_0' }
parking_area = root.xpath('//d:parkingAreaStatus', namespaces=ns)
for pa in parking_area:
area_ref = pa.find('d:parkingAreaReference', ns)
occupancy = pa.find('d:parkingAreaOccupancy', ns)
trend = pa.find('d:parkingAreaOccupancyTrend', ns)
status_time = pa.find('d:parkingAreaStatusTime', ns)
print area_ref.get('id'), occupancy.text, trend.text, status_time.text
Below is the output of the demo code above. Instead of print, you can store these information in whatever data structure you like :
2[Zeil] 0.22177419 stable 2018-02-04T05:16:00.000+01:00
5[Dom / Römer] 0.28625 stable 2018-02-04T05:16:00.000+01:00
1[Anlagenring] 0.257889 stable 2018-02-04T05:16:00.000+01:00
3[Mainzer Landstraße] 0.20594966 stable 2018-02-04T05:16:00.000+01:00
4[Bahnhofsviertel] 0.31513646 stable 2018-02-04T05:16:00.000+01:00
1) some references on XPath :
XPath 1.0 spec: The most trustworthy reference on XPath 1.0
XPath syntax: Gentler introduction to basic XPath expressions
quick question... i can create/parse a chunk of html using libxml2dom, etc...
however, is there a way to somehow display the xpath used to generate/extract the html chunk.. i'm assuming that there's some method/way of doing this that i can't find..
ex:
import libxml2dom
d = libxml2dom.parseString(s, html=1)
##
hdr="//div[3]/table[1]/tr/th"
thdr_ = d.xpath(hdr)
print "lent = ",len(thdr_)
at this point, thdr_ is an array/list of objects.. each of which points to a chunk of html (if you will)
i'm trying to figure out if there's a way to get, say, the xpath for say, the thdr_[x] element/item of the list...
ie:
thdr_[0]=//div[3]/table[1]/tr[0]/th
thdr_[1]=//div[3]/table[1]/tr[1]/th
thdr_[2]=//div[3]/table[1]/tr[2]/th
.
.
.
any thoughts/comments..
thanks
-tom
I did this by iterating each node and comparing the textContent with my expected text. For fuzzy comparisons I used the SequenceMatcher class from difflib.