Regex : Don't match if other text is found before - python

I'm trying to parse a markdown document with a regex to find if there is a title in the document (# title).
I've manage to achieve this with this regex (?m)^#{1}(?!#) (.*), the problem is that I can also have code section in my markdown where I can encounter the # title format as a comment.
My idea was to try to find the # title, but if in lines before there is a ```language then don't match.
Here is a text example where I need to only match # my title and not the # helloworld.py below, especially if # my title is missing (which is what I need to find out) :
<!--
.. title: Structuring a Python application
.. medium: yes
.. devto: yes
-->
# my title
In this short article I will explain all the different ways of structuring a Python application, from a quick script to a more complex web application.
## Single python file containing all the code
```python
#!/usr/bin/env python
# helloworld.py
test

This could get real messy with regex. But since it seems like you'll be using python anyway - this can be trivial.
mkdwn = '''<!--
.. title: Structuring a Python application
.. medium: yes
.. devto: yes
-->
# my title
In this short article I will explain all the different ways of structuring a Python application, from a quick script to a more complex web application.
## Single python file containing all the code
```python
#!/usr/bin/env python
# helloworld.py
test'''
'''Get the first occurrence of a substring that
you're 100% certain **will not be present** before the title
but **will be present** in the document after the title (if the title exists)
'''
idx = mkdwn.index('```')
# Now, try to extract the title using regex, starting from the string start but ending at `idx`
title_match = re.search(r'^# (.+)', mkdwn[:idx],flags=re.M)
# Get the 1st group if a match was found, else empty string
title = title_match.group(1) if title_match else ''
print(title)
You can also reduce this
title_match = re.search(r'^# (.+)', mkdwn[:idx],flags=re.M)
# Get the 1st group if a match was found, else empty string
title = title_match.group(1) if title_match else ''
into a one liner, if you're into that sort of thing-
title = getattr(re.search(r'^# (.+)', mkdwn[:idx],flags=re.M), 'group', lambda _: '')(1)
getattr will return the attribute group if present (i.e when match was found) - otherwise it'll just return that dummy function(lambda _: '') which takes a dummy argument and returns an empty string, to be assigned to title.
The returned function is then called with the argument 1, which returns the 1st group if a match was found. If a match wasn't found, well the argument doesn't matter, it just returns an empty string.
Output
my title

This is a task for three regular expressions. Screen all code fragments temporarily with the first one, process markdown with the second one, unscreen code with the third.
"Screening" means storing code fragments in a dictionary and replacing with some special markdown with dictionary key.

Related

Python's ElementTree, how to create links in a paragraph

I have a website I'm building running off Python 2.7 and using ElementTree to build the HTML on the fly. I have no problem creating the elements and appending them to the tree. It's where I have to insert links in the middle of a large paragraph that I am stumped. This is easy when it's done in text, but this is doing it via XML. Here's what I mean:
Sample text:
lawLine = "..., a vessel as defined in Section 21 of the Harbors and Navigation Code which is inhabited and designed for habitation, an inhabited floating home as defined in subdivision (d) of Section 18075.55 of the Health and Safety Code, ..."
To add that text to the HTML as H4-style text, I typically use:
h4 = ET.Element('h4')
htmlTree.append(h4)
h4.text = lawLine
I need to add links at the word "Section" and the numbers associated with it, but I can't simply create a new element "a" in the middle of a paragraph and add it to the HTML tree, so I'm trying to build that piece as text, then do ET.fromstring and append it to the tree:
thisLawType = 'PC'
matches = re.findall(r'Section [0-9.]*', lawLine)
if matches:
lawLine = """<h4>{0}</h4>""".format(lawLine)
for thisMatch in matches:
thisMatchLinked = """{2}""".format(thisLawType, thisMatch.replace('Section ',''), thisMatch)
lawLine = lawLine.replace(thisMatch, thisMatchLinked)
htmlBody.append(ET.fromstring(lawLine))
I am getting "xml.etree.ElementTree.ParseError: not well-formed" errors when I do ET.fromstring. Is there a better way to do this in ElementTree? I'm sure there are better extensions out there, but my work environment is limited to Python 2.7 and the standard library. Any help would be appreciated. Thanks!
Evan
The xml you are generating is indeed not well formed, because of the presence of & in thisMatchLinked. It's one of the special charcters which need to be escaped (see an interesting explanation here).
So try replacing & with & and see if it works.

Markdown: Processing order for registred Pattern

I have written a python extension for markdown based on InlineProcessor who correctly match when the pattern appears:
Custom extension:
from markdown.util import AtomicString, etree
from markdown.extensions import Extension
from markdown.inlinepatterns import InlineProcessor
RE = r'(#)(\S{3,})'
class MyPattern(InlineProcessor):
def handleMatch(self, m, data):
tag = m.group(2)
el = etree.Element("a")
el.set('href', f'/{tag}')
el.text = AtomicString(f'#{tag}')
return el, m.start(0), m.end(0)
class MyExtension(Extension):
def extendMarkdown(self, md, md_globals):
# If processed by attr_list extension, not by this one
md.inlinePatterns.register(MyPattern(RE, md), 'my_tag', 200)
def makeExtension(*args, **kwargs):
return MyExtension(*args, **kwargs)
IN: markdown('foo #bar')
OUT: <p>foo #bar</p>
But my extension is breaking a native feature called attr_list in extra of python markdown.
IN: ### Title {style="color:#FF0000;"}
OUT: <h3>Title {style="color:#FF0000;"}</h3>
I'm not sure to correctly understand how Python-Markdown register / apply patterns on the text. I try to register my pattern with a high number to put it at the end of the process md.inlinePatterns.register(MyPattern(RE, md), 'my_tag', 200) but it doesn't do the job.
I have look at the source code of attr_list extension and they use Treeprocessor based class. Did I need to have a class-based onTreeprocessor and not an InlineProcessor for my MyPattern? To find a way to don't apply my tag on element how already have matched with another one (there: attr_list)?
You need a stricter regular expression which won't result in false matches. Or perhaps you need to alter the syntax you use so that it doesn't clash with other legitimate text.
First of all, the order of events is correct. Using your example input:
### Title {style="color:#FF0000;"}
When the InlineProcessor gets it, so far it has been processed to this:
<h3>Title {style="color:#FF0000;"}</h3>
Notice that the block level tags are now present (<h3>), but the attr_list has not been processed. And that is your problem. Your regular expression is matching #FF0000;"} and converting that to a link: #FF0000;"}.
Finally, after all InlinePrecessors are done, the attr_list TreeProsessor is run, but with the link in the middle, it doesn't recognize the text as a valid attr_list and ignores it (as it should).
In other words, your problem has nothing to do with order at all. You can't run an inline processor after the attr_list TreeProcessor, so you need to explore other alternatives. You have at least two options:
Rewrite your regular expression to not have false matches. You might want to try using word boundaries or something.
Reconsider your proposed new syntax. #bar is a pretty indistinct syntax which is likely to reoccur elsewhere in the text and result in false matches. Perhaps you could require it to be wrapped in brackets or use some character other than a hash.
Personally, I would strongly suggest the second option. Read some text with #bar in it, it would not be obvious tome that that is a link. However, [#bar] (or similar) would be much more clear.

Not picking up all XML elementree sub-sub elements in python

I am trying to capture all claims text tax in a bunch of xml patent files but having trouble with tags within the <claim-test>. Sometimes there's another <claim-text> and sometimes there is also <claim-ref> interrupting the text. In my output, the code gets cut off. Usually there are over 10 claims. I am trying to only get the text in the claim text.
I've already looked and tried the following but these don't work:
xml elementree missing elements python and
How to get all sub-elements of an element tree with Python ElementTree?
I've included a snippet here as it does get quite long to capture all.
My code for this is below (where fullname is the file name and directory).
for _, elem in iterparse(fullname):
description = '' # reset to empty string at beginning of each loop
abtext = '' # reset to empty string at beginning of each loop
claimtext= '' # reset to empty string
if elem.tag == 'claims':
for node4 in tree.findall('.//claims/claim/claim-text'):
claimtext = claimtext + node4.text
f.write('\n\nCLAIMTEXT\n\n\n')
f.write(smart_str(claimtext) + '\n\n')
#put row in df
row = dict(zip(['PATENT_ID', 'CLASS', 'ABSTRACT', 'DESCRIPTION','CLAIMS'], [data,cat,abtext,description,claimtext]))
row_s = pd.Series(row)
row_s.name = i
df = df.append(row_s)
So the resulting problem is twofold a) I only get one of the text printed to fil and b) nothing comes into the dataframe at all. I'm not sure if that's part of the same problem or two separate problems. I can get the claims to print into a file and that works fine but skips some of the text.

How do I preserve new lines when extracting text from html using lxml.text_content()

I am trying to learn to use Whoosh. I have a large collection of html documents I want to search. I discovered that the text_content() method creates some interesting problems for example I might have some text that is organized in a table that looks like
<html><table><tr><td>banana</td><td>republic</td></tr><tr><td>stateless</td><td>person</td></table></html>
When I take the original string and and get the tree and then use text_content to get the text in the following manner
mytree = html.fromstring(myString)
text = mytree.text_content()
The results have no spaces (as should be expected)
'bananarepublicstatelessperson'
I tried to insert new lines using string.replace()
myString = myString.replace('</tr>','</tr>\n')
I confirmed that the new line was present
'<html><table><tr><td>banana</td><td>republic</td></tr>\n<tr><td>stateless</td><td>person</td></table></html>'
but when I run the same code from above the line feeds are not present. Thus the resulting text_content() looks just like above.
This is a problem from me because I need to be able to separate words, I thought I could add non-breaking spaces after each td and line breaks after rows as well asd line breaks after body elements etc to get text that reasonably conforms to my original source.
I will note that I did some more testing and found that line breaks inserted after paragraph tag closes were preserved. But there is a lot of text in the tables that I need to be able to search.
Thanks for any assistance
You could use this solution:
import re
def striphtml(data):
p = re.compile(r'<.*?>')
return p.sub('', data)
>>> striphtml('I Want This <b>text!</b>')
>>> 'I Want This text!'
Found here: using python, Remove HTML tags/formatting from a string

What stronger alternatives are there to difflib?

I am working on script that needs to be able to track revisions. The general idea is to give it a list of tuples where the first entry is the name of a field (ie "title" or "description" etc.), the second entry is the first version of that field, and the third entry is the revised version. So something like this:
[("Title", "The first version of the title", "The second version of the title")]
Now, using python docx I want my script to create a word file that will show the original version, and the new version with the changes bolded. Example:
Original Title:
This is the first version of the title
Revised Title:
This is the second version of the title
The way that this is done in python docx is to create a list of tuples, where the first entry is the text, and the second one is the formatting. So the way to create the revised title would be this:
paratext = [("This is the ", ''),("second",'b'),(" version of the title",'')]
Having recent discovered difflib I figured this would be a pretty easy task. And indeed, for simple word replacements such as sample above, it is, and can be done with the following function:
def revFinder(str1,str2):
s = difflib.SequenceMatcher(None, str1, str2)
matches = s.get_matching_blocks()[:-1]
paratext = []
for i in range(len(matches)):
print "------"
print str1[matches[i][0]:matches[i][0]+matches[i][2]]
print str2[matches[i][1]:matches[i][1]+matches[i][2]]
paratext.append((str2[matches[i][1]:matches[i][1]+matches[i][2]],''))
if i != len(matches)-1:
print ""
print str1[matches[i][0]+matches[i][2]:matches[i+1][0]]
print str2[matches[i][1]+matches[i][2]:matches[i+1][1]]
if len(str2[matches[i][1]+matches[i][2]:matches[i+1][1]]) > len(str1[matches[i][0]+matches[i][2]:matches[i+1][0]]):
paratext.append((str2[matches[i][1]+matches[i][2]:matches[i+1][1]],'bu'))
else:
paratext.append((str1[matches[i][0]+matches[i][2]:matches[i+1][0]],'bu'))
return paratext
The problems come when I want to do anything else. For example, changing 'teh' to 'the' produces t h e h (without the spaces, I couldn't figure out the formatting). Another issue is that extra text appended to the end is not shown as a change (or at all).
So, my question to all of you is what alternatives are there to difflib which are powerful enough to handle more complicated text comparions, or, how can I use difflib better such that it works for what I want? Thanks in advance

Categories

Resources