Parsing Stackoverflow-like text box in Python - python

I have a <textarea> where the user enters his text. The text can contain special chars which I need to parse and replace with HTML tags for display purposes.
For example:
Bolded text will be entered as: *some text* and parsed to: <strong>some text</strong>.
URL will be entered as: #some text | to/url# and parsed to: some text
What's the best way to parse this text input?
Regex? (I don't have any experience with regex)
Some Python library?
Or should I write my own parser, "reading" the input and applying logic where needed?

The emphasis element of the language you describe looks like Markdown.
You should consider just using Markdown, as is. There is a Python module that parses it too.

The best way depends on exactly what your input "language" is. If it has the same sort of nested structures as HTML, you don't want to do it with regular expressions. (Obligatory link: RegEx match open tags except XHTML self-contained tags)
Are you inventing your own little markup language?
If you are: why? Why not use one of the already existing ones, such as Markdown or reST, for which parsers already exist?
If you aren't: why are you writing your own parser? Isn't there one already?

You can have a look at some existing libraries for parsing wiki text:
http://remysharp.com/2008/04/01/wiki-to-html-using-javascript/
This one seems to work with the same format you've defined.
Headings: ! Heading1 text !! Heading2 text !!! Heading3 text
Bold: Bolded Text
Italic: Italicized Text
Underline: +Underlined Text+
http://randomactsofcoding.blogspot.co.uk/2009/08/parsewikijs-javascript-wiki-parsing.html
Or this one that has a really simple API and allows for checking if the given text is actually a wiki text.
UPDATED - Added python wiki parsers:
Having a look at a list of wiki parsers from here.
Media wiki-parser seems to be a good python parser that generates html from wiki markup:
https://github.com/peter17/mediawiki-parser

Related

Remove html tags in python without HTML formatters

I am trying to remove HTML tags from text in Python. The issue is with the format of the tags present. Ex:
[click internet options div on the right]
div - is the HTML tag
Expected:
[click internet options on the right]
It does not have the format like <> etc. Currently I manually created a list of HTML tags and removing it using the "not in". Is there a better way to clean this. P.S I am not asking for the code as such, any suggestions on the approach would be great.
You can use a regular expression, but you will need a list of the HTML tags you want to remove. Take a look to re.sub documentation, it will help you to write your regexp, like this one:
re.sub(r"(div|section|aside)", "", toCheck)
The first parameter is the pattern, the second the replacement (in this case nothing) and then, the third, the string to check.

How to handle HTML entities in parsed text - Python

I have a parsed text what contains HTML versions of different symbols like quotation marks or dashes.
This is how one string looks like:
Introduction &#8211 First page&#8218s content
And I would like to achive this:
Introduction - First page's content
Is there any library or common solution that changes the HTML entities in any string? Or I would need to write a function which replace the html to the proper string?
I already checked these answers, but I would rather need something that works with a simple Python string that contains html entities.
html module doesn't require anything special from the string. It just works:
>>> import html
>>> html.unescape('Introduction &#8211 First page&#8218s content')
'Introduction – First page‚s content'
Try
print unicode(x)
or
print x.encode('ascii')

How to remove tags from a string in python using regular expressions? (NOT in HTML)

I need to remove tags from a string in python.
<FNT name="Century Schoolbook" size="22">Title</FNT>
What is the most efficient way to remove the entire tag on both ends, leaving only "Title"? I've only seen ways to do this with HTML tags, and that hasn't worked for me in python. I'm using this particularly for ArcMap, a GIS program. It has it's own tags for its layout elements, and I just need to remove the tags for two specific title text elements. I believe regular expressions should work fine for this, but I'm open to any other suggestions.
This should work:
import re
re.sub('<[^>]*>', '', mystring)
To everyone saying that regexes are not the correct tool for the job:
The context of the problem is such that all the objections regarding regular/context-free languages are invalid. His language essentially consists of three entities: a = <, b = >, and c = [^><]+. He wants to remove any occurrences of acb. This fairly directly characterizes his problem as one involving a context-free grammar, and it is not much harder to characterize it as a regular one.
I know everyone likes the "you can't parse HTML with regular expressions" answer, but the OP doesn't want to parse it, he just wants to perform a simple transformation.
Please avoid using regex. Eventhough regex will work on your simple string, but you'd get problem in the future if you get a complex one.
You can use BeautifulSoup get_text() feature.
from bs4 import BeautifulSoup
text = '<FNT name="Century Schoolbook" size="22">Title</FNT>'
soup = BeautifulSoup(text)
print(soup.get_text())
Searching this regex and replacing it with an empty string should work.
/<[A-Za-z\/][^>]*>/
Example (from python shell):
>>> import re
>>> my_string = '<FNT name="Century Schoolbook" size="22">Title</FNT>'
>>> print re.sub('<[A-Za-z\/][^>]*>', '', my_string)
Title
If it's only for parsing and retrieving value, you might take a look at BeautifulStoneSoup.
If the source text is well-formed XML, you can use the stdlib module ElementTree:
import xml.etree.ElementTree as ET
mystring = """<FNT name="Century Schoolbook" size="22">Title</FNT>"""
element = ET.XML(mystring)
print element.text # 'Title'
If the source isn't well-formed, BeautifulSoup is a good suggestion. Using regular expressions to parse tags is not a good idea, as several posters have pointed out.
Use an XML parser, such as ElementTree. Regular expressions are not the right tool for this job.

Regular expression to match closing HTML tags

I'm working on a small Python script to clean up HTML documents. It works by accepting a list of tags to KEEP and then parsing through the HTML code trashing tags that are not in the list I've been using regular expressions to do it and I've been able to match opening tags and self-closing tags but not closing tags.
The pattern I've been experimenting with to match closing tags is </(?!a)>. This seems logical to me so why is not working? The (?!a) should match on anything that is NOT an anchor tag (not that the "a" is can be anything-- it's just an example).
Edit: AGG! I guess the regex didn't show!
Read:
RegEx match open tags except XHTML self-contained tags
Can you provide some examples of why it is hard to parse XML and HTML with a regex?
Repent.
Use a real HTML parser, like BeautifulSoup.
<TAG\b[^>]*>(.*?)</TAG>
Matches the opening and closing pair of a specific HTML tag.
<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>
Will match the opening and closing pair of any HTML tag.
See here.
Don't use regex to parse HTML. It will only give you headaches.
Use an XML parser instead. Try BeautifulSoup or lxml.
You may also consider using the html parser that is built into python (Documentation for Python 2 and Python 3)
This will help you home in on the specific area of the HTML Document you would like to work on - and use regular expressions on it.

Parse URL from plain text

How can I parse URLs from any give plain text (not limited to href attributes in tags)?
Any code examples in Python will be appreciated.
You could use a Regular Expression to parse the string.
Look in this previously asked question:
What’s the cleanest way to extract URLs from a string using Python?
See Jan Goyvaerts' blog.
So a Python code example could look like
result = re.findall(r"\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&##/%=~_|$?!:,.]*[A-Z0-9+&##/%=~_|$]", subject)

Categories

Resources