How to handle HTML entities in parsed text - Python - python

I have a parsed text what contains HTML versions of different symbols like quotation marks or dashes.
This is how one string looks like:
Introduction &#8211 First page&#8218s content
And I would like to achive this:
Introduction - First page's content
Is there any library or common solution that changes the HTML entities in any string? Or I would need to write a function which replace the html to the proper string?
I already checked these answers, but I would rather need something that works with a simple Python string that contains html entities.

html module doesn't require anything special from the string. It just works:
>>> import html
>>> html.unescape('Introduction &#8211 First page&#8218s content')
'Introduction – First page‚s content'

Try
print unicode(x)
or
print x.encode('ascii')

Related

Python regex doesn't work on string

I have an HTML file that I process using lxml and BeautifulSoup (convert from HTML to text). Somehow, the ill-formed HTML below makes it into the text and I'd like to remove it. I tried matching something like "<.+>" in the text string, but it doesn't work. The string I want to remove is this:
string = """ .trb_m_b:befoe{ctent:'Hide comments'}.trb_c_so{padding-top:10px;min-height:500px}||<div class="trb_c_so" data-role=c_container><div class="s_comments" data-sitename="ffff" data-content-id="jksjkj7878787" data-type=promo-comment data-publisher="ronctt"></div></div>"""
The exact code I tried on it is:
pattern = re.compile(r'<.+>')
if (pattern.search(string)):
print ("Found")
However, that regex doesn't match the string, although it should.
Why would that be?
Thanks.
EDIT. It looks like the problem is not with the regular expressions, but with something very bizarre. I have this string in a list, it's the last item. When I loop through it the first time, for some reason, the program never hits it. The second time, however, it does. I don't understand the reason for it.
EDIT2. It turns out the problem was that I was trying to remove elements in a loop (if they matched the regex), which is not permitted. I rewrote the code to use a list comprehension, and now it works fine.
I believe what you want is this:
import re
data = re.findall("\<(.*?)\>", string)
Your HTML is not a complete HTML tag, if you really want to match the string that you give,you can use this:
re.findall("\.trb_m_b.*?></div></div>", string)

Extract All urls from a string in python

Given a string of text which could possibly contain multiple urls all starting with http://
for example:
someString = "Text amongst words and links http://www.text.com more text more text another http http://www.word.com"
How can I extract all the urls from a string like the one above?
Leaving just
http://www.text.com
http://www.word.com
This should work:
>>> for url in re.findall('(http://\S+)', someString): print url
...
http://www.text.com
http://www.word.com
You want regular expressions.
In python: https://docs.python.org/2/library/re.html
Regular expression to evaluate: http://daringfireball.net/2010/07/improved_regex_for_matching_urls
Shouldn't take you long from there

Use Python re to get rid of links

Say I have a string looks like Boston–Cambridge–Quincy, MA–NH MSA
How can I use re to get rid of links and get only the Boston–Cambridge–Quincy, MA–NH MSA part?
I tried something like match = re.search(r'<.+>(\w+)<.+>', name_tmp) but not working.
re.sub('<a[^>]+>(.*?)</a>', '\\1', text)
Note that parsing HTML in general is rather dangerous. However it seems that you are parsing MediaWiki generated links where it is safe to assume that the links are always similar formatted, so you should be fine with that regular expression.
You can also use the bleach module https://pypi.python.org/pypi/bleach , which wraps html sanitizing tools and lets you quickly strip text of html

Parsing Stackoverflow-like text box in Python

I have a <textarea> where the user enters his text. The text can contain special chars which I need to parse and replace with HTML tags for display purposes.
For example:
Bolded text will be entered as: *some text* and parsed to: <strong>some text</strong>.
URL will be entered as: #some text | to/url# and parsed to: some text
What's the best way to parse this text input?
Regex? (I don't have any experience with regex)
Some Python library?
Or should I write my own parser, "reading" the input and applying logic where needed?
The emphasis element of the language you describe looks like Markdown.
You should consider just using Markdown, as is. There is a Python module that parses it too.
The best way depends on exactly what your input "language" is. If it has the same sort of nested structures as HTML, you don't want to do it with regular expressions. (Obligatory link: RegEx match open tags except XHTML self-contained tags)
Are you inventing your own little markup language?
If you are: why? Why not use one of the already existing ones, such as Markdown or reST, for which parsers already exist?
If you aren't: why are you writing your own parser? Isn't there one already?
You can have a look at some existing libraries for parsing wiki text:
http://remysharp.com/2008/04/01/wiki-to-html-using-javascript/
This one seems to work with the same format you've defined.
Headings: ! Heading1 text !! Heading2 text !!! Heading3 text
Bold: Bolded Text
Italic: Italicized Text
Underline: +Underlined Text+
http://randomactsofcoding.blogspot.co.uk/2009/08/parsewikijs-javascript-wiki-parsing.html
Or this one that has a really simple API and allows for checking if the given text is actually a wiki text.
UPDATED - Added python wiki parsers:
Having a look at a list of wiki parsers from here.
Media wiki-parser seems to be a good python parser that generates html from wiki markup:
https://github.com/peter17/mediawiki-parser

Parse URL from plain text

How can I parse URLs from any give plain text (not limited to href attributes in tags)?
Any code examples in Python will be appreciated.
You could use a Regular Expression to parse the string.
Look in this previously asked question:
What’s the cleanest way to extract URLs from a string using Python?
See Jan Goyvaerts' blog.
So a Python code example could look like
result = re.findall(r"\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&##/%=~_|$?!:,.]*[A-Z0-9+&##/%=~_|$]", subject)

Categories

Resources