Use Python re to get rid of links - python

Say I have a string looks like Boston–Cambridge–Quincy, MA–NH MSA
How can I use re to get rid of links and get only the Boston–Cambridge–Quincy, MA–NH MSA part?
I tried something like match = re.search(r'<.+>(\w+)<.+>', name_tmp) but not working.

re.sub('<a[^>]+>(.*?)</a>', '\\1', text)
Note that parsing HTML in general is rather dangerous. However it seems that you are parsing MediaWiki generated links where it is safe to assume that the links are always similar formatted, so you should be fine with that regular expression.

You can also use the bleach module https://pypi.python.org/pypi/bleach , which wraps html sanitizing tools and lets you quickly strip text of html

Related

Processing a HTML file using Python

I wanted to remove all the tags in HTML file. For that I used re module of python.
For example, consider the line <h1>Hello World!</h1>.I want to retain only "Hello World!". In order to remove the tags, I used re.sub('<.*>','',string). For obvious reasons the result I get is an empty string (The regexp identifies the first and last angle brackets and removes everything in between). How could I get over this issue?
You can make the match non-greedy: '<.*?>'
You also need to be careful, HTML is a crafty beast, and can thwart your regexes.
Parse the HTML using BeautifulSoup, then only retrieve the text.
make it non-greedy: http://docs.python.org/release/2.6/howto/regex.html#greedy-versus-non-greedy
off-topic: the approach that uses regular expressions is error prone. it cannot handle cases when angle brackets do not represent tags. I recommend http://lxml.de/
Use a parser, either lxml or BeautifulSoup:
import lxml.html
print lxml.html.fromstring(mystring).text_content()
Related questions:
Using regular expressions to parse HTML: why not?
Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms
Beautiful Soup is great for parsing html!
You might not require it now, but it's worth learning to use it. Will help you in the future too.

HTML code processing

I want to process some HTML code and remove the tags as in the example:
"<p><b>This</b> is a very interesting paragraph.</p>" results in "This is a very interesting paragraph."
I'm using Python as technology; do you know any framework I may use to remove the HTML tags?
Thanks!
This question may help you: Strip HTML from strings in Python
No matter what solution you choose, I'd recommend avoiding regular expressions. They can be slow when processing large strings, they might not work due to invalid HTML, and stripping HTML with regex isn't always secure or reliable.
BeautifulSoup
import libxml2
text = "<p><b>This</b> is a very interesting paragraph.</p>"
root = libxml2.parseDoc(text)
print root.content
# 'This is a very interesting paragraph.'
Depending on your needs, you could just use the regular expression /<(.|\n)*?>/ and replace all matches with empty strings. This works perfectly for manual cases, but if you're building this as an application feature then you'll need a more robust and secure option.
you can use lxml.

How to remove tags from a string in python using regular expressions? (NOT in HTML)

I need to remove tags from a string in python.
<FNT name="Century Schoolbook" size="22">Title</FNT>
What is the most efficient way to remove the entire tag on both ends, leaving only "Title"? I've only seen ways to do this with HTML tags, and that hasn't worked for me in python. I'm using this particularly for ArcMap, a GIS program. It has it's own tags for its layout elements, and I just need to remove the tags for two specific title text elements. I believe regular expressions should work fine for this, but I'm open to any other suggestions.
This should work:
import re
re.sub('<[^>]*>', '', mystring)
To everyone saying that regexes are not the correct tool for the job:
The context of the problem is such that all the objections regarding regular/context-free languages are invalid. His language essentially consists of three entities: a = <, b = >, and c = [^><]+. He wants to remove any occurrences of acb. This fairly directly characterizes his problem as one involving a context-free grammar, and it is not much harder to characterize it as a regular one.
I know everyone likes the "you can't parse HTML with regular expressions" answer, but the OP doesn't want to parse it, he just wants to perform a simple transformation.
Please avoid using regex. Eventhough regex will work on your simple string, but you'd get problem in the future if you get a complex one.
You can use BeautifulSoup get_text() feature.
from bs4 import BeautifulSoup
text = '<FNT name="Century Schoolbook" size="22">Title</FNT>'
soup = BeautifulSoup(text)
print(soup.get_text())
Searching this regex and replacing it with an empty string should work.
/<[A-Za-z\/][^>]*>/
Example (from python shell):
>>> import re
>>> my_string = '<FNT name="Century Schoolbook" size="22">Title</FNT>'
>>> print re.sub('<[A-Za-z\/][^>]*>', '', my_string)
Title
If it's only for parsing and retrieving value, you might take a look at BeautifulStoneSoup.
If the source text is well-formed XML, you can use the stdlib module ElementTree:
import xml.etree.ElementTree as ET
mystring = """<FNT name="Century Schoolbook" size="22">Title</FNT>"""
element = ET.XML(mystring)
print element.text # 'Title'
If the source isn't well-formed, BeautifulSoup is a good suggestion. Using regular expressions to parse tags is not a good idea, as several posters have pointed out.
Use an XML parser, such as ElementTree. Regular expressions are not the right tool for this job.

Parse URL from plain text

How can I parse URLs from any give plain text (not limited to href attributes in tags)?
Any code examples in Python will be appreciated.
You could use a Regular Expression to parse the string.
Look in this previously asked question:
What’s the cleanest way to extract URLs from a string using Python?
See Jan Goyvaerts' blog.
So a Python code example could look like
result = re.findall(r"\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&##/%=~_|$?!:,.]*[A-Z0-9+&##/%=~_|$]", subject)

python regular expression to parse div tags

a question about python regular expression.
I would like to match a div block like
<div class="leftTail"><ul class="hotnews">any news stuff</ul></div>
I was thinking a pattern like
p = re.compile(r'<div\s+class=\"leftTail\">[^(div)]+</div>')
but it seems not working properly
another pattern
p = re.compile(r'<div\s+class=\"leftTail\">[\W|\w]+</div>')
i got much more than i think, it gets all the stuff until the last tag in the file.
Thanks for any help
You might want to consider graduating to an actual HTML parser. I suggest you give Beautiful Soup a try. There are many crazy ways for HTML to be formatted, and the regular expressions may not work correctly all the time, even if you write them correctly.
Don't use regular expressions to parse XML or HTML. You'll never be able to get it to work correctly for nested divs.
try this:
p = re.compile(r'<div\s+class=\"leftTail\">.*?</div>')

Categories

Resources