How can I parse URLs from any give plain text (not limited to href attributes in tags)?
Any code examples in Python will be appreciated.
You could use a Regular Expression to parse the string.
Look in this previously asked question:
What’s the cleanest way to extract URLs from a string using Python?
See Jan Goyvaerts' blog.
So a Python code example could look like
result = re.findall(r"\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&##/%=~_|$?!:,.]*[A-Z0-9+&##/%=~_|$]", subject)
Related
I have a parsed text what contains HTML versions of different symbols like quotation marks or dashes.
This is how one string looks like:
Introduction – First page‚s content
And I would like to achive this:
Introduction - First page's content
Is there any library or common solution that changes the HTML entities in any string? Or I would need to write a function which replace the html to the proper string?
I already checked these answers, but I would rather need something that works with a simple Python string that contains html entities.
html module doesn't require anything special from the string. It just works:
>>> import html
>>> html.unescape('Introduction – First page‚s content')
'Introduction – First page‚s content'
Try
print unicode(x)
or
print x.encode('ascii')
Say I have a string looks like Boston–Cambridge–Quincy, MA–NH MSA
How can I use re to get rid of links and get only the Boston–Cambridge–Quincy, MA–NH MSA part?
I tried something like match = re.search(r'<.+>(\w+)<.+>', name_tmp) but not working.
re.sub('<a[^>]+>(.*?)</a>', '\\1', text)
Note that parsing HTML in general is rather dangerous. However it seems that you are parsing MediaWiki generated links where it is safe to assume that the links are always similar formatted, so you should be fine with that regular expression.
You can also use the bleach module https://pypi.python.org/pypi/bleach , which wraps html sanitizing tools and lets you quickly strip text of html
I wanted to remove all the tags in HTML file. For that I used re module of python.
For example, consider the line <h1>Hello World!</h1>.I want to retain only "Hello World!". In order to remove the tags, I used re.sub('<.*>','',string). For obvious reasons the result I get is an empty string (The regexp identifies the first and last angle brackets and removes everything in between). How could I get over this issue?
You can make the match non-greedy: '<.*?>'
You also need to be careful, HTML is a crafty beast, and can thwart your regexes.
Parse the HTML using BeautifulSoup, then only retrieve the text.
make it non-greedy: http://docs.python.org/release/2.6/howto/regex.html#greedy-versus-non-greedy
off-topic: the approach that uses regular expressions is error prone. it cannot handle cases when angle brackets do not represent tags. I recommend http://lxml.de/
Use a parser, either lxml or BeautifulSoup:
import lxml.html
print lxml.html.fromstring(mystring).text_content()
Related questions:
Using regular expressions to parse HTML: why not?
Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms
Beautiful Soup is great for parsing html!
You might not require it now, but it's worth learning to use it. Will help you in the future too.
I need to remove tags from a string in python.
<FNT name="Century Schoolbook" size="22">Title</FNT>
What is the most efficient way to remove the entire tag on both ends, leaving only "Title"? I've only seen ways to do this with HTML tags, and that hasn't worked for me in python. I'm using this particularly for ArcMap, a GIS program. It has it's own tags for its layout elements, and I just need to remove the tags for two specific title text elements. I believe regular expressions should work fine for this, but I'm open to any other suggestions.
This should work:
import re
re.sub('<[^>]*>', '', mystring)
To everyone saying that regexes are not the correct tool for the job:
The context of the problem is such that all the objections regarding regular/context-free languages are invalid. His language essentially consists of three entities: a = <, b = >, and c = [^><]+. He wants to remove any occurrences of acb. This fairly directly characterizes his problem as one involving a context-free grammar, and it is not much harder to characterize it as a regular one.
I know everyone likes the "you can't parse HTML with regular expressions" answer, but the OP doesn't want to parse it, he just wants to perform a simple transformation.
Please avoid using regex. Eventhough regex will work on your simple string, but you'd get problem in the future if you get a complex one.
You can use BeautifulSoup get_text() feature.
from bs4 import BeautifulSoup
text = '<FNT name="Century Schoolbook" size="22">Title</FNT>'
soup = BeautifulSoup(text)
print(soup.get_text())
Searching this regex and replacing it with an empty string should work.
/<[A-Za-z\/][^>]*>/
Example (from python shell):
>>> import re
>>> my_string = '<FNT name="Century Schoolbook" size="22">Title</FNT>'
>>> print re.sub('<[A-Za-z\/][^>]*>', '', my_string)
Title
If it's only for parsing and retrieving value, you might take a look at BeautifulStoneSoup.
If the source text is well-formed XML, you can use the stdlib module ElementTree:
import xml.etree.ElementTree as ET
mystring = """<FNT name="Century Schoolbook" size="22">Title</FNT>"""
element = ET.XML(mystring)
print element.text # 'Title'
If the source isn't well-formed, BeautifulSoup is a good suggestion. Using regular expressions to parse tags is not a good idea, as several posters have pointed out.
Use an XML parser, such as ElementTree. Regular expressions are not the right tool for this job.
a question about python regular expression.
I would like to match a div block like
<div class="leftTail"><ul class="hotnews">any news stuff</ul></div>
I was thinking a pattern like
p = re.compile(r'<div\s+class=\"leftTail\">[^(div)]+</div>')
but it seems not working properly
another pattern
p = re.compile(r'<div\s+class=\"leftTail\">[\W|\w]+</div>')
i got much more than i think, it gets all the stuff until the last tag in the file.
Thanks for any help
You might want to consider graduating to an actual HTML parser. I suggest you give Beautiful Soup a try. There are many crazy ways for HTML to be formatted, and the regular expressions may not work correctly all the time, even if you write them correctly.
Don't use regular expressions to parse XML or HTML. You'll never be able to get it to work correctly for nested divs.
try this:
p = re.compile(r'<div\s+class=\"leftTail\">.*?</div>')