Using Python I want to replace all URLs in a body of text with links to those URLs, like what Gmail does.
Can this be done in a one liner regular expression?
Edit: by body of text I just meant plain text - no HTML
You can load the document up with a DOM/HTML parsing library ( see html5lib ), grab all text nodes, match them against a regular expression and replace the text nodes with a regex replacement of the URI with anchors around it using a PCRE such as:
/(https?:[;\/?\\#&=+$,\[\]A-Za-z0-9\-_\.\!\~\*\'\(\)%][\;\/\?\:\#\&\=\+\$\,\[\]A-Za-z0-9\-_\.\!\~\*\'\(\)%#]*|[KZ]:\\*.*\w+)/g
I'm quite sure you can scourge through and find some sort of utility that does this, I can't think of any off the top of my head though.
Edit: Try using the answers here: How do I get python-markdown to additionally "urlify" links when formatting plain text?
import re
urlfinder = re.compile("([0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}|((news|telnet|nttp|file|http|ftp|https)://)|(www|ftp)[-A-Za-z0-9]*\\.)[-A-Za-z0-9\\.]+):[0-9]*)?/[-A-Za-z0-9_\\$\\.\\+\\!\\*\\(\\),;:#&=\\?/~\\#\\%]*[^]'\\.}>\\),\\\"]")
def urlify2(value):
return urlfinder.sub(r'\1', value)
call urlify2 on a string and I think that's it if you aren't dealing with a DOM object.
I hunted around a lot, tried these solutions and was not happy with their readability or features, so I rolled the following:
_urlfinderregex = re.compile(r'http([^\.\s]+\.[^\.\s]*)+[^\.\s]{2,}')
def linkify(text, maxlinklength):
def replacewithlink(matchobj):
url = matchobj.group(0)
text = unicode(url)
if text.startswith('http://'):
text = text.replace('http://', '', 1)
elif text.startswith('https://'):
text = text.replace('https://', '', 1)
if text.startswith('www.'):
text = text.replace('www.', '', 1)
if len(text) > maxlinklength:
halflength = maxlinklength / 2
text = text[0:halflength] + '...' + text[len(text) - halflength:]
return '<a class="comurl" href="' + url + '" target="_blank" rel="nofollow">' + text + '<img class="imglink" src="/images/linkout.png"></a>'
if text != None and text != '':
return _urlfinderregex.sub(replacewithlink, text)
else:
return ''
You'll need to get a link out image, but that's pretty easy. This is specifically for user submitted text like comments which I assume is usually what people are dealing with.
/\w+:\/\/[^\s]+/
When you say "body of text" do you mean a plain text file, or body text in an HTML document? If you want the HTML document, you will want to use Beautiful Soup to parse it; then, search through the body text and insert the tags.
Matching the actual URLs is probably best done with the urlparse module. Full discussion here: How do you validate a URL with a regular expression in Python?
Gmail is a lot more open, when it comes to URLs, but it is not always right either. e.g. it will make www.a.b into a hyperlink as well as http://a.b but it often fails because of wrapped text and uncommon (but valid) URL characters.
See appendix A. A. Collected BNF for URI for syntax, and use that to build a reasonable regular expression that will consider what surrounds the URL as well. You'd be well advised to consider a couple of scenarios where URLs might end up.
Related
I try to clean some HTML data with regular expression in python. Given the input string with HTML tags, I want to remove tags and its content if the content contains space. The requirements is like below:
inputString = "I want to remove <code>tag with space</code> not sole <code>word</code>"
outputString = regexProcess(inputString)
print outputString
>>I want to remove not sole <code>word</code>
The regex re.sub("<code>.+?</code>", " ", inputString) can only remove all tags, how to improve it or are there some other methods?
Thanks in advance.
Using regex with HTML is fraught with various issues, that is why you should be aware of all possible consequences. So, your <code>.+?</code> regex will only work in case the <code> and </code> tags are on one line and if there are no nested <code> tags inside them.
Assuming there are no nested code tags you might extend your current approach:
import re
inputString = "I want to remove <code>tag with space</code> not sole <code>word</code>"
outputString = re.sub("<code>(.+?)</code>", lambda m: " " if " " in m.group(1) else m.group(), inputString, flags=re.S)
print(outputString)
The re.S flag will enable . to match line breaks and a lambda will help to perform a check against each match: any code tag that contains a whitespace in its node value will be turned into a regular space, else it will be kept.
See this Python demo
A more common way to parse HTML in Python is to use BeautifulSoup. First, parse the HTML, then get all the code tags and then replace the code tag if the nodes contains a space:
>>> from bs4 import BeautifulSoup
soup = BeautifulSoup('I want to remove <code>tag with space</code> not sole <code>word</code>', "html.parser")
>>> for p in soup.find_all('code'):
if p.string and " " in p.string:
p.replace_with(" ")
>>> print(soup)
I want to remove not sole <code>word</code>
bad idea to parse HTML with regex. However if your HTML is simple enough you could do this:
re.sub(r"<code>[^<]*\s[^<]*</code>", " ", inputString)
We're looking for at least a space somewhere, to be able to make it work with code tags on the same line, I've added filtering on < char (it has no chance to be in a tag, since even escaping it is <).
Ok, it's still a hack, a proper html parser is preferred.
small test:
inputString = "<code>hello </code> <code>world</code> <code>hello world</code> <code>helloworld</code>"
I get:
<code>world</code> <code>helloworld</code>
You can used to remove tags according to open and close tags also .
inputString = re.sub(r"<.*?>", " ", inputString)
In my case it is working .
Enjoy ...
I have the following code.
def render_markdown(markdown):
"Replaces markdown links with plain text"
# non greedy
# also includes images
RE_ANCHOR = re.compile(r"\[[^\[]*?\]\(.+?\)")
# create a mapping
mapping = load_mapping()
anchors = RE_ANCHOR.findall(markdown)
counter = -1
while len(anchors) != 0:
for link in anchors:
counter += 1
text, href = link.split('](')[:2]
text = '-=-' + text[1:] + '-=-'
text = text.replace(' ', '_') + '_' + str(counter)
href = href[: -1]
mapping[text] = href
markdown = markdown.replace(link, text)
anchors = RE_ANCHOR.findall(markdown)
return markdown, mapping
However the markdown function does not replace all the links. Almost none are replaced.I looked around on SO and found a lot of questions pertaining to this. The problems found were of the type:
abc.replace(x, y) instead of abc = abc.replace(x, y)
I am doing that but the string is not being replaced
It looks like the cause is that your regex isn't matching the text you expected. So the loop is empty.
Try posting some sample markdown, run your code, and add print statements so that you can see all the intermediate results (especially the anchors list). With that in hand, debugging will be much easier :-)
I don't understand why you are using replace when you are already using regex. The re library gives you the tools to do what you want without needing to locate the string twice (once with the regex and once with replace).
For example, MatchObject contains the start and end positions of the matched text. You could use text slicing to do your own string substitutions. But even that is unnecessary as you can use re.sub and have the re library do the substitution for you when a match is found. You just need to define a callable which accepts the MathObject and returns the text to replace it.
def render_markdown(markdown):
"Replaces markdown links with plain text"
RE_ANCHOR = re.compile(r"\[[^\[]*?\]\(.+?\)")
mapping = load_mapping()
def replace_link(m):
# process your link here...
mapping[text] = href
return text
return RE_ANCHOR.sub(replace_link, markdown)
And if you wanted to make a few additions to your regular expression, you could have the regex break up your link into parts which would be accessible as groups on the match object. For example:
RE_ANCHOR = re.compile(r"\[([^\[]*?)\]\((.+?)\)")
# ...
text = m.group(1)
link = m.group(2)
All I did was add a set of parentheses around each of the text and link (inside the brackets). Although, I expect your regex is not sophisticated enough to match all possible links found within Markdown documents. For example, the Python-Markdown library permits at least six levels of nested brackets inside the "text" portion of the link. And don't forget about titles defined in a link (as (url "title")). But that is just scratching the surface. Besides, that would be a separate question.
I am trying to teach myself Python by writing a very simple web crawler with it.
The code for it is here:
#!/usr/bin/python
import sys, getopt, time, urllib, re
LINK_INDEX = 1
links = [sys.argv[len(sys.argv) - 1]]
visited = []
politeness = 10
maxpages = 20
def print_usage():
print "USAGE:\n./crawl [-politeness <seconds>] [-maxpages <pages>] seed_url"
def parse_args():
#code for parsing arguments (works fine so didnt need to be included here)
def crawl():
global links, visited
url = links.pop()
visited.append(url)
print "\ncurrent url: %s" % url
response = urllib.urlopen(url)
html = response.read()
html = html.lower()
raw_links = re.findall(r'<a href="[\w\.-]+"', html)
print "found: %d" % len(raw_links)
for raw_link in raw_links:
temp = raw_link.split('"')
if temp[LINK_INDEX] not in visited and temp[LINK_INDEX] not in links:
links.append(temp[LINK_INDEX])
print "\nunvisited:"
for link in links:
print link
print "\nvisited:"
for link in visited:
print link
parse_args()
while len(visited) < maxpages and len(links) > 0:
crawl()
time.sleep(politeness)
print "politeness = %d, maxpages = %d" % (politeness, maxpages)
I created a small test network in the same working directory of about 10 pages that all link together in various ways, and it seems to work fine, but when I send it out onto the actual internet by itself, it is unable to parse links from files it gets.
It is able to get the html code fine, because I can print that out, but it seems that the re.findall() part is not doing what it is supposed to, because the links list never gets populated. Have I maybe written my regex wrong? It worked fine to find strings like <a href="test02.html" and then parse the link from that, but for some reason, it isn't working for actual web pages. It might be the http part perhaps that is throwing it off?
I've never used regex with Python before so I'm pretty sure that this is the problem. Can anyone give me any idea how express the pattern I am looking for better? Thanks!
The problem is with your regex. There are a whole bunch of ways I could write a valid HTML anchor that your regex wouldn't match. For example, there could be extra whitespace, or line breaks in it, and there are other attributes that could exist that you haven't taken into account. Also, you take no account of different case. For example:
foo
foo
<a class="bar" href="foo">foo</a>
None of these would be matched by your regex.
You probably want something more like this:
<a[^>]*href="(.*?)"
This will match an anchor tag start, followed by any characters other than > (so that we're still matching inside the tag). This might be things like a class or id attribute. The value of the href attribute is then captured in a capture group, which you can extract by
match.group(1)
The match for the href value is also non-greedy. This means it will match the smallest match possible. This is because otherwise if you have other tags on the same line, you'll match beyond what you want to.
Finally, you'll need to add the re.I flag to match in a case insensitive way.
Your regexp doesn't match all valid values for the href attributes, such as path with slashes, and so on. Using [^"]+ (anything different from the closing double quote) instead of [\w\.-]+ would help, but it doesn't matter because… you should not parse HTML with regexps to begin with.
Lev already mentionned BeautifulSoup, you could also look at lxml. It will work better that any hand-crafted regexp you could write.
You probably want this:
raw_links = re.findall(r'<a href="(.+?)"', html)
Use the brackets to indicate what you want returned, otherwise you get the whole match including the <a href=... bit. Now you get everything until the closing quote mark, due to the use of a non-greedy +? operator.
A more discriminating filter might be:
raw_links = re.findall(r'<a href="([^">]+?)"', html)
this matches anything except a quote and a terminating bracket.
These simple RE's will match to URL's that have been commented, URL-like literal strings inside bits of javascript, etc. So be careful about using the results!
I need to get info from a website that outputs it between <font color="red">needed-info-here</font> OR <span style="font-weight:bold;">needed-info-here</span>, randomly.
I can get it when I use
start = '<font color="red">'
end = '</font>'
expression = start + '(.*?)' + end
match = re.compile(expression).search(web_source_code)
needed_info = match.group(1)
, but then I have to pick to fetch either <font> or <span>, failing, when the site uses the other tag.
How do I modify the regular expression so it would always succeed?
Don't parse HTML with regex.
Regex is not the right tool to use for this problem. Look up BeautifulSoup or lxml.
You can join two alternatives with a vertical bar:
start = '<font color="red">|<span style="font-weight:bold;">'
end = '</font>|</span>'
since you know that a font tag will always be closed by </font>, a span tag always by </span>.
However, consider also using a solid HTML parser such as BeautifulSoup, rather than rolling your own regular expressions, to parse HTML, which is particularly unsuitable in general for getting parsed by regular expressions.
Although regular expressions are not your best choice for parsing HTML.
For the sake of education, here is a possible answer to your question:
start = '<(?P<tag>font|tag) color="red">'
end = '</(?P=tag)>'
expression = start + '(.*?)' + end
expression = '(<font color="red">(.*?)</font>|<span style="font-weight:bold;">(.*?)</span>)'
match = re.compile(expression).search(web_source_code)
needed_info = match.group(2)
This would get the job done but you shouldn't really be using regex to parse html
Regex and HTML are not such a good match, HTML has too many potential variations that will trip up your regex. BeautifulSoup is the standard tool to employ here, but I find pyparsing can be just as effective, and sometimes even simpler to construct when trying to locate a particular tag relative to a particular previous tag.
Here is how to address your question using pyparsing:
html = """ need to get info from a website that outputs it between <font color="red">needed-info-here</font> OR <span style="font-weight:bold;">needed-info-here</span>, randomly.
<font color="white">but not this info</font> and
<span style="font-weight:normal;">dont want this either</span>
"""
from pyparsing import *
font,fontEnd = makeHTMLTags("FONT")
# only match <font> tags with color="red"
font.setParseAction(withAttribute(color="red"))
# only match <span> tags with given style
span,spanEnd = makeHTMLTags("SPAN")
span.setParseAction(withAttribute(style="font-weight:bold;"))
# define full match patterns, define "body" results name for easy access
fontpattern = font + SkipTo(fontEnd)("body") + fontEnd
spanpattern = span + SkipTo(spanEnd)("body") + spanEnd
# now create a single pattern, matching either of the other patterns
searchpattern = fontpattern | spanpattern
# call searchString, and extract body element from each match
for text in searchpattern.searchString(html):
print text.body
Prints:
needed-info-here
needed-info-here
I haven't used Python, but if you make expressions equal to the following, it should work:
/(?P<open><(font|span)[^>]*>)(?P<info>[^<]+)(?P<close><\/(font|span)>)/gi
Then just access your needed info with the name "info".
PS - I also agree about the "not parsing HTML with regex" rule, but if you know that it will appear in either font or span tags, then so be it...
Also, why use the font tag? I haven't used a font tag since I learned CSS.
I want a regular expression to extract the title from a HTML page. Currently I have this:
title = re.search('<title>.*</title>', html, re.IGNORECASE).group()
if title:
title = title.replace('<title>', '').replace('</title>', '')
Is there a regular expression to extract just the contents of <title> so I don't have to remove the tags?
Use ( ) in regexp and group(1) in python to retrieve the captured string (re.search will return None if it doesn't find the result, so don't use group() directly):
title_search = re.search('<title>(.*)</title>', html, re.IGNORECASE)
if title_search:
title = title_search.group(1)
Note that starting in Python 3.8, and the introduction of assignment expressions (PEP 572) (:= operator), it's possible to improve a bit on Krzysztof Krasoń's solution by capturing the match result directly within the if condition as a variable and re-use it in the condition's body:
# pattern = '<title>(.*)</title>'
# text = '<title>hello</title>'
if match := re.search(pattern, text, re.IGNORECASE):
title = match.group(1)
# hello
Try using capturing groups:
title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)
May I recommend you to Beautiful Soup. Soup is a very good lib to parse all of your html document.
soup = BeatifulSoup(html_doc)
titleName = soup.title.name
Try:
title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)
re.search('<title>(.*)</title>', s, re.IGNORECASE).group(1)
The provided pieces of code do not cope with Exceptions
May I suggest
getattr(re.search(r"<title>(.*)</title>", s, re.IGNORECASE), 'groups', lambda:[u""])()[0]
This returns an empty string by default if the pattern has not been found, or the first match.
I'd think this should suffice:
#!python
import re
pattern = re.compile(r'<title>([^<]*)</title>', re.MULTILINE|re.IGNORECASE)
pattern.search(text)
... assuming that your text (HTML) is in a variable named "text."
This also assumes that there are no other HTML tags which can be legally embedded inside of an HTML TITLE tag and there exists no way to legally embed any other < character within such a container/block.
However ...
Don't use regular expressions for HTML parsing in Python. Use an HTML parser! (Unless you're going to write a full parser, which would be a of extra, and redundant work when various HTML, SGML and XML parsers are already in the standard libraries).
If you're handling "real world" tag soup HTML (which is frequently non-conforming to any SGML/XML validator) then use the BeautifulSoup package. It isn't in the standard libraries (yet) but is widely recommended for this purpose.
Another option is: lxml ... which is written for properly structured (standards conformant) HTML. But it has an option to fallback to using BeautifulSoup as a parser: ElementSoup.
The currently top-voted answer by Krzysztof Krasoń fails with <title>a</title><title>b</title>. Also, it ignores title tags crossing line boundaries, e.g., for line-length reasons. Finally, it fails with <title >a</title> (which is valid HTML: White space inside XML/HTML tags).
I therefore propose the following improvement:
import re
def search_title(html):
m = re.search(r"<title\s*>(.*?)</title\s*>", html, re.IGNORECASE | re.DOTALL)
return m.group(1) if m else None
Test cases:
print(search_title("<title >with spaces in tags</title >"))
print(search_title("<title\n>with newline in tags</title\n>"))
print(search_title("<title>first of two titles</title><title>second title</title>"))
print(search_title("<title>with newline\n in title</title\n>"))
Output:
with spaces in tags
with newline in tags
first of two titles
with newline
in title
Ultimately, I go along with others recommending an HTML parser - not only, but also to handle non-standard use of HTML tags.
I needed something to match package-0.0.1 (name, version) but want to reject an invalid version such as 0.0.010.
See regex101 example.
import re
RE_IDENTIFIER = re.compile(r'^([a-z]+)-((?:(?:0|[1-9](?:[0-9]+)?)\.){2}(?:0|[1-9](?:[0-9]+)?))$')
example = 'hello-0.0.1'
if match := RE_IDENTIFIER.search(example):
name, version = match.groups()
print(f'Name: {name}')
print(f'Version: {version}')
else:
raise ValueError(f'Invalid identifier {example}')
Output:
Name: hello
Version: 0.0.1
Is there a particular reason why no one suggested using lookahead and lookbehind? I got here trying to do the exact same thing and (?<=<title>).+(?=<\/title>) works great. It will only match whats between parentheses so you don't have to do the whole group thing.