I have the following code.
def render_markdown(markdown):
"Replaces markdown links with plain text"
# non greedy
# also includes images
RE_ANCHOR = re.compile(r"\[[^\[]*?\]\(.+?\)")
# create a mapping
mapping = load_mapping()
anchors = RE_ANCHOR.findall(markdown)
counter = -1
while len(anchors) != 0:
for link in anchors:
counter += 1
text, href = link.split('](')[:2]
text = '-=-' + text[1:] + '-=-'
text = text.replace(' ', '_') + '_' + str(counter)
href = href[: -1]
mapping[text] = href
markdown = markdown.replace(link, text)
anchors = RE_ANCHOR.findall(markdown)
return markdown, mapping
However the markdown function does not replace all the links. Almost none are replaced.I looked around on SO and found a lot of questions pertaining to this. The problems found were of the type:
abc.replace(x, y) instead of abc = abc.replace(x, y)
I am doing that but the string is not being replaced
It looks like the cause is that your regex isn't matching the text you expected. So the loop is empty.
Try posting some sample markdown, run your code, and add print statements so that you can see all the intermediate results (especially the anchors list). With that in hand, debugging will be much easier :-)
I don't understand why you are using replace when you are already using regex. The re library gives you the tools to do what you want without needing to locate the string twice (once with the regex and once with replace).
For example, MatchObject contains the start and end positions of the matched text. You could use text slicing to do your own string substitutions. But even that is unnecessary as you can use re.sub and have the re library do the substitution for you when a match is found. You just need to define a callable which accepts the MathObject and returns the text to replace it.
def render_markdown(markdown):
"Replaces markdown links with plain text"
RE_ANCHOR = re.compile(r"\[[^\[]*?\]\(.+?\)")
mapping = load_mapping()
def replace_link(m):
# process your link here...
mapping[text] = href
return text
return RE_ANCHOR.sub(replace_link, markdown)
And if you wanted to make a few additions to your regular expression, you could have the regex break up your link into parts which would be accessible as groups on the match object. For example:
RE_ANCHOR = re.compile(r"\[([^\[]*?)\]\((.+?)\)")
# ...
text = m.group(1)
link = m.group(2)
All I did was add a set of parentheses around each of the text and link (inside the brackets). Although, I expect your regex is not sophisticated enough to match all possible links found within Markdown documents. For example, the Python-Markdown library permits at least six levels of nested brackets inside the "text" portion of the link. And don't forget about titles defined in a link (as (url "title")). But that is just scratching the surface. Besides, that would be a separate question.
Related
I try to clean some HTML data with regular expression in python. Given the input string with HTML tags, I want to remove tags and its content if the content contains space. The requirements is like below:
inputString = "I want to remove <code>tag with space</code> not sole <code>word</code>"
outputString = regexProcess(inputString)
print outputString
>>I want to remove not sole <code>word</code>
The regex re.sub("<code>.+?</code>", " ", inputString) can only remove all tags, how to improve it or are there some other methods?
Thanks in advance.
Using regex with HTML is fraught with various issues, that is why you should be aware of all possible consequences. So, your <code>.+?</code> regex will only work in case the <code> and </code> tags are on one line and if there are no nested <code> tags inside them.
Assuming there are no nested code tags you might extend your current approach:
import re
inputString = "I want to remove <code>tag with space</code> not sole <code>word</code>"
outputString = re.sub("<code>(.+?)</code>", lambda m: " " if " " in m.group(1) else m.group(), inputString, flags=re.S)
print(outputString)
The re.S flag will enable . to match line breaks and a lambda will help to perform a check against each match: any code tag that contains a whitespace in its node value will be turned into a regular space, else it will be kept.
See this Python demo
A more common way to parse HTML in Python is to use BeautifulSoup. First, parse the HTML, then get all the code tags and then replace the code tag if the nodes contains a space:
>>> from bs4 import BeautifulSoup
soup = BeautifulSoup('I want to remove <code>tag with space</code> not sole <code>word</code>', "html.parser")
>>> for p in soup.find_all('code'):
if p.string and " " in p.string:
p.replace_with(" ")
>>> print(soup)
I want to remove not sole <code>word</code>
bad idea to parse HTML with regex. However if your HTML is simple enough you could do this:
re.sub(r"<code>[^<]*\s[^<]*</code>", " ", inputString)
We're looking for at least a space somewhere, to be able to make it work with code tags on the same line, I've added filtering on < char (it has no chance to be in a tag, since even escaping it is <).
Ok, it's still a hack, a proper html parser is preferred.
small test:
inputString = "<code>hello </code> <code>world</code> <code>hello world</code> <code>helloworld</code>"
I get:
<code>world</code> <code>helloworld</code>
You can used to remove tags according to open and close tags also .
inputString = re.sub(r"<.*?>", " ", inputString)
In my case it is working .
Enjoy ...
This might be a silly question, but I'm just trying to learn!
I'm trying to build a simple email search tool to learn more about python. I'm modifying some open source code to parse the email address:
emails = re.findall(r'([A-Za-z0-9\.\+_-]+#[A-Za-z0-9\._-]+\.[a-zA-Z]*)', html)
Then I'm writing the results into a spreadsheet using the CSV module.
Since I'd like to keep the domain extension open to almost any, my results are outputting image files with an email type format:
example: forbes#2x-302019213j32.png
How can I add to exclude "png" string from re.findall
Code:
def scrape(self, page):
try:
request = urllib2.Request(page.url.encode("utf8"))
html = urllib2.urlopen(request).read()
except Exception, e:
return
emails = re.findall(r'([A-Za-z0-9\.\+_-]+#[A-Za-z0-9\._-]+\.[a-zA-Z]*)', html)
for email in emails:
if email not in self.emails: # if not a duplicate
self.csvwriter.writerow([page.title.encode('utf8'), page.url.encode("utf8"), email])
self.emails.append(email)
you already are only acting on an if ... just make part of the if check ... ...that will be much much much easier than trying to exclude it from the regex
if email not in self.emails and not email.endswith("png"): # if not a duplicate
self.csvwriter.writerow([page.title.encode('utf8'), page.url.encode("utf8"), email])
self.emails.append(email)
I know Joran already gave you a response, but here's another way to do it with Python regex that I found cool.
There is a (?!...) matching pattern that essentially says: "Wherever you place this matching pattern, if at that point in the string this pattern is checked and a match is found, then that match occurrence fails."
If that was a bad explanation, the Python document does a much better job: https://docs.python.org/2/howto/regex.html#lookahead-assertions
Also, here is a working example:
y = r'([A-Za-z0-9\.\+_-]+#[A-Za-z0-9\._-]+\.(?!png)[a-zA-z]*)'
s = 'forbes#2x-302019213j32.png'
re.findall(y, s) # Will return an empty list
s2 = 'myname#email2018529391230.net'
re.findall(y, s2) # Will return a list with s2 string
s3 = s + ' ' + s2 # Concatenates the two e-mail-formatted strings
re.findall(y, s3) # Will only return s2 string in list
Lots of ways to do this, but my favorite is:
pat = re.compile(r'''
[A-Za-z0-9\.\+_-]+ # 1+ \w\n.+-_
#[A-Za-z0-9\._-]+ # literal # followed by same
\.png # if png, DON'T CAPTURE
|([A-Za-z0-9\.\+_-]+#[A-Za-z0-9\._-]+\.[a-zA-Z]*)
# if not png, CAPTURE''', flags=re.X)
Since regexes are evaluated left-to-right, if a string starts to match then it will match the left side of the | first. If the string ends in .png, then it will consume that string but NOT capture it. If it DOESN'T end in .png, the right side of the | will begin to consume it and WILL capture it. For a more in-depth conversation of this trick, see here. To use these do:
matches = filter(None,pat.findall(html))
Any string matched by the left side (e.g. all the png files that are matched but NOT part of a capturing group) will show up as an empty string in your findall. filter(None, iterable) removes all the empty strings from your iterable, leaving you with only the data you want.
Alternatively, you can filter after you grab everything
pat = re.compile(r'''[A-Za-z0-9\.\+_-]+#[A-Za-z0-9\._-]+\.[a-zA-Z]*''')
# same regex you have currently
matches = filter(lambda x: not x.endswith('png'), pat.findall(html))
Note that further on, you should really make self.emails a set. It doesn't seem to need to keep its ordering, and set lookup is WAY faster than list lookup. Remember to use set.add instead of list.append though.
am really a noob with regular expressions, I tried to do this on my own but I couldn't understand from the manuals how to approach it. Am trying to find all img tags of a given content, I wrote the below but its returning None
content = i.content[0].value
prog = re.compile(r'^<img')
result = prog.match(content)
print result
any suggestions?
Multipurpose solution:
image_re = re.compile(r"""
(?P<img_tag><img)\s+ #tag starts
[^>]*? #other attributes
src= #start of src attribute
(?P<quote>["''])? #optional open quote
(?P<image>[^"'>]+) #image file name
(?(quote)(?P=quote)) #close quote
[^>]*? #other attributes
> #end of tag
""", re.IGNORECASE|re.VERBOSE) #re.VERBOSE allows to define regex in readable format with comments
image_tags = []
for match in image_re.finditer(content):
image_tags.append(match.group("img_tag"))
#print found image_tags
for image_tag in image_tags:
print image_tag
As you can see in regex definition, it contains
(?P<group_name>regex)
It allows you to access found groups by group_name, and not by number. It is for readability. So, if you want to show all src attributes of img tags, then just write:
for match in image_re.finditer(content):
image_tags.append(match.group("image"))
After this image_tags list will contain src of image tags.
Also, if you need to parse html, then there are instruments that were designed exactly for such purposes. For example it is lxml, that use xpath expressions.
I don't know Python but assuming it uses normal Perl compatible regular expressions...
You probably want to look for "<img[^>]+>" which is: "<img", followed by anything that is not ">", followed by ">". Each match should give you a complete image tag.
Using Python I want to replace all URLs in a body of text with links to those URLs, like what Gmail does.
Can this be done in a one liner regular expression?
Edit: by body of text I just meant plain text - no HTML
You can load the document up with a DOM/HTML parsing library ( see html5lib ), grab all text nodes, match them against a regular expression and replace the text nodes with a regex replacement of the URI with anchors around it using a PCRE such as:
/(https?:[;\/?\\#&=+$,\[\]A-Za-z0-9\-_\.\!\~\*\'\(\)%][\;\/\?\:\#\&\=\+\$\,\[\]A-Za-z0-9\-_\.\!\~\*\'\(\)%#]*|[KZ]:\\*.*\w+)/g
I'm quite sure you can scourge through and find some sort of utility that does this, I can't think of any off the top of my head though.
Edit: Try using the answers here: How do I get python-markdown to additionally "urlify" links when formatting plain text?
import re
urlfinder = re.compile("([0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}|((news|telnet|nttp|file|http|ftp|https)://)|(www|ftp)[-A-Za-z0-9]*\\.)[-A-Za-z0-9\\.]+):[0-9]*)?/[-A-Za-z0-9_\\$\\.\\+\\!\\*\\(\\),;:#&=\\?/~\\#\\%]*[^]'\\.}>\\),\\\"]")
def urlify2(value):
return urlfinder.sub(r'\1', value)
call urlify2 on a string and I think that's it if you aren't dealing with a DOM object.
I hunted around a lot, tried these solutions and was not happy with their readability or features, so I rolled the following:
_urlfinderregex = re.compile(r'http([^\.\s]+\.[^\.\s]*)+[^\.\s]{2,}')
def linkify(text, maxlinklength):
def replacewithlink(matchobj):
url = matchobj.group(0)
text = unicode(url)
if text.startswith('http://'):
text = text.replace('http://', '', 1)
elif text.startswith('https://'):
text = text.replace('https://', '', 1)
if text.startswith('www.'):
text = text.replace('www.', '', 1)
if len(text) > maxlinklength:
halflength = maxlinklength / 2
text = text[0:halflength] + '...' + text[len(text) - halflength:]
return '<a class="comurl" href="' + url + '" target="_blank" rel="nofollow">' + text + '<img class="imglink" src="/images/linkout.png"></a>'
if text != None and text != '':
return _urlfinderregex.sub(replacewithlink, text)
else:
return ''
You'll need to get a link out image, but that's pretty easy. This is specifically for user submitted text like comments which I assume is usually what people are dealing with.
/\w+:\/\/[^\s]+/
When you say "body of text" do you mean a plain text file, or body text in an HTML document? If you want the HTML document, you will want to use Beautiful Soup to parse it; then, search through the body text and insert the tags.
Matching the actual URLs is probably best done with the urlparse module. Full discussion here: How do you validate a URL with a regular expression in Python?
Gmail is a lot more open, when it comes to URLs, but it is not always right either. e.g. it will make www.a.b into a hyperlink as well as http://a.b but it often fails because of wrapped text and uncommon (but valid) URL characters.
See appendix A. A. Collected BNF for URI for syntax, and use that to build a reasonable regular expression that will consider what surrounds the URL as well. You'd be well advised to consider a couple of scenarios where URLs might end up.
I want a regular expression to extract the title from a HTML page. Currently I have this:
title = re.search('<title>.*</title>', html, re.IGNORECASE).group()
if title:
title = title.replace('<title>', '').replace('</title>', '')
Is there a regular expression to extract just the contents of <title> so I don't have to remove the tags?
Use ( ) in regexp and group(1) in python to retrieve the captured string (re.search will return None if it doesn't find the result, so don't use group() directly):
title_search = re.search('<title>(.*)</title>', html, re.IGNORECASE)
if title_search:
title = title_search.group(1)
Note that starting in Python 3.8, and the introduction of assignment expressions (PEP 572) (:= operator), it's possible to improve a bit on Krzysztof Krasoń's solution by capturing the match result directly within the if condition as a variable and re-use it in the condition's body:
# pattern = '<title>(.*)</title>'
# text = '<title>hello</title>'
if match := re.search(pattern, text, re.IGNORECASE):
title = match.group(1)
# hello
Try using capturing groups:
title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)
May I recommend you to Beautiful Soup. Soup is a very good lib to parse all of your html document.
soup = BeatifulSoup(html_doc)
titleName = soup.title.name
Try:
title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)
re.search('<title>(.*)</title>', s, re.IGNORECASE).group(1)
The provided pieces of code do not cope with Exceptions
May I suggest
getattr(re.search(r"<title>(.*)</title>", s, re.IGNORECASE), 'groups', lambda:[u""])()[0]
This returns an empty string by default if the pattern has not been found, or the first match.
I'd think this should suffice:
#!python
import re
pattern = re.compile(r'<title>([^<]*)</title>', re.MULTILINE|re.IGNORECASE)
pattern.search(text)
... assuming that your text (HTML) is in a variable named "text."
This also assumes that there are no other HTML tags which can be legally embedded inside of an HTML TITLE tag and there exists no way to legally embed any other < character within such a container/block.
However ...
Don't use regular expressions for HTML parsing in Python. Use an HTML parser! (Unless you're going to write a full parser, which would be a of extra, and redundant work when various HTML, SGML and XML parsers are already in the standard libraries).
If you're handling "real world" tag soup HTML (which is frequently non-conforming to any SGML/XML validator) then use the BeautifulSoup package. It isn't in the standard libraries (yet) but is widely recommended for this purpose.
Another option is: lxml ... which is written for properly structured (standards conformant) HTML. But it has an option to fallback to using BeautifulSoup as a parser: ElementSoup.
The currently top-voted answer by Krzysztof Krasoń fails with <title>a</title><title>b</title>. Also, it ignores title tags crossing line boundaries, e.g., for line-length reasons. Finally, it fails with <title >a</title> (which is valid HTML: White space inside XML/HTML tags).
I therefore propose the following improvement:
import re
def search_title(html):
m = re.search(r"<title\s*>(.*?)</title\s*>", html, re.IGNORECASE | re.DOTALL)
return m.group(1) if m else None
Test cases:
print(search_title("<title >with spaces in tags</title >"))
print(search_title("<title\n>with newline in tags</title\n>"))
print(search_title("<title>first of two titles</title><title>second title</title>"))
print(search_title("<title>with newline\n in title</title\n>"))
Output:
with spaces in tags
with newline in tags
first of two titles
with newline
in title
Ultimately, I go along with others recommending an HTML parser - not only, but also to handle non-standard use of HTML tags.
I needed something to match package-0.0.1 (name, version) but want to reject an invalid version such as 0.0.010.
See regex101 example.
import re
RE_IDENTIFIER = re.compile(r'^([a-z]+)-((?:(?:0|[1-9](?:[0-9]+)?)\.){2}(?:0|[1-9](?:[0-9]+)?))$')
example = 'hello-0.0.1'
if match := RE_IDENTIFIER.search(example):
name, version = match.groups()
print(f'Name: {name}')
print(f'Version: {version}')
else:
raise ValueError(f'Invalid identifier {example}')
Output:
Name: hello
Version: 0.0.1
Is there a particular reason why no one suggested using lookahead and lookbehind? I got here trying to do the exact same thing and (?<=<title>).+(?=<\/title>) works great. It will only match whats between parentheses so you don't have to do the whole group thing.