Remove all links with a specific protocol with RegEx - python

I want to remove all links from a text and replace them with a subsitute that are starting with the protocols "example://" and "example_two://". All other links shall be untouched.
The following regex will replace all links despite of the fact that I limit the link types:
(\<a).+?(example|example_two)?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))(.+?)</a>+"
Has anyone a suggestion what is required to change the regex to work as expected?

Depending on which characters are allowed in the link, this should work:
import re
link1 = r'<a href=example_two://path.com/to/something?and_some_parameters=1234&and_ano%20ther_one=asdf />'
link2 = r'<a> href="example_two://path.com/to/something?and_some_parameters=1234&and_another_one=asdf"</a>'
pattern = re.compile(r"(?P<before>(?:(?P<opening><a>)|<a).*)(?:example|example_two)://[a-zA-Z0-9_/.=%?&$:;#,<>]*(?P<after>.*(?(opening)</a>|/>))")
print(pattern.sub(r"\g<before>https://stackoverflow.com\g<after>", link1))
print(pattern.sub(r"\g<before>https://example.com\g<after>", link2))
# Prints:
# <a href=https://stackoverflow.com/>
# <a> href="https://example.com"</a>
This takes whatever is before the link and puts it in the before group, puts whatever comes after the link in the after group and then substitues the complete match in pattern.sub. The replacement is a concatenation of the matches in the before and after groups, with the replacement link in the middle.
What's more, the closing tag is conditioned on the opening tag. If the opening tag is <a>, the matched closing tag is </a>, otherwise a /> is matched.

Related

Regex for replacing markdown links with <a> ONLY if link does not contain http

I'm working on a python application that tries to reuse markdown from github pages and I need to handle links a particular way. I currently use the following to convert markdown links to what I need:
filterString = '([^\!]|^)\[(.+)\]\((.+)\)'
filteredPage = re.sub(filterString, r"""<a href='#' onclick='requestPage("\3");'>\2</a>""", pageContent)
This converts something like:
[Getting Started](gettingStarted.md)
to
<a href='#' onclick='requestPage("gettingStarted.md");'>Getting Started</a>
This currently ignores links for images (preceded by a '!') which is as desired. The problem is that some of the markdown contains external links that I do not want converted.
I want to use the substitution on:
[Getting Started](gettingStarted.md)
but not either of these:
![Getting Started](gettingStarted.png)
[Getting Started](https://www.gettingstarted.com)
I've seen examples of matching things that don't begin with something, but since I'm trying to match within a certain position (i.e, match when in parenthesis after something in brackets that doesn't start with an !) I'm not sure how to accomplish the not match on 'http'.
You could use a negative lookahead right after matching the opening parenthesis \((?!http) to assert not http.
If you want to match between [] and () you could also use a negated character class to not over match the braces.
If you make the first group non capturing (?: you could do the replacement with group \2 and group \1
(?:[^\!]|^)\[([^\[\]]+)\]\((?!http)([^()]+)\)
Regex demo | Python demo
For example
import re
filterString = '(?:[^\!]|^)\[([^\[\]]+)\]\((?!http)([^()]+)\)'
strings = [
"[Getting Started](gettingStarted.md)",
"![Getting Started](gettingStarted.png)",
"[Getting Started](https://www.gettingstarted.com)"
]
for pageContent in strings:
filteredPage = re.sub(filterString, r"""<a href='#' onclick='requestPage("\2");'>\1</a>""", pageContent)
print(filteredPage)
Output
<a href='#' onclick='requestPage("gettingStarted.md");'>Getting Started</a>
![Getting Started](gettingStarted.png)
[Getting Started](https://www.gettingstarted.com)

how to look behind in regex without matching a pattern itself?

Lets say we want to extract the link in a tag like this:
input:
<p><b>some text</b></p>
desired output:
http://www.google.com/home/etc
the first solution is to find the match with reference using this href=[\'"]?([^\'" >]+) regex
but what I want to achieve is to match the link followed by href. so trying this (?=href\")... (lookahead assertion: matches without consuming) is still matching the href itself.
It is a regex only question.
One of many regex based solutions would be a capturing group:
>>> re.search(r'href="([^"]*)"', s).group(1)
'http://www.google.com/home/etc'
[^"]* matches any number non-".
A solution could be:
(?:href=)('|")(.*)\1
(?:href=) is a non capturing group. It means that the parser use href during the matching but it actually does not return it. As a matter of fact if you try this in regex you will see there's no group holding it.
Besides, every time you open and close a round bracket, you create a group. As a consequence, ('|") defines the group #1 and the URL you want will be in group #2. The way you retrieve this info depends on the programming language.
At the end, the \1 returns the value hold by group #1 (in this case it will be ") to provide a delimiter to the URL
Make yourself comfortable with a parser, e.g. with BeautifulSoup.
With this, it could be achieved with
from bs4 import BeautifulSoup
html = """<p><b>some text</b></p>"""
soup = BeautifulSoup(html, "html5lib")
print(soup.find('a').text)
# some text
BeautifulSoup supports a number of selectors including CSS selectors.

Python regular expression grabbing paragraphs from old HTML

I am working on transferring old content from a website, written in some old HTML, to their new WordPress site. I am using Python to do this. I am trying to
get the content from the old HTML pages using urllib.request
Use a regular expression to grab the text of HTML <p> elements that have classes that identify them as the body of the text
use XML-RPC methods to upload the content to the new WordPress site.
I'm ok with #1 and #3. The problem I am having is with #2, writing the regular expression to capture the content.
The content is in paragraphs that have varying format. Below are two representative examples of two paragraphs that I am trying to extract their content using a regular expression.
Paragraph #1
<p class=bodyDC style='text-indent:12.0pt'><span style='font-size:14.0pt;
mso-bidi-font-size:10.0pt'>We have no need to fear the future." So said
bishop-elect H. George Anderson at a news conference immediately following his election as
bishop of the Evangelical Lutheran Church in America. "[The
future] belongs­ to God, untouched by human hands." At the beginning of a
new ministry of leadership and pastoral oversight, such words from a bishop are
obviously designed to project confidence and a profound sense of trust in the
mission of the Church. They are words designed to inspire and empower the
people of God for ministry.<o:p></o:p></span></p>
Paragraph #2
<p class=BODY><span style='font-size:14.0pt;mso-bidi-font-size:10.0pt'>Ages
ago, another prophet of the people stood at his station and peered into the
future. The<span style="mso-spacerun: yes"> </span>prophet Habakkuk poised on
the rampart, scanned the horizon for the approaching enemy he knew was coming.
As he waited, Habakkuk prayed to God asking why God was unresponsive to all
this violence and destruction. In Habakkuk chapter 2 the prophet records God's
answer to his questions about the future. God says to the fearful one, "For
there is still a vision for the appointed time;… If it seems to tarry, wait for
it; it will surely come, it will not delay…the righteous live by faith"
(2:3-4).<o:p></o:p></span></p>
Ideally my regular expression would identify content paragraphs by their class of BODY or bodyDC. Once it has identified a paragraph containing text content, it would ignore all the HTML elements preceding and following the text content, and simply grab the text content.
The regular expression I have so far is still a work in progress:
post_content_re = re.compile(r'<p class=(body\w*)(.*?>)(<.*?>)*([a-z])', re.IGNORECASE)
My explanation for my regular expression parts:
class=(body\w*) should match either BODY or bodyDC, but it doesn't, it only matches BODY, and I don't know why
(.*?>) match the remaining attributes in the paragraph element
(<.*?>)* match 0 or more html elements enclosed in <> after the paragraph element
([a-z]) The content I am trying to get would be after any HTML elements. Right now I'm just testing for one letter, not the full paragraph text, because I'm still testing.
The matches I am getting all look like this:
BODY- but I expected BODY or bodyDC
> - this is the closing > of the p element with class BODY
<span style='font-size:14.0pt;mso-bidi-font-size:10.0pt'> - this is the span element after the P element
A - this is the first letter after the span element
So essentially, my RE is matching paragraphs like Paragraph #2 above, but not like Paragraph #1. I'm not sure why, and I'm stuck.
Thank you for any help.
I would follow a two step approach to this problem.
first collect all the paragraphs of interest
second extract the text from each paragraph
First
Parse out all the paragraphs that have the desired class.
<p\s*(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\sclass=(['"]?)(?:body|bodydc)\1(?:\s|>)(?:([^<]*)|<(?!\/p)(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>)*(?=<\/p>)
This regex will do the following:
find all the paragraph tags of the given class upto but not including the close </p>
avoids some odd edge cases problems like <span onmouseover=" </p> ">
due to regex limitations this will not work with nested paragraph tags like <p>outside paragraph<p>inside paragraph</p>more text in the outside</p>
See Live Demo
Second
Extract the raw text from each paragraph
(?:([^<]*)|<(?!\/p)(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>)
This regex will do the following:
match both the raw text and tags
place the raw text into capture group 1
avoid difficult edge cases
See Live Demo
While (as someone commented) you should not parse HTML like this, for this one-off job this kind of solution might just work.
Your regex is not working for the first paragraph because . does not match newlines, and you have a newline inside your tag. You can use tricks like [\S\s] to match all characters, including newlines.
This one does not remove the tags at the end of the paragraph, but I hope it still helps:
for g1, g2, content in re.findall("<p (class=bodyDC|class=BODY)[^><]*>(<[\S\s]*?>)*([\S\s]*?)<\\/p>", str1):
print content
Bit of explanation:
<p (class=bodyDC|class=BODY)[^><]*> matches the opening paragraph tag
<p: the beginning of the tag
(class=bodyDC|class=BODY): one of the two class attributes
[^><]*: any other attributes inside the tag
>: the end of the tag
(<[\S\s]*?>)* matches any number of tags
<: the beginning of the tag
[\S\s]*?: any other attributes (could have also used [^><]*)
>: end of tag
([\S\s]*?) matches any text. This is group 3, this is basically the content. (Plus the tags at the end of it.)
<\/p> matches the closing paragraph tag. (Note that in the code it actually appears as <\\/p>, because the backslash has to be escaped in the python string.)

Python regex: remove certain HTML tags and the contents in them

If I have a string that contains this:
<p><span class=love><p>miracle</p>...</span></p><br>love</br>
And I want to remove the string:
<span class=love><p>miracle</p>...</span>
and maybe some other HTML tags. At the same time, the other tags and the contents in them will be reserved.
The result should be like this:
<p></p><br>love</br>
I want to know how to do this using regex pattern?
what I have tried :
r=re.compile(r'<span class=love>.*?(?=</span>)')
r.sub('',s)
but it will leave the
</span>
can you help me using re module this time?and i will learn html parser next
First things first: Don’t parse HTML using regular expressions
That being said, if there is no additional span tag within that span tag, then you could do it like this:
text = re.sub('<span class=love>.*?</span>', '', text)
On a side note: paragraph tags are not supposed to go within span tags (only phrasing content is).
The expression you have tried, <span class=love>.*?(?=</span>), is already quite good. The problem is that the lookahead (?=</span>) will never match what it looks ahead for. So the expression will stop immediately before the closing span tag. You now could manually add a closing span at the end, i.e. <span class=love>.*?(?=</span>)</span>, but that’s not really necessary: The .*? is a non-greedy expression. It will try to match as little as possible. So in .*?</span> the .*? will only match until a closing span is found where it immediately stops.

BeautifulSoup, simple regex issue

I just hit a snag with regex and have no idea why this's not working.
Here is what BeautifulSoup doc says:
soup.find_all(class_=re.compile("itl"))
# [<p class="title"><b>The Dormouse's story</b></p>]
Here is my html:
Aouate</span><span class="pos_text pos3_l_4">
and I'm trying to match the span tag (last position).
>>> if soup.find(class_=re.compile("pos_text pos3_l_\d{1}")):
print "Yes"
# prints nothing - indicating there is no such pattern in the html
So, I'm just repeating the BS4 docs, except my regex is not working. Sure enough if I replace the \d{1} with 4 (as originally in the html) it succeedes.
Try "\\d" in your regex. It's probably interpreting "\d" as trying to escape 'd'.
Alternatively, a raw string ought to work. Just put an 'r' in front of the regex, like this:
re.compile(r"pos_text pos3_l_\d{1}")
I'm not entirely sure, but this worked for me:
soup.find(attrs={'class':re.compile('pos_text pos3_l_\d{1}')})
You are matching not for a class but for an specific combination of classes in an specific order.
From the documentation:
You can also search for the exact string value of the class attribute:
css_soup.find_all("p", class_="body strikeout")
# [<p class="body strikeout"></p>] But searching for variants of the string value won’t work:
css_soup.find_all("p", class_="strikeout body")
# []
So you should problable fist match for post_text and then in the result try to match with a regexp in the matches for that search

Categories

Resources