Python regular expression grabbing paragraphs from old HTML

Python regular expression grabbing paragraphs from old HTML - python

I am working on transferring old content from a website, written in some old HTML, to their new WordPress site. I am using Python to do this. I am trying to
get the content from the old HTML pages using urllib.request
Use a regular expression to grab the text of HTML <p> elements that have classes that identify them as the body of the text
use XML-RPC methods to upload the content to the new WordPress site.
I'm ok with #1 and #3. The problem I am having is with #2, writing the regular expression to capture the content.
The content is in paragraphs that have varying format. Below are two representative examples of two paragraphs that I am trying to extract their content using a regular expression.
Paragraph #1
<p class=bodyDC style='text-indent:12.0pt'><span style='font-size:14.0pt;
mso-bidi-font-size:10.0pt'>We have no need to fear the future." So said
bishop-elect H. George Anderson at a news conference immediately following his election as
bishop of the Evangelical Lutheran Church in America. "[The
future] belongs to God, untouched by human hands." At the beginning of a
new ministry of leadership and pastoral oversight, such words from a bishop are
obviously designed to project confidence and a profound sense of trust in the
mission of the Church. They are words designed to inspire and empower the
people of God for ministry.<o:p></o:p></span></p>
Paragraph #2
<p class=BODY><span style='font-size:14.0pt;mso-bidi-font-size:10.0pt'>Ages
ago, another prophet of the people stood at his station and peered into the
future. The<span style="mso-spacerun: yes"> </span>prophet Habakkuk poised on
the rampart, scanned the horizon for the approaching enemy he knew was coming.
As he waited, Habakkuk prayed to God asking why God was unresponsive to all
this violence and destruction. In Habakkuk chapter 2 the prophet records God's
answer to his questions about the future. God says to the fearful one, "For
there is still a vision for the appointed time;… If it seems to tarry, wait for
it; it will surely come, it will not delay…the righteous live by faith"
(2:3-4).<o:p></o:p></span></p>
Ideally my regular expression would identify content paragraphs by their class of BODY or bodyDC. Once it has identified a paragraph containing text content, it would ignore all the HTML elements preceding and following the text content, and simply grab the text content.
The regular expression I have so far is still a work in progress:
post_content_re = re.compile(r'<p class=(body\w*)(.*?>)(<.*?>)*([a-z])', re.IGNORECASE)
My explanation for my regular expression parts:
class=(body\w*) should match either BODY or bodyDC, but it doesn't, it only matches BODY, and I don't know why
(.*?>) match the remaining attributes in the paragraph element
(<.*?>)* match 0 or more html elements enclosed in <> after the paragraph element
([a-z]) The content I am trying to get would be after any HTML elements. Right now I'm just testing for one letter, not the full paragraph text, because I'm still testing.
The matches I am getting all look like this:
BODY- but I expected BODY or bodyDC
> - this is the closing > of the p element with class BODY
<span style='font-size:14.0pt;mso-bidi-font-size:10.0pt'> - this is the span element after the P element
A - this is the first letter after the span element
So essentially, my RE is matching paragraphs like Paragraph #2 above, but not like Paragraph #1. I'm not sure why, and I'm stuck.
Thank you for any help.

I would follow a two step approach to this problem.
first collect all the paragraphs of interest
second extract the text from each paragraph
First
Parse out all the paragraphs that have the desired class.
<p\s*(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\sclass=(['"]?)(?:body|bodydc)\1(?:\s|>)(?:([^<]*)|<(?!\/p)(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>)*(?=<\/p>)
This regex will do the following:
find all the paragraph tags of the given class upto but not including the close </p>
avoids some odd edge cases problems like <span onmouseover=" </p> ">
due to regex limitations this will not work with nested paragraph tags like <p>outside paragraph<p>inside paragraph</p>more text in the outside</p>
See Live Demo
Second
Extract the raw text from each paragraph
(?:([^<]*)|<(?!\/p)(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>)
This regex will do the following:
match both the raw text and tags
place the raw text into capture group 1
avoid difficult edge cases
See Live Demo

While (as someone commented) you should not parse HTML like this, for this one-off job this kind of solution might just work.
Your regex is not working for the first paragraph because . does not match newlines, and you have a newline inside your tag. You can use tricks like [\S\s] to match all characters, including newlines.
This one does not remove the tags at the end of the paragraph, but I hope it still helps:
for g1, g2, content in re.findall("<p (class=bodyDC|class=BODY)[^><]*>(<[\S\s]*?>)*([\S\s]*?)<\\/p>", str1):
print content
Bit of explanation:
<p (class=bodyDC|class=BODY)[^><]*> matches the opening paragraph tag
<p: the beginning of the tag
(class=bodyDC|class=BODY): one of the two class attributes
[^><]*: any other attributes inside the tag
>: the end of the tag
(<[\S\s]*?>)* matches any number of tags
<: the beginning of the tag
[\S\s]*?: any other attributes (could have also used [^><]*)
>: end of tag
([\S\s]*?) matches any text. This is group 3, this is basically the content. (Plus the tags at the end of it.)
<\/p> matches the closing paragraph tag. (Note that in the code it actually appears as <\\/p>, because the backslash has to be escaped in the python string.)

Related

Remove all links with a specific protocol with RegEx

I want to remove all links from a text and replace them with a subsitute that are starting with the protocols "example://" and "example_two://". All other links shall be untouched.
The following regex will replace all links despite of the fact that I limit the link types:
(\<a).+?(example|example_two)?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))(.+?)</a>+"
Has anyone a suggestion what is required to change the regex to work as expected?

Depending on which characters are allowed in the link, this should work:
import re
link1 = r'<a href=example_two://path.com/to/something?and_some_parameters=1234&and_ano%20ther_one=asdf />'
link2 = r'<a> href="example_two://path.com/to/something?and_some_parameters=1234&and_another_one=asdf"</a>'
pattern = re.compile(r"(?P<before>(?:(?P<opening><a>)|<a).*)(?:example|example_two)://[a-zA-Z0-9_/.=%?&$:;#,<>]*(?P<after>.*(?(opening)</a>|/>))")
print(pattern.sub(r"\g<before>https://stackoverflow.com\g<after>", link1))
print(pattern.sub(r"\g<before>https://example.com\g<after>", link2))
# Prints:
# <a href=https://stackoverflow.com/>
# <a> href="https://example.com"</a>
This takes whatever is before the link and puts it in the before group, puts whatever comes after the link in the after group and then substitues the complete match in pattern.sub. The replacement is a concatenation of the matches in the before and after groups, with the replacement link in the middle.
What's more, the closing tag is conditioned on the opening tag. If the opening tag is <a>, the matched closing tag is </a>, otherwise a /> is matched.

Capture text groups in paragraph Regex

I want to capture header and their corresponding value from the paragraph.
Example,
Paragraph : "INTRODUCTION: There was a beautiful village. CONCLUSION: End of the story"
Regex used: \b([A-Z]+(?:\s+[A-Z]+)):\s(.?)(?=\s\b(?:[A-Z]+(?:\s+[A-Z]+)*):|$)
Output: [('INTRODUCTION', 'There was a beautiful village'), ('CONCLUSION': 'End of the story')]
There is no problem in this. But sometime I get patterns like,
Paragraph: "There was a beautiful garden once. INTRODUCTION: There was a beautiful village. CONCLUSION: End of the story"
Output Expected: [('FreeText', 'There was a beautiful garden once'), ('INTRODUCTION', 'There was a beautiful village'), ('CONCLUSION': 'End of the story')]
How to achieve the above case in case some free text are coming up. Any help is appreciated.....

I don't think you'll be able to use a look behind assertion in most cases because the case will be of unknown length.
You could try matching for the beginning of the text (if the paragraph you've shown us is typically how the string looks).
This will match the cases without free text:
^\b([A-Z]+(?:\s+[A-Z]+)):\s(.?)(?=\s\b(?:[A-Z]+(?:\s+[A-Z]+)*):|$)
If you have cases that are in the form:
^\b([A-Z][a-z]+[.])
Then you know you have a case that starts with free text because of the absence of an all capitalized key word. So you could add something like this to the start of your regex and just have two different cases for matching.

Regular Expressions: Find Names in String using Python

I have never had a very hard time with regular expressions up until now. I am hoping the solution is not obvious because I have probably spent a few hours on this problem.
This is my string:
<b>Carson Daly</b>: Ben Schwartz, Soko, Jacob Escobedo (R 2/28/14)<br>'
I want to extract 'Soko', and 'Jacob Escobedo' as individual strings. If I takes two different patterns for the extractions that is okay with me.
I have tried "\s([A-Za-z0-9]{1}.+?)," and other alterations of that regex to get the data I want but I have had no success. Any help is appreciated.
The names never follow the same tag or the same symbol. The only thing that consistently precedes the names is a space (\s).
Here is another string as an example:
<b>Carson Daly</b>: Wil Wheaton, the Birds of Satan, Courtney Kemp Agboh<br>

An alternative approach would be to parse the string with an HTML parser, like lxml.
For example, you can use the xpath to find everything between a b tag with Carson Daly text and br tag by checking preceding and following siblings:
from lxml.html import fromstring
l = [
"""<b>Carson Daly</b>: Ben Schwartz, Soko, Jacob Escobedo (R 2/28/14)<br>'""",
"""<b>Carson Daly</b>: Wil Wheaton, the Birds of Satan, Courtney Kemp Agboh<br>"""
]
for html in l:
tree = fromstring(html)
results = ''
for element in tree.xpath('//node()[preceding-sibling::b="Carson Daly" and following-sibling::br]'):
if not isinstance(element, str):
results += element.text.strip()
else:
text = element.strip(':')
if text:
results += text.strip()
print results.split(', ')
It prints:
['Ben Schwartz', 'Soko', 'Jacob Escobedo (R 2/28/14)']
['Wil Wheaton', 'the Birds of Satan', 'Courtney Kemp Agboh']

If you want to do it in regex (and with all the disclaimers on that topic), the following regex works with your strings. However, do note that you need to retrieve your matches from capture Group 1. In the online demo, make sure you look at the Group 1 captures in the bottom right pane. :)
<[^<]*</[^>]*>|<.*?>|((?<=,\s)\w[\w ]*\w|\w[\w ]*\w(?=,))
Basically, with the left alternations (separated by |) we match everything we don't want, then the final parentheses on the right capture what we do want.
This is an application of this question about matching a pattern except in certain situations (read that for implementation details including links to Python code).

Regex quantifiers

I'm new to regex and this is stumping me.
In the following example, I want to extract facebook.com/pages/Dr-Morris-Westfried-Dermatologist/176363502456825?id=176363502456825&sk=info. I've read up on lazy quantifiers and lookbehinds but I still can't piece together the right regex. I'd expect facebook.com\/.*?sk=info to work but it captures too much. Can you guys help?
<i class="mrs fbProfileBylineIcon img sp_2p7iu7 sx_96df30"></i></span><span class="fbProfileBylineLabel"><span itemprop="address" itemscope="itemscope" itemtype="http://schema.org/PostalAddress">7508 15th Avenue, Brooklyn, New York 11228</span></span></span><span class="fbProfileBylineFragment"><span class="fbProfileBylineIconContainer"><i class="mrs fbProfileBylineIcon img sp_2p7iu7 sx_9f18df"></i></span><span class="fbProfileBylineLabel"><span itemprop="telephone">(718) 837-9004</span></span></span></div></div></div><a class="title" href="https://www.facebook.com/pages/Dr-Morris-Westfried-Dermatologist/176363502456825?id=176363502456825&sk=info" aria-label="About Dr. Morris Westfried - Dermatologist">

As much as I love regex, this is an html parsing task:
>>> from bs4 import BeautifulSoup
>>> html = .... # that whole text in the question
>>> soup = BeautifulSoup(html)
>>> pred = lambda tag: tag.attrs['href'].endswith('sk=info')
>>> [tag.attrs['href'] for tag in filter(pred, soup.find_all('a'))]
['https://www.facebook.com/pages/Dr-Morris-Westfried-Dermatologist/176363502456825?id=176363502456825&sk=info']

This works :)
facebook\.com\/[^>]*?sk=info
Debuggex Demo
With only .* it finds the first facebook.com, and then continues until the sk=info. Since there's another facebook.com between, you overlap them.
The unique thing between that you don't want is a > (or <, among other characters), so changing anything to anything but a > finds the facebook.com closest to the sk=info, as you want.
And yes, using regex for HTML should only be used in basic tasks. Otherwise, use a parser.

Why your pattern doesn't work:
You pattern doesn't work because the regex engine try your pattern from left to right in the string.
When the regex engine meets the first facebook.com\/ in the string, and since you use .*? after, the regex engine will add to the (possible) match result all the characters (including " or > or spaces) until it finds sk=info (since . can match any characters except newlines).
This is the reason why fejese suggests to replace the dot with [^"] or aliteralmind suggests to replace it with [^>] to make the pattern fail at this position in the string (the first).
Using an html parser is the easiest way if you want to deal with html. However, for a ponctual match or search/replace, note that if an html parser provide security, simplicity, it has a cost in term of performance since you need to load the whole tree of your document for a single task.

The problem is that you have an other facebook.com part. You can restrict the .* not to match " so it needs to stay within one attribute:
facebook\.com\/[^"]*;sk=info

Python regex: remove certain HTML tags and the contents in them

If I have a string that contains this:
<p><span class=love><p>miracle</p>...</span></p><br>love</br>
And I want to remove the string:
<span class=love><p>miracle</p>...</span>
and maybe some other HTML tags. At the same time, the other tags and the contents in them will be reserved.
The result should be like this:
<p></p><br>love</br>
I want to know how to do this using regex pattern?
what I have tried :
r=re.compile(r'<span class=love>.*?(?=</span>)')
r.sub('',s)
but it will leave the
</span>
can you help me using re module this time?and i will learn html parser next

First things first: Don’t parse HTML using regular expressions
That being said, if there is no additional span tag within that span tag, then you could do it like this:
text = re.sub('<span class=love>.*?</span>', '', text)
On a side note: paragraph tags are not supposed to go within span tags (only phrasing content is).
The expression you have tried, <span class=love>.*?(?=</span>), is already quite good. The problem is that the lookahead (?=</span>) will never match what it looks ahead for. So the expression will stop immediately before the closing span tag. You now could manually add a closing span at the end, i.e. <span class=love>.*?(?=</span>)</span>, but that’s not really necessary: The .*? is a non-greedy expression. It will try to match as little as possible. So in .*?</span> the .*? will only match until a closing span is found where it immediately stops.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python regular expression grabbing paragraphs from old HTML - python

Related

Remove all links with a specific protocol with RegEx

Capture text groups in paragraph Regex

Regular Expressions: Find Names in String using Python

Regex quantifiers

Python regex: remove certain HTML tags and the contents in them

Categories

Resources