Regex - Combining an 'or' with a 'look-behind' - python

Sorry about the confusing title. I am trying to figure out a simple Regex problem, but cannot figure out what the solution is.
I have a HTML snippet from a larger HTML document.
<td class="grade">100.0</td>
<td class="teacher">Mathias, Jordan</td>
Other Regex separates the two, giving them those class-names. I use a positive look-ahead to check for a . or a , (period or comma), and assign them the class of grade or teacher (respectively).
The problem comes later, when I want to check if the code in-between these tags is blank.
i.e. : <td class="grade"></td>
I would like to use a positive look-behind to check if the class is either grade or teacher (grade|teacher). In addition, I would like to check that there is truly nothing in between the >< (conjunction of the empty tags).
So-far, this is what I have: (?<=.*(teacher|grade)*.+>?)[^.](?=</td>)
NOTE: This is in Python

Instead of pre-processing your HTML, trust in BeautifulSoup and use regular expression searches:
soup.find_all('td', text=re.compile(','))
finds all <td> elements with the direct text in the tag containing a comma.

Related

Regular expression to extract info from HTML file

I would like to use a regular expression to extract the following text from an HTML file: ">ABCDE</A></td><td>
I need to extract: ABCDE
Could anybody please help me with the regular expression that I should use?
Leaning on this, https://stackoverflow.com/a/40908001/11450166
(?<=(<A>))[A-Za-z]+(?=(<\/A>))
With that expression, supposing that your tag is <A> </A>, works fine.
This other match with your input form.
(?<=(>))[A-Za-z]+(?=(<\/A>))
You can try using this regular expression in your specific example:
/">(.*)<\/A><\/td><td>/g
Tested on string:
Lorem ipsum">ABCDE</A></td><td>Lorem ipsum<td></td>Lorem ipsum
extracts:
">ABCDE</A></td><td>
Then it's all a matter of extracting the substring from each match using any programming language. This can be done removing first 2 characters and last 13 characters from the matching string from regex, so that you can extract ABCDE only.
I also tried:
/">([^<]*)<\/A><\/td><td>/g
It has same effect, but it won't include matches that include additional HTML code. As far as I understand it, ([^<]*) is a negating set that won't match < characters in that region, so it won't catch other tag elements inside that region. This could be useful for more fine control over if you're trying to search some specific text and you need to filter nested HTML code.

Python regular expression grabbing paragraphs from old HTML

I am working on transferring old content from a website, written in some old HTML, to their new WordPress site. I am using Python to do this. I am trying to
get the content from the old HTML pages using urllib.request
Use a regular expression to grab the text of HTML <p> elements that have classes that identify them as the body of the text
use XML-RPC methods to upload the content to the new WordPress site.
I'm ok with #1 and #3. The problem I am having is with #2, writing the regular expression to capture the content.
The content is in paragraphs that have varying format. Below are two representative examples of two paragraphs that I am trying to extract their content using a regular expression.
Paragraph #1
<p class=bodyDC style='text-indent:12.0pt'><span style='font-size:14.0pt;
mso-bidi-font-size:10.0pt'>We have no need to fear the future." So said
bishop-elect H. George Anderson at a news conference immediately following his election as
bishop of the Evangelical Lutheran Church in America. "[The
future] belongs­ to God, untouched by human hands." At the beginning of a
new ministry of leadership and pastoral oversight, such words from a bishop are
obviously designed to project confidence and a profound sense of trust in the
mission of the Church. They are words designed to inspire and empower the
people of God for ministry.<o:p></o:p></span></p>
Paragraph #2
<p class=BODY><span style='font-size:14.0pt;mso-bidi-font-size:10.0pt'>Ages
ago, another prophet of the people stood at his station and peered into the
future. The<span style="mso-spacerun: yes"> </span>prophet Habakkuk poised on
the rampart, scanned the horizon for the approaching enemy he knew was coming.
As he waited, Habakkuk prayed to God asking why God was unresponsive to all
this violence and destruction. In Habakkuk chapter 2 the prophet records God's
answer to his questions about the future. God says to the fearful one, "For
there is still a vision for the appointed time;… If it seems to tarry, wait for
it; it will surely come, it will not delay…the righteous live by faith"
(2:3-4).<o:p></o:p></span></p>
Ideally my regular expression would identify content paragraphs by their class of BODY or bodyDC. Once it has identified a paragraph containing text content, it would ignore all the HTML elements preceding and following the text content, and simply grab the text content.
The regular expression I have so far is still a work in progress:
post_content_re = re.compile(r'<p class=(body\w*)(.*?>)(<.*?>)*([a-z])', re.IGNORECASE)
My explanation for my regular expression parts:
class=(body\w*) should match either BODY or bodyDC, but it doesn't, it only matches BODY, and I don't know why
(.*?>) match the remaining attributes in the paragraph element
(<.*?>)* match 0 or more html elements enclosed in <> after the paragraph element
([a-z]) The content I am trying to get would be after any HTML elements. Right now I'm just testing for one letter, not the full paragraph text, because I'm still testing.
The matches I am getting all look like this:
BODY- but I expected BODY or bodyDC
> - this is the closing > of the p element with class BODY
<span style='font-size:14.0pt;mso-bidi-font-size:10.0pt'> - this is the span element after the P element
A - this is the first letter after the span element
So essentially, my RE is matching paragraphs like Paragraph #2 above, but not like Paragraph #1. I'm not sure why, and I'm stuck.
Thank you for any help.
I would follow a two step approach to this problem.
first collect all the paragraphs of interest
second extract the text from each paragraph
First
Parse out all the paragraphs that have the desired class.
<p\s*(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\sclass=(['"]?)(?:body|bodydc)\1(?:\s|>)(?:([^<]*)|<(?!\/p)(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>)*(?=<\/p>)
This regex will do the following:
find all the paragraph tags of the given class upto but not including the close </p>
avoids some odd edge cases problems like <span onmouseover=" </p> ">
due to regex limitations this will not work with nested paragraph tags like <p>outside paragraph<p>inside paragraph</p>more text in the outside</p>
See Live Demo
Second
Extract the raw text from each paragraph
(?:([^<]*)|<(?!\/p)(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>)
This regex will do the following:
match both the raw text and tags
place the raw text into capture group 1
avoid difficult edge cases
See Live Demo
While (as someone commented) you should not parse HTML like this, for this one-off job this kind of solution might just work.
Your regex is not working for the first paragraph because . does not match newlines, and you have a newline inside your tag. You can use tricks like [\S\s] to match all characters, including newlines.
This one does not remove the tags at the end of the paragraph, but I hope it still helps:
for g1, g2, content in re.findall("<p (class=bodyDC|class=BODY)[^><]*>(<[\S\s]*?>)*([\S\s]*?)<\\/p>", str1):
print content
Bit of explanation:
<p (class=bodyDC|class=BODY)[^><]*> matches the opening paragraph tag
<p: the beginning of the tag
(class=bodyDC|class=BODY): one of the two class attributes
[^><]*: any other attributes inside the tag
>: the end of the tag
(<[\S\s]*?>)* matches any number of tags
<: the beginning of the tag
[\S\s]*?: any other attributes (could have also used [^><]*)
>: end of tag
([\S\s]*?) matches any text. This is group 3, this is basically the content. (Plus the tags at the end of it.)
<\/p> matches the closing paragraph tag. (Note that in the code it actually appears as <\\/p>, because the backslash has to be escaped in the python string.)

Python regex: remove certain HTML tags and the contents in them

If I have a string that contains this:
<p><span class=love><p>miracle</p>...</span></p><br>love</br>
And I want to remove the string:
<span class=love><p>miracle</p>...</span>
and maybe some other HTML tags. At the same time, the other tags and the contents in them will be reserved.
The result should be like this:
<p></p><br>love</br>
I want to know how to do this using regex pattern?
what I have tried :
r=re.compile(r'<span class=love>.*?(?=</span>)')
r.sub('',s)
but it will leave the
</span>
can you help me using re module this time?and i will learn html parser next
First things first: Don’t parse HTML using regular expressions
That being said, if there is no additional span tag within that span tag, then you could do it like this:
text = re.sub('<span class=love>.*?</span>', '', text)
On a side note: paragraph tags are not supposed to go within span tags (only phrasing content is).
The expression you have tried, <span class=love>.*?(?=</span>), is already quite good. The problem is that the lookahead (?=</span>) will never match what it looks ahead for. So the expression will stop immediately before the closing span tag. You now could manually add a closing span at the end, i.e. <span class=love>.*?(?=</span>)</span>, but that’s not really necessary: The .*? is a non-greedy expression. It will try to match as little as possible. So in .*?</span> the .*? will only match until a closing span is found where it immediately stops.

Can I have a non-greedy regex with dotall?

I would like to match dotall and non-greedy. This is what I have:
img(.*?)(onmouseover)+?(.*?)a
However, this is not being non-greedy. This data is not matching as I expected:
<img src="icon_siteItem.gif" alt="siteItem" title="A version of this resource is available on siteItem" border="0"></a><br><br></td><td rowspan="4" width="20"></td></tr><tr><td>An activity in which students find other more specific adjectives to
describe a range of nouns, followed by writing a postcard to describe a
nice holiday without using the word 'nice'.</td></tr><tr><td>From the resource collection: Drafting </td></tr><tr><td><abbr style="border-bottom:0px" title="Key Stage 3">thing</abbr> | <abbr style="border-bottom:0px" title="Key Stage 4">hello</abbr> | <abbr style="border-bottom:0px" title="Resources">Skills</abbr></td></tr></tbody></table></div></div></td></tr><tr><td><div style="padding-left: 30px"><div><table style="" bgcolor="#DFE7EE" border="0" cellpadding="0" cellspacing="5" width="100%"><tbody><tr valign="top"><td rowspan="4" width="60"><img name="/attachments/3700.pdf" onmouseover="ChangeImageOnRollover(this,'/application/files/images/attach_icons/rollover_pdf.gif')" onmouseout="ChangeImageOnRollover(this,'/application/files/images/attach_icons/small_pdf.gif')" src="small_pdf.gif" alt="Download Recognising and avoiding ambiguity in PDF format" title="Download in PDF format" style="vertical-align: middle;" border="0"><br>790.0 k<br>
and I cannot understand why.
What I think I am stating in the above regex is:
start with "img", then allow 0 or more any character including new line, then look for at least 1 "onmouseover", then allow 0 or more any character including new line, then an "a"
Why doesn't this work as I expected?
KEY POINT: dotall must be enabled
It is being non-greedy.
It is your understanding of non-greedy that is not correct.
A regex will always try to match.
Let me show a simplified example of what non-greedy actually means(as suggested by a comment):
re.findall(r'a*?bc*?', 'aabcc', re.DOTALL)
This will match:
as few repetitions of 'a' as possible (in this case 2)
followed by a 'b'
and as few repetitions of 'c' as possible (in this case 0)
so the only match is 'aab'.
And just to conclude:
Don't use regex to parse HTML. There are libraries that were made for the job. re is not one of them.
First of all, your regex looks a little funky: you're saying match "img", then any number of characters, "onmouseover" at least once, but possibly repeated (e.g. "onmouseoveronmouseoveronmouseover"), followed by any number of characters, followed by "a".
This should match from img src="icon_ all the way to onmouseover="Cha. That's probably not what you want, but it's what you asked for.
Second, and this is significanly more important:
DON'T USE REGULAR EXPESSIONS TO PARSE HTML.
And in case you didn't understand it the first time, let me repeat it in italics:
DON'T USE REGULAR EXPESSIONS TO PARSE HTML.
Finally, let me link you to the canonical grimoire on the subject:
You can't parse [X]HTML with a regex

Python regex: how to extract inner data from regex

I want to extract data from such regex:
<td>[a-zA-Z]+</td><td>[\d]+.[\d]+</td><td>[\d]+</td><td>[\d]+.[\d]+</td>
I've found related question extract contents of regex
but in my case I shoud iterate somehow.
As paprika mentioned in his/her comment, you need to identify the desired parts of any matched text using ()'s to set off the capture groups. To get the contents from within the td tags, change:
<td>[a-zA-Z]+</td><td>[\d]+.[\d]+</td><td>[\d]+</td><td>[\d]+.[\d]+</td>
to:
<td>([a-zA-Z]+)</td><td>([\d]+.[\d]+)</td><td>([\d]+)</td><td>([\d]+.[\d]+)</td>
^^^^^^^^^ ^^^^^^^^^^^ ^^^^^ ^^^^^^^^^^^
group 1 group 2 group 3 group 4
And then access the groups by number. (Just the first line, the line with the '^'s and the one naming the groups are just there to help you see the capture groups as specified by the parentheses.)
dataPattern = re.compile(r"<td>[a-zA-Z]+</td>... etc.")
match = dataPattern.find(htmlstring)
field1 = match.group(1)
field2 = match.group(2)
and so on. But you should know that using re's to crack HTML source is one of the paths toward madness. There are many potential surprises that will lurk in your input HTML, that are perfectly working HTML, but will easily defeat your re:
"<TD>" instead of "<td>"
spaces between tags, or between data and tags
" " spacing characters
Libraries like BeautifulSoup, lxml, or even pyparsing will make for more robust web scrapers.
As the poster clarified, the <td> tags should be removed from the string.
Note that the string you've shown us is just that: a string. Only if used in the context of regular expression functions is it a regular expression (a regexp object can be compiled from it).
You could remove the <td> tags as simply as this (assuming your string is stored in s):
s.replace('<td>','').replace('</td>','')
Watch out for the gotchas however: this is really of limited use in the context of real HTML, just as others pointed out.
Further, you should be aware that whatever regular expression [string] is left, what you can parse with that is probably not what you want, i.e. it's not going to automatically match anything that it matched before without <td> tags!

Categories

Resources