I'm practicing regular expressions on html file.
My goal is to fetch tittle of the file:
<tittle>Popular baby names</tittle>
I tried something like this:
pattern = re.compile(r'>.+<')
and instead of what I'm looking for I'm getting:
((1791, 1794), '>?<')
((2544, 2547), '>1<')
((2605, 2608), '>2<')
I've read that dot represents any character but a newline. That makes me wonder why it isn't working.
If you want to catch what's inside the tag only, use capturing groups ().
import re
s = '<tittle>Popular baby names</tittle> some text <title>Other title</title> <strong>bold</strong>'
re.findall(r'>([\w\s]+)</', s)
# ['Popular baby names', 'Other title', 'bold']
Related
I am wanting to write a regular expression which pulls out the following from between but not including <p> and </p> tags:
begins with the word "Cash" has some stuff I don't want then a number $336,008.
From:
<p>Cash nbs $13,000</p>
I want "Cash" and if available (maybe 0 or nothing) the number.
What I have so far gets me everything inside of the <p> tags including the tags themselves.
\<p\>(|.*Cash.*|.*Total.*\$\d+.*)\<\/p\>
If you using re.findall, just use () with you want to catch.
>>> test = '<p>Cash nbs $13,000</p>'
>>> re.findall(r'<p>(Cash).+\${1}\d*|\,</p>', test)
['Cash']
I have the following regular expression to extract song names from a certain website:
<h2 class="chart-row__song">(.*?)</h2>
It displays the results below :
Where ' is in the output below, is an apostrophe on the website the song name is extract from.
How would I go about changing my regular expression to remove those characters? '
TIA
As stated in the comments, you can't do that using a regex alone. You need to unescape the HTML entities present in the match separately.
import re
import html
regex = re.compile(r'<h2 class="chart-row__song">(.*?)</h2>')
result = [html.unescape(s) for s in regex.findall(mystring)]
I'd like to identify the characters within a string that are located relatively to a string I search for.
In other words, if I search for 'Example Text' in the below string, I'd like to identify the immediate characters that come before and after 'Example Text' and also have '<' and '>'.
For example, if I searched the below string for 'Example Text', I'd like the function to return <h3> and </h3>, since those are the characters that come immediately before and after it.
String = "</div><p></p> Random Other Text <h3>Example Text</h3><h3>Coachella Valley Music & Arts Festival</h3><strong>Random Text</strong>:Random Date<br/>"
I do not believe you are asking the right question here. I think what you're actually aiming for is:
Given a piece of text, how can I capture the html element that encapsulates it
Very different problem and one that should NEVER be solved with a regex. If you want to know why, just google it.
As far as that other question goes and capturing the relevant html tag I would recommend using lxml. The docs can be found here. For your use case you could do the follows:
>>> from lxml import etree
>>> from StringIO import StringIO
>>> your_string = "</div><p></p> Random Other Text <h3>Example Text</h3><h3>Coachella Valley Music & Arts Festival</h3><strong>Random Text</strong>:Random Date<br/>"
>>> parser = etree.HTMLParser()
>>> document = etree.parse(StringIO(your_string), parser)
>>> elements = document.xpath('//*[text()="Example Text"]')
>>> elements[0].tag
'h3'
I believe it can be done by beautifulsoup
from BeautifulSoup import BeautifulSoup
String = "</div><p></p> Random Other Text <h3>Example Text</h3><h3>Coachella Valley Music & Arts Festival</h3><strong>Random Text</strong>:Random Date<br/>"
soup = BeautifulSoup(String)
input = 'Example Text'
for elem in soup(text=input):
print(str(elem.parent).replace(input,'') )
Reasons to not use regex:
Difficulty in defining number of characters to return before and after match.
If you match for tags, what do you do if the searched-for text is not immediately surrounded by tags?
Obligatory: Tony the Pony says so
If you're parsing HTML/XML, use an HTML/XML parser. lxml is a good one, I personally prefer using BeautifulSoup, as it uses lxml for some of its heavy lifting, but has other features as well, and is more user-friendly, especially for quick matches.
You can use the regex <[^>]*> to match a tag, then use groups defined with parentheses to separate your match into the blocks that you want:
m = re.search("(<[^>]*>)Example Text(<[^>]*>)", String)
m.groups()
Out[7]: ('<h3>', '</h3>')
I have never had a very hard time with regular expressions up until now. I am hoping the solution is not obvious because I have probably spent a few hours on this problem.
This is my string:
<b>Carson Daly</b>: Ben Schwartz, Soko, Jacob Escobedo (R 2/28/14)<br>'
I want to extract 'Soko', and 'Jacob Escobedo' as individual strings. If I takes two different patterns for the extractions that is okay with me.
I have tried "\s([A-Za-z0-9]{1}.+?)," and other alterations of that regex to get the data I want but I have had no success. Any help is appreciated.
The names never follow the same tag or the same symbol. The only thing that consistently precedes the names is a space (\s).
Here is another string as an example:
<b>Carson Daly</b>: Wil Wheaton, the Birds of Satan, Courtney Kemp Agboh<br>
An alternative approach would be to parse the string with an HTML parser, like lxml.
For example, you can use the xpath to find everything between a b tag with Carson Daly text and br tag by checking preceding and following siblings:
from lxml.html import fromstring
l = [
"""<b>Carson Daly</b>: Ben Schwartz, Soko, Jacob Escobedo (R 2/28/14)<br>'""",
"""<b>Carson Daly</b>: Wil Wheaton, the Birds of Satan, Courtney Kemp Agboh<br>"""
]
for html in l:
tree = fromstring(html)
results = ''
for element in tree.xpath('//node()[preceding-sibling::b="Carson Daly" and following-sibling::br]'):
if not isinstance(element, str):
results += element.text.strip()
else:
text = element.strip(':')
if text:
results += text.strip()
print results.split(', ')
It prints:
['Ben Schwartz', 'Soko', 'Jacob Escobedo (R 2/28/14)']
['Wil Wheaton', 'the Birds of Satan', 'Courtney Kemp Agboh']
If you want to do it in regex (and with all the disclaimers on that topic), the following regex works with your strings. However, do note that you need to retrieve your matches from capture Group 1. In the online demo, make sure you look at the Group 1 captures in the bottom right pane. :)
<[^<]*</[^>]*>|<.*?>|((?<=,\s)\w[\w ]*\w|\w[\w ]*\w(?=,))
Basically, with the left alternations (separated by |) we match everything we don't want, then the final parentheses on the right capture what we do want.
This is an application of this question about matching a pattern except in certain situations (read that for implementation details including links to Python code).
How to find all words except the ones in tags using RE module?
I know how to find something, but how to do it opposite way? Like I write something to search for, but acutally I want to search for every word except everything inside tags and tags themselves?
So far I managed this:
f = open (filename,'r')
data = re.findall(r"<.+?>", f.read())
Well it prints everything inside <> tags, but how to make it find every word except thats inside those tags?
I tried ^, to use at the start of pattern inside [], but then symbols as . are treated literally without special meaning.
Also I managed to solve this, by splitting string, using '''\= <>"''', then checking whole string for words that are inside <> tags (like align, right, td etc), and appending words that are not inside <> tags in another list. But that a bit ugly solution.
Is there some simple way to search for every word except anything that's inside <> and these tags themselves?
So let say string 'hello 123 <b>Bold</b> <p>end</p>'
with re.findall, would return:
['hello', '123', 'Bold', 'end']
Using regex for this kind of task is not the best idea, as you cannot make it work for every case.
One of solutions that should catch most of such words is regex pattern
\b\w+\b(?![^<]*>)
If you want to avoid using a regular expression, BeautifulSoup makes it very easy to get just the text from an HTML document:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_string)
text = "".join(soup.findAll(text=True))
From there, you can get the list of words with split:
words = text.split()
Something like re.compile(r'<[^>]+>').sub('', string).split() should do the trick.
You might want to read this post about processing context-free languages using regular expressions.
Strip out all the tags (using your original regex), then match words.
The only weakness is if there are <s in the strings other than as tag delimiters, or the HTML is not well formed. In that case, it is better to use an HTML parser.