I am using line.rfind() to find a certain line in an html page and then I am splitting the line to pull out individual numbers. For example:
position1 = line.rfind('Wed')
This finds this particular line of html code:
<strong class="temp">79<span>°</span></strong><span class="low"><span>Lo</span> 56<span>°</span></span>
First I want to pull out the '79', which is done with the following code:
if position1 > 0 :
self.high0 = lines[line_number + 4].split('<span>')[0].split('">')[-1]
This works perfectly. The problem I am encountering is trying to extract the '56' from that line of html code. I can't split it between '< span>' and '< /span> since the first '< span>' it finds in the line is after the '79'. Is there a way to tell the script to look for the second occurrence of '< span>'?
Thanks for your help!
Concerns about parsing HTML with regex aside, I've found that regex tends to be fairly useful for grabbing information from limited, machine-generated HTML.
You can pull out both values with a regex like this:
import re
matches = re.findall(r'<strong class="temp">(\d+).*?<span>Lo</span> (\d+)', lines[line_number+4])
if matches:
high, low = matches[0]
Consider this quick-and-dirty: if you rely on it for a job, you may want to use a real parser like BeautifulSoup.
import re
html = """
<strong class="temp">79<span>°</span></strong><span class="low"><span>Lo</span> 56<span>°</span></span>
"""
numbers = re.findall(r"\d+", html, re.X|re.M|re.S)
print numbers
--output:--
['79', '56']
With BeautifulSoup:
from bs4 import BeautifulSoup
html = """
<strong class="temp">
79
<span>°</span>
</strong>
<span class="low">
<span>Lo</span>
56
<span>°</span>
</span>
"""
soup = BeautifulSoup(html)
low_span = soup.find('span', class_="low")
for string in low_span.stripped_strings:
print string
--output:--
Lo
56
°
Related
i have this string
<div class"ewSvNa"><a class="ugP" href="link">Description</a><span data-testid=""><small>$</small><span>0,00</span></div>
and this regex /ewS.*?ugP\".*?f=\"(.*?)\">(.*?)<.*?<s.*?n>(.*?)</g. The result is:
Group 1 = 'link'
Group 2 = 'Description'
Group 3 = '0,00'
My question is: It`s possible have the result of Group 3 like '$0,00'?
Thank u guys =]]]]]
It's recommend to not use regex to parse HTML - instead use a proper parser such as Beautiful Soup.
Then your code becomes:
from bs4 import BeautifulSoup
text = '<div class"ewSvNa"><a class="ugP" href="link">Description</a><span data-testid=""><small>$</small><span>0,00</span></div>'
soup = BeautifulSoup(text)
amount = soup.select_one('span[data-testid]').get_text()
# '$0,00'
Consider the following HTML:
<li>
<a href="url">
<b>This</b>
" is "
<b>a</b>
" test "
<b>string</b>
"!"
</a>
</li>
I would like to extract all the text between the <a> tag except "!". In other words, the text contained between the first opening <b> and last closing </b>: This is a test string.
from bs4 import BeautifulSoup
html = '''
<li>
<a href="url">
<b>This</b>
" is "
<b>a</b>
" test "
<b>string</b>
"!"
</a>
</li>
'''
soup = BeautifulSoup(html)
anchor = soup.a
Note that the number of <b> tags and strings without tags varies so next or next_sibling won't work.
Is there an easier way to do this?
Edit:
Ideally, I would like a method that works even if I have more than one string not enclosed in tags after the last </b>.
Try the code below
result = ''.join([i.strip().replace('"', '') for i in anchor.strings if i.strip()][:-1])
print(result)
output
'This is a test string'
Based on your question and comments, I think getting the indexes of substrings and operating on a whole subset of the HTML could do what you need.
Let's create a function to retrieve all of the indexes of a substring first (see answer by #AkiRoss):
def findall(p, s):
i = s.find(p)
while i != -1:
yield i
i = s.find(p, i+1)
Then use this to find occurences of <b> and </b>.
opening_b_occurrences = [i for i in findall('<b>', html)]
# has the value of [21, 40, 58]
closing_b_occurrences = [i for i in findall('</b>', html)]
# has the value of [28, 44, 67]
Now you can use that information to get a substring of HTML to do your text extraction on:
first_br = opening_b_occurrences[0]
last_br = closing_b_occurrences[-1] # getting the last one from list
text_inside_br = html[first_br:last_br]
The text in text_inside_br should now be '<b>This</b>\n" is "\n<b>a</b>\n" test "\n<b>string'. You can clean it now, for example by appending </br> back to it and using BeautifulSoup to extract the values or just using regex to do that.
In python I have a list with string items that look like that:
My website is <a href="WEBSITE1" target='_blank'><u>WEBSITE1</u></a>
The link is <a href="LINK1" target='_blank'><u>LINK1</u></a>
...
And what I want to do is to substitute (in every list item) the href syntax leaving only the link as text, so my list would look like:
My website is WEBSITE1
The link is LINK1
...
I was thinking about matching and replacing this regex:
<a href="(.*?)" target='_blank'><u>(.*?)</u></a>
with:
(.*?)
but it doesn't work. It seems to complex. Any easy way to have as output a list object with cleaned items?
You can also process the string with an HTML parser, e.g. BeautifulSoup and it's replace_with() - finding all a elements in a string and replacing them with the link texts:
>>> from bs4 import BeautifulSoup
>>> l = [
... """My website is <a href="WEBSITE1" target='_blank'><u>WEBSITE1</u></a>""",
... """The link is <a href="LINK1" target='_blank'><u>LINK1</u></a>"""
... ]
>>> for item in l:
... soup = BeautifulSoup(item, "html.parser")
... for a in soup("a"):
... a.replace_with(a.text)
... print(str(soup))
...
My website is WEBSITE1
The link is LINK1
Or, as pointed out by #user3100115 in comments, just getting the text of the "soup" object also works on your sample data:
>>> for item in l:
... print(BeautifulSoup(item, "html.parser").get_text())
...
My website is WEBSITE1
The link is LINK1
This regex seems to work
([^<]+)<a\s+href\s*=\s*"([^"]+).*
Regex Demo
Python Code
p = re.compile(r'<a\s+href\s*=\s*"([^"]+).*')
test_str = ["My website is <a href=\"WEBSITE1\" target='_blank'><u>WEBSITE1</u></a>", "The link is <a href=\"LINK1\" target='_blank'><u>LINK1</u></a>"]
for x in test_str:
print(re.sub(p, r"\1", x))
Ideone Demo
If I had to use a regex I would use something like
<a href.*?><u>(.*?)<\/u><\/a>
and then replace in a list comprehension
pattern = re.compile('<a href.*?><u>(.*?)<\/u><\/a>')
print [re.sub(pattern, r"\1", string) for string in my_list]
But consider using beautifulsoup or another html parser, as pointed out in other answers, which will provide you with a more generic solution
Regex Explanation
<a href.*?> match an a href tag, non greedy, up to the first closing bracket
<u> match the u tag
(.*?) match the string you want to keep
<\/u><\/a> match the closing tags
Retrieve the parenthesized capture group in your re.sub:
>>>s = """
My website is <a href="WEBSITE1" target='_blank'><u>WEBSITE1</u></a>
The link is <a href="LINK1" target='_blank'><u>LINK1</u></a>
"""
>>> re.sub("<a href=\"(.*?)\" target='_blank'><u>(.*?)</u></a>", r'\1', s)
'\nMy website is WEBSITE1 \nThe link is LINK1 \n'
Make sure the replacement string is a proper r escaped string, else it will simply replace with \1.
As your input is a list (let's assume its name is s):
>>> for i in range(0,len(s)):
... s[i] = re.sub("<a href=\"(.*?)\" target='_blank'><u>(.*?)</u></a>", r'\1', s[i])
>>> s
['My website is WEBSITE1', 'The link is LINK1']
If done regularly or on a large list, you can compile the regex before the loop.
Please clarify: your title says to remove html href tags, but in your example, you also remove the u tags.
Your answer can be simplified if we are guaranteed to have no other HTML tags than a and u (or if we want to remove all tags). In this case, we can search for anything between < and >, or for anything between <a or </a> and >. My answer assumes this, so it will be invalid if it is not.
import re
S = (
'My website is <u>WEBSITE1</u>',
'The link is <u>LINK1</u>',
)
RE1 = re.compile( r"<\/?[^>]*>")
RE2 = re.compile( r"<\/?[aA][^>]*>")
for s in S:
s1 = RE1.sub( "", s ) # remove all tags
s2 = RE2.sub( "", s ) # remove only <a> and </a> tags
print (s)
print (s1)
print (s2)
print ("")
When run (python2), it produces
My website is <u>WEBSITE1</u>
My website is WEBSITE1
My website is <u>WEBSITE1</u>
The link is <u>LINK1</u>
The link is LINK1
The link is <u>LINK1</u>
first line is original string, second is with all HTML tags removed, third is with only a tags removed.
I did not include a third choice: only remove the a href tags.
I am trying to extract the text between <th> tags from an HTML table. The following code explains the problem
searchstr = '<th class="c1">data 1</th><th>data 2</th>'
p = re.compile('<th\s+.*?>(.*?)</th>|<th>(.*?)</th>')
for i in p.finditer(searchstr):print i.group(1)
The output produced by the code is
data 1
None
If I change the pattern to <th>(.*?)</th>|<th\s+.*?>(.*?)</th> the output changes to
None
data 2
What is the correct way to catch the group in both cases.I am not using the pattern <th.*?>(.*?)</th> because there may be <thead> tags in the search string.
Why don't use an HTML Parser instead - BeautifulSoup, for example:
>>> from bs4 import BeautifulSoup
>>> str = '<th class="c1">data 1</th><th>data 2</th>'
>>> soup = BeautifulSoup(str, "html.parser")
>>> [th.get_text() for th in soup.find_all("th")]
[u'data 1', u'data 2']
Also note that str is a bad choice for a variable name - you are shadowing a built-in str.
You may reduce the regex like below with one capturing group.
re.compile(r'(?s)<th\b[^>]*>(.*?)</th>')
I have the below URL and would like to extract prices. For that I load the page into beautifulsoup:
soup = bs(content, 'lxml')
for e in soup.find_all(class_="totalPrice"):
Now I get a text that looks like this (this is one single Element of type bs4.element.Tag):
<td class="totalPrice" colspan="3">
<div data-component="track" data-hash="OLNYSRfCbdWGffSRe" data-stage="1" data-track="view"></div>
Total: £145
</td>
How can I create another find expression that will extract the 145? Is there a way to search as for "Total" and then get the text just next to it?
URL with original content that I extract
Use a regex!
>>> import re
>>> search_text = 'blah Total: result'
>>> result = re.findall(r'Total: (.*)', search_text)
>>> result
['result']
If you want to be more general and capture anything that looks like currency, try this:
>>> result = re.findall(r': (£\d*)', search_text)
This will get you the currency symbol £ + and of the following digits.
You can get text from tag
text = e.get_text()
and you have normal string Total: £145 so you can split it
text.split(' ') # [`Total:', '£145`]
slice it
text[8:] # 145
use regular expression, etc.