I am aware that using Regex to parse html code is technically incorrect but found this out too far into starting this project (it's for some coursework that I have already stated that I am going to use Regex for so too late to go back on that now)
Im trying to make a python program that takes a html document, strips out the numbers contained after the card-count class and then append them to a list, the problem is that rather than finding the first match when it runs it seems to find the first one and all the others that are identical to the first one and so on, here is some example html and my regex:
<span class="card-count">1</span>
<span class="card-name">Garruk Relentless</span>
</span>
<span class="row">
<span class="card-count">2</span>
<span class="card-name">Jace, the Mind Sculptor</span>
</span>
</div>
<div class="sorted-by-creature clearfix element">
<h5>Creature (16)</h5>
<span class="row">
<span class="card-count">4</span>
<span class="card-name">Deathrite Shaman</span>
</span>
<span class="row">
<span class="card-count">4</span>
<span class="card-name">Noble Hierarch</span>
</span>
<span class="row">
<span class="card-count">4</span>
<span class="card-name">Stoneforge Mystic</span>
</span>
<span class="row">
<span class="card-count">4</span>
<span class="card-name">True-Name Nemesis</span>
</span>
</div>
<div class="sorted-by-sorcery clearfix element">
<h5>Sorcery (3)</h5>
<span class="row">
<span class="card-count">3</span>
<span class="card-name">Ponder</span>
</span>
And the python code is:
card_number_list=[]
number_of_cards=int(0)
#find out how many of x cards there are in the deck
def card_number_regex(card_number_list):
global number_of_cards
global html
number_in_set= re.search("card-count.*",html)
get_rid= re.search("card-count.*",html).group(0)
html=html.replace(get_rid,"")
number_in_set=number_in_set.group(0)
html=html.replace(number_in_set, "")
number_in_set=number_in_set.replace('card-count">',"")
number_in_set=number_in_set.replace('</span>', "")
card_number_list.append(number_in_set)
number_in_set_int=int(number_in_set)
print(number_in_set_int)
number_of_cards=(number_of_cards+number_in_set_int)
return number_of_cards
while number_of_cards<75:
card_number_regex(card_number_list)
The output I get when I run this is
1
2
4
3
While many seem to rather bash on your choice to use regex for this task, I would argue that it does not seem too difficult for your specific goal and will provide an actual answer for what you asked for.
import re
a = html
b = re.findall('<span class="card-count">(.*?)</span>',a)
print(b[0])
That regex should give the contents of your card-count classes in a list, and using first index you retrieve only the match you want your regex to find.
Obviously this would work less well for other use-cases, but as you seem to know that you only ever want the first occurrence in the html-document it does not matter that list contains all of them, even when they are in another div tag etc.
And as others have said, I don't see why you wouldn't use a regular html parser for this.
Related
im doing web scraping for first time using scrapy trying to get some prices from a web site. The thing is that i don't know how to get it because is inside the node content, first time with xpath so i'm little confuse. Let my give the example:
<span class="list d-block">
<span class="value" content="1250">
<span class="sr-only">
Precio reducido de
</span>
<span class="price-original">
<span class="">
$1.250
</span>
(Normal)
</span>
<span class="sr-only">
(Oferta)
</span>
</span>
</span>
I need to get the content, in this case "1250" in this case from #class= "value".
Any help will be great!
As I understand you want to get content attribute value. here is the XPath:
'//span[#class="<value>"]/#content'
On the xml that you posted this xpath should work:
string(//span[#class='value']/#content)
Please find this tutorial for details on xpath.
I need help selecting the price of a html code. As I have extracted the Title of a movie, I now need to extract the price. I have tried using the lookahead regular expression but I get an error when I use \n.* as it says "A quantifier inside a lookbehind makes it non-fixed width". I need the first and the second price in the text.
Regex I have tried:
(?<=Hello<\/a>.*\n.*\n.*\n.*\n.*\n.*\n.*\n.*\n.*?(\$)
and:
Hello<\/a>.*\n.*\n.*\n.*\n.*\n.*\n.*\n.*\n.*?(\$)
But doesn't work.
Text:
<a class="blue_link" href="http://www.ebgames.com.au/Games/sjbeiub108723">Hello:</a>
<div class="hi">
<p>Including <a class="blue_link">
<p>Price$<data1>40.00</p>
Pls help and thank you :)
You can use this regex with the DOTALL flag:
import re
r = "The Durrells: Series 2.+\$(\d+\.\d+).+\$(\d+\.\d+)"
text = ''' <a class="blue_link fn url" href="http://www.fishpond.com.au/Movies/Durrells-Series-2-Keeley-Hawes/5014138609450">The Durrells: Series 2</a>
<div class="by">
<p>Starring <a class="blue_link" href="http://www.fishpond.com.au/c/Movies/s/Keeley+Hawes">Keeley Hawes</a>, <a class="blue_link" href="http://www.fishpond.com.au/c/Movies/s/Milo+Parker">Milo Parker</a>, <a class="blue_link" href="http://www.fishpond.com.au/c/Movies/s/Josh+O%27Connor">Josh O'Connor</a>, <a class="blue_link" href="http://www.fishpond.com.au/c/Movies/s/Daisy+Waterstone">Daisy Wat...</a></p>
<div class="productSearch-metainfo">
DVD (UK), May 2017 </div>
</div>
</div></td>
<td align="right" style="vertical-align:top;"><div class="productSearch-price-container">
<span class="rrp-label">Elsewhere</span> <s>$30.53</s> <span class="productSpecialPrice"><b>$27.46</b></span> <div style="white-space:nowrap;"> <span class="you_save">Save 10%</span> </div><span class="free-shipping">with Free Shipping!</span></div>
'''
print(re.findall(r, text, re.DOTALL))
Output:
[('30.53', '27.46')]
I have a trouble. My aim is to parse the data until some moment. Then, I want to stop parsing.
<span itemprop="address">
Some address
</span>
<i class="fa fa-signal">
</i>
...
</p>
</div>
</div>
<div class="search_pagination" id="pagination">
<ul class="pagination">
</ul>
</div>
</div>
</div>
</div>
<div class="col-sm-3">
<div class="panel" itemscope="" itemtype="http://schema.org/WPSideBar">
<h2 class="heading_a" itemprop="name">
Top-10 today
</h2> #a lot of tags after that moment
I want to get all the values from <span itemprop="address"> (there are a lot of them before) until the moment Top-10 today.
You can actually let BeautifulSoup parse only the tags you are interested in via SoupStrainer:
from bs4 import BeautifulSoup, SoupStrainer
only_addresses = SoupStrainer("span", itemprop="address")
soup = BeautifulSoup(html_doc, "html.parser", parse_only=only_addresses)
If you though have some "addresses" before the "Top-10 today" and some after but you are interested in those coming before it, you can make a custom searching function:
def search_addresses(tag):
return tag.name == "span" and tag.get("itemprop") == "address" and \
tag.find_next("h2", text=lambda text: text and "Top-10 today" in text)
addresses = soup.find_all(search_addresses)
It does not look trivial, but the idea is simple - we are using find_next() for every "address" to check if "Top-10 today" heading exists after it.
I am trying to parse with BeautifulSoup an awful HTML page to retrieve a few information. The code below:
import bs4
with open("smartradio.html") as f:
html = f.read()
soup = bs4.BeautifulSoup(html)
x = soup.find_all("div", class_="ue-alarm-status", playerid="43733")
print(x)
extracts the fragments I would like to analyze further:
[<div alarmid="f319e1fb" class="ue-alarm-status" playerid="43733">
<div>
<div class="ue-alarm-edit ue-link">Réveil 1: </div>
<div>allumé</div>
<div>7:00</div>
</div>
<div>
<div class="ue-alarm-dow">Lu, Ma, Me, Je, Ve </div>
<div class="ue-alarm-delete ue-link">Supprimer</div>
</div>
</div>, <div alarmid="ea510709" class="ue-alarm-status" playerid="43733">
<div>
<div class="ue-alarm-edit ue-link">Réveil 2: </div>
<div>allumé</div>
<div>7:30</div>
</div>
<div>
<div class="ue-alarm-dow">Sa </div>
<div class="ue-alarm-delete ue-link">Supprimer</div>
</div>
</div>]
I am interested in retrieving:
the hour (line 5 and 14)
the string (days in French) under <div class="ue-alarm-dow">
I believe that for the days it is enough to repeat a find() or find_all(). I am mentioning that because while it grabs the right information, I am not sure that this is the right way to parse the file with BeautifulSoup (but at least it works):
for y in x:
z = y.find("div", class_="ue-alarm-dow")
print(z.text)
# output:
# Lu, Ma, Me, Je, Ve
# Sa
I do not know how to get to the hour, though. Is there a way to navigate the tree by path (in the sense that I know that the hour is under the second <div>, three <div> deep)? Or should I do it differently?
You can also rely on the allumé text and get the next sibling div element:
y.find('div', text=u'allumé').find_next_sibling('div').text
or, in a similar manner, relying on the class of the previous div:
y.find('div', class_='ue-alarm-edit').find_next_siblings('div')[1].text
or, using regular expressions:
y.find('div', text=re.compile(r'\d+:\d+')).text
or, just get the div by index:
y.find_all('div')[4].text
similarlist = res.find_all_next("div", class_="result-wrapper")
for item in similarlist:
print(item)
This returns:
<div class="result-wrapper">
<div class="row-fluid result-row">
<div class="span6 result-left">
<p>
<a class="tooltipLink warn-cs" data-original-title="Listen" href="..." rel="tooltip"><i class="..."></i></a>
<a class="muted-link" href="/dictionary/german-english/aa-machen">Aa <b>machen</b></a>
</p>
</div>
<div class="span6 result-right row-fluid">
<span class="span9">
<a class="muted-link" href="/dictionary/english-german/do-a-poo">to do a poo</a>, <a class="muted-link" href="/dictionary/english-german/pooh">to pooh</a>
</span>
</div>
</div>
</div>
When I choose to print item.get_text() instead, I get
abgeneigt machen
to disincline
abhängig machen
2137
to predicate
Absenker machen
to layer
So basically a lot of new lines between the list items that I don't need. Is this because of the <p> tags? How do I get rid of them?
Yes, between tags the HTML contains whitespace (including newlines) too.
You can easily collapse all multi-line whitespace with a regular expression:
import re
re.sub(r'\n\s*\n', r'\n\n', item.get_text().strip(), flags=re.M)
This removes any whitespace (newlines, spaces, tabs, etc.) between two newlines.
You can the the strip() function in python
item.get_text().strip()