Regular expression selecting a text with skipping a few lines - python

I need help selecting the price of a html code. As I have extracted the Title of a movie, I now need to extract the price. I have tried using the lookahead regular expression but I get an error when I use \n.* as it says "A quantifier inside a lookbehind makes it non-fixed width". I need the first and the second price in the text.
Regex I have tried:
(?<=Hello<\/a>.*\n.*\n.*\n.*\n.*\n.*\n.*\n.*\n.*?(\$)
and:
Hello<\/a>.*\n.*\n.*\n.*\n.*\n.*\n.*\n.*\n.*?(\$)
But doesn't work.
Text:
<a class="blue_link" href="http://www.ebgames.com.au/Games/sjbeiub108723">Hello:</a>
<div class="hi">
<p>Including <a class="blue_link">
<p>Price$<data1>40.00</p>
Pls help and thank you :)

You can use this regex with the DOTALL flag:
import re
r = "The Durrells: Series 2.+\$(\d+\.\d+).+\$(\d+\.\d+)"
text = ''' <a class="blue_link fn url" href="http://www.fishpond.com.au/Movies/Durrells-Series-2-Keeley-Hawes/5014138609450">The Durrells: Series 2</a>
<div class="by">
<p>Starring <a class="blue_link" href="http://www.fishpond.com.au/c/Movies/s/Keeley+Hawes">Keeley Hawes</a>, <a class="blue_link" href="http://www.fishpond.com.au/c/Movies/s/Milo+Parker">Milo Parker</a>, <a class="blue_link" href="http://www.fishpond.com.au/c/Movies/s/Josh+O%27Connor">Josh O'Connor</a>, <a class="blue_link" href="http://www.fishpond.com.au/c/Movies/s/Daisy+Waterstone">Daisy Wat...</a></p>
<div class="productSearch-metainfo">
DVD (UK), May 2017 </div>
</div>
</div></td>
<td align="right" style="vertical-align:top;"><div class="productSearch-price-container">
<span class="rrp-label">Elsewhere</span> <s>$30.53</s> <span class="productSpecialPrice"><b>$27.46</b></span> <div style="white-space:nowrap;"> <span class="you_save">Save 10%</span> </div><span class="free-shipping">with Free Shipping!</span></div>
'''
print(re.findall(r, text, re.DOTALL))
Output:
[('30.53', '27.46')]

Related

How can I parse html file using python and beautiful soup from html tag under html tag value?

My html file contains same tag(<span class="fna">) multiple times. If I want to differentiate this tag then i need to look previous tag. Tag() under tag(<span id="field-value-reporter">).
In beautiful soup, I can apply only on tag condition like, soup.find_all("span", {"id": "fna"}). This function extract all data for tag(<span class="fna">) but I need only which contain under tag(<span id="field-value-reporter")
Example html tags:
<div class="value">
<span id="field-value-reporter">
<div class="vcard vcard_287422" >
<a class="email " href="/user_profile?user_id=287422" >
<span class="fna">Chris Pearce (:cpearce)
</span>
</a>
</div>
</span>
</div>
<div class="value">
<span id="field-value-triage_owner">
<div class="vcard vcard_27780" >
<a class="email " href="/user_profile?user_id=27780">
<span class="fna">Justin Dolske [:Dolske]
</span>
</a>
</div>
</span>
</div>
Use soup.select:
soup.select('#field-value-reporter a > span') # select for all tags that are children of a tag whose id is field-value-reporter
>>> [<span class="fna">Chris Pearce (:cpearce)</span>]
soup.select uses css selector and are, in my opinion, much more capable than the default element search that comes with BeautifulSoup. Note that all results are returned as list and contains everything that match.

how to make this regex only find the first match

I am aware that using Regex to parse html code is technically incorrect but found this out too far into starting this project (it's for some coursework that I have already stated that I am going to use Regex for so too late to go back on that now)
Im trying to make a python program that takes a html document, strips out the numbers contained after the card-count class and then append them to a list, the problem is that rather than finding the first match when it runs it seems to find the first one and all the others that are identical to the first one and so on, here is some example html and my regex:
<span class="card-count">1</span>
<span class="card-name">Garruk Relentless</span>
</span>
<span class="row">
<span class="card-count">2</span>
<span class="card-name">Jace, the Mind Sculptor</span>
</span>
</div>
<div class="sorted-by-creature clearfix element">
<h5>Creature (16)</h5>
<span class="row">
<span class="card-count">4</span>
<span class="card-name">Deathrite Shaman</span>
</span>
<span class="row">
<span class="card-count">4</span>
<span class="card-name">Noble Hierarch</span>
</span>
<span class="row">
<span class="card-count">4</span>
<span class="card-name">Stoneforge Mystic</span>
</span>
<span class="row">
<span class="card-count">4</span>
<span class="card-name">True-Name Nemesis</span>
</span>
</div>
<div class="sorted-by-sorcery clearfix element">
<h5>Sorcery (3)</h5>
<span class="row">
<span class="card-count">3</span>
<span class="card-name">Ponder</span>
</span>
And the python code is:
card_number_list=[]
number_of_cards=int(0)
#find out how many of x cards there are in the deck
def card_number_regex(card_number_list):
global number_of_cards
global html
number_in_set= re.search("card-count.*",html)
get_rid= re.search("card-count.*",html).group(0)
html=html.replace(get_rid,"")
number_in_set=number_in_set.group(0)
html=html.replace(number_in_set, "")
number_in_set=number_in_set.replace('card-count">',"")
number_in_set=number_in_set.replace('</span>', "")
card_number_list.append(number_in_set)
number_in_set_int=int(number_in_set)
print(number_in_set_int)
number_of_cards=(number_of_cards+number_in_set_int)
return number_of_cards
while number_of_cards<75:
card_number_regex(card_number_list)
The output I get when I run this is
1
2
4
3
While many seem to rather bash on your choice to use regex for this task, I would argue that it does not seem too difficult for your specific goal and will provide an actual answer for what you asked for.
import re
a = html
b = re.findall('<span class="card-count">(.*?)</span>',a)
print(b[0])
That regex should give the contents of your card-count classes in a list, and using first index you retrieve only the match you want your regex to find.
Obviously this would work less well for other use-cases, but as you seem to know that you only ever want the first occurrence in the html-document it does not matter that list contains all of them, even when they are in another div tag etc.
And as others have said, I don't see why you wouldn't use a regular html parser for this.

Remove html after some point in Beautilful Soup

I have a trouble. My aim is to parse the data until some moment. Then, I want to stop parsing.
<span itemprop="address">
Some address
</span>
<i class="fa fa-signal">
</i>
...
</p>
</div>
</div>
<div class="search_pagination" id="pagination">
<ul class="pagination">
</ul>
</div>
</div>
</div>
</div>
<div class="col-sm-3">
<div class="panel" itemscope="" itemtype="http://schema.org/WPSideBar">
<h2 class="heading_a" itemprop="name">
Top-10 today
</h2> #a lot of tags after that moment
I want to get all the values from <span itemprop="address"> (there are a lot of them before) until the moment Top-10 today.
You can actually let BeautifulSoup parse only the tags you are interested in via SoupStrainer:
from bs4 import BeautifulSoup, SoupStrainer
only_addresses = SoupStrainer("span", itemprop="address")
soup = BeautifulSoup(html_doc, "html.parser", parse_only=only_addresses)
If you though have some "addresses" before the "Top-10 today" and some after but you are interested in those coming before it, you can make a custom searching function:
def search_addresses(tag):
return tag.name == "span" and tag.get("itemprop") == "address" and \
tag.find_next("h2", text=lambda text: text and "Top-10 today" in text)
addresses = soup.find_all(search_addresses)
It does not look trivial, but the idea is simple - we are using find_next() for every "address" to check if "Top-10 today" heading exists after it.

Find a paragraph and find a string inside this paragraph with REGEX

I have inside an HTML page some lines like this :
<div>
<p class="match"> this sentence should match </p>
some text
<a class="a"> some text </a>
</div>
<div>
<p class="match"> this sentence shouldnt match</p>
some text
<a class ="b"> some text </a>
</div>
I want to extract the lines inside the <p class="match"> but only when there are inside div containing <a class="a">.
What I've done so far is below (I firstly find the paragraphs with <a class="a"> inside and I iterate on the result to find the sentence inside a <p class="match">) :
import re
file_to_r = open("a")
regex_div = re.compile(r'<div>.+"a".+?</div>', re.DOTALL)
regex_match = re.compile(r'<p class="match">(.+)</p>')
for m in regex_div.findall(file_to_r.read()):
print(regex_match.findall(m))
but I wonder if there is an other (still efficient) way to do it at once?
Use an HTML Parser, like BeautifulSoup.
Find the a tag with a class and then find previous sibling - p tag with class match:
from bs4 import BeautifulSoup
data = """
<div>
<p class="match"> this sentence should match </p>
some text
<a class="a"> some text </a>
</div>
<div>
<p class="match"> this sentence shouldn't match</p>
some text
<a class ="b"> some text </a>
</div>
"""
soup = BeautifulSoup(data)
a = soup.find('a', class_='a')
print a.find_previous_sibling('p', class_='match').text
Prints:
this sentence should match
Also see why you should avoid using regex for parsing HTML here:
RegEx match open tags except XHTML self-contained tags
You should use a html parser but if you still wat a regex you can use something like this:
<div>\s*<p class="match">([\w\s]+)</p>[\w\s]+(?=<a class="a").*?</div>
Working demo
<div>\s*\n\s*.*?<p class=.*?>(.*?)<\/p>\s*\n\s*.*?\s*\n\s*(?=(\<a class=\"a\"\>))
You can use this.
See demo.
http://regex101.com/r/lK9iD2/7

BeautifulSoup - getting rid of paragraph whitespace/line breaks

similarlist = res.find_all_next("div", class_="result-wrapper")
for item in similarlist:
print(item)
This returns:
<div class="result-wrapper">
<div class="row-fluid result-row">
<div class="span6 result-left">
<p>
<a class="tooltipLink warn-cs" data-original-title="Listen" href="..." rel="tooltip"><i class="..."></i></a>
<a class="muted-link" href="/dictionary/german-english/aa-machen">Aa <b>machen</b></a>
</p>
</div>
<div class="span6 result-right row-fluid">
<span class="span9">
<a class="muted-link" href="/dictionary/english-german/do-a-poo">to do a poo</a>, <a class="muted-link" href="/dictionary/english-german/pooh">to pooh</a>
</span>
</div>
</div>
</div>
When I choose to print item.get_text() instead, I get
abgeneigt machen
to disincline
abhängig machen
2137
to predicate
Absenker machen
to layer
So basically a lot of new lines between the list items that I don't need. Is this because of the <p> tags? How do I get rid of them?
Yes, between tags the HTML contains whitespace (including newlines) too.
You can easily collapse all multi-line whitespace with a regular expression:
import re
re.sub(r'\n\s*\n', r'\n\n', item.get_text().strip(), flags=re.M)
This removes any whitespace (newlines, spaces, tabs, etc.) between two newlines.
You can the the strip() function in python
item.get_text().strip()

Categories

Resources