BeautifulSoup - getting rid of paragraph whitespace/line breaks - python

similarlist = res.find_all_next("div", class_="result-wrapper")
for item in similarlist:
print(item)
This returns:
<div class="result-wrapper">
<div class="row-fluid result-row">
<div class="span6 result-left">
<p>
<a class="tooltipLink warn-cs" data-original-title="Listen" href="..." rel="tooltip"><i class="..."></i></a>
<a class="muted-link" href="/dictionary/german-english/aa-machen">Aa <b>machen</b></a>
</p>
</div>
<div class="span6 result-right row-fluid">
<span class="span9">
<a class="muted-link" href="/dictionary/english-german/do-a-poo">to do a poo</a>, <a class="muted-link" href="/dictionary/english-german/pooh">to pooh</a>
</span>
</div>
</div>
</div>
When I choose to print item.get_text() instead, I get
abgeneigt machen
to disincline
abhängig machen
2137
to predicate
Absenker machen
to layer
So basically a lot of new lines between the list items that I don't need. Is this because of the <p> tags? How do I get rid of them?

Yes, between tags the HTML contains whitespace (including newlines) too.
You can easily collapse all multi-line whitespace with a regular expression:
import re
re.sub(r'\n\s*\n', r'\n\n', item.get_text().strip(), flags=re.M)
This removes any whitespace (newlines, spaces, tabs, etc.) between two newlines.

You can the the strip() function in python
item.get_text().strip()

Related

Python3.7: RegEx for string between strings on multiple lines?

I would like to find 30,850 in:
<div class='user-information__achievements-heading' data-test-points-title>
Points
</div>
<div class='user-information__achievements-data' data-test-points-count>
30,850
</div>
</div>
with:
^(?!<div class='user-information__achievements-data' data-test-points-count>
|<.div>)(.*)$
(returns nothing)
How come ^(?!START\-OF\-FIELDS|END\-OF\-FIELDS)(.*)$ does work for:
START-OF-FIELDS
<div>
Line A
END-OF-FIELDS
(returns <div>)?
You also can search text by bs4
from bs4 import BeautifulSoup
tx = """
<div class='user-information__achievements-heading' data-test-points-title>
Points
</div>
<div class='user-information__achievements-data' data-test-points-count>
30,850
</div>
</div>
"""
bs = BeautifulSoup(tx,"lxml")
result = bs.find("div",{"class":"user-information__achievements-data"}).text
print(result.strip()) # 30,850
Besides I totally agree to never parse HTML with re (and it's really fun to read, btw) if you only have this piece of text and need a quick re.search, a simple r'\d+,\d+' would do...:
import re
s = '''<div class='user-information__achievements-heading' data-test-points-title>
Points
</div>
<div class='user-information__achievements-data' data-test-points-count>
30,850
</div>
</div>'''
re.search(r'\d+,\d+', s)
<re.Match object; span=(179, 185), match='30,850'>
No need for regex just do:
i=" <div class='user-information__achievements-data' data-test-points-count>"
print(s.splitlines()[s.splitlines().index(i)+1].lstrip())
Output:
30,850
You want re.DOTALL because by default . doesn't match newlines and line brakes.
re.compile(YOUR_REGEX, flags=re.S)
You can also prepend your regex with (?s) for the same effect.

Regular expression selecting a text with skipping a few lines

I need help selecting the price of a html code. As I have extracted the Title of a movie, I now need to extract the price. I have tried using the lookahead regular expression but I get an error when I use \n.* as it says "A quantifier inside a lookbehind makes it non-fixed width". I need the first and the second price in the text.
Regex I have tried:
(?<=Hello<\/a>.*\n.*\n.*\n.*\n.*\n.*\n.*\n.*\n.*?(\$)
and:
Hello<\/a>.*\n.*\n.*\n.*\n.*\n.*\n.*\n.*\n.*?(\$)
But doesn't work.
Text:
<a class="blue_link" href="http://www.ebgames.com.au/Games/sjbeiub108723">Hello:</a>
<div class="hi">
<p>Including <a class="blue_link">
<p>Price$<data1>40.00</p>
Pls help and thank you :)
You can use this regex with the DOTALL flag:
import re
r = "The Durrells: Series 2.+\$(\d+\.\d+).+\$(\d+\.\d+)"
text = ''' <a class="blue_link fn url" href="http://www.fishpond.com.au/Movies/Durrells-Series-2-Keeley-Hawes/5014138609450">The Durrells: Series 2</a>
<div class="by">
<p>Starring <a class="blue_link" href="http://www.fishpond.com.au/c/Movies/s/Keeley+Hawes">Keeley Hawes</a>, <a class="blue_link" href="http://www.fishpond.com.au/c/Movies/s/Milo+Parker">Milo Parker</a>, <a class="blue_link" href="http://www.fishpond.com.au/c/Movies/s/Josh+O%27Connor">Josh O'Connor</a>, <a class="blue_link" href="http://www.fishpond.com.au/c/Movies/s/Daisy+Waterstone">Daisy Wat...</a></p>
<div class="productSearch-metainfo">
DVD (UK), May 2017 </div>
</div>
</div></td>
<td align="right" style="vertical-align:top;"><div class="productSearch-price-container">
<span class="rrp-label">Elsewhere</span> <s>$30.53</s> <span class="productSpecialPrice"><b>$27.46</b></span> <div style="white-space:nowrap;"> <span class="you_save">Save 10%</span> </div><span class="free-shipping">with Free Shipping!</span></div>
'''
print(re.findall(r, text, re.DOTALL))
Output:
[('30.53', '27.46')]

how to make this regex only find the first match

I am aware that using Regex to parse html code is technically incorrect but found this out too far into starting this project (it's for some coursework that I have already stated that I am going to use Regex for so too late to go back on that now)
Im trying to make a python program that takes a html document, strips out the numbers contained after the card-count class and then append them to a list, the problem is that rather than finding the first match when it runs it seems to find the first one and all the others that are identical to the first one and so on, here is some example html and my regex:
<span class="card-count">1</span>
<span class="card-name">Garruk Relentless</span>
</span>
<span class="row">
<span class="card-count">2</span>
<span class="card-name">Jace, the Mind Sculptor</span>
</span>
</div>
<div class="sorted-by-creature clearfix element">
<h5>Creature (16)</h5>
<span class="row">
<span class="card-count">4</span>
<span class="card-name">Deathrite Shaman</span>
</span>
<span class="row">
<span class="card-count">4</span>
<span class="card-name">Noble Hierarch</span>
</span>
<span class="row">
<span class="card-count">4</span>
<span class="card-name">Stoneforge Mystic</span>
</span>
<span class="row">
<span class="card-count">4</span>
<span class="card-name">True-Name Nemesis</span>
</span>
</div>
<div class="sorted-by-sorcery clearfix element">
<h5>Sorcery (3)</h5>
<span class="row">
<span class="card-count">3</span>
<span class="card-name">Ponder</span>
</span>
And the python code is:
card_number_list=[]
number_of_cards=int(0)
#find out how many of x cards there are in the deck
def card_number_regex(card_number_list):
global number_of_cards
global html
number_in_set= re.search("card-count.*",html)
get_rid= re.search("card-count.*",html).group(0)
html=html.replace(get_rid,"")
number_in_set=number_in_set.group(0)
html=html.replace(number_in_set, "")
number_in_set=number_in_set.replace('card-count">',"")
number_in_set=number_in_set.replace('</span>', "")
card_number_list.append(number_in_set)
number_in_set_int=int(number_in_set)
print(number_in_set_int)
number_of_cards=(number_of_cards+number_in_set_int)
return number_of_cards
while number_of_cards<75:
card_number_regex(card_number_list)
The output I get when I run this is
1
2
4
3
While many seem to rather bash on your choice to use regex for this task, I would argue that it does not seem too difficult for your specific goal and will provide an actual answer for what you asked for.
import re
a = html
b = re.findall('<span class="card-count">(.*?)</span>',a)
print(b[0])
That regex should give the contents of your card-count classes in a list, and using first index you retrieve only the match you want your regex to find.
Obviously this would work less well for other use-cases, but as you seem to know that you only ever want the first occurrence in the html-document it does not matter that list contains all of them, even when they are in another div tag etc.
And as others have said, I don't see why you wouldn't use a regular html parser for this.

Edit text from html with BeautifulSoup

I'm currently trying to extract the html elements which have a text on their own and wrap them with a special tag.
For example, my HTML looks like this:
<ul class="myBodyText">
<li class="fields">
This text still has children
<b>
Simple Text
</b>
<div class="s">
<ul class="section">
<li style="padding-left: 10px;">
Hello <br/>
World
</li>
</ul>
</div>
</li>
</ul>
I'm trying to wrap tags only around the tags, so I can further parse them at a later time, so I tried to make it look like this:
<ul class="bodytextAttributes">
<li class="field">
[Editable]This text still has children[/Editable]
<b>
[Editable]Simple Text[/Editable]
</b>
<div class="sectionFields">
<ul class="section">
<li style="padding-left: 10px;">
[Editable]Hello [/Editable]<br/>
[Editable]World[/Editable]
</li>
</ul>
</div>
</li>
</ul>
My script so far, which iterates just fine, but the placement of the edit placeholders isn't working and I have currently no idea how I can check this:
def parseSection(node):
b = str(node)
changes = set()
tag_start, tag_end = extractTags(b)
# index 0 is the element itself
for cell in node.findChildren()[1:]:
if cell.findChildren():
cell = parseSection(cell)
else:
# safe to extract with regular expressions, only 1 standardized tag created by BeautifulSoup
subtag_start, subtag_end = extractTags(str(cell))
changes.add((str(cell), "[/EditableText]{0}[EditableText]{1}[/EditableText]{2}[EditableText]".format(subtag_start, str(cell.text), subtag_end)))
text = extractText(b)
for change in changes:
text = text.replace(change[0], change[1])
return bs("{0}[EditableText]{1}[/EditableText]{2}".format(tag_start, text, tag_end), "html.parser")
The script generates following:
<ul class="myBodyText">
[EditableText]
<li class="fields">
This text still has children
[/EditableText]
<b>
[EditableText]
Simple Text
[/EditableText]
</b>
[EditableText]
<div class="s">
<ul class="section">
<li style="padding-left: 10px;">
Hello [/EditableText]
<br/>
[EditableText][/EditableText]
<br/>
[EditableText]
World
</li>
</ul>
</div>
</li>
[/EditableText]
</ul>
How I can check this and fix it? I'm grateful for every possible answer.
There is a built-in replace_with() method that fits the use case nicely:
soup = BeautifulSoup(data)
for node in soup.find_all(text=lambda x: x.strip()):
node.replace_with("[Editable]{}[/Editable]".format(node))
print soup.prettify()

Find a paragraph and find a string inside this paragraph with REGEX

I have inside an HTML page some lines like this :
<div>
<p class="match"> this sentence should match </p>
some text
<a class="a"> some text </a>
</div>
<div>
<p class="match"> this sentence shouldnt match</p>
some text
<a class ="b"> some text </a>
</div>
I want to extract the lines inside the <p class="match"> but only when there are inside div containing <a class="a">.
What I've done so far is below (I firstly find the paragraphs with <a class="a"> inside and I iterate on the result to find the sentence inside a <p class="match">) :
import re
file_to_r = open("a")
regex_div = re.compile(r'<div>.+"a".+?</div>', re.DOTALL)
regex_match = re.compile(r'<p class="match">(.+)</p>')
for m in regex_div.findall(file_to_r.read()):
print(regex_match.findall(m))
but I wonder if there is an other (still efficient) way to do it at once?
Use an HTML Parser, like BeautifulSoup.
Find the a tag with a class and then find previous sibling - p tag with class match:
from bs4 import BeautifulSoup
data = """
<div>
<p class="match"> this sentence should match </p>
some text
<a class="a"> some text </a>
</div>
<div>
<p class="match"> this sentence shouldn't match</p>
some text
<a class ="b"> some text </a>
</div>
"""
soup = BeautifulSoup(data)
a = soup.find('a', class_='a')
print a.find_previous_sibling('p', class_='match').text
Prints:
this sentence should match
Also see why you should avoid using regex for parsing HTML here:
RegEx match open tags except XHTML self-contained tags
You should use a html parser but if you still wat a regex you can use something like this:
<div>\s*<p class="match">([\w\s]+)</p>[\w\s]+(?=<a class="a").*?</div>
Working demo
<div>\s*\n\s*.*?<p class=.*?>(.*?)<\/p>\s*\n\s*.*?\s*\n\s*(?=(\<a class=\"a\"\>))
You can use this.
See demo.
http://regex101.com/r/lK9iD2/7

Categories

Resources