Excluding unwanted results of findAll using BeautifulSoup - python

Using BeautifulSoup, I am aiming to scrape the text associated with this HTML hook:
<p class="review_comment">
So, using the simple code as follows,
content = page.read()
soup = BeautifulSoup(content)
results = soup.find_all("p", "review_comment")
I am happily parsing the text that is living here:
<p class="review_comment">
This place is terrible!</p>
The bad news is that every 30 or so times the soup.find_all gets a match, it also matches and grabs something that I really don't want, which is a user's old review that they've since updated:
<p class="review_comment">
It's 1999, and I will always love this place…
Read more »</p>
In my attempts to exclude these old duplicate reviews, I have tried a hodgepodge of ideas.
I've been trying to alter the arguments in my soup.find_all() call
to specifically exclude any text that comes before the <a href="#"
class="show-archived">Read more »</a>
I've drowned in Regular Expressions-type matching limbo with no success.
I can't seem to take advantage of the class="show-archived" attribute.
Any ideas would be gratefully appreciated. Thanks in advance.

Is this what you are seeking?
for p in soup.find_all("p", "review_comment"):
if p.find(class_='show-archived'):
continue
# p is now a wanted p

Related

Not able to scrape data in "div" class on WSJ pages

I am trying to scrape text content from articles on the WSJ site. For e.g. consider the following html source:
<div class="article-content ">
<p>BEIRUT—
Carlos Ghosn,
who is seeking to clear his name in Lebanon, would face a very different path to vindication here, where endemic corruption and the former auto executive’s widespread popularity could influence the outcome of a potential trial. </p> <p>Mr. Ghosn, the former chief of auto makers
I am using the following code:
res = requests.get(url)
html = BeautifulSoup(res.text, "lxml")
classid = "article-content "
item = html.find_all("div", {"class":classid})
This returns a null item. I saw a few other posts where people have suggested adding delays and others but these are not working in my case. Plan on using the scraped text for some ML projects.
I have a subscription to WSJ and am logged in when running the above script.
Any help with this will be much appreciated! Thanks
Your code worked fine for me. Just make sure that you are searching for the correct 'classid'. I don't think this will make a difference, but you can try using this as an alternative:
item = html.find_all("div", class_ = classid)
One thing that can be done is to confirm the presence of the element by checking with javascript on the console. Many a times there are background requests being made to serve the page. So, you might see the element in the page..but it is the result of a request to different URL or inside of a script.
Try using select and set the parser as 'lxml'
content = [p.text for p in soup.select('.article-content p')]

Beautifulsoup - get text not between specific tags (after </span> but before <br>)?

I've looked around and found solutions that have worked or suppose to work for this exact question, but it will not work for this situation. Anyone have a reason why it would work here, and not here? Or just simply show what I'm doing wrong, and I can work out the difference.
Keep in mind, I'm just giving a snippet of the html, it contains much more with the same span and class='boldText'. I'm specifically wanting the tag with Status: as its text, then the next text/content following that.
import bs4
html1 = '''<span class="boldText"><b>Date:</b> </span>12/04/2018<br/>
<span class="boldText"><b>Name:</b> </span>Aaron Rodgers<br/>
<span class="boldText"><b>Status:</b> </span>Questionable<br/><br/>
<br/>
<br/><br/><br/>'''
soup = bs4.BeautifulSoup(html1,'html.parser')
status = soup.find(text='Status:').next_sibling
I'm just trying to get the text: 'Questionable'
so looking for output:
>>> print (status)
>>> Questionable
The problem is that the b tag has no siblings. It's easier to see when formatted like this:
<span class="boldText">
<b>Status:</b>
</span>
Questionable
<br/>
See how the b is the only child of the span? The string "Questionable" is actually a sibling of the parent span, so you need to navigate to it as follows:
print(soup.find('b', string='Status:').parent.next_sibling)
# => 'Questionable'

Python code to keep only a set of html tags in a input string

I have text like this:
<div>
<script></script>
<h1>name</h1>
<p> Description </p>
<i> italic </i>
</div>
I want to remove all html tags except h tags and p tags. For this I'm trying to make a more generic method like this:
def strip_tags(text, a_list_of_tags_to_not_remove)
Using the following Beautiful Soup code I can remove all the html tags, but it doesn't allow to keep a list of tags, while removing others.
from bs4 import BeautifulSoup
cleantext = BeautifulSoup(raw_html).text
Can I do this using Beautiful Soup or are there any other python library to do this?
Yes, you can.
You can use .find_all([]) to find all the tags you don't care about, then call .unwrap() to get rid of them while keeping the content.
You can use the find_all function:
soup.find_all(['h1', 'p'])
to get a list of the tags you need, instead of having to find all the tags you don't want.

Pattern match not working as expected python

I was playing around with pattern matches in different html codes of sites I noticed something weird. I used this pattern :
pat = <div class="id-app-orig-desc">.*</div>
I used it on a app page of the play store(Picked a random app). So according to me it should just give what's between the div tags (ie the description) but that does not happen. I gives everything starting from the first of the pattern and goes on till the last of the page completely ignoring in between. Anyone knows what's happening?!
And I check the length of the list returned it's just 1.
First of all, do not parse HTML with regex, use a specialized tool - HTML parser. For example, BeautifulSoup:
from bs4 import BeautifulSoup
data = """
<div>
<div class="id-app-orig-desc">
Do not try to get me with a regex, please.
</div>
</div>
"""
soup = BeautifulSoup(data)
print soup.find('div', {'class': 'id-app-orig-desc'}).text.strip()
Prints:
Do not try to get me with a regex, please.

How to skip through same tags in BeautifulSoup - Python

I am currently writing code for Scrapers and more and more become a fan of Python and especially BeautifulSoup.
Still... When parsing through html I came across a difficult part that I could only use in a not so beautiful way.
I want to scrape HTML Code and especially the following snippet:
<div class="title-box">
<h2>
<span class="result-desc">
Search results <strong>1</strong>-<strong>10</strong> out of <strong>10,009</strong> about <strong>paul mccartney</strong>Create email Alert
</span>
</h2>
</div>
So what I do is I identify the div by using:
comment = TopsySoup.find('div', attrs={'class' : 'title-box'})
Then the ugly part comes in. To catch the number I want to have: 10,009 I use:
catcher = comment.strong.next.next.next.next.next.next.next
Can somebody tell me if there is a nicer way?
How about comment.find_all('strong')[2].text?
It can actually be shortened as comment('strong')[2].text, since calling a Tag object as though it is a function is the same as calling find_all on it.
>>> comment('strong')[2].text
u'10,009'

Categories

Resources