Convert a BeautifulSoup ResultSet into a list of strings - python

I'm trying to scrape the details of the reviews from here into a CSV using Python. Each movie has a star rating, which is denoted by an image, having a class('icon-star-fill' , or 'icon-star-half'). I'm trying to write a function to assign a numerical value.
The code that I have so far is returning a bs4.element.ResultSet, with each element a Tag
[<i class="icon-star-full"></i>, <i class="icon-star-full"></i>]
I want to convert that into a list of strings, like
["<i class="icon-star-full"></i>", "<i class="icon-star-full"></i>"]
I've tried soup_obj.text, soup_obj.content, and they're returning empty strings.
This is my code
from bs4 import BeautifulSoup
import requests
result = requests.get(url='http://www.rogerebert.com/reviews')
result_content = result.content
soup_obj = BeautifulSoup(result_content, 'html5lib')
wrapper_class = soup_obj.find('div', id='review-list')
for x in wrapper_class.find_all('figure'):
convoluted_rating = x.find('span', class_='star-rating').find_all('i')
print convoluted_rating
I've seen this and it returns an array with None, like so
[None,None]

You can iterate over the ResultSet and call tag.prettify:
tags = []
for x in wrapper_class.find_all('figure'):
tags.extend(
(i.prettify() for i in x.find('span', class_='star-rating').find_all('i'))
)
print(tags)
['<i class="icon-star-full">\n</i>\n',
'<i class="icon-star-full">\n</i>',
'<i class="icon-star-full">\n</i>\n',
...
]

Related

Removing quotes from re.findall output

I am trying to remove the quotes from my re.findall output using Python 3. I tried suggestions from various forums but it didn't work as expected finally thought of asking out here myself.
My code:
import requests
from bs4 import BeautifulSoup
import re
import time
price = [];
while True:
url = "https://api.binance.com/api/v3/ticker/price?symbol=ETHUSDT"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
data = soup.prettify()
for p in data:
match = re.findall('\d*\.?\d+',data)
print("ETH/USDT",match)
price.append(match)
break
Output of match gives:
['143.19000000']. I would like it to be like: [143.1900000] but I cannot figure out how to do this.
Another problem I am encountering is that the list price appends every object like a single list. So the output of price would be for example [[a], [b], [c]]. I would like it to be like [a, b, c] I am having a bit of trouble to solve these two problems.
Thanks :)
Parse the response from requests.get() as JSON, rather than using BeautifulSoup:
import requests
url = "https://api.binance.com/api/v3/ticker/price?symbol=ETHUSDT"
response = requests.get(url)
response.raise_for_status()
data = response.json()
print(data["price"])
To get floats instead of strings:
float_match = [float(el) for el in match]
To get a list instead of a list of lists:
for el in float_match:
price.append(el)

Get all elements that match a specific attribute value, but match any tag or attribute name with BeautifulSoup

Is it possible to get all elements that match a specific attribute value, but match any tag or attribute name with BeautifulSoup. If so does anyone know how to do it?
Here's an example of how I'm trying to do it
from bs4 import BeautifulSoup
import requests
text_to_match = 'https://s3-ap-southeast-2.amazonaws.com/bettss3/images/003obzt0t_w1200_h1200.jpg'
url = 'https://www.betts.com.au/item/37510-command.html?colour=chocolate'
r = requests.get(url)
bs = BeautifulSoup(r.text, features="html.parser")
possibles = bs.find_all(None, {None: text_to_match})
print(possibles)
This gives me an empty list [].
If I replace {None: text_to_match} with {'href': text_to_match} this example will give some results as expected. I'm trying to figure out how to do this without specifying the attribute's name, and only matching the value.
You can try to find_all with no limitation and filter those who doesn't correspond to your needs, as such
text_to_match = 'https://s3-ap-southeast-2.amazonaws.com/bettss3/images/003obzt0t_w1200_h1200.jpg'
url = 'https://www.betts.com.au/item/37510-command.html?colour=chocolate'
r = requests.get(url)
bs = BeautifulSoup(r.text, features="html.parser")
tags = [tag for tag in bs.find_all() if text_to_match in str(tag)]
print(tags)
this sort of solution is a bit clumsy as you might get some irrelevant tags, you make your text a bit more tag specific by:
text_to_match = r'="https://s3-ap-southeast-2.amazonaws.com/bettss3/images/003obzt0t_w1200_h1200.jpg"'
which is a bit closer to the str representation of a tag with attribute

Extracting Embedded <span> in Python using BeautifulSoup

I am trying to extract a value in a span however the span is embedded into another. I was wondering how I get the value of only 1 span rather than both.
from bs4 import BeautifulSoup
some_price = page_soup.find("div", {"class":"price_FHDfG large_3aP7Z"})
some_price.span
# that code returns this:
'''
<span>$289<span class="rightEndPrice_6y_hS">99</span></span>
'''
# BUT I only want the $289 part, not the 99 associated with it
After making this adjustment:
some_price.span.text
the interpreter returns
$28999
Would it be possible to somehow remove the '99' at the end? Or to only extract the first part of the span?
Any help/suggestions would be appreciated!
You can access the desired value from the soup.contents attribute:
from bs4 import BeautifulSoup as soup
html = '''
<span>$289<span class="rightEndPrice_6y_hS">99</span></span>
'''
result = soup(html, 'html.parser').find('span').contents[0]
Output:
'$289'
Thus, in the context of your original div lookup:
result = page_soup.find("div", {"class":"price_FHDfG large_3aP7Z"}).span.contents[0]

Extract number from a website using beautifulsoup?

The following python code:
from bs4 import BeautifulSoup
div = '<div class="hm"><span class="xg1">查看:</span> 15660<span class="pipe">|</span><span class="xg1">回复:</span> 435</div>'
soup = BeautifulSoup(div, "lxml")
hm = soup.find("div", {"class": "hm"})
print(hm)
The output that i want two number in this case:
15660
435
I want to try to extract the numbers from the website using beautifulsoup. But I do not know how to do it?
Call soup.find_all, with a regex -
>>> list(map(str.strip, soup.find_all(text=re.compile(r'\b\d+\b'))))
Or,
>>> [x.strip() for x in soup.find_all(text=re.compile(r'\b\d+\b'))]
['15660', '435']
If you need integers instead of strings, call int inside the list comprehension -
>>> [int(x.strip()) for x in soup.find_all(text=re.compile(r'\b\d+\b'))]
[15660, 435]

Hashtags python html

I want to extract all the hashtags from a given website:
For example, "I love #stack overflow because #people are very #helpful!"
This should pull the 3 hashtags into a table.
In the website I am targeting there is a table with a #tag description
So we can find #love this hashtag speaks about love
This is my work:
#import the library used to query a website
import urllib2
#specify the url
wiki = "https://www.symplur.com/healthcare-hashtags/tweet-chats/all"
#Query the website and return the html to the variable 'page'
page = urllib2.urlopen(wiki)
#import the Beautiful soup functions to parse the data returned from the
website
from bs4 import BeautifulSoup
#Parse the html in the 'page' variable, and store it in Beautiful Soup
format
soup = BeautifulSoup(page, "lxml")
print soup.prettify()
s = soup.get_text()
import re
re.findall("#(\w+)", s)
I have an issues in the output :
The first one is that the output look like this :
[u'eeeeee',
u'333333',
u'222222',
u'222222',
u'222222',
u'222222',
u'222222',
u'222222',
u'222222',
u'AASTGrandRoundsacute'
The output concatenate the Hashtag with the first word in the description. If I compare to the example I evoked before the output is 'lovethis'.
How can I do to extract only the one word after the hashtag.
Thank you
I think there's no need to use regex to parse the text you get from the page, you can use BeautifulSoup itself for that. I'm using Python3.6 in the code below, just to show the entire code, but the important line is hashtags = soup.findAll('td', {'id':'tweetchatlist_hashtag'}). Notice all hashtags in the table have td tag and id attribute = tweetchatlist_hashtag, so calling .findAll is the way to go here:
import requests
import re
from bs4 import BeautifulSoup
wiki = "https://www.symplur.com/healthcare-hashtags/tweet-chats/all"
page = requests.get(wiki).text
soup = BeautifulSoup(page, "lxml")
hashtags = soup.findAll('td', {'id':'tweetchatlist_hashtag'})
Now let's have a look at the first item of our list:
>>> hashtags[0]
<td id="tweetchatlist_hashtag" itemprop="location">#AASTGrandRounds</td>
So we see that what we really want is the value of title attribute of a:
>>> hashtags[0].a['title']
'#AASTGrandRounds'
To proceed to get a list of all hashtags using list comprehension:
>>> lst = [hashtag.a['title'] for hashtag in hashtags]
If you are not used with list comprehension syntax, the line above is similar to this:
>>> lst = []
>>> for hashtag in hashtags:
lst.append(hashtag.a['title'])
lst then is the desired output, see the first 20 items of the list:
>>> lst[:20]
['#AASTGrandRounds', '#abcDrBchat', '#addictionchat', '#advocacychat', '#AetnaMyHealthy', '#AlzChat', '#AnatQ', '#anzOTalk', '#AskAvaility', '#ASPChat', '#ATtalk', '#autchat', '#AXSChat', '#ayacsm', '#bcceu', '#bccww', '#BCSM', '#benurse', '#BeTheDifference', '#bioethx']

Categories

Resources