Parsing figure name on Beautiful Soup - python

This is my first time posting, so please be gentle.
I'm extracting data from trip advisor. The reviews are interpreted with a figure that is represented like this.
<span class="ui_bubble_rating bubble_40"></span>
As you can see, there is a "40" in the end that represents 4 stars. The same happens with "20" (2 stars) etc...
How can I obtain the "ui_bubble_rating bubble_40"?
Thank you in advance...

I'm not sure if this is the most efficient way of doing that, but here's how I'd do it:
tags = soup.find_all(class=re.compile("bubble_\d\d"))
The tags variable will then include every tag in the page that matches the regex bubble_\d\d. After that, you just need to extract the number, like so:
stars = tags[0].split("_")[1]
If you want to be fancy, you can use list comprehensions to extract the numbers from every tag:
stars = [tag.split("_")[1] for tag in tags]

I am not sure what kind of data you are trying to scrape,
but you can obtain that span tag like so (I tested it and left some prints in):
from urllib import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen("YOUR_REVIEWS_URL")
bs1=BeautifulSoup(html, 'lxml')
for s in bs1.findAll("span", {"class":"ui_bubble_rating bubble_40"}):
print(s)
More generic way (scrape all ratings (bubble_[0-9]{2})):
toFind = re.compile("(bubble_[0-9]{2})+")
for s in bs1.findAll("span", {"class":toFind}):
print(s)
Hope that answers your question

Related

How to pull quotes and authors from a website in Python?

I have written the following code to pull the quotes from the webpage:
#importing python libraries
from bs4 import BeautifulSoup as bs
import pandas as pd
pd.set_option('display.max_colwidth', 500)
import time
import requests
import random
from lxml import html
#collect first page of quotes
page = requests.get("https://www.kdnuggets.com/2017/05/42-essential-quotes-data-science-thought-leaders.html")
#create a BeautifulSoup object
soup=BeautifulSoup(page.content, 'html.parser')
soup
print(soup.prettify())
#find all quotes on the page
soup.find_all('ol')
#pull just the quotes and not the superfluous data
Quote=soup.find(id='post-')
Quote_list=Quote.find_all('ol')
quote_list
At this point, I now want to just show the text in a list and not see < li > or < ol > tags
I've tried using the .get_text() attribute but I get an error saying
ResultSet object has no attribute 'get_text'
How can I get only the text to return?
This is only for the first page of quotes - there is a second page which I am going to need to pull the quotes from. I will also need to present the data in a table with a column for the quotes and a column for the author from both pages.
Help is greatly appreciated... I'm still new to learning python and I've been working to this point for 8 hours on this code and feel so stuck/discouraged.
The 'html.parser' seems to be having bit of a problem even with what I think is now the correct code. But after switching to using 'lxml', which this was not using, it now seems to be working:
from bs4 import BeautifulSoup as bs
import requests
page = requests.get("https://www.kdnuggets.com/2017/05/42-essential-quotes-data-science-thought-leaders.html")
soup=bs(page.content, 'lxml')
quotes = []
post_id = soup.find(id='post-')
ordered_lists = post_id.find_all('ol')
quotes.extend([li.get_text()
for li in ordered_list.find_all('li')
for ordered_list in ordered_lists
])
print(len(quotes))
print(quotes[0]) # Get the first quote
print('-' * 80)
print(quotes[-1]) #print last quote
Prints:
22
“By definition all scientists are data scientists. In my opinion, they are half hacker, half analyst, they use data to build products and find insights. It’s Columbus meet Columbo―starry-eyed explorers and skeptical detectives.”
--------------------------------------------------------------------------------
“Once you have a certain amount of math/stats and hacking skills, it is much better to acquire a grounding in one or more subjects than in adding yet another programming language to your hacking skills, or yet another machine learning algorithm to your math/stats portfolio…. Clients will rather work with some data scientist A who understands their specific field than with another data scientist B who first needs to learn the basics―even if B is better in math/stats/hacking.”
Alternate Coding
from bs4 import BeautifulSoup as bs
import requests
page = requests.get("https://www.kdnuggets.com/2017/05/42-essential-quotes-data-science-thought-leaders.html")
soup=bs(page.content, 'lxml')
quotes = []
post_id = soup.find(id='post-')
ordered_lists = post_id.find_all('ol')
for ordered_list in ordered_lists:
for li in ordered_list.find_all('li'):
quotes.append(li.get_text())
print(len(quotes))
print(quotes[0]) # Get the first quote
print('-' * 80)
print(quotes[-1]) #print last quote
The find_all() method can take a list of elements to search for. It basically takes a function which determines what elements should be returned. For printing the result from the tags, you need get_text(). But it only works on single entity so you have to loop over the entire find_all() to get each element and then apply get_text() to extract text from each element.
Use this code to get all your quotes :- (This is updated and working)
from bs4 import BeautifulSoup as bs
import requests
from lxml import html
page = requests.get("https://www.kdnuggets.com/2017/05/42-essential-quotes-data-science-thought-leaders.html")
soup=bs(page.text, 'html.parser')
# Q return all of your quotes with HTML tagging but you can loop over it with `get_text()`
Q = soup.find('div',{"id":"post-"}).find_all('ol')
quotes=[]
for every_quote in Q:
quotes.append(every_quote.get_text())
print(quotes[0]) # Get the first quote
Use quotes[0], quotes[1], ... to get 1st, 2nd, and so on quotes!

BeautifulSoup won't parse Article element

I'm working on parsing this web page.
I've got table = soup.find("div",{"class","accordions"}) to get just the fixtures list (and nothing else) however now I'm trying to loop through each match one at a time. It looks like each match starts with an article element tag <article role="article" about="/fixture/arsenal/2018-apr-01/stoke-city">
However for some reason when I try to use matches = table.findAll("article",{"role","article"})
and then print the length of matches, I get 0.
I've also tried to say matches = table.findAll("article",{"about","/fixture/arsenal"}) but get the same issue.
Is BeautifulSoup unable to parse tags, or am I just using it wrong?
Try this:
matches = table.findAll('article', attrs={'role': 'article'})
the reason is that findAll is searching for tag name. refer to bs4 docs
You need to pass the attributes as a dictionary. There are three ways in which you can get the data you want.
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.arsenal.com/fixtures')
soup = BeautifulSoup(r.text, 'lxml')
matches = soup.find_all('article', {'role': 'article'})
print(len(matches))
# 16
Or, this is also the same:
matches = soup.find_all('article', role='article')
But, both these methods give some extra article tags that don't have the Arsernal fixtures. So, if you want to find them using /fixture/arsenal you can use CSS selectors. (Using find_all won't work, as you need a partial match)
matches = soup.select('article[about^=/fixture/arsenal]')
print(len(matches))
# 13
Also, have a look at the keyword arguments. It'll help you get what you want.

Python findall regex issue

So, essentially my main issue comes from the regex part of findall. I'm trying to webscrape some information, but I can't for the life of me get any data to come out correctly. I thought that the (\S+ \S+) was the regex part, and I'd be extracting from any parts in between the HTML code of <li> and </li>, but instead, I get an empty list from print(data). I realize that I'm going to need a \S+ for every word in each of the list code parts, so how would I go about this? Also, how would I get it to post each one of the different parts of the HTML with the list code parts?
INPUT: Just the website. Mikky Ekko - Time
OUTPUT: In this case, it should be album titles (i.e. Mikky Ekko - Time)
import urllib.request
from re import findall
url = "http://rnbxclusive.se"
response = urllib.request.urlopen(url)
html = response.read()
htmlStr = str(html)
data = findall("<li>(\S+ \S+)</li>.*", htmlStr)
print(data)
for item in data:
print(item)
Use lxml
import lxml.html
doc = lxml.html.fromstring(response.read())
for li in doc.findall('.//li'):
print li.text_content()
<li>([^><]*)<\/li>
Try this.This will give all contents of <li> tag. flag.See demo.
http://regex101.com/r/dZ1vT6/55

Python 3 Beautiful Soup Data type incompatibility issue

Hello there stack community!
I'm having an issue that I can't seem to resolve since it looks like most of the help out there is for Python 2.7.
I want to pull a table from a webpage and then just get the linktext and not the whole anchor.
Here is the code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
url = 'http://www.craftcount.com/category.php?cat=5'
html = urlopen(url).read()
soup = BeautifulSoup(html)
alltables = soup.findAll("table")
## This bit captures the input from the previous sequence
results=[]
for link in alltables:
rows = link.findAll('a')
## Find just the names
top100 = re.findall(r">(.*?)<\/a>",rows)
print(top100)
When I run it, I get: "TypeError: expected string or buffer"
Up to the second to the last line, it does everything correctly (when I swap out 'print(top100)' for 'print(rows)').
As an example of the response I get:
michellechangjewelry
And I just need to get:
michellechangjewelry
According to pythex.org, my (ir)regular expression should work, so I wanted to see if anyone out there knew how to do that. As an additional issue, it looks like most people like to go the other way, that is, from having the full text and only wanting the URL part.
Finally, I'm using BeautifulSoup out of "convenience", but I'm not beholden to it if you can suggest a better package to narrow down the parsing to the linktext.
Many thanks in advance!!
BeautifulSoup results are not strings; they are Tag objects, mostly.
Look for the text of the <a> tags, use the .string attribute:
for table in alltables:
link = table.find('a')
top100 = link.string
print(top100)
This finds the first <a> link in a table. To find all text of all links:
for table in alltables:
links = table.find_all('a')
top100 = [link.string for link in links]
print(top100)

Edit content in BeautifulSoup ResultSet

My goal in the end is to add up the number within the BeautifulSoup ResultSet down here:
[<span class="u">1,677</span>, <span class="u">114</span>, <span class="u">15</span>]
<class 'BeautifulSoup.ResultSet'>
Therefore, end up with:
sum = 1806
But it seems like the usual techniques for iterating through a list do not work here.
In the end I know that I have to extract the numbers, delete the commas, and then add them up. But I am kinda stuck, especially with extracting the numbers.
Would really appreciate some help. Thanks
Seems like the usual iterating techniques are working for me. Here is my code:
from bs4 import BeautifulSoup
# or `from BeautifulSoup import BeautifulSoup` if you are using BeautifulSoup 3
text = "<html><head><title>Test</title></head><body><span>1</span><span>2</span></body></html>"
soup = BeautifulSoup(text)
spans = soup.findAll('span')
total = sum(int(span.string) for span in spans)
print(total)
# 3
What have you tried? Do you have any error message we might be able to work with?

Categories

Resources