How do I parse the values like this with BeutifulSoup? - python

I am trying to parse some information from awebsite but ran in to a little problem, the information I need wont print out and just shows [] when I need the values (3 for example from the source code provided. I would need some help to get it working. Hope someone here can help me out and assist in the matter.
Best of regards.
import re
import requests
from bs4 import BeautifulSoup
url_to_parse = "https://www.webpage.com"
response = requests.get(url_to_parse)
response_text = response.text
soup = BeautifulSoup(response_text, 'lxml')
#print(soup.prettify())
ragex = re.compile('c76a6')
content_lis = soup.find_all('button', attrs={'class': ragex})
print(content_lis)
source: <button class="c76a6" type="button" data-test-name="valueButton"><span class="_5a5c0" data-test-name="value">3</span></button>

because the find_all returns in array to get item require to index on it or loop through the matches and that takes time if you know that the target is unique so you have to use find get the first match so in that case you should add attribute called text to get the value only
import re
import requests
from bs4 import BeautifulSoup
url_to_parse = "https://www.webpage.com"
response = requests.get(url_to_parse)
response_content = response.content
soup = BeautifulSoup(response_content, 'lxml')
# print(soup.prettify())
regex = re.compile('c76a6')
content_list = soup.find('button',{'class': regex})
print(content_list.text)

Related

Can't scrape <h3> tag from page

Seems like i can scrape any tag and class, except h3 on this page. It keeps returning None or an empty list. I'm trying to get this h3 tag:
...on the following webpage:
https://www.empireonline.com/movies/features/best-movies-2/
And this is the code I use:
from bs4 import BeautifulSoup
import requests
from pprint import pprint
from bs4 import BeautifulSoup
URL = "https://www.empireonline.com/movies/features/best-movies-2/"
response = requests.get(URL)
web_html = response.text
soup = BeautifulSoup(web_html, "html.parser")
movies = soup.findAll(name = "h3" , class_ = "jsx-4245974604")
movies_text=[]
for item in movies:
result = item.getText()
movies_text.append(result)
print(movies_text)
Can you please help with the solution for this problem?
As other people mentioned this is dynamic content, which needs to be generated first when opening/running the webpage. Therefore you can't find the class "jsx-4245974604" with BS4.
If you print out your "soup" variable you actually can see that you won't find it. But if simply you want to get the names of the movies you can just use another part of the html in this case.
The movie name is in the alt tag of the picture (and actually also in many other parts of the html).
import requests
from pprint import pprint
from bs4 import BeautifulSoup
URL = "https://www.empireonline.com/movies/features/best-movies-2/"
response = requests.get(URL)
web_html = response.text
soup = BeautifulSoup(web_html, "html.parser")
movies = soup.findAll("img", class_="jsx-952983560")
movies_text=[]
for item in movies:
result = item.get('alt')
movies_text.append(result)
print(movies_text)
If you run into this issue in the future, remember to just print out the initial html you can get with soup and just check by eye if the information you need can be found.

Cannot extract span element with BeautifulSoup

See below. I am using BeautifulSoup to try and extract this value. What I've tried:
pg = requests.get(websitelink)
soup = BeautifulSoup(pg.content, 'html.parser'
value = soup.find('span',{'class':'wall-header__item_count'}).text
I've tried find, and find all and it returns a Nonetype. For whatever reason the wall-header item count is unable to be found with these methods, even though it appears in the HTML. How can I get this value? Thanks!
I'm assuming you want to get the number of total items. The number is stored within the HTML page inside the <script>. beautifulsoup doesn't see it, but you can use re/json modules to extract it:
import re
import json
import requests
url = "https://www.nike.com/w"
html_doc = requests.get(url).text
data = re.search(r"window\.INITIAL_REDUX_STATE=(\{.*\})", html_doc).group(1)
data = json.loads(data)
# uncomment this to print all data;
# print(json.dumps(data, indent=4))
print("Total items:", data["Wall"]["pageData"]["totalResources"])
Prints (in case in my country):
Total items: 5600
you are forgetting to close the braces
soup = BeautifulSoup(pg.content, 'html.parser'
should be:
soup = BeautifulSoup(pg.content, 'html.parser')

Python Beautiful soup: Target a particular element

I am trying to scrape a particular part of a website(https://flightmath.com/from-CDG-to-BLR) but I am unable to target the element that I need.
Below is the part of the html
<h2 style="background-color:#7DC2F8;padding:10px"><i class="fa fa-plane"></i>
flight distance = <strong>4,866</strong> miles</h2>
This is my code
dist = soup.find('h2', attrs={'class': 'fa fa-plane'})
I just want to target the "4,866" part.
I would be really grateful if someone can guide me on this.
Thanks in advance.
attrs={'class': '...'} requires an exact class attribute value (not a combination). Instead, use soup.select_one method to select by extended css rule:
from bs4 import BeautifulSoup
import requests
url = 'https://flightmath.com/from-CDG-to-BLR'
html_data = requests.get(url).content
soup = BeautifulSoup(html_data, 'html.parser')
dist = soup.select_one('h2 i.fa-plane + strong')
print(dist.text) # 4,866
In case of interest: The value is hard coded into the html (for a flight speed calculation) so you could also regex out a more precise value with the following. You can use round() to get the value shown on page.
import requests, re
urls = ['https://flightmath.com/from-CDG-to-BOM', 'https://flightmath.com/from-CDG-to-BLR', 'https://flightmath.com/from-CDG-to-IXC']
p = re.compile(r'flightspeed\.min\.value\/60 \+ ([0-9.]+)')
with requests.Session() as s:
for url in urls:
print(p.findall(s.get(url).text)[0])
find tag with class name and then use find_next() to find the strong tag.
from bs4 import BeautifulSoup
import requests
url = 'https://flightmath.com/from-CDG-to-BLR'
html_data = requests.get(url).text
soup = BeautifulSoup(html_data, 'html.parser')
dist = soup.find('i',class_='fa-plane').find_next('strong')
print(dist.text)

How to extract certain parts of an HTML paragraph

I am new to webscraping and regular expressions and facing a problem here. One of my code gives me an output in HTML but I need to extract a certain part out of the paragraph and not the complete paragraph. I Need help with this. Below is my code.
import mechanize
from bs4 import BeautifulSoup
import urllib2
br = mechanize.Browser()
response = br.open("http://www.consultadni.info/index.php")
br.select_form(name="form1")
br['APE_PAT']='PATRICIO'
br['APE_MAT']='GAMARRA'
br['NOMBRES']='MARCELINA'
req=br.submit().read()
soup = BeautifulSoup(req, "lxml")
for link in soup.findAll("a"):
sub=link.get("href")
soup1 = BeautifulSoup(sub, "lxml")
print soup1.find_all('p')
Output on screen:
[<p>/</p>]
[<p>datospersonales.php?nc=PATRICIO GAMARRA MARCELINA&dni1=40772568&dni2=12405868&id1=12a40a58a68&id2=30/06/1980&dni3=40631880</p>]
[<p>datospersonales.php?nc=PATRICIO GAMARRA MARCELINA&dni1=40772568&dni2=12405868&id1=12a40a58a68&id2=30/06/1980&dni3=40631880</p>]
[<p>http://www.infocorpperuconsultatusdeudas.blogspot.com/2015/05/infocorp-consulta-gratis-tu-reporte-de.html?ref=dnionline</p>]
What I need: 30/06/1980 & 40631880
For Python 2.7 try this way:
from urlparse import parse_qs
result = set()
for link in soup.find_all("a"):
sub = parse_qs(link.get("href"))
if "id2" in sub:
result.add((sub["id2"][0], sub["dni3"][0]))
print result
Clean way to parse URLs (Python 3):
from urllib import parse
URL = "datospersonales.php?nc=PATRICIO GAMARRA MARCELINA&dni1=40772568&dni2=12405868&id1=12a40a58a68&id2=30/06/1980&dni3=40631880"
query_parts = parse.parse_qs(parse.urlparse(URL).query)
print(query_parts["id2"][0], query_parts["dni3"][0])

How to crawl the description for sfglobe using python

I am trying to use Python and Beautifulsoup to get this page from sfglobe website: http://sfglobe.com/2015/04/28/stirring-pictures-from-the-riots-in-baltimore.
This is the code:
import urllib2
from bs4 import BeautifulSoup
url = 'http://sfglobe.com/2015/04/28/stirring-pictures-from-the-riots-in-baltimore'
req = urllib2.urlopen(url)
html = req.read()
soup = BeautifulSoup(html)
desc = soup.find('span', class_='articletext intro')
Could anyone help me to solve this problem?
From the question title, I assuming that the only thing you want is the description of the article, which can be found in the <meta> tag within the HTML <head>.
You were on the right track, but I'm not exactly sure why you did:
desc = soup.find('span', class_='articletext intro')
Regardless, I came up with something using requests (see http://stackoverflow.com/questions/2018026/should-i-use-urllib-or-urllib2-or-requests) rather than urllib2
import requests
from bs4 import BeautifulSoup
url = 'http://sfglobe.com/2015/04/28/stirring-pictures-from-the-riots-in-baltim\
ore'
req = requests.get(url)
html = req.text
soup = BeautifulSoup(html)
tag = soup.find(attrs={'name':'description'}) # find meta tag w/ description
desc = tag['value'] # get value of attribute 'value'
print desc
If that isn't what you are looking for, please clarify so I can try and help you more.
EDIT: after some clarification, I pieced together why you were originally using desc = soup.find('span', class_='articletext intro').
Maybe this is what you are looking for:
import requests
from bs4 import BeautifulSoup, NavigableString
url = 'http://sfglobe.com/2015/04/28/stirring-pictures-from-the-riots-in-baltimore'
req = requests.get(url)
html = req.text
soup = BeautifulSoup(html)
body = soup.find('span', class_='articletext intro')
# remove script tags
[s.extract() for s in body('script')]
text = ""
# iterate through non-script elements in the content body
for stuff in body.select('*'):
# get contents of tags, .contents returns a list
content = stuff.contents
# check if the list has the text content a.k.a. isn't empty AND is a NavigableString, not a tag
if len(content) == 1 and isinstance(content[0], NavigableString):
text += content[0]
print text

Categories

Resources