Print URL from two different BeautifulSoap outputs - python

I am scraping a few URLs in batch using BeautifulSoap.
Here is my script (only relevant stuff):
import urllib2
from bs4 import BeautifulSoup
quote_page = 'https://example.com/foo/bar'
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
url_box = soup.find('div', attrs={'class': 'player'})
print url_box
This gives 2 different kinds of print depending on the HTML of URL (about half pages gives first print and rest give the second print).
Here's first kind of print:
<div class="player">
<video class="video-js vjs-fluid video-player" height="100%" id="some-player" poster="https://example.com/path/to/jpg/random.jpg" width="100%"></video>
<span data-type="trailer-src" data-url="https://example.com/path/to/mp4/random.mp4"></span>
</div>
And here's the other:
<div class="player">
<img alt="Image description here" src="https://example.com/path/to/jpg/random.jpg"/>
</div>
I want to extract the image URL which is poster in first and src in second.
Any ideas how I can do that so same script extracts that URL from either kind of print?
P.S The first print also has a mp4 link which I do not need.

You can use the get() method to get the value of attrs from the targeted tag.
You should be able to do something like this:
if url_box.find('video'):
url = url_box.find('video').get('poster')
mp4 = ulr_box.find('span').get('data-url')
if url_box.find('img'):
url = url_box.find('img').get('src')

Decide which version you are dealing with and split accordingly:
firstVersion = '''<div class="player">
<video class="video-js vjs-fluid video-player" height="100%" id="some-player" poster="https://example.com/path/to/jpg/random.jpg" width="100%"></video>
<span data-type="trailer-src" data-url="https://example.com/path/to/mp4/random.mp4"></span>
</div>'''
secondVersion = '''<div class="player">
<img alt="Image description here" src="https://example.com/path/to/jpg/random.jpg"/>
</div>'''
def extractImageUrl(htmlInput):
imageUrl = ""
if "poster" in htmlInput:
imageUrl = htmlInput.split('poster="')[1].split('"')[0]
elif "src" in htmlInput:
imageUrl = htmlInput.split('src="')[1].split('"')[0]
return imageUrl

Related

How to scrape for <span title>?

I have been trying to scrape indeed.com and when doing so I ran into a problem. When scraping for the titles of the positions on some results i get 'new' because there is a span before the position name labeled as 'new'. I have tried researching and trying different things i still havent got no where. So i come for help. The position names live within the span title tags but when i scrape for 'span' in some cases i obviously get the 'new' first because it grabs the first span it sees. I have tried to exclude it several ways but havent had any luck.
Indeed Source Code:
<div class="heading4 color-text-primary singleLineTitle tapItem-gutter">
<h2 class="jobTitle jobTitle-color-purple jobTitle-newJob">
<div class="new topLeft holisticNewBlue desktop">
<span class = "label">new</span>
</div>
<span title="Freight Stocker"> Freight Stocker </span>
</h2>
</div>
Code I Tried:
import requests
from bs4 import BeautifulSoup
def extract(page):
headers = {''}
url = f'https://www.indeed.com/jobs?l=Bakersfield%2C%20CA&start={page}&vjk=42cee666fbd2fae9'
r = requests.get(url, headers)
soup = BeautifulSoup(r.content, 'html.parser')
return soup
def transform(soup):
divs = soup.find_all('div', class_ = 'heading4 color-text-primary singleLineTitle tapItem-gutter')
for item in divs:
res = item.find('span').text
print(res)
return
c=extract(0)
transform(c)
Results:
new
Hourly Warehouse Ope
Immediate FT/PT Open
Service Cashier/Rece
new
Cannabis Sales Repreresentative
new
new
new
new
new
You can use a CSS selector .resultContent span[title], which will select all <span> that have a title attribute within the class resultContent.
To use a CSS selector, use the select() method instead of .find():
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for tag in soup.select(".resultContent span[title]"):
print(tag.text)

Python BeautifulSoup unable to find tags under certain levels

I was trying to extract website titles and photo links from this link
I used codes below
webpage = requests.get(url)
page = bs(webpage.content, "html.parser")
content = webpage.find("div", {"class":"end__viewer_container"})
However, when I try to extract html contents, I can only print to the level
<div id="cont" class= "end __viewer_container" role= "main" style="padding-top: 43px;">
For everything inside/under it,for example
title = webpage.find("h3", {"class":"se_textarea"}
photo_link = webpage.find("div", {"class":"se_viewArea"})
When I try to find it, Python only return "None", I was not able to extract anything under this level
When I try to print
content = webpage.find("div", {"class":"end __viewer_container"})
I am able the see the elements that I want to find,but not able to extract.
Also,I suspect that there is a script under it
<div class="end __viewer_container" id="cont" role="main">
<!-- 컨텐츠 내용 {{ -->
<script id="__clipContent" type="x-clip-content">
<div id="SEDOC-1613466923574--228885683" class="se_doc_viewer se_body_wrap se_theme_default "
data-docversion="1.0">
I am wondering whether this script causing me not being able to extract anything inside/under it.
Or are there any other ways that can extract the title and the photos links?
Thanks
they put html inside script tag <script type="x-clip-content" id="__clipContent">... this is why you can't select the element.
the solution is to get innerHTML inside #__clipContent then re-parse with BeautifulSoup
webpage = requests.get(url)
soup = BeautifulSoup(webpage.text, 'html.parser')
clipContent = soup.select_one('#__clipContent')
innerHTML = clipContent.decode_contents()
newSoup = BeautifulSoup(innerHTML , 'html.parser')
title = newSoup.select_one("h3.se_textarea")
photo_links = newSoup.select("div.se_viewArea img")
print('Title: ', title.text.strip())
for img in photo_links:
print(img['data-src'])

How to extract the href attribute value from an a tag with beautiful soup?

This is the part of the html that I am extracting on the platform and it has the snippet I want to get, the value of the href attribute of the tag with the class "booktitle"
</div>
<div class="elementList" style="padding-top: 10px;">
<div class="left" style="width: 75%;">
<a class="leftAlignedImage" href="/book/show/2784.Ways_of_Seeing" title="Ways of Seeing"><img alt="Ways of Seeing" src="https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1464018308l/2784._SY75_.jpg"/></a>
<a class="bookTitle" href="/book/show/2784.Ways_of_Seeing">Ways of Seeing (Paperback)</a>
<br/>
<span class="by">by</span>
<span itemprop="author" itemscope="" itemtype="http://schema.org/Person">
<div class="authorName__container">
<a class="authorName" href="https://www.goodreads.com/author/show/29919.John_Berger" itemprop="url"><span itemprop="name">John Berger</span></a>
</div>
After logging in using the mechanize library I have this piece of code to try to extract it, but here it returns the name of the book as the code asks, I tried several ways to get only the href value but none worked so far
from bs4 import BeautifulSoup as bs4
from requests import Session
from lxml import html
import Downloader as dw
import requests
def getGenders(browser : mc.Browser, url: str, name: str) -> None:
res = browser.open(url)
aux = res.read()
html2 = bs4(aux, 'html.parser')
with open(name, "w", encoding='utf-8') as file2:
file2.write( str( html2 ) )
getGenders(br, "https://www.goodreads.com/shelf/show/art", "gendersBooks.html")
with open("gendersBooks.html", "r", encoding='utf8') as file:
contents = file.read()
bsObj = bs4(contents, "lxml")
aux = open("books.text", "w", encoding='utf8')
officials = bsObj.find_all('a', {'class' : 'booktitle'})
for text in officials:
print(text.get_text())
aux.write(text.get_text().format())
aux.close()
file.close()
Can you try this? (sorry if it doesn't work, I am not on a pc with python right now)
for text in officials:
print(text['href'])
BeautifulSoup works just fine with the html code that you provided, if you want to get the text of a tag you simply use ".text", if you want to get the href you use ".get('href')" or if you are sure the tag has an href value you can use "['href']".
Here is a simple example easy to understand with your html code snipet.
from bs4 import BeautifulSoup
html_code = '''
</div>
<div class="elementList" style="padding-top: 10px;">
<div class="left" style="width: 75%;">
<a class="leftAlignedImage" href="/book/show/2784.Ways_of_Seeing" title="Ways of Seeing"><img alt="Ways of Seeing" src="https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1464018308l/2784._SY75_.jpg"/></a>
<a class="bookTitle" href="/book/show/2784.Ways_of_Seeing">Ways of Seeing (Paperback)</a>
<br/>
<span class="by">by</span>
<span itemprop="author" itemscope="" itemtype="http://schema.org/Person">
<div class="authorName__container">
<a class="authorName" href="https://www.goodreads.com/author/show/29919.John_Berger" itemprop="url"><span itemprop="name">John Berger</span></a>
</div>
'''
soup = BeautifulSoup(html_code, 'html.parser')
tag = soup.find('a', {'class':'bookTitle'})
# - Book Title -
title = tag.text
print(title)
# - Href Link -
href = tag.get('href')
print(href)
I don't know why you downloaded the html and saved it to disk and then open it again, If you just want to get some tag values, then downloading the html, saving to disk and then reopening is totally unnecessary, you can save the html to a variable and then pass that variable to beautifulsoup.
Now I see that you imported requests library, but you used mechanize instead, as far as I know requests is the easiest and the most modern library to use when getting data from web pages in python. I also see that you imported "session" from requests, session is not necessary unless you want to make mulltiple requests and want to keep the connection open with the server for faster subsecuent request's.
Also if you open a file with the "with" statement, you are using python context managers, which handles the closing of a file, which means you don't have to close the file at the end.
So your code more simplify without saving the downloaded 'html' to disk, I will make it like this.
from bs4 import BeautifulSoup
import requests
url = 'https://www.goodreads.com/shelf/show/art/gendersBooks.html'
html_source = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')
# - To get the tag that we want -
tag = soup.find('a', {'class' : 'booktitle'})
# - Extract Book Title -
href = tag.text
# - Extract href from Tag -
title = tag.get('href')
Now if you got multiple "a" tags with the same class name: ('a', {'class' : 'booktitle'}) then you do it like this.
get all the "a" tags first:
a_tags = soup.findAll('a', {'class' : 'booktitle'})
and then scrape all the book tags info and append each book info to a books list.
books = []
for a in a_tags:
try:
title = a.text
href = a.get('href')
books.append({'title':title, 'href':href}) #<-- add each book dict to books list
print(title)
print(href)
except:
pass
To understand your code better I advise you to read this related links:
BeautifulSoup:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
requests:
https://requests.readthedocs.io/en/master/
Python Context Manager:
https://book.pythontips.com/en/latest/context_managers.html
https://effbot.org/zone/python-with-statement.htm

How to extract or Scrape data from HTML page but from the element itself

Currently i use lxml to parse the html document to get the data from the HTML elements
but there is a new challenge, there is one data stored as ratings inside HTML elements
https://i.stack.imgur.com/bwGle.png
<p data-rating="3">
<span class="glyphicon glyphicon-star xh-highlight"></span>
<span class="glyphicon glyphicon-star xh-highlight"></span>
<span class="glyphicon glyphicon-star xh-highlight"></span>
</p>
Its easy to extract text between tags but within tags no ideas.
What do you suggest ?
Challenge i want to extract "3"
URL:https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops
Br,
Gabriel.
If I understand your question and comments correctly, the following should extract all the rating in that page:
import lxml.html
import requests
BASE_URL = "https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
html = requests.get(BASE_URL)
root = lxml.html.fromstring(html.text)
targets = root.xpath('//p[./span[#class]]/#data-rating')
For example:
targets[0]
output
3
Try below script:
from bs4 import BeautifulSoup
import requests
BASE_URL = "https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
html = requests.get(BASE_URL).text
soup = BeautifulSoup(html, "html.parser")
for tag in soup.find_all("div", {"class":"ratings"}):
# get all child from the tags
for h in tag.children:
# convert to string data type
s = h.encode('utf-8').decode("utf-8")
# find the tag with data-rating and get text after the keyword
m = re.search('(?<=data-rating=)(.*)', s)
# check if not None
if m:
#print the text after data-rating and remove last char
print(m.group()[:-1])

Isolate SRC attribute from soup return in python

I am using Python3 with BeautifulSoup to get a certain div from a webpage. My end goal is to get the img src's url from within this div so I can pass it to pytesseract to get the text off the image.
The img doesn't have any classes or unique identifiers so I am not sure how to use BeautifulSoup to get just this image every time. There are several other images and their order changes from day to day. So instead, I just got the entire div that surrounds the image. The div information doesn't change and is unique, so my code looks like this:
weather_today = soup.find("div", {"id": "weather_today_content"})
thus my script currently returns the following:
<div class="style3" id="weather_today_content">
<img alt="" src="/database/img/weather_today.jpg?ver=2018-08-01" style="width: 400px"/>
</div>
Now I just need to figure out how to pull just the src into a string so I can then pass it to pytesseract to download and use ocr to pull further information.
I am unfamiliar with regex but have been told this is the best method. Any assistance would be greatly appreciated. Thank you.
Find the 'img' element, in the 'div' element you found, then read the attribute 'src' from it.
from bs4 import BeautifulSoup
html ="""
<html><body>
<div class="style3" id="weather_today_content">
<img alt="" src="/database/img/weather_today.jpg?ver=2018-08-01" style="width: 400px"/>
</div>
</body></html>
"""
soup = BeautifulSoup(html, 'html.parser')
weather_today = soup.find("div", {"id": "weather_today_content"})
print (weather_today.find('img')['src'])
Outputs:
/database/img/weather_today.jpg?ver=2018-08-01
You can use CSS selector, that is built within BeautifulSoup (methods select() and select_one()):
data = """<div class="style3" id="weather_today_content">
<img alt="" src="/database/img/weather_today.jpg?ver=2018-08-01" style="width: 400px"/>
</div>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
print(soup.select_one('div#weather_today_content img')['src'])
Prints:
/database/img/weather_today.jpg?ver=2018-08-01
The selector div#weather_today_content img means select <div> with id=weather_today_content and withing this <div> select an <img>.

Categories

Resources