How to get link and title using Beautiful Soup 4

How to get link and title using Beautiful Soup 4 - python

html=
"""<div class="slick-list"><div class="slick-track" style="width: 1380px; opacity: 1; transform: translate3d(0px, 0px, 0px);"><div data-index="0" class="slick-slide slick-active slick-current" tabindex="-1" aria-hidden="false" style="outline: none; width: 230px;"><div><div data-courseid="567828" class="course-discovery-unit--card-margin--2TVw4 merchandising-course-card--card--2UfMa"><a href="/course/complete-python-bootcamp/" data-purpose="merchandising-course-card-body-567828" target="_self" class="merchandising-course-card--mask--2-b-d"><div class="merchandising-course-card--card-header--89z8L"><img class="merchandising-course-card--course-image--3G7Kh" alt="" width="240" height="135" src="https://img-a.udemycdn.com/course/240x135/567828_67d0.jpg" srcset="https://img-a.udemycdn.com/course/240x135/567828_67d0.jpg 1x, https://img-a.udemycdn.com/course/480x270/567828_67d0.jpg 2x"></div><div class="merchandising-course-card--card-body--3OpAH"><div><div class="merchandising-course-card--course-title--2Ob4m" data-purpose="course-card-title">Complete Python Bootcamp: Go from zero to hero in Python 3</div>"""
I want to extract link and title output:
title=Complete Python Bootcamp: Go from zero to hero in Python 3
link=/course/complete-python-bootcamp/
Here is my code:
data=soup.findAll("div",{"class":"slick-list"})
print(data)
for link in data:
for a in link.findAll("a"):
print(a.title,a.href)

from bs4 import BeautifulSoup
html="""<div class="slick-list"><div class="slick-track" style="width: 1380px; opacity: 1; transform: translate3d(0px, 0px, 0px);"><div data-index="0" class="slick-slide slick-active slick-current" tabindex="-1" aria-hidden="false" style="outline: none; width: 230px;"><div><div data-courseid="567828" class="course-discovery-unit--card-margin--2TVw4 merchandising-course-card--card--2UfMa"><a href="/course/complete-python-bootcamp/" data-purpose="merchandising-course-card-body-567828" target="_self" class="merchandising-course-card--mask--2-b-d"><div class="merchandising-course-card--card-header--89z8L"><img class="merchandising-course-card--course-image--3G7Kh" alt="" width="240" height="135" src="https://img-a.udemycdn.com/course/240x135/567828_67d0.jpg" srcset="https://img-a.udemycdn.com/course/240x135/567828_67d0.jpg 1x, https://img-a.udemycdn.com/course/480x270/567828_67d0.jpg 2x"></div><div class="merchandising-course-card--card-body--3OpAH"><div><div class="merchandising-course-card--course-title--2Ob4m" data-purpose="course-card-title">Complete Python Bootcamp: Go from zero to hero in Python 3</div>"""
soup = BeautifulSoup(html, 'html.parser')
print('title='+soup.find("div",{"data-purpose":"course-card-title"}).text)
print('link='+soup.find("a").get('href'))
I hope this answers your question.

I working solution based on your code (and using findAll):
from bs4 import BeautifulSoup
html= """<div class="slick-list"><div class="slick-track" style="width: 1380px; opacity: 1; transform: translate3d(0px, 0px, 0px);"><div data-index="0" class="slick-slide slick-active slick-current" tabindex="-1" aria-hidden="false" style="outline: none; width: 230px;"><div><div data-courseid="567828" class="course-discovery-unit--card-margin--2TVw4 merchandising-course-card--card--2UfMa"><a href="/course/complete-python-bootcamp/" data-purpose="merchandising-course-card-body-567828" target="_self" class="merchandising-course-card--mask--2-b-d"><div class="merchandising-course-card--card-header--89z8L"><img class="merchandising-course-card--course-image--3G7Kh" alt="" width="240" height="135" src="https://img-a.udemycdn.com/course/240x135/567828_67d0.jpg" srcset="https://img-a.udemycdn.com/course/240x135/567828_67d0.jpg 1x, https://img-a.udemycdn.com/course/480x270/567828_67d0.jpg 2x"></div><div class="merchandising-course-card--card-body--3OpAH"><div><div class="merchandising-course-card--course-title--2Ob4m" data-purpose="course-card-title">Complete Python Bootcamp: Go from zero to hero in Python 3</div>"""
soup = BeautifulSoup(html, 'html.parser')
data=soup.findAll("div",{"class":"slick-list"})
#print(data)
for div in data:
for a in div.findAll("a"):
print(div.text,a.get('href'))

Related

How do I get a word through Selenium?

I'd like to extract and use the red alphabet of the code below through 'Selenium', so please give me some advice on how to do it
The alphabet changes randomly on every try
<td>
<input type="text" name="WKey" id="As_wkey" value="" maxlength="10" class="inputType" style="width: 300px;" title="password" />
<span id="myspam" style="padding: 2px;">
<span style="font-size: 12pt; font-weight: bold; color: red;">H</span>123
<span style="font-size: 12pt; font-weight: bold; color: red;">R</span>
<span style="font-size: 12pt; font-weight: bold; color: red;">8</span>6789
</span>
(type red word.)
</td>
here is my code
red_characters_element = driver.find_element(By.ID, 'myspam')
red_characters_elements = red_characters_element.find_elements(by = By.CSS_SELECTOR, value="span[style='font-size: 12pt; font-weight: bold; color: red;']")
print(red_characters_elements)
result []

Given all the Red colored alphabets are inside of the <span> tag. You can retrieve it using tag.
red_characters_element = driver.find_element(By.ID, 'myspam')
red_characters_elements = red_characters_element.find_elements(By.TAG_NAME, 'span')
for red_character in red_characters_elements:
print(red_character.text)
Results :
H
R
8

If you need only red letters, you can try using the java script inside the selenium:
driver.execute_script('return document.querySelectorAll("[style*=red]")')
You get an array of objects where the style has the attribute "red", with the for loop you can get the values or anything else

Extracting text from generic span tag nested within multiple div tags

I need to grab the text (shown as a date) in the following piece of html code. The code a1v2v3 changes depending on the page, so I cannot use that as a reference or use a css selector.
Relevant HTML:
<div class="mvp-collapse-content-box">
<div data-v-a1v2v3="" class="mvp-row"><div data-v-a1v2v3="" class="mvp-tag mvp-tag-default mvp-tag-checked" style="margin-left: -16px; visibility: hidden;">
<!----> <span class="mvp-tag-text">LIVE</span>
<!----></div><span data-v-a1v2v3="" style="display: inline-block;">
2019.06.12 17:09
<br data-v-a1v2v3="">
Full HTML:
<div class="mvp-collapse-content-box">
<div data-v-a1v2v3="">
<div data-v-a1v2v3="" class="mvp-collapse mvp-collapse-simple">
<div data-v-a1v2v3="" class="mvp-collapse-item mvp-collapse-item-active" style="padding-left: 6px;">
<div class="mvp-collapse-header">
<!----> <div data-v-a1v2v3="" class="mvp-row-flex mvp-row-flex-middle">
<div data-v-a1v2v3="" class="mvp-col mvp-col-span-18 mvp-col-span-xs-16 mvp-col-span-sm-18 mvp-col-span-md-18 mvp-col-span-lg-18"><div data-v-a1v2v3="" class="mvp-tag mvp-tag-blue mvp-tag-checked" style="margin-left: -16px;">
<!----> <span class="mvp-tag-text mvp-tag-color-white">LIVE</span>
<!----></div><div data-v-a1v2v3="" class="versionAndMemo">
<span data-v-a1v2v3="" style="display: inline-block; line-height: 26px; vertical-align: middle; margin: 0px 1px; font-weight: bold; font-size: 14px;">1.2.3.44</span>
<!----></div></div>
<div data-v-a1v2v3="" class="mvp-col mvp-col-span-6 mvp-col-span-xs-8 mvp-col-span-sm-6 mvp-col-span-md-6 mvp-col-span-lg-6"><div data-v-a1v2v3="" style="display: inline-block; float: right; margin-right: 6px;"><i data-v-a1v2v3="" class="lal la-download" style="font-size: 1.8em; margin-top: 8px; display: table-cell; vertical-align: middle;"></i></div>
<div data-v-a1v2v3="" style="float: right; margin-right: 22px;"><i data-v-a1v2v3="" class="lal la-link" style="font-size: 1.8em; margin-top: 8px; display: table-cell; vertical-align: middle;"></i></div></div></div></div> <div class="mvp-collapse-content" style="" data-old-padding-top="" data-old-padding-bottom="" data-old-overflow="">
<div class="mvp-collapse-content-box">
<div data-v-a1v2v3="" class="mvp-row"><div data-v-a1v2v3="" class="mvp-tag mvp-tag-default mvp-tag-checked" style="margin-left: -16px; visibility: hidden;">
<!----> <span class="mvp-tag-text">LIVE</span>
<!----></div><span data-v-a1v2v3="" style="display: inline-block;">
2019.06.12 17:09
<br data-v-a1v2v3="">
Here is what I have so far:
page = requests.get(app, headers=headers, cookies=cookies).text
soup = BeautifulSoup(page, 'html.parser')
for spantime in soup.findAll("div", {"class": "mvp-collapse-content-box"}):
print(spantime)
But nothing is being printed. I have also tried adding the following:
page = requests.get(app, headers=headers, cookies=cookies).text
soup = BeautifulSoup(page, 'html.parser')
for spantime in soup.findAll("div", {"class": "mvp-collapse-content-box"}):
print(spantime.text)
for span in spantime.find_all('span', recursive=True):
print(span.text)
But the neither of them prints anything. I have a feeling that it might have something to do with the mvp-collapse-content-box class that I've used - some of the div tags with that class do not necessarily have span tags, as shown in the Full HTML.

Use find_next() to find the span tag and use text property.
from bs4 import BeautifulSoup
html='''<div class="mvp-collapse-content-box">
<div data-v-a1v2v3="" class="mvp-row"><div data-v-a1v2v3="" class="mvp-tag mvp-tag-default mvp-tag-checked" style="margin-left: -16px; visibility: hidden;">
<!----> <span class="mvp-tag-text">LIVE</span>
<!----></div><span data-v-a1v2v3="" style="display: inline-block;">
2019.06.12 17:09
<br data-v-a1v2v3="">'''
soup=BeautifulSoup(html,'html.parser')
div=soup.find('div',class_="mvp-collapse-content-box")
print(div.find_next('span').find_next('span').text.strip())
Output:
2019.06.12 17:09

You can always select a child tag like this:
div = soup.find("div", { "class" : "mvp-collapse-content-box" })
spans = div.findChildren("span" , recursive=False)
for span in spans:
print span

Python.'NoneType' object has no attribute 'b'

I am getting an error when i want to get coupon code using beautifulsoap
This is a part of page:
<ul class="nc-nav__promo-modal--global-links"><div class="nc-nav__promo-modal--global-divider"></div>
<li><div><span style="font-weight: 500; top: 0">35% off our favorite wear-now styles.* Online only. Use code <b style="font-weight: 700; top: 0">HISUMMER.</b></span></div><button type="button" class="nc-nav__promo-modal--global-details-button" aria-describedby="dialogDetailsBtn-0">Details</button></li>
<li><div><span style="font-weight: 500; top: 0">35% off our favorite wear-now styles.* Online only. Use code <b style="font-weight: 700; top: 0">MAY20.</b></span></div><button type="button" class="nc-nav__promo-modal--global-details-button" aria-describedby="dialogDetailsBtn-0">Details</button></li>
</ul>
This is my Code:
def parse(self, response):
self.mongo.GetAllDocuments()
soup = BeautifulSoup(response.text, 'html.parser')
url,off,coupon,itemtype = "","","",""
containersC=soup.select(".nc-nav__promo-modal--global-links > li")
for itemC in containersC:
coupon = itemC.a.div.span.b.text
I am getting the following error:
AttributeError: 'NoneType' object has no attribute 'b'

Your code is assuming that the structure of the HTML is the same for all instances. If the b (or any other element) is missing, you will get that error. One approach would be to first test for the presence of a b tag before attempting to print it, for example:
from bs4 import BeautifulSoup
html = """ <ul class="nc-nav__promo-modal--global-links"><div class="nc-nav__promo-modal--global-divider"></div>
<li><div><span style="font-weight: 500; top: 0">35% off our favorite wear-now styles.* Online only. Use code <b style="font-weight: 700; top: 0">HISUMMER.</b></span></div><button type="button" class="nc-nav__promo-modal--global-details-button" aria-describedby="dialogDetailsBtn-0">Details</button></li>
<li><div><span style="font-weight: 500; top: 0">35% off our favorite wear-now styles.* Online only. Use code <b style="font-weight: 700; top: 0">MAY20.</b></span></div><button type="button" class="nc-nav__promo-modal--global-details-button" aria-describedby="dialogDetailsBtn-0">Details</button></li>
</ul>"""
soup = BeautifulSoup(html, 'html.parser')
for li_tag in soup.select(".nc-nav__promo-modal--global-links > li"):
b_tag = li_tag.find('b')
if b_tag:
print(b_tag.text)
For your HTML, this gives:
HISUMMER.
MAY20.

compute the average height and the average width of div tag

I have need to get the average div height and width of an html doc.
I have try this solution but it doesn't work:
import numpy as np
average_width = np.mean([div.attrs['width'] for div in my_doc.get_div() if 'width' in div.attrs])
average_height = np.mean([div.attrs['height'] for div in my_doc.get_div() if 'height' in div.attrs])
print average_height,average_width
the get_div method return the list of all div retrieved by the find_all method of beautifulSoup
here is an example :
print my_doc.get_div()[1]
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:45px; top:81px; width:127px; height:9px;">
<span style="font-family: EICMDA+AdvTrebu-R; font-size:8px">Journal of Infection (2015)
</span>
<span style="font-family: EICMDB+AdvTrebu-B; font-size:8px">xx</span>
<span style="font-family: EICMDA+AdvTrebu-R; font-size:8px">, 1</span>
<span style="font-family: EICMDD+AdvPS44A44B; font-size:7px">e</span>
<span style="font-family: EICMDA+AdvTrebu-R; font-size:8px">4
<br/>
</span>
</div>
when i get the attributes, it works perfectly
print my_doc.get_div()[1].attrs
{u'style': u'position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:45px; top:81px; width:127px; height:9px;'}
but when i try to get the value
print my_doc.get_div()[1].attrs['width']
I get an error :
KeyError: 'width'
but i don't understand because when i check the type :
print type(my_doc.get_div()[1].attrs)
it's a dictionary , <type 'dict'>

There may be better way-
Way -1
Below is my tested code to extract width and height.
from bs4 import BeautifulSoup
html_doc = '''<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:45px; top:81px; width:127px; height:9px;">
<span style="font-family: EICMDA+AdvTrebu-R; font-size:8px">Journal of Infection (2015)
</span>
<span style="font-family: EICMDB+AdvTrebu-B; font-size:8px">xx</span>
<span style="font-family: EICMDA+AdvTrebu-R; font-size:8px">, 1</span>
<span style="font-family: EICMDD+AdvPS44A44B; font-size:7px">e</span>
<span style="font-family: EICMDA+AdvTrebu-R; font-size:8px">4
<br/>
</span>
</div>'''
soup = BeautifulSoup(html_doc,'html.parser')
my_att = [i.attrs['style'] for i in soup.find_all("div")]
dd = ''.join(my_att).split(";")
dd_cln= filter(None, dd)
dd_cln= [i.strip() for i in dd_cln ]
my_dict = dict(i.split(':') for i in dd_cln)
print my_dict['width']
Way-2
Use regular expression as described here.
Working code-
import numpy as np
import re
from bs4 import BeautifulSoup
html_doc = '''<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:45px; top:81px; width:127px; height:9px;">
<span style="font-family: EICMDA+AdvTrebu-R; font-size:8px">Journal of Infection (2015)
</span>
<span style="font-family: EICMDB+AdvTrebu-B; font-size:8px">xx</span>
<span style="font-family: EICMDA+AdvTrebu-R; font-size:8px">, 1</span>
<span style="font-family: EICMDD+AdvPS44A44B; font-size:7px">e</span>
<span style="font-family: EICMDA+AdvTrebu-R; font-size:8px">4
<br/>
</span>
</div>'''
soup = BeautifulSoup(html_doc,'html.parser')
my_att = [i.attrs['style'] for i in soup.find_all("div")]
css = ''.join(my_att)
print css
width_list = map(float,re.findall(r'(?<=width:)(\d+)(?=px;)', css))
height_list = map(float,re.findall(r'(?<=height:)(\d+)(?=px;)', css))
print np.mean(height_list)
print np.mean(width_list)

Locating tags via styles - using Python 2 and BeautifulSoup 4

I am trying to use BeautifulSoup 4 to extract text from specific tags in an HTML Document. I have HTML that has a bunch of div tags like the following:
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:42px; top:90px; width:195px; height:24px;">
<span style="font-family: FIPXQM+Arial-BoldMT; font-size:12px">
Futures Daily Market Report for Financial Gas
<br/>
21-Jul-2015
<br/>
</span>
</div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:54px; top:135px; width:46px; height:10px;">
<span style="font-family: FIPXQM+Arial-BoldMT; font-size:10px">
COMMODITY
<br/>
</span>
</div>
I am trying to get the text from all span tags that are in any div tag that has a style of "left:54px".
I can get a single div if i use:
soup = BeautifulSoup(open(extracted_html_file))
print soup.find_all('div',attrs={"style":"position:absolute; border: textbox 1px solid; "
"writing-mode:lr-tb; left:42px; top:90px; "
"width:195px; height:24px;"})
It returns:
[<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:42px; top:90px; width:195px; height:24px;"><span style="font-family: FIPXQM+Arial-BoldMT; font-size:12px">Futures Daily Market Report for Financial Gas
<br/>21-Jul-2015
<br/></span></div>]
But that only gets me the one div that exactly matches that styling. I want all divs that match only the "left:54px" style.
To do this, I've tried a few different ways:
soup = BeautifulSoup(open(extracted_html_file))
print soup.find_all('div',style='left:54px')
print soup.find_all('div',attrs={"style":"left:54px"})
print soup.find_all('div',attrs={"left":"54px"})
But all these print statements return empty lists.
Any Ideas?

You can pass in a regular expression instead of a string according to the documentation here: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-keyword-arguments
So I would try this:
import re
soup = BeautifulSoup(open(extracted_html_file))
soup.find_all('div', style = re.compile('left:54px'))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to get link and title using Beautiful Soup 4 - python

Related

How do I get a word through Selenium?

Extracting text from generic span tag nested within multiple div tags

Python.'NoneType' object has no attribute 'b'

compute the average height and the average width of div tag

Locating tags via styles - using Python 2 and BeautifulSoup 4

Categories

Resources