I am struggling trying to figure out what is the element I need to tell Beautiful Soup to scrape the tag 'amount' value, which in this code sample is "1,56".
I am pasting below a code excerpt of the webpage I want to scrape:
<td class="line-content">
<span class="html-tag">
<div
<span class="html-attribute-name">
class
</span>
='
<span class="html-attribute-value">
the-price
</span>
'
<span class="html-attribute-name">
style
</span>
='
<span class="html-attribute-value">
margin-top:20px;
</span>
'>
</span>
</td>
</tr>
<tr>
<td class="line-number" value="447">
</td>
<td class="line-content">
<span class="html-tag">
<span
<span class="html-attribute-name">
class
</span>
='
<span class="html-attribute-value">
currency
</span>
'>
</span>
€
<span class="html-tag">
</span>
</span>
<span class="html-tag">
<span
<span class="html-attribute-name">
class
</span>
='
<span class="html-attribute-value">
amount
</span>
'>
</span>
1,56
<span class="html-tag">
</span>
</span>
</td>
</tr>
would you kindly enlighten me?
I am really grateful for any help.
You can target the amount for example like this (data is your HTML string):
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'html.parser')
span_with_amount = soup.find(lambda tag: tag.name == 'span' and tag.get_text(strip=True) == 'amount')
value = span_with_amount.parent.find_next_sibling(text=True)
print(value.strip())
Prints:
1,56
First we will find <span> with the text "amount" and then we will find the text that is next to the parent of this <span>.
Related
I can't solve this problem. What I want to do is extract only one thing that matches 'href' in
the <a>tag
But, I can't figure it out with my code.
here is HTML structure and code.
[
<li class="first impact"><div class="B_MyAd_"></div>
*<a class="goodsBox-info" href="http://barogo.alba.co.kr/">**
<span class="logo"> <img alt="Baro Inc"src="//imagelogo.alba.kr/data_image2/logo/brand/
20200916174910805.gif"/> </span> <span class="company">Baro.Inc</span> <span class="title"><span>New Rider Recruition</span></span> <span class="wrap"> <span class="local">National</span> <span class="pay"><span class="payLetter">confirm</span> <span class="payIcon talk"></span></span> </span> </a>
*<a class="brandHover" href="http://barogo.alba.co.kr/" </a>*</li>, . ,,,,,,.
]
=================================================================================
browser.get("http://www.alba.co.kr/")
alba = BeautifulSoup(browser.page_source, 'html.parser')
brands = list(alba.find_all('li', {"class" : "impact"}, 'a.goodsBox-info'))
#print(brands)
for b in brands :
if b('a', {"class" : "brandHover"}) :
b = b('a')["href"]
print(b)
I need to create a dictionary in Python which contain 'data-value': '(each size, ex: US: 6Y)'
Page source code looks like this:
<div data-header-element=".b-userMenu" data-sticky-header="true" id="js-size-pick">
<p class="m-productDescr_sizeItem">
<a class="m-productDescr_sizeBtn js-sizeItem js-tooltipHtml js-tooltip_rm" data-carturl="/cart/add?id=545896443" data-tip=" <span> US: 3,5Y </span>
<span> EU: 35,5 </span>
<span> CM: 22,5 </span>
" data-value="545896443">
35,5
</a>
<span class="js-tooltipContent g-dn">
<span> US: 3,5Y </span>
<span> EU: 35,5 </span>
<span> CM: 22,5 </span>
</span>
</p>
<p class="m-productDescr_sizeItem">
<a class="m-productDescr_sizeBtn js-sizeItem js-tooltipHtml js-tooltip_rm" data-carturl="/cart/add?id=545895979" data-tip=" <span> US: 4Y </span>
<span> EU: 36 </span>
<span> CM: 23 </span>
" data-value="545895979">
36
</a>
<span class="js-tooltipContent g-dn">
<span> US: 4Y </span>
<span> EU: 36 </span>
<span> CM: 23 </span>
</span>
</p>
Do you have any idea how to solve this? I tried with loop like for size in 'class'= "m-productDescr_sizeItem"
You'll have to iterate through the span tags. Keep in mind with dictionaries, you can't have duplicate keys. So I created a list of dictionaries here since there are duplicate keys:
html = '''<div data-header-element=".b-userMenu" data-sticky-header="true" id="js-size-pick">
<p class="m-productDescr_sizeItem">
<a class="m-productDescr_sizeBtn js-sizeItem js-tooltipHtml js-tooltip_rm" data-carturl="/cart/add?id=545896443" data-tip=" <span> US: 3,5Y </span>
<span> EU: 35,5 </span>
<span> CM: 22,5 </span>
" data-value="545896443">
35,5
</a>
<span class="js-tooltipContent g-dn">
<span> US: 3,5Y </span>
<span> EU: 35,5 </span>
<span> CM: 22,5 </span>
</span>
</p>
<p class="m-productDescr_sizeItem">
<a class="m-productDescr_sizeBtn js-sizeItem js-tooltipHtml js-tooltip_rm" data-carturl="/cart/add?id=545895979" data-tip=" <span> US: 4Y </span>
<span> EU: 36 </span>
<span> CM: 23 </span>
" data-value="545895979">
36
</a>
<span class="js-tooltipContent g-dn">
<span> US: 4Y </span>
<span> EU: 36 </span>
<span> CM: 23 </span>
</span>
</p>'''
Given that html:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
spans = soup.find_all('span',{'class':'js-tooltipContent g-dn'})
dict_list = []
for span in spans:
alpha = span.find_all('span')
id_temp = span.parent()[0]['data-value']
for each in alpha:
temp_dict = {}
values = each.text.strip().split(':')
k = values[0].strip()
v = values[1].strip()
temp_dict.update({'size':v, 'id':id_temp})
dict_list.append(temp_dict)
Output:
print (dict_list)
[{'size': '3,5Y', 'id': '545896443'}, {'size': '35,5', 'id': '545896443'}, {'size': '22,5', 'id': '545896443'}, {'size': '4Y', 'id': '545895979'}, {'size': '36', 'id': '545895979'}, {'size': '23', 'id': '545895979'}]
I have a piece of HTML code below:
<div class="user-tagline ">
<span class="username " data-avatar="aaaaaaa">player1</span>
<span class="user-rating">(1357)</span>
<span class="country-flag-small flag-113" tip="Portugal"></span>
</div>
<div class="user-tagline ">
<span class="username " data-avatar="bbbbbbb">player2</span>
<span class="user-rating">(1387)</span>
<span class="country-flag-small flag-70" tip="Indonesia"></span>
</div>
I want to extract "Portugal" from it, note the span class is a dynamic one, it is not always class="country-flag-small flag-113" but indeed changes per the value of country generated for this div block.
To get the player1 and 1357, I am using the following cumbersome code:
player1info = soup.findAll('div', attrs={'class':'user-tagline'})[0].text.split("\n")
player1 = player1info[1]
pscore1 = player1info[1].replace('(','').replace(')', '')
It would be appreciated if someone can share with your better solution here. Thank you in advance
UPDATE:
With the initial HTML div info extracted, now I would like to expand it to extract more for the entire row, here is the row:
<tr board-popover="" fen="r1bk2r1/1p2n3/pN6/1B1qQp2/P2Pp2p/1P6/2P2PPP/R3K1R1 b Q -" flip-board="1" highlight-squares="c4b6">
<td>
<a class="clickable-link td-user" href="https://www.chess.com/live/game/2249663029?username=belemnarmada" target="_self">
<span class="time-control">
<i class="icon-rapid">
</i>
</span>
<div class="user-tagline ">
<span class="username " data-avatar="https://betacssjs.chesscomfiles.com/bundles/web/images/noavatar_l.1c5172d5.gif" data-country="Portugal" data-enabled="true" data-flag="113" data-joined="Joined Jun 19, 2016" data-logged="Online 6 hrs ago" data-membership="basic" data-name="Atikinounette" data-popup="hover" data-title="" data-username="Atikinounette">
Atikinounette
</span>
<span class="user-rating">
(1357)
</span>
<span class="country-flag-small flag-113" tip="Portugal">
</span>
</div>
<div class="user-tagline ">
<span class="username " data-avatar="https://images.chesscomfiles.com/uploads/v1/user/28196414.83e31ff1.50x50o.3a6f77e4aa44.jpeg" data-country="Indonesia" data-enabled="true" data-flag="70" data-joined="Joined May 15, 2016" data-logged="Online Nov 7, 2017" data-membership="basic" data-name="belemnarmada" data-popup="hover" data-title="" data-username="belemnarmada">
belemnarmada
</span>
<span class="user-rating">
(1387)
</span>
<span class="country-flag-small flag-70" tip="Indonesia">
</span>
</div>
</a>
</td>
<td>
<a class="clickable-link text-middle" href="https://www.chess.com/live/game/2249663029?username=belemnarmada" target="_self">
<div class="pull-left">
<span class="game-result">
1
</span>
<span class="game-result">
0
</span>
</div>
<div class="result">
<i class="icon-square-minus loss" tip="Lost">
</i>
</div>
</a>
</td>
<td class="text-center">
<a class="clickable-link" href="https://www.chess.com/live/game/2249663029?username=belemnarmada" target="_self">
30 min
</a>
</td>
<td class="text-right">
<a class="clickable-link text-middle moves" href="https://www.chess.com/live/game/2249663029?username=belemnarmada" target="_self">
25
</a>
</td>
<td class="text-right miniboard">
<a class="clickable-link archive-date" href="https://www.chess.com/live/game/2249663029?username=belemnarmada" target="_self">
Aug 9, 2017
</a>
</td>
<td class="text-center miniboard">
<input class="checkbox" game-checkbox="" game-id="2249663029" game-is-live="true" ng-model="model.gameIds[2249663029].checked" type="checkbox"/>
</td>
</tr>
Needed info are:
player's info (answer provided by #balderman already got that)
game-result (1, 0)
playing time (30 min in this row)
total moves (25)
playing date (Aug 9, 2017)
Thank you so much here.
How about the code below?
The idea that the user attributes are 3 spans under the div. So the code points to those spans and extract the data.
from bs4 import BeautifulSoup
html = '''<html><body> <div class="user-tagline ">
<span class="username " data-avatar="aaaaaaa">player1</span>
<span class="user-rating">(1357)</span>
<span class="country-flag-small flag-113" tip="Portugal"></span>
</div>
<div class="user-tagline ">
<span class="username " data-avatar="bbbbbbb">player2</span>
<span class="user-rating">(1387)</span>
<span class="country-flag-small flag-70" tip="Indonesia"></span>
</div><body></html>'''
soup = BeautifulSoup(html, 'html.parser')
users = soup.findAll('div', attrs={'class': 'user-tagline'})
for user in users:
user_properties = user.findAll('span')
for idx, prop in enumerate(user):
if idx == 1:
print('user name: {}'.format(prop.text))
elif idx == 3:
print('user rating: {}'.format(prop.text))
elif idx == 5:
print('user country: {}'.format(prop.attrs['tip']))
Output
user name: player1
user rating: (1357)
user country: Portugal
user name: player2
user rating: (1387)
user country: Indonesia
This is a more readable solution:
div1 = soup.select("div.user-tagline")[0]
player1 = div1.select_one("span.user-rating").text
pscore1 = div1.select_one("span.country-flag-small").text
To extract data of all divs, just use a loop. And replace "0" with "i".
If you are interested only in the first div, you can go with this:
res = bsobj.find('div', {'class':'user-tagline'}).findAll('span')
print(res[0].text, res[1].text, res[2]['tip'])
Please have a look at following html code:
<section class = "products">
<span class="price-box ri">
<span class="price ">
<span data-currency-iso="PKR">Rs.</span>
<span dir="ltr" data-price="5999"> 5,999</span> </span>
<span class="price -old ">
<span data-currency-iso="PKR">Rs.</span>
<span dir="ltr" data-price="9999"> 9,999</span> </span>
</span>
</section>
In the products section, there are 40 such code blocks which contain prices for items. Not all products have old prices but all products have current price. But when I try to access item prices it also gives me old prices, so I get total 69 item prices which should be 40. I am missing something but since I am new to this field I couldn't figure it out. Please someone could help. Thanks.
You can use a CSS selector to match the exact class name. For example, here, you can use span[class="price "] as the selector, and it won't match the old prices.
html = '''
<section class = "products">
<span class="price-box ri">
<span class="price ">
<span data-currency-iso="PKR">Rs.</span>
<span dir="ltr" data-price="5999"> 5,999</span>
</span>
<span class="price -old ">
<span data-currency-iso="PKR">Rs.</span>
<span dir="ltr" data-price="9999"> 9,999</span>
</span>
</span>
</section>'''
soup = BeautifulSoup(html, 'lxml')
for price in soup.select('span[class="price "]'):
print(price.get_text(' ', strip=True))
Output:
Rs. 5,999
Or, you could also use a custom function to match the class name.
for price in soup.find_all('span', class_=lambda c: c == 'price '):
print(price.get_text(' ', strip=True))
I have created this for loop to find td items that start with 'td_threadtitle':
for item in posts:
hello = item.find("td", {"id": lambda L: L and L.startswith('td_threadtitle')})
print(hello)
But I get this error:
hello = item.find("td", {"id": lambda L: L and L.startswith('td_threadtitle')})
TypeError: slice indices must be integers or None or have an __index__ method
When I change the variable hello to this:
hello = item.find("td") , it works perfectly fine. Why does it throw that error when I try to specify the id?
EDIT:
This is how I created posts:
tableWithPosts = soup.find("body").find("div", attrs = {"align": "center"}).find("div", {"class" : "page"}).find("div", attrs = {"style" : "padding:0px 0px 0px 0px"}).find("center").find("form").find("table", {"id": "threadslist"})
posts = tableWithPosts.find("tbody", {"id": "threadbits_forum_75"}
Here is a portion of posts:
</a>
)
</span>
</div>
<div class="smallfont">
<span onclick="window.open('member.php?s=625e629b088a68126ca2d867c056b363&u=206824', '_self')" style="cursor:pointer">
thelavenhagen
</span>
</div>
</td>
<td class="alt2" title="Replies: 11, Views: 1,471">
<div class="smallfont" style="text-align:right; white-space:nowrap">
Thu, May-25-2017
<span class="time">
05:06:46 AM
</span>
<br/>
by
<a href="member.php?s=625e629b088a68126ca2d867c056b363&find=lastposter&t=581132" rel="nofollow">
westopher
</a>
<a href="showthread.php?s=625e629b088a68126ca2d867c056b363&p=1067660274#post1067660274">
<img alt="Go to last post" border="0" class="inlineimg" src="images/buttons/lastpost.gif"/>
</a>
</div>
</td>
<td align="center" class="alt1">
<a href="misc.php?do=whoposted&t=581132" onclick="who(581132); return false;">
11
</a>
</td>
<td align="center" class="alt2">
1,471
</td>
</tr>
<tr>
<td class="alt1" id="td_threadstatusicon_558556">
<img alt="" border="" id="thread_statusicon_558556" src="images/statusicon/thread_hot.gif"/>
</td>
<td class="alt2">
<img alt="" border="0" src="images/icons/icon1.gif"/>
</td>
<td class="alt1" id="td_threadtitle_558556" title="1996 E36 M3 Lux Dakar Yellow, 87,800 miles, special order without sunroof. Second owner, owned...">
<div>
<span style="float:right">
<a href="#" onclick="attachments(558556); return false">
<img alt="4 Attachment(s)" border="0" class="inlineimg" src="images/misc/paperclip.gif"/>
</a>
</span>
<span style="color: blue">
<b>
<u>
FS:
</u>
</b>
</span>
<a href="showthread.php?s=625e629b088a68126ca2d867c056b363&t=558556" id="thread_title_558556">
1996 E36 M3 - Dakar Lux Slicktop
</a>
<span class="smallfont" style="white-space:nowrap">
(
<img alt="Multi-page thread" border="0" class="inlineimg" src="images/misc/multipage.gif"/>
<a href="showthread.php?s=625e629b088a68126ca2d867c056b363&t=558556">
1
</a>
<a href="showthread.php?s=625e629b088a68126ca2d867c056b363&t=558556&page=2">
2
</a>
<a href="showthread.php?s=625e629b088a68126ca2d867c056b363&t=558556&page=3">
3
</a>
)
</span>
</div>
<div class="smallfont">
<span onclick="window.open('member.php?s=625e629b088a68126ca2d867c056b363&u=95931', '_self')" style="cursor:pointer">
yellowbee
</span>
</div>
</td>
<td class="alt2" title="Replies: 23, Views: 5,147">
<div class="smallfont" style="text-align:right; white-space:nowrap">
Thu, May-25-2017
<span class="time">
04:04:07 AM
</span>
<br/>
by
<a href="member.php?s=625e629b088a68126ca2d867c056b363&find=lastposter&t=558556" rel="nofollow">
mbausa
</a>
<a href="showthread.php?s=625e629b088a68126ca2d867c056b363&p=1067660244#post1067660244">
<img alt="Go to last post" border="0" class="inlineimg" src="images/buttons/lastpost.gif"/>
</a>
</div>
</td>
<td align="center" class="alt1">
<a href="misc.php?do=whoposted&t=558556" onclick="who(558556); return false;">
23
</a>
</td>
<td align="center" class="alt2">
5,147
</td>
</tr>
<tr>
<td class="alt1" id="td_threadstatusicon_580693">
<img alt="" border="" id="thread_statusicon_580693" src="images/statusicon/thread_hot.gif"/>
</td>
<td class="alt2">
<img alt="" border="0" src="images/icons/icon1.gif"/>
</td>
<td class="alt1" id="td_threadtitle_580693" title="Selling my wife's car. We have owned her for two years and have put over 20k hassle free miles on...">
<div>
<span style="color: blue">
<b>
<u>
FS:
</u>
</b>
</span>
<a href="showthread.php?s=625e629b088a68126ca2d867c056b363&t=580693" id="thread_title_580693">
2011 BMW 740Li Alpine White M Package Dakota Brown Interior Weather-tech Mats
</a>
</div>
<div class="smallfont">
<span onclick="window.open('member.php?s=625e629b088a68126ca2d867c056b363&u=128641', '_self')" style="cursor:pointer">
911-AL
</span>
</div>
</td>
Remove your for loop, try with this:
hello = posts.find_all("td", {"id": lambda L: L and L.startswith('td_threadtitle')})
hello
It will find all td items that start with 'td_threadtitle'
hello will be a list which contains all td(objects <class 'bs4.element.Tag'> ) start with 'td_threadtitle', you can still access their div.