I need to create a dictionary in Python which contain 'data-value': '(each size, ex: US: 6Y)'
Page source code looks like this:
<div data-header-element=".b-userMenu" data-sticky-header="true" id="js-size-pick">
<p class="m-productDescr_sizeItem">
<a class="m-productDescr_sizeBtn js-sizeItem js-tooltipHtml js-tooltip_rm" data-carturl="/cart/add?id=545896443" data-tip=" <span> US: 3,5Y </span>
<span> EU: 35,5 </span>
<span> CM: 22,5 </span>
" data-value="545896443">
35,5
</a>
<span class="js-tooltipContent g-dn">
<span> US: 3,5Y </span>
<span> EU: 35,5 </span>
<span> CM: 22,5 </span>
</span>
</p>
<p class="m-productDescr_sizeItem">
<a class="m-productDescr_sizeBtn js-sizeItem js-tooltipHtml js-tooltip_rm" data-carturl="/cart/add?id=545895979" data-tip=" <span> US: 4Y </span>
<span> EU: 36 </span>
<span> CM: 23 </span>
" data-value="545895979">
36
</a>
<span class="js-tooltipContent g-dn">
<span> US: 4Y </span>
<span> EU: 36 </span>
<span> CM: 23 </span>
</span>
</p>
Do you have any idea how to solve this? I tried with loop like for size in 'class'= "m-productDescr_sizeItem"
You'll have to iterate through the span tags. Keep in mind with dictionaries, you can't have duplicate keys. So I created a list of dictionaries here since there are duplicate keys:
html = '''<div data-header-element=".b-userMenu" data-sticky-header="true" id="js-size-pick">
<p class="m-productDescr_sizeItem">
<a class="m-productDescr_sizeBtn js-sizeItem js-tooltipHtml js-tooltip_rm" data-carturl="/cart/add?id=545896443" data-tip=" <span> US: 3,5Y </span>
<span> EU: 35,5 </span>
<span> CM: 22,5 </span>
" data-value="545896443">
35,5
</a>
<span class="js-tooltipContent g-dn">
<span> US: 3,5Y </span>
<span> EU: 35,5 </span>
<span> CM: 22,5 </span>
</span>
</p>
<p class="m-productDescr_sizeItem">
<a class="m-productDescr_sizeBtn js-sizeItem js-tooltipHtml js-tooltip_rm" data-carturl="/cart/add?id=545895979" data-tip=" <span> US: 4Y </span>
<span> EU: 36 </span>
<span> CM: 23 </span>
" data-value="545895979">
36
</a>
<span class="js-tooltipContent g-dn">
<span> US: 4Y </span>
<span> EU: 36 </span>
<span> CM: 23 </span>
</span>
</p>'''
Given that html:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
spans = soup.find_all('span',{'class':'js-tooltipContent g-dn'})
dict_list = []
for span in spans:
alpha = span.find_all('span')
id_temp = span.parent()[0]['data-value']
for each in alpha:
temp_dict = {}
values = each.text.strip().split(':')
k = values[0].strip()
v = values[1].strip()
temp_dict.update({'size':v, 'id':id_temp})
dict_list.append(temp_dict)
Output:
print (dict_list)
[{'size': '3,5Y', 'id': '545896443'}, {'size': '35,5', 'id': '545896443'}, {'size': '22,5', 'id': '545896443'}, {'size': '4Y', 'id': '545895979'}, {'size': '36', 'id': '545895979'}, {'size': '23', 'id': '545895979'}]
Related
I can't solve this problem. What I want to do is extract only one thing that matches 'href' in
the <a>tag
But, I can't figure it out with my code.
here is HTML structure and code.
[
<li class="first impact"><div class="B_MyAd_"></div>
*<a class="goodsBox-info" href="http://barogo.alba.co.kr/">**
<span class="logo"> <img alt="Baro Inc"src="//imagelogo.alba.kr/data_image2/logo/brand/
20200916174910805.gif"/> </span> <span class="company">Baro.Inc</span> <span class="title"><span>New Rider Recruition</span></span> <span class="wrap"> <span class="local">National</span> <span class="pay"><span class="payLetter">confirm</span> <span class="payIcon talk"></span></span> </span> </a>
*<a class="brandHover" href="http://barogo.alba.co.kr/" </a>*</li>, . ,,,,,,.
]
=================================================================================
browser.get("http://www.alba.co.kr/")
alba = BeautifulSoup(browser.page_source, 'html.parser')
brands = list(alba.find_all('li', {"class" : "impact"}, 'a.goodsBox-info'))
#print(brands)
for b in brands :
if b('a', {"class" : "brandHover"}) :
b = b('a')["href"]
print(b)
I have a html document that is structured as following:
<ul>
<li>
<span class="date">
2021.
</span>
</li>
<li>
<span class="date">
2020.
</span>
<span class="links">
</span>
</li>
</ul>
I want to insert a header inbetween the two entries depending on whether a condition is met using beautifulsoup but am only able to place it inside of the li tags:
current_date = None
for i,entry in enumerate(soup.findAll('li')):
date = entry.find('span', {'class' : 'date'})
item_date = date.text
if item_date != current_date:
current_date = item_date
new_tag = soup.new_tag('h3', id='year_heading')
new_tag.string = current_date
entry.insert(0,new_tag)
the goal is to make it look as following:
<ul>
<h3 id="year_heading">2021</h3>
<li>
<span class="date">
2021.
</span>
</li>
<h3 id="year_heading">2020</h3>
<li>
<span class="date">
2020.
</span>
<span class="links">
</span>
</li>
</ul>
but the current output is
<ul>
<li><h3 id="year_heading">2021</h3>
<span class="date">
2021.
</span>
</li>
<li><h3 id="year_heading">2020</h3>
<span class="date">
2020.
</span>
<span class="links">
</span>
</li>
</ul>
this places my heading at the top of the li tag, resulting in an entry with the bulletpoint being placed beside the header instead of beside the entry itself. Is there a good solution to this problem?
EDIT: added desired output
Try:
from bs4 import BeautifulSoup
html_doc = """<ul>
<li>
<span class="date">
2021.
</span>
</li>
<li>
<span class="date">
2020.
</span>
<span class="links">
</span>
</li>
</ul>"""
soup = BeautifulSoup(html_doc, "html.parser")
for span in soup.select("span.date"):
txt = span.get_text(strip=True).strip(".")
ul = span.find_parent("ul")
ul.insert(
ul.contents.index(span.find_parent("li")),
BeautifulSoup(
'<h3 id="year_heading">{}</h3>\n'.format(txt), "html.parser"
),
)
print(soup)
Prints:
<ul>
<h3 id="year_heading">2021</h3>
<li>
<span class="date">
2021.
</span>
</li>
<h3 id="year_heading">2020</h3>
<li>
<span class="date">
2020.
</span>
<span class="links">
</span>
</li>
</ul>
I am struggling trying to figure out what is the element I need to tell Beautiful Soup to scrape the tag 'amount' value, which in this code sample is "1,56".
I am pasting below a code excerpt of the webpage I want to scrape:
<td class="line-content">
<span class="html-tag">
<div
<span class="html-attribute-name">
class
</span>
='
<span class="html-attribute-value">
the-price
</span>
'
<span class="html-attribute-name">
style
</span>
='
<span class="html-attribute-value">
margin-top:20px;
</span>
'>
</span>
</td>
</tr>
<tr>
<td class="line-number" value="447">
</td>
<td class="line-content">
<span class="html-tag">
<span
<span class="html-attribute-name">
class
</span>
='
<span class="html-attribute-value">
currency
</span>
'>
</span>
€
<span class="html-tag">
</span>
</span>
<span class="html-tag">
<span
<span class="html-attribute-name">
class
</span>
='
<span class="html-attribute-value">
amount
</span>
'>
</span>
1,56
<span class="html-tag">
</span>
</span>
</td>
</tr>
would you kindly enlighten me?
I am really grateful for any help.
You can target the amount for example like this (data is your HTML string):
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'html.parser')
span_with_amount = soup.find(lambda tag: tag.name == 'span' and tag.get_text(strip=True) == 'amount')
value = span_with_amount.parent.find_next_sibling(text=True)
print(value.strip())
Prints:
1,56
First we will find <span> with the text "amount" and then we will find the text that is next to the parent of this <span>.
I have a piece of HTML code below:
<div class="user-tagline ">
<span class="username " data-avatar="aaaaaaa">player1</span>
<span class="user-rating">(1357)</span>
<span class="country-flag-small flag-113" tip="Portugal"></span>
</div>
<div class="user-tagline ">
<span class="username " data-avatar="bbbbbbb">player2</span>
<span class="user-rating">(1387)</span>
<span class="country-flag-small flag-70" tip="Indonesia"></span>
</div>
I want to extract "Portugal" from it, note the span class is a dynamic one, it is not always class="country-flag-small flag-113" but indeed changes per the value of country generated for this div block.
To get the player1 and 1357, I am using the following cumbersome code:
player1info = soup.findAll('div', attrs={'class':'user-tagline'})[0].text.split("\n")
player1 = player1info[1]
pscore1 = player1info[1].replace('(','').replace(')', '')
It would be appreciated if someone can share with your better solution here. Thank you in advance
UPDATE:
With the initial HTML div info extracted, now I would like to expand it to extract more for the entire row, here is the row:
<tr board-popover="" fen="r1bk2r1/1p2n3/pN6/1B1qQp2/P2Pp2p/1P6/2P2PPP/R3K1R1 b Q -" flip-board="1" highlight-squares="c4b6">
<td>
<a class="clickable-link td-user" href="https://www.chess.com/live/game/2249663029?username=belemnarmada" target="_self">
<span class="time-control">
<i class="icon-rapid">
</i>
</span>
<div class="user-tagline ">
<span class="username " data-avatar="https://betacssjs.chesscomfiles.com/bundles/web/images/noavatar_l.1c5172d5.gif" data-country="Portugal" data-enabled="true" data-flag="113" data-joined="Joined Jun 19, 2016" data-logged="Online 6 hrs ago" data-membership="basic" data-name="Atikinounette" data-popup="hover" data-title="" data-username="Atikinounette">
Atikinounette
</span>
<span class="user-rating">
(1357)
</span>
<span class="country-flag-small flag-113" tip="Portugal">
</span>
</div>
<div class="user-tagline ">
<span class="username " data-avatar="https://images.chesscomfiles.com/uploads/v1/user/28196414.83e31ff1.50x50o.3a6f77e4aa44.jpeg" data-country="Indonesia" data-enabled="true" data-flag="70" data-joined="Joined May 15, 2016" data-logged="Online Nov 7, 2017" data-membership="basic" data-name="belemnarmada" data-popup="hover" data-title="" data-username="belemnarmada">
belemnarmada
</span>
<span class="user-rating">
(1387)
</span>
<span class="country-flag-small flag-70" tip="Indonesia">
</span>
</div>
</a>
</td>
<td>
<a class="clickable-link text-middle" href="https://www.chess.com/live/game/2249663029?username=belemnarmada" target="_self">
<div class="pull-left">
<span class="game-result">
1
</span>
<span class="game-result">
0
</span>
</div>
<div class="result">
<i class="icon-square-minus loss" tip="Lost">
</i>
</div>
</a>
</td>
<td class="text-center">
<a class="clickable-link" href="https://www.chess.com/live/game/2249663029?username=belemnarmada" target="_self">
30 min
</a>
</td>
<td class="text-right">
<a class="clickable-link text-middle moves" href="https://www.chess.com/live/game/2249663029?username=belemnarmada" target="_self">
25
</a>
</td>
<td class="text-right miniboard">
<a class="clickable-link archive-date" href="https://www.chess.com/live/game/2249663029?username=belemnarmada" target="_self">
Aug 9, 2017
</a>
</td>
<td class="text-center miniboard">
<input class="checkbox" game-checkbox="" game-id="2249663029" game-is-live="true" ng-model="model.gameIds[2249663029].checked" type="checkbox"/>
</td>
</tr>
Needed info are:
player's info (answer provided by #balderman already got that)
game-result (1, 0)
playing time (30 min in this row)
total moves (25)
playing date (Aug 9, 2017)
Thank you so much here.
How about the code below?
The idea that the user attributes are 3 spans under the div. So the code points to those spans and extract the data.
from bs4 import BeautifulSoup
html = '''<html><body> <div class="user-tagline ">
<span class="username " data-avatar="aaaaaaa">player1</span>
<span class="user-rating">(1357)</span>
<span class="country-flag-small flag-113" tip="Portugal"></span>
</div>
<div class="user-tagline ">
<span class="username " data-avatar="bbbbbbb">player2</span>
<span class="user-rating">(1387)</span>
<span class="country-flag-small flag-70" tip="Indonesia"></span>
</div><body></html>'''
soup = BeautifulSoup(html, 'html.parser')
users = soup.findAll('div', attrs={'class': 'user-tagline'})
for user in users:
user_properties = user.findAll('span')
for idx, prop in enumerate(user):
if idx == 1:
print('user name: {}'.format(prop.text))
elif idx == 3:
print('user rating: {}'.format(prop.text))
elif idx == 5:
print('user country: {}'.format(prop.attrs['tip']))
Output
user name: player1
user rating: (1357)
user country: Portugal
user name: player2
user rating: (1387)
user country: Indonesia
This is a more readable solution:
div1 = soup.select("div.user-tagline")[0]
player1 = div1.select_one("span.user-rating").text
pscore1 = div1.select_one("span.country-flag-small").text
To extract data of all divs, just use a loop. And replace "0" with "i".
If you are interested only in the first div, you can go with this:
res = bsobj.find('div', {'class':'user-tagline'}).findAll('span')
print(res[0].text, res[1].text, res[2]['tip'])
I have the following html code:
<div class="xyOfqd">
<div class="aAAD">
<div class="Bgbcca">Updated</div>
<span class="hthtb">
<div>
<span class="hthtb">September 30, 2018</span>
</div>
</span>
</div>
<div class="aAAD">
<div class="Bgbcca">Text1</div>
<span class="hthtb">
<div><span class="hthtb">Text2</span></div>
</span>
</div>
<div
class="aAAD">
<div class="Bgbcca">MyText</div>
<span class="hthtb">
<div>
<span class="hthtb">Text3</span>
</div>
</span>
</div>
<div class="aAAD">
<div class="Bgbcca">Text4</div>
<span class="hthtb">
<div><span
class="hthtb">Text5</span></div>
</span>
</div>
<div class="aAAD">
<div
class="Bgbcca">Text6</div>
<span class="hthtb">
<div><span
class="hthtb">Text7</span></div>
</span>
</div>
<div class="aAAD">
<div
class="Bgbcca">
Text8/div>
<span class="hthtb">
<div>
<span class="hthtb">
<div>Text9</div>
<div>Text10</div>
</span>
</div>
</span>
</div>
<div class="aAAD">
<div
class="Bgbcca">Text11</div>
<span class="hthtb">
<div><span class="hthtb">Text12</span></div>
</span>
</div>
How can I find Text3 which is located right after the div element with the string of MyText?
You can use lxml.html solution:
from lxml import html
source = """
<div class="xyOfqd">
<div class="aAAD">
<div class="Bgbcca">Updated</div>
...
<span class="hthtb">
<div><span class="hthtb">Text12</span></div>
</span>
</div>"""
tree = html.fromstring(source)
print(tree.xpath('//div[.="MyText"]/following-sibling::span/div/span/text()'))
Only if your structure is the final one, you can have the right value doing this:
from bs4 import BeautifulSoup as bfs
html = """<div class="xyOfqd">
<div class="aAAD">
<div class="Bgbcca">Updated</div>
<span class="hthtb">
<div>
<span class="hthtb">September 30, 2018</span>
</div>
</span>
</div>
<div class="aAAD">
<div class="Bgbcca">Text1</div>
<span class="hthtb">
<div><span class="hthtb">Text2</span></div>
</span>
</div>
<div
class="aAAD">
<div class="Bgbcca">MyText</div>
<span class="hthtb">
<div>
<span class="hthtb">Text3</span>
</div>
</span>
</div>
<div class="aAAD">
<div class="Bgbcca">Text4</div>
<span class="hthtb">
<div><span
class="hthtb">Text5</span></div>
</span>
</div>
<div class="aAAD">
<div
class="Bgbcca">Text6</div>
<span class="hthtb">
<div><span
class="hthtb">Text7</span></div>
</span>
</div>
<div class="aAAD">
<div
class="Bgbcca">
Text8/div>
<span class="hthtb">
<div>
<span class="hthtb">
<div>Text9</div>
<div>Text10</div>
</span>
</div>
</span>
</div>
<div class="aAAD">
<div
class="Bgbcca">Text11</div>
<span class="hthtb">
<div><span class="hthtb">Text12</span></div>
</span>
</div>"""
soup = bfs(html, 'html.parser')
result = ''
for div0 in soup.find_all('div',{'class':'aAAD'}):
for div1 in div0.find_all('div', {'class':'Bgbcca'}):
if div1.get_text() == 'MyText':
span = div0.find('span',{'class':'hthtb'})
if span:
span_to_return = span.find('span',{'class':'hthtb'})
if span_to_return:
result = span_to_return.get_text()
print(result)
You can build a custom query function to pass into find():
def has_my_text(tag):
found = tag.select_one('.Bgbcca')
# important to assign the result to avoid calling
# .get_text() on a NoneType, resulting in an error.
if found:
return found.get_text() == "MyText"
soup = bs4.... # assign your soup object
found = soup.find(has_my_text)
# <div class="Bgbcca">MyText</div>
# <span class="hthtb">
# <div>
# <span class="hthtb">Text3</span>
# </div>
# </span>
# </div>
# Note your span class is nested so we go two level in
result = found.select_one('.hthtb').select_one('.hthtb').get_text()
# 'Text3'
# This below also works if your other span are always empty texts
result = found.select_one('.hthtb').get_text().strip()
Note, the find() and select_one assume we only need the first match found. If you need to handle multiple matches, you'll need to use find_all() and select() and make changes to your code accordingly.
If you want to handle variable texts, you can define your function like this:
def has_my_text(tag, text):
found = tag.select_one('.Bgbcca')
if found:
return found.get_text() == text
And wrap the function in your find() like this:
txt = "MyText"
soup.find(lambda tag: has_my_text(tag, txt))