I am trying to extract a value in a span however the span is embedded into another. I was wondering how I get the value of only 1 span rather than both.
from bs4 import BeautifulSoup
some_price = page_soup.find("div", {"class":"price_FHDfG large_3aP7Z"})
some_price.span
# that code returns this:
'''
<span>$289<span class="rightEndPrice_6y_hS">99</span></span>
'''
# BUT I only want the $289 part, not the 99 associated with it
After making this adjustment:
some_price.span.text
the interpreter returns
$28999
Would it be possible to somehow remove the '99' at the end? Or to only extract the first part of the span?
Any help/suggestions would be appreciated!
You can access the desired value from the soup.contents attribute:
from bs4 import BeautifulSoup as soup
html = '''
<span>$289<span class="rightEndPrice_6y_hS">99</span></span>
'''
result = soup(html, 'html.parser').find('span').contents[0]
Output:
'$289'
Thus, in the context of your original div lookup:
result = page_soup.find("div", {"class":"price_FHDfG large_3aP7Z"}).span.contents[0]
Related
it is convenient to use "index-x" to quick locate a sub section in a page.
for instance
https://docs.python.org/3/library/re.html#index-2
gives 3rd sub-section in this page.
when i want to share the location of a sub-section to others, how to get the index in a convenient way?
for instance, how to get the index of {m,n} sub-section without counting from index-0?
With bs4 4.7.1 you can use :has and :contains to target a specific text string and return the index (note that using select_one will return first match. Use a list comprehension and select if want to return all matches
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://docs.python.org/3/library/re.html')
soup = bs(r.content, 'lxml')
index = soup.select_one('dl:has(.pre:contains("{m,n}"))')['id']
print(index)
Any version: if you want a dictionary that maps special characters to indices. Thanks to #zoe for spotting the error in my dictionary comprehension.
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://docs.python.org/3/library/re.html')
soup = bs(r.content, 'lxml')
mappings = dict([(item['id'], [i.text for i in item.select('dt .pre')]) for item in soup.select('[id^="index-"]')])
indices = {i: k for (k, v) in mappings.items() for i in v}
You're looking for index-7.
You can download the HTML of the page and get all the possible values of index-something with the following code:
import re
import requests
from bs4 import BeautifulSoup
r = requests.get('https://docs.python.org/3/library/re.html')
soup = BeautifulSoup(r.content.decode())
result = [t['id'] for t in soup.find_all(id=re.compile('index-\d+'))]
print(result)
Output:
['index-0', 'index-1', 'index-2', 'index-3', 'index-4', 'index-5', 'index-6', 'index-7', 'index-8', 'index-9', 'index-10', 'index-11', 'index-12', 'index-13', 'index-14', 'index-15', 'index-16', 'index-17', 'index-18', 'index-19', 'index-20', 'index-21', 'index-22', 'index-23', 'index-24', 'index-25', 'index-26', 'index-27', 'index-28', 'index-29', 'index-30', 'index-31', 'index-32', 'index-33', 'index-34', 'index-35', 'index-36', 'index-37', 'index-38']
The t objects in the list comprehension contain the HTML of the tags whose id matches the regex.
Is it possible to extract data from within onclick attribute values like analysis(1644983), AsianOdds(1644983) and EuropeOdds(1644983)? I just want to show one number, as all are same within this HTML code.
HTML
<td style="word-spacing:-3px" align="left"> 析亚 欧</td>
Python Code
from bs4 import BeautifulSoup
soup=BeautifulSoup("""<td style="word-spacing:-3px" align="left"> 析亚 欧</td>""",'html.parser')
lines=soup.find_all('onclick')
for line in lines:
print(line['analysis'])
Expected output
1644983
I tried to explain everything in the comments:
from bs4 import BeautifulSoup
html = '''<td style="word-spacing:-3px" align="left">
析
亚
欧
</td>'''
soup = BeautifulSoup(html, 'html.parser')
# Find all <a> elements
elements = soup.find_all('a')
# Loop over all found elements
for element in elements:
# Disregard element if it doesn't contain onclick attribute
if 'onclick' not in element.attrs:
continue
# Get attribute value
value = element['onclick']
# Disregard wrong elements
if not value.startswith('analysis('):
continue
# Extract position of opening bracket
position = value.index('(') + 1
# Slice string so only content inside bracket is left
value = value[position:-1]
# Print result
print(value)
from bs4 import BeautifulSoup
import urllib
from openpyxl import Workbook
from openpyxl.compat import range
from openpyxl.cell import get_column_letter
r = urllib.urlopen('https://www.vrbo.com/576329').read()
soup = BeautifulSoup(r)
rate = soup.find_all('body')
print rate
print type(soup)
I'm trying to capture values in containers such as data-bedrooms="3", specifically the values given in the quotations, but I have no idea what they are formally called or how to parse them.
The below is a sample of part of the print out for the "body" so I know the values are there, the capturing the specific part is what I can't get:
data-ratemaximum="$260" data-rateminimum="$220" data-rateunits="night" data-rawlistingnumber="576329" data-requestuuid="73bcfaa3-9637-40a8-801c-ae86f93caf39" data-searchpdptab="C" data-serverday="18" data-showbookingphone="False"
To obtain the value of an attribute used rate [ 'attr'], example:
from bs4 import BeautifulSoup
import urllib
from openpyxl import Workbook
from openpyxl.compat import range
from openpyxl.cell import get_column_letter
r = urllib.urlopen('https://www.vrbo.com/576329').read()
soup = BeautifulSoup(r, "html.parser")
rate = soup.find('body')
print rate['data-ratemaximum']
print rate['data-rateunits']
print rate['data-rawlistingnumber']
print rate['data-requestuuid']
print rate['data-searchpdptab']
print rate['data-serverday']
print rate['data-searchpdptab']
print rate['data-showbookingphone']
print rate
print type(soup)
You need to pick apart your result. It might be helpful to know that those things you seek are called attributes of a tag in HTML:
body_tag = rate[0]
data_bedrooms = body_tag.attrs['data-bedrooms']
The code above assumes you only have one <body> -- if you have more you will need to use a for loop on rate. You'll also possibly want to convert the value to an integer with int().
Not sure if you wanted only data-bedrooms from the soup object or not. I did some cursory checking of the output produce and was able to reason that the data-* items you mentioned were attributes, rather than tags. If doc structure is consistent, you could probably locate the respective tag associated with the attribute, and make finding these more efficient:
import re
# regex pattern for attribs
data_tag_pattern = re.compile('^data\-')
# Create list of attribs
attribs_wanted = "data-bedrooms data-rateminimumdata-rateunits data-rawlistingnumber data-requestuuid data-searchpdptab data-serverday data-showbookingphone".split()
# Search entire tree
for item in soup.findAll():
# Use descendants to recurse downwards
for child in item.descendants:
try:
for attribute in child.attrs:
if data_tag_pattern.match(attribute) and attribute in attribs_wanted:
print("{}: {}".format(attribute, child[attribute]))
except AttributeError:
pass
This will produce output as so:
data-showbookingphone: False
data-bedrooms: 3
data-requestuuid: 2b6f4d21-8b04-403d-9d25-0a660802fb46
data-serverday: 18
data-rawlistingnumber: 576329
data-searchpdptab: C
hth!
I'm writing an analyzing tool that counts how many children has any HTML tag in the source code.
I mapped the code with BeautifulSoup, and now I want to iterate over any tag in the page and count how many children it has.
What will be the best way to iterate over all the tags? How can I for example get all the tags that do not have any children?
If you use find_all() with no arguments you can iterate over every tag.
You can get how many children a tag has by using len(tag.contents).
To get a list of all tags with no children:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('someHTMLFile.html', 'r'), 'html.parser')
body = soup.body
empty_tags = []
for tag in body.find_all():
if len(tag.contents) == 0:
empty_tags.append(tag)
print empty_tags
or...
empty_tags = [tag for tag in soup.body.find_all() if len(tag.contents) == 0]
You can count the tag's children by using the len() function.
meta_tags = soup.findAll('meta' , property="article:tag")
if len(meta_tags) < 1:
return False
I use BeautifulSoup for the same. Using the findChildren method of each element
In the below code, fullData contains the HTML string of the webpage
soup=BeautifulSoup(fullData)
elements = soup.findAll()
def findElements(dataList,el):
temp=el.findChildren()
if(len(temp)==0):
print(el.get_text())
tempResults=[findElements(dataList,el) for el in elements]
Hope this helps
Don't reinvent the wheel... especially not in ways that don't roll. BeautifulSoup does
count the children for you, unsurprisingly.
from bs4 import BeautifulSoup as BS
doc = BS('<html><head><title>Example</title></head><body><h1>The Truth</h1>'
+ '<p>It is out there, Neo.</p></body></html>')
print(len(doc.html))
# 2, head and body
print(len(list(x for x in doc.html.find_all())))
# 5, because find_all() finds... all?
print(len(list(x for x in doc.html.children)))
# 2, but instead of letting BeautifulSoup count it as it deems best,
# you actually gather the pieces and count them yourself
print(len(doc.html.contents))
# 2, functionally the same as the prior, just more readable
I've looked at the other beautifulsoup get same level type questions. Seems like my is slightly different.
Here is the website http://engine.data.cnzz.com/main.php?s=engine&uv=&st=2014-03-01&et=2014-03-31
I'm trying to get that table on the right. Notice how the first row of the table expands into a detailed break down of that data. I don't want that data. I only want the very top level data. You can also see that the other rows also can be expanded, but not in this case. So just looping and skipping tr[2] might not work. I've tried this:
r = requests.get(page)
r.encoding = 'gb2312'
soup = BeautifulSoup(r.text,'html.parser')
table=soup.find('div', class_='right1').findAll('tr', {"class" : re.compile('list.*')})
but there is still more nested list* at other levels. How to get only the first level?
Limit your search to direct children of the table element only by setting the recursive argument to False:
table = soup.find('div', class_='right1').table
rows = table.find_all('tr', {"class" : re.compile('list.*')}, recursive=False)
#MartijnPieters' solution is already perfect, but don't forget that BeautifulSoup allows you to use multiple attributes as well when locating elements. See the following code:
from bs4 import BeautifulSoup as bsoup
import requests as rq
import re
url = "http://engine.data.cnzz.com/main.php?s=engine&uv=&st=2014-03-01&et=2014-03-31"
r = rq.get(url)
r.encoding = "gb2312"
soup = bsoup(r.content, "html.parser")
div = soup.find("div", class_="right1")
rows = div.find_all("tr", {"class":re.compile(r"list\d+"), "style":"cursor:pointer;"})
for row in rows:
first_td = row.find_all("td")[0]
print first_td.get_text().encode("utf-8")
Notice how I also added "style":"cursor:pointer;". This is unique to the top-level rows and is not an attribute of the inner rows. This gives the same result as the accepted answer:
百度汇总
360搜索
新搜狗
谷歌
微软必应
雅虎
0
有道
其他
[Finished in 2.6s]
Hopefully this also helps.