I want to print an attribute value based on its name, take for example
<META NAME="City" content="Austin">
I want to do something like this
soup = BeautifulSoup(f) # f is some HTML containing the above meta tag
for meta_tag in soup("meta"):
if meta_tag["name"] == "City":
print(meta_tag["content"])
The above code give a KeyError: 'name', I believe this is because name is used by BeatifulSoup so it can't be used as a keyword argument.
It's pretty simple, use the following:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<META NAME="City" content="Austin">')
>>> soup.find("meta", {"name":"City"})
<meta name="City" content="Austin" />
>>> soup.find("meta", {"name":"City"})['content']
'Austin'
theharshest answered the question but here is another way to do the same thing.
Also, In your example you have NAME in caps and in your code you have name in lowercase.
s = '<div class="question" id="get attrs" name="python" x="something">Hello World</div>'
soup = BeautifulSoup(s)
attributes_dictionary = soup.find('div').attrs
print attributes_dictionary
# prints: {'id': 'get attrs', 'x': 'something', 'class': ['question'], 'name': 'python'}
print attributes_dictionary['class'][0]
# prints: question
print soup.find('div').get_text()
# prints: Hello World
6 years late to the party but I've been searching for how to extract an html element's tag attribute value, so for:
<span property="addressLocality">Ayr</span>
I want "addressLocality". I kept being directed back here, but the answers didn't really solve my problem.
How I managed to do it eventually:
>>> from bs4 import BeautifulSoup as bs
>>> soup = bs('<span property="addressLocality">Ayr</span>', 'html.parser')
>>> my_attributes = soup.find().attrs
>>> my_attributes
{u'property': u'addressLocality'}
As it's a dict, you can then also use keys and 'values'
>>> my_attributes.keys()
[u'property']
>>> my_attributes.values()
[u'addressLocality']
Hopefully it helps someone else!
theharshest's answer is the best solution, but FYI the problem you were encountering has to do with the fact that a Tag object in Beautiful Soup acts like a Python dictionary. If you access tag['name'] on a tag that doesn't have a 'name' attribute, you'll get a KeyError.
The following works:
from bs4 import BeautifulSoup
soup = BeautifulSoup('<META NAME="City" content="Austin">', 'html.parser')
metas = soup.find_all("meta")
for meta in metas:
print meta.attrs['content'], meta.attrs['name']
One can also try this solution :
To find the value, which is written in span of table
htmlContent
<table>
<tr>
<th>
ID
</th>
<th>
Name
</th>
</tr>
<tr>
<td>
<span name="spanId" class="spanclass">ID123</span>
</td>
<td>
<span>Bonny</span>
</td>
</tr>
</table>
Python code
soup = BeautifulSoup(htmlContent, "lxml")
soup.prettify()
tables = soup.find_all("table")
for table in tables:
storeValueRows = table.find_all("tr")
thValue = storeValueRows[0].find_all("th")[0].string
if (thValue == "ID"): # with this condition I am verifying that this html is correct, that I wanted.
value = storeValueRows[1].find_all("span")[0].string
value = value.strip()
# storeValueRows[1] will represent <tr> tag of table located at first index and find_all("span")[0] will give me <span> tag and '.string' will give me value
# value.strip() - will remove space from start and end of the string.
# find using attribute :
value = storeValueRows[1].find("span", {"name":"spanId"})['class']
print value
# this will print spanclass
If tdd='<td class="abc"> 75</td>'
In Beautifulsoup
if(tdd.has_attr('class')):
print(tdd.attrs['class'][0])
Result: abc
Related
I'd linke to retrieve information form a couple of players from transfermarkt.de, e.g Manuel Neuer's birthday.
Here is how the relevant html looks like:
<tr>
<th>Geburtsdatum:</th>
<td>
27.03.1986
</td>
</tr>
I know I could get the date by using the following code:
soup = BeautifulSoup(source_code, "html.parser")
player_attributes = soup.find("table", class_ = 'auflistung')
rows = player_attributes.find_all('tr')
date_of_birth = re.search(r'([0-9]+\.[0-9]+\.[0-9]+)', rows[1].get_text(), re.M)[0]
but that is quite fragile. E.g. for Robert Lewandowski the date of birth is in a different position of the table. So, which attributes appear at the players profile differs. Is there a way to logically do
finde the tag with 'Geburtsdatum:' in it
get the text of the tag right after it
the more robust the better :)
BeautifulSoup allows retrieve next sibling using method findNext():
from bs4 import BeautifulSoup
import requests
html = requests.get('https://www.transfermarkt.de/manuel-neuer/profil/spieler/17259', headers = {'User-Agent': 'Custom'})
soup = BeautifulSoup(source_code, "html.parser")
player_attributes = soup.find("table", class_ = 'auflistung')
rows = player_attributes.find_all('tr')
def get_table_value(rows, table_header):
for row in rows:
helpers = row.find_all(text=re.compile(table_header))
if helpers is not None:
for helper in helpers:
return helper.find_next('td').get_text()
I have HTML like:
<tr>
<td>Title:</td>
<td>Title value</td>
</tr>
I have to specify after which <td> with text i want to grab text of second <td>. Something like: Grab text of first next <td> after <td> which contain text Title:. Result should be: Title value
I have some basic understanding of Python and BeutifulSoupno and i have no idea how can I do this when there is no class to specify.
I have tried this:
row = soup.find_all('td', string='Title:')
text = str(row.nextSibling)
print(text)
and I receive error: AttributeError: 'ResultSet' object has no attribute 'nextSibling'
First of all, soup.find_all() returns a ResultSet which contains all the elements with tag td and string as Title: .
For each such element in the result set , you will need to get the nextSibling separately (also, you should loop through until you find the nextSibling of tag td , since you can get other elements in between (like a NavigableString)).
Example -
>>> from bs4 import BeautifulSoup
>>> s="""<tr>
... <td>Title:</td>
... <td>Title value</td>
... </tr>"""
>>> soup = BeautifulSoup(s,'html.parser')
>>> row = soup.find_all('td', string='Title:')
>>> for r in row:
... nextSib = r.nextSibling
... while nextSib.name != 'td' and nextSib is not None:
... nextSib = nextSib.nextSibling
... print(nextSib.text)
...
Title value
Or you can use another library that has support for XPATH , and with Xpath you can do this easily. Other libraries like - lxml or xml.etree .
What you're intending to do is relatively easier with lxml using xpath. You can try something like this,
from lxml import etree
tree = etree.parse(<your file>)
path_list = tree.xpath('//<xpath to td>')
for i in range(0, len(path_list)) :
if path_list[i].text == '<What you want>' and i != len(path_list) :
your_text = path_list[i+1].text
I want to get data located(name, city and address) in div tag from a HTML file like this:
<div class="mainInfoWrapper">
<h4 itemprop="name">name</h4>
<div>
city
Address
</div>
</div>
I don't know how can I get data that i want in that specific tag.
obviously I'm using python with beautifulsoup library.
There are several <h4> tags in the source HTML, but only one <h4> with the itemprop="name" attribute, so you can search for that first. Then access the remaining values from there. Note that the following HTML is correctly reproduced from the source page, whereas the HTML in the question was not:
from bs4 import BeautifulSoup
html = '''<div class="mainInfoWrapper">
<h4 itemprop="name">
NAME
</h4>
<div>
PROVINCE - CITY ADDRESS
</div>
</div>'''
soup = BeautifulSoup(html)
name_tag = soup.find('h4', itemprop='name')
addr_div = name_tag.find_next_sibling('div')
province_tag, city_tag = addr_div.find_all('a')
name, province, city = [t.text.strip() for t in name_tag, province_tag, city_tag]
address = city_tag.next_sibling.strip()
When run for the URL that you provided
import requests
from bs4 import BeautifulSoup
r = requests.get('http://goo.gl/sCXNp2')
soup = BeautifulSoup(r.content)
name_tag = soup.find('h4', itemprop='name')
addr_div = name_tag.find_next_sibling('div')
province_tag, city_tag = addr_div.find_all('a')
name, province, city = [t.text.strip() for t in name_tag, province_tag, city_tag]
address = city_tag.next_sibling.strip()
>>> print name
بیمارستان حضرت فاطمه (س)
>>> print province
تهران
>>> print city
تهران
>>> print address
یوسف آباد، خیابان بیست و یکم، جنب پارک شفق، بیمارستان ترمیمی پلاستیک فک و صورت
I'm not sure that the printed output is correct on my terminal, however, this code should produce the correct text for a properly configured terminal.
You can do it with built-in lxml.html module :
>>> s="""<div class="mainInfoWrapper">
... <h4 itemprop="name">name</h4>
... <div>
...
... city
...
... Address
... </div>
... </div>"""
>>>
>>> import lxml.html
>>> document = lxml.html.document_fromstring(s)
>>> print document.text_content().split()
['name', 'city', 'Address']
And with BeautifulSoup to get the text between your tags:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(s)
>>> print soup.text
And for get the text from a specific tag just use soup.find_all :
soup = BeautifulSoup(your_HTML_source)
for line in soup.find_all('div',attrs={"class" : "mainInfoWrapper"}):
print line.text
If h4 is used only once then you can do this -
name = soup.find('h4', attrs={'itemprop': 'name'})
print name.text
parentdiv = name.find_parent('div', class_='mainInfoWrapper')
cityaddressdiv = name.find_next_sibling('div')
How would I, using BeautifulSoup, search for tags containing ONLY the attributes I search for?
For example, I want to find all <td valign="top"> tags.
The following code:
raw_card_data = soup.fetch('td', {'valign':re.compile('top')})
gets all of the data I want, but also grabs any <td> tag that has the attribute valign:top
I also tried:
raw_card_data = soup.findAll(re.compile('<td valign="top">'))
and this returns nothing (probably because of bad regex)
I was wondering if there was a way in BeautifulSoup to say "Find <td> tags whose only attribute is valign:top"
UPDATE
FOr example, if an HTML document contained the following <td> tags:
<td valign="top">.....</td><br />
<td width="580" valign="top">.......</td><br />
<td>.....</td><br />
I would want only the first <td> tag (<td width="580" valign="top">) to return
As explained on the BeautifulSoup documentation
You may use this :
soup = BeautifulSoup(html)
results = soup.findAll("td", {"valign" : "top"})
EDIT :
To return tags that have only the valign="top" attribute, you can check for the length of the tag attrs property :
from BeautifulSoup import BeautifulSoup
html = '<td valign="top">.....</td>\
<td width="580" valign="top">.......</td>\
<td>.....</td>'
soup = BeautifulSoup(html)
results = soup.findAll("td", {"valign" : "top"})
for result in results :
if len(result.attrs) == 1 :
print result
That returns :
<td valign="top">.....</td>
You can use lambda functions in findAll as explained in documentation. So that in your case to search for td tag with only valign = "top" use following:
td_tag_list = soup.findAll(
lambda tag:tag.name == "td" and
len(tag.attrs) == 1 and
tag["valign"] == "top")
if you want to only search with attribute name with any value
from bs4 import BeautifulSoup
import re
soup= BeautifulSoup(html.text,'lxml')
results = soup.findAll("td", {"valign" : re.compile(r".*")})
as per Steve Lorimer better to pass True instead of regex
results = soup.findAll("td", {"valign" : True})
The easiest way to do this is with the new CSS style select method:
soup = BeautifulSoup(html)
results = soup.select('td[valign="top"]')
Just pass it as an argument of findAll:
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup("""
... <html>
... <head><title>My Title!</title></head>
... <body><table>
... <tr><td>First!</td>
... <td valign="top">Second!</td></tr>
... </table></body><html>
... """)
>>>
>>> soup.findAll('td')
[<td>First!</td>, <td valign="top">Second!</td>]
>>>
>>> soup.findAll('td', valign='top')
[<td valign="top">Second!</td>]
Adding a combination of Chris Redford's and Amr's answer, you can also search for an attribute name with any value with the select command:
from bs4 import BeautifulSoup as Soup
html = '<td valign="top">.....</td>\
<td width="580" valign="top">.......</td>\
<td>.....</td>'
soup = Soup(html, 'lxml')
results = soup.select('td[valign]')
If you are looking to pull all tags where a particular attribute is present at all, you can use the same code as the accepted answer, but instead of specifying a value for the tag, just put True.
soup = BeautifulSoup(html)
results = soup.findAll("td", {"valign" : True})
This will return all td tags that have valign attributes. This is useful if your project involves pulling info from a tag like div that is used all over, but can handle very specific attributes that you might be looking for.
find using an attribute in any tag
<th class="team" data-sort="team">Team</th>
soup.find_all(attrs={"class": "team"})
<th data-sort="team">Team</th>
soup.find_all(attrs={"data-sort": "team"})
I'd like to know how to fix broken html tags before parsing it with Beautiful Soup.
In the following script the td> needs to be replaced with <td.
How can I do the substitution so Beautiful Soup can see it?
from BeautifulSoup import BeautifulSoup
s = """
<tr>
td>LABEL1</td><td>INPUT1</td>
</tr>
<tr>
<td>LABEL2</td><td>INPUT2</td>
</tr>"""
a = BeautifulSoup(s)
left = []
right = []
for tr in a.findAll('tr'):
l, r = tr.findAll('td')
left.extend(l.findAll(text=True))
right.extend(r.findAll(text=True))
print left + right
Edit (working):
I grabbed a complete (at least it should be complete) list of all html tags from w3 to match against. Try it out:
fixedString = re.sub(">\s*(\!--|\!DOCTYPE|\
a|abbr|acronym|address|applet|area|\
b|base|basefont|bdo|big|blockquote|body|br|button|\
caption|center|cite|code|col|colgroup|\
dd|del|dfn|dir|div|dl|dt|\
em|\
fieldset|font|form|frame|frameset|\
head|h1|h2|h3|h4|h5|h6|hr|html|\
i|iframe|img|input|ins|\
kbd|\
label|legend|li|link|\
map|menu|meta|\
noframes|noscript|\
object|ol|optgroup|option|\
p|param|pre|\
q|\
s|samp|script|select|small|span|strike|strong|style|sub|sup|\
table|tbody|td|textarea|tfoot|th|thead|title|tr|tt|\
u|ul|\
var)>", "><\g<1>>", s)
bs = BeautifulSoup(fixedString)
Produces:
>>> print s
<tr>
td>LABEL1</td><td>INPUT1</td>
</tr>
<tr>
<td>LABEL2</td><td>INPUT2</td>
</tr>
>>> print re.sub(">\s*(\!--|\!DOCTYPE|\
a|abbr|acronym|address|applet|area|\
b|base|basefont|bdo|big|blockquote|body|br|button|\
caption|center|cite|code|col|colgroup|\
dd|del|dfn|dir|div|dl|dt|\
em|\
fieldset|font|form|frame|frameset|\
head|h1|h2|h3|h4|h5|h6|hr|html|\
i|iframe|img|input|ins|\
kbd|\
label|legend|li|link|\
map|menu|meta|\
noframes|noscript|\
object|ol|optgroup|option|\
p|param|pre|\
q|\
s|samp|script|select|small|span|strike|strong|style|sub|sup|\
table|tbody|td|textarea|tfoot|th|thead|title|tr|tt|\
u|ul|\
var)>", "><\g<1>>", s)
<tr><td>LABEL1</td><td>INPUT1</td>
</tr>
<tr>
<td>LABEL2</td><td>INPUT2</td>
</tr>
This one should match broken ending tags as well (</endtag>):
re.sub(">\s*(/?)(\!--|\!DOCTYPE|\a|abbr|acronym|address|applet|area|\
b|base|basefont|bdo|big|blockquote|body|br|button|\
caption|center|cite|code|col|colgroup|\
dd|del|dfn|dir|div|dl|dt|\
em|\
fieldset|font|form|frame|frameset|\
head|h1|h2|h3|h4|h5|h6|hr|html|\
i|iframe|img|input|ins|\
kbd|\
label|legend|li|link|\
map|menu|meta|\
noframes|noscript|\
object|ol|optgroup|option|\
p|param|pre|\
q|\
s|samp|script|select|small|span|strike|strong|style|sub|sup|\
table|tbody|td|textarea|tfoot|th|thead|title|tr|tt|\
u|ul|\
var)>", "><\g<1>\g<2>>", s)
If that's the only thing you're concerned about td> -> , try:
myString = re.sub('td>', '<td>', myString)
Before sending myString to BeautifulSoup. If there are other broken tags give us some examples and we'll work on it : )