from bs4 import BeautifulSoup
source_code = """
"""
soup = BeautifulSoup(source_code)
print soup.a['name'] #prints 'One'
Using BeautifulSoup, i can grab the first name attribute which is one, but i am not sure how i can print the second, which is Two
Anyone able to help me out?
You should read the documentation. There you can see that soup.find_all returns a list
so you can iterate over the list and, for each element, extract the tag you are looking for. So you should do something like (not tested here):
from bs4 import BeautifulSoup
soup = BeautifulSoup(source_code)
for item in soup.find_all('a'):
print item['name']
To get any a child element other than the first, use find_all. For the second a tag:
print soup.find_all('a', recursive=False)[1]['name']
To stay on the same level and avoid a deep search, pass the argument: recursive=False
This will give you all the tags of "a":
>>> from BeautifulSoup import BeautifulSoup
>>> aTags = BeautifulSoup(source_code).findAll('a')
>>> for tag in aTags: print tag["name"]
...
One
Two
Related
from bs4 import BeautifulSoup
import requests
yt_link = "https://www.youtube.com/watch?v=bKDdT_nyP54"
response = requests.get(yt_link)
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.findAll('div', {'class': 'style-scope ytd-app'})
print(title)
It prints empty array [], and if I use find() method then it prints None as a result.
Why does this happen. Please help me I am stuck here.
Yes its difficult to find title because of youtube uses javascript and dynamic content rendering so what you can do try to print soup first and find title from it,so in meta you can find title extract it. And it work for any URL probably
from bs4 import BeautifulSoup
import requests
yt_link = "https://www.youtube.com/watch?v=bKDdT_nyP54"
response = requests.get(yt_link)
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.find('meta',attrs={"name":"title"})
print(title.get("content"))
output:
Akon - Smack That (Official Music Video) ft. Eminem
As find() method returns the first matching object if it doesn't find then returns None and findAll() method returns the list of matching object if it doesn't find then returns empty list.
I'm new to Python. Here are some lines of coding in Python to print out all article titles on http://www.nytimes.com/.
import requests
from bs4 import BeautifulSoup
base_url = 'http://www.nytimes.com'
r = requests.get(base_url)
soup = BeautifulSoup(r.text)
for story_heading in soup.find_all(class_="story-heading"):
if story_heading.a:
print(story_heading.a.text.replace("\n", " ").strip())
else:
print(story_heading.contents[0].strip())
What do .a and .text mean?
Thank you very much.
First, let's see what printing one story_heading alone gives us:
>>> story_heading
<h2 class="story-heading">Mortgage Calculator</h2>
To extract only the a tag, we access it using story_heading.a:
>>> story_heading.a
Mortgage Calculator
To get only the text inside the tag itself, and not it's attributes, we use .text:
>>> story_heading.a.text
'Mortgage Calculator'
Here,
.a gives you the first anchor tag
.text gives you the text within the tag
I seem to be doing something wrong. I have an HTML source that I pull using urllib. Based on this HTML file I use beautifulsoup to findAll elements with an ID based on a specified array. This works for me, however the output is messy and includes linebreaks "\n".
Python: 2.7.12
BeautifulSoup: bs4
I have tried to use prettify() to correct the output but always get an error:
AttributeError: 'ResultSet' object has no attribute 'prettify'
import urllib
import re
from bs4 import BeautifulSoup
cfile = open("test.txt")
clist = cfile.read()
clist = clist.split('\n')
i=0
while i<len (clist):
url = "https://example.com/"+clist[i]
htmlfile = urllib.urlopen (url)
htmltext = htmlfile.read()
soup = BeautifulSoup (htmltext, "html.parser")
soup = soup.findAll (id=["id1", "id2", "id3"])
print soup.prettify()
i+=1
I'm sure there is something simple I am overlooking with this line:
soup = soup.findAll (id=["id1", "id2", "id3"])
I'm just not sure what. Sorry if this is a stupid question. I've only been using Python and Beautiful Soup for a few days.
You are reassigning the soup variable to the result of .findAll(), which is a ResultSet object (basically, a list of tags) which does not have the prettify() method.
The solution is to keep the soup variable pointing to the BeautifulSoup instance.
You can call prettify() on the top-level BeautifulSoup object, or on any of its Tag objects:
findAll return a list of match tags, so your code equal to [tag1,tag2..].prettify()
and it will not work.
Is there a way to find an element using only the data attribute in html, and then grab that value?
For example, with this line inside an html doc:
<ul data-bin="Sdafdo39">
How do I retrieve Sdafdo39 by searching the entire html doc for the element that has the data-bin attribute?
A little bit more accurate
[item['data-bin'] for item in bs.find_all('ul', attrs={'data-bin' : True})]
This way, the iterated list only has the ul elements that has the attr you want to find
from bs4 import BeautifulSoup
bs = BeautifulSoup(html_doc)
html_doc = """<ul class="foo">foo</ul><ul data-bin="Sdafdo39">"""
[item['data-bin'] for item in bs.find_all('ul', attrs={'data-bin' : True})]
You can use find_all method to get all the tags and filtering based on "data-bin" found in its attributes will get us the actual tag which has got it. Then we can simply extract the value corresponding to it, like this
from bs4 import BeautifulSoup
html_doc = """<ul data-bin="Sdafdo39">"""
bs = BeautifulSoup(html_doc)
print [item["data-bin"] for item in bs.find_all() if "data-bin" in item.attrs]
# ['Sdafdo39']
You could solve this with gazpacho in just a couple of lines:
First, import and turn the html into a Soup object:
from gazpacho import Soup
html = """<ul data-bin="Sdafdo39">"""
soup = Soup(html)
Then you can just search for the "ul" tag and extract the href attribute:
soup.find("ul").attrs["data-bin"]
# Sdafdo39
As an alternative if one prefers to use CSS selectors via select() instead of find_all():
from bs4 import BeautifulSoup
html_doc = """<ul class="foo">foo</ul><ul data-bin="Sdafdo39">"""
soup = BeautifulSoup(html_doc)
# Select
soup.select('ul[data-bin]')
I am trying to access content in certain td tags with Python and BeautifulSoup. I can either get the first td tag meeting the criteria (with find), or all of them (with findAll).
Now, I could just use findAll, get them all, and get the content I want out of them, but that seems like it is inefficient (even if I put limits on the search). Is there anyway to go to a certain td tag meeting the criteria I want? Say the third, or the 10th?
Here's my code so far:
from __future__ import division
from __future__ import unicode_literals
from __future__ import print_function
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
br = Browser()
url = "http://finance.yahoo.com/q/ks?s=goog+Key+Statistics"
page = br.open(url)
html = page.read()
soup = BeautifulSoup(html)
td = soup.findAll("td", {'class': 'yfnc_tablehead1'})
for x in range(len(td)):
var1 = td[x]
var2 = var1.contents[0]
print(var2)
Is there anyway to go to a certain td
tag meeting the criteria I want? Say
the third, or the 10th?
Well...
all_tds = [td for td in soup.findAll("td", {'class': 'yfnc_tablehead1'})]
print all_tds[3]
...there is no other way..
find and findAll are very flexible, the BeautifulSoup.findAll docs say
5. You can pass in a callable object
which takes a Tag object as its only
argument, and returns a boolean. Every
Tag object that findAll encounters
will be passed into this object, and
if the call returns True then the tag
is considered to match.