I'm new to Python. Here are some lines of coding in Python to print out all article titles on http://www.nytimes.com/.
import requests
from bs4 import BeautifulSoup
base_url = 'http://www.nytimes.com'
r = requests.get(base_url)
soup = BeautifulSoup(r.text)
for story_heading in soup.find_all(class_="story-heading"):
if story_heading.a:
print(story_heading.a.text.replace("\n", " ").strip())
else:
print(story_heading.contents[0].strip())
What do .a and .text mean?
Thank you very much.
First, let's see what printing one story_heading alone gives us:
>>> story_heading
<h2 class="story-heading">Mortgage Calculator</h2>
To extract only the a tag, we access it using story_heading.a:
>>> story_heading.a
Mortgage Calculator
To get only the text inside the tag itself, and not it's attributes, we use .text:
>>> story_heading.a.text
'Mortgage Calculator'
Here,
.a gives you the first anchor tag
.text gives you the text within the tag
Related
I am wondering how would I open another page in my list with BeautifulSoup? I have followed this tutorial, but it does not tell us how to open another page on the list. Also how would I open a "a href" that is nested inside of a class?
Here is my code:
# coding: utf-8
import requests
from bs4 import BeautifulSoup
r = requests.get("")
soup = BeautifulSoup(r.content)
soup.find_all("a")
for link in soup.find_all("a"):
print link.get("href")
for link in soup.find_all("a"):
print link.text
for link in soup.find_all("a"):
print link.text, link.get("href")
g_data = soup.find_all("div", {"class":"listing__left-column"})
for item in g_data:
print item.contents
for item in g_data:
print item.contents[0].text
print link.get('href')
for item in g_data:
print item.contents[0]
I am trying to collect the href's from the titles of each business, and then open them and scrape that data.
I am still not sure where you are getting the HTML from, but if you are trying to extract all of the href tags, then the following approach should work based on the image you have posted:
import requests
from bs4 import BeautifulSoup
r = requests.get("<add your URL here>")
soup = BeautifulSoup(r.content)
for a_tag in soup.find_all('a', class_='listing-name', href=True):
print 'href: ', a_tag['href']
By adding href=True to the find_all(), it ensures that only a elements that contain an href attribute are returned therefore removing the need to test for it as an attribute.
Just to warn you, you might find some websites will lock you out after one or two attempts as they are able to detect that you are trying to access a site via a script, rather than as a human. If you feel you are not getting the correct responses, I would recommend printing the HTML you are getting back to ensure it it still as you expect.
If you then want to get the HTML for each of the links, the following could be used:
import requests
from bs4 import BeautifulSoup
# Configure this to be your first request URL
r = requests.get("http://www.mywebsite.com/search/")
soup = BeautifulSoup(r.content)
for a_tag in soup.find_all('a', class_='listing-name', href=True):
print 'href: ', a_tag['href']
# Configure this to the root of the above website, e.g. 'http://www.mywebsite.com'
base_url = "http://www.mywebsite.com"
for a_tag in soup.find_all('a', class_='listing-name', href=True):
print '-' * 60 # Add a line of dashes
print 'href: ', a_tag['href']
request_href = requests.get(base_url + a_tag['href'])
print request_href.content
Tested using Python 2.x, for Python 3.x please add parentheses to the print statements.
I had the same problem and I will like to share my findings because I did try the answer, for some reasons it did not work but after some research I found something interesting.
You might need to find the attributes of the "href" link itself:
You will need the exact class which contains the href link in your case, I am thinking="class":"listing__left-column" and equate it to a variable say "all" for example:
from bs4 import BeautifulSoup
all = soup.find_all("div", {"class":"listing__left-column"})
for item in all:
for link in item.find_all("a"):
if 'href' in link.attrs:
a = link.attrs['href']
print(a)
print("")
I did this and I was able to get into another link which was embedded in the home page
I'm using beautiful soup. There is a tag like this:
<li> s.r.o., <small>small</small></li>
I want to get the text within the anchor <a> tag only, without any from the <small> tag in the output; i.e. " s.r.o., "
I tried find('li').text[0] but it does not work.
Is there a command in BS4 which can do that?
One option would be to get the first element from the contents of the a element:
>>> from bs4 import BeautifulSoup
>>> data = '<li> s.r.o., <small>small</small></li>'
>>> soup = BeautifulSoup(data)
>>> print soup.find('a').contents[0]
s.r.o.,
Another one would be to find the small tag and get the previous sibling:
>>> print soup.find('small').previous_sibling
s.r.o.,
Well, there are all sorts of alternative/crazy options also:
>>> print next(soup.find('a').descendants)
s.r.o.,
>>> print next(iter(soup.find('a')))
s.r.o.,
Use .children
soup.find('a').children.next()
s.r.o.,
If you would like to loop to print all content of anchor tags located in html string/web page (must utilise urlopen from urllib), this works:
from bs4 import BeautifulSoup
data = '<li>s.r.o., <small>small</small</li> <li>2nd</li> <li>3rd</li>'
soup = BeautifulSoup(data,'html.parser')
a_tag=soup('a')
for tag in a_tag:
print(tag.contents[0]) #.contents method to locate text within <a> tags
Output:
s.r.o.,
2nd
3rd
a_tag is a list containing all anchor tags; collecting all anchor tags in a list, enables group editing (if more than one <a> tags present.
>>>print(a_tag)
[s.r.o., <small>small</small>, 2nd, 3rd]
From the documentation, retrieving the text of the tag can be done by calling string property
soup = BeautifulSoup('<li> s.r.o., <small>small</small></li>')
res = soup.find('a')
res.small.decompose()
print(res.string)
# s.r.o.,
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigablestring
I am trying to get the links from a news website page(from one of its archives). I wrote the following lines of code in Python:
main.py contains :
import mechanize
from bs4 import BeautifulSoup
url = "http://www.thehindu.com/archive/web/2010/06/19/"
br = mechanize.Browser()
htmltext = br.open(url).read()
articletext = ""
soup = BeautifulSoup(htmltext)
for tag in soup.findAll('li', attrs={"data-section":"Business"}):
articletext += tag.contents[0]
print articletext
An example of the object in tag.contents[0] :
ITC to issue 1:1 bonus
But on running it I am getting the following error :
File "C:\Python27\crawler\main.py", line 4, in <module>
text = articletext.getArticle(url)
File "C:\Python27\crawler\articletext.py", line 23, in getArticle
return getArticleText(htmltext)
File "C:\Python27\crawler\articletext.py", line 18, in getArticleText
articletext += tag.contents[0]
TypeError: cannot concatenate 'str' and 'Tag' objects
Can someone help me to sort it out ?? I am new to Python Programming. thanks and regards.
you are using link_dictionary vaguely. If you are not using it for reading purpose then try the following code :
br = mechanize.Browser()
htmltext = br.open(url).read()
articletext = ""
for tag_li in soup.findAll('li', attrs={"data-section":"Op-Ed"}):
for link in tag_li.findAll('a'):
urlnew = urlnew = link.get('href')
brnew = mechanize.Browser()
htmltextnew = brnew.open(urlnew).read()
articletext = ""
soupnew = BeautifulSoup(htmltextnew)
for tag in soupnew.findAll('p'):
articletext += tag.text
print re.sub('\s+', ' ', articletext, flags=re.M)
Note : re is for regulare expression. for this you import the module of re.
I believe you may want to try accessing the text inside the list item like so:
for tag in soup.findAll('li', attrs={"data-section":"Business"}):
articletext += tag.string
Edited: General Comments on getting links from a page
Probably the easiest data type to use to gather a bunch of links and retrieve them later is a dictionary.
To get links from a page using BeautifulSoup, you could do something like the following:
link_dictionary = {}
with urlopen(url_source) as f:
soup = BeautifulSoup(f)
for link in soup.findAll('a'):
link_dictionary[link.string] = link.get('href')
This will provide you with a dictionary named link_dictionary, where every key in the dictionary is a string that is simply the text contents between the <a> </a> tags and every value is the the value of the href attribute.
How to combine this what your previous attempt
Now, if we combine this with the problem you were having before, we could try something like the following:
link_dictionary = {}
for tag in soup.findAll('li', attrs={"data-section":"Business"}):
for link in tag.findAll('a'):
link_dictionary[link.string] = link.get('href')
If this doesn't make sense, or you have a lot more questions, you will need to experiment first and try to come up with a solution before asking another new, clearer question.
You might want to use the powerful XPath query language with the faster lxml module. As simple as that:
import urllib2
from lxml import etree
url = 'http://www.thehindu.com/archive/web/2010/06/19/'
html = etree.HTML(urllib2.urlopen(url).read())
for link in html.xpath("//li[#data-section='Business']/a"):
print '{} ({})'.format(link.text, link.attrib['href'])
Update for #data-section='Chennai'
#!/usr/bin/python
import urllib2
from lxml import etree
url = 'http://www.thehindu.com/template/1-0-1/widget/archive/archiveWebDayRest.jsp?d=2010-06-19'
html = etree.HTML(urllib2.urlopen(url).read())
for link in html.xpath("//li[#data-section='Chennai']/a"):
print '{} => {}'.format(link.text, link.attrib['href'])
from bs4 import BeautifulSoup
source_code = """
"""
soup = BeautifulSoup(source_code)
print soup.a['name'] #prints 'One'
Using BeautifulSoup, i can grab the first name attribute which is one, but i am not sure how i can print the second, which is Two
Anyone able to help me out?
You should read the documentation. There you can see that soup.find_all returns a list
so you can iterate over the list and, for each element, extract the tag you are looking for. So you should do something like (not tested here):
from bs4 import BeautifulSoup
soup = BeautifulSoup(source_code)
for item in soup.find_all('a'):
print item['name']
To get any a child element other than the first, use find_all. For the second a tag:
print soup.find_all('a', recursive=False)[1]['name']
To stay on the same level and avoid a deep search, pass the argument: recursive=False
This will give you all the tags of "a":
>>> from BeautifulSoup import BeautifulSoup
>>> aTags = BeautifulSoup(source_code).findAll('a')
>>> for tag in aTags: print tag["name"]
...
One
Two
I'm trying to scrape all the inner html from the <p> elements in a web page using BeautifulSoup. There are internal tags, but I don't care, I just want to get the internal text.
For example, for:
<p>Red</p>
<p><i>Blue</i></p>
<p>Yellow</p>
<p>Light <b>green</b></p>
How can I extract:
Red
Blue
Yellow
Light green
Neither .string nor .contents[0] does what I need. Nor does .extract(), because I don't want to have to specify the internal tags in advance - I want to deal with any that may occur.
Is there a 'just get the visible HTML' type of method in BeautifulSoup?
----UPDATE------
On advice, trying:
soup = BeautifulSoup(open("test.html"))
p_tags = soup.findAll('p',text=True)
for i, p_tag in enumerate(p_tags):
print str(i) + p_tag
But that doesn't help - it prints out:
0Red
1
2Blue
3
4Yellow
5
6Light
7green
8
Short answer: soup.findAll(text=True)
This has already been answered, here on StackOverflow and in the BeautifulSoup documentation.
UPDATE:
To clarify, a working piece of code:
>>> txt = """\
... <p>Red</p>
... <p><i>Blue</i></p>
... <p>Yellow</p>
... <p>Light <b>green</b></p>
... """
>>> import BeautifulSoup
>>> BeautifulSoup.__version__
'3.0.7a'
>>> soup = BeautifulSoup.BeautifulSoup(txt)
>>> for node in soup.findAll('p'):
... print ''.join(node.findAll(text=True))
Red
Blue
Yellow
Light green
The accepted answer is great but it is 6 years old now, so here's the current Beautiful Soup 4 version of this answer:
>>> txt = """\
<p>Red</p>
<p><i>Blue</i></p>
<p>Yellow</p>
<p>Light <b>green</b></p>
"""
>>> from bs4 import BeautifulSoup, __version__
>>> __version__
'4.5.1'
>>> soup = BeautifulSoup(txt, "html.parser")
>>> print("".join(soup.strings))
Red
Blue
Yellow
Light green
I have stumbled upon this very same problem and wanted to share the 2019 version of this solution. Maybe it helps somebody out.
# importing the modules
from bs4 import BeautifulSoup
from urllib.request import urlopen
# setting up your BeautifulSoup Object
webpage = urlopen("https://insertyourwebpage.com")
soup = BeautifulSoup( webpage.read(), features="lxml")
p_tags = soup.find_all('p')
for each in p_tags:
print (str(each.get_text()))
Notice that we're first printing the array content one by one and THEN call the get_text() method that strips the tags from the text, so that we only print out the text.
Also:
it is better to use the updated 'find_all()' in bs4 than the older findAll()
urllib2 was replaced by urllib.request and urllib.error, see here
Now your output should be:
Red
Blue
Yellow
Light
Hope this helps someone looking for an updated solution.
Normally the data scrapped from website will contains tags.To avoid that tags and show only text content, you can use text attribute.
For example,
from BeautifulSoup import BeautifulSoup
import urllib2
url = urllib2.urlopen("https://www.python.org")
content = url.read()
soup = BeautifulSoup(content)
title = soup.findAll("title")
paragraphs = soup.findAll("p")
print paragraphs[1] //Second paragraph with tags
print paragraphs[1].text //Second paragraph without tags
In this example, I collect all paragraphs from python site and display it with tags and without tags.
First, convert the html to a string using str. Then, use the following code with your program:
import re
x = str(soup.find_all('p'))
content = str(re.sub("<.*?>", "", x))
This is called a regex. This one will remove anything that comes between two html tags (inclusive of the tags).