Extract link from url using Beautifulsoup - python

I am trying to get the web link of the following, using beautifulsoup
<div class="alignright single">
Hadith on Clothing: Women should lower their garments to cover their feet ยป </div>
</div>
My code is as follow
from bs4 import BeautifulSoup
import urllib2
url1 = "http://www.dailyhadithonline.com/2013/07/21/hadith-on-clothing-the-lower-garment-should-be-hallway-between-the-shins/"
content1 = urllib2.urlopen(url1).read()
soup = BeautifulSoup(content1)
nextlink = soup.findAll("div", {"class" : "alignright single"})
a = nextlink.find('a')
print a.get('href')
I get the following error, please help
a = nextlink.find('a')
AttributeError: 'ResultSet' object has no attribute 'find'

Use .find() if you want to find just one match:
nextlink = soup.find("div", {"class" : "alignright single"})
or loop over all matches:
for nextlink in soup.findAll("div", {"class" : "alignright single"}):
a = nextlink.find('a')
print a.get('href')
The latter part can also be expressed as:
a = nextlink.find('a', href=True)
print a['href']
where the href=True part only matches elements that have a href attribute, which means that you won't have to use a.get() because the attribute will be there (alternatively, no <a href="..."> link is found and a will be None).
For the given URL in your question, there is only one such link, so .find() is probably most convenient. It may even be possible to just use:
nextlink = soup.find('a', rel='next', href=True)
if nextlink is not None:
print a['href']
with no need to find the surrounding div. The rel="next" attribute looks enough for your specific needs.
As an extra tip: make use of the response headers to tell BeautifulSoup what encoding to use for a page; the urllib2 response object can tell you what, if any, character set the server thinks the HTML page is encoded in:
response = urllib2.urlopen(url1)
soup = BeautifulSoup(response.read(), from_encoding=response.info().getparam('charset'))
Quick demo of all the parts:
>>> import urllib2
>>> from bs4 import BeautifulSoup
>>> response = urllib2.urlopen('http://www.dailyhadithonline.com/2013/07/21/hadith-on-clothing-the-lower-garment-should-be-hallway-between-the-shins/')
>>> soup = BeautifulSoup(response.read(), from_encoding=response.info().getparam('charset'))
>>> soup.find('a', rel='next', href=True)['href']
u'http://www.dailyhadithonline.com/2013/07/21/hadith-on-clothing-women-should-lower-their-garments-to-cover-their-feet/'

You need to unpack the list so Try this instead:
nextlink = soup.findAll("div", {"class" : "alignright single"})[0]
Or since there's only one match the find method also ought to work:
nextlink = soup.find("div", {"class" : "alignright single"})

Related

I can't get the a tag using Beautifulsoup, though I can get other tags

I'm just trying to get data from a webpage called "Elgiganten" and its url: https://www.elgiganten.se/
I want to get the products name and its url. When I tried to get the a tag so I got an empty list, but I could get the span tag though taht they were in the same div tag.
Here is the whole code:
from bs4 import BeautifulSoup
import requests
respons = requests.get("https://www.elgiganten.se")
soup = BeautifulSoup(respons.content, "lxml")
g_data = soup.find_all("div", {"class": "col-flex S-order-1"})
for item in g_data:
print(item.contents[1].find_all("span")[0])
print(item.contents[1].find_all("a", {"class": "product-name"}))
I hope that anyone can tell me why the a tag seems to be invisible, and can fix the issue.
Go for the a-tags directly. You can extract the product name and the url both from that tag:
from bs4 import BeautifulSoup
import requests
respons = requests.get("https://www.elgiganten.se")
soup = BeautifulSoup(respons.content, "lxml")
g_data = soup.find_all("a", {"class": "product-name"}, href=True)
for item in g_data:
print(item['title'], item['href'])
If you wish to stick to the way you started, the following is how you can achieve that:
import requests
from bs4 import BeautifulSoup
respons = requests.get("https://www.elgiganten.se")
soup = BeautifulSoup(respons.text,"lxml")
for item in soup.find_all(class_="mini-product-content"):
product_name = item.find("span",class_="table-cell").text
product_link = item.find("a",class_="product-name").get("href")
print(product_name,product_link)
Try:
g_data = soup.find_all("a", class_="product-name")

How to get the full link using BeautifulSoap

The function get("href") is not returning the full link. In the html file exist the link:
But, the function link.get("href") return:
"navhome.php?lightbox&dpxshig=/iprop_prod=180-slim/tipo=fotolivro/width=950/height=615/control=true/tema=tema_02/nome_tema=Q2wmYWFjdXRlO3NzaWNvIFByZXRv&cpmdsc=MOZAO"
sub_site = "https://www.fotoregistro.com.br/navhome.php?vitrine-produto-slim"
response = urllib.request.urlopen(sub_site)
data = response.read()
soup = BeautifulSoup(data,'lxml')
for link in soup.find_all('a'):
url = link.get("href")
print (url)
Use select and seems to print fine
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.fotoregistro.com.br/fotolivros/180-slim?cpmdsc=MOZAO')
soup = bs(r.content, 'lxml')
print([item['href'] for item in soup.select('.warp_lightbox')])
Use
print([item['href'] for item in soup.select('[href]')])
for all links.
Let me focus on the specific part of your problem in the html:
<a class='warp_lightbox' title='Comprar' href='//www.fotoregistro.com.br/
navhome.php?lightbox&dpxshig=/iprop_prod=180-slim/tipo=fotolivro/width=950/height=615/control=true/tema=tema_02/nome_tema=Q2wmYWFjdXRlO3NzaWNvIFByZXRv&cpmdsc=MOZAO'><img src='
//sh.digipix.com.br/subhomes/_lojas_consumer/paginas/fotolivro/img/180slim/vitrine/classic_01_tb.jpg' alt='slim' />
</a>
You can get it by doing:
for link in soup.find_all('a', {'class':'warp_lightbox'}):
url = link.get("href")
break
you find out that url is:
'//www.fotoregistro.com.br/\rnavhome.php?lightbox&dpxshig=/iprop_prod=180-slim/tipo=fotolivro/width=950/height=615/control=true/tema=tema_02/nome_tema=Q2wmYWFjdXRlO3NzaWNvIFByZXRv&cpmdsc=MOZAO'
You can see two important patterns at the begininning of the string:
// which is a way to keep the current protocol, see this;
\r which is ASCII Carriage Return (CR).
When you print it, you simply lose this part:
//www.fotoregistro.com.br/\r
If you need the raw string, you can use repr in your for loop:
print(repr(url))
and you get:
//www.fotoregistro.com.br/\rnavhome.php?lightbox&dpxshig=/iprop_prod=180-slim/tipo=fotolivro/width=950/height=615/control=true/tema=tema_02/nome_tema=Q2wmYWFjdXRlO3NzaWNvIFByZXRv&cpmdsc=MOZAO
If you need the path, you can replace the initial part:
base = 'www.fotoregistro.com.br/'
for link in soup.find_all('a', {'class':'warp_lightbox'}):
url = link.get("href").replace('//www.fotoregistro.com.br/\r',base)
print(url)
and you get:
www.fotoregistro.com.br/navhome.php?lightbox&dpxshig=/iprop_prod=180-slim/tipo=fotolivro/width=950/height=615/control=true/tema=tema_02/nome_tema=Q2wmYWFjdXRlO3NzaWNvIFByZXRv&cpmdsc=MOZAO
www.fotoregistro.com.br/navhome.php?lightbox&dpxshig=/iprop_prod=180-slim/tipo=fotolivro/width=950/height=615/control=true/tema=tema_02/preview=true/nome_tema=Q2wmYWFjdXRlO3NzaWNvIFByZXRv&cpmdsc=MOZAO
.
.
.
Without specifying the class:
for link in soup.find_all('a'):
url = link.get("href")
print(repr(url))

BeautifulSoup and remove entire tag

I'm working with BeautifulSoup. I wish that if I see the tag -a href- the entire line is deleted, but, actually, not.
By example, if I have :
<a href="/psf-landing/">
This is a test message
</a>
Actually, I can have :
<a>
This is a test message
</a>
So, how can I just get :
This is a test message
Here is my code :
soup = BeautifulSoup(content_driver, "html.parser")
for element in soup(text=lambda text: isinstance(text, Comment)):
element.extract()
for titles in soup.findAll('a'):
del titles['href']
tree = soup.prettify()
Try to use .extract() method. In your case, you're just deleting an attribute
for titles in soup.findAll('a'):
if titles['href'] is not None:
titles.extract()
Here,you can see the detailed examples Dzone NLP examples
what you need is :
text = soup.get_text(strip=True)
This is the sample example:
from bs4 import BeautifulSoup
import urllib.request
response = urllib.request.urlopen('http://php.net/')
html = response.read()
soup = BeautifulSoup(html,"html5lib")
text = soup.get_text(strip=True)
print (text)
You are looking for the unwrap() method. Have a look at the following snippet:
html = '''
<a href="/psf-landing/">
This is a test message
</a>'''
soup = BeautifulSoup(html, 'html.parser')
for el in soup.find_all('a', href=True):
el.unwrap()
print(soup)
# This is a test message
Using href=True will match only the tags that have href as an attribute.

Python and BeautifulSoup Opening pages

I am wondering how would I open another page in my list with BeautifulSoup? I have followed this tutorial, but it does not tell us how to open another page on the list. Also how would I open a "a href" that is nested inside of a class?
Here is my code:
# coding: utf-8
import requests
from bs4 import BeautifulSoup
r = requests.get("")
soup = BeautifulSoup(r.content)
soup.find_all("a")
for link in soup.find_all("a"):
print link.get("href")
for link in soup.find_all("a"):
print link.text
for link in soup.find_all("a"):
print link.text, link.get("href")
g_data = soup.find_all("div", {"class":"listing__left-column"})
for item in g_data:
print item.contents
for item in g_data:
print item.contents[0].text
print link.get('href')
for item in g_data:
print item.contents[0]
I am trying to collect the href's from the titles of each business, and then open them and scrape that data.
I am still not sure where you are getting the HTML from, but if you are trying to extract all of the href tags, then the following approach should work based on the image you have posted:
import requests
from bs4 import BeautifulSoup
r = requests.get("<add your URL here>")
soup = BeautifulSoup(r.content)
for a_tag in soup.find_all('a', class_='listing-name', href=True):
print 'href: ', a_tag['href']
By adding href=True to the find_all(), it ensures that only a elements that contain an href attribute are returned therefore removing the need to test for it as an attribute.
Just to warn you, you might find some websites will lock you out after one or two attempts as they are able to detect that you are trying to access a site via a script, rather than as a human. If you feel you are not getting the correct responses, I would recommend printing the HTML you are getting back to ensure it it still as you expect.
If you then want to get the HTML for each of the links, the following could be used:
import requests
from bs4 import BeautifulSoup
# Configure this to be your first request URL
r = requests.get("http://www.mywebsite.com/search/")
soup = BeautifulSoup(r.content)
for a_tag in soup.find_all('a', class_='listing-name', href=True):
print 'href: ', a_tag['href']
# Configure this to the root of the above website, e.g. 'http://www.mywebsite.com'
base_url = "http://www.mywebsite.com"
for a_tag in soup.find_all('a', class_='listing-name', href=True):
print '-' * 60 # Add a line of dashes
print 'href: ', a_tag['href']
request_href = requests.get(base_url + a_tag['href'])
print request_href.content
Tested using Python 2.x, for Python 3.x please add parentheses to the print statements.
I had the same problem and I will like to share my findings because I did try the answer, for some reasons it did not work but after some research I found something interesting.
You might need to find the attributes of the "href" link itself:
You will need the exact class which contains the href link in your case, I am thinking="class":"listing__left-column" and equate it to a variable say "all" for example:
from bs4 import BeautifulSoup
all = soup.find_all("div", {"class":"listing__left-column"})
for item in all:
for link in item.find_all("a"):
if 'href' in link.attrs:
a = link.attrs['href']
print(a)
print("")
I did this and I was able to get into another link which was embedded in the home page

Retrieve the first href from a div tag

I need to retrieve is the href containing /questions/20702626/javac1-8-class-not-found. But the output I get for the code below is //stackoverflow.com:
from bs4 import BeautifulSoup
import urllib2
url = "http://stackoverflow.com/search?q=incorrect+operator"
content = urllib2.urlopen(url).read()
soup = BeautifulSoup(content)
for tag in soup.find_all('div'):
if tag.get("class")==['summary']:
for tag in soup.find_all('div'):
if tag.get("class")==['result-link']:
for link in soup.find_all('a'):
print link.get('href')
break;
Instead of making nested loops, write a CSS selector:
for link in soup.select('div.summary div.result-link a'):
print link.get('href')
Which is not only more readable, but also solves your problem. It prints:
/questions/11977228/incorrect-answer-in-operator-overloading
/questions/8347592/sizeof-operator-returns-incorrect-size
/questions/23984762/c-incorrect-signature-for-assignment-operator
...
/questions/24896659/incorrect-count-when-using-comparison-operator
/questions/7035598/patter-checking-check-of-incorrect-number-of-operators-and-brackets
Additional note: you might want to look into using StackExchange API instead of the current web-scraping/HTML-parsing approach.

Categories

Resources