BeautifulSoup getting href [duplicate] - python

This question already has answers here:
retrieve links from web page using python and BeautifulSoup [closed]
(16 answers)
Closed 9 years ago.
I have the following soup:
next
<span class="class">...</span>
From this I want to extract the href, "some_url"
I can do it if I only have one tag, but here there are two tags. I can also get the text 'next' but that's not what I want.
Also, is there a good description of the API somewhere with examples. I'm using the standard documentation, but I'm looking for something a little more organized.

You can use find_all in the following way to find every a element that has an href attribute, and print each one:
# Python2
from BeautifulSoup import BeautifulSoup
html = '''next
<span class="class">later</span>'''
soup = BeautifulSoup(html)
for a in soup.find_all('a', href=True):
print "Found the URL:", a['href']
# The output would be:
# Found the URL: some_url
# Found the URL: another_url
# Python3
from bs4 import BeautifulSoup
html = '''next
<span class="class">
another_url</span>'''
soup = BeautifulSoup(html)
for a in soup.find_all('a', href=True):
print("Found the URL:", a['href'])
# The output would be:
# Found the URL: https://some_url.com
# Found the URL: https://some_other_url.com
Note that if you're using an older version of BeautifulSoup (before version 4) the name of this method is findAll. In version 4, BeautifulSoup's method names were changed to be PEP 8 compliant, so you should use find_all instead.
If you want all tags with an href, you can omit the name parameter:
href_tags = soup.find_all(href=True)

Related

How to scrape href with Python 3.5 and BeautifulSoup [duplicate]

This question already has answers here:
retrieve links from web page using python and BeautifulSoup [closed]
(16 answers)
Closed 6 years ago.
I want to scrape the href of every project from the website https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1 with Python 3.5 and BeautifulSoup.
That's my code
#Loading Libraries
import urllib
import urllib.request
from bs4 import BeautifulSoup
#define URL for scraping
theurl = "https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1"
thepage = urllib.request.urlopen(theurl)
#Cooking the Soup
soup = BeautifulSoup(thepage,"html.parser")
#Scraping "Link" (href)
project_ref = soup.findAll('h6', {'class': 'project-title'})
project_href = [project.findChildren('a')[0].href for project in project_ref if project.findChildren('a')]
print(project_href)
I get [None, None, .... None, None] back.
I need a list with all the href from the class .
Any ideas?
Try something like this:
import urllib.request
from bs4 import BeautifulSoup
theurl = "https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1"
thepage = urllib.request.urlopen(theurl)
soup = BeautifulSoup(thepage)
project_href = [i['href'] for i in soup.find_all('a', href=True)]
print(project_href)
This will return all the href instances. As i see in your link, a lot of href tags have # inside them. You can avoid these with a simple regex for proper links, or just ignore the # symboles.
project_href = [i['href'] for i in soup.find_all('a', href=True) if i['href'] != "#"]
This will still give you some trash links like /discover?ref=nav, so if you want to narrow it down use a proper regex for the links you need.
EDIT:
To solve the problem you mentioned in the comments:
soup = BeautifulSoup(thepage)
for i in soup.find_all('div', attrs={'class' : 'project-card-content'}):
print(i.a['href'])

Python and BeautifulSoup Opening pages

I am wondering how would I open another page in my list with BeautifulSoup? I have followed this tutorial, but it does not tell us how to open another page on the list. Also how would I open a "a href" that is nested inside of a class?
Here is my code:
# coding: utf-8
import requests
from bs4 import BeautifulSoup
r = requests.get("")
soup = BeautifulSoup(r.content)
soup.find_all("a")
for link in soup.find_all("a"):
print link.get("href")
for link in soup.find_all("a"):
print link.text
for link in soup.find_all("a"):
print link.text, link.get("href")
g_data = soup.find_all("div", {"class":"listing__left-column"})
for item in g_data:
print item.contents
for item in g_data:
print item.contents[0].text
print link.get('href')
for item in g_data:
print item.contents[0]
I am trying to collect the href's from the titles of each business, and then open them and scrape that data.
I am still not sure where you are getting the HTML from, but if you are trying to extract all of the href tags, then the following approach should work based on the image you have posted:
import requests
from bs4 import BeautifulSoup
r = requests.get("<add your URL here>")
soup = BeautifulSoup(r.content)
for a_tag in soup.find_all('a', class_='listing-name', href=True):
print 'href: ', a_tag['href']
By adding href=True to the find_all(), it ensures that only a elements that contain an href attribute are returned therefore removing the need to test for it as an attribute.
Just to warn you, you might find some websites will lock you out after one or two attempts as they are able to detect that you are trying to access a site via a script, rather than as a human. If you feel you are not getting the correct responses, I would recommend printing the HTML you are getting back to ensure it it still as you expect.
If you then want to get the HTML for each of the links, the following could be used:
import requests
from bs4 import BeautifulSoup
# Configure this to be your first request URL
r = requests.get("http://www.mywebsite.com/search/")
soup = BeautifulSoup(r.content)
for a_tag in soup.find_all('a', class_='listing-name', href=True):
print 'href: ', a_tag['href']
# Configure this to the root of the above website, e.g. 'http://www.mywebsite.com'
base_url = "http://www.mywebsite.com"
for a_tag in soup.find_all('a', class_='listing-name', href=True):
print '-' * 60 # Add a line of dashes
print 'href: ', a_tag['href']
request_href = requests.get(base_url + a_tag['href'])
print request_href.content
Tested using Python 2.x, for Python 3.x please add parentheses to the print statements.
I had the same problem and I will like to share my findings because I did try the answer, for some reasons it did not work but after some research I found something interesting.
You might need to find the attributes of the "href" link itself:
You will need the exact class which contains the href link in your case, I am thinking="class":"listing__left-column" and equate it to a variable say "all" for example:
from bs4 import BeautifulSoup
all = soup.find_all("div", {"class":"listing__left-column"})
for item in all:
for link in item.find_all("a"):
if 'href' in link.attrs:
a = link.attrs['href']
print(a)
print("")
I did this and I was able to get into another link which was embedded in the home page

Removing certain tags with beautifulsoup and python

Question
I am trying to remove style tags like <h2> and <div class=...> from my html file which is being downloaded by BeautifulSoup. I do want to keep what the tags contain (like text)
However this does not seem to work.
What i have tried
for url in urls:
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
table = soup.find("div", {"class": "product_specifications bottom_l js_readmore_content"})
print "<hr style='border-width:5px;'>"
for style in table.find_all('style'):
if 'style' in style.attrs:
del style.attrs['style']
print table
Urls i tried to work with
Python HTML parsing with beautiful soup and filtering stop words
Remove class attribute from HTML using Python and lxml
BeautifulSoup Tag Removal
You can use decompose():
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#decompose
If you want to clear just text or keep element removed from tree, use clear and extract (description just above decompose).
You are looking for unwrap().
your_soup.tag.unwrap()

How to use CSS selectors to retrieve specific links lying in some class using BeautifulSoup?

I am new to Python and I am learning it for scraping purposes I am using BeautifulSoup to collect links (i.e href of 'a' tag). I am trying to collect the links under the "UPCOMING EVENTS" tab of site http://allevents.in/lahore/. I am using Firebug to inspect the element and to get the CSS path but this code returns me nothing. I am looking for the fix and also some suggestions for how I can choose proper CSS selectors to retrieve desired links from any site. I wrote this piece of code:
from bs4 import BeautifulSoup
import requests
url = "http://allevents.in/lahore/"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
for link in soup.select( 'html body div.non-overlay.gray-trans-back div.container div.row div.span8 div#eh-1748056798.events-horizontal div.eh-container.row ul.eh-slider li.h-item div.h-meta div.title a[href]'):
print link.get('href')
The page is not the most friendly in the use of classes and markup, but even so your CSS selector is too specific to be useful here.
If you want Upcoming Events, you want just the first <div class="events-horizontal">, then just grab the <div class="title"><a href="..."></div> tags, so the links on titles:
upcoming_events_div = soup.select_one('div.events-horizontal')
for link in upcoming_events_div.select('div.title a[href]'):
print(link['href'])
Note that you should not use r.text; use r.content and leave decoding to Unicode to BeautifulSoup. See Encoding issue of a character in utf-8
import bs4 , requests
res = requests.get("http://allevents.in/lahore/")
soup = bs4.BeautifulSoup(res.text)
for link in soup.select('a[property="schema:url"]'):
print link.get('href')
This code will work fine!!

Extract href from html

I am given the following html :
<IMG border="0" SRC="SOMETHING" ALT="[DIR] "> Acaryochloris_marina_MBIC11017_> Jun 12 2013
<IMG border="0" SRC="SOMETHING" ALT="[DIR] "> Acetobacter_pasteurianus_386B_u> Aug 8 2013
and many more...
I want to extract the href from here.
Here's my python script : (page_source contains the html)
soup = BeautifulSoup(page_source)
links = soup.find_all('a',attrs={'href': re.compile("^http://")})
for tag in links:
link = tag.get('href',None)
if link != None:
print link
But this keeps returning the following error :
links = soup.find_all('A',attrs={'HREF': re.compile("^http://")})
TypeError: 'NoneType' object is not callable
You are using BeautifulSoup version 3, not version 4. soup.find_all is then not interpreted as a method, but as a search for the first <find_all> element. Because there is no such element, soup.find_all resolves to None.
Install BeautifulSoup 4 instead, the import is:
from bs4 import BeautifulSoup
BeautifulSoup 3 is instead imported as from BeautifulSoup import BeautifulSoup.
If you are sure you wanted to use BeautifulSoup 3 (not recommended), then use:
links = soup.findAll('a', attrs={'href': re.compile("^http://")})
As a side note, because you limit your search to <a> tags with a certain value, *there will always be a href attribute on the elements that are found. Using .get() and testing for None is entirely redundant. The following is equivalent:
links = soup.find_all('a',attrs={'href': re.compile("^http://")})
for tag in links:
link = tag['href']
print link
BeautifulSoup 4 also supports CSS selectors, which could make your query a little more readable still, removing the need for you to specify a regular expression:
for tag in soup.select('a[href^=http://]'):
link = tag['href']
print link
Why not use the split command?
Iterate over all lines of the file and d something like that:
href = line.split("HREF=\"")[1].split("\"")[0]

Categories

Resources