I'm having an issue trying to get beautiful soup to find an a href with a specific title and extract the href only.
I have the code below but cant seem to make it get the href only(whatever is between the open " and close ") based on the hyperlink text found in the in that href.
res = requests.get(website_url)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
temp_tag_href = soup.select_one("a[href*=some text]")
sometexthrefonly = temp_tag_href.attrs['href']
Effectively, i would like it to go through the entire html parsed in soup and only return what is between the href open " and close " because the
that hyperlink text is 'some text'.
so the steps would be:
1: parse html,
2: look at all the a hrefs tags,
3: find the href that has the hyperlink text 'some text',
4: output only what is in between the href " " (not including the
"") for that href
Any help will greatly be appreciated!
ahmed,
So after some quick refreshers on requests and researching the BeautifulSoup library, I think you'll want something like the following:
res = requests.get(website_url)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
link = list(filter(lambda x: x['href'] == 'some text', soup.find_all('a')))[0]
print(link['href']) # since you don't specify output to where, I'll use stdout for simplicity
As it turns out in the Beautiful Soup Documentation there is a convenient way to access whatever attributes you want from an html element using dictionary lookup syntax. You can also do all kinds of lookups using this library.
If you are doing web scraping, it may also be useful to try switching to a library that supports XPATH, which allows you to write powerful queries such as //a[#href="some text"][1] which will get you the first link with url equal to "some text"
this should do the work:
from BeautifulSoup import BeautifulSoup
html = '''next
<div>later</div>
<h3>later</h3>'''
soup = BeautifulSoup(html)
# iterate all hrefs
for a in soup.find_all('a', href=True):
print("Next HREF: %s" % a['href'])
if a['href'] == 'some_text':
print("Found it!")
Related
I will be analyzing a lot of sites with different htmls and I am trying to find all lines that contain specific text(inside html) using BeautifulSoup.
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
for text in soup.find_all():
if "price" in text:
print text
This approach doesn't work(even though "price" is mentioned over 40x in html). Maybe there is even better approach to do this?
Why don't let BeautifulSoup find you the nodes containing the desired text:
for node in soup.find_all(text=lambda x: x and "price" in x):
print(node)
With bs4 4.7.1 you can use :contains pseudo class with * to consider all elements. Obviously some repeats as parents may contain children with same text. Here I search for price.
import requests
from bs4 import BeautifulSoup
url = 'https://www.visitsealife.com/brighton/tickets/'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
items = soup.select('*:contains(price)')
print(items)
print(len(items))
To extract all text from a given URL, you could just use something like:
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
for element in soup.findAll(['script', 'style']):
element.extract()
text = soup.get_text()
This will also remove possibly unwanted text inside script and style sections. You could then search for your required text using that.
you don't have to use Beautiful soup to find the specific text in the html instead you can use the request for that For ex:
r = requests.get(url)
if 'specific text' in r.content:
print r.content
I am wondering how would I open another page in my list with BeautifulSoup? I have followed this tutorial, but it does not tell us how to open another page on the list. Also how would I open a "a href" that is nested inside of a class?
Here is my code:
# coding: utf-8
import requests
from bs4 import BeautifulSoup
r = requests.get("")
soup = BeautifulSoup(r.content)
soup.find_all("a")
for link in soup.find_all("a"):
print link.get("href")
for link in soup.find_all("a"):
print link.text
for link in soup.find_all("a"):
print link.text, link.get("href")
g_data = soup.find_all("div", {"class":"listing__left-column"})
for item in g_data:
print item.contents
for item in g_data:
print item.contents[0].text
print link.get('href')
for item in g_data:
print item.contents[0]
I am trying to collect the href's from the titles of each business, and then open them and scrape that data.
I am still not sure where you are getting the HTML from, but if you are trying to extract all of the href tags, then the following approach should work based on the image you have posted:
import requests
from bs4 import BeautifulSoup
r = requests.get("<add your URL here>")
soup = BeautifulSoup(r.content)
for a_tag in soup.find_all('a', class_='listing-name', href=True):
print 'href: ', a_tag['href']
By adding href=True to the find_all(), it ensures that only a elements that contain an href attribute are returned therefore removing the need to test for it as an attribute.
Just to warn you, you might find some websites will lock you out after one or two attempts as they are able to detect that you are trying to access a site via a script, rather than as a human. If you feel you are not getting the correct responses, I would recommend printing the HTML you are getting back to ensure it it still as you expect.
If you then want to get the HTML for each of the links, the following could be used:
import requests
from bs4 import BeautifulSoup
# Configure this to be your first request URL
r = requests.get("http://www.mywebsite.com/search/")
soup = BeautifulSoup(r.content)
for a_tag in soup.find_all('a', class_='listing-name', href=True):
print 'href: ', a_tag['href']
# Configure this to the root of the above website, e.g. 'http://www.mywebsite.com'
base_url = "http://www.mywebsite.com"
for a_tag in soup.find_all('a', class_='listing-name', href=True):
print '-' * 60 # Add a line of dashes
print 'href: ', a_tag['href']
request_href = requests.get(base_url + a_tag['href'])
print request_href.content
Tested using Python 2.x, for Python 3.x please add parentheses to the print statements.
I had the same problem and I will like to share my findings because I did try the answer, for some reasons it did not work but after some research I found something interesting.
You might need to find the attributes of the "href" link itself:
You will need the exact class which contains the href link in your case, I am thinking="class":"listing__left-column" and equate it to a variable say "all" for example:
from bs4 import BeautifulSoup
all = soup.find_all("div", {"class":"listing__left-column"})
for item in all:
for link in item.find_all("a"):
if 'href' in link.attrs:
a = link.attrs['href']
print(a)
print("")
I did this and I was able to get into another link which was embedded in the home page
I need to retrieve is the href containing /questions/20702626/javac1-8-class-not-found. But the output I get for the code below is //stackoverflow.com:
from bs4 import BeautifulSoup
import urllib2
url = "http://stackoverflow.com/search?q=incorrect+operator"
content = urllib2.urlopen(url).read()
soup = BeautifulSoup(content)
for tag in soup.find_all('div'):
if tag.get("class")==['summary']:
for tag in soup.find_all('div'):
if tag.get("class")==['result-link']:
for link in soup.find_all('a'):
print link.get('href')
break;
Instead of making nested loops, write a CSS selector:
for link in soup.select('div.summary div.result-link a'):
print link.get('href')
Which is not only more readable, but also solves your problem. It prints:
/questions/11977228/incorrect-answer-in-operator-overloading
/questions/8347592/sizeof-operator-returns-incorrect-size
/questions/23984762/c-incorrect-signature-for-assignment-operator
...
/questions/24896659/incorrect-count-when-using-comparison-operator
/questions/7035598/patter-checking-check-of-incorrect-number-of-operators-and-brackets
Additional note: you might want to look into using StackExchange API instead of the current web-scraping/HTML-parsing approach.
First, please check the image below so I can better explain my question:
I am trying to take a user input to select one of the links below "Course Search By Term".... (ie. Winter 2015).
The HTML opened shows the part of the code for this webpage. I would like to grab all the href links in the element , which consists of five term links I want. I am following the instructions from this website (www.gregreda.com/2013/03/03/web-scraping-101-with-python/), but it doesn't explain this part. Here is some code I have been trying.
from bs4 import BeautifulSoup
from urllib2 import urlopen
BASE_URL = "http://classes.uoregon.edu/"
def get_category_links(section_url):
html = urlopen(section_url).read()
soup = BeautifulSoup(html, "lxml")
pldefault = soup.find("td", "pldefault")
ul_links = pldefault.find("ul")
category_links = [BASE_URL + ul.a["href"] for i in ul_links.findAll("ul")]
return category_links
Any help is appreciated! Thanks. Or if you would like to see the website, its classes.uoregon.edu/
I would keep it simple and locate all links containing 2015 in the text and term in href:
for link in soup.find_all("a",
href=lambda href: href and "term" in href,
text=lambda text: text and "2015" in text):
print link["href"]
Prints:
/pls/prod/hwskdhnt.p_search?term=201402
/pls/prod/hwskdhnt.p_search?term=201403
/pls/prod/hwskdhnt.p_search?term=201404
/pls/prod/hwskdhnt.p_search?term=201406
/pls/prod/hwskdhnt.p_search?term=201407
If you want full URLs, use urlparse.urljoin() to join the links with a base url:
from urlparse import urljoin
...
for link in soup.find_all("a",
href=lambda href: href and "term" in href,
text=lambda text: text and "2015" in text):
print urljoin(url, link["href"])
This would print:
http://classes.uoregon.edu/pls/prod/hwskdhnt.p_search?term=201402
http://classes.uoregon.edu/pls/prod/hwskdhnt.p_search?term=201403
http://classes.uoregon.edu/pls/prod/hwskdhnt.p_search?term=201404
http://classes.uoregon.edu/pls/prod/hwskdhnt.p_search?term=201406
http://classes.uoregon.edu/pls/prod/hwskdhnt.p_search?term=201407
I am new to Python and I am learning it for scraping purposes I am using BeautifulSoup to collect links (i.e href of 'a' tag). I am trying to collect the links under the "UPCOMING EVENTS" tab of site http://allevents.in/lahore/. I am using Firebug to inspect the element and to get the CSS path but this code returns me nothing. I am looking for the fix and also some suggestions for how I can choose proper CSS selectors to retrieve desired links from any site. I wrote this piece of code:
from bs4 import BeautifulSoup
import requests
url = "http://allevents.in/lahore/"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
for link in soup.select( 'html body div.non-overlay.gray-trans-back div.container div.row div.span8 div#eh-1748056798.events-horizontal div.eh-container.row ul.eh-slider li.h-item div.h-meta div.title a[href]'):
print link.get('href')
The page is not the most friendly in the use of classes and markup, but even so your CSS selector is too specific to be useful here.
If you want Upcoming Events, you want just the first <div class="events-horizontal">, then just grab the <div class="title"><a href="..."></div> tags, so the links on titles:
upcoming_events_div = soup.select_one('div.events-horizontal')
for link in upcoming_events_div.select('div.title a[href]'):
print(link['href'])
Note that you should not use r.text; use r.content and leave decoding to Unicode to BeautifulSoup. See Encoding issue of a character in utf-8
import bs4 , requests
res = requests.get("http://allevents.in/lahore/")
soup = bs4.BeautifulSoup(res.text)
for link in soup.select('a[property="schema:url"]'):
print link.get('href')
This code will work fine!!