Best way to get 'hrefs' from CSS selector in BeautifulSoup? - python

Writing a script that would, initially, scrape the data for all of the census blocks in a given census block group. In order to do that, though, I first need to be able to get a link all of the block groups in a given tract. The tracts are defined by a list with the URLs to them, which returns a page which lists the block groups within the css selector "div#rList3 a". When I run this code:
from bs4 import BeautifulSoup
from urllib.request import urlopen
tracts = ['http://www.usa.com/NY023970800.html','http://www.usa.com/NY023970900.html',
'http://www.usa.com/NY023970600.html','http://www.usa.com/NY023970700.html',
'http://www.usa.com/NY023970500.html']
class Scrape:
def scrapeTracts(self):
for i in tracts:
html = urlopen(i)
soup = BeautifulSoup(html.read(), 'lxml')
bgs = soup.select("div#rList3 a")
print(bgs)
s = Scrape()
s.scrapeTracts()
This gives me an output that looks like: [NY0239708001] (with the real amount of links cut out for the sake of the length of this post.) My question is, how can I get just the string after 'href', in this case being /NY0239708001.html?

You can do this mostly in one line, by doing this:
bgs = [i.attrs.get('href') for i in soup.select("div#rList3 a")]
Output:
['/NY0239708001.html']
['/NY0239709001.html', '/NY0239709002.html', '/NY0239709003.html', '/NY0239709004.html']
['/NY0239706001.html', '/NY0239706002.html', '/NY0239706003.html', '/NY0239706004.html']
['/NY0239707001.html', '/NY0239707002.html', '/NY0239707003.html', '/NY0239707004.html', '/NY0239707005.html']
['/NY0239705001.html', '/NY0239705002.html', '/NY0239705003.html', '/NY0239705004.html']

Each node has an attrs dictionary which contains the attributes of that node...including CSS classes, or in this case, the href.
hrefs = []
for bg in bgs:
hrefs.append(bg.attrs['href'])

Related

Extracting product URLs from a search query on a website

If I were for example looking to track the price changes of MIDI keyboards on https://www.gear4music.com/Studio-MIDI-Controllers. I would need to extract all the URLs of the products pictured from the search and then loop through the URLs of the products and extract price info for each product. I can obtain the price data of an individual product by hard coding the URL but I cannot find a way to automate getting the URLs of multiple products.
So far I have tried this,
from bs4 import BeautifulSoup
import requests
url = "https://www.gear4music.com/Studio-MIDI- Controllers"
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'lxml')
tags = soup.find_all('a')
for tag in tags:
print(tag.get('href'))
This does produce a list of URLs but I cannot make out which ones relate specifically to the MIDI keyboards in that search query that I want to obtain the price product info of. Is there a better more specific way to obtain the URLs of the products only and not everything within the HTML file?
There are many ways how to obtain product links. One way could be select all <a> tags which have data-g4m-inv= attribute:
import requests
from bs4 import BeautifulSoup
url = "https://www.gear4music.com/Studio-MIDI-Controllers"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for a in soup.select("a[data-g4m-inv]"):
print("https://www.gear4music.com" + a["href"])
Prints:
https://www.gear4music.com/Recording-and-Computers/SubZero-MiniPad-MIDI-Controller/P6E
https://www.gear4music.com/Recording-and-Computers/SubZero-MiniControl-MIDI-Controller/P6D
https://www.gear4music.com/Keyboards-and-Pianos/SubZero-MiniKey-25-Key-MIDI-Controller/JMR
https://www.gear4music.com/Keyboards-and-Pianos/Nektar-SE25/2XWA
https://www.gear4music.com/Keyboards-and-Pianos/Korg-nanoKONTROL2-USB-MIDI-Controller-Black/G8L
https://www.gear4music.com/Recording-and-Computers/SubZero-ControlKey25-MIDI-Keyboard/221Y
https://www.gear4music.com/Keyboards-and-Pianos/SubZero-CommandKey25-Universal-MIDI-Controller/221X
...
Open the chrome developer console and look at the div that corresponds to a product, from there, set a variable(lets say "product") equal to soup.find_all(that aforementioned div) and loop through these results to find tags that are children of that element (or alternatively identify the title class and search that way).

Scrape <div<span from HTML-page

I am trying to create a simple weather forecast with Python in Eclipse. So far I have written this:
from bs4 import BeautifulSoup
import requests
def weather_forecast():
url = 'https://www.yr.no/nb/v%C3%A6rvarsel/daglig-tabell/1-92416/Norge/Vestland/Bergen/Bergen'
r = requests.get(url) # Get request for contents of the page
print(r.content) # Outputs HTML code for the page
soup = BeautifulSoup(r.content, 'html5lib') # Parse the data with BeautifulSoup(HTML-string, html-parser)
min_max = soup.select('min-max.temperature') # Select all spans with a "min-max-temperature" attribute
print(min_max.prettify())
table = soup.find('div', attrs={'daily-weather-list-item__temperature'})
print(table.prettify())
From a html-page with elements that looks like this:
I have found the path to the first temperature in the HTML-page's elements, but when I try and execute my code, and print to see if I have done it correctly, nothing is printed. My goal is to print a table with dates and corresponding temperatures, which seems like an easy task, but I do not know how to properly name the attribute or how to scrape them all from the HTML-page in one iteration.
The <span has two temperatures stored, one min and one max, here it just happens that they're the same.
I want to go into each <div class="daily-weather-list-item__temperature", collect the two temperatures and add them to a dictionary, how do I do this?
I have looked at this question on stackoverflow but I couldn't figure it out:
Python BeautifulSoup - Scraping Div Spans and p tags - also how to get exact match on div name
You could use a dictionary comprehension. Loop over all the forecasts which have class daily-weather-list-item, then extract date from the datetime attribute of the time tags, and use those as keys; associate the keys with the maxmin info.
import requests
from bs4 import BeautifulSoup
def weather_forecast():
url = 'https://www.yr.no/nb/v%C3%A6rvarsel/daglig-tabell/1-92416/Norge/Vestland/Bergen/Bergen'
r = requests.get(url) # Get request for contents of the page
soup = BeautifulSoup(r.content, 'html5lib')
temps = {i.select_one('time')['datetime']:i.select_one('.min-max-temperature').get_text(strip= True)
for i in soup.select('.daily-weather-list-item')}
return temps
weather_forecast()

How to use web scraping to get visible text on the webpage?

This is the link of the webpage I want to scrape:
https://www.tripadvisor.in/Restaurants-g494941-Indore_Indore_District_Madhya_Pradesh.html
I have also applied additional filters, by clicking on the encircled heading1
This is how the webpage looks like after clicking on the heading2
I want to get names of all the places displayed on the webpage but I seem to be having trouble with it as the url doesn't get changed on applying the filter.
I am using python urllib for this.
Here is my code:
url = "https://www.tripadvisor.in/Hotels-g494941-Indore_Indore_District_Madhya_Pradesh-Hotels.html"
page = urlopen(url)
html_bytes = page.read()
html = html_bytes.decode("utf-8")
print(html)
You can use bs4. Bs4 is a python module that allows you to get certain things off of webpages. This will get the text from the site:
from bs4 import BeautifulSoup as bs
soup = bs(html, features='html5lib')
text = soup.get_text()
print(text)
If you would like to get something that is not the text, maybe something with a certain tag you can also use bs4:
soup.findall('p') # Getting all p tags
soup.findall('p', class_='Title') #getting all p tags with a class of Title
Find what class and tag all of the place names have, and then use the above to get all the place names.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/

BeautifulSoup/Python Problems Parsing Websites

I'm sure this may have been asked in the past but I am attempting to parse a website (hopefully somehow automate it to parse multiple websites at once eventually) but it's not working properly. I may be having issues grabbing appropriate tags or something but essentially I want to go to this website and pull off all of the items from the lists created (possibly with hrefs intact or in a separate document) and stick them into a file where I can share in an easy-to-read format. So far this is my code:
url = "http://catalog.apu.edu/academics/college-liberal-arts-sciences/math-physics-statistics/applied-mathematics-bs/" `
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")
print(soup.get_text())
results = soup.find_all('div', class_"tab_content")
for element in results:
title_elem = element.find('h1')
h2_elem = element.find('h2')
h3_elem = element.find('h3')
href_elem = element.find('href')
if None in (title_elem, h2_elem, h3_elem, href_elem):
continue
print(title_elem.text.strip())
print(h2_elem.text.strip())
print(h3_elem.text.strip())
print(href_elem.text.strip())
print()
I even attempted to write this for a table but I get the same type of output, which are a bunch of empty elements:
for table in soup.find_all('table'):
for subtable in table.find_all('table'):
print(subtable)
Does anyone have any insight as to why this may be the case? If possible I would also not be opposed to regex parsing, but the main goal here is to go into this site (and hopefully others like it) and take the entire table/lists/descriptions of the individual programs for each major and write the information into an easy-to-read file
Similar approach in that I also selected to combine bs4 with pandas but I tested for the presence of the hyperlink class.
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
url = 'http://catalog.apu.edu/academics/college-liberal-arts-sciences/math-physics-statistics/applied-mathematics-bs/'
r = requests.get(url)
soup = bs(r.content, 'lxml')
for table in soup.select('.sc_courselist'):
tbl = pd.read_html(str(table))[0]
links_column = ['http://catalog.apu.edu' + i.select_one('.bubblelink')['href'] if i.select_one('.bubblelink') is not None else '' for i in table.select('td:nth-of-type(1)')]
tbl['Links'] = links_column
print(tbl)
With BeautifulSoup, an alternative to find/find_all is select_one/select. The latter two apply css selectors with select_one returning the first match for the css selector passed in, and select returning a list of all matches. "." is a class selector, meaning it will select attributes with the specified class e.g. sc_courselist or bubblelink. bubblelink is the class of the element with the desired hrefs. These are within the first column of each table which is selected using td:nth-of-type(1).

How to use BeautifulSoup to scrape links in a html

I need download few links in a html. But I don't need all of them, I only need few of them in certain section on this webpage.
For example, in http://www.nytimes.com/roomfordebate/2014/09/24/protecting-student-privacy-in-online-learning, I need links in the debaters section. I plan to use BeautifulSoup and I looked the html of one of the links:
Data Collection Is Out of Control
Here's my code:
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
link_set = set()
for link in soup.find_all("a", class = "bl-bigger"):
href = link.get('href')
if href == None:
continue
elif '/roomfordebate/' in href:
link_set.add(href)
for link in link_set:
print link
This code is supposed to give me all the links with bl-bigger class. But it actually returns nothing. Could anyone figure what's wrong with my code or how to make it work?
Thanks
I don't see bl-bigger class at all when I view the source from Chrome. May be that's why your code is not working?
Lets start looking at the source. The whole Debaters section seems to be put within div with class nytint-discussion-content. So using BeautifulSoup, lets get that whole div first.
debaters_div = soup.find('div', class_="nytint-discussion-content")
Again learning from the source, seems all the links are within a list, li tag. Now all you have to do is, find all li tags and find anchor tags within them. One more thing you can notice is, all the li tags have class nytint-bylines-1.
list_items = debaters_div.find_all("li", class_="nytint-bylines-1")
list_items[0].find('a')
# Data Collection Is Out of Control
So, your whole code can be:
link_set = set()
response = requests.get(url)
html_data = response.text
soup = BeautifulSoup(html_data)
debaters_div = soup.find('div', class_="nytint-discussion-content")
list_items = debaters_div.find_all("li", class_="nytint-bylines-1")
for each_item in list_items:
html_link = each_item.find('a').get('href')
if html_link.startswith('/roomfordebate'):
link_set.add(html_link)
Now link_set will contain all the links you want. From the link given in question, it will fetch 5 links.
PS: link_set contains only uri and not actual html addresses. So I would add http://www.nytimes.com at start before adding those links to link_set. Just change the last line to:
link_set.add('http://www.nytimes.com' + html_link)
You need to call the method with an object instead of keyword argument:
soup.find("tagName", { "class" : "cssClass" })
or use .select method which executes CSS queries:
soup.select('a.bl-bigger')
Examples are in the docs, just search for '.select' string. Also, instead of writing the entire script you'll quickly get some working code with ipython interactive shell.

Categories

Resources