Getting href links from a website using Python's Beautiful Soup module

Getting href links from a website using Python's Beautiful Soup module - python

I am trying to get the href links from this page, specifically the links to the pages of those respective clubs. My current code is as follows. I have not included the imports. If needed, I just did import requests and from bs4 import BeautifulSoup:
rsoLink = "https://illinois.campuslabs.com/engage/organizations?query=badminton"
page = requests.get(rsoLink)
beautifulPage = BeautifulSoup(page.content, 'html.parser')
for link in beautifulPage.findAll("a"):
print(link.get('href'))
My output is empty, suggesting that the program did not find the links. When I looked at the HTML structure of the page, the "a" tags seem to be nested deep within the page's structure (they are inside an element which is within another element, which itself is inside an another element). My question is how I would access the links then; do I have to go through all these elements?

The data you see on page is loaded with JavaScript from different URL. So beautifulsoup doesn't see it. To load the data you can use next example:
import json
import requests
url = (
"https://illinois.campuslabs.com/engage/api/discovery/search/organizations"
)
params = {"top": "10", "filter": "", "query": "badminton", "skip": "0"}
data = requests.get(url, params=params).json()
# uncomment to print all data:
# print(json.dumps(data, indent=4))
for v in data["value"]:
print(
"{:<50} {}".format(
v["Name"],
"https://illinois.campuslabs.com/engage/organization/"
+ v["WebsiteKey"],
)
)
Prints:
Badminton For Fun https://illinois.campuslabs.com/engage/organization/badminton4fun
Illini Badminton Intercollegiate Sports Club https://illinois.campuslabs.com/engage/organization/illinibadmintonintercollegiatesportsclub

If you take a look at the actual HTML returned by requests, you can see that none of the actual page content is loaded, suggesting that it's loaded client-side via Javascript, likely using an HTTP request to fetch the necessary data.
Here, the easiest solution would be to inspect the HTTP requests made by the site and look for an API endpoint that returns the organizations data. By checking the Network tab of Chrome DevTools, you can find this endpoint:
https://illinois.campuslabs.com/engage/api/discovery/search/organizations?top=10&filter=&query=badminton&skip=0
Here, you can see the JSON response for all of the organizations that are being loaded into the page by client-side JS. If you take a look at the JSON, you'll notice that a link isn't one of the keys returned, but it's easily constructed using the WebsiteKey key.
Putting all of this together:
import requests
import json
SEARCH_URL = "https://illinois.campuslabs.com/engage/api/discovery/search/organizations"
ORGANIZATION_URL = "https://illinois.campuslabs.com/engage/organization/"
search = "badminton"
resp = requests.get(
SEARCH_URL,
params={"top": 10, "filter": "", "query": search, "skip": 0}
)
organizations = json.loads(resp.text)["value"]
links = [ORGANIZATION_URL + organization["WebsiteKey"] for organization in organizations]
print(links)
Similar strategies can be used to find and use other API endpoints on the site, such as the organization categories.

Related

How can I get the source of chat without using selenium?

So my issue is that, I want to get user's id info from the chat.
The chat area what I'm looking for, looks like this...
<div id="chat_area" class="chat_area" style="will-change: scroll-position;">
<dl class="" user_id="asdf1234"><dt class="user_m"><em class="pc"></em> :</dt><dd id="1">blah blah</dd></dl>
asdf1234
...
What I want do is to,
Get the part starting with <a href='javascript:'' user_id='asdf1234' ...
so that I can parse this and do some other stuffs.
But this webpage is the one I'm currently using, and it can not be proxy(webdriver by selenium).
How can I extract that data from the chat?

It looks like you've got two separate problems here. I'd use both the requests and BeautifulSoup libraries to accomplish this.
Use your browser's developer tools, the network tab, to refresh the page and look for the request which responds with the HTML you want. Use the requests library to emulate this request exactly.
import requests
headers = {"name": "value"}
# Get case example.
response = requests.get("some_url", headers=headers)
# Post case example.
data = {"key": "value"}
response = requests.post("some_url", headers=headers, data=data)
Web-scraping is always finicky, if this doesn't work you're most likely going to need to use a requests session. Or a one-time hacky solution is just to set your cookies from the browser.
Once you have made the request you can use BeautifulSoup to scrape your user id very easily.
from bs4 import BeautifulSoup
# Create BS parser.
soup = BeautifulSoup(response.text, 'lxml')
# Find all elements with the attribute "user_id".
find_results = soup.findAll("a", {"user_id" : True})
# Iterate results. Could also just index if you want the single user_id.
for result in find_results:
user_id = result["user_id"]

Python web-scraping on a multi-layered website without [href]

I am looking for a way to scrape data from the student-accomodation website uniplaces: https://www.uniplaces.com/en/accommodation/berlin.
In the end, I would like to scrape particular information for each property, such as bedroom size, number of roommates, location. In order to do this, I will first have to scrape all property links and then scrape the individual links afterwards.
However, even after going through the console and using BeautifulSoup for the extraction of urls, I was not able to extract the urls leading to the separate listings. They don't seem to be included as a [href] and I wasn't able to identify the links in any other format within the html code.
This is the python code I used but it also didn't return anything:
from bs4 import BeautifulSoup
import urllib.request
resp = urllib.request.urlopen("https://www.uniplaces.com/accommodation/lisbon")
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))
for link in soup.find_all('a', href=True):
print(link['href'])
So my question is: If links are not included in http:// format or referenced as [href]: is there any way to extract the listings urls?
I would really highly appreciate any support on this!
All the best,
Hannah

If you look at the network tab, you find some API call specifically to this url : https://www.uniplaces.com/api/search/offers?city=PT-lisbon&limit=24&locale=en_GB&ne=38.79507211908374%2C-9.046124472314432&page=1&sw=38.68769060641113%2C-9.327992453271463
which specifies the location PT-lisbon and northest(ne) and southwest(sw) direction. From this file, you can get the id for each offers and append it to the current url, you can also get all info you get from the webpage (price, description etc...)
For instance :
import requests
resp = requests.get(
url = 'https://www.uniplaces.com/api/search/offers',
params = {
"city":'PT-lisbon',
"limit":'24',
"locale":'en_GB',
"ne":'38.79507211908374%2C-9.046124472314432',
"page":'1',
"sw":'38.68769060641113%2C-9.327992453271463'
})
body = resp.json()
base_url = 'https://www.uniplaces.com/accommodation/lisbon'
data = [
(
t['id'], #offer id
base_url + '/' + t['id'], #this is the offer page
t['attributes']['accommodation_offer']['title'],
t['attributes']['accommodation_offer']['price']['amount'],
t['attributes']['accommodation_offer']['available_from']
)
for t in body['data']
]
print(data)

Problems accessing page with python requests

I'm trying to extract the sector of a stock for a ML classification project. If I go to the following page:
https://www.six-swiss-exchange.com/shares/security_info_en.html?id=CH0012221716CHF4
I get (on the screen) some information about this stock (it changes, with the id code - I just pick the first one of the list). However, none of the information is available on a regular request. (The html page contains mostly javascript functions)
What I need is on the "Shares Details" tab (ICB Supersector at the bottom of the page). Once again nothing is available with a regular requests. I looked into what happens when I click this tab and the desired request is inside the url:
http://www.six-swiss-exchange.com/shares/info_details_en.html?id=CH0210483332CHF4&portalSegment=EQ&dojo.preventCache=1520360103852 HTTP/1.1
However, if I use this url directly, I get an 403 error from requests but work from a browser. I usually don't have any problems with this sort of things but in this case, do I have to submit cookies or any other information to access that page - no login is required and it can be easily accessed from any browser.
I am thinking 1) make a first request to the url that works, 2) store the cookie they send you (I don't know how to do that really) and 3) make a second request to the desired url. Would this work?
I tried using request.session() but I'm not sure if this is the solution or if I implemented it properly.
If anyone has dealt with that sort of problem, I would love any pointers in solving this. Thanks.

from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
BASE_URL = 'https://www.six-swiss-exchange.com'
def get_page_html(isin):
params = {
'id': isin,
'portalSegment': 'EQ'
}
r = requests.get(
'{}/shares/info_details_en.html'.format(BASE_URL),
params=params
)
r.raise_for_status()
return r.text
def get_supersector_info(soup):
supersector = soup.find('td', text='ICB Supersector').next_sibling.a
return {
'link': urljoin(BASE_URL, supersector['href']),
'text': supersector.text
}
if __name__ == '__main__':
page_html = get_page_html('CH0012221716CHF4')
soup = BeautifulSoup(page_html, 'lxml')
supersector_info = get_supersector_info(soup)
Console:
https://www.six-swiss-exchange.com/search/quotes_en.html?security=C2700T
Industrial Goods & Services

Crawl/scrape websites/webpages containing a specific text, with no prior information about any such websites/webpages

I used nutch and scrapy. They need seed URLs to crawl. That means, one should be already aware of the websites/webpages which would contain the text that is being searched for.
My case is different, I do not have the prior information about the websites/webpages which contain the text I am searching for. So I won't be able to use seed URLs to be crawled by tools such as nutch and scrapy.
Is there a way to crawl websites/webpages for a given text, without knowing any websites/webpages that would possibly contain that text?

You could parse the commoncrawl dataset. It contains billions of webpages. Their site contains examples on how to do it with MapReduce.
Other than that any web crawler needs to have some starting point.

You can use the Google search API (https://developers.google.com/custom-search/json-api/v1/overview?csw=1) for 100 free queries/day. The search results will be in JSON format, which you can use to feed the links to your scraper.

Well you can use requests module to get data.
Here in below example I am getting data from all sites having "pizza" word in those.
import requests
url = 'http://www.google.com/search'
my_headers = { 'User-agent' : 'Mozilla/11.0' }
payload = { 'q' : 'pizza', 'start' : '0' }
r = requests.get( url, params = payload, headers = my_headers )
You can use BeautifulSoup library to extract any type of information from retrieved data (HTML data)
from bs4 import BeautifulSoup
soup = BeautifulSoup( r.text, 'html.parser' )
Now if you want text data you can use this function
soup.getText()

Scraping second page in Python Yields Different Data Than Browsing to Second Page

I'm attempting to scrape some data from www.ksl.com/auto/ using Python Requests and Beautiful Soup. I'm able to get the results from the first search page but not subsequent pages. When I request the second page using the same URL Chrome constructs when I click the "Next" button on the page, I get a set of results that no longer matches my search query. I've found other questions on Stack Overflow that discuss Ajax calls that load subsequent pages, and using Chrome's Developer tools to examine the request. But, none of that has helped me with this problem -- which I've had on other sites as well.
Here is an example query that returns only Acuras on the site. When you advance in the browser to the second page, the URL is simply this: https://www.ksl.com/auto/search/index?page=1. When I use Requests to hit those two URLs, the second search results are not Acuras. Is there, perhaps a cookie that my browser is passing back to the server to preserve my filters?
I would appreciate any advice someone can give about how to get subsequent pages of the results I searched for.
Here is the simple code I'm using:
from requests import get
from bs4 import BeautifulSoup
page1 = get('https://www.ksl.com/auto/search/index?keyword=&make%5B%5D=Acura&yearFrom=&yearTo=&mileageFrom=&mileageTo=&priceFrom=&priceTo=&zip=&miles=25&newUsed%5B%5D=All&page=0&sellerType=&postedTime=&titleType=&body=&transmission=&cylinders=&liters=&fuel=&drive=&numberDoors=&exteriorCondition=&interiorCondition=&cx_navSource=hp_search&search.x=63&search.y=8&search=Search+raquo%3B').text
page2 = get('https://www.ksl.com/auto/search/index?page=2').text
soup = BeautifulSoup(page1, 'html.parser')
listings = soup.findAll("div", { "class" : "srp-listing-body-right" })
listings[0] # An Acura - success!
soup2 = BeautifulSoup(page2, 'html.parser')
listings2 = soup2.findAll("div", { "class" : "srp-listing-body-right" })
listings2[0] # Not an Acura. :(

Try this. Create a Session object and then call the links. This will maintain your session with the server when you send a call to the next link.
import requests
from bs4 import BeautifulSoup
s = requests.Session() # Add this line
page1 = s.get('https://www.ksl.com/auto/search/index?keyword=&make%5B%5D=Acura&yearFrom=&yearTo=&mileageFrom=&mileageTo=&priceFrom=&priceTo=&zip=&miles=25&newUsed%5B%5D=All&page=0&sellerType=&postedTime=&titleType=&body=&transmission=&cylinders=&liters=&fuel=&drive=&numberDoors=&exteriorCondition=&interiorCondition=&cx_navSource=hp_search&search.x=63&search.y=8&search=Search+raquo%3B').text
page2 = s.get('https://www.ksl.com/auto/search/index?page=1').text

Yes, the website uses cookies so that https://www.ksl.com/auto/search/index shows or extends your last search. More specifically, the search parameters are stored on the server for you particular session cookie, that is, the value of the PHPSESSID cookie.
However, instead of passing that cookie around, you can simply always do full queries (in the sense of the request parameters), each time using a different value for the page parameter.
https://www.ksl.com/auto/search/index?keyword=&make%5B%5D=Acura&yearFrom=&yearTo=&mileageFrom=&mileageTo=&priceFrom=&priceTo=&zip=&miles=25&newUsed%5B%5D=All&page=0&sellerType=&postedTime=&titleType=&body=&transmission=&cylinders=&liters=&fuel=&drive=&numberDoors=&exteriorCondition=&interiorCondition=&cx_navSource=hp_search&search.x=63&search.y=8&search=Search+raquo%3B
https://www.ksl.com/auto/search/index?keyword=&make%5B%5D=Acura&yearFrom=&yearTo=&mileageFrom=&mileageTo=&priceFrom=&priceTo=&zip=&miles=25&newUsed%5B%5D=All&page=1&sellerType=&postedTime=&titleType=&body=&transmission=&cylinders=&liters=&fuel=&drive=&numberDoors=&exteriorCondition=&interiorCondition=&cx_navSource=hp_search&search.x=63&search.y=8&search=Search+raquo%3B

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.