How to scrape 'Click to Display' fields with BeautifulSoup - python

I am trying to scrape the number of schools and names of schools that basketball players get offers from verbalcommits.com
Using this page as an example: http://www.verbalcommits.com/players/jarrey-foster
It's easy to access the first offer (SMU) but all of the other offers are hidden behind the "Show other offers" button. When I inspect the page, I can see the offers but my scraper doesn't get to them. I've been using the following:
page=urllib.request.urlopen("http://www.verbalcommits.com/players/jarrey-foster") #opens page
soup = BeautifulSoup(page, 'html.parser') #makes page into a BS python object
schools = soup.body.findAll('span',{"class":"team_name"})
print(schools)
This returns the first span that has the team name in it, but not the rest of the spans that are hidden. What do I need to add to access the rest of the page that is hidden?

To elaborate more on #furas's great answer. Here is how you can extract the player id and make a second request to get the "closed offers". For this, we are going to maintain a web-scraping session with requests:
import requests
from bs4 import BeautifulSoup
with requests.Session() as session:
response = session.get("http://www.verbalcommits.com/players/jarrey-foster")
# get the player id
soup = BeautifulSoup(response.content, "html.parser")
player_id = soup.select_one("h1.player-name").get("data-player-id")
# get closed offers
response = session.get("http://www.verbalcommits.com/player_divs/closed_offers", params={"player_id": player_id})
soup = BeautifulSoup(response.content, "html.parser")
# print team names
for team in soup.select(".team_name"):
print(team.get_text())
Prints team names for demonstration purposes:
UTEP
Sam Houston State
New Hampshire
Rice
Temple
Liberty
UL Lafayette

You can't get other data because when you click button then JavaScript reads it from server from
http://www.verbalcommits.com/player_divs/closed_offers?player_id=17766&_=1475626846752
Now you can use this url with BS to get data.
I used Firebug in Firefox or Developer Tools in Chrome to find this url.
EDIT: inside HTML I found data-player-id="17766" - it is first argument in above url. Maybe you can find second argument so you could generate url using Python.
EDIT: I checked url
http://www.verbalcommits.com/player_divs/closed_offers?player_id=17766
and it gives the same data so you don't need second argument.

Related

Fetch all pages using a Python request, using Beautiful Soup

I tried to fetch all product's name from the web page, but I could have only 12.
If I scroll down the web page then it gets refreshed and adds more information.
How can I to get all information?
import requests
from bs4 import BeautifulSoup
import re
url = "https://www.outre.com/product-category/wigs/"
res = requests.get(url)
res.raise_for_status()
soup = BeautifulSoup(res.text, "lxml")
items = soup.find_all("div", attrs={"class":"title-wrapper"})
for item in items:
print(item.p.a.get_text())
Your code is good. The thing is on the website; the products are dynamically loaded, so when you do your request you can only get the first 12 products.
You can check the developer console inside your browser to track the Ajax call made during browsing.
I did it, and it turns out a call is made to retrieve more product to the URL
https://www.outre.com/product-category/wigs/page/2/
So if you want to get all the products you need to browse multiple pages. I suggest you to use a loop and use your code several times.
N.B.: You can try to check the website to see is there is a more convenient place to get the product (like not from the main page)
The page loads the products from different URL via JavaScript, so Beautiful Soup doesn't see it. To get all pages, you can use the following example:
import requests
from bs4 import BeautifulSoup
url = "https://www.outre.com/product-category/wigs/page/{}/"
page = 1
while True:
soup = BeautifulSoup(requests.get(url.format(page)).content, "html.parser")
titles = soup.select(".product-title")
if not titles:
break
for title in titles:
print(title.text)
page += 1
Prints:
...
Wet & Wavy Loose Curl 18″
Wet & Wavy Boho Curl 20″
Nikaya
Jeanette
Natural Glam Body
Natural Free Deep

Scrapping data not coming from exact url

I'm trying to scrap some monster infobox table from rswiki.
Some specific monster have multiple levels, for example:
https://oldschool.runescape.wiki/w/Dwarf
You can switch through the different levels by clicking on boxes on top of the infobox: "Level 7","Level 10"...
Once you click on the level box it changes the url to match the level.
So when i request the url https://oldschool.runescape.wiki/w/Dwarf#Level_10, it's bringing data from the first level only, in case: https://oldschool.runescape.wiki/w/Dwarf#Level_7, and i can't get to scrap other levels.
import requests
from bs4 import BeautifulSoup
url = 'https://oldschool.runescape.wiki/w/Dwarf#Level_20'
response = requests.get(url, headers = {'User-Agent':'Mozilla/5.0'})
soup = BeautifulSoup(response.content, 'html.parser')
soup_minfobox = soup.find_all('table', class_ ="infobox infobox-switch no-parenthesis-style infobox-monster")
print(soup_minfobox[0].text)
Output: Level 7Level 10Level 11Level 20DwarfReleased6 April 2001 (Update)MembersNoCombat level7Size1x1 ...
Excuse me the makeshift code, but in the output you can see that it is the data from the lv 7 in the end, although the url is for the lv 20.
If you manually trigger the events (from the browser's console), you'll see that the infobox changes:
$("span[data-switch-anchor='#Level_7']").click();
$("span[data-switch-anchor='#Level_10']").click();
$("span[data-switch-anchor='#Level_11']").click();
$("span[data-switch-anchor='#Level_20']").click();
So you can use the above selectors and consult the answers provided in the following topic on how to invoke an event using BeautifulSoup:
invoking onclick event with beautifulsoup python

Finding the correct elements for scraping a website

I am trying to scrape only certain articles from this main page. To be more specific, I am trying to scrape only articles from sub-page media and from sub-sub-pages Press releases; Governing Council decisions; Press conferences; Monetary policy accounts; Speeches; Interviews, and also just those which are in English.
I managed (based on some tutorials and other SE:overflow answers), to put together a code that scrapes completely everything from the website because my original idea was to scrape everything and then in data frame just clear the output later but the website includes so much that it always freezes after some time.
Getting the sub-links:
import requests
import re
from bs4 import BeautifulSoup
master_request = requests.get("https://www.ecb.europa.eu/")
base_url = "https://www.ecb.europa.eu"
master_soup = BeautifulSoup(master_request.content, 'html.parser')
master_atags = master_soup.find_all("a", href=True)
master_links = [ ]
sub_links = {}
for master_atag in master_atags:
master_href = master_atag.get('href')
master_href = base_url + master_href
print(master_href)
master_links.append(master_href)
sub_request = requests.get(master_href)
sub_soup = BeautifulSoup(sub_request.content, 'html.parser')
sub_atags = sub_soup.find_all("a", href=True)
sub_links[master_href] = []
for sub_atag in sub_atags:
sub_href = sub_atag.get('href')
sub_links[master_href].append(sub_href)
print("\t"+sub_href)
Some things I tried were to change the base link to sublinks - my idea was that maybe I can just do it separately for every sub-page and later just put the links together but that did not work). Other things that I tried was to replace the 17th line with the following;
sub_atags = sub_soup.find_all("a",{'class': ['doc-title']}, herf=True)
this seemed to partially solve my problem because even though it did not got only links from the sub-pages it at least ignored links that are not 'doc-title' which are all the links with text on the website but it was still too much and some links were not retrieved correctly.
I tried also tried the following:
for master_atag in master_atags:
master_href = master_atag.get('href')
for href in master_href:
master_href = [base_url + master_href if str(master_href).find(".en") in master_herf
print(master_href)
I thought that because all hrefs with English documents had .en somewhere in them this would only give me all links where .en occurs somewhere in the href but this code gives me syntax error for the print(master_href) which I dont understand because previously print(master_href) worked.
Next I want to extract the following information from sublinks. This part of code works when I test it for a single link, but I never had chance to try it on the above code since it wont finish running. Will this work once I manage to get the proper list of all links?
for links in sublinks:
resp = requests.get(sublinks)
soup = BeautifulSoup(resp.content, 'html5lib')
article = soup.find('article')
title = soup.find('title')
textdate = soup.find('h2')
paragraphs = article.find_all('p')
matches = re.findall('(\d{2}[\/ ](\d{2}|January|Jan|February|Feb|March|Mar|April|Apr|May|May|June|Jun|July|Jul|August|Aug|September|Sep|October|Oct|November|Nov|December|Dec)[\/ ]\d{2,4})', str(textdate))
for match in matches:
print(match[0])
datadate = match[0]
import pandas as pd
ecbdf = pd.DataFrame({"Article": [Article]; "Title": [title]: "Text": [paragraphs], "date": datadate})
Also going back to the scraping, since the first approach with beautiful soup did not worked for me I also tried to just approach the problem differently. The website has RSS feeds so I wanted to use the following code:
import feedparser
from pandas.io.json import json_normalize
import pandas as pd
import requests
rss_url='https://www.ecb.europa.eu/home/html/rss.en.html'
ecb_feed = feedparser.parse(rss_url)
df_ecb_feed=json_normalize(ecb_feed.entries)
df_ecb_fead.head()
Here I run into a problem of not being even able to find the RSS feed url in the first place. I tried the following: I viewed the source page and I tried to search for "RSS" and tried all urls that I could find this way but I always get empty dataframe.
I am a beginner to web-scraping and at this point I dont know how to proceed or how to approach this problem. In the end what I want to accomplish is to just collect all articles from the subpages with their titles, and dates and authors and put them into one dataframe.
The biggest problem you have with scraping this site is probably the lazy loading: Using JavaScript, they load the articles from several html pages and merge them into the list. For details, look out for index_include in the source code. This is problematic for scraping with only requests and BeautifulSoup because what your soup instance gets from the request content is just the basic skeleton without the list of articles. Now you have two options:
Instead of the main article list page (Press Releases, Interviews, etc.), use the lazy-loaded lists of articles, e.g., /press/pr/date/2019/html/index_include.en.html. This will probably be the easier option, but you have to do it for each year you're interested in.
Use a client that can execute JavaScript like Selenium to obtain the HTML instead of requests.
Apart from that, I would suggest to use CSS selectors for extracting information from the HTML code. This way, you only need a few lines for the article thing. Also, I don't think you have to filter for English articles if you use the index.en.html page for scraping because it shows English by default and -- additionally -- other languages if available.
Here's an example I quickly put together, this can certainly be optimized but it shows how to load the page with Selenium and extract the article URLs and article contents:
from bs4 import BeautifulSoup
from selenium import webdriver
base_url = 'https://www.ecb.europa.eu'
urls = [
f'{base_url}/press/pr/html/index.en.html',
f'{base_url}/press/govcdec/html/index.en.html'
]
driver = webdriver.Chrome()
for url in urls:
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
for anchor in soup.select('span.doc-title > a[href]'):
driver.get(f'{base_url}{anchor["href"]}')
article_soup = BeautifulSoup(driver.page_source, 'html.parser')
title = article_soup.select_one('h1.ecb-pressContentTitle').text
date = article_soup.select_one('p.ecb-publicationDate').text
paragraphs = article_soup.select('div.ecb-pressContent > article > p:not([class])')
content = '\n\n'.join(p.text for p in paragraphs)
print(f'title: {title}')
print(f'date: {date}')
print(f'content: {content[0:80]}...')
I get the following output for the Press Releases page:
title: ECB appoints Petra Senkovic as Director General Secretariat and Pedro Gustavo Teixeira as Director General Secretariat to the Supervisory Board
date: 20 December 2019
content: The European Central Bank (ECB) today announced the appointments of Petra Senkov...
title: Monetary policy decisions
date: 12 December 2019
content: At today’s meeting the Governing Council of the European Central Bank (ECB) deci...

Unable to navigate Amazon pagination with Python and BS4

I've been trying to create a simple web scraper program to scrape the book titles of a 100 bestseller list on Amazon. I've used this code before on another site with no problems. But for some reason, it scraps the first page fine but then posts the same results for the following iterations.
I'm not sure if it's something to do with how Amazon creates its urls or not. When I manually enter the "#2" (and beyond) at the end of the url in the browser it navigates fine.
(Once the scrape is working I plan on dumping the data in csv files. But for now, print to the terminal will do.)
import requests
from bs4 import BeautifulSoup
for i in range(5):
url = "https://smile.amazon.com/Best-Sellers-Kindle-Store-Dystopian-Science-Fiction/zgbs/digital-text/6361470011/ref=zg_bs_nav_kstore_4_158591011#{}".format(i)
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
for book in soup.find_all('div', class_='zg_itemWrapper'):
title = book.find('div', class_='p13n-sc-truncate')
name = book.find('a', class_='a-link-child')
price = book.find('span', class_='p13n-sc-price')
print(title)
print(name)
print(price)
print("END")
This is a common problem that you have to face, some sites load the data asynchronous(with ajax) those are XMLHttpRequest that you can see in the tab networking of your DOM inspector. Usually the websites load the data from a different endpoint with POST method to solve that you can use urllib or requests library.
In this case the request is through a GET method and you can scrape it from this url with no need of extend your code https://www.amazon.com/Best-Sellers-Kindle-Store-Dystopian-Science-Fiction/zgbs/digital-text/6361470011/ref=zg_bs_pg_3?_encoding=UTF8&pg=3&ajax=1 where you only change the pg parameter

CSS selectors to be used for scraping specific links

I am new to Python and working on a scraping project. I am using Firebug to copy the CSS path of required links. I am trying to collect the links under the tab of "UPCOMING EVENTS" from http://kiascenehai.pk/ but it is just for learning how I can get the specified links.
I am looking for the fix of this problem and also suggestions for how to retrieve specified links using CSS selectors.
from bs4 import BeautifulSoup
import requests
url = "http://kiascenehai.pk/"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
for link in soup.select("html body div.body-outer-wrapper div.body-wrapper.boxed-mode div.main- outer-wrapper.mt30 div.main-wrapper.container div.row.row-wrapper div.page-wrapper.twelve.columns.b0 div.row div.page-wrapper.twelve.columns div.row div.eight.columns.b0 div.content.clearfix section#main-content div.row div.six.columns div.small-post-wrapper div.small-post-content h2.small-post-title a"):
print link.get('href')
First of all, that page requires a city selection to be made (in a cookie). Use a Session object to handle this:
s = requests.Session()
s.post('http://kiascenehai.pk/select_city/submit_city', data={'city': 'Lahore'})
response = s.get('http://kiascenehai.pk/')
Now the response gets the actual page content, not redirected to the city selection page.
Next, keep your CSS selector no larger than needed. In this page there isn't much to go on as it uses a grid layout, so we first need to zoom in on the right rows:
upcoming_events_header = soup.find('div', class_='featured-event')
upcoming_events_row = upcoming_events_header.find_next(class_='row')
for link in upcoming_events_row.select('h2 a[href]'):
print link['href']
This is co-founder KiaSceneHai.pk; please don't scrape websites, alot of effort goes into collecting the data, we offer access through our API, you can use the contact form to request access, ty

Categories

Resources