Scrapping data not coming from exact url

Scrapping data not coming from exact url - python

I'm trying to scrap some monster infobox table from rswiki.
Some specific monster have multiple levels, for example:
https://oldschool.runescape.wiki/w/Dwarf
You can switch through the different levels by clicking on boxes on top of the infobox: "Level 7","Level 10"...
Once you click on the level box it changes the url to match the level.
So when i request the url https://oldschool.runescape.wiki/w/Dwarf#Level_10, it's bringing data from the first level only, in case: https://oldschool.runescape.wiki/w/Dwarf#Level_7, and i can't get to scrap other levels.
import requests
from bs4 import BeautifulSoup
url = 'https://oldschool.runescape.wiki/w/Dwarf#Level_20'
response = requests.get(url, headers = {'User-Agent':'Mozilla/5.0'})
soup = BeautifulSoup(response.content, 'html.parser')
soup_minfobox = soup.find_all('table', class_ ="infobox infobox-switch no-parenthesis-style infobox-monster")
print(soup_minfobox[0].text)
Output: Level 7Level 10Level 11Level 20DwarfReleased6 April 2001 (Update)MembersNoCombat level7Size1x1 ...
Excuse me the makeshift code, but in the output you can see that it is the data from the lv 7 in the end, although the url is for the lv 20.

If you manually trigger the events (from the browser's console), you'll see that the infobox changes:
$("span[data-switch-anchor='#Level_7']").click();
$("span[data-switch-anchor='#Level_10']").click();
$("span[data-switch-anchor='#Level_11']").click();
$("span[data-switch-anchor='#Level_20']").click();
So you can use the above selectors and consult the answers provided in the following topic on how to invoke an event using BeautifulSoup:
invoking onclick event with beautifulsoup python

Related

Fetch all pages using a Python request, using Beautiful Soup

I tried to fetch all product's name from the web page, but I could have only 12.
If I scroll down the web page then it gets refreshed and adds more information.
How can I to get all information?
import requests
from bs4 import BeautifulSoup
import re
url = "https://www.outre.com/product-category/wigs/"
res = requests.get(url)
res.raise_for_status()
soup = BeautifulSoup(res.text, "lxml")
items = soup.find_all("div", attrs={"class":"title-wrapper"})
for item in items:
print(item.p.a.get_text())

Your code is good. The thing is on the website; the products are dynamically loaded, so when you do your request you can only get the first 12 products.
You can check the developer console inside your browser to track the Ajax call made during browsing.
I did it, and it turns out a call is made to retrieve more product to the URL
https://www.outre.com/product-category/wigs/page/2/
So if you want to get all the products you need to browse multiple pages. I suggest you to use a loop and use your code several times.
N.B.: You can try to check the website to see is there is a more convenient place to get the product (like not from the main page)

The page loads the products from different URL via JavaScript, so Beautiful Soup doesn't see it. To get all pages, you can use the following example:
import requests
from bs4 import BeautifulSoup
url = "https://www.outre.com/product-category/wigs/page/{}/"
page = 1
while True:
soup = BeautifulSoup(requests.get(url.format(page)).content, "html.parser")
titles = soup.select(".product-title")
if not titles:
break
for title in titles:
print(title.text)
page += 1
Prints:
...
Wet & Wavy Loose Curl 18″
Wet & Wavy Boho Curl 20″
Nikaya
Jeanette
Natural Glam Body
Natural Free Deep

using beautiful soup for simulating a page-click to access all HTML on a page?

I'm trying to scrape the following website:
https://www.bandsintown.com/?came_from=257&sort_by_filter=Number+of+RSVPs
I'm able to successfully scrape the events listed on the page using beautifulsoup, using the following code:
from bs4 import BeautifulSoup
import requests
url = 'https://www.bandsintown.com/?came_from=257&sort_by_filter=Number+of+RSVPs'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
dates = soup.find_all('div', {'class': 'event-b58f7990'})
month=[]
day=[]
for i in dates:
md = i.find_all('div')
month.append(md[0].text)
day.append(md[1].text)
However, the issue I'm having is that I'm only able to scrape the first 18 events - the rest of the page is only available if the 'view all' button is clicked at the bottom. Is there a way in beautifulsoup, or otherwise, to simulate this button being clicked, so that I can scrape ALL of the data? I'd prefer to keep this in python as I'm doing most scraping with beautifulsoup. Thanks so much!

If you can work out the end point or set an end point for range in the following (with error handling for going too far) you can get a json response and parse out the info you require as follows. Depending on how many requests making you may choose to re-use connection with session.
import requests
import pandas as pd
url = 'https://www.bandsintown.com/upcomingEvents?came_from=257&sort_by_filter=Number+of+RSVPs&page={}&latitude=51.5167&longitude=0.0667'
results = []
for page in range(1,20):
data = requests.get(url.format(page)).json()
for item in data['events']:
results.append([item['artistName'], item['eventDate']['day'],item['eventDate']['month']])
df = pd.DataFrame(results)
print(df)

Scraping page with BS in python only captures first column of splitColumn

I'm trying to scrape the last part of this page through BeautifulSoup in python.
I want to retrieve all the companies listed in the bottom. Furthermore, the companies are ordered alphabetically, where the companies with titles starting with "A-F" appear under the first tab, then "G-N" under the second tab and so on. You have to click the tabs for the names to appear, so I'll loop through the different "name pages" and apply the same code.
I'm having trouble retrieving all the names of a single page, however.
When looking at the companies named "A-F" I can only retrieve the names of the first column of the table.
My code is:
from bs4 import BeautifulSoup as Soup
import requests
incl_page_url = "https://www.triodos.com/en/investment-management/socially-
responsible-investment/sustainable-investment-universe/companies-atmf1/"
page = requests.get(incl_page_url)
soup = Soup(page.content, "html.parser")
for header in soup.find("h2").next_siblings:
try:
for a in header.childGenerator():
if str(type(a)) == "<class 'bs4.element.NavigableString'>":
print(str(a))
except:
pass
As can be seen by running this, I only get the names from the first column.
Any help is very much appreciated.

Give this a shot and tell me this is not what you wanted:
from bs4 import BeautifulSoup
import requests
incl_page_url = "https://www.triodos.com/en/investment-management/socially-responsible-investment/sustainable-investment-universe/companies-atmf1/"
page = requests.get(incl_page_url).text
soup = BeautifulSoup(page, "lxml")
for items in soup.select(".splitColumn p"):
title = '\n'.join([item for item in items.strings])
print(title)
Result:
3iGroup
8point3 Energy Partners 
A
ABN AMRO
Accell Group
Accsys Technologies
Achmea
Acuity Brands
Adecco
Adidas
Adobe Systems

How to scrape 'Click to Display' fields with BeautifulSoup

I am trying to scrape the number of schools and names of schools that basketball players get offers from verbalcommits.com
Using this page as an example: http://www.verbalcommits.com/players/jarrey-foster
It's easy to access the first offer (SMU) but all of the other offers are hidden behind the "Show other offers" button. When I inspect the page, I can see the offers but my scraper doesn't get to them. I've been using the following:
page=urllib.request.urlopen("http://www.verbalcommits.com/players/jarrey-foster") #opens page
soup = BeautifulSoup(page, 'html.parser') #makes page into a BS python object
schools = soup.body.findAll('span',{"class":"team_name"})
print(schools)
This returns the first span that has the team name in it, but not the rest of the spans that are hidden. What do I need to add to access the rest of the page that is hidden?

To elaborate more on #furas's great answer. Here is how you can extract the player id and make a second request to get the "closed offers". For this, we are going to maintain a web-scraping session with requests:
import requests
from bs4 import BeautifulSoup
with requests.Session() as session:
response = session.get("http://www.verbalcommits.com/players/jarrey-foster")
# get the player id
soup = BeautifulSoup(response.content, "html.parser")
player_id = soup.select_one("h1.player-name").get("data-player-id")
# get closed offers
response = session.get("http://www.verbalcommits.com/player_divs/closed_offers", params={"player_id": player_id})
soup = BeautifulSoup(response.content, "html.parser")
# print team names
for team in soup.select(".team_name"):
print(team.get_text())
Prints team names for demonstration purposes:
UTEP
Sam Houston State
New Hampshire
Rice
Temple
Liberty
UL Lafayette

You can't get other data because when you click button then JavaScript reads it from server from
http://www.verbalcommits.com/player_divs/closed_offers?player_id=17766&_=1475626846752
Now you can use this url with BS to get data.
I used Firebug in Firefox or Developer Tools in Chrome to find this url.
EDIT: inside HTML I found data-player-id="17766" - it is first argument in above url. Maybe you can find second argument so you could generate url using Python.
EDIT: I checked url
http://www.verbalcommits.com/player_divs/closed_offers?player_id=17766
and it gives the same data so you don't need second argument.

how to scrape deeply embeded links with python beautifulSoup

I'm trying to build a spider/web crawler for academic purposes to grab text from academic publications and append related links to a URL stack. I'm trying to crawl 1 website called 'PubMed'. I can't seem to grab the links I need though. Here is my code with an example page, this page should be representative of others in their database:
website = 'http://www.ncbi.nlm.nih.gov/pubmed/?term=mtap+prmt'
from bs4 import BeautifulSoup
import requests
r = requests.get(website)
soup = BeautifulSoup(r.content)
I have broken the html tree down into several variables just for readability so that it can all fit on 1 screen width.
key_text = soup.find('div', {'class':'grid'}).find('div',{'class':'col twelve_col nomargin shadow'}).find('form',{'id':'EntrezForm'})
side_column = key_text.find('div', {'xmlns:xi':'http://www.w3.org/2001/XInclude'}).find('div', {'class':'supplemental col three_col last'})
side_links = side_column.find('div').findAll('div')[1].find('div', {'id':'disc_col'}).findAll('div')[1]
for link in side_links:
print link
if you look at the html source code using chrome inspect element there should be several other nested divs with links within 'side_links'. However the above code produces the following error:
Traceback (most recent call last):
File "C:/Users/ballbag/Copy/web_scraping/google_search.py", line 22, in <module>
side_links = side_column.find('div').findAll('div')[1].find('div', {'id':'disc_col'}).findAll('div')[1]
IndexError: list index out of range
if you go to the url there is a column on the right called 'related links' containing the urls that I wish to scrape. But I can't seem to get to them. There is a statement saying under the div i am trying to get into and I suspect this has something to do with it. Can anyone help grab these links? I'd really appreciate any pointers

The problem is that the side bar is loaded with an additional asynchronous request.
The idea here would be to:
maintain a web-scraping session using requests.Session
parse the url that is used for getting the side bar
follow that link and get the links from the div with class="portlet_content"
Code:
from urlparse import urljoin
from bs4 import BeautifulSoup
import requests
base_url = 'http://www.ncbi.nlm.nih.gov'
website = 'http://www.ncbi.nlm.nih.gov/pubmed/?term=mtap+prmt'
# parse the main page and grab the link to the side bar
session = requests.Session()
soup = BeautifulSoup(session.get(website).content)
url = urljoin(base_url, soup.select('div#disc_col a.disc_col_ph')[0]['href'])
# parsing the side bar
soup = BeautifulSoup(session.get(url).content)
for a in soup.select('div.portlet_content ul li.brieflinkpopper a'):
print a.text, urljoin(base_url, a.get('href'))
Prints:
The metabolite 5'-methylthioadenosine signals through the adenosine receptor A2B in melanoma. http://www.ncbi.nlm.nih.gov/pubmed/25087184
Down-regulation of methylthioadenosine phosphorylase (MTAP) induces progression of hepatocellular carcinoma via accumulation of 5'-deoxy-5'-methylthioadenosine (MTA). http://www.ncbi.nlm.nih.gov/pubmed/21356366
Quantitative analysis of 5'-deoxy-5'-methylthioadenosine in melanoma cells by liquid chromatography-stable isotope ratio tandem mass spectrometry. http://www.ncbi.nlm.nih.gov/pubmed/18996776
...
Cited in PMC http://www.ncbi.nlm.nih.gov/pmc/articles/pmid/23265702/citedby/?tool=pubmed

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrapping data not coming from exact url - python

Related

Fetch all pages using a Python request, using Beautiful Soup

using beautiful soup for simulating a page-click to access all HTML on a page?

Scraping page with BS in python only captures first column of splitColumn

How to scrape 'Click to Display' fields with BeautifulSoup

how to scrape deeply embeded links with python beautifulSoup

Categories

Resources