I am trying to navigate a website using beautifulsoup. I open the first page and find the links I want to follow, but when I ask beautiful soup to open the next page, none of the HTML is parsed and it just returns this
<function scraper at 0x000001E3684D0E18>
I have tried opening the second page in its own script and it works just fine so the problem has to do with parsing a page from another page.
I have ~2000 links I need to go through so I created a function that goes through them. Here's my script so far
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import lxml
# The first webpage I'm parsing
my_url = 'https://mars.nasa.gov/msl/multimedia/raw/'
#calls the urlopen function from the request module of the urllib module
# AKA opens up the connection and grabs the page
uClient = uReq(my_url)
#imports the entire webpage from html format into python.
# If webpage has lots of data this can take a long time and take up a lot of
space or crash
page_html = uClient.read()
#closes the client
uClient.close()
#parses the HTML using bs4
page_soup = soup(page_html, "lxml")
#finds the categories for the types of images on the site, category 1 is
RHAZ
containers = page_soup.findAll("div", {"class": "image_list"})
RHAZ = containers[1]
# prints the links in RHAZ
links = []
for link in RHAZ.find_all('a'):
#removes unwanted characters from the link making it usable.
formatted_link = my_url+str(link).replace('\n','').split('>')
[0].replace('%5F\"','_').replace('amp;','').replace('<a href=\"./','')
links.append(formatted_link)
print (links[1])
# I know i should be defining a function here.. so ill give it a go.
def scraper():
pic_page = uReq('links[1]') #calls the first link in the list
page_open = uClient.read() #reads the page in a python accessible format
uClient.close() #closes the page after it's been stored to memory
soup_open = soup(page_open, "lxml")
print (soup_open)
print (scraper)
Do I need to clear the previously loaded HTML in beautifulsoup so I can open the next page? If so, how would I do this? Thanks for any help
You need to make requests from the urls scraped from first page...check this code.
from bs4 import BeautifulSoup
import requests
url = 'https://mars.nasa.gov/msl/multimedia/raw'
req = requests.get(url)
soup = BeautifulSoup(req.content, 'lxml')
img_list = soup.find_all('div', attrs={'class': 'image_list'})
for i in img_list:
image = i.find_all('a')
for x in image:
href = x['href'].replace('.', '')
link = (str(url)+str(href))
req2 = requests.get(link)
soup2 = BeautifulSoup(req2.content, 'lxml')
img_list2 = soup2.find_all('div', attrs={
'class': 'RawImageUTC'})
for l in img_list2:
image2 = l.find_all('a')
for y in image2:
href2 = y['href']
print(href2)
Output:
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02172/opgs/edr/fcam/FLB_590315340EDR_F0722464FHAZ00337M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02172/opgs/edr/fcam/FRB_590315340EDR_F0722464FHAZ00337M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02172/opgs/edr/fcam/FLB_590315340EDR_T0722464FHAZ00337M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02172/opgs/edr/fcam/FRB_590315340EDR_T0722464FHAZ00337M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02171/opgs/edr/fcam/FLB_590214757EDR_F0722464FHAZ00341M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02171/opgs/edr/fcam/FRB_590214757EDR_F0722464FHAZ00341M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02171/opgs/edr/fcam/FLB_590214757EDR_T0722464FHAZ00341M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02171/opgs/edr/fcam/FRB_590214757EDR_T0722464FHAZ00341M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02170/opgs/edr/fcam/FLB_590149941EDR_F0722464FHAZ00337M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02170/opgs/edr/fcam/FRB_590149941EDR_F0722464FHAZ00337M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02170/opgs/edr/fcam/FLB_590134317EDR_S0722464FHAZ00214M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02170/opgs/edr/fcam/FLB_590134106EDR_S0722464FHAZ00214M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02170/opgs/edr/fcam/FLB_590134065EDR_S0722464FHAZ00214M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02170/opgs/edr/fcam/FLB_590134052EDR_S0722464FHAZ00222M_.JPG
http://mars.jpl.nasa.gov/msl-raw-images/proj/msl/redops/ods/surface/sol/02170/opgs/edr/fcam/FLB_590133948EDR_S0722464FHAZ00222M_.JPG
Related
I'm trying to scrape data from Robintrack but, I cannot get the data from the increase/decrease section. I can only scrape the home page data. Here is my Soup
import bs4
import requests
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
robin_track_url= 'https://robintrack.net/popularity_changes?changeType=increases'
#r = requests.get('https://robintrack.net/popularity_changes?changeType=increases')
#soup = bs4.BeautifulSoup(r.text, "xml")
#Grabs and downloads html page
uClient = uReq(robin_track_url)
page_html = uClient.read()
uClient.close()
#Does HTML parsing
page_soup = soup(page_html, "html.parser")
print("On Page")
print(page_soup.body)
stocks = page_soup.body.findAll("div", {"class":"ReactVirtualized__Table__row"})
print(len(stocks))
print(stocks)
Am I doing something wrong?
Your version not will be work because the data that you want to load is loaded via JS.
requests load the only static page.
if you want to get data that you want to do next:
requests.get('https://robintrack.net/api/largest_popularity_increases?hours_ago=24&limit=50&percentage=false&min_popularity=50&start_index=0').json()
I would like to scrape all running times (not just the first 10 results) from the data tables on https://www.ijsselsteinloop.nl/uitslagen-2019. However, the data that shows on the webpage does not show in de page source. Under every data table, there's a hyperlink ("hier"). These link to the full data table pages. But those links are also not in the page source.
Any suggestions or code snippets how to scrape this data (with Python and BeautifulSoup or Scrapy).
Use the same endpoint the page uses for that content. You can find this in the network tab of the browser.
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
r = requests.get('https://www.ijsselsteinloop.nl/uitslag/2019/index.html')
soup = bs(r.content, 'lxml')
links = ['https://www.ijsselsteinloop.nl/uitslag/2019/' + item['href'] for item in soup.select('[href^=uitslag]')]
for link in links:
table = pd.read_html(link)[0]
print(table)
You could use BeautifulSoup. First :
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html,"html.parser")
And then use function find.All( to get every tr). And then use for loop , and type
again find('td') to get every row
very new to Python. The following code will only allow me to display individual p entries from the extracted website (the first entry, 0, being the current example).
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = "https://en.wikipedia.org/wiki/Young_Thug"
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
page_soup.findAll("p")
paragraphs = page_soup.findAll("p")
paragraph = paragraphs[0].text.strip()
print(paragraph)
For some reason, I can't grip the particular for argument I would need to display all of the p elements on the site in a single block of text.
The eventual goal of the above code snippet is a reading grade level app, hence the stripped down text. Any help would be appreciated, thank you!
I’m not near my laptop to include the output, but generally it would be:
paragraphs = page_soup.findAll("p")
for para in paragraphs:
print (para.text.strip())
My script below...
I feel like I'm missing one line of code to make this work properly. Using Reddit as a test source to scrap sport links.
# import libraries
import bs4
from urllib2 import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.reddit.com/r/BoxingStreams/comments/6w2vdu/mayweather_vs_mcgregor_archive_footage/'
# opening up connection, grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
# html parsing
page_soup = soup(page_html, "html.parser")
hyperli = page_soup.findAll("form")
filename = "sportstreams.csv"
f = open(filename, "w")
headers = "Sport Links"
f.write(headers)
for containli in hyperli:
link = containli.a["href"]
print(link)
f.write(str(link)+'\n')
f.close()
Everything works except that it only grabs the link from the first row [0]. If I don't use the code ["href"] then it adds all the (a href links) except that it also adds the word NONE to the CSV file. Using the
["href"] would (I hope) just add the http links and avoid adding the word NONE.
What am I missing here?
As explained in the documentation Navigating using tag names:
Using a tag name as an attribute will give you only the first tag by that name
...
If you need to get all the <a> tags, or anything more complicated than the first tag with a certain name, you’ll need to use one of the methods described in Searching the tree, such as find_all():
In your case, you could use page_soup.select("form a[href]") to find all the links in forms that have href attributes.
links = page_soup.select("form a[href]")
for link in links:
href = link["href"]
print(href)
f.write(href + "\n")
I'm using beautiful soup for the first time and the text from the span class is not being extracted. I'm not familiarized with HTML so I'm unsure as to why this happens, so it'd be great to understand.
I've used the code below:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.anz.com.au/personal/home-loans/your-loan/interest-rates/#varhome'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.Close()
page_soup = soup(page_html, "html.parser")
content = page_soup.findAll("span",attrs={"data-item":"rate"})
With this code for index 0 it returns the following:
<span class="productdata" data-baserate-code="VRI" data-cc="AU" data-
item="rate" data-section="PHL" data-subsection="VR"></span>
However I'd expect something like this when I inspect via Chrome, which has the text such as the interest rate:
<span class="productdata" data-cc="AU" data-section="PHL" data-
subsection="VR" data-baserate-code="VRI" data-item="rate">5.20% p.a.</span>
Data you are trying to extract does not exists. It is loaded using JS after the page is loaded. Website uses a JSON api to load information on the page. So Beautiful soup can not find the data. Data can be viewed at following link that hits JSON API on the site and provides JSON data.
https://www.anz.com/productdata/productdata.asp?output=json&country=AU§ion=PHL
You can parse the json and get the data. Also for HTTP requests I would recommend requests package.
As others said, the content is JavaScript generated, you can use selenium together ChromeDriver to find the data you want with something like:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.anz.com.au/personal/home-loans/your-loan/interest-rates/#varhome")
items = driver.find_elements_by_css_selector("span[data-item='rate']")
itemsText = [item.get_attribute("textContent") for item in items]
>>> itemsText
['5.20% p.a.', '5.30% p.a.', '5.75% p.a.', '5.52% p.a.', ....]
As seen above, BeautifulSoup wasn't necessary at all, but you can use it instead to parse the page source and get the same results:
from bs4 import BeautifulSoup
soup = BeautifulSoup(driver.page_source, 'html.parser')
items = soup.findAll("span",{"data-item":"rate"})
itemsText = [item.text for items in items]