I am trying to parse the HTML page of a popular music streaming web app with BeautifulSoup, I am using the find_all function to look for X css class.
Workflow looks like:
r = requests.get('URL')
soup = BeautifulSoup(r.content)
soup.select("Tag", class_="Class name here")
The output is an empty list, which tells me it's not finding the class I'm looking for.
Here is the kicker: when I open the developer tools/HTML page source code, I can traverse the tree and find the class I am looking for.
Any ideas for why it's not being loaded. And can I load it into my python instance.
Thank you,
P.S. if any of my semantics/verbiage is incorrect please feel free to edit. I am not a webdev, just an enthusiast. >_<
Related
I am kinda a newbie in data world. So i tried to use bs4 and requests to scrap data from trending youtube videos. I have tried using soup.findall() method. To see if it works i displayed it. But it gives me an empty list. Can you help me fix it? Click here to see the spesific part of the html code.
from bs4 import BeautifulSoup
import requests
r = requests.get("https://www.youtube.com/feed/explore")
soup = BeautifulSoup(r.content,"lxml")
soup.prettify()
trendings = soup.find_all("ytd-video-renderer",attrs = {"class":"style-scope ytd-expanded-
shelf-contents-renderer"})
print(trending)
This webpage is dynamic and contains scripts to load data. Whenever you make a request using requests.get("https://www.youtube.com/feed/explore"), it loads the initial source code file that only contains information like head, meta, etc, and scripts. In a real-world scenario, you will have to wait until scripts load data from the server. BeautifulSoup does not catch the interactions with DOM via JavaScript. That's why soup.find_all("ytd-video-renderer",attrs = {"class":"style-scope ytd-expanded-shelf-contents-renderer"}) gives you empty list as there is no ytd-video-renderer tag or style-scope ytd-expanded-shelf-contents-renderer class.
For dynamic webpages, I think you should use Selenium (or maybe Scrapy).
For Youtube, you can use it's API as well.
I was trying to get some headlines from the newyorktimes website. I have 2 questions,
question 1:
This is my code, but I gives me no output, does anyone know what I'd have to change?
import requests
from bs4 import BeautifulSoup
url = 'https://www.nytimes.com'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
a = soup.find_all(class_="balancedHeadline")
for story_heading in a:
print(story_heading)
My second question:
As the HTML is not the same for all headlines (there's a different class for the big headlines and the smaller ones for example), how would I take all those different classes in my code and give me all of the headlines as output?
Thanks in advance!
BeautifulSoup is a robust parsing library.
But, unlike your browser, it does not evaluate javascript.
Elements with balancedHeadline class you were looking for are
not present in the download HTML document.
They get added in later when assets have downloaded
and javascript functions have run.
You won't be able to find such a class using your current technique.
The answer to your second question is in the docs.
A regex or a function would work, but you might find that
passing in a list is simpler for your application.
So i'm having a problem grabbing a pages html for some reason when i send a request to the site then use html.fromstring(site.content) it grabs some pages html but then again some of them just print out <Element html at 0x7f6359db3368>
Is there a reason for this? something i can do to fix this? is it some type of security? Also i don't want to use things like Beautiful Soup or Scapy yet.. I Want to learn some more before i decide to get into those libraries...
Maybe this will help a little:
import requests
from lxml import html
a = requests.get('https://www.python.org/')
b = html.fromstring(a.content)
d = b.xpath('.//*[#id="documentation"]/a') #XPath to the blue 'Documentation' near the top of the screen
print(d) #prints [<Element a at 0x104f7f318>]
print(d[0].text) #prints Documentation
You can usually find the XPath with the Chrome Developer tools, after viewing HTML. I'd be happy to give more specific help if you wanted to post the website you're scrapping, and what you're looking for.
I'm looking for method to read context from hyperlink on website. Is it possible?
For example:
website = "WEBSITE"
openwebsite = urllib2.urlopen(website)
hyperlink = _some_method_to_find_hyperlink(openwebsite)
get_context_from_hyper(hyperlink)
I was searching in Beautiful Shop, but I cannot find something usefull.
I thinking that i could with lopp to find revelant hyperlink, and use urllib2 again, but website is quite large, and it would takes ages.
You could go and try the Beautiful Soup Package which enables you to parse HTML and thus extract any tag you might be looking for.
I'm attempting to access the URLs of the different fish family from this website: http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Salmon
I'd like to be able to run a script that opens the links of the given website to then be able to parse the information stored within the pages. I'm fairly new to web scraping, so any help will be appreciated. Thanks in advance!
This is what I have so far:
import urllib2
import re
from bs4 import BeautifulSoup
import time
fish_url = 'http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Salmon'
page = urllib2.urlopen(fish_url)
html_doc = page.read()
soup = BeautifulSoup(html_doc)
page = urllib2.urlopen('http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Salmon').read()
soup = BeautifulSoup(page)
soup.prettify()
for fish in soup.findAll('a', href=True):
print fish['href']
Scrapy is the perfect tool for this. It is a python web scraping framework.
http://doc.scrapy.org/en/latest/intro/tutorial.html
You can pass in your url with you term, and create rules for crawling.
FOr example using a regex you would add a rule to scrape all links with the path /Summary and then extract the information using XPath or Beautiful soup.
Additionally you can set up a rule to automatically handle the pagination, ie in your example url it could automatically follow the Next link.
Basically a lot of what you are trying to do comes packaged for free in scrapy. I would def take a look into it.
If you're just writing a one-off script to grab all the data from this site, you can do something like:
fish_url_base = "http://www.fishbase.org/ComNames/%s"
fish_urls = [fish_url_base%a['href'] for a in soup.find_all('a')]
This gives you a list of links to traverse, which you can pass to urllib2.urlopen and BeautifulSoup:
for url in fish_urls:
fish_soup = BeautifulSoup(urllib2.urlopen(url).read())
# Do something with your fish_soup
(Note 1: I haven't tested this code; you might need to adjust the base URL to fit the href attributes so you get to the right site.)
(Note 2: I see you're using bs4 but calling findAll on the soup. findAll was right for BS3, but it is changed to find_all in bs4.)
(Note 3: If you're doing this for practical rather than learning purposes / fun, there are more efficient ways of scraping, such as scrapy also mentioned here.)