No output after using requests and BeautifulSoup in PyCharm - python

I was trying to get some headlines from the newyorktimes website. I have 2 questions,
question 1:
This is my code, but I gives me no output, does anyone know what I'd have to change?
import requests
from bs4 import BeautifulSoup
url = 'https://www.nytimes.com'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
a = soup.find_all(class_="balancedHeadline")
for story_heading in a:
print(story_heading)
My second question:
As the HTML is not the same for all headlines (there's a different class for the big headlines and the smaller ones for example), how would I take all those different classes in my code and give me all of the headlines as output?
Thanks in advance!

BeautifulSoup is a robust parsing library.
But, unlike your browser, it does not evaluate javascript.
Elements with balancedHeadline class you were looking for are
not present in the download HTML document.
They get added in later when assets have downloaded
and javascript functions have run.
You won't be able to find such a class using your current technique.
The answer to your second question is in the docs.
A regex or a function would work, but you might find that
passing in a list is simpler for your application.

Related

How can i fix the find_all error while web scraping?

I am kinda a newbie in data world. So i tried to use bs4 and requests to scrap data from trending youtube videos. I have tried using soup.findall() method. To see if it works i displayed it. But it gives me an empty list. Can you help me fix it? Click here to see the spesific part of the html code.
from bs4 import BeautifulSoup
import requests
r = requests.get("https://www.youtube.com/feed/explore")
soup = BeautifulSoup(r.content,"lxml")
soup.prettify()
trendings = soup.find_all("ytd-video-renderer",attrs = {"class":"style-scope ytd-expanded-
shelf-contents-renderer"})
print(trending)
This webpage is dynamic and contains scripts to load data. Whenever you make a request using requests.get("https://www.youtube.com/feed/explore"), it loads the initial source code file that only contains information like head, meta, etc, and scripts. In a real-world scenario, you will have to wait until scripts load data from the server. BeautifulSoup does not catch the interactions with DOM via JavaScript. That's why soup.find_all("ytd-video-renderer",attrs = {"class":"style-scope ytd-expanded-shelf-contents-renderer"}) gives you empty list as there is no ytd-video-renderer tag or style-scope ytd-expanded-shelf-contents-renderer class.
For dynamic webpages, I think you should use Selenium (or maybe Scrapy).
For Youtube, you can use it's API as well.

How to read text off a website using python (Simple explanation)

I'm looking to make a program that can get the text off a website when given the website's URL. I would like to be able to get all text between the tags. Everywhere I have looked online seems to overcomplicate this and it involves some coding in C which I am not well versed in. To summarize what I would like the code to look like (best case scenario). If theres anything I can clarify or is unclear in the question please let me know in comments
import WebReader as WR
StringOfWebText = WR.getParagrahText("WebsiteURL")
You probably want to look into something like BeautifulSoup paired with requests. You can then extract text from a page with a simple solution like this:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://google.com")
soup = BeautifulSoup(r.text, "html.parser")
print(s.text)
There's also tag-searching and other useful features built into BS4, if you need to be able to handle that.

Python 'BeautiulSoup()' function what does it actually do?

Python nube here. I know two methods to parse URL to BeautifulSoup to open URLs.
Method #1 USING REQUESTS
from bs4 import BeautifulSoup
import requests
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
print soup.prettify()
Method #2 USING URLLIB/URLLIB2
from bs4 import BeautifulSoup
import urllib2
f = urllib2.urlopen(url)
page = f.read() #Some people skip this step.
soup = BeautifulSoup(page)
print soup.prettify()
I have following questions:
What exactly does BeautifulSoup() function does ? Somewhere it requires page.content and html.parser and somewhere it only takes urllib2.urlopen(url).read (as stated in the second example). This is very simple to cram but hard to understand what is going on here. I have checked the official documentation, not very helpful. (Please also comment on html.parser and page.content, why not just html and page like in second example ?)
In Method#2 as stated above, what difference does it make if I skip the f.read() command ?
For experts, these questions might be very simple, but I would really appreciate help on these. I have googled quite a lot but still not getting the answers.
Thanks !
BeautifulSoup does not open URLs. It takes HTML, and gives you the ability to prettify the output (as you have done).
In both method #1 and #2 you are fetching the HTML using another libary (either requests, or urllib) and then presenting the resulting HTML to beautiful soup.
This is why you need to read the content in method #2.
Therefore, I think you are looking in the wrong spots for documentation. You should be searching how to use request or urllib (I recommend requests myself).
BeautifulSoup is a python package to help you parse html.
The first argument it requires is just a raw html response, or any raw html or xml text that it can parse, so it doesn't matter what package delivers that as long as it is in valid html format.
The second argument, in your first example html.parser is telling BeautifulSoup what package to use to actually parse the data. In my knowledge there are only 2 options, html.parser and lxml. They do the same basically but with different performance advantages, that's the only difference as far as I can tell.
If you omit that second argument then the BeautifulSoup package just uses the default, which is lxml in most cases.
To your last question i'm not entirely sure, but I think there is no fundamental difference between invoking f.read() first or having BeautifulSoup do that implicitly but that would not always work and is bad practice.
Like #Klaus said in a comment to you, you should really read the docs here

BeautifulSoup4: Missing Parsed Table Data

I'm trying to extract the Earnings Per Share data through BeautifulSoup 4 from this page.
When I parse the data, the table information is missing using the default, lxml and HTML 5 parsers. I believe this has something to do with Javascript and I have been trying to implement PyV8 to transform the script into readable HTML for BS4. The problem is I don't know where to go from here.
Do you know if this is in fact my issue? I have been reading many posts and it's been a very big headache for me today. Below is a quick example. The financeWrap includes the table information, but beautifulSoup shows that it is empty.
import requests
from bs4 import BeautifulSoup
url = "http://financials.morningstar.com/ratios/r.html?t=AAPL&region=usa&culture=en-US"
response = requests.get(url)
soup_key_ratios = bs(response.content, 'html5lib')
financial_tables = soup_key_ratios.find("div", {"id":"financeWrap"})
print financial_tables
# Output: <div id="financeWrap">
# </div>
The issue is that you're trying to get data that is coming in through Ajax on the website. If you go to the link you provided, and looked at the source via the browser, you'll see that there should be no content with the data.
However, if you use a console manager, such as Firebug, you will see that there are Ajax requests made to the following URL, which is something you can parse via beautifulsoup (perhaps - I haven't tried it or looked at the structure of the data).
Keep in mind that this is quite possibly against the website's ToS.

Parsing data stored in URLs via BeautifulSoup?

I'm attempting to access the URLs of the different fish family from this website: http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Salmon
I'd like to be able to run a script that opens the links of the given website to then be able to parse the information stored within the pages. I'm fairly new to web scraping, so any help will be appreciated. Thanks in advance!
This is what I have so far:
import urllib2
import re
from bs4 import BeautifulSoup
import time
fish_url = 'http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Salmon'
page = urllib2.urlopen(fish_url)
html_doc = page.read()
soup = BeautifulSoup(html_doc)
page = urllib2.urlopen('http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Salmon').read()
soup = BeautifulSoup(page)
soup.prettify()
for fish in soup.findAll('a', href=True):
print fish['href']
Scrapy is the perfect tool for this. It is a python web scraping framework.
http://doc.scrapy.org/en/latest/intro/tutorial.html
You can pass in your url with you term, and create rules for crawling.
FOr example using a regex you would add a rule to scrape all links with the path /Summary and then extract the information using XPath or Beautiful soup.
Additionally you can set up a rule to automatically handle the pagination, ie in your example url it could automatically follow the Next link.
Basically a lot of what you are trying to do comes packaged for free in scrapy. I would def take a look into it.
If you're just writing a one-off script to grab all the data from this site, you can do something like:
fish_url_base = "http://www.fishbase.org/ComNames/%s"
fish_urls = [fish_url_base%a['href'] for a in soup.find_all('a')]
This gives you a list of links to traverse, which you can pass to urllib2.urlopen and BeautifulSoup:
for url in fish_urls:
fish_soup = BeautifulSoup(urllib2.urlopen(url).read())
# Do something with your fish_soup
(Note 1: I haven't tested this code; you might need to adjust the base URL to fit the href attributes so you get to the right site.)
(Note 2: I see you're using bs4 but calling findAll on the soup. findAll was right for BS3, but it is changed to find_all in bs4.)
(Note 3: If you're doing this for practical rather than learning purposes / fun, there are more efficient ways of scraping, such as scrapy also mentioned here.)

Categories

Resources