I'm looking to make a program that can get the text off a website when given the website's URL. I would like to be able to get all text between the tags. Everywhere I have looked online seems to overcomplicate this and it involves some coding in C which I am not well versed in. To summarize what I would like the code to look like (best case scenario). If theres anything I can clarify or is unclear in the question please let me know in comments
import WebReader as WR
StringOfWebText = WR.getParagrahText("WebsiteURL")
You probably want to look into something like BeautifulSoup paired with requests. You can then extract text from a page with a simple solution like this:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://google.com")
soup = BeautifulSoup(r.text, "html.parser")
print(s.text)
There's also tag-searching and other useful features built into BS4, if you need to be able to handle that.
Related
I was trying to get some headlines from the newyorktimes website. I have 2 questions,
question 1:
This is my code, but I gives me no output, does anyone know what I'd have to change?
import requests
from bs4 import BeautifulSoup
url = 'https://www.nytimes.com'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
a = soup.find_all(class_="balancedHeadline")
for story_heading in a:
print(story_heading)
My second question:
As the HTML is not the same for all headlines (there's a different class for the big headlines and the smaller ones for example), how would I take all those different classes in my code and give me all of the headlines as output?
Thanks in advance!
BeautifulSoup is a robust parsing library.
But, unlike your browser, it does not evaluate javascript.
Elements with balancedHeadline class you were looking for are
not present in the download HTML document.
They get added in later when assets have downloaded
and javascript functions have run.
You won't be able to find such a class using your current technique.
The answer to your second question is in the docs.
A regex or a function would work, but you might find that
passing in a list is simpler for your application.
Python nube here. I know two methods to parse URL to BeautifulSoup to open URLs.
Method #1 USING REQUESTS
from bs4 import BeautifulSoup
import requests
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
print soup.prettify()
Method #2 USING URLLIB/URLLIB2
from bs4 import BeautifulSoup
import urllib2
f = urllib2.urlopen(url)
page = f.read() #Some people skip this step.
soup = BeautifulSoup(page)
print soup.prettify()
I have following questions:
What exactly does BeautifulSoup() function does ? Somewhere it requires page.content and html.parser and somewhere it only takes urllib2.urlopen(url).read (as stated in the second example). This is very simple to cram but hard to understand what is going on here. I have checked the official documentation, not very helpful. (Please also comment on html.parser and page.content, why not just html and page like in second example ?)
In Method#2 as stated above, what difference does it make if I skip the f.read() command ?
For experts, these questions might be very simple, but I would really appreciate help on these. I have googled quite a lot but still not getting the answers.
Thanks !
BeautifulSoup does not open URLs. It takes HTML, and gives you the ability to prettify the output (as you have done).
In both method #1 and #2 you are fetching the HTML using another libary (either requests, or urllib) and then presenting the resulting HTML to beautiful soup.
This is why you need to read the content in method #2.
Therefore, I think you are looking in the wrong spots for documentation. You should be searching how to use request or urllib (I recommend requests myself).
BeautifulSoup is a python package to help you parse html.
The first argument it requires is just a raw html response, or any raw html or xml text that it can parse, so it doesn't matter what package delivers that as long as it is in valid html format.
The second argument, in your first example html.parser is telling BeautifulSoup what package to use to actually parse the data. In my knowledge there are only 2 options, html.parser and lxml. They do the same basically but with different performance advantages, that's the only difference as far as I can tell.
If you omit that second argument then the BeautifulSoup package just uses the default, which is lxml in most cases.
To your last question i'm not entirely sure, but I think there is no fundamental difference between invoking f.read() first or having BeautifulSoup do that implicitly but that would not always work and is bad practice.
Like #Klaus said in a comment to you, you should really read the docs here
This question is similar to the one asked here, but the answer was not of much help.
I am trying to extract comments from a webpage which uses Disqus, however I am not able to access the section.
This is what I have so far, it's not much
import urllib
import urllib2,cookielib
from bs4 import BeautifulSoup
from IPython.display import HTML
site= "http://www.timesofmalta.com/articles/view/20161207/local/daphne-caruana-galizia-among-politicos-28-most-influential.633146"
hdr = {'User-Agent':'Mozilla/5.0', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'}
req = urllib2.Request(site,headers=hdr)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page,"html.parser")
title = soup.title.text
print title
Any hints as to how I could attempt to tackle this?
I had the same issue while trying to download an infinity scroll on java. After doing a million things, including beautiful soup, i realized that the best way to tackle this problem was debugging with chrome, to get the URL of the petition that would come out as the dynamic content was loaded, and then find a way to regulate the expression so that i could call it in different ways.
so for example, if when you activate your infinite scroll, you have the chrome debugging console open, you will see an HTTP petition(probably HTTP-get) coming out. If the URL has a structure as:
http:www.yourlink.com/get_comments/product/page_offset_numbertoload/
you will be able to build an http petition with python and send it, get the response, in which the data that you are looking for is stored. Good luck man!
So i'm having a problem grabbing a pages html for some reason when i send a request to the site then use html.fromstring(site.content) it grabs some pages html but then again some of them just print out <Element html at 0x7f6359db3368>
Is there a reason for this? something i can do to fix this? is it some type of security? Also i don't want to use things like Beautiful Soup or Scapy yet.. I Want to learn some more before i decide to get into those libraries...
Maybe this will help a little:
import requests
from lxml import html
a = requests.get('https://www.python.org/')
b = html.fromstring(a.content)
d = b.xpath('.//*[#id="documentation"]/a') #XPath to the blue 'Documentation' near the top of the screen
print(d) #prints [<Element a at 0x104f7f318>]
print(d[0].text) #prints Documentation
You can usually find the XPath with the Chrome Developer tools, after viewing HTML. I'd be happy to give more specific help if you wanted to post the website you're scrapping, and what you're looking for.
I am attempting to scrape the comments counter from a Web page. The code is presented below.
When I ask it to print letters, the output is an empty list. Why that is happening?
import urllib2
from bs4 import BeautifulSoup
r2 = urllib2.urlopen("http://www.ign.com/articles/2016/01/03/steam-surpasses-12-million-concurrent-users").read()
soup2 = BeautifulSoup(r2)
letters = soup2.find_all("div",class_="fyre-comment-count")
print letters
The list is empty because there are no comments on that page. div#livefyre-comment is empty, and div.fyre-comment-count does not exist.
Up in the page's header, there is a suspicious script tag pulling JavaScript from http://cdn.livefyre.com/Livefyre.js. I don't know what Livefyre is, but I assume it sucks comments out of a database somewhere and inserts them into div#livefyre-comment or its surrounding div.article-comments. Presumably, div.fyre-comment-count will also appear somewhere in the DOM once the script is done.
This sort of... design decision is increasingly common on Web sites. To see what a Web page really looks like, browse it with both JavaScript and cookies off (and be prepared for the occasional "500 Internal Server Error" from sites that never imagined such hooliganism was possible).
I don't know enough about screen scraping to tell you where to go from here. You might be able to piece together a URL to fetch the comments (and their count) directly from Livefyre. I'd start by perusing the JavaScript functions they provide, and the data-settings attribute of div#livefyre-comment, which appears to be a JSON dictionary full of relevant parameters.
Your code is very close, almost right. You just missed a few things. Check the code below.
import urlparse
from bs4 import BeautifulSoup
import urllib2
r2 = urllib2.urlopen("http://www.ign.com/articles/2016/01/03/steam-surpasses-12-million-concurrent-users").read()
soup = BeautifulSoup(r2, 'html.parser')
for line in soup.find_all("div",class_="fyre-comment-count"):
comments = ''.join(line.find_all(text=True))
print (comments)