Crawl a news website and getting the news content - python

I'm trying to download the text from a news website. The HTML is:
<div class="pane-content">
<div class="field field-type-text field-field-noticia-bajada">
<div class="field-items">
<div class="field-item odd">
<p>"My Text" target="_blank">www.injuv.cl</a></strong></p> </div>
The output should be: My Text
I'm using the following python code:
try:
from BeautifulSoup import BeautifulSoup
except ImportError:
from bs4 import BeautifulSoup
html = "My URL"
parsed_html = BeautifulSoup(html)
p = parsed_html.find("div", attrs={'class':'pane-content'})
print(p)
But the output of the code is: "None". Do you know what is wrong with my code??

The problem is that you are not parsing the HTML, you are parsing the URL string:
html = "My URL"
parsed_html = BeautifulSoup(html)
Instead, you need to get/retrieve/download the source first, example in Python 2:
from urllib2 import urlopen
html = urlopen("My URL")
parsed_html = BeautifulSoup(html)
In Python 3, it would be:
from urllib.request import urlopen
html = urlopen("My URL")
parsed_html = BeautifulSoup(html)
Or, you can use the third-party "for humans"-style requests library:
import requests
html = requests.get("My URL").content
parsed_html = BeautifulSoup(html)
Also note that you should not be using BeautifulSoup version 3 at all - it is not maintained anymore. Replace:
try:
from BeautifulSoup import BeautifulSoup
except ImportError:
from bs4 import BeautifulSoup
with just:
from bs4 import BeautifulSoup

BeautifulSoup accepts a string of HTML. You need to retrieve the HTML from the page using the URL.
Check out urllib for making HTTP requests. (Or requests for an even simpler way.) Retrieve the HTML and pass that to BeautifulSoup like so:
import urllib
from bs4 import BeautifulSoup
# Get the HTML
conn = urllib.urlopen("http://www.example.com")
html = conn.read()
# Give BeautifulSoup the HTML:
soup = BeautifulSoup(html)
From here, just parse as you attempted previously.
p = soup.find("div", attrs={'class':'pane-content'})
print(p)

Related

Not getting the entire <li> line using BeautifulSoup

I am using BeautifulSoup to extract the list items under the class "secondary-nav-main-links" from the https://www.champlain.edu/current-students web page. I thought my working code below would extract the entire "li" line but the last portion, "/li", is placed on its own line. I included screen captures of the current output and the indended output. Any ideas? Thanks!!
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('https://www.champlain.edu/current-students')
bs = BeautifulSoup(html.read(), 'html.parser')
soup = bs.find(class_='secondary-nav secondary-nav-sm has-callouts')
for div in soup.find_all('li'):
print(div)
Current output:
capture1
Intended output:
capture2
You can remove the newline character with str.replace
And you can unescape html characters like & with html.unescape
str(div).replace('\n','')
To replace & with &, add this to the print statement
import html
html.unescape(str(div))
So your code becomes
from urllib.request import urlopen
from bs4 import BeautifulSoup
import html
html = urlopen('https://www.champlain.edu/current-students')
bs = BeautifulSoup(html.read(), 'html.parser')
soup = bs.find(class_='secondary-nav secondary-nav-sm has-callouts')
for div in soup.find_all('li'):
print(html.unescape(str(div).replace('\n','')))

How to get the text in <script>

A while ago I used the following code to get window._sharedData; but the same code just now has no way, what should I do
If I change script to div it can work but I need is use script
code.py
from bs4 import BeautifulSoup
html1 = '<h1><script>window._sharedData;</script></h1>'
soup = BeautifulSoup(html1)
print(soup.find('script').text)
Add html.parser or lxml and call .string instead .text
from bs4 import BeautifulSoup
html = '<h1><script>window._sharedData;</script></h1>'
soup = BeautifulSoup(html, 'html.parser')
print(soup.find('script').string)
You should use BeautifulSoup(html1, 'lxml') instead of BeautifulSoup(html1). If Output is empty, you will use .string instead of .text. You can try it:
from bs4 import BeautifulSoup
html1 = '<h1><script>window._sharedData;</script></h1>'
soup = BeautifulSoup(html1, 'lxml')
print(soup.find('script').text)
or
print(soup.find('script').string)
Output will be:
window._sharedData;

Why does Beautiful Soup not return the content?

I am using bs4 to scrape some results. I could see the HTML content in the source but when I try to fetch it using bs4, it does not give rather says "File does not exist"
from bs4 import BeautifulSoup
import requests
source = requests.get("https://result.smitcs.in/grade.php?subid=BA1106")
soup = BeautifulSoup(source.text, "html.parser")
marks_pre = soup.find("pre")
marks = marks_pre.find("div")
print(marks.prettify())
The above code returns
<div style="font-family: courier; line-height: 12px;font-size:
20px;background:white;"> File does not exist </div>
The above code works fine if I copy the source code from the web and save it locally as HTML file and then fetch it.
try this
from bs4 import BeautifulSoup
import requests
URL = "https://result.smitcs.in/grade.php?subid=BA1106"
PAGE = requests.get(URL)
# get HTML content
SOUP = BeautifulSoup(PAGE.content, 'lxml')
marks = SOUP.find("div")
print(marks.prettify())

Having trouble in getting page source with beautifulsoup

I am trying to get the HTML source of a web page using beautifulsoup.
import bs4 as bs
import requests
import urllib.request
sourceUrl='https://www.pakwheels.com/forums/t/planing-a-trip-from-karachi-to-lahore-by-road-in-feb-2017/414115/2.html'
source=urllib.request.urlopen(sourceUrl).read()
soup=bs.BeautifulSoup(source,'html.parser')
print(soup)
I want the HTML source of the page. This is what I am getting now:
'ps.store("siteSettings", {"title":"PakWheels Forums","contact_email":"sami.ullah#pakeventures.com","contact_url":"https://www.pakwheels.com/main/contact_us","logo_url":"https://www.pakwheels.com/assets/logo.png","logo_small_url":"/images/d-logo-sketch-small.png","mobile_logo_url":"data:image/svg+xml;base64,PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0idXRmLTgiPz4NCjwhLS0gR2VuZXJhdG9yOiBBZG9iZSBJbGx1c3RyYXRvciAxNi4wLjAsIFNWRyBFeHBvcnQgUGx1Zy1JbiAuIFNWRyBWZXJzaW9uOiA2LjAwIEJ1aWxkIDApICAtLT4NCjwhRE9DVFlQRSBzdmcgUFVCTElDICItLy9XM0MvL0RURCBTVkcgMS4xLy9FTiIgImh0dHA6Ly93d3cudzMub3JnL0dyYXBoaWNzL1NWRy8xLjEvRFREL3N2ZzExLmR0ZCI+DQo8c3ZnIHZlcnNpb249IjEuMSIgaWQ9IkxheWVyXzEiIHhtbG5zPSJodHRwOi8vd3d3LnczLm9yZy8yMDAwL3N2ZyIgeG1sbnM6eGxpbms9Imh0dHA6Ly93d3cudzMub3JnLzE5OTkveGxpbmsiIHg9IjBweCIgeT0iMHB4Ig0KCSB3aWR0aD0iMjQwcHgiIGhlaWdodD0iNjBweCIgdmlld0JveD0iMCAwIDI0MCA2MCIgZW5hYmxlLWJhY2tncm91bmQ9Im5ldyAwIDAgMjQwIDYwIiB4bWw6c3BhY2U9InByZXNlcnZlIj4NCjxwYXRoIGZpbGw9IiNGRkZGRkYiIGQ9Ik02LjkwMiwyMy4yODZDMzQuNzc3LDIwLjI2Miw1Ny4yNC'
Have a look at this code:
from urllib import request
from bs4 import BeautifulSoup
url_1 = "http://www.google.com"
page = request.urlopen(url_1)
soup = BeautifulSoup(page)
print(soup.prettify())
Import everything you need correctly. Read this.

Beautifulsoup url loading error

So I am trying to get the content of this page using beautiful soup. I want to create a dictionary of all the css color names and this seemed like a quick and easy way to access this. So naturally I did the quick basic:
from bs4 import BeautifulSoup as bs
url = 'http://www.w3schools.com/cssref/css_colornames.asp'
soup = bs(url)
for some reason I am only getting the url in a p tag inside the body and that's it:
>>> print soup.prettify()
<html>
<body>
<p>
http://www.w3schools.com/cssref/css_colornames.asp
</p>
</body>
</html>
why wont BeautifulSoup give me access to the information I need?
Beautifulsoup does not load a URL for you.
You need to pass in the full HTML page, which means you need to load it from the URL first. Here is a sample using the urllib2.urlopen function to achieve that:
from urllib2 import urlopen
from bs4 import BeautifulSoup as bs
source = urlopen(url).read()
soup = bs(source)
Now you can extract the colours just fine:
css_table = soup.find('table', class_='reference')
for row in css_table.find_all('tr'):
cells = row.find_all('td')
if cells:
print cells[0].a.text, cells[1].a.text

Categories

Resources