Python 3 web scraper extremely simple not working - python

I am working through a book "The Self-taught programmer" and am having trouble with some python code. I get the program to run without any errors. The problem is that there is no output whatsoever.
import urllib.request
from bs4 import BeautifulSoup
class Scraper:
def __init__(self, site):
self.site = site
def scrape(self):
r = urllib.request\
.urlopen(self.site)
html = r.read()
parser = "html.parser"
sp = BeautifulSoup(html, parser)
for tag in sp.find_all("a"):
url = tag.get("href")
if url is None:
continue
if "html" in url:
print("\n" + url)
news = "https://news.google.com/"
Scraper(news).scrape()

Look at the last "if" statement. If there's no text "html" in the url, nothing gets printed. Try removing that and un-indenting:
class Scraper:
def __init__(self, site):
self.site = site
def scrape(self):
r = urllib.request\
.urlopen(self.site)
html = r.read()
parser = "html.parser"
sp = BeautifulSoup(html, parser)
for tag in sp.find_all("a"):
url = tag.get("href")
if url is None:
continue
print("\n" + url)

Related

Printing an HTML Output in Python

I've been creating a program with a variety of uses. I call it the Electronic Database with Direct Yield (EDDY). One thing that I have been having the most trouble with is EDDY's google search capabilities. EDDY will ask the user to give an input. EDDY will then edit the input slightly by replacing any spaces (' ') with plus signs ('+'), then go to the resulting url (without opening a browser). It then copies the html from the webpage and is SUPPOSED to give the results and descriptions of the site, and to specify, without the HTML code.
This is what I have so far.
import urllib
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
import requests
def cleanup(url):
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "lxml")
length = len(soup.prettify()) - 1
print(soup.prettify()[16800:length])
print(soup.title.text)
print(soup.body.text)
def eddysearch():
headers = {'User-Agent': 'Chrome.exe'}
reg_url = "http://www.google.com/search?q="
print("Ready for query")
query = input()
if(query != "quit"):
print("Searching for keyword: " + query)
print("Please wait...")
search = urllib.parse.quote_plus(query)
url = reg_url + search
req = Request(url=url, headers=headers)
html = urlopen(req).read()
cleanup(url)
eddysearch()
eddysearch()
Can anyone help me out? Thanks in advance!
hIf you dont want to use an SSL certificate, you can do .read()
# Python 2.7.x
import urllib
url = "http://stackoverflow.com"
f = urllib.urlopen(url)
print f.read()
#Python 3.x
import urllib.request
url = 'http://www.stackoverflow.com'
f = urllib.request.urlopen(url)
print(f.read())

Problem with Python web scraper in PyCharm. (Beginner)

I recently started learning Python. In the process of learning about web scraping, I followed an example to scrape from Google News. After running my code, I get the message: "Process finished with exit code 0" with no results. If I change the url to "https://yahoo.com" I get results. Could anyone point out what, if anything I am doing wrong?
Code:
import urllib.request
from bs4 import BeautifulSoup
class Scraper:
def __init__(self, site):
self.site = site
def scrape(self):
r = urllib.request.urlopen(self.site)
html = r.read()
parser = "html.parser"
sp = BeautifulSoup(html, parser)
for tag in sp.find_all("a"):
url = tag.get("href")
if url is None:
continue
if "html" in url:
print("\n" + url)
news = "https://news.google.com/"
Scraper(news).scrape()
Try this out:
import urllib.request
from bs4 import BeautifulSoup
class Scraper:
def __init__(self, site):
self.site = site
def scrape(self):
r = urllib.request.urlopen(self.site)
html = r.read()
parser = "html.parser"
sp = BeautifulSoup(html, parser)
for tag in sp.find_all("a"):
url = tag.get("href")
if url is None:
continue
else:
print("\n" + url)
if __name__ == '__main__':
news = "https://news.google.com/"
Scraper(news).scrape()
Initially you were checking each link to see if it contained 'html' in it. I am assuming the example you were following was checking to see if the links ended in '.html;
Beautiful soup works really well, but you need to check the source code on the website your scraping to get an idea for how the code is layed out. Devtools in chrome works really well for this, F12 to get their quick.
I removed:
if "html" in url:
print("\n" + url)
and replaced it with:
else:
print("\n" + url)

Python: Simple Web Crawler using BeautifulSoup4

I have been following TheNewBoston's Python 3.4 tutorials that use Pycharm, and am currently on the tutorial on how to create a web crawler. I Simply want to download all of XKCD's Comics. Using the archive that seemed very easy. Here is my code, followed by TheNewBoston's.
Whenever I run the code, nothing happens. It runs through and says, "Process finished with exit code 0" Where did I screw up?
TheNewBoston's Tutorial is a little dated, and the website used for the crawl has changed domains. I will comment the part of the video that seems to matter.
My code:
mport requests
from urllib import request
from bs4 import BeautifulSoup
def download_img(image_url, page):
name = str(page) + ".jpg"
request.urlretrieve(image_url, name)
def xkcd_spirder(max_pages):
page = 1
while page <= max_pages:
url = r'http://xkcd.com/' + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for link in soup.findAll('div', {'img': 'src'}):
href = link.get('href')
print(href)
download_img(href, page)
page += 1
xkcd_spirder(5)
The comic is in the div with the id comic, you just need to pull the src from img inside that div then join it to the base url and finally request the content and write, I use the basename as the name to save the file under.
I also replaced your while with a range loop and did all the http requests just using requests:
import requests
from bs4 import BeautifulSoup
from os import path
from urllib.parse import urljoin # python2 -> from urlparse import urljoin
def download_img(image_url, base):
# path.basename(image_url)
# http://imgs.xkcd.com/comics/tree_cropped_(1).jpg -> tree_cropped_(1).jpg -
with open(path.basename(image_url), "wb") as f:
# image_url is a releative path, we have to join to the base
f.write(requests.get(urljoin(base,image_url)).content)
def xkcd_spirder(max_pages):
base = "http://xkcd.com/"
for page in range(1, max_pages + 1):
url = base + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
# we only want one image
img = soup.select_one("#comic img") # or .find('div',id= 'comic').img
download_img(img["src"], base)
xkcd_spirder(5)
Once you run the code you will see we get the first five comics.

cron job fails in gae python

I have a script in Google Appengine that is started every 20 minutes by cron.yaml. This works locally, on my own machine. When I go (manually) to the url which starts the script online, it also works. However, the script always fails to complete online, on Google's instances, when cron.yaml is in charge of starting it.
The log shows no errors, only 2 debug messages:
D 2013-07-23 06:00:08.449
type(soup): <class 'bs4.BeautifulSoup'> END type(soup)
D 2013-07-23 06:00:11.246
type(soup): <class 'bs4.BeautifulSoup'> END type(soup)
Here's my script:
# coding: utf-8
import jinja2, webapp2, urllib2, re
from bs4 import BeautifulSoup as bs
from google.appengine.api import memcache
from google.appengine.ext import db
class Article(db.Model):
content = db.TextProperty()
datetime = db.DateTimeProperty(auto_now_add=True)
companies = db.ListProperty(db.Key)
url = db.StringProperty()
class Company(db.Model):
name = db.StringProperty()
ticker = db.StringProperty()
#property
def articles(self):
return Article.gql("WHERE companies = :1", self.key())
def companies_key(companies_name=None):
return db.Key.from_path('Companies', companies_name or 'default_companies')
def articles_key(articles_name=None):
return db.Key.from_path('Articles', articles_name or 'default_articles')
def scrape():
companies = memcache.get("companies")
if not companies:
companies = Company.all()
memcache.add("companies",companies,30)
for company in companies:
links = links(company.ticker)
links = set(links)
for link in links:
if link is not "None":
article_object = Article()
text = fetch(link)
article_object.content = text
article_object.url = link
article_object.companies.append(company.key()) #doesn't work.
article_object.put()
def fetch(link):
try:
html = urllib2.urlopen(url).read()
soup = bs(html)
except:
return "None"
text = soup.get_text()
text = text.encode('utf-8')
text = text.decode('utf-8')
text = unicode(text)
if text is not "None":
return text
else:
return "None"
def links(ticker):
url = "https://www.google.com/finance/company_news?q=NASDAQ:" + ticker + "&start=10&num=10"
html = urllib2.urlopen(url).read()
soup = bs(html)
div_class = re.compile("^g-section.*")
divs = soup.find_all("div", {"class" : div_class})
links = []
for div in divs:
a = unicode(div.find('a', attrs={'href': re.compile("^http://")}))
link_regex = re.search("(http://.*?)\"",a)
try:
link = link_regex.group(1)
soup = bs(link)
link = soup.get_text()
except:
link = "None"
links.append(link)
return links
...and the script's handler in main:
class ScrapeHandler(webapp2.RequestHandler):
def get(self):
scrape.scrape()
self.redirect("/")
My guess is that the problem might be the double for loop in the scrape script, but I don't understand exactly why.
Update:
Articles are indeed being scraped (as many as there should be), and now there are no log errors, or even debug messages at all. Looking at the log, the cron job seemed to execute perfectly. Even so, Appengine's cron job panel says the cron job failed.
I,m pretty sure this error was due to DeadlineExceededError, which I did not run into locally. My scrape() script now does its thing on fewer companies and articles, and does not run into the exceeded deadline.

How can I retrieve the page title of a webpage using Python?

How can I retrieve the page title of a webpage (title html tag) using Python?
Here's a simplified version of #Vinko Vrsalovic's answer:
import urllib2
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen("https://www.google.com"))
print soup.title.string
NOTE:
soup.title finds the first title element anywhere in the html document
title.string assumes it has only one child node, and that child node is a string
For beautifulsoup 4.x, use different import:
from bs4 import BeautifulSoup
I'll always use lxml for such tasks. You could use beautifulsoup as well.
import lxml.html
t = lxml.html.parse(url)
print(t.find(".//title").text)
EDIT based on comment:
from urllib2 import urlopen
from lxml.html import parse
url = "https://www.google.com"
page = urlopen(url)
p = parse(page)
print(p.find(".//title").text)
No need to import other libraries. Request has this functionality in-built.
>> hearders = {'headers':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:51.0) Gecko/20100101 Firefox/51.0'}
>>> n = requests.get('http://www.imdb.com/title/tt0108778/', headers=hearders)
>>> al = n.text
>>> al[al.find('<title>') + 7 : al.find('</title>')]
u'Friends (TV Series 1994\u20132004) - IMDb'
The mechanize Browser object has a title() method. So the code from this post can be rewritten as:
from mechanize import Browser
br = Browser()
br.open("http://www.google.com/")
print br.title()
This is probably overkill for such a simple task, but if you plan to do more than that, then it's saner to start from these tools (mechanize, BeautifulSoup) because they are much easier to use than the alternatives (urllib to get content and regexen or some other parser to parse html)
Links:
BeautifulSoup
mechanize
#!/usr/bin/env python
#coding:utf-8
from bs4 import BeautifulSoup
from mechanize import Browser
#This retrieves the webpage content
br = Browser()
res = br.open("https://www.google.com/")
data = res.get_data()
#This parses the content
soup = BeautifulSoup(data)
title = soup.find('title')
#This outputs the content :)
print title.renderContents()
Using HTMLParser:
from urllib.request import urlopen
from html.parser import HTMLParser
class TitleParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.match = False
self.title = ''
def handle_starttag(self, tag, attributes):
self.match = tag == 'title'
def handle_data(self, data):
if self.match:
self.title = data
self.match = False
url = "http://example.com/"
html_string = str(urlopen(url).read())
parser = TitleParser()
parser.feed(html_string)
print(parser.title) # prints: Example Domain
Use soup.select_one to target title tag
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('url')
soup = bs(r.content, 'lxml')
print(soup.select_one('title').text)
Using regular expressions
import re
match = re.search('<title>(.*?)</title>', raw_html)
title = match.group(1) if match else 'No title'
soup.title.string actually returns a unicode string.
To convert that into normal string, you need to do
string=string.encode('ascii','ignore')
Here is a fault tolerant HTMLParser implementation.
You can throw pretty much anything at get_title() without it breaking, If anything unexpected happens
get_title() will return None.
When Parser() downloads the page it encodes it to ASCII
regardless of the charset used in the page ignoring any errors.
It would be trivial to change to_ascii() to convert the data into UTF-8 or any other encoding. Just add an encoding argument and rename the function to something like to_encoding().
By default HTMLParser() will break on broken html, it will even break on trivial things like mismatched tags. To prevent this behavior I replaced HTMLParser()'s error method with a function that will ignore the errors.
#-*-coding:utf8;-*-
#qpy:3
#qpy:console
'''
Extract the title from a web page using
the standard lib.
'''
from html.parser import HTMLParser
from urllib.request import urlopen
import urllib
def error_callback(*_, **__):
pass
def is_string(data):
return isinstance(data, str)
def is_bytes(data):
return isinstance(data, bytes)
def to_ascii(data):
if is_string(data):
data = data.encode('ascii', errors='ignore')
elif is_bytes(data):
data = data.decode('ascii', errors='ignore')
else:
data = str(data).encode('ascii', errors='ignore')
return data
class Parser(HTMLParser):
def __init__(self, url):
self.title = None
self.rec = False
HTMLParser.__init__(self)
try:
self.feed(to_ascii(urlopen(url).read()))
except urllib.error.HTTPError:
return
except urllib.error.URLError:
return
except ValueError:
return
self.rec = False
self.error = error_callback
def handle_starttag(self, tag, attrs):
if tag == 'title':
self.rec = True
def handle_data(self, data):
if self.rec:
self.title = data
def handle_endtag(self, tag):
if tag == 'title':
self.rec = False
def get_title(url):
return Parser(url).title
print(get_title('http://www.google.com'))
In Python3, we can call method urlopen from urllib.request and BeautifulSoup from bs4 library to fetch the page title.
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://www.google.com")
soup = BeautifulSoup(html, 'lxml')
print(soup.title.string)
Here we are using the most efficient parser 'lxml'.
Using lxml...
Getting it from page meta tagged according to the Facebook opengraph protocol:
import lxml.html.parse
html_doc = lxml.html.parse(some_url)
t = html_doc.xpath('//meta[#property="og:title"]/#content')[0]
or using .xpath with lxml:
t = html_doc.xpath(".//title")[0].text

Categories

Resources