I recently started learning Python. In the process of learning about web scraping, I followed an example to scrape from Google News. After running my code, I get the message: "Process finished with exit code 0" with no results. If I change the url to "https://yahoo.com" I get results. Could anyone point out what, if anything I am doing wrong?
Code:
import urllib.request
from bs4 import BeautifulSoup
class Scraper:
def __init__(self, site):
self.site = site
def scrape(self):
r = urllib.request.urlopen(self.site)
html = r.read()
parser = "html.parser"
sp = BeautifulSoup(html, parser)
for tag in sp.find_all("a"):
url = tag.get("href")
if url is None:
continue
if "html" in url:
print("\n" + url)
news = "https://news.google.com/"
Scraper(news).scrape()
Try this out:
import urllib.request
from bs4 import BeautifulSoup
class Scraper:
def __init__(self, site):
self.site = site
def scrape(self):
r = urllib.request.urlopen(self.site)
html = r.read()
parser = "html.parser"
sp = BeautifulSoup(html, parser)
for tag in sp.find_all("a"):
url = tag.get("href")
if url is None:
continue
else:
print("\n" + url)
if __name__ == '__main__':
news = "https://news.google.com/"
Scraper(news).scrape()
Initially you were checking each link to see if it contained 'html' in it. I am assuming the example you were following was checking to see if the links ended in '.html;
Beautiful soup works really well, but you need to check the source code on the website your scraping to get an idea for how the code is layed out. Devtools in chrome works really well for this, F12 to get their quick.
I removed:
if "html" in url:
print("\n" + url)
and replaced it with:
else:
print("\n" + url)
Related
Recently I started getting acquainted with the web and in particular with web scrapers. For better understanding, I decided to implement a small program. I want to make a scraper that collects all the links that users leave in the comments from the posts of the /r/Python Reddit thread.
Here is the code I got:
from bs4 import BeautifulSoup
import requests
from urllib.error import HTTPError
class Post:
def __init__(self, thread, title, url, inner_links=None):
if inner_links is None:
inner_links = []
self.thread = thread
self.title = title
self.url = url
self.inner_links = inner_links
def get_new_posts_reddit(thread: str):
reddit_url = 'https://www.reddit.com'
html = requests.get(reddit_url + '/r/' + thread).content.decode('utf8')
bs = BeautifulSoup(html, 'html.parser')
posts = []
try:
for post_link in bs.find_all('a', class_='SQnoC3ObvgnGjWt90zD9Z _2INHSNB8V5eaWp4P0rY_mE'):
posts.append(Post(thread, post_link.text, reddit_url + post_link['href']))
except HTTPError:
return []
return posts
def get_inner_links(post: Post):
html = requests.get(post.url).content.decode('utf8')
bs = BeautifulSoup(html, 'html.parser')
for link in bs.find_all('a', class_='_3t5uN8xUmg0TOwRCOGQEcU'):
post.inner_links.append({'text': link.find_parent('div').text, 'link': link['href']})
python_posts = get_new_posts_reddit('Python')
for elem in python_posts:
get_inner_links(elem)
with open('result.txt', 'w', encoding="utf8") as file:
for elem in python_posts:
file.write(str(elem.inner_links) + '\n')
The main problem is that sometimes this program works and sometimes it doesn't. That is, in 1 run out of 5, it will collect the first 7 posts from the thread and then find internal links, again, only in one of the 7 posts. I think the problem might be that I send requests to the site too often or something like that. Please help me figure this out
I found out that the problem was that I was getting a page where the content hadn't loaded yet. I rewrote the parser on selenium and everything worked.
I am working through a book "The Self-taught programmer" and am having trouble with some python code. I get the program to run without any errors. The problem is that there is no output whatsoever.
import urllib.request
from bs4 import BeautifulSoup
class Scraper:
def __init__(self, site):
self.site = site
def scrape(self):
r = urllib.request\
.urlopen(self.site)
html = r.read()
parser = "html.parser"
sp = BeautifulSoup(html, parser)
for tag in sp.find_all("a"):
url = tag.get("href")
if url is None:
continue
if "html" in url:
print("\n" + url)
news = "https://news.google.com/"
Scraper(news).scrape()
Look at the last "if" statement. If there's no text "html" in the url, nothing gets printed. Try removing that and un-indenting:
class Scraper:
def __init__(self, site):
self.site = site
def scrape(self):
r = urllib.request\
.urlopen(self.site)
html = r.read()
parser = "html.parser"
sp = BeautifulSoup(html, parser)
for tag in sp.find_all("a"):
url = tag.get("href")
if url is None:
continue
print("\n" + url)
Im working with beautiful soup and would like to grab emails to a depth of my choosing in my web scraper. Currently however I am unsure why my web scraping tool is not working. Everytime I run it, it does not populate the email list.
#!/usr/bin/python
from bs4 import BeautifulSoup, SoupStrainer
import re
import urllib
import threading
def step2():
file = open('output.html', 'w+')
file.close()
# links already added
visited = set()
visited_emails = set()
scrape_page(visited, visited_emails, 'https://www.google.com', 2)
print('Webpages \n')
for w in visited:
print(w)
print('Emails \n')
for e in visited_emails:
print(e)
# Run recursively
def scrape_page(visited, visited_emails, url, depth):
if depth == 0:
return
website = urllib.urlopen(url)
soup = BeautifulSoup(website, parseOnlyThese=SoupStrainer('a', email=False))
emails = re.findall(r"[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,4}", str(website))
first = str(website).split('mailto:')
for i in range(1, len(first)):
print(first.split('>')[0])
for email in emails:
if email not in visited_emails:
print('- got email ' + email)
visited_emails.add(email)
for link in soup:
if link.has_attr('href'):
if link['href'] not in visited:
if link['href'].startswith('https://www.google.com'):
visited.add(link['href'])
scrape_page(visited, visited_emails, link['href'], depth - 1)
def main():
step2()
main()
for some reason im unsure how to fix my code to add emails to the list. if you could give me some advice it would be greatly appreciated. thanks
You just need to look for the href's with mailto:
emails = [a["href"] for a in soup.select('a[href^=mailto:]')]
I presume https://www.google.com is a placeholder for the actual site you are scraping as there are no mailto's to scrape on the google page. If there are mailto's in the source you are scraping then this will find them.
I have been following TheNewBoston's Python 3.4 tutorials that use Pycharm, and am currently on the tutorial on how to create a web crawler. I Simply want to download all of XKCD's Comics. Using the archive that seemed very easy. Here is my code, followed by TheNewBoston's.
Whenever I run the code, nothing happens. It runs through and says, "Process finished with exit code 0" Where did I screw up?
TheNewBoston's Tutorial is a little dated, and the website used for the crawl has changed domains. I will comment the part of the video that seems to matter.
My code:
mport requests
from urllib import request
from bs4 import BeautifulSoup
def download_img(image_url, page):
name = str(page) + ".jpg"
request.urlretrieve(image_url, name)
def xkcd_spirder(max_pages):
page = 1
while page <= max_pages:
url = r'http://xkcd.com/' + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for link in soup.findAll('div', {'img': 'src'}):
href = link.get('href')
print(href)
download_img(href, page)
page += 1
xkcd_spirder(5)
The comic is in the div with the id comic, you just need to pull the src from img inside that div then join it to the base url and finally request the content and write, I use the basename as the name to save the file under.
I also replaced your while with a range loop and did all the http requests just using requests:
import requests
from bs4 import BeautifulSoup
from os import path
from urllib.parse import urljoin # python2 -> from urlparse import urljoin
def download_img(image_url, base):
# path.basename(image_url)
# http://imgs.xkcd.com/comics/tree_cropped_(1).jpg -> tree_cropped_(1).jpg -
with open(path.basename(image_url), "wb") as f:
# image_url is a releative path, we have to join to the base
f.write(requests.get(urljoin(base,image_url)).content)
def xkcd_spirder(max_pages):
base = "http://xkcd.com/"
for page in range(1, max_pages + 1):
url = base + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
# we only want one image
img = soup.select_one("#comic img") # or .find('div',id= 'comic').img
download_img(img["src"], base)
xkcd_spirder(5)
Once you run the code you will see we get the first five comics.
I started a little project. I am trying to scrape the URL http://pr0gramm.com/
and save the tags under a picture in a variable, but I have problems to do so.
I am searching for this in the code
<a class="tag-link" href="/top/Flaschenkind">Flaschenkind</a>
And I actually just need the part "Flaschenkind" to be saved, but also the following tags in that line.
This is my code so far
import requests
from bs4 import BeautifulSoup
url = "http://pr0gramm.com/"
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
links = soup.find_all("div", {"class" : "item-tags"})
print(links)
I sadly just get this output
[]
I already tried to change the URL to http://pr0gramm.com/top/
but I get the same output. I wonder if it happens because the site might be made with JavaScript and it can't scrape the data correctly then?
The problem is - this is a dynamic site and all of the data you see is loaded via additional XHR calls to the website JSON API. You need to simulate that in your code.
Working example using requests:
from urllib.parse import urljoin
import requests
base_image_url = "http://img.pr0gramm.com"
with requests.Session() as session:
response = session.get("http://pr0gramm.com/api/items/get", params={"flags": 1, "promoted": "1"})
posts = response.json()["items"]
for post in posts:
image_url = urljoin(base_image_url, post["image"])
# get tags
response = session.get("http://pr0gramm.com/api/items/info", params={"itemId": post["id"]})
post_data = response.json()
tags = [tag["tag"] for tag in post_data["tags"]]
print(image_url, tags)
This would print the post image url as well as a list of post tags:
http://img.pr0gramm.com/2016/03/07/f693234d558334d7.jpg ['Datsun 1600 Wagon', 'Garage 88', 'Kombi', 'nur Oma liegt tiefer', 'rolladen', 'slow']
http://img.pr0gramm.com/2016/03/07/185544cda956679e.webm ['Danke Merkel', 'deeskalierte zeitnah', 'demokratie im endstadium', 'Fachkraft', 'Far Cry Primal', 'Invite is raus', 'typ ist nackt', 'VVS', 'webm', 'zeigt seine stange']
http://img.pr0gramm.com/2016/03/07/4a6719b33219fd87.jpg ['bmw', 'der Gerät', 'Drehmoment', 'für mehr Motorräder auf pr0', 'Motorrad']
...
First off your URL is a Java Script enabled version of this site. They offer a static URL as www.pr0gramm.com/static/ Here you'll find the content formatted more like your example suggests you expect.
Using this static version of the URL I retrieved <a> tags using the code below like yours. I removed the class tag filter. Python 2.7
import bs4
import urllib2
def main():
url = "http://pr0gramm.com/static/"
try:
fin = urllib2.urlopen(url)
except:
print "Url retrieval failed url:",url
return None
html = fin.read()
bs = bs4.BeautifulSoup(html,"html5lib")
links = bs.find_all("a")
print links
return None
if __name__ == "__main__":
main()