The output of this code block always returns me the "except". No specific error is shown in my terminal. What am i doing wrong ?
Any help is appreciated!
from bs4 import BeautifulSoup
import csv
import urllib2
# get page source and create a BeautifulSoup object based on it
try:
print("Fetching page.")
page = urllib2.open("http://siph0n.net")
soup = BeautifulSoup(page, 'lxml')
#specify tags the parameters are stored in
metaData = soup.find_all("a")
except:
print("Error during fetch.")
exit()
"No specific error is shown in my terminal"
That's because your except block is shadowing it. Either remove the try/except or print the exception in the except block:
try:
.
.
.
except Exception as ex:
print(ex)
Note that catching the general type Exception is generally a bad idea. Your except blocks should always catch the specific exception type as possible.
You can use requests for getting the data.
from bs4 import BeautifulSoup
import requests
import csv
import urllib2
# get page source and create a BeautifulSoup object based on it
try:
print("Fetching page.")
page = requests.get("http://siph0n.net")
soup = BeautifulSoup(page, 'lxml')
#specify tags the parameters are stored in
metaData = soup.find_all("a")
except Exception as ex:
print(ex)
Related
I am trying to web Scrape the website: "http://norumors.net/?post_type=rumors?post_type=rumors" to get only the heading news and put them in a CSV file using Beautifulsoup and python, This is the code I am using after i look into the HTML source code "view-source:http://norumors.net/?post_type=rumors?post_type=rumors"
import urllib.request,sys,time
from bs4 import BeautifulSoup
import requests
import pandas as pd
pagesToGet= 1
upperframe=[]
for page in range(1,pagesToGet+1):
print('processing page :', page)
url = 'http://norumors.net/?post_type=rumors/?page='+str(page)
print(url)
#an exception might be thrown, so the code should be in a try-except block
try:
#use the browser to get the url. This is suspicious command that might blow up.
page=requests.get(url) # this might throw an exception if something goes wrong.
except Exception as e: # this describes what to do if an exception is thrown
error_type, error_obj, error_info = sys.exc_info() # get the exception information
print ('ERROR FOR LINK:',url) #print the link that cause the problem
print (error_type, 'Line:', error_info.tb_lineno) #print error info and line that threw the exception
continue #ignore this page. Abandon this and go back.
time.sleep(2)
soup=BeautifulSoup(page.text,'html.parser')
frame=[]
links=soup.find_all('li',attrs={'class':'o-listicle__item'})
print(len(links))
filename="NEWS.csv"
f=open(filename,"w", encoding = 'utf-8')
headers="Statement,Link\n"
f.write(headers)
for j in links:
Statement = j.find("div",attrs={'class':'row d-flex'}).text.strip()
# Link = "http://norumors.net/"
Link += j.find("div",attrs={'class':'col-lg-4 col-md-4 col-sm-6 col-xs-6'}).find('a')['href'].strip()
frame.append((Statement,Link))
f.write(Statement.replace(",","^")+","+Link+","+Date.replace(",","^")+","+Source.replace(",","^")+","+Label.replace(",","^")+"\n")
upperframe.extend(frame)
f.close()
data=pd.DataFrame(upperframe, columns=['Statement','Link'])
data.head()
but After I run the code I am getting the pandas data frame and CSV file empty, any suggestion why is that? knowing that i want to get the text between tags.
If I understand correctly, you want to get the text part of the news headlines and the href link to these news. You further want to write them into a CSV file. The problem with your code is for j in links: is not executed because soup.find_all('li',attrs={'class':'o-listicle__item'}) returns an empty list. You should be careful with the names and classes of the tags that you are searching. Below code gets news texts and their links, it also writes them to the CSV file using pd.DataFrame.
import urllib.request,sys,time
from bs4 import BeautifulSoup
import requests
import pandas as pd
pagesToGet = 1
for page in range(1,pagesToGet+1):
print('processing page :', page)
url = 'http://norumors.net/?post_type=rumors/?page=' + str(page)
print(url)
#an exception might be thrown, so the code should be in a try-except block
try:
#use the browser to get the url. This is suspicious command that might blow up.
page = requests.get(url) # this might throw an exception if something goes wrong.
except Exception as e: # this describes what to do if an exception is thrown
error_type, error_obj, error_info = sys.exc_info() # get the exception information
print('ERROR FOR LINK:',url) #print the link that cause the problem
print(error_type, 'Line:', error_info.tb_lineno) #print error info and line that threw the exception
continue #ignore this page. Abandon this and go back.
soup = BeautifulSoup(page.text,'html.parser')
texts = []
links = []
filename = "NEWS.csv"
f = open(filename,"w", encoding = 'utf-8')
Statement = soup.find("div",attrs={'class':'row d-flex'})
divs = Statement.find_all("div",attrs={'class':'col-lg-4 col-md-4 col-sm-6 col-xs-6'})
for div in divs:
txt = div.find("img",attrs={'class':'rumor__thumb'})
texts.append(txt['alt'])
lnk = div.find("a",attrs={'class':'rumor--archive'})
links.append(lnk['href'])
data = pd.DataFrame(list(zip(texts, links)), columns=['Statement', 'Link'])
data.to_csv(f, encoding='utf-8', index=False)
f.close()
I wrote the following code. It's probably not the prettiest, but I tried. When I run it it creates the links.txt file, but the actual script crashes immediately without showing any error on the cmd. I tried researching BS4 and I really think this should work.
This was the initial script that I was trying to get working so I could eventually change it to only scrape the links within the 'card' class, but the fact that it cannot even scrape all the links I want to understand what I did wrong.
import requests
import time
from bs4 import BeautifulSoup
import sys
sys.stdout = open("links.txt", "a")
for x in range(0, 10):
try:
URL = f'https://wesbite.com/downloads/{x}/'
page = requests.get(URL)
time.sleep(5)
soup = BeautifulSoup(html, 'html.parser')
links_with_text = []
for a in soup.find_all('a', href=True):
if a.text:
links_with_text.append(a['href'])
print(links_with_text)
except:
continue
Example of the Card class I was evenmtually trying to scrape:
<div class="card-content">
<div class="center">
<a target="_blank" href="https://website.com/username/">username</a>
I took your suggestions removed the except, and realized that my indents were inconsistent. After fixing that and changing the page.text it seems to work. The code below:
import requests
import time
from bs4 import BeautifulSoup
import sys
sys.stdout = open("links.txt", "a")
for x in range(0, 10):
try:
URL = f'https://wesbite.com/downloads/{x}/'
page = requests.get(URL)
time.sleep(5)
soup = BeautifulSoup(page.text, 'html.parser')
links_with_text = []
for a in soup.find_all('a', href=True):
if a.text:
links_with_text.append(a['href'])
print(links_with_text)
except Exception as e:
print('something went wrong')
The html variable in BeautifulSoup(html, 'html.parser') is not defined in the code you've posted, my guess is that's raising an exception, which is suppressed by your catch block. Remove the try...catch code and run it, exceptions are helpful information and suppressing them in this way will prevent you from finding the problem.
I'm building a script to scan a website and capture URLs and test whether it's working or not. The problem is that the script is looking for just the URLs of the website's home page and leaving others aside. How do I capture all pages linked to the site?
Below my code attachment:
import urllib
from bs4 import BeautifulSoup
import re
from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
page = urllib.request.urlopen("http://www.google.com/")
soup = BeautifulSoup(page.read(), features='lxml')
links = soup.findAll("a", attrs={'href': re.compile('^(http://)')})
for link in links:
result = (link["href"])
req = Request(result)
try:
response = urlopen(req)
pass
except HTTPError as e:
if e.code != 200:
# Stop, Error!
with open("Document_ERROR.txt", 'a') as archive:
archive.write(result)
archive.write('\n')
archive.write('{} \n'.format(e.reason))
archive.write('{}'.format(e.code))
archive.close()
else:
# Enjoy!
with open("Document_OK.txt", 'a') as archive:
archive.write(result)
archive.write('\n')
archive.close()
The main reason this doesn't work is that you put both the OK and ERROR-writes inside the except-block.
This means that only urls that actually raise an exception will be stored.
In general it would be my advice for you to spray some print-statements into the difference stages of the script - or use an IDE that allows you to step through the code during runtime - line by line. That makes stuff like this so much easier to debug.
PyCharm is free and allows you to do so. Give that a try.
So - I haven't worked with urllib but use requests a lot (python -m pip install requests). A quick refactor using that would look something like below:
import requests
from bs4 import BeautifulSoup
import re
url = "http://www.google.com"
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html, "lxml")
links = soup.find_all("a", attrs={'href': re.compile('^(http://)')})
for link in links:
href = link["href"]
print("Testing for URL {}".format(href))
try:
# since you only want to scan for status code, no need to pull the entire html of the site - use HEAD instead of GET
r = requests.head(href)
status = r.status_code
# 404 etc will not yield an error
error = None
except Exception as e:
# these exception will not have a status_code
status = None
error = e
# store the finding in your files
if status is None or status != 200:
print("URL is broken. Writing to ERROR_Doc")
# do your storing here of href, status and error
else:
print("URL is live. Writing to OK_Doc"
# do your storing here
Hope this makes sense.
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html,"html.parser")
for child in bsObj.find("table",{"id":"giftlist"}).children:
print(child)
Could anyone tell me what wrong with my code is? :((( What should i do next?
You should put the code in try-except block
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html,"html.parser")
try:
for child in bsObj.find("table",{"id":"giftlist"}).children:
print(child)
except AttributeError as e:
print(e)
except:
print("An error has occured")
***In your case I have visited the website the id is not "giftlist", it's "giftLift" you have done a typing mistake and that's why find function is returning none type object.
I'm not sure whether you have already solved this problem posted 3 years ago, but you've made a small mistake, I'm assuming.
The id of the tag is not giftlist... It is giftList
Is your code from the book "Web Scraping With Python" from the O'Reilly series? I found the exact same code from that book, including this webpage, pythonscraping.com/pages/page3.html, which is posted by the author for the purpose of giving the readers a place to practice. Btw on the book it is also giftList, so I think you might have copied the code wrong
try this one now
for child in bsObj.find("table",{"id":"giftList"}).children:
print(child)
One option is to put the offending loop construct into a try, then handle the exception that appears when the interator returns None:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html,"html.parser")
try:
for child in bsObj.find("table",{"id":"giftlist"}).children:
print(child)
except AttributeError:
# do what you want to do when bsObj.find() returns None
Or you could check the resulting list for None before entering the loop:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html,"html.parser")
result = bsObj.find("table",{"id":"giftlist"})
if result:
for child in result.children:
print(child)
else:
# do what you want to do when bsObj.find() returns None
It's typo issue. I also meet the same problem. In the web page, the id name should be "id="giftList", not giftlist.
It should work after modifying the id name. Try it.
I have been playing with the cfscrape module which allows you to bypass the cloudflare captcha protection on sites... I have accessed the page's contents but can't seem to get my code to work, instead the whole HTML is printed. I'm only trying to find keywords within the <span class="availability">
import urllib2
import cfscrape
from bs4 import BeautifulSoup
import requests
from lxml import etree
import smtplib
import urllib2, sys
scraper = cfscrape.CloudflareScraper()
url = "http://www.sneakersnstuff.com/en/product/25698/adidas-stan-smith-gtx"
req = scraper.get(url).content
try:
page = urllib2.urlopen(req)
except urllib2.HTTPError, e:
print("hi")
content = e.fp.read()
soup = BeautifulSoup(content, "lxml")
result = soup.find_all("span", {"class":"availability"})
I have omitted some irrelevant parts of code
try:
page = urllib2.urlopen(req)
content = page.read()
except urllib2.HTTPError, e:
print("hi")
You should read the urlopen's object which contain the html code.
and you should put the content variable before the except.