Cloudflare scraping, finding elements - python

I have been playing with the cfscrape module which allows you to bypass the cloudflare captcha protection on sites... I have accessed the page's contents but can't seem to get my code to work, instead the whole HTML is printed. I'm only trying to find keywords within the <span class="availability">
import urllib2
import cfscrape
from bs4 import BeautifulSoup
import requests
from lxml import etree
import smtplib
import urllib2, sys
scraper = cfscrape.CloudflareScraper()
url = "http://www.sneakersnstuff.com/en/product/25698/adidas-stan-smith-gtx"
req = scraper.get(url).content
try:
page = urllib2.urlopen(req)
except urllib2.HTTPError, e:
print("hi")
content = e.fp.read()
soup = BeautifulSoup(content, "lxml")
result = soup.find_all("span", {"class":"availability"})
I have omitted some irrelevant parts of code

try:
page = urllib2.urlopen(req)
content = page.read()
except urllib2.HTTPError, e:
print("hi")
You should read the urlopen's object which contain the html code.
and you should put the content variable before the except.

Related

My script does not search all links, what to do?

I'm building a script to scan a website and capture URLs and test whether it's working or not. The problem is that the script is looking for just the URLs of the website's home page and leaving others aside. How do I capture all pages linked to the site?
Below my code attachment:
import urllib
from bs4 import BeautifulSoup
import re
from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
page = urllib.request.urlopen("http://www.google.com/")
soup = BeautifulSoup(page.read(), features='lxml')
links = soup.findAll("a", attrs={'href': re.compile('^(http://)')})
for link in links:
result = (link["href"])
req = Request(result)
try:
response = urlopen(req)
pass
except HTTPError as e:
if e.code != 200:
# Stop, Error!
with open("Document_ERROR.txt", 'a') as archive:
archive.write(result)
archive.write('\n')
archive.write('{} \n'.format(e.reason))
archive.write('{}'.format(e.code))
archive.close()
else:
# Enjoy!
with open("Document_OK.txt", 'a') as archive:
archive.write(result)
archive.write('\n')
archive.close()
The main reason this doesn't work is that you put both the OK and ERROR-writes inside the except-block.
This means that only urls that actually raise an exception will be stored.
In general it would be my advice for you to spray some print-statements into the difference stages of the script - or use an IDE that allows you to step through the code during runtime - line by line. That makes stuff like this so much easier to debug.
PyCharm is free and allows you to do so. Give that a try.
So - I haven't worked with urllib but use requests a lot (python -m pip install requests). A quick refactor using that would look something like below:
import requests
from bs4 import BeautifulSoup
import re
url = "http://www.google.com"
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html, "lxml")
links = soup.find_all("a", attrs={'href': re.compile('^(http://)')})
for link in links:
href = link["href"]
print("Testing for URL {}".format(href))
try:
# since you only want to scan for status code, no need to pull the entire html of the site - use HEAD instead of GET
r = requests.head(href)
status = r.status_code
# 404 etc will not yield an error
error = None
except Exception as e:
# these exception will not have a status_code
status = None
error = e
# store the finding in your files
if status is None or status != 200:
print("URL is broken. Writing to ERROR_Doc")
# do your storing here of href, status and error
else:
print("URL is live. Writing to OK_Doc"
# do your storing here
Hope this makes sense.

How can I print any website content? (Using something like my code)

I want to open the website and get its content, store it in a variable and print it
from urllib.request import urlopen
url = any_website
content = urlopen(url).read().decode('utf-8')
print(content)
The expected result is that I get what is written in the page
In python, there are several libraries you may be interested in. An example of printing contents to get you started below:-
from bs4 import BeautifulSoup as soup
import requests
url = "https://en.wikipedia.org/wiki/List_of_multinational_corporations"
page = requests.get(url)
page_html = (page.content)
page_soup = soup(page_html, "html.parser")
print (page_soup)
with urlopen, you may try as below
from bs4 import BeautifulSoup
import urllib
url = "https://en.wikipedia.org/wiki/List_of_multinational_corporations"
r = urllib.urlopen(url).read()
soup = BeautifulSoup(r)
print type(soup)
print (soup.prettify()[0:1000])

Reading in Content From URLS in a File

I'm trying to get other subset URLs from a main URL. However,as I print to see if I get the content, I noticed that I am only getting the HTML, not the URLs within it.
import urllib
file = 'http://example.com'
with urllib.request.urlopen(file) as url:
collection = url.read().decode('UTF-8')
I think this is what you are looking for.
You can use beautiful soup library of python and this code should work with python3
import urllib
from urllib.request import urlopen
from bs4 import BeautifulSoup
def get_all_urls(url):
open = urlopen(url)
url_html = BeautifulSoup(open, 'html.parser')
for link in url_html.find_all('a'):
links = str(link.get('href'))
if links.startswith('http'):
print(links)
else:
print(url + str(links))
get_all_urls('url.com')

Unnamed error using Urllib2 and Beautiful soup

The output of this code block always returns me the "except". No specific error is shown in my terminal. What am i doing wrong ?
Any help is appreciated!
from bs4 import BeautifulSoup
import csv
import urllib2
# get page source and create a BeautifulSoup object based on it
try:
print("Fetching page.")
page = urllib2.open("http://siph0n.net")
soup = BeautifulSoup(page, 'lxml')
#specify tags the parameters are stored in
metaData = soup.find_all("a")
except:
print("Error during fetch.")
exit()
"No specific error is shown in my terminal"
That's because your except block is shadowing it. Either remove the try/except or print the exception in the except block:
try:
.
.
.
except Exception as ex:
print(ex)
Note that catching the general type Exception is generally a bad idea. Your except blocks should always catch the specific exception type as possible.
You can use requests for getting the data.
from bs4 import BeautifulSoup
import requests
import csv
import urllib2
# get page source and create a BeautifulSoup object based on it
try:
print("Fetching page.")
page = requests.get("http://siph0n.net")
soup = BeautifulSoup(page, 'lxml')
#specify tags the parameters are stored in
metaData = soup.find_all("a")
except Exception as ex:
print(ex)

Crawl a news website and getting the news content

I'm trying to download the text from a news website. The HTML is:
<div class="pane-content">
<div class="field field-type-text field-field-noticia-bajada">
<div class="field-items">
<div class="field-item odd">
<p>"My Text" target="_blank">www.injuv.cl</a></strong></p> </div>
The output should be: My Text
I'm using the following python code:
try:
from BeautifulSoup import BeautifulSoup
except ImportError:
from bs4 import BeautifulSoup
html = "My URL"
parsed_html = BeautifulSoup(html)
p = parsed_html.find("div", attrs={'class':'pane-content'})
print(p)
But the output of the code is: "None". Do you know what is wrong with my code??
The problem is that you are not parsing the HTML, you are parsing the URL string:
html = "My URL"
parsed_html = BeautifulSoup(html)
Instead, you need to get/retrieve/download the source first, example in Python 2:
from urllib2 import urlopen
html = urlopen("My URL")
parsed_html = BeautifulSoup(html)
In Python 3, it would be:
from urllib.request import urlopen
html = urlopen("My URL")
parsed_html = BeautifulSoup(html)
Or, you can use the third-party "for humans"-style requests library:
import requests
html = requests.get("My URL").content
parsed_html = BeautifulSoup(html)
Also note that you should not be using BeautifulSoup version 3 at all - it is not maintained anymore. Replace:
try:
from BeautifulSoup import BeautifulSoup
except ImportError:
from bs4 import BeautifulSoup
with just:
from bs4 import BeautifulSoup
BeautifulSoup accepts a string of HTML. You need to retrieve the HTML from the page using the URL.
Check out urllib for making HTTP requests. (Or requests for an even simpler way.) Retrieve the HTML and pass that to BeautifulSoup like so:
import urllib
from bs4 import BeautifulSoup
# Get the HTML
conn = urllib.urlopen("http://www.example.com")
html = conn.read()
# Give BeautifulSoup the HTML:
soup = BeautifulSoup(html)
From here, just parse as you attempted previously.
p = soup.find("div", attrs={'class':'pane-content'})
print(p)

Categories

Resources