#Downloading All XKCD Comics
url = "http://xkcd.com"
os.makedirs("xkcd", exist_ok=True)
while not url.endswith("#"):
print("Downloading page %s..." % url)
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text)
comicElem = soup.select("#comic img")
if comicElem == []:
print("Could not find comic image.")
else:
comicUrl = comicElem[0].get("src")
#Download the image.
print('Downloading image %s...' % (comicUrl))
res = requests.get(comicUrl)
res.raise_for_status()
imageFile = open(os.path.join("xkcd", os.path.basename(comicUrl)),"wb")
for chunk in res.iter_content(None):
imageFile.write(chunk)
imageFile.close()
prevLink = soup.select("a[rel=prev]")[0]
url = "http://xkcd.com" + prevLink.get("href")
print("Done.")
Full code is stated above. Full output is stated below.
Downloading page http://xkcd.com...
C:/Users/emosc/PycharmProjects/RequestsLearning/main.py:38: GuessedAtParserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 38 of the file C:/Users/emosc/PycharmProjects/RequestsLearning/main.py. To get rid of this warning, pass the additional argument 'features="html.parser"' to the BeautifulSoup constructor.
soup = bs4.BeautifulSoup(res.text)
Traceback (most recent call last):
File "C:/Users/emosc/PycharmProjects/RequestsLearning/main.py", line 46, in <module>
res = requests.get(comicUrl)
File "C:\Users\emosc\PycharmProjects\RequestsLearning\venv\lib\site-packages\requests\api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\emosc\PycharmProjects\RequestsLearning\venv\lib\site-packages\requests\api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\emosc\PycharmProjects\RequestsLearning\venv\lib\site-packages\requests\sessions.py", line 528, in request
prep = self.prepare_request(req)
File "C:\Users\emosc\PycharmProjects\RequestsLearning\venv\lib\site-packages\requests\sessions.py", line 456, in prepare_request
p.prepare(
File "C:\Users\emosc\PycharmProjects\RequestsLearning\venv\lib\site-packages\requests\models.py", line 316, in prepare
self.prepare_url(url, params)
File "C:\Users\emosc\PycharmProjects\RequestsLearning\venv\lib\site-packages\requests\models.py", line 390, in prepare_url
raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL '//imgs.xkcd.com/comics/rapid_test_results.png': No schema supplied. Perhaps you meant http:////imgs.xkcd.com/comics/rapid_test_results.png?
Downloading image //imgs.xkcd.com/comics/rapid_test_results.png...
I have never ever seen an image link like (only with 2 backslashes not 4) http:////imgs.xkcd.com/comics/rapid_test_results.png this and BS4 recommends me to use that and I dont know how to solve this error. Typically followed Automate the Boring Stuff with Python book, same code as from that book but shoots this error when I try to scrape the site. Thanks for any help.
The http:// and https:// protocols are both examples of schemas. Print your URLs before every usage in your code, and check to see if those two are not 1:1 included at the start of your url. Failure to add http://url or https://url will lead to the error as shown, so make sure http:// is added.
You can add this to check what the URL starts with
if comicURL.startswith("//"):
continue
Put the portion of the code below in a try, except block, it'd run.
res = requests.get(comicUrl)
add "http:"
like this: comicUrl = 'http:'+comicElem[0].get('src')
Related
I am making a webscraping program that goes through each URL in a list of URLs, opens the page with that URL, and extracts some information from the soup. Most of the time it works fine, but occasionally the program will stop advancing through the list but not terminate the program, show warnings/exceptions, or otherwise show signs of error. My code, stripped down to the relevant parts, looks like this:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as bs
# some code...
for url in url_list:
req = Request(url, headers={"User-Agent": "Mozilla/5.0"})
page = urlopen(req)
soup = bs(page, features="html.parser")
# do some stuff with the soup...
When the program stalls, if I terminate it manually (using PyCharm), I get this traceback:
File "/Path/to/my/file.py", line 48, in <module>
soup = bs(page, features="html.parser")
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/bs4/__init__.py", line 266, in __init__
markup = markup.read()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 454, in read
return self._readall_chunked()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 564, in _readall_chunked
value.append(self._safe_read(chunk_left))
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 610, in _safe_read
chunk = self.fp.read(min(amt, MAXAMOUNT))
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/socket.py", line 589, in readinto
return self._sock.recv_into(b)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 1052, in recv_into
return self.read(nbytes, buffer)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 911, in read
return self._sslobj.read(len, buffer)
KeyboardInterrupt
Here's what I have tried and learned:
Added a check to make sure that the page status is always 200 when making the soup. The fail condition never occurs.
Added a print statement after the soup is created. This print statement does not trigger after stalling.
The URLs are always valid. This is confirmed by the fact that the program does not stall on the same URL every time, and double confirmed by a similar program I have with nearly identical code that shows the same behavior on a different set of URLs.
I have tried running through this step-by-step with a debugger. The problem has not occurred in the 30 or so iterations I've checked manually, which may just be coincidence.
The page returns the correct headers when bs4 stalls. The problem seems to be isolated to the creation of the soup.
What could cause this behavior?
I am working with a local html file in python, and I am trying to use lxml to parse the file. For some reason I can't get the file to load properly, and I'm not sure if this has to do with not having an http server set up on my local machine, etree usage, or something else.
My reference for this code was this: http://docs.python-guide.org/en/latest/scenarios/scrape/
This could be a related problem: Requests : No connection adapters were found for, error in Python3
Here is my code:
from lxml import html
import requests
page = requests.get('C:\Users\...\sites\site_1.html')
tree = html.fromstring(page.text)
test = tree.xpath('//html/body/form/div[3]/div[3]/div[2]/div[2]/div/div[2]/div[2]/p[1]/strong/text()')
print test
The traceback that I'm getting reads:
C:\Python27\python.exe "C:/Users/.../extract_html/extract.py"
Traceback (most recent call last):
File "C:/Users/.../extract_html/extract.py", line 4, in <module>
page = requests.get('C:\Users\...\sites\site_1.html')
File "C:\Python27\lib\site-packages\requests\api.py", line 69, in get
return request('get', url, params=params, **kwargs)
File "C:\Python27\lib\site-packages\requests\api.py", line 50, in request
response = session.request(method=method, url=url, **kwargs)
File "C:\Python27\lib\site-packages\requests\sessions.py", line 465, in request
resp = self.send(prep, **send_kwargs)
File "C:\Python27\lib\site-packages\requests\sessions.py", line 567, in send
adapter = self.get_adapter(url=request.url)
File "C:\Python27\lib\site-packages\requests\sessions.py", line 641, in get_adapter
raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for 'C:\Users\...\sites\site_1.html'
Process finished with exit code 1
You can see that it has something to do with a "connection adapter" but I'm not sure what that means.
If the file is local, you shouldn't be using requests -- just open the file and read it in. requests expects to be talking to a web server.
with open(r'C:\Users\...site_1.html', "r") as f:
page = f.read()
tree = html.fromstring(page)
There is a better way for doing it:
using parse function instead of fromstring
tree = html.parse("C:\Users\...site_1.html")
print(html.tostring(tree))
You can also try using Beautiful Soup
from bs4 import BeautifulSoup
f = open("filepath", encoding="utf8")
soup = BeautifulSoup(f)
f.close()
So I'm using BeautifulSoup to build a webscraper to grab every ad on a Craigslist page. Here's what I've got so far:
import requests
from bs4 import BeautifulSoup, SoupStrainer
import bs4
page = "http://miami.craigslist.org/search/roo?query=brickell"
search_html = requests.get(page).text
roomSoup = BeautifulSoup(search_html, "html.parser")
ad_list = roomSoup.find_all("a", {"class":"hdrlnk"})
#print ad_list
ad_ls = [item["href"] for item in ad_list]
#print ad_ls
ad_urls = ["miami.craigslist.org" + ad for ad in ad_ls]
#print ad_urls
url_str = [str(unicode) for unicode in ad_urls]
# What's in url_str?
for url in url_str:
print url
When I run this, I get:
miami.craigslist.org/mdc/roo/4870912192.html
miami.craigslist.org/mdc/roo/4858122981.html
miami.craigslist.org/mdc/roo/4870665175.html
miami.craigslist.org/mdc/roo/4857247075.html
miami.craigslist.org/mdc/roo/4870540048.html ...
This is exactly what I want: a list containing the URLs to each ad on the page.
My next step was to extract something from each of those pages; hence building another BeautifulSoup object. But I get stopped short:
for url in url_str:
ad_html = requests.get(str(url)).text
Here we finally get to my question: What exactly is this error? The only thing I can make sense of is the last 2 lines:
Traceback (most recent call last): File "webscraping.py", line 24,
in <module>
ad_html = requests.get(str(url)).text File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/api.py",
line 65, in get
return request('get', url, **kwargs) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/api.py",
line 49, in request
response = session.request(method=method, url=url, **kwargs) File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/sessions.py",
line 447, in request
prep = self.prepare_request(req) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/sessions.py",
line 378, in prepare_request
hooks=merge_hooks(request.hooks, self.hooks), File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/models.py",
line 303, in prepare
self.prepare_url(url, params) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/models.py",
line 360, in prepare_url
"Perhaps you meant http://{0}?".format(url)) requests.exceptions.MissingSchema: Invalid URL
u'miami.craigslist.org/mdc/roo/4870912192.html': No schema supplied.
Perhaps you meant http://miami.craigslist.org/mdc/roo/4870912192.html?
It looks like the issue is that all my links are preceded by u', so requests.get() isn't working. This is why you see me pretty much trying to force all the URLs into a regular string with str(). No matter what I do, though, I get this error. Is there something else I'm missing? Am I completely misunderstanding my problem?
Thanks much in advance!
Looks like you misundersood the problem
The message:
u'miami.craigslist.org/mdc/roo/4870912192.html': No schema supplied.
Perhaps you meant http://miami.craigslist.org/mdc/roo/4870912192.html?
means it lacks of http:// (the schema) before the url
so replacing
ad_urls = ["miami.craigslist.org" + ad for ad in ad_ls]
by
ad_urls = ["http://miami.craigslist.org" + ad for ad in ad_ls]
should do the job
Hi I have an excel sheet with only 1 column and i want to import that column into a list in python.
It has 5 elements in that column, all containing a url like "http://dl.dropboxusercontent.com/sh/hk7l7t1ead5bd7d/AAACc6yA_4MhwbaxX_dizyg3a/NT51-177/DPS_0321.jpg?dl=0".
My code
import requests
import csv
import xlrd
ls = []
ls1 = ['01.jpg','02.jpg','03.jpg','04.jpg','05.jpg','06.jpg']
wb = xlrd.open_workbook('Book1.xls')
ws = wb.sheet_by_name('Book1')
num_rows = ws.nrows - 1
curr_row = -1
while (curr_row < num_rows):
curr_row += 1
row = ws.row(curr_row)
ls.append(row)
for each in ls:
urlFetch = requests.get(each)
img = urlFetch.content
for x in ls1:
file = open(x,'wb')
file.write(img)
file.close()
Now it is giving me error:
Traceback (most recent call last):
File "C:\Users\Prime\Documents\NetBeansProjects\Python_File_Retrieve\src\python_file_retrieve.py", line 18, in <module>
urlFetch = requests.get(each)
File "c:\Python34\lib\site-packages\requests-2.5.0-py3.4.egg\requests\api.py", line 65, in get
return request('get', url, **kwargs)
File "c:\Python34\lib\site-packages\requests-2.5.0-py3.4.egg\requests\api.py", line 49, in request
response = session.request(method=method, url=url, **kwargs)
File "c:\Python34\lib\site-packages\requests-2.5.0-py3.4.egg\requests\sessions.py", line 461, in request
resp = self.send(prep, **send_kwargs)
File "c:\Python34\lib\site-packages\requests-2.5.0-py3.4.egg\requests\sessions.py", line 567, in send
adapter = self.get_adapter(url=request.url)
File "c:\Python34\lib\site-packages\requests-2.5.0-py3.4.egg\requests\sessions.py", line 646, in get_adapter
raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for '[text:'https://dl.dropboxusercontent.com/sh/hk7l7t1ead5bd7d/AAACc6yA_4MhwbaxX_dizyg3a/NT51-177/DPS_0321.jpg?dl=0']'
Please Help
Your problem isn't with reading the Excel file, but with parsing the content out of it. Notice that your error was thrown from the Requests library?
requests.exceptions.InvalidSchema: No connection adapters were found for <url>
From the error we learn that the URL you take from each cell in your Excel file, also has a [text: prefix -
'[text:'https://dl.dropboxusercontent.com/sh/hk7l7t1ead5bd7d/AAACc6yA_4MhwbaxX_dizyg3a/NT51-177/DPS_0321.jpg?dl=0']'
That's something that Requests cannot work with, because it doesn't know the protocol of the URL.
If you do
requests.get('https://dl.dropboxusercontent.com/sh/hk7l7t1ead5bd7d/AAACc6yA_4MhwbaxX_dizyg3a/NT51-177/DPS_0321.jpg?dl=0')
You get appropriate results.
What you need to do is extract the URL only out of the cell.
If you're having problems with that, give us examples for URLs in the Excel file
For the urls in your spreadsheet, click on one of them and see what appears in the formula bar. I'm guessing it looks like this:
[text:'https://dl.dropboxusercontent.com/sh/hk7l7t1ead5bd7d/AAACc6yA_4MhwbaxX_dizyg3a/NT51-177/DPS_0321.jpg?dl=0']
Because in the stack trace, that's what it's printing out for the url.
Can you remove the brackets, quotes, and "text:" parts of this? That should fix it.
If I use urllib to load this url( https://www.fundingcircle.com/my-account/sell-my-loans/ ) I get a 400 status error.
e.g. The following returns a 400 error
>>> import urllib
>>> f = urllib.urlopen("https://www.fundingcircle.com/my-account/sell-my-loans/")
>>> print f.read()
However, if I copy and paste the url into my browser, I see a web page with the information that I want to see.
I have tried using a try, except, and then reading the error. But the returned data just tells me that the page does not exist. e.g.
import urllib
try:
f = urllib.urlopen("https://www.fundingcircle.com/my-account/sell-my-loans/")
except Exception as e:
eString = e.read()
print eString
Why can't Python load the page?
If Python is given a 404 status then that'd be because the server refuses to give you the page.
Why that is is difficult to know, because servers are black boxes. But your browser gives the server more than just the URL, it also gives it a set of HTTP headers. Most likely the server alters behaviour based on the contents of one or more of those headers.
You need to look in your browser development tools and see what your browser sends, then try and replicate some of those headers from Python. Obvious candidates are the User-Agent header, followed by Accept and Cookie headers.
However, in this specific case, the server is responding with a 401 Unauthorized; you are given a login page. It does this both for the browser and Python:
>>> import urllib
>>> urllib.urlopen('https://www.fundingcircle.com/my-account/sell-my-loans/')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib.py", line 87, in urlopen
return opener.open(url)
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib.py", line 208, in open
return getattr(self, name)(url)
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib.py", line 451, in open_https
return self.http_error(url, fp, errcode, errmsg, headers)
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib.py", line 372, in http_error
result = method(url, fp, errcode, errmsg, headers)
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib.py", line 683, in http_error_401
errcode, errmsg, headers)
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib.py", line 381, in http_error_default
raise IOError, ('http error', errcode, errmsg, headers)
IOError: ('http error', 401, 'Unauthorized', <httplib.HTTPMessage instance at 0x1066f9a28>)
but Python's urllib doesn't have a handler for the 401 status code and turns that into an exception.
The response body contains a login form; you'll have to write code to log in here, and presumably track cookies.
That task would be a lot easier with more specialised tools. You could use robobrowser to load the page, parse the form and give you the tools to fill it out, then post the form for you and track the cookies required to keep you logged in. It is built on top of the excellent requests and BeautifulSoup libraries.