Hi I have an excel sheet with only 1 column and i want to import that column into a list in python.
It has 5 elements in that column, all containing a url like "http://dl.dropboxusercontent.com/sh/hk7l7t1ead5bd7d/AAACc6yA_4MhwbaxX_dizyg3a/NT51-177/DPS_0321.jpg?dl=0".
My code
import requests
import csv
import xlrd
ls = []
ls1 = ['01.jpg','02.jpg','03.jpg','04.jpg','05.jpg','06.jpg']
wb = xlrd.open_workbook('Book1.xls')
ws = wb.sheet_by_name('Book1')
num_rows = ws.nrows - 1
curr_row = -1
while (curr_row < num_rows):
curr_row += 1
row = ws.row(curr_row)
ls.append(row)
for each in ls:
urlFetch = requests.get(each)
img = urlFetch.content
for x in ls1:
file = open(x,'wb')
file.write(img)
file.close()
Now it is giving me error:
Traceback (most recent call last):
File "C:\Users\Prime\Documents\NetBeansProjects\Python_File_Retrieve\src\python_file_retrieve.py", line 18, in <module>
urlFetch = requests.get(each)
File "c:\Python34\lib\site-packages\requests-2.5.0-py3.4.egg\requests\api.py", line 65, in get
return request('get', url, **kwargs)
File "c:\Python34\lib\site-packages\requests-2.5.0-py3.4.egg\requests\api.py", line 49, in request
response = session.request(method=method, url=url, **kwargs)
File "c:\Python34\lib\site-packages\requests-2.5.0-py3.4.egg\requests\sessions.py", line 461, in request
resp = self.send(prep, **send_kwargs)
File "c:\Python34\lib\site-packages\requests-2.5.0-py3.4.egg\requests\sessions.py", line 567, in send
adapter = self.get_adapter(url=request.url)
File "c:\Python34\lib\site-packages\requests-2.5.0-py3.4.egg\requests\sessions.py", line 646, in get_adapter
raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for '[text:'https://dl.dropboxusercontent.com/sh/hk7l7t1ead5bd7d/AAACc6yA_4MhwbaxX_dizyg3a/NT51-177/DPS_0321.jpg?dl=0']'
Please Help
Your problem isn't with reading the Excel file, but with parsing the content out of it. Notice that your error was thrown from the Requests library?
requests.exceptions.InvalidSchema: No connection adapters were found for <url>
From the error we learn that the URL you take from each cell in your Excel file, also has a [text: prefix -
'[text:'https://dl.dropboxusercontent.com/sh/hk7l7t1ead5bd7d/AAACc6yA_4MhwbaxX_dizyg3a/NT51-177/DPS_0321.jpg?dl=0']'
That's something that Requests cannot work with, because it doesn't know the protocol of the URL.
If you do
requests.get('https://dl.dropboxusercontent.com/sh/hk7l7t1ead5bd7d/AAACc6yA_4MhwbaxX_dizyg3a/NT51-177/DPS_0321.jpg?dl=0')
You get appropriate results.
What you need to do is extract the URL only out of the cell.
If you're having problems with that, give us examples for URLs in the Excel file
For the urls in your spreadsheet, click on one of them and see what appears in the formula bar. I'm guessing it looks like this:
[text:'https://dl.dropboxusercontent.com/sh/hk7l7t1ead5bd7d/AAACc6yA_4MhwbaxX_dizyg3a/NT51-177/DPS_0321.jpg?dl=0']
Because in the stack trace, that's what it's printing out for the url.
Can you remove the brackets, quotes, and "text:" parts of this? That should fix it.
Related
#Downloading All XKCD Comics
url = "http://xkcd.com"
os.makedirs("xkcd", exist_ok=True)
while not url.endswith("#"):
print("Downloading page %s..." % url)
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text)
comicElem = soup.select("#comic img")
if comicElem == []:
print("Could not find comic image.")
else:
comicUrl = comicElem[0].get("src")
#Download the image.
print('Downloading image %s...' % (comicUrl))
res = requests.get(comicUrl)
res.raise_for_status()
imageFile = open(os.path.join("xkcd", os.path.basename(comicUrl)),"wb")
for chunk in res.iter_content(None):
imageFile.write(chunk)
imageFile.close()
prevLink = soup.select("a[rel=prev]")[0]
url = "http://xkcd.com" + prevLink.get("href")
print("Done.")
Full code is stated above. Full output is stated below.
Downloading page http://xkcd.com...
C:/Users/emosc/PycharmProjects/RequestsLearning/main.py:38: GuessedAtParserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 38 of the file C:/Users/emosc/PycharmProjects/RequestsLearning/main.py. To get rid of this warning, pass the additional argument 'features="html.parser"' to the BeautifulSoup constructor.
soup = bs4.BeautifulSoup(res.text)
Traceback (most recent call last):
File "C:/Users/emosc/PycharmProjects/RequestsLearning/main.py", line 46, in <module>
res = requests.get(comicUrl)
File "C:\Users\emosc\PycharmProjects\RequestsLearning\venv\lib\site-packages\requests\api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\emosc\PycharmProjects\RequestsLearning\venv\lib\site-packages\requests\api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\emosc\PycharmProjects\RequestsLearning\venv\lib\site-packages\requests\sessions.py", line 528, in request
prep = self.prepare_request(req)
File "C:\Users\emosc\PycharmProjects\RequestsLearning\venv\lib\site-packages\requests\sessions.py", line 456, in prepare_request
p.prepare(
File "C:\Users\emosc\PycharmProjects\RequestsLearning\venv\lib\site-packages\requests\models.py", line 316, in prepare
self.prepare_url(url, params)
File "C:\Users\emosc\PycharmProjects\RequestsLearning\venv\lib\site-packages\requests\models.py", line 390, in prepare_url
raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL '//imgs.xkcd.com/comics/rapid_test_results.png': No schema supplied. Perhaps you meant http:////imgs.xkcd.com/comics/rapid_test_results.png?
Downloading image //imgs.xkcd.com/comics/rapid_test_results.png...
I have never ever seen an image link like (only with 2 backslashes not 4) http:////imgs.xkcd.com/comics/rapid_test_results.png this and BS4 recommends me to use that and I dont know how to solve this error. Typically followed Automate the Boring Stuff with Python book, same code as from that book but shoots this error when I try to scrape the site. Thanks for any help.
The http:// and https:// protocols are both examples of schemas. Print your URLs before every usage in your code, and check to see if those two are not 1:1 included at the start of your url. Failure to add http://url or https://url will lead to the error as shown, so make sure http:// is added.
You can add this to check what the URL starts with
if comicURL.startswith("//"):
continue
Put the portion of the code below in a try, except block, it'd run.
res = requests.get(comicUrl)
add "http:"
like this: comicUrl = 'http:'+comicElem[0].get('src')
I have an URL in the config file which I parsed using ConfigParser to get the requests
config.ini
[default]
root_url ='https://reqres.in/api/users?page=2'
FetchFeeds.py
import requests
from configparser import ConfigParser
import os
import requests
config = ConfigParser()
config.read(os.path.join(os.path.dirname(__file__), '../Config', 'config.ini'))
rootUrl = (config['default']['root_url'])
print(rootUrl)
response = requests.get(rootUrl)
Even the URL is printed properly 'https://reqres.in/api/users?page=2'but I am getting the following error on the requests module
Traceback (most recent call last):
File "C:/Users/sam/PycharmProjects/testProject/GET_Request/FetchFeeds.py", line 11, in <module>
response = requests.get(rootUrl)
File "C:\Users\sam\PycharmProjects\testProject\venv\lib\site-packages\requests\api.py", line 76, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\sam\PycharmProjects\testProject\venv\lib\site-packages\requests\api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\sam\PycharmProjects\testProject\venv\lib\site-packages\requests\sessions.py", line 530, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\sam\PycharmProjects\testProject\venv\lib\site-packages\requests\sessions.py", line 637, in send
adapter = self.get_adapter(url=request.url)
File "C:\Users\sam\PycharmProjects\testProject\venv\lib\site-packages\requests\sessions.py", line 730, in get_adapter
raise InvalidSchema("No connection adapters were found for {!r}".format(url))
requests.exceptions.InvalidSchema: No connection adapters were found for "'https://reqres.in/api/users?page=2'"
Process finished with exit code 1
Please note I checked carefully there are no white spaces
The main issue is it tries to find a URL like 'url.com' which doesn't exist (correct would be url.com), so the solution is to not put apostrophes in config.ini files:
[default]
root_url = https://reqres.in/api/users?page=2
Also think about if you configure the parameters rather in your code than in the config file and use this instead:
requests.get(config['default']['root_url'], params={'page': 2})
I am working with a local html file in python, and I am trying to use lxml to parse the file. For some reason I can't get the file to load properly, and I'm not sure if this has to do with not having an http server set up on my local machine, etree usage, or something else.
My reference for this code was this: http://docs.python-guide.org/en/latest/scenarios/scrape/
This could be a related problem: Requests : No connection adapters were found for, error in Python3
Here is my code:
from lxml import html
import requests
page = requests.get('C:\Users\...\sites\site_1.html')
tree = html.fromstring(page.text)
test = tree.xpath('//html/body/form/div[3]/div[3]/div[2]/div[2]/div/div[2]/div[2]/p[1]/strong/text()')
print test
The traceback that I'm getting reads:
C:\Python27\python.exe "C:/Users/.../extract_html/extract.py"
Traceback (most recent call last):
File "C:/Users/.../extract_html/extract.py", line 4, in <module>
page = requests.get('C:\Users\...\sites\site_1.html')
File "C:\Python27\lib\site-packages\requests\api.py", line 69, in get
return request('get', url, params=params, **kwargs)
File "C:\Python27\lib\site-packages\requests\api.py", line 50, in request
response = session.request(method=method, url=url, **kwargs)
File "C:\Python27\lib\site-packages\requests\sessions.py", line 465, in request
resp = self.send(prep, **send_kwargs)
File "C:\Python27\lib\site-packages\requests\sessions.py", line 567, in send
adapter = self.get_adapter(url=request.url)
File "C:\Python27\lib\site-packages\requests\sessions.py", line 641, in get_adapter
raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for 'C:\Users\...\sites\site_1.html'
Process finished with exit code 1
You can see that it has something to do with a "connection adapter" but I'm not sure what that means.
If the file is local, you shouldn't be using requests -- just open the file and read it in. requests expects to be talking to a web server.
with open(r'C:\Users\...site_1.html', "r") as f:
page = f.read()
tree = html.fromstring(page)
There is a better way for doing it:
using parse function instead of fromstring
tree = html.parse("C:\Users\...site_1.html")
print(html.tostring(tree))
You can also try using Beautiful Soup
from bs4 import BeautifulSoup
f = open("filepath", encoding="utf8")
soup = BeautifulSoup(f)
f.close()
So I'm using BeautifulSoup to build a webscraper to grab every ad on a Craigslist page. Here's what I've got so far:
import requests
from bs4 import BeautifulSoup, SoupStrainer
import bs4
page = "http://miami.craigslist.org/search/roo?query=brickell"
search_html = requests.get(page).text
roomSoup = BeautifulSoup(search_html, "html.parser")
ad_list = roomSoup.find_all("a", {"class":"hdrlnk"})
#print ad_list
ad_ls = [item["href"] for item in ad_list]
#print ad_ls
ad_urls = ["miami.craigslist.org" + ad for ad in ad_ls]
#print ad_urls
url_str = [str(unicode) for unicode in ad_urls]
# What's in url_str?
for url in url_str:
print url
When I run this, I get:
miami.craigslist.org/mdc/roo/4870912192.html
miami.craigslist.org/mdc/roo/4858122981.html
miami.craigslist.org/mdc/roo/4870665175.html
miami.craigslist.org/mdc/roo/4857247075.html
miami.craigslist.org/mdc/roo/4870540048.html ...
This is exactly what I want: a list containing the URLs to each ad on the page.
My next step was to extract something from each of those pages; hence building another BeautifulSoup object. But I get stopped short:
for url in url_str:
ad_html = requests.get(str(url)).text
Here we finally get to my question: What exactly is this error? The only thing I can make sense of is the last 2 lines:
Traceback (most recent call last): File "webscraping.py", line 24,
in <module>
ad_html = requests.get(str(url)).text File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/api.py",
line 65, in get
return request('get', url, **kwargs) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/api.py",
line 49, in request
response = session.request(method=method, url=url, **kwargs) File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/sessions.py",
line 447, in request
prep = self.prepare_request(req) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/sessions.py",
line 378, in prepare_request
hooks=merge_hooks(request.hooks, self.hooks), File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/models.py",
line 303, in prepare
self.prepare_url(url, params) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/models.py",
line 360, in prepare_url
"Perhaps you meant http://{0}?".format(url)) requests.exceptions.MissingSchema: Invalid URL
u'miami.craigslist.org/mdc/roo/4870912192.html': No schema supplied.
Perhaps you meant http://miami.craigslist.org/mdc/roo/4870912192.html?
It looks like the issue is that all my links are preceded by u', so requests.get() isn't working. This is why you see me pretty much trying to force all the URLs into a regular string with str(). No matter what I do, though, I get this error. Is there something else I'm missing? Am I completely misunderstanding my problem?
Thanks much in advance!
Looks like you misundersood the problem
The message:
u'miami.craigslist.org/mdc/roo/4870912192.html': No schema supplied.
Perhaps you meant http://miami.craigslist.org/mdc/roo/4870912192.html?
means it lacks of http:// (the schema) before the url
so replacing
ad_urls = ["miami.craigslist.org" + ad for ad in ad_ls]
by
ad_urls = ["http://miami.craigslist.org" + ad for ad in ad_ls]
should do the job
I am using urllib2 in Python to scrape a webpage. However, the read() method does not return.
Here is the code I am using:
import urllib2
url = 'http://edmonton.en.craigslist.ca/kid/'
headers = {'User-Agent': 'Mozilla/5.0'}
request = urllib2.Request(url, headers=headers)
f_webpage = urllib2.urlopen(request)
html = f_webpage.read() # <- does not return
I last ran the script a month ago and it was working fine then.
Note that the same script runs well for webpages of other categories on Edmonton Craigslist like http://edmonton.en.craigslist.ca/act/ or http://edmonton.en.craigslist.ca/eve/.
As requested in comments :)
Install requests by $ pip install requests
Use requests as the following:
>>> import requests
>>> url = 'http://edmonton.en.craigslist.ca/kid/'
>>> headers = {'User-Agent': 'Mozilla/5.0'}
>>> request = requests.get(url, headers=headers)
>>> request.ok
True
>>> request.text # content in string, similar to .read() in question
...
...
Disclaimer: this is not technically the answer to OP's question, but solves OP's problem as urllib2 is known to be problematic and requests library is born to solve such problems.
It returns (or more specifically, errors out) fine for me:
>>> import urllib2
>>> url = 'http://edmonton.en.craigslist.ca/kid/'
>>> headers = {'User-Agent': 'Mozilla/5.0'}
>>> request = urllib2.Request(url, headers=headers)
>>> f_webpage = urllib2.urlopen(request)
>>> html = f_webpage.read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/socket.py", line 351, in read
data = self._sock.recv(rbufsize)
File "/usr/lib/python2.7/httplib.py", line 541, in read
return self._read_chunked(amt)
File "/usr/lib/python2.7/httplib.py", line 592, in _read_chunked
value.append(self._safe_read(amt))
File "/usr/lib/python2.7/httplib.py", line 647, in _safe_read
chunk = self.fp.read(min(amt, MAXAMOUNT))
File "/usr/lib/python2.7/socket.py", line 380, in read
data = self._sock.recv(left)
socket.error: [Errno 104] Connection reset by peer
Chances are that Craigslist is detecting that you are a scraper and refusing to give you the actual page.
I met the similar problem with you. Part of my error information:
File "C:\Python27\lib\socket.py", line 380, in read
data = self._sock.recv(left)
File "C:\Python27\lib\httplib.py", line 573, in read
s = self.fp.read(amt)
File "C:\Python27\lib\socket.py", line 380, in read
data = self._sock.recv(left)
error: [Errno 10054]
I solve it by reading the buffer in small batches instead of reading directly.
def readBuf(fsrc, length=16*1024):
result=''
while 1:
buf = fsrc.read(length)
if not buf:
break
else:
result+=buf
return result
Instead of using html=f_webpage.read(), you can use html=readBuf(f_webpage) to scrape the webpage.