I am trying to create code that scrapes and downloads specific files from archive.org. When I run the program, I run into this code error.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:\ROMS\Gamecube\main.py", line 16, in <module>
response = requests.get(DOMAIN + file_link)
File "C:\Users\cycle\AppData\Local\Programs\Python\Python38-32\lib\site-packages\requests\api.py", line 76, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\cycle\AppData\Local\Programs\Python\Python38-32\lib\site-packages\requests\api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\cycle\AppData\Local\Programs\Python\Python38-32\lib\site-packages\requests\sessions.py", line 530, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\cycle\AppData\Local\Programs\Python\Python38-32\lib\site-packages\requests\sessions.py", line 643, in send
r = adapter.send(request, **kwargs)
File "C:\Users\cycle\AppData\Local\Programs\Python\Python38-32\lib\site-packages\requests\adapters.py", line 516, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='archive.org007%20-%20agent%20under%20fire%20%28usa%29.nkit.gcz', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x043979B8>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))
This is my code:
from bs4 import BeautifulSoup as bs
import requests
DOMAIN = 'https://archive.org'
URL = 'https://archive.org/download/GCRedumpNKitPart1'
FILETYPE = '%28USA%29.nkit.gcz'
def get_soup(url):
return bs(requests.get(url).text, 'html.parser')
for link in get_soup(URL).find_all('a'):
file_link = link.get('href')
if FILETYPE in file_link:
print(file_link)
with open(link.text, 'wb') as file:
response = requests.get(DOMAIN + file_link)
file.write(response.content)
You simply forgot / after https://archive.org so you create incorrect urls.
Add / at the end of domain
DOMAIN = 'https://archive.org/'
or add / later
response = requests.get(DOMAIN + '/' + file_link)
or use urllib.parse.urljoin() to create urls
import urllib.parse
response = requests.get(urllib.parse.urljoin(DOMAIN, file_link))
Related
I have the following code and when I run it on Windows I can make requests through a specific NIC as said on this answer but when I run it on Arch Linux request goes to timeout.
import requests
from requests_toolbelt.adapters import source
source = source.SourceAddressAdapter('10.100.89.75')
with requests.Session() as session:
session.mount('http://', source)
r = session.get("http://ifconfig.me")
print(r.text)
I get the following error:
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
File "/usr/lib/python3.10/site-packages/requests/sessions.py", line 600, in get
return self.request("GET", url, **kwargs)
File "/usr/lib/python3.10/site-packages/requests/sessions.py", line 587, in request
resp = self.send(prep, **send_kwargs)
File "/usr/lib/python3.10/site-packages/requests/sessions.py", line 701, in send
r = adapter.send(request, **kwargs)
File "/usr/lib/python3.10/site-packages/requests/adapters.py", line 553, in send
raise ConnectTimeout(e, request=request)
requests.exceptions.ConnectTimeout: HTTPConnectionPool(host='ifconfig.me', port=80): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f8e0ab379a0>, 'Connection to ifconfig.me timed out. (connect timeout=None)'))
I'm trying to write a script in python to send HTTP get request to automatically generated URLs and get its response code and elapsed time. The URLs need not necessarily be a valid one, 400 responses are acceptable too.
script1.py
import sys
import requests
str1="http://www.googl"
str3=".com"
str2='a'
for x in range(0, 8):
y = chr(ord(str2)+x)
str_s=str1+y+str3
r=requests.get(str_s)
print(str_s, r.status_code, r.elapsed.total_seconds())
Error:
File "script1.py", line 12, in <module><br>
r=requests.get(str_s)<br>
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 72, in get<br>
return request('get', url, params=params, **kwargs)<br>
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 58, in request<br>
return session.request(method=method, url=url, **kwargs)<br>
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 508, in request<br>
resp = self.send(prep, **send_kwargs)<br>
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 618, in send<br>
r = adapter.send(request, **kwargs)<br>
File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 508, in send<br>
raise ConnectionError(e, request=request)<br>
requests.exceptions.ConnectionError: HTTPConnectionPool(host='www.googla.com', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fc44c891e50>: Failed to establish a new connection: [Errno -2] Name or service not known',))
I just want see the time taken to receive response of each request.
Only one request has to be sent
Response code does not matter.
I guess you want to get something like this:
import sys
import requests
str1="http://www.googl"
str3=".com"
str2='a'
for x in range(0, 8):
y = chr(ord(str2)+x)
str_s=str1+y+str3
print('Connecting to ' + str_s)
try:
r = requests.get(str_s)
print(str_s, r.status_code, r.elapsed.total_seconds())
except requests.ConnectionError as e:
print(" Failed to open url")
In this case, using the try...except you can catch the exception that get raises and handle it in a nice way.
I am trying to scrape a website (~3000 pages) once a day. My code for the requests is:
#Return soup of page of Carsales ads
def getsoup(pagelimit,offset):
url = "http://www.carsales.com.au/cars/results?q=%28Service%3D%5BCarsales%5D%26%28%28%28SiloType%3D%5BDemo%20and%20near%20new%20cars%5D%7CSiloType%3D%5BDealer%20used%20cars%5D%29%7CSiloType%3D%5BDemo%20and%20near%20new%20cars%5D%29%7CSiloType%3D%5BPrivate%20seller%20cars%5D%29%29&cpw=1&sortby={0}&limit={1}&offset={2}".format('Price',str(pagelimit),str(offset))
#Sortby options: LastUpdated,Price,
r = requests.get(url, headers)
soup = BeautifulSoup(r.text, "html5lib") #"html.parser"
totalpages = int(soup.find("div", class_="pagination").text.split(' of ',1)[1].split('\n', 1)[0])
currentpage = int(soup.find("div", class_="pagination").text.split('Page ',1)[1].split(' of', 1)[0])
return (soup, totalpages, currentpage)
adscrape = []
#Run through all search result pages, appending ads to adscrape
while currentpage < totalpages:
soup, totalpages, currentpage = getsoup(pagelimit,offset)
print 'Page: {0} of {1}. Offset is {2}.'.format(currentpage,totalpages,offset)
adscrape.extend(getpageads(soup,offset))
offset = offset+pagelimit
# sleep(1)
I have previously run it successfully with no sleep() function to rate limit. However now I am getting errors midway through execution, it does this whether or not sleep(1) is active in the code or not:
...
Page: 1523 of 2956. Offset is 91320.
Page: 1524 of 2966. Offset is 91380.
Page: 1525 of 2956. Offset is 91440.
Traceback (most recent call last):
File "D:\Google Drive\pythoning\carsales\carsales_v2.py", line 82, in <module>
soup, totalpages, currentpage = getsoup(pagelimit,offset)
File "D:\Google Drive\pythoning\carsales\carsales_v2.py", line 28, in getsoup
r = requests.get(url, headers)
File "C:\Python27\lib\site-packages\requests\api.py", line 69, in get
return request('get', url, params=params, **kwargs)
File "C:\Python27\lib\site-packages\requests\api.py", line 50, in request
response = session.request(method=method, url=url, **kwargs)
File "C:\Python27\lib\site-packages\requests\sessions.py", line 468, in request
resp = self.send(prep, **send_kwargs)
File "C:\Python27\lib\site-packages\requests\sessions.py", line 576, in send
r = adapter.send(request, **kwargs)
File "C:\Python27\lib\site-packages\requests\adapters.py", line 423, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='www.carsales.com.au', port=80): Max retries exceeded with url: /cars/results?q=%28Service%3D%5BCarsales%5D%26%28%28%28SiloType%3D%5BDemo%20and%20near%20new%20cars%5D%7CSiloType%3D%5BDealer%20used%20cars%5D%29%7CSiloType%3D%5BDemo%20and%20near%20new%20cars%5D%29%7CSiloType%3D%5BPrivate%20seller%20cars%5D%29%29&cpw=1&sortby=Price&limit=60&offset=91500&user-agent=Mozilla%2F5.0 (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x000000001A097710>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
<<< Process finished. (Exit code 1)
I'm assuming this is due to hitting the server with too many requests in a certain time. If so, how can I avoid this without making the script take hours and hours to run? What is the normal practice for scraping websites to counter this issue?
I have made a web crawler that takes thousands of Urls from a text file and then crawls the data on that webpage.
Now that it has many Urls; some Urls are broken too.
So it gives me the error:
Traceback (most recent call last):
File "C:/Users/khize_000/PycharmProjects/untitled3/new.py", line 57, in <module>
crawl_data("http://www.foasdasdasdasdodily.com/r/126e7649cc-sweetssssie-pies-mac-and-cheese-recipe-by-the-dr-oz-show")
File "C:/Users/khize_000/PycharmProjects/untitled3/new.py", line 18, in crawl_data
data = requests.get(url)
File "C:\Python27\lib\site-packages\requests\api.py", line 67, in get
return request('get', url, params=params, **kwargs)
File "C:\Python27\lib\site-packages\requests\api.py", line 53, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Python27\lib\site-packages\requests\sessions.py", line 468, in request
resp = self.send(prep, **send_kwargs)
File "C:\Python27\lib\site-packages\requests\sessions.py", line 576, in send
r = adapter.send(request, **kwargs)
File "C:\Python27\lib\site-packages\requests\adapters.py", line 437, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='www.foasdasdasdasdodily.com', port=80): Max retries exceeded with url: /r/126e7649cc-sweetssssie-pies-mac-and-cheese-recipe-by-the-dr-oz-show (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x0310FCB0>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed',))
Here's my code:
def crawl_data(url):
global connectString
data = requests.get(url)
response = str( data )
if response != "<Response [200]>":
return
soup = BeautifulSoup(data.text,"lxml")
titledb = soup.h1.string
But it still gives me the same exception or error.
I simply want it to ignore that Urls from which there is no response
and move on to the next Url.
You need to learn about exception handling. The easiest way to ignore these errors is to surround the code that processes a single URL with a try-except construct, making you code read something like:
try:
<process a single URL>
except requests.exceptions.ConnectionError:
pass
This will mean that if the specified exception occurs your program will just execute the pass (do nothing) statement and move on to the next
Use try-except:
def crawl_data(url):
global connectString
try:
data = requests.get(url)
except requests.exceptions.ConnectionError:
return
response = str( data )
soup = BeautifulSoup(data.text,"lxml")
titledb = soup.h1.string
I have my backend developed in java which does all kind of processing. And my frontend is developed using python's flask framework. I am using requests to send a request and get a response from the apis present in java.
Following is the line in my code which does that:
req = requests.post(buildApiUrl.getUrl('user') + "/login", data=payload)
My problem is, sometimes when the tomcat instance is not running or there is some issue with java apis, I always get an error from requests as follows:
ERROR:root:HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /MYAPP/V1.0/user/login (Caused by <class 'socket.error'>: [Errno 111] Connection refused)
Traceback (most recent call last):
File "/home/rahul/git/myapp/webapp/views/utils.py", line 31, in decorated_view
return_value = func(*args, **kwargs)
File "/home/rahul/git/myapp/webapp/views/public.py", line 37, in login
req = requests.post(buildApiUrl.getUrl('user') + "/login", data=payload)
File "/home/rahul/git/myapp/venv/local/lib/python2.7/site-packages/requests/api.py", line 88, in post
return request('post', url, data=data, **kwargs)
File "/home/rahul/git/myapp/venv/local/lib/python2.7/site-packages/requests/api.py", line 44, in request
return session.request(method=method, url=url, **kwargs)
File "/home/rahul/git/myapp/venv/local/lib/python2.7/site-packages/requests/sessions.py", line 335, in request
resp = self.send(prep, **send_kwargs)
File "/home/rahul/git/myapp/venv/local/lib/python2.7/site-packages/requests/sessions.py", line 438, in send
r = adapter.send(request, **kwargs)
File "/home/rahul/git/myapp/venv/local/lib/python2.7/site-packages/requests/adapters.py", line 327, in send
raise ConnectionError(e)
ConnectionError: HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /MYAPP/V1.0/user/login (Caused by <class 'socket.error'>: [Errno 111] Connection refused)
I want to handle any such errors that I receive in my flask app so that I can give the necessary response on the web page instead of showing blank screen. So how can I achieve this?
Catch the exception request.post raises using try-except:
try:
req = requests.post(buildApiUrl.getUrl('user') + "/login", data=payload)
except requests.exceptions.RequestException:
# Handle exception ..