I am trying to a web-scraping. Firstly the code was working but later it does not. The code is
import requests
import hashlib
from bs4 import BeautifulSoup
def sha512(x):
m = hashlib.sha512(x.encode())
return m.hexdigest()
session = requests.Session()
session.cookies["user-agent"] = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36"
r = session.post("https://ringzer0ctf.com/login", data={"username":"myusername","password":"mypass"})
r = session.get("https://ringzeractf.com/challenges/13")
soup = BeautifulSoup(r.text, 'html.parser')
It gives error like
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='ringzeractf.com', port=443): Max retries exceeded
with url: /challenges/13 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x04228490>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))
Your URL in the GET request is wrong. Change ringzeractf to ringzer0ctf
Related
I am new to Python and Web scraping but it's been two weeks that I periodically scrape one website and successfully download images from it. I use different proxies and sometimes change them. But starting yesterday all my proxies suddenly stopped working with a timeout error. I've tried a whole list of them and all fail.
Could this be a kind of site protection from scraping? If yes, is there a way to overcome it?
header = {
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}
proxies = {
"http": "http://188.114.99.153",
"https": "http://180.94.69.66:8080"
}
url = 'https://parovoz.com/newgallery/index.php?&LNG=RU&NO_ICONS=0&CATEG=-1&HOWMANY=192'
html = requests.get(url, headers=header, proxies=proxies, timeout=10).text
soup = BeautifulSoup(html, 'lxml')
Error message:
ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x000001536A8E7190>, 'Connection to 180.94.69.66 timed out. (connect timeout=10)'))
This will GET the URL and retry 3 times in case of ConnectTimeoutError. It will help to apply delays between attempts to avoid failing again in case of periodic request quota.
Take a look at urllib3.util.retry.Retry, it has many options to simplify retries.
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from bs4 import BeautifulSoup
header = {
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}
url = 'https://parovoz.com/newgallery/index.php?&LNG=RU&NO_ICONS=0&CATEG=-1&HOWMANY=192'
session = requests.Session()
retry = Retry(connect=3, backoff_factor=0.5)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
html = session.get(url, headers=header).text
soup = BeautifulSoup(html, 'lxml')
print(soup)
Hello I have a script that I run on my organizations internal network, but it was supposed to run on the 1st but it didn't so I did a backup on my local database of the data so that I can run the script to have the correct data. I changed the url so it lines up with my local site but it is not working as I get an error of
HTTPSConnectionPool(host='localhost', port=44345): Max retries exceeded with url: /logon.aspx (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fc26acb86a0>: Failed to establish a new connection: [Errno 111] Connection refused',)
Here is how I set it up my script to access the url
URL = "https://localhost:44345/logon.aspx"
headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36"}
username="script"
password="password"
s = Session()
s.verify = False
s.headers.update(headers)
r = s.get(URL)
Why is my connection being refused? I can browse to the site through my internet browser so why am I getting a connection refused?
Since your are running on localhost try http protocol instead of https.
I am trying to fix the following error. But i am not finding any solution. can anyone help me with this?
When i run this code sometimes it runs the code, but sometimes it displays the below error. Below is the code with the error
import requests
from bs4 import BeautifulSoup
import mysql.connector
mydb = mysql.connector.connect(host="localhost", user="root",passwd="", database="python_db")
mycursor = mydb.cursor()
#url="https://csr.gov.in/companyprofile.php?year=FY%202014-15&CIN=U01224KA1980PLC003802"
#query1 = "INSERT INTO csr_details(average_net_profit,csr_prescribed_expenditure,csr_spent,local_area_spent) VALUES()"
mycursor.execute("SELECT cin_no FROM tn_cin WHERE csr_status=0")
urls=mycursor.fetchall()
#print(urls)
def convertTuple(tup):
str = ''.join(tup)
return str
for url in urls:
str = convertTuple(url[0])
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36', "Accept-Language": "en-US,en;q=0.9", "Accept-Encoding": "gzip, deflate"}
csr_link = 'https://csr.gov.in/companyprofile.php?year=FY%202014-15&CIN='
link = csr_link+str
#print(link)
response=requests.get(link, headers=headers)
#print(response.status_code)
bs=BeautifulSoup(response.text,"html.parser")
div_table=bs.find('div', id = 'colfy4')
if div_table is not None:
fy_table = div_table.find_all('table', id = 'employee_data')
if fy_table is not None:
for tr in fy_table:
td=tr.find_all('td')
if len(td)>0:
rows=[i.text for i in td]
row1=rows[0]
row2=rows[1]
row3=rows[2]
row4=rows[3]
#cin_no=url[1]
#cin=convertTuple(url[1])
#result=cin_no+rows
mycursor.execute("INSERT INTO csr_details(cin_no,average_net_profit,csr_prescribed_expenditure,csr_spent,local_area_spent) VALUES(%s,%s,%s,%s,%s)",(str,row1,row2,row3,row4))
#print(cin)
#print(str)
#var=1
status_update="UPDATE tn_cin SET csr_status=%s WHERE cin_no=%s"
data = ('1',str)
mycursor.execute(status_update,data)
#result=mycursor.fetchall()
#print(result)
mydb.commit()
I am getting following error after running the above code
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
The error
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
is often an error caused on the server-side with the error normally classified under the status code of 5xx. The error simply suggests an instance in which the server is closed before a full response is delivered.
I believe it's likely caused by this line
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36', "Accept-Language": "en-US,en;q=0.9", "Accept-Encoding": "gzip, deflate"}
which in some cases has issues with the header values. You may simply try to set the header as
response=requests.get(link, headers={"User-Agent":"Mozilla/5.0"})
and see if that solves your problem.
See this answer for user-agents for a variety of browsers.
I've been trying to print time data from this site: clockofeidolon.com and I found that the hour, minutes and seconds are stored in "span class="big-x"
tags and have tried to get the data with this
from bs4 import BeautifulSoup
from requests import Session
session = Session()
session.headers['user-agent'] = (
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/'
'66.0.3359.181 Safari/537.36'
)
url = 'https://clockofeidolon'
response = session.get(url=url)
data = response.text
soup = BeautifulSoup(data, "html.parser")
spans = soup.find('<span class="big')
print(data)
print([span.text for span in spans])
I keep getting authentication erros though
socket.gaierror: [Errno 11001] getaddrinfo failed
This error is occuring because you are trying to access an URL that doesn't exist (https://clockofeidolon) or Python can't reach.
Look at this question, which explains what that error means:
"getaddrinfo failed", what does that mean?
The host clockofeidolon did not resolve to an IP. You were probably looking for clockofeidolon.com.
I have the following Python 2.7 code:
import requests
from urllib3 import Retry
s = requests.Session()
http_retries = Retry(3)
https_retries = Retry(3)
http = requests.adapters.HTTPAdapter(max_retries=http_retries)
https = requests.adapters.HTTPAdapter(max_retries=https_retries)
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36',}
s.mount('http://', http)
s.mount('https://', https)
response = s.get(URL, headers=headers, timeout=10)
I keep getting
Failed to establish a new connection: [Errno 101] Network is unreachable'
when I run script from Amazon AWS Instance but on another network it works fine.
Any idea why