I'm using a script to grab download links from an HTML page (sent to me via mail) and then download the files, the script has been working great for about 6 months, but last week i started getting "403 Error".
from what I've read and understand, the issue is that the site is blocking me, thinking that it's a bot (can't deny that), but I'm not scraping the HTML code of the site, just trying to download a file using requests.get, I only get this error from one specific site, other ones I can download fine.
I've tried setting headers={'User-Agent': 'Mozilla/5.0'} but that didn't help.
here's the function that downloads the file:
def download_file(dl_url, local_save_path):
"""Download URL to given path"""
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36'
auth_check = requests.get(dl_url, auth=(username.get(), password.get()), verify=False, headers={'User-Agent': user_agent})
dnl_sum = 1024
local_filename = dl_url.split('/')[-1]
complete_name = os.path.join(local_save_path, local_filename)
# Get file size
r = requests.head(dl_url, auth=(username.get(), password.get()), verify=False, headers={'User-Agent': user_agent})
try:
dl_file_size = int(r.headers['content-length'])
file_size.set(str(int(int(r.headers['content-length']) * (10 ** -6))) + "MB")
c = 1
except KeyError:
c = 0
pass
# NOTE the stream=True parameter
print('1')
r = requests.get(dl_url, stream=True, auth=(username.get(), password.get()), verify=False, headers={'User-Agent': user_agent})
print('2')
while True:
try:
with open(complete_name, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
f.flush()
if c == 1:
download_perc.set(percentage(dl_file_size, dnl_sum))
elif c == 0:
print(dnl_sum)
dnl_sum = os.path.getsize(complete_name)
except FileNotFoundError:
continue
break
return
Have you try to use a proxy ?
You can use tor, it's allow you dynamic IP address and website can't recognize you.
Try this https://techoverflow.net/blog/2015/02/06/using-python-requests-over-tor/
Related
I am trying to dowload a big amount of HTML pages from a certain website, with the following python code using "requests" package:
FROM = 547495
TO = 570000
for page_number in range(FROM, TO):
url = DEFAULT_URL + str(page_number)
response = requests.get(url)
if response.status_code == 200:
with open(str(page_number) + ".html", "wb") as file:
file.write(response.content)
time.sleep(0.5)
I put a sleep(0.5) command in order that the web server will not think it is a DDOS attack.
after about 20,000 pages, I started getting only 403 forbiden http status code, and I can't anymore download pages.
But, if I try to open the same pages in my browser It opens well, so I guess the web server did not block me.
does someone has an Idea what caused it? and how can I handle it?
thank you
Make it look like it's your browser using headers, and set a cookie ID if it requires a session, here is an example. You can retrieve values of headers from inspecting "Network" tab in your browser when visiting the pages.
with requests.session() as sess:
sess.headers["User-Agent"]= "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:81.0) Gecko/20100101 Firefox/81.0"
sess.get(url)
sess.headers["Cookie"] = "eZSESSID={}".format(sess.cookies.get("eZSESSID"))
for page_number in range(FROM, TO):
if response.status_code == 200:
with open(str(page_number) + ".html", "wb") as file:
file.write(response.content)
time.sleep(0.5)
I have a website that I want to download a pdf using request, the website requires you to log in then you can access the pdf file.
I am using this script but it isn't working, what is the problem? I used some code from another post, but couldn't figure out how to resolve this issue!!!
import requests
import sys
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36'
}
login_data = {
'Email': 'My-email',
'Password': 'My-password',
'login': 'Login'
}
url = 'https://download-website' #The website i want to download the file from
filename = 'filename.pdf'
# creating a connection to the pdf
print("Creating the connection ...")
with requests.session() as s:
url1 = 'https://login-website/' #The website i want to log in into
r = s.get(url1, headers=headers, stream=True)
soup = BeautifulSoup(r.content, 'html5lib')
login_data['__RequestVerificationToken'] = soup.find('input', attrs={'name':'__RequestVerificationToken'})['value']
r = s.post(url1, data=login_data, headers=headers, stream=True)
with requests.get(url, stream=True) as r:
if r.status_code != 200:
print("Could not download the file '{}'\nError Code : {}\nReason : {}\n\n".format(
url, r.status_code, r.reason), file=sys.stderr)
else:
# Storing the file as a pdf
print("Saving the pdf file :\n\"{}\" ...".format(filename))
with open(filename, 'wb') as f:
try:
total_size = int(r.headers['Content-length'])
saved_size_pers = 0
moversBy = 8192*100/total_size
for chunk in r.iter_content(chunk_size=8192):
if chunk:
f.write(chunk)
saved_size_pers += moversBy
print("\r=>> %.2f%%" % (
saved_size_pers if saved_size_pers <= 100 else 100.0), end='')
print(end='\n\n')
except Exception:
print("==> Couldn't save : {}\\".format(filename))
f.flush()
r.close()
r.close()
I can only guess, because I do not know the link to the website. Try to write the keys of the user data in lower case. If that doesn't work try to find out what the registration form of the website expects with the developer tools of your browser.
I have the code:
from urllib.request import urlopen
url = 'http://gmsh.info/bin/MacOSX/gmsh-4.5.2-MacOSX-sdk.tgz'
sdk = urlopen(url).read()
and the question: why this download never ends? Link is OK and it works in browsers. I tried to set some headers like this:
from urllib import request
req = request.Request(url)
req.add_header('user-agent', "Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11")
sdk = request.urlopen(req).read()
but this didn't help. Any ideas?
this is because the file size is very big try downloading it into chunks..
as shown in example it will work..
import urllib.request
filedata = urllib.request.urlopen('http://gmsh.info/bin/MacOSX/gmsh-4.5.2-MacOSX-sdk.tgz')
CHUNK = 1 * 1024
with open('test.zip', 'wb') as f:
while True:
chunk = filedata.read(CHUNK)
if not chunk:
break
f.write(chunk)
I am working on a project for science that scrapes skyward.smsd.org it opens in a pop up but at the top of the page it has a URL when I go to it not in the popup it says your session has expired and there is no way around this I can find. I am also having an invalid syntax error with else: msg if anyone can help me find a solution to these issues
while True:
import requests
from bs4 import BeautifulSoup
import time
from time import sleep
url = "https://skyward.smsd.org/scripts/wsisa.dll/WService=wsEAplus/sfcalendar002.w"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "lxml")
from requests.packages.urllib3 import add_stderr_logger
add_stderr_logger()
s = requests.Session()
s.headers['User-Agent'] = 'Mozilla/5.0'
login = {login: 3078774, password: (MY PASSWORD)}
login_response = s.post(url, data=login)
for r in login_response.history:
if r.status_code == 401: # 401 means authentication failed
sys.exit(1) # abort
pdf_response = s.get(pdf_url) # Your cookies and headers are automatically included
if str(soup).find("skyward") == -1:
continue
time.sleep(60)
else:
msg = 'Subject: This is the script talking, check Skyward'
#Possibilty to make this tell you exactly what is changed
#A text feature that goes out daily for missing assignments
fromaddr = '3078774#smsd.org'
toaddrs = ['3078774#smsd.org']
print('From: ' + fromaddr)
print('To: ' + str(toaddrs))
print('Message: ' + msg)
break
I have this problem:
I'm trying to create a script in Python to download a web site and look for some info.
this is the code:
import urllib.request
url_archive_of_nethys = "http://www.aonprd.com/Default.aspx"
def getMainPage():
fp = urllib.request.urlopen(url_archive_of_nethys)
mybytes = fp.read()
mystr = mybytes.decode("utf8")
fp.close()
print(mystr)
def main():
getMainPage()
if __name__ == "__main__":
main()
but when I start it I get:
<HTTPError 999: 'No Hacking'>
I also tried to use curl command:
curl http://www.aonprd.com/Default.aspx
and i downloaded the page correctly
I'm developing using Visual Studio and python 3.6
Any suggest will be appreciated
thank you
they probably detect your user-agent and filter you.
try to change it:
req = urllib.request.Request(
url,
data=None,
headers={'User-Agent': ("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/35.0.1916.47 Safari/537.36")})
fp = urllib.request.urlopen(req)