I have this problem:
I'm trying to create a script in Python to download a web site and look for some info.
this is the code:
import urllib.request
url_archive_of_nethys = "http://www.aonprd.com/Default.aspx"
def getMainPage():
fp = urllib.request.urlopen(url_archive_of_nethys)
mybytes = fp.read()
mystr = mybytes.decode("utf8")
fp.close()
print(mystr)
def main():
getMainPage()
if __name__ == "__main__":
main()
but when I start it I get:
<HTTPError 999: 'No Hacking'>
I also tried to use curl command:
curl http://www.aonprd.com/Default.aspx
and i downloaded the page correctly
I'm developing using Visual Studio and python 3.6
Any suggest will be appreciated
thank you
they probably detect your user-agent and filter you.
try to change it:
req = urllib.request.Request(
url,
data=None,
headers={'User-Agent': ("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/35.0.1916.47 Safari/537.36")})
fp = urllib.request.urlopen(req)
Related
so I have some software who uses webscraping, but for some reason it doesn't seem to work. It's bizarre because when I run it in Google Colab, the code works fine and the url's can open and be scraped, but when I run it in my web application (and run it on my console using python3 run.py) it doesn't work.
Here is the code that is returning errors :
b = searchgoogle(query, num)
c = []
print(b)
for i in b:
extractor = extractors.ArticleExtractor()
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/50.0.2661.102 Safari/537.36'
}
extractor = extractors.ArticleExtractor()
req = Request(url=i, headers=headers)
d = urlopen(req)
try:
if d.info()['content-type'].startswith('text/html'):
print ('its html')
resp = requests.get(i, headers=headers)
if resp.ok:
doc = extractor.get_content(resp.text)
c.append(comparetexts(text,doc,i))
else:
print(f'Failed to get URL: {resp.status_code}')
else:
print ('its not html')
except KeyError:
print( 'its not html')
print(i)
return c
The code returning errors is the "d = urlopen(req)"
There is code above the section I just put here but it has nothing to do with the errors. Anyways, thanks for your time!
(By the way, I checked my OPEN SSL version on python3 and it says : 'OpenSSL 1.1.1m 14 Dec 2021' so I think it's up to date)
This happens because your web application does not have SSL certification, so you should tell your script to ignore SSL verification when making the request, as specified here:
Python 3 urllib ignore SSL certificate verification
I am working on a project for science that scrapes skyward.smsd.org it opens in a pop up but at the top of the page it has a URL when I go to it not in the popup it says your session has expired and there is no way around this I can find. I am also having an invalid syntax error with else: msg if anyone can help me find a solution to these issues
while True:
import requests
from bs4 import BeautifulSoup
import time
from time import sleep
url = "https://skyward.smsd.org/scripts/wsisa.dll/WService=wsEAplus/sfcalendar002.w"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "lxml")
from requests.packages.urllib3 import add_stderr_logger
add_stderr_logger()
s = requests.Session()
s.headers['User-Agent'] = 'Mozilla/5.0'
login = {login: 3078774, password: (MY PASSWORD)}
login_response = s.post(url, data=login)
for r in login_response.history:
if r.status_code == 401: # 401 means authentication failed
sys.exit(1) # abort
pdf_response = s.get(pdf_url) # Your cookies and headers are automatically included
if str(soup).find("skyward") == -1:
continue
time.sleep(60)
else:
msg = 'Subject: This is the script talking, check Skyward'
#Possibilty to make this tell you exactly what is changed
#A text feature that goes out daily for missing assignments
fromaddr = '3078774#smsd.org'
toaddrs = ['3078774#smsd.org']
print('From: ' + fromaddr)
print('To: ' + str(toaddrs))
print('Message: ' + msg)
break
I am trying to download a PDF, however I get the following error: HTTP Error 403: Forbidden
I am aware that the server is blocking for whatever reason, but I cant seem to find a solution.
import urllib.request
import urllib.parse
import requests
def download_pdf(url):
full_name = "Test.pdf"
urllib.request.urlretrieve(url, full_name)
try:
url = ('http://papers.xtremepapers.com/CIE/Cambridge%20IGCSE/Mathematics%20(0580)/0580_s03_qp_1.pdf')
print('initialized')
hdr = {}
hdr = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36',
'Content-Length': '136963',
}
print('HDR recieved')
req = urllib.request.Request(url, headers=hdr)
print('Header sent')
resp = urllib.request.urlopen(req)
print('Request sent')
respData = resp.read()
download_pdf(url)
print('Complete')
except Exception as e:
print(str(e))
You seem to have already realised this; the remote server is apparently checking the user agent header and rejecting requests from Python's urllib. But urllib.request.urlretrieve() doesn't allow you to change the HTTP headers, however, you can use urllib.request.URLopener.retrieve():
import urllib.request
opener = urllib.request.URLopener()
opener.addheader('User-Agent', 'whatever')
filename, headers = opener.retrieve(url, 'Test.pdf')
N.B. You are using Python 3 and these functions are now considered part of the "Legacy interface", and URLopener has been deprecated. For that reason you should not use them in new code.
The above aside, you are going to a lot of trouble to simply access a URL. Your code imports requests, but you don't use it - you should though because it is much easier than urllib. This works for me:
import requests
url = 'http://papers.xtremepapers.com/CIE/Cambridge%20IGCSE/Mathematics%20(0580)/0580_s03_qp_1.pdf'
r = requests.get(url)
with open('0580_s03_qp_1.pdf', 'wb') as outfile:
outfile.write(r.content)
I'm using a script to grab download links from an HTML page (sent to me via mail) and then download the files, the script has been working great for about 6 months, but last week i started getting "403 Error".
from what I've read and understand, the issue is that the site is blocking me, thinking that it's a bot (can't deny that), but I'm not scraping the HTML code of the site, just trying to download a file using requests.get, I only get this error from one specific site, other ones I can download fine.
I've tried setting headers={'User-Agent': 'Mozilla/5.0'} but that didn't help.
here's the function that downloads the file:
def download_file(dl_url, local_save_path):
"""Download URL to given path"""
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36'
auth_check = requests.get(dl_url, auth=(username.get(), password.get()), verify=False, headers={'User-Agent': user_agent})
dnl_sum = 1024
local_filename = dl_url.split('/')[-1]
complete_name = os.path.join(local_save_path, local_filename)
# Get file size
r = requests.head(dl_url, auth=(username.get(), password.get()), verify=False, headers={'User-Agent': user_agent})
try:
dl_file_size = int(r.headers['content-length'])
file_size.set(str(int(int(r.headers['content-length']) * (10 ** -6))) + "MB")
c = 1
except KeyError:
c = 0
pass
# NOTE the stream=True parameter
print('1')
r = requests.get(dl_url, stream=True, auth=(username.get(), password.get()), verify=False, headers={'User-Agent': user_agent})
print('2')
while True:
try:
with open(complete_name, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
f.flush()
if c == 1:
download_perc.set(percentage(dl_file_size, dnl_sum))
elif c == 0:
print(dnl_sum)
dnl_sum = os.path.getsize(complete_name)
except FileNotFoundError:
continue
break
return
Have you try to use a proxy ?
You can use tor, it's allow you dynamic IP address and website can't recognize you.
Try this https://techoverflow.net/blog/2015/02/06/using-python-requests-over-tor/
I am trying to make a script that gets similar images from google using a url, using a part from this code.
The problem is, that I want to get to this link, because from it I can get to the images themselves by cloicking on the "search by image" link, but when I use the script, I get the exact same page, but without the "search by image" link.
I would like to know why and if there is a way to fix it.
Thanks a lot in advance!
P.S. Here's the code
import os
from urllib2 import Request, urlopen
from cookielib import LWPCookieJar
USER_AGENT = r"Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0)"
LOCAL_PATH = r"C:\scripts\google_search"
COOKIE_JAR_FILE = r".google-cookie"
class google_search(object):
def cleanup(self):
if os.path.isfile(self.cookie_jar_path):
os.remove(self.cookie_jar_path)
os.chdir(LOCAL_PATH)
for html in os.listdir("."):
if html.endswith(".html"):
os.remove(html)
def __init__(self, cookie_jar_path):
self.cookie_jar_path = cookie_jar_path
self.cookie_jar = LWPCookieJar(self.cookie_jar_path)
self.counter = 0
self.cleanup()
try:
cookie.load()
except Exception:
pass
def get_html(self, url):
request = Request(url = url)
request.add_header("User-Agent", USER_AGENT)
self.cookie_jar.add_cookie_header(request)
response = urlopen(request)
self.cookie_jar.extract_cookies(response, request)
html_response = response.read()
response.close()
self.cookie_jar.save()
return html_response
def main():
url_2 = r"http://www.google.com/search?hl=en&q=http%3A%2F%2Fi.imgur.com%2FqGRxTNA.jpg&btnG=Google+Search"
search = google_search(os.path.join(LOCAL_PATH, COOKIE_JAR_FILE))
html_2 = search.get_html(url_2)
if __name__ == '__main__':
main()
I have tried something of that sort a few weeks back. My server used to reject my requests with a 404 because I was not setting a proper user agent.
In your case, you are not setting the user agent properly. Pasting my User-Agent header.
USER_AGENT = r"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36"
PS: I hope you have read the T & C of Google. You might be violating the terms.