so I have some software who uses webscraping, but for some reason it doesn't seem to work. It's bizarre because when I run it in Google Colab, the code works fine and the url's can open and be scraped, but when I run it in my web application (and run it on my console using python3 run.py) it doesn't work.
Here is the code that is returning errors :
b = searchgoogle(query, num)
c = []
print(b)
for i in b:
extractor = extractors.ArticleExtractor()
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/50.0.2661.102 Safari/537.36'
}
extractor = extractors.ArticleExtractor()
req = Request(url=i, headers=headers)
d = urlopen(req)
try:
if d.info()['content-type'].startswith('text/html'):
print ('its html')
resp = requests.get(i, headers=headers)
if resp.ok:
doc = extractor.get_content(resp.text)
c.append(comparetexts(text,doc,i))
else:
print(f'Failed to get URL: {resp.status_code}')
else:
print ('its not html')
except KeyError:
print( 'its not html')
print(i)
return c
The code returning errors is the "d = urlopen(req)"
There is code above the section I just put here but it has nothing to do with the errors. Anyways, thanks for your time!
(By the way, I checked my OPEN SSL version on python3 and it says : 'OpenSSL 1.1.1m 14 Dec 2021' so I think it's up to date)
This happens because your web application does not have SSL certification, so you should tell your script to ignore SSL verification when making the request, as specified here:
Python 3 urllib ignore SSL certificate verification
Related
I have this problem:
I'm trying to create a script in Python to download a web site and look for some info.
this is the code:
import urllib.request
url_archive_of_nethys = "http://www.aonprd.com/Default.aspx"
def getMainPage():
fp = urllib.request.urlopen(url_archive_of_nethys)
mybytes = fp.read()
mystr = mybytes.decode("utf8")
fp.close()
print(mystr)
def main():
getMainPage()
if __name__ == "__main__":
main()
but when I start it I get:
<HTTPError 999: 'No Hacking'>
I also tried to use curl command:
curl http://www.aonprd.com/Default.aspx
and i downloaded the page correctly
I'm developing using Visual Studio and python 3.6
Any suggest will be appreciated
thank you
they probably detect your user-agent and filter you.
try to change it:
req = urllib.request.Request(
url,
data=None,
headers={'User-Agent': ("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/35.0.1916.47 Safari/537.36")})
fp = urllib.request.urlopen(req)
I have been trying to login to a website using python 3.6 but it has proven to be more difficult than i originally anticipated. So far this is my code:
import urllib.request
import urllib.parse
headers = {}
headers['User-Agent'] = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36"
url = "https://www.pinterest.co.uk/login/"
data = {
"email" : "my#email",
"password" : "my_password"}
data = urllib.parse.urlencode(data)
data = data.encode("utf-8")
request = urllib.request.Request(url, headers = headers, data = data)
response = urllib.request.urlopen(request)
responseurl = response.geturl()
print(responseurl)
This throws up a 403 error (forbidden), and I'm not sure why as I have added my email, passcode and even changed the user agent. Am I just missing something simple like a cookiejar?
If possible is there a way to do this without using the requests module as this is a challenge that I have been given to do this with only inbuilt modules (but I am allowed to get help so I'm not cheating)
Most sites will use a csrf token or other means to block exactly what you are attempting to do. One possible workaround would be to utilize a browser automation framework such as selenium and log in through the site's UI
So I am not sure why, but reading plenty of other similar issues and resolved questions on here, I can't see why my request is not printing the page behind the login form. I am using a simple webpage to test it out, where I am registered. Providing the creds in my payload and holding the cookie using .Session() should open my second URL. But instead I get the login form printed. I checked with wireshark, and Burp Suite, and everything looks normal when I run the script, looks like if I login to the webpage.
Here is the code:
# -*- coding: utf-8 -*-
import requests
url = 'http://www.chicago-cz.com/forum/login.php'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36'}
payload = {
"username": "User_321",
"password": "S33cr3t",
}
with requests.Session() as s:
p = s.post(url, headers=headers, data=payload)
#print p.text
# URL behind login (Inbox)
r = s.get('http://www.chicago-cz.com/forum/privmsg.php?folder=inbox')
print r.content
I am trying to download a PDF, however I get the following error: HTTP Error 403: Forbidden
I am aware that the server is blocking for whatever reason, but I cant seem to find a solution.
import urllib.request
import urllib.parse
import requests
def download_pdf(url):
full_name = "Test.pdf"
urllib.request.urlretrieve(url, full_name)
try:
url = ('http://papers.xtremepapers.com/CIE/Cambridge%20IGCSE/Mathematics%20(0580)/0580_s03_qp_1.pdf')
print('initialized')
hdr = {}
hdr = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36',
'Content-Length': '136963',
}
print('HDR recieved')
req = urllib.request.Request(url, headers=hdr)
print('Header sent')
resp = urllib.request.urlopen(req)
print('Request sent')
respData = resp.read()
download_pdf(url)
print('Complete')
except Exception as e:
print(str(e))
You seem to have already realised this; the remote server is apparently checking the user agent header and rejecting requests from Python's urllib. But urllib.request.urlretrieve() doesn't allow you to change the HTTP headers, however, you can use urllib.request.URLopener.retrieve():
import urllib.request
opener = urllib.request.URLopener()
opener.addheader('User-Agent', 'whatever')
filename, headers = opener.retrieve(url, 'Test.pdf')
N.B. You are using Python 3 and these functions are now considered part of the "Legacy interface", and URLopener has been deprecated. For that reason you should not use them in new code.
The above aside, you are going to a lot of trouble to simply access a URL. Your code imports requests, but you don't use it - you should though because it is much easier than urllib. This works for me:
import requests
url = 'http://papers.xtremepapers.com/CIE/Cambridge%20IGCSE/Mathematics%20(0580)/0580_s03_qp_1.pdf'
r = requests.get(url)
with open('0580_s03_qp_1.pdf', 'wb') as outfile:
outfile.write(r.content)
I am trying to make a script that gets similar images from google using a url, using a part from this code.
The problem is, that I want to get to this link, because from it I can get to the images themselves by cloicking on the "search by image" link, but when I use the script, I get the exact same page, but without the "search by image" link.
I would like to know why and if there is a way to fix it.
Thanks a lot in advance!
P.S. Here's the code
import os
from urllib2 import Request, urlopen
from cookielib import LWPCookieJar
USER_AGENT = r"Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0)"
LOCAL_PATH = r"C:\scripts\google_search"
COOKIE_JAR_FILE = r".google-cookie"
class google_search(object):
def cleanup(self):
if os.path.isfile(self.cookie_jar_path):
os.remove(self.cookie_jar_path)
os.chdir(LOCAL_PATH)
for html in os.listdir("."):
if html.endswith(".html"):
os.remove(html)
def __init__(self, cookie_jar_path):
self.cookie_jar_path = cookie_jar_path
self.cookie_jar = LWPCookieJar(self.cookie_jar_path)
self.counter = 0
self.cleanup()
try:
cookie.load()
except Exception:
pass
def get_html(self, url):
request = Request(url = url)
request.add_header("User-Agent", USER_AGENT)
self.cookie_jar.add_cookie_header(request)
response = urlopen(request)
self.cookie_jar.extract_cookies(response, request)
html_response = response.read()
response.close()
self.cookie_jar.save()
return html_response
def main():
url_2 = r"http://www.google.com/search?hl=en&q=http%3A%2F%2Fi.imgur.com%2FqGRxTNA.jpg&btnG=Google+Search"
search = google_search(os.path.join(LOCAL_PATH, COOKIE_JAR_FILE))
html_2 = search.get_html(url_2)
if __name__ == '__main__':
main()
I have tried something of that sort a few weeks back. My server used to reject my requests with a 404 because I was not setting a proper user agent.
In your case, you are not setting the user agent properly. Pasting my User-Agent header.
USER_AGENT = r"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36"
PS: I hope you have read the T & C of Google. You might be violating the terms.