Python code to request the URL:
agent = {"User-Agent":'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'} #using agent to solve the blocking issue
response = requests.get('https://www.naukri.com/jobs-in-andhra-pradesh', headers=agent)
#making the request to the link
Output when printing the html :
<!DOCTYPE html>
<html>
<head>
<title>Naukri reCAPTCHA</title> #the title in the actual title of the URL that I am requested for
<meta name="robots" content="noindex, nofollow">
<link rel="stylesheet" href="https://static.naukimg.com/s/4/101/c/common_v62.min.css" />
<script src="https://www.google.com/recaptcha/api.js" async defer></script>
</head>
</html>
Using Google Cache along with a referer (in the header) will help you bypass the captcha.
Things to note:
Don't send more than 2 requests/sec. You may get blocked.
The result you receive is a cache. This will not be effective if you are trying to scrape a real-time data.
Example:
header = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" ,
'referer':'https://www.google.com/'
}
r = requests.get("http://webcache.googleusercontent.com/search?q=cache:www.naukri.com/jobs-in-andhra-pradesh",headers=header)
This gives:
>>> r.content
[Squeezed 2554 lines]
Related
I have a script that used to work with urllib and now has to use requests. I have a url I use to put stuff in a database. the url is
http://www.example.com/insert.php?network=testnet&id=1245100&c=2800203&lat=7555344
this url worked through urllib(urlopen) but i get 403 forbidden when doing it through requests.get
HEADER = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.61 Safari/537.36' }
headers = requests.utils.default_headers()
headers.update = ( HEADER,)
payload={'network':'testnet','id':'1245300','c':'2803824', 'lat':'7555457'}
response = requests.get("http://www.example.com/insert.php", headers=headers, params=payload)
print(f"Remote commit: {response.text}")
print(response.url)
the url works in a browser and gets a simple json ok response.
the script produces:
Remote commit: <html>
<head><title>403 Forbidden</title></head>
<body>
<center><h1>403 Forbidden</h1></center>
<hr><center>nginx</center>
</body>
</html>
http://www.example.com/insert.php?network=testnet&id=1245300&c=2803824&lat=7555457
not sure what I am doing wrong.
edit: changed https to http.
Forbidden often correlated to SSL/TLS certificate verification failure. Please try using the requests.get by setting the verify=False as following
Fixing the SSL certificate issue
requests.get("https://www.example.com/insert.php?network=testnet&id=1245300&c=2803824&lat=7555457", verify=False)
Fixing the TLS certificate issue
Check out my answer related to the TLS certificate verification fix.
Somehow I overcomplicated it and when I tried the absolute minimum that works.
import requests
headers = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.61 Safari/537.36' }
response = requests.get("http://www.example.com/insert.php?network=testnet&id=1245200&c=2803824&lat=7555457", headers=headers)
print(response.text)
How can I bypass 503 with BS4
Selenium works for a long time, so I would not want to use it
site to request
changing user-agent did not help
there is no cycle in the code, this error arrives from the first request
header = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-language': 'en-US,en;q=0.8',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}
I did just
import requests
r = requests.get("https://mangalib.me/")
and got 503 too, in r.text I found that
<noscript>
<h1 data-translate="turn_on_js" style="color:#bd2426;">Please turn JavaScript on and reload the page.</h1>
</noscript>
and later
<div id="no-cookie-warning" class="cookie-warning" data-translate="turn_on_cookies" style="display:none">
<p data-translate="turn_on_cookies" style="color:#bd2426;">Please enable Cookies and reload the page.</p>
</div>
<p data-translate="process_is_automatic">This process is automatic. Your browser will redirect to your requested content shortly.</p>
<p data-translate="allow_5_secs" id="cf-spinner-allow-5-secs" >Please allow up to 5 seconds…</p>
So I suspect you need tool with JavaScript support if you want to access this page
i am trying to get html of supreme main page to parse it.
Here is what i am trying:
from bs4 import BeautifulSoup
all_page = requests.get('https://www.supremenewyork.com/index', headers = {
'Upgrade-Insecure-Requests': '1',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
}).text
all_page_html = BeautifulSoup(all_page,'html.parser')
print(all_page_html)
But instead of html i get this response:
<!DOCTYPE html>
<html lang="en"><head><meta charset="utf-8"/><meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/><title>Supreme</title><meta content="Supreme. The official website of Supreme. EST 1994. NYC." name="description"/><meta content="telephone=no" name="format-detection"/><meta content="on" http-equiv="cleartype"/><meta content="notranslate" name="google"/><meta content="app-id=664573705" name="apple-itunes-app"/><link href="//www.google-analytics.com" rel="dns-prefetch"/><link href="//ssl.google-analytics.com" rel="dns-prefetch"/><link href="//d2flb1n945r21v.cloudfront.net" rel="dns-prefetch"/><script src="https://www.google.com/recaptcha/api.js">async defer</script><meta content="width=device-width, initial-scale=1, minimum-scale=1, maximum-scale=1, user-scalable=no" id="viewport" name="viewport"/><link href="//d17ol771963kd3.cloudfront.net/assets/application-2000eb9ad53eb6df5a7d0fd8c85c0c03.css" media="all" rel="stylesheet"/><script \
e.t.c
Is this a kind of a block or maybe i am missing something? I even added requested headers but still i get this type of response instead of a normal one.
Well, that's actually how the page is. It is saying that it's and HTML page with some css and javascript running, then you should use the "Inspect Element" to search for the elements you want to grab and maybe write down the class they are stored in to find them more easily.
I would like to access and scrape the data from this link.
where;
new_url='https://www.scopus.com/results/results.uri?sort=plf-f&src=s&imp=t&sid=2c816e0ea43cf176a59117097216e6d4&sot=b&sdt=b&sl=160&s=%28TITLE-ABS-KEY%28EEG%29AND+TITLE-ABS-KEY%28%22deep+learning%22%29+AND+DOCTYPE%28ar%29%29+AND+ORIG-LOAD-DATE+AFT+1591735287+AND+ORIG-LOAD-DATE+BEF+1592340145++AND+PUBYEAR+AFT+2018&origin=CompleteResultsEmailAlert&dgcid=raven_sc_search_en_us_email&txGid=cc4809850a0eff92f629c95380f9f883'
As accessing the new_url via the following line
req = Request(url, headers={'User-Agent': 'Mozilla/5.9'})
produced the error
Webscraping: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop
A set of new line was drafted
req = urllib.request.Request(new_url, None, {'User-Agent': 'Mozilla/5.0 (X11; Linux i686; G518Rco3Yp0uLV40Lcc9hAzC1BOROTJADjicLjOmlr4=) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3','Accept-Encoding': 'gzip, deflate, sdch','Accept-Language': 'en-US,en;q=0.8','Connection': 'keep-alive'})
cj = CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
raw = opener.open(req).read()
page_soup = soup(raw, 'html.parser')
print(page_soup.prettify())
While no error is thrown out, but the
print(page_soup.prettify())
output some unrecognized text format
6�>�.�t1k�e�LH�.��]WO�?m�^#�
څ��#�h[>��!�H8����|����n(XbU<~�k�"���#g+�4�Ǻ�Xv�7�UȢB2�
�7�F8�XA��W\�ɚ��^8w��38�#'
SH�<_0�B���oy�5Bނ)E���GPq:�ќU�c���ab�h�$<ra�
;o�Q�a#ð�d\�&J3Τa�����:�I�etf�a���h�$(M�~���ua�$�
n�&9u%ҵ*b���w�j�V��P�D�'z[��������)
with a warning
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
I suspect, this can be resolved by encode it using utf-8, which is as below
req = urllib.request.Request(new_url, None, {'User-Agent': 'Mozilla/5.0 (X11; Linux i686; G518Rco3Yp0uLV40Lcc9hAzC1BOROTJADjicLjOmlr4=) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3','Accept-Encoding': 'gzip, deflate, sdch','Accept-Language': 'en-US,en;q=0.8','Connection': 'keep-alive'})
cj = CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
raw = opener.open(req).read()
with open(raw, 'r', encoding='utf-8') as f:
page_soup = soup(f, 'html.parser')
print(page_soup.prettify())
However, the compiler return an error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position
1: invalid start byte
May I know what is the problem, appreciate for any insight.
Try using the requests library
import requests
from bs4 import BeautifulSoup
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"}
with requests.Session() as s:
r = s.get(new_url, headers = headers)
soup = BeautifulSoup(r.text, 'lxml')
print(soup.get_text())
you can still use cookies here
Edit: Updated code to show the use of headers, this would tell the website you are a browser instead of a program - but further login operations I would suggest the use of selenium instead of requests
If you want to use urllib library, remove Accept-Encoding from the headers (also specify Accept-Charset just utf-8 for simplicity):
req = urllib.request.Request(new_url, None, {'User-Agent': 'Mozilla/5.0 (X11; Linux i686; G518Rco3Yp0uLV40Lcc9hAzC1BOROTJADjicLjOmlr4=) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Charset': 'utf-8;q=0.7,*;q=0.3','Accept-Language': 'en-US,en;q=0.8','Connection': 'keep-alive'})
The result is:
<!DOCTYPE html>
<!-- Form Name: START -->
<html lang="en">
<!-- Template_Component_Name: id.start.vm -->
<head>
<meta charset="utf-8"/>
...etc.
I am trying to scrape http website using proxies and when I am trying to extract text, it shows as "Web page Blocked". How could I avoid this error?
My code is as follows
url = "http://campanulaceae.myspecies.info/"
proxy_dict = {
'http' : "174.138.54.49:8080",
'https' : "174.138.54.49:8080"
}
page = requests.get(url, proxies=proxy_dict)
soup = BeautifulSoup(page.text,'html.parser')
print(soup)
I get below output when I am trying to output text from the website.
<html>
<head>
<title>Web Page Blocked</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="NO-CACHE" http-equiv="PRAGMA"/>
<meta content="initial-scale=1.0" name="viewport"/>
........
<body bgcolor="#e7e8e9">
<div id="content">
<h1>Web Page Blocked</h1>
<p>Access to the web page you were trying to visit has been blocked in accordance with company policy. Please contact your system administrator if you believe this is in error.</p>
Because you did not specify a user-agent for the request headers.
Quite often, sites block requests that come from robot-like sources.
Try it like this:
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.106 Safari/537.36'}
page = requests.get(url, headers=headers, proxies=proxy_dict)