How can I bypass 503 with BS4
Selenium works for a long time, so I would not want to use it
site to request
changing user-agent did not help
there is no cycle in the code, this error arrives from the first request
header = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-language': 'en-US,en;q=0.8',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}
I did just
import requests
r = requests.get("https://mangalib.me/")
and got 503 too, in r.text I found that
<noscript>
<h1 data-translate="turn_on_js" style="color:#bd2426;">Please turn JavaScript on and reload the page.</h1>
</noscript>
and later
<div id="no-cookie-warning" class="cookie-warning" data-translate="turn_on_cookies" style="display:none">
<p data-translate="turn_on_cookies" style="color:#bd2426;">Please enable Cookies and reload the page.</p>
</div>
<p data-translate="process_is_automatic">This process is automatic. Your browser will redirect to your requested content shortly.</p>
<p data-translate="allow_5_secs" id="cf-spinner-allow-5-secs" >Please allow up to 5 seconds…</p>
So I suspect you need tool with JavaScript support if you want to access this page
I'm trying to get the page source for an Instagram post using the code below. Funnily enough, it worked a few times but then it said that I wasn't logged in (which changes the entire source code). Is there any way so that I can access the source code? You would get while logged in without using automation stuff like Selenium, because that would be pretty slow.
import requests
import urllib
from urllib.request import urlopen, URLError, Request
def getSource(rawLink):
req = Request(
rawLink,
data = None,
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36"
}
)
with urlopen(req) as response:
source = response.read().decode('utf-8')
return source
link = "https://www.instagram.com/p/COF47v4HoC9/"
source = getSource(link)
print(source[0:100])
As you can see the output line <html lang="en" class="no-js not-logged-in client-root"> indicates how I'm not logged in.
Python code to request the URL:
agent = {"User-Agent":'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'} #using agent to solve the blocking issue
response = requests.get('https://www.naukri.com/jobs-in-andhra-pradesh', headers=agent)
#making the request to the link
Output when printing the html :
<!DOCTYPE html>
<html>
<head>
<title>Naukri reCAPTCHA</title> #the title in the actual title of the URL that I am requested for
<meta name="robots" content="noindex, nofollow">
<link rel="stylesheet" href="https://static.naukimg.com/s/4/101/c/common_v62.min.css" />
<script src="https://www.google.com/recaptcha/api.js" async defer></script>
</head>
</html>
Using Google Cache along with a referer (in the header) will help you bypass the captcha.
Things to note:
Don't send more than 2 requests/sec. You may get blocked.
The result you receive is a cache. This will not be effective if you are trying to scrape a real-time data.
Example:
header = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" ,
'referer':'https://www.google.com/'
}
r = requests.get("http://webcache.googleusercontent.com/search?q=cache:www.naukri.com/jobs-in-andhra-pradesh",headers=header)
This gives:
>>> r.content
[Squeezed 2554 lines]
I am trying to scrape http website using proxies and when I am trying to extract text, it shows as "Web page Blocked". How could I avoid this error?
My code is as follows
url = "http://campanulaceae.myspecies.info/"
proxy_dict = {
'http' : "174.138.54.49:8080",
'https' : "174.138.54.49:8080"
}
page = requests.get(url, proxies=proxy_dict)
soup = BeautifulSoup(page.text,'html.parser')
print(soup)
I get below output when I am trying to output text from the website.
<html>
<head>
<title>Web Page Blocked</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="NO-CACHE" http-equiv="PRAGMA"/>
<meta content="initial-scale=1.0" name="viewport"/>
........
<body bgcolor="#e7e8e9">
<div id="content">
<h1>Web Page Blocked</h1>
<p>Access to the web page you were trying to visit has been blocked in accordance with company policy. Please contact your system administrator if you believe this is in error.</p>
Because you did not specify a user-agent for the request headers.
Quite often, sites block requests that come from robot-like sources.
Try it like this:
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.106 Safari/537.36'}
page = requests.get(url, headers=headers, proxies=proxy_dict)
I want to send a post data requests to this URL : https://www.createsend.com/t/securedsubscribe?token=" + token Every time the token is changing but i figure out a way to retrieve it. But i can't acces the internet site
When i check in Chrome Console this header is used.
authority:www.createsend.com
method:POST
path:/t/securedsubscribe?token=7B9BCC9AE0CD58E2170E07A7D79E679426EBC1D02FA06CA791557CA7ACC1155F5A0DDA83D987E06C
scheme:https
accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
accept-encoding:gzip, deflate, br
accept-language:fr
cache-control:max-age=0
content-length:174
content-type:application/x-www-form-urlencoded
cookie:__utma=38149500.1167947680.1522060606.1522060606.1522060606.1; __utmc=38149500; __utmz=38149500.1522060606.1.1.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); __utmt=1; _ga=GA1.2.1167947680.1522060606; _gid=GA1.2.485326558.1522060607; ajs_group_id=null; __qca=P0-314313712-1522060607057; mp_mixpanel__c=0; ajs_user_id=%22C5EC08CADFFC107B-B6DC4E4B6840339E%22; ajs_anonymous_id=%222e3961d1-be64-4ecf-8481-c8a47295b130%22; __utmv=38149500.|1=user-type=user=1; __utmb=38149500.2.10.1522060606; mp_1c1eda798f92601aecaa904fe7b3520a_mixpanel=%7B%22distinct_id%22%3A%20%22C5EC08CADFFC107B-B6DC4E4B6840339E%22%2C%22mp_lib%22%3A%20%22Segment%3A%20web%22%2C%22%24search_engine%22%3A%20%22google%22%2C%22%24initial_referrer%22%3A%20%22https%3A%2F%2Fwww.google.be%2F%22%2C%22%24initial_referring_domain%22%3A%20%22www.google.be%22%2C%22mp_name_tag%22%3A%20%22C5EC08CADFFC107B-B6DC4E4B6840339E%22%2C%22id%22%3A%20%22C5EC08CADFFC107B-B6DC4E4B6840339E%22%7D; _uetsid=_uetcde93ba3; optimizelyEndUserId=oeu1522060693722r0.545521542891791; optimizelyBuckets=%7B%7D; optimizelySegments=%7B%22341521689%22%3A%22direct%22%2C%22341576276%22%3A%22gc%22%2C%22341588087%22%3A%22false%22%2C%222833930025%22%3A%22none%22%2C%225027931715%22%3A%22true%22%2C%225195510267%22%3A%22true%22%7D; intercom-lou-je5td1qt=1; intercom-session-je5td1qt=WmxjSDdDVlRYYTZUc1Z3bmlENTNhOXFJUzhvK2piZElBakFjbDI5dkpwS0hvWEFLVmMweWNHNER0Ujh3QTFKUy0tdGt0aitQQTNuQWoxZGhXMUllakhMQT09--ec3c7db367f57a30eb5b5818de90e43dc8cc39a6; __ssid=089a6e70-58c2-41bb-98d2-90ce7ada1d73
origin:http://tres-bien.com
referer:http://tres-bien.com/odehhasoidj
upgrade-insecure-requests:1
user-agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/64.0.3282.167 Chrome/64.0.3282.167 Safari/537.36
This webpage use H2 protocole. So i use Hyper module to pass trough :
session.mount("https://www.createsend.com",HTTP20Adapter())
r = session.post(url , data=payload2 , headers=header2)
print(r.text)
But I still get this error.
<HTML><HEAD><TITLE>Bad Request</TITLE>
<META HTTP-EQUIV="Content-Type" Content="text/html; charset=us-ascii"></HEAD>
<BODY><h2>Bad Request - Invalid Header</h2>
<hr><p>HTTP Error 400. The request has an invalid header name.</p>
</BODY></HTML>
The website's form I'm submitting : http://tres-bien.com/odehhasoidj where you can check which POST requests are made
This is my code
header2 = {
#'Host': 'www.createsend.com',
'authority':'www.createsend.com' ,
'method':'POST',
'path':'/t/securedsubscribe?token=' + token,
'scheme':'https',
'accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-encoding':'gzip, deflate, br',
'accept-language':'fr',
'cache-control':'max-age=0',
#'content-length':'182',
'content-type':'application/x-www-form-urlencoded',
'origin':'http://tres-bien.com',
'referer':'http://tres-bien.com/odehhasoidj',
'upgrade-insecure-requests':'1',
#'cookie':'__utma=38149500.1167947680.1522060606.1522060606.1522060606.1; __utmc=38149500; __utmz=38149500.1522060606.1.1.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); __utmt=1; _ga=GA1.2.1167947680.1522060606; _gid=GA1.2.485326558.1522060607; ajs_group_id=null; __qca=P0-314313712-1522060607057; mp_mixpanel__c=0; ajs_user_id=%22C5EC08CADFFC107B-B6DC4E4B6840339E%22; ajs_anonymous_id=%222e3961d1-be64-4ecf-8481-c8a47295b130%22; __utmv=38149500.|1=user-type=user=1; __utmb=38149500.2.10.1522060606; mp_1c1eda798f92601aecaa904fe7b3520a_mixpanel=%7B%22distinct_id%22%3A%20%22C5EC08CADFFC107B-B6DC4E4B6840339E%22%2C%22mp_lib%22%3A%20%22Segment%3A%20web%22%2C%22%24search_engine%22%3A%20%22google%22%2C%22%24initial_referrer%22%3A%20%22https%3A%2F%2Fwww.google.be%2F%22%2C%22%24initial_referring_domain%22%3A%20%22www.google.be%22%2C%22mp_name_tag%22%3A%20%22C5EC08CADFFC107B-B6DC4E4B6840339E%22%2C%22id%22%3A%20%22C5EC08CADFFC107B-B6DC4E4B6840339E%22%7D; _uetsid=_uetcde93ba3; optimizelyEndUserId=oeu1522060693722r0.545521542891791; optimizelyBuckets=%7B%7D; optimizelySegments=%7B%22341521689%22%3A%22direct%22%2C%22341576276%22%3A%22gc%22%2C%22341588087%22%3A%22false%22%2C%222833930025%22%3A%22none%22%2C%225027931715%22%3A%22true%22%2C%225195510267%22%3A%22true%22%7D; intercom-lou-je5td1qt=1; intercom-session-je5td1qt=WmxjSDdDVlRYYTZUc1Z3bmlENTNhOXFJUzhvK2piZElBakFjbDI5dkpwS0hvWEFLVmMweWNHNER0Ujh3QTFKUy0tdGt0aitQQTNuQWoxZGhXMUllakhMQT09--ec3c7db367f57a30eb5b5818de90e43dc8cc39a6; __ssid=089a6e70-58c2-41bb-98d2-90ce7ada1d73',
#'Authorization': token ,
'user-agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/64.0.3282.167 Chrome/64.0.3282.167 Safari/537.36'
}
payload2 = {
'cm-name' : 'Namz Foll' ,
'cm-zlutdu-zlutdu' : email ,
'cm-f-qkytli': 'Adress Bilning' ,
'cm-f-qkytld': '1030' ,
'cm-f-qkytlh': 'Ciky' ,
'cm-fo-qkytlk ' : '3324398' ,
'cm-f-qkytlu' : '0412345408' ,
'cm-fo-qkytry' : '3324639'
}
url = "https://www.createsend.com/t/securedsubscribe?token=" + token
There is not enough information to tell, but authority, scheme, etc. are special headers that must be prefixed by a colon, as in :authority, :scheme, etc.
Please see "Pseudo Header Fields" here: https://www.rfc-editor.org/rfc/rfc7540#section-8.1.2.
This header belongs to HTTP/2. And you can't use them with requests library, but you can use another library hyper
from hyper import HTTPConnection
conn = HTTPConnection('http2bin.org:443')
conn.request('GET', '/get')
resp = conn.get_response()
print(resp.read())
But learning a new library can be a time taking process, and as we know requests us much simple, to know how you can use this header, this link has explained it very well.