Error when requesting page with requests.get python - python

i am trying to get html of supreme main page to parse it.
Here is what i am trying:
from bs4 import BeautifulSoup
all_page = requests.get('https://www.supremenewyork.com/index', headers = {
'Upgrade-Insecure-Requests': '1',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
}).text
all_page_html = BeautifulSoup(all_page,'html.parser')
print(all_page_html)
But instead of html i get this response:
<!DOCTYPE html>
<html lang="en"><head><meta charset="utf-8"/><meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/><title>Supreme</title><meta content="Supreme. The official website of Supreme. EST 1994. NYC." name="description"/><meta content="telephone=no" name="format-detection"/><meta content="on" http-equiv="cleartype"/><meta content="notranslate" name="google"/><meta content="app-id=664573705" name="apple-itunes-app"/><link href="//www.google-analytics.com" rel="dns-prefetch"/><link href="//ssl.google-analytics.com" rel="dns-prefetch"/><link href="//d2flb1n945r21v.cloudfront.net" rel="dns-prefetch"/><script src="https://www.google.com/recaptcha/api.js">async defer</script><meta content="width=device-width, initial-scale=1, minimum-scale=1, maximum-scale=1, user-scalable=no" id="viewport" name="viewport"/><link href="//d17ol771963kd3.cloudfront.net/assets/application-2000eb9ad53eb6df5a7d0fd8c85c0c03.css" media="all" rel="stylesheet"/><script \
e.t.c
Is this a kind of a block or maybe i am missing something? I even added requested headers but still i get this type of response instead of a normal one.

Well, that's actually how the page is. It is saying that it's and HTML page with some css and javascript running, then you should use the "Inspect Element" to search for the elements you want to grab and maybe write down the class they are stored in to find them more easily.

Related

How to bypass 503 BS4 python

How can I bypass 503 with BS4
Selenium works for a long time, so I would not want to use it
site to request
changing user-agent did not help
there is no cycle in the code, this error arrives from the first request
header = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-language': 'en-US,en;q=0.8',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}
I did just
import requests
r = requests.get("https://mangalib.me/")
and got 503 too, in r.text I found that
<noscript>
<h1 data-translate="turn_on_js" style="color:#bd2426;">Please turn JavaScript on and reload the page.</h1>
</noscript>
and later
<div id="no-cookie-warning" class="cookie-warning" data-translate="turn_on_cookies" style="display:none">
<p data-translate="turn_on_cookies" style="color:#bd2426;">Please enable Cookies and reload the page.</p>
</div>
<p data-translate="process_is_automatic">This process is automatic. Your browser will redirect to your requested content shortly.</p>
<p data-translate="allow_5_secs" id="cf-spinner-allow-5-secs" >Please allow up to 5 seconds…</p>
So I suspect you need tool with JavaScript support if you want to access this page

Logging in to Instagram without Selenium

I'm trying to get the page source for an Instagram post using the code below. Funnily enough, it worked a few times but then it said that I wasn't logged in (which changes the entire source code). Is there any way so that I can access the source code? You would get while logged in without using automation stuff like Selenium, because that would be pretty slow.
import requests
import urllib
from urllib.request import urlopen, URLError, Request
def getSource(rawLink):
req = Request(
rawLink,
data = None,
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36"
}
)
with urlopen(req) as response:
source = response.read().decode('utf-8')
return source
link = "https://www.instagram.com/p/COF47v4HoC9/"
source = getSource(link)
print(source[0:100])
As you can see the output line <html lang="en" class="no-js not-logged-in client-root"> indicates how I'm not logged in.

How to Bypass Google Recaptcha while scraping with Requests

Python code to request the URL:
agent = {"User-Agent":'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'} #using agent to solve the blocking issue
response = requests.get('https://www.naukri.com/jobs-in-andhra-pradesh', headers=agent)
#making the request to the link
Output when printing the html :
<!DOCTYPE html>
<html>
<head>
<title>Naukri reCAPTCHA</title> #the title in the actual title of the URL that I am requested for
<meta name="robots" content="noindex, nofollow">
<link rel="stylesheet" href="https://static.naukimg.com/s/4/101/c/common_v62.min.css" />
<script src="https://www.google.com/recaptcha/api.js" async defer></script>
</head>
</html>
Using Google Cache along with a referer (in the header) will help you bypass the captcha.
Things to note:
Don't send more than 2 requests/sec. You may get blocked.
The result you receive is a cache. This will not be effective if you are trying to scrape a real-time data.
Example:
header = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" ,
'referer':'https://www.google.com/'
}
r = requests.get("http://webcache.googleusercontent.com/search?q=cache:www.naukri.com/jobs-in-andhra-pradesh",headers=header)
This gives:
>>> r.content
[Squeezed 2554 lines]

Webscraping with http shows "Web page blocked"

I am trying to scrape http website using proxies and when I am trying to extract text, it shows as "Web page Blocked". How could I avoid this error?
My code is as follows
url = "http://campanulaceae.myspecies.info/"
proxy_dict = {
'http' : "174.138.54.49:8080",
'https' : "174.138.54.49:8080"
}
page = requests.get(url, proxies=proxy_dict)
soup = BeautifulSoup(page.text,'html.parser')
print(soup)
I get below output when I am trying to output text from the website.
<html>
<head>
<title>Web Page Blocked</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="NO-CACHE" http-equiv="PRAGMA"/>
<meta content="initial-scale=1.0" name="viewport"/>
........
<body bgcolor="#e7e8e9">
<div id="content">
<h1>Web Page Blocked</h1>
<p>Access to the web page you were trying to visit has been blocked in accordance with company policy. Please contact your system administrator if you believe this is in error.</p>
Because you did not specify a user-agent for the request headers.
Quite often, sites block requests that come from robot-like sources.
Try it like this:
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.106 Safari/537.36'}
page = requests.get(url, headers=headers, proxies=proxy_dict)

Python Requests POST DATA Error 400 Header name invalid

I want to send a post data requests to this URL : https://www.createsend.com/t/securedsubscribe?token=" + token Every time the token is changing but i figure out a way to retrieve it. But i can't acces the internet site
When i check in Chrome Console this header is used.
authority:www.createsend.com
method:POST
path:/t/securedsubscribe?token=7B9BCC9AE0CD58E2170E07A7D79E679426EBC1D02FA06CA791557CA7ACC1155F5A0DDA83D987E06C
scheme:https
accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
accept-encoding:gzip, deflate, br
accept-language:fr
cache-control:max-age=0
content-length:174
content-type:application/x-www-form-urlencoded
cookie:__utma=38149500.1167947680.1522060606.1522060606.1522060606.1; __utmc=38149500; __utmz=38149500.1522060606.1.1.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); __utmt=1; _ga=GA1.2.1167947680.1522060606; _gid=GA1.2.485326558.1522060607; ajs_group_id=null; __qca=P0-314313712-1522060607057; mp_mixpanel__c=0; ajs_user_id=%22C5EC08CADFFC107B-B6DC4E4B6840339E%22; ajs_anonymous_id=%222e3961d1-be64-4ecf-8481-c8a47295b130%22; __utmv=38149500.|1=user-type=user=1; __utmb=38149500.2.10.1522060606; mp_1c1eda798f92601aecaa904fe7b3520a_mixpanel=%7B%22distinct_id%22%3A%20%22C5EC08CADFFC107B-B6DC4E4B6840339E%22%2C%22mp_lib%22%3A%20%22Segment%3A%20web%22%2C%22%24search_engine%22%3A%20%22google%22%2C%22%24initial_referrer%22%3A%20%22https%3A%2F%2Fwww.google.be%2F%22%2C%22%24initial_referring_domain%22%3A%20%22www.google.be%22%2C%22mp_name_tag%22%3A%20%22C5EC08CADFFC107B-B6DC4E4B6840339E%22%2C%22id%22%3A%20%22C5EC08CADFFC107B-B6DC4E4B6840339E%22%7D; _uetsid=_uetcde93ba3; optimizelyEndUserId=oeu1522060693722r0.545521542891791; optimizelyBuckets=%7B%7D; optimizelySegments=%7B%22341521689%22%3A%22direct%22%2C%22341576276%22%3A%22gc%22%2C%22341588087%22%3A%22false%22%2C%222833930025%22%3A%22none%22%2C%225027931715%22%3A%22true%22%2C%225195510267%22%3A%22true%22%7D; intercom-lou-je5td1qt=1; intercom-session-je5td1qt=WmxjSDdDVlRYYTZUc1Z3bmlENTNhOXFJUzhvK2piZElBakFjbDI5dkpwS0hvWEFLVmMweWNHNER0Ujh3QTFKUy0tdGt0aitQQTNuQWoxZGhXMUllakhMQT09--ec3c7db367f57a30eb5b5818de90e43dc8cc39a6; __ssid=089a6e70-58c2-41bb-98d2-90ce7ada1d73
origin:http://tres-bien.com
referer:http://tres-bien.com/odehhasoidj
upgrade-insecure-requests:1
user-agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/64.0.3282.167 Chrome/64.0.3282.167 Safari/537.36
This webpage use H2 protocole. So i use Hyper module to pass trough :
session.mount("https://www.createsend.com",HTTP20Adapter())
r = session.post(url , data=payload2 , headers=header2)
print(r.text)
But I still get this error.
<HTML><HEAD><TITLE>Bad Request</TITLE>
<META HTTP-EQUIV="Content-Type" Content="text/html; charset=us-ascii"></HEAD>
<BODY><h2>Bad Request - Invalid Header</h2>
<hr><p>HTTP Error 400. The request has an invalid header name.</p>
</BODY></HTML>
The website's form I'm submitting : http://tres-bien.com/odehhasoidj where you can check which POST requests are made
This is my code
header2 = {
#'Host': 'www.createsend.com',
'authority':'www.createsend.com' ,
'method':'POST',
'path':'/t/securedsubscribe?token=' + token,
'scheme':'https',
'accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-encoding':'gzip, deflate, br',
'accept-language':'fr',
'cache-control':'max-age=0',
#'content-length':'182',
'content-type':'application/x-www-form-urlencoded',
'origin':'http://tres-bien.com',
'referer':'http://tres-bien.com/odehhasoidj',
'upgrade-insecure-requests':'1',
#'cookie':'__utma=38149500.1167947680.1522060606.1522060606.1522060606.1; __utmc=38149500; __utmz=38149500.1522060606.1.1.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); __utmt=1; _ga=GA1.2.1167947680.1522060606; _gid=GA1.2.485326558.1522060607; ajs_group_id=null; __qca=P0-314313712-1522060607057; mp_mixpanel__c=0; ajs_user_id=%22C5EC08CADFFC107B-B6DC4E4B6840339E%22; ajs_anonymous_id=%222e3961d1-be64-4ecf-8481-c8a47295b130%22; __utmv=38149500.|1=user-type=user=1; __utmb=38149500.2.10.1522060606; mp_1c1eda798f92601aecaa904fe7b3520a_mixpanel=%7B%22distinct_id%22%3A%20%22C5EC08CADFFC107B-B6DC4E4B6840339E%22%2C%22mp_lib%22%3A%20%22Segment%3A%20web%22%2C%22%24search_engine%22%3A%20%22google%22%2C%22%24initial_referrer%22%3A%20%22https%3A%2F%2Fwww.google.be%2F%22%2C%22%24initial_referring_domain%22%3A%20%22www.google.be%22%2C%22mp_name_tag%22%3A%20%22C5EC08CADFFC107B-B6DC4E4B6840339E%22%2C%22id%22%3A%20%22C5EC08CADFFC107B-B6DC4E4B6840339E%22%7D; _uetsid=_uetcde93ba3; optimizelyEndUserId=oeu1522060693722r0.545521542891791; optimizelyBuckets=%7B%7D; optimizelySegments=%7B%22341521689%22%3A%22direct%22%2C%22341576276%22%3A%22gc%22%2C%22341588087%22%3A%22false%22%2C%222833930025%22%3A%22none%22%2C%225027931715%22%3A%22true%22%2C%225195510267%22%3A%22true%22%7D; intercom-lou-je5td1qt=1; intercom-session-je5td1qt=WmxjSDdDVlRYYTZUc1Z3bmlENTNhOXFJUzhvK2piZElBakFjbDI5dkpwS0hvWEFLVmMweWNHNER0Ujh3QTFKUy0tdGt0aitQQTNuQWoxZGhXMUllakhMQT09--ec3c7db367f57a30eb5b5818de90e43dc8cc39a6; __ssid=089a6e70-58c2-41bb-98d2-90ce7ada1d73',
#'Authorization': token ,
'user-agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/64.0.3282.167 Chrome/64.0.3282.167 Safari/537.36'
}
payload2 = {
'cm-name' : 'Namz Foll' ,
'cm-zlutdu-zlutdu' : email ,
'cm-f-qkytli': 'Adress Bilning' ,
'cm-f-qkytld': '1030' ,
'cm-f-qkytlh': 'Ciky' ,
'cm-fo-qkytlk ' : '3324398' ,
'cm-f-qkytlu' : '0412345408' ,
'cm-fo-qkytry' : '3324639'
}
url = "https://www.createsend.com/t/securedsubscribe?token=" + token
There is not enough information to tell, but authority, scheme, etc. are special headers that must be prefixed by a colon, as in :authority, :scheme, etc.
Please see "Pseudo Header Fields" here: https://www.rfc-editor.org/rfc/rfc7540#section-8.1.2.
This header belongs to HTTP/2. And you can't use them with requests library, but you can use another library hyper
from hyper import HTTPConnection
conn = HTTPConnection('http2bin.org:443')
conn.request('GET', '/get')
resp = conn.get_response()
print(resp.read())
But learning a new library can be a time taking process, and as we know requests us much simple, to know how you can use this header, this link has explained it very well.

Categories

Resources