I am trying to scrape a website in which the request headers are having some new (for me) attributes such as :authority, :method, :path, :scheme.
{':authority':'xxxx',':method':'GET',':path':'/xxxx',':scheme':'https','accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8','accept-encoding':'gzip, deflate, br','accept-language':'en-US,en;q=0.9','cache-control':'max-age=0',GOOGLE_ABUSE_EXEMPTION=ID=0d5af55f1ada3f1e:TM=1533116294:C=r:IP=182.71.238.62-:S=APGng0u2o9IqL5wljH2o67S5Hp3hNcYIpw;1P_JAR=2018-8-1-9', 'upgrade-insecure-requests': '1', 'user-agent': 'Mozilla/5.0(WindowsNT6.1;Win64;x64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/68.0.3440.84Safari/537.36', 'x-client-data': 'CJG2yQEIpbbJAQjEtskBCKmdygEI2J3KAQioo8oBCIKkygE=' }
I tried passing them as headers with http request but ended up with error as shown below.
ValueError: Invalid header name b':scheme'
Any help would be appreciated on understanding and guidance on using them in passing request.
EDIT:
code added
import requests
url = 'https://www.google.co.in/search?q=some+text'
headers = {':authority':'xxxx',':method':'GET',':path':'/xxxx',':scheme':'https','accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8','accept-encoding':'gzip, deflate, br','accept-language':'en-US,en;q=0.9','cache-control':'max-age=0','upgrade-insecure-requests': '1', 'user-agent': 'Mozilla/5.0(WindowsNT6.1;Win64;x64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/68.0.3440.84Safari/537.36', 'x-client-data': 'CJG2yQEIpbbJAQjEtskBCKmdygEI2J3KAQioo8oBCIKkygE=' }
response = requests.get(url, headers=headers)
print(response.text)
Your error comes from here (python's source code)
Http headers cannot start with a semicolon as RFC states.
:authority, :method, :path, :scheme are not http headers
https://en.wikipedia.org/wiki/List_of_HTTP_header_fields
':method':'GET'
defines http request method
https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol#Request_methods
and
:authority, :path, :scheme
are parts of URI https://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Generic_syntax
Related
I'm trying to scrap this website https://triller.co/ , so I want to get information from profile pages like this https://triller.co/#warnermusicarg , what I do is trying to request the json url that contains the information, in this case it's https://social.triller.co/v1.5/api/users/by_username/warnermusicarg
When I use requests.get() it works normally and I can retrieve all the information.
import requests
import urllib.parse
from urllib.parse import urlencode
url = 'https://social.triller.co/v1.5/api/users/by_username/warnermusicarg'
headers = {'authority':'social.triller.co',
'method':'GET',
'path':'/v1.5/api/users/by_username/warnermusicarg',
'scheme':'https',
'accept':'*/*',
'accept-encoding':'gzip, deflate, br',
'accept-language':'ar,en-US;q=0.9,en;q=0.8',
'authorization': 'Bearer eyJhbGciOiJIUzI1NiIsImlhdCI6MTY0MDc4MDc5NSwiZXhwIjoxNjkyNjIwNzk1fQ.eyJpZCI6IjUyNjQ3ODY5OCJ9.Ds-acbfcGSeUrGDSs47pBiT3b13Eb9SMcB8BF8OylqQ',
'origin':'https://triller.co',
'sec-ch-ua':'" Not A;Brand";v="99", "Chromium";v="96", "Google Chrome";v="96"',
'sec-ch-ua-mobile':'?0',
'sec-ch-ua-platform':'"Windows"',
'sec-fetch-dest':'empty',
'sec-fetch-mode':'cors',
'sec-fetch-site':'same-site',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
response = requests.get(url, headers=headers)
The problem arises when I try to use an API proxy providers as Webscraping.ai, ScrapingBee, etc
api_key='my_api_key'
api_url='https://api.webscraping.ai/html?'
params = {'api_key': api_key, 'timeout': '20000', 'url':url}
proxy_url = api_url + urlencode(params)
response2 = requests.get(proxy_url, headers=headers)
This gives me this error
2022-01-08 22:30:59 [urllib3.connectionpool] DEBUG: https://api.webscraping.ai:443 "GET /html?api_key=my_api_key&timeout=20000&url=https%3A%2F%2Fsocial.triller.co%2Fv1.5%2Fapi%2Fusers%2Fby_username%2Fwarnermusicarg&render_js=false HTTP/1.1" 502 91
{'status_code': 403, 'status_message': '', 'message': 'Unexpected HTTP code on the target page'}
What I tried to do is:
1- I searched for the meaning of 403 code in the documentation of my API proxy provider, it said that api_key is wrong, but I'm 100% sure it's correct,
Also, I changed to another API proxy provider but the same issue,
Also, I had the same issue with twitter.com
And I don't know what to do?
Currently, the code on the question successfully returns a response with code 200, but there are 2 possible issues:
Some sites block datacenter proxies, try to use proxy=residential API parameter (params = {'api_key': api_key, 'timeout': '20000', proxy: 'residential', 'url':url}).
Some of the headers on your headers parameter are unnecessary. Webscraping.AI uses its own set of headers to mimic the behaviors of normal browsers, so setting custom user-agent, accept-language, etc., may interfere with them and cause 403 responses from the target site. Use only the necessary headers. Looks like it will be only the authorization header in your case.
I don't know exactly what caused this error but I tried using their webscraping_ai.ApiClient() instance as in here and it worked,
configuration = webscraping_ai.Configuration(
host = "https://api.webscraping.ai",
api_key = {
'api_key': 'my_api_key'
}
)
with webscraping_ai.ApiClient(configuration) as api_client:
# Create an instance of the API class
api_instance = webscraping_ai.HTMLApi(api_client)
url_j = url # str | URL of the target page
headers = headers
timeout = 20000
js = False
proxy = 'datacenter'
api_response = api_instance.get_html(url_j, headers=headers, timeout=timeout, js=js, proxy=proxy)
I am scraping a site with Scrapy but some of it's API's are not returning JSON data without the 'if-none-match' header.
I have greater than 100 API's list so I want to generate automatic headers for getting a valid JSON file. anybody knows how to handle this or there is any other method to get rid of it.
Thanks in advance.
You can use the DEFAULT_REQUEST_HEADERS setting if you want to define headers for all requests:
# settings.py
DEFAULT_REQUEST_HEADERS={
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'If-None-Match': '*',
}
or the headers parameter for individuals requests:
req = scrapy.Request(url, callback=self.parse, headers={'If-None-Match': '*'})
I'm a Python user, beginner level. I'm trying to follow this instruction on Basecamp 3. Documentation: https://github.com/basecamp/bc3-api
I've successfully gone through the authorization step and was able to retrieve the access token (which consists of 3 keys: access_token, expires_in and refresh_token.
Now i'm trying to pull some actual data from Basecamp, and the most basic call is to https://3.basecampapi.com/999999999/projects.json (with 99999999 being my account number, which I have).
The instruction has an example in curl: curl -H "Authorization: Bearer $ACCESS_TOKEN" -H 'User-Agent: MyApp (yourname#example.com)' https://3.basecampapi.com/999999999/projects.json
But I cannot translate this to Python. I tried many methods of passing the keys to the header call but none works. Can anyone help me out?
Code:
url = "3.basecampapi.com/99999999/projects.json"
headers = {'Content-Type': 'application/json',
'User-Agent': 'MyApp (myemail#gmail.com)',
'access_token': 'Access_Token_String',
'expires_in': '1209600',
'refresh_token': 'Refresh_token_string'}
result = requests.post(url, headers=headers)
This is an old question, but posting an answer for anyone who happens to stumble upon this.
url = f'3.basecampapi.com/{PROJECT_ID}/projects.json'
headers = {'User-Agent': 'MyApp (myemail#gmail.com)',
'Content-Type': 'application/json; charset=utf-8',
'Authorization': f'Bearer {ACCESS_TOKEN}'
response = requests.get(url, headers=headers)
Then view the output via response.json()
I am dealing with this little error but I can not get the solution. I authenticate into a page and I had opened the "inspect/network" chrome tool to see what web service is called and how. I found out this is used:
I have censored sensitive data releated to the site. So, I have to do this same request using python, but I am always getting error 500 and the log on the server side is not showing helpful information (only java traceback).
This is the code of the request
response = requests.post(url,data = 'username=XXXXX&password=XXXXXXX')
URL has the same string that you see in the image under "General/Request URL" label.
Data has the same string that you see in the image under "Form Data".
It looks very simple request but I can not get it to work :( .
Best regards
If you want your request appears like coming from Chrome, other than sending correct data you need to specify headers as well. The reason you got 500 error is probably there're certain settings on your server side disallowing traffic from "non-browsers".
So in your case, you need to add headers:
headers = {'Accept': 'application/json, text/plain, */*',
'Accept-Encoding': gzip, deflate,
...... # more
'User-Agent': 'Mozilla/5.0 XXXXX...' # this line tells the server what browser/agent is used for this request
}
response = requests.post(url,data = 'username=XXXXX&password=XXXXXXX', headers=headers)
P.S. If you are curious, default headers from requests are:
>>> import requests
>>> session = requests.Session()
>>> session.headers
{'Connection': 'keep-alive', 'Accept-Encoding': 'gzip, deflate',
'Accept': '*/*', 'User-Agent': 'python-requests/2.13.0'}
As you can see the default User-Agent is python-requests/2.13.0, and some websites do block such traffic.
I'm using the python requests library to get and post http content. I have no problem using the get function but my post function seems to fail or not do anything at all. From my understanding the requests library the POST function automatically encodes the data you send but I'm not sure if that's actually happening
code:
data = 'hash='+hash+'&confirm=Continue+as+Free+User'
r = requests.post(url,data)
html = r.text
by checking the "value" of html I can tell that the return response is that of the url without the POST.
You're not taking advantage of how requests will encode it for you. To do so, you need to write your code this way:
data = {'hash': hash, 'confirm': 'Continue as Free User'}
r = requests.post(url, data)
html = r.text
I can not test this for you but this is how the encoding happens automatically.
post(url, data=None, **kwargs)
Sends a POST request. Returns :class:`Response` object.
:param url: URL for the new :class:`Request` object.
:param data: (optional) Dictionary, bytes, or file-like object to send in the body of the :class:`Request`.
:param \*\*kwargs: Optional arguments that ``request`` takes.
import requests
url = "http://computer-database.herokuapp.com/computers"
payload = "name=Hello11111122OKOK&introduced=1986-12-26&discontinued=2100-12-26&company=13"
headers = {
'accept-language': "en-US,en;q=0.9,kn;q=0.8",
'accept-encoding': "gzip, deflate",
'accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
'content-type': "application/x-www-form-urlencoded",
'cache-control': "no-cache",
'postman-token': "3e5dabdc-149a-ff4c-a3db-398a7b52f9d5"
}
response = requests.request("POST", url, data=payload, headers=headers)
print(response.text)