The site that I am trying to scrape must follow 302 observed by the browser network tools. I've copied the network request as Curl and its working fine, but when I convert it to python requests its just returning 200.
Curl:
curl -v 'https://my-site.com/CardAuthentication.aspx' \
-H 'Connection: keep-alive' \
-H 'Pragma: no-cache' \
-H 'Cache-Control: no-cache' \
-H 'Upgrade-Insecure-Requests: 1' \
-H 'Origin: https://my-site.com' \
-H 'Content-Type: multipart/form-data; boundary=----WebKitFormBoundaryBN2XwBbcAwvRWZzk' \
-H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36' \
-H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' \
-H 'Sec-GPC: 1' \
-H 'Sec-Fetch-Site: same-origin' \
-H 'Sec-Fetch-Mode: navigate' \
-H 'Sec-Fetch-User: ?1' \
-H 'Sec-Fetch-Dest: document' \
-H 'Referer: https://my-site.com/CardAuthentication.aspx' \
-H 'Accept-Language: en-US,en;q=0.9' \
-H 'Cookie: ASP.NET_SessionId=...; TS0192fa71=...; ServiceProvider=UID=...; .ASPXAUTH=...' \
--data-raw $'------WebKitFormBoundaryBN2XwBbcAwvRWZzk\r\ DATA \r\n' \
--compressed
this is returning:
* Trying IP:443...
........
........
< HTTP/1.1 302 Found
< Cache-Control: no-cache
< Pragma: no-cache
< Content-Type: text/html; charset=utf-8
< Expires: -1
< Location: my-site.com/MemberDetails.aspx
< Date: Mon, 07 Feb 2022 09:54:49 GMT
< Content-Length: 61211
< Set-Cookie: TS0192fa71=...; Path=/; Domain=.my-site.com; Secure
Python:
import requests
import logging
logging.basicConfig(level=logging.DEBUG)
cookies = {
"ASP.NET_SessionId": ".....",
"TS0192fa71": ".....",
"ServiceProvider": ".....",
".ASPXAUTH": ".....",
}
headers = {
......
"Content-Type": "multipart/form-data; boundary=----WebKitFormBoundaryBN2XwBbcAwvRWZzk",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
"Referer": "https://my-site.com/CardAuthentication.aspx",
"Accept-Language": "en-US,en;q=0.9",
}
data = {
"------WebKitFormBoundaryBN2XwBbcAwvRWZzk\r\nContent-Disposition: form-data; name": '"__EVENTTARGET"\r\n\r\n\r\n-- DATA
}
response = requests.post(
"https://my-site.com/CardAuthentication.aspx",
headers=headers,
cookies=cookies,
data=data,
)
And the return code is 200, with response history empy.
Is there something wrong with requests library or in the way that request library is processing the data? How can I solve this?
Is there something wrong with requests library or in the way that
request library is processing the data? How can I solve this?
This is default requests behavior, you need to set allow_redirects to False if you want to not follow in case of 3xx response code, example
import requests
r = requests.get("http://github.com", allow_redirects=False)
print(r.status_code) # 301
If you want to know more read request.request docs.
Related
I have a JOOMLA 3 system with Community Builder plugin. I need to update Users from a CSV file.
Here is the code I have tried:
if __name__ == "__main__":
s = fn_login()
r = s.get(cb_users)
print(r)
with open(IMPORT_FILE, 'rb') as f:
files = {'files': f}
r = s.post(post_url, files=files)
The session is ok. I have tried to figure out the POST message, but something may be missing. Nothing happens when I send the post. I have got response 200. What else do I need to consider?
Here is the request header:
curl 'https://3joomla.domain/administrator/index.php'
-X 'POST'
-H 'authority: 3joomla.domain'
-H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3;q=0.9'
-H 'accept-language: en-US,en;q=0.9,hu;q=0.8'
-H 'cache-control: max-age=0'
-H 'content-length: 6710'
-H 'content-type: multipart/form-data; boundary=----WebKitFormBoundaryALBLy00y6zolw7hC'
-H 'cookie: ....
-H 'origin: https://3joomla.domain'
-H 'referer: https://3joomla.domain/administrator/index.php'
-H 'sec-ch-ua: "Not?A_Brand";v="8", "Chromium";v="108", "Google Chrome";v="108"'
-H 'sec-ch-ua-mobile: ?0'
-H 'sec-ch-ua-platform: "Linux"'
-H 'sec-fetch-dest: document'
-H 'sec-fetch-mode: navigate'
-H 'sec-fetch-site: same-origin'
-H 'sec-fetch-user: ?1'
-H 'upgrade-insecure-requests: 1'
-H 'user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'
--compressed
can't parse the transcript of a video from https://www.ted.com/talks/alexis_nikole_nelson_a_flavorful_field_guide_to_foraging/transcript
the requests won't see the span class where the text actually is. What could be the problem?
import requests
url = 'https://www.ted.com/talks/alexis_nikole_nelson_a_flavorful_field_guide_to_foraging/transcript'
page = requests.get(url)
print(page.content)
Is there any way to reach the transcript? Thank you.
I need to reach this
no atrribute found
That's because the data is not loaded via the link you're using, but via a call to their GraphQL instance.
Using curl, you can fetch the data like so:
curl 'https://www.ted.com/graphql?operationName=Transcript&variables=%7B%22id%22%3A%22alexis_nikole_nelson_a_flavorful_field_guide_to_foraging%22%2C%22language%22%3A%22en%22%7D&extensions=%7B%22persistedQuery%22%3A%7B%22version%22%3A1%2C%22sha256Hash%22%3A%2218f8e983b84c734317ae9388c83a13bc98702921b141c2124b3ce4aeb6c48ef6%22%7D%7D' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:99.0) Gecko/20100101 Firefox/99.0' -H 'Accept: */*' -H 'Accept-Language: en-US,en;q=0.5' -H 'Accept-Encoding: gzip, deflate, br' -H 'Referer: https://www.ted.com/talks/alexis_nikole_nelson_a_flavorful_field_guide_to_foraging/transcript' -H 'content-type: application/json' -H 'client-id: Zenith production' -H 'x-operation-name: Transcript' --output - | gzip -d
Note, the URL is urlencoded. You can import from urllib.parse import quote to use the quote() method to urlencode a string in python.
So simply translate the above curl command to python.
There's no magic, simply set the correct headers.
If you're lazy, you can also use this online converter, to convert a curl command to python code.
This produces:
import requests
from requests.structures import CaseInsensitiveDict
url = "https://www.ted.com/graphql?operationName=Transcript&variables=%7B%22id%22%3A%22alexis_nikole_nelson_a_flavorful_field_guide_to_foraging%22%2C%22language%22%3A%22en%22%7D&extensions=%7B%22persistedQuery%22%3A%7B%22version%22%3A1%2C%22sha256Hash%22%3A%2218f8e983b84c734317ae9388c83a13bc98702921b141c2124b3ce4aeb6c48ef6%22%7D%7D"
headers = CaseInsensitiveDict()
headers["User-Agent"] = "Mozilla/5.0 (X11; Linux x86_64; rv:99.0) Gecko/20100101 Firefox/99.0"
headers["Accept"] = "*/*"
headers["Accept-Language"] = "en-US,en;q=0.5"
headers["Accept-Encoding"] = "gzip, deflate, br"
headers["Referer"] = "https://www.ted.com/talks/alexis_nikole_nelson_a_flavorful_field_guide_to_foraging/transcript"
headers["content-type"] = "application/json"
headers["client-id"] = "Zenith production"
headers["x-operation-name"] = "Transcript"
resp = requests.get(url, headers=headers)
print(resp.content)
Output:
b'{"data":{"translation":{"id":"209255","language" ...
I tried to run the Curl command in the terminal. I can get the response in JSON. But when I convert the Curl to Python Code and run the python script, the script returned an html page. It looks like it is a captcha page.
Why they can detect the difference? Any way I can bypass that?
curl --location --request GET 'https://www.99.co/api/v2/web/search/listings?query_type=city&property_segments=residential&listing_type=sale&rental_type=unit&page_size=1&page_num=100&zoom=11&show_future_mrts=true&show_cluster_preview=true&show_internal_linking=true' \
--header 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:89.0) Gecko/20100101 Firefox/89.0' \
--header 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' \
--header 'accept-language: en-US,en;q=0.5' \
--header 'if-none-match: W/"5cb28158470b6eb63b0db1f5dd2adda7d10c328c"' \
--header 'cookie: _xsrf=2|6be6c7ee|de091e966dba87e0b2c8637a63127715|1616479899' \
--header 'cookie: lotameid=undefined' \
--header 'cookie: started_since=1616480838' \
--header 'cookie: country=SG' \
--header 'cookie: nid=2de067d4-fed8-4c35-b7db-14f479997f8a' \
--header 'cookie: g_state={"i_p":1627551635034,"i_l":1}' \
--header 'upgrade-insecure-requests: 1' \
--header 'te: trailers' \
--header 'sec-gpc: 1'
import requests
url = "https://www.99.co/api/v2/web/search/listings?query_type=city&property_segments=residential&listing_type=sale&rental_type=unit&page_size=1&page_num=100&zoom=11&show_future_mrts=true&show_cluster_preview=true&show_internal_linking=true"
payload={}
headers = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:89.0) Gecko/20100101 Firefox/89.0',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'accept-language': 'en-US,en;q=0.5',
'if-none-match': 'W/"5cb28158470b6eb63b0db1f5dd2adda7d10c328c"',
'cookie': '_xsrf=2|6be6c7ee|de091e966dba87e0b2c8637a63127715|1616479899',
'cookie': 'lotameid=undefined',
'cookie': 'started_since=1616480838',
'cookie': 'country=SG',
'cookie': 'nid=2de067d4-fed8-4c35-b7db-14f479997f8a',
'cookie': 'g_state={"i_p":1627551635034,"i_l":1}',
'upgrade-insecure-requests': '1',
'te': 'trailers',
'sec-gpc': '1'
}
response = requests.request("GET", url, headers=headers, data=payload)
print(response.text)
It is bizarre activity, I get different response from cURL and Python with the requests module. Also, Python is very slow: cURL takes about 6 seconds, but Python is 2 minutes.
My cURL command is:
curl -L -s -k --request GET \
--url "https://$1/api/maintenance/maintenance/dual_image_config" \
--header 'Cache-Control: no-cache' \
--header 'Accept-Encoding: gzip, deflate, sdch' \
--header 'Content-Type: application/json' \
--header 'Connection: keep-alive'
The cURL response is:
{ "state_1": "Active", "state_2": "stand-by", "current_active_image": "Image-1", "id": 1}
My Python code is:
import requests, sys
DEFAULT_HEADER = {"Accept-Encoding": "gzip, deflate, br",
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"Content-Type": "application/json",
"X-Requested-With": "XMLHttpRequest"}
url = "https://{0}/api/maintenance/maintenance/dual_image_config".format(sys.argv[1])
response = requests.get(url="", data=None, headers=DEFAULT_HEADER , verify=False)
Python response response.text is:
'0121\r\n{ "state_1": "Active", "state_2": "stand-by", "current_active_image": "Image-1", "id": 1}\r\n0\r\n\r\n'
It will have 0121\r\n & \r\n0\r\n\r\n in start and begin, I have try to change request header & encoding, but it not work, why it is got different response, and so slow ?
Here is the curl command:
curl 'http://ecard.bupt.edu.cn/User/ConsumeInfo.aspx'
-H 'Cookie: ASP.NET_SessionId=4mzmi0whscqg4humcs5fx0cf; .NECEID=1; .NEWCAPEC1=$newcapec$:zh-CN_CAMPUS; .ASPXAUTSSM=A91571BB81F620690376AF422A09EEF8A7C4F6C09978F36851B8DEDFA56C19C96A67EBD8BBD29C385C410CBC63D38135EFAE61BF49916E0A6B9AB582C9CB2EEB72BD9DE20D64A51C09474E676B188D937B84E601A981C02F51AA9C85A08EABCC1D30D7B9299E02A45D427B08A44AEBB0'
-H 'Origin: http://ecard.bupt.edu.cn'
-H 'Accept-Encoding: gzip, deflate'
-H 'Accept-Language: zh-CN,zh;q=0.8,en;q=0.6,zh-TW;q=0.4'
-H 'Upgrade-Insecure-Requests: 1'
-H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36'
-H 'Content-Type: application/x-www-form-urlencoded'
-H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'
-H 'Cache-Control: max-age=0' -H 'Referer: http://ecard.bupt.edu.cn/User/ConsumeInfo.aspx'
-H 'Connection: keep-alive'
--data '__EVENTTARGET=&__EVENTARGUMENT=&__VIEWSTATE=%2FwEPDwUINDEyMzA5NDkPFgIeCFNvcnRUeXBlBQNBU0MWAmYPZBYCAgMPZBYCAgMPZBYCAgQPPCsAEQMADxYEHgtfIURhdGFCb3VuZGceC18hSXRlbUNvdW50ZmQBEBYAFgAWAAwUKwAAFgJmD2QWAmYPZBYCAgEPZBYCAgEPDxYEHghDc3NDbGFzcwUKU29ydEJ0X0FzYx4EXyFTQgICZGQYAQUiY3RsMDAkQ29udGVudFBsYWNlSG9sZGVyMSRncmlkVmlldw88KwAMAQhmZBb11RlvPBlQn5fbuqe6uh8BRZJ2jUZ5U17xEnRM%2BnbF&__VIEWSTATEGENERATOR=D99777FC&__EVENTVALIDATION=%2FwEdAAgtbl4l0SdCQqYpyy3Ex39BEoHXgPeD%2Fa3eEKPr2bG0rY8WyuyQajjUbeopYM%2Fne68pyn2l0BPEWPPnyd6MfoDZVQGbFRIKjb%2FOTPkkCDIIaJ3X6W3VKO%2FV036Pdf6jf06OTFulzGhY80%2FZ1HbCJJ6LR5Lg6mLp72ibUCB3VipRJuK11qmW%2BSDe3IOvbK4oNUdV5%2FxmkXw25tTwfJ8P8OS2&ctl00%24ContentPlaceHolder1%24rbtnType=0&ctl00%24ContentPlaceHolder1%24txtStartDate=2016-09-17&ctl00%24ContentPlaceHolder1%24txtEndDate=2016-09-24&ctl00%24ContentPlaceHolder1%24btnSearch=%E6%9F%A5++%E8%AF%A2' --compressed
I want to use request to get the return message to analyse, but i'm confused about this.
Thanks!
You can use python requests module. The answer would be something like:
response = requests.get('http://ecard.bupt.edu.cn/User/ConsumeInfo.aspx',
headers={'Origin': 'http://ecard.bupt.edu.cn'},
data={'__EVENTTARGET': '', 'rbtnType':0},
cookies={'.NECEID': 1}
)
I didn't put all parameters there but I think this example is enough to figure out how to send exact the same request you want. The only thing I am not sure is --compressed flag - as I understand from docs it is something more specific to curl(correct me if I'm wrong)
Here is a useful website to convert the curl to requests:
curl to request
Curl Command:
curl 'http://js.t.sinajs.cn/open/api/js/api/bundle.js?version=20150130.02' -H 'Pragma: no-cache' -H 'Accept-Encoding: gzip, deflate, sdch' -H 'Accept-Language: zh-CN,zh;q=0.8,en;q=0.6,zh-TW;q=0.4' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.87 Safari/537.36' -H 'Accept: */*' -H 'Referer: http://www.liaoxuefeng.com/wiki/001374738125095c955c1e6d8bb493182103fac9270762a000/001375502200090e998439175cc4268b0ea703b3b4ed55e000' -H 'Connection: keep-alive' -H 'Cache-Control: no-cache' --compressed
Python requests:
import requests
headers = {
'Pragma': 'no-cache',
'Accept-Encoding': 'gzip, deflate, sdch',
'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6,zh-TW;q=0.4',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.87 Safari/537.36',
'Accept': '*/*',
'Referer': 'http://www.liaoxuefeng.com/wiki/001374738125095c955c1e6d8bb493182103fac9270762a000/001375502200090e998439175cc4268b0ea703b3b4ed55e000',
'Connection': 'keep-alive',
'Cache-Control': 'no-cache',
}
requests.get('http://js.t.sinajs.cn/open/api/js/api/bundle.js?version=20150130.02', headers=headers)