I am having an issue with submitting an HTTP Post request. My purpose of this program is to scrape the lyrics off a website, and then use that string in a text summarizer. I am having an issue submitting the POST request on the summarizer's website. Currently with the code below, it does not submit request. It just returns the page. I think it may be due to the content-type being different, but I am not sure.
My code:
def summarize(lyrics):
url = 'http://www.freesummarizer.com'
values = {'text' : lyrics,
'maxsentences' : '1',
'maxtopwords' : '40',
'email' : 'your#email.com' }
headers = {'User-Agent' : 'Mozilla/5.0'}
cookies = {'_jsuid': '777245265', '_ga':'GA1.2.164138903.1423973625', '__smToken':'elPdHJINsP5LvAYhia6OAA68', '__smListBuilderShown':'true', '_first_pageview':'1', '_gat':'1', '_eventqueue':'%7B%22heatmap%22%3A%5B%7B%22type%22%3A%22heatmap%22%2C%22href%22%3A%22%252F%22%2C%22x%22%3A324%2C%22y%22%3A1800%2C%22w%22%3A640%7D%5D%2C%22events%22%3A%5B%5D%7D', 'PHPSESSID':'28b0843d49700e134530fbe32ea62923', '__smSmartbarShown':'true'}
r = requests.post(url, data=values, headers=headers)
print(r.text)
My Response:
'transfer-encoding': 'chunked'
'set-cookie': 'PHPSESSID=1f10ec11e6f9040cbb5a81e16bfcdf7f; path=/',
'expires': 'Thu, 19 Nov 1981 08:52:00 GMT'
'keep-alive': 'timeout=5, max=100'
'server': 'Apache'
'connection': 'Keep-Alive'
'pragma': 'no-cache'
'cache-control': 'no-store, no-cache, must-revalidate, post-check=0, pre-check=0'
'date': 'Fri, 27 Feb 2015 18:38:41 GMT'
'content-type': 'text/html'
A successful response on this website:
Host: freesummarizer.com
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:35.0) Gecko/20100101 Firefox/35.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Referer: http://freesummarizer.com/
Cookie: _jsuid=777245265; _ga=GA1.2.164138903.1423973625; __smToken=elPdHJINsP5LvAYhia6OAA68; __smListBuilderShown=true; _first_pageview=1; _gat=1; _eventqueue=%7B%22heatmap%22%3A%5B%7B%22type%22%3A%22heatmap%22%2C%22href%22%3A%22%252F%22%2C%22x%22%3A324%2C%22y%22%3A1800%2C%22w%22%3A640%7D%5D%2C%22events%22%3A%5B%5D%7D; PHPSESSID=28b0843d49700e134530fbe32ea62923; __smSmartbarShown=true
Connection: keep-alive
Content-Type: application/x-www-form-urlencoded
Content-Length: 6044
Everything seems to be working just fine with requests.
But I think the issue here is that you are using the wrong tool for the job.
The tool I believe you are looking for is Selenium.
Selenium automates browsers. That's it! What you do with that power is entirely up to you. Primarily, it is for automating web applications for testing purposes, but is certainly not limited to just that. Boring web-based administration tasks can (and should!) also be automated as well.
You should absolutely take a look it this tool.
Selenium docs
Related
I'm trying to log in to a website using requests module. While creating a script to do so, I could notice that the payload used in there is completely different from the conventional approach. This is exactly how the payload +åEMAIL"PASSWORD(0 looks like. This is the content type parameters content-type: application/grpc-web+proto.
The following is what I see in dev tools when I log in to that site manually:
General
--------------------------------------------------------
Request URL: https://grips-web.aboutyou.com/checkout.CheckoutV1/logInWithEmail
Request Method: POST
Status Code: 200
Remote Address: 104.18.9.228:443
Response Headers
--------------------------------------------------------
Referrer Policy: strict-origin-when-cross-origin
access-control-allow-credentials: true
access-control-allow-origin: https://www.aboutyou.cz
access-control-expose-headers: Content-Encoding, Vary, Access-Control-Allow-Origin, Access-Control-Allow-Credentials, Date, Content-Type, grpc-status, grpc-message
cf-cache-status: DYNAMIC
cf-ray: 67d009674f604a4d-SIN
content-encoding: gzip
content-type: application/grpc-web+proto
date: Wed, 11 Aug 2021 08:19:04 GMT
expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
server: cloudflare
set-cookie: __cf_bm=a45185d4acac45725b46236884673503104a9473-1628669944-1800-Ab2Aos6ocz7q8B8v53oEsSK5QiImY/zqlTba/Y0FqpdsaQt2c10FJylcwTacmdovm6tjGd8hLdy/LidfFCtOj70=; path=/; expires=Wed, 11-Aug-21 08:49:04 GMT; domain=.aboutyou.com; HttpOnly; Secure; SameSite=None
vary: Origin
Request Headers
--------------------------------------------------------
:authority: grips-web.aboutyou.com
:method: POST
:path: /checkout.CheckoutV1/logInWithEmail
:scheme: https
accept: */*
accept-encoding: gzip, deflate, br
accept-language: en-US,en;q=0.9
cache-control: no-cache
content-length: 48
content-type: application/grpc-web+proto
origin: https://www.aboutyou.cz
pragma: no-cache
referer: https://www.aboutyou.cz/
sec-ch-ua: "Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"
sec-ch-ua-mobile: ?0
sec-fetch-dest: empty
sec-fetch-mode: cors
sec-fetch-site: cross-site
user-agent: Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36
x-grpc-web: 1
Request Payload
--------------------------------------------------------
+åEMAIL"PASSWORD(0
This is what I've created so far (can't find any way to fill in the payload):
import requests
from bs4 import BeautifulSoup
start_url = 'https://www.aboutyou.cz/'
post_link = 'https://grips-web.aboutyou.com/checkout.CheckoutV1/logInWithEmail'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.3',
'content-type': 'application/grpc-web+proto',
'origin': 'https://www.aboutyou.cz',
'referer': 'https://www.aboutyou.cz/',
'x-grpc-web': '1'
}
payload = {
}
with requests.Session() as s:
s.headers.update(headers)
r = s.post(post_link,data=payload)
print(r.status_code)
print(r.url)
Steps to log in to that site manually:
Go to this site
This is how to get the login form
Login form looks like this
How can I log in to that site using requests module?
I don't think that you'll be able to use Python Requests to login to your target site.
Your post_link url:
post_link = 'https://grips-web.aboutyou.com/checkout.CheckoutV1/logInWithEmail'
states that it is: gRPC requires HTTP/2 and Python Requests send HTTP/1.1 requests only.
Additionally, I noted that the target site also uses CloudFlare, which is difficult to bypass with Python, especially when using Python Requests
'Set-Cookie': '__cf_bm=11d867459fe0951da4157b475cf88eb3ab7658fb-1629229293-1800-AeFomlmROcmUYcRosxxcSnoJkGOW/WXjUe1WxK6SkM2eXIbnAqXRlpwOkpvOfONrbApJd4Qwj+a8+kOzLAfpHIE=; path=/; expires=Tue, 17-Aug-21 20:11:33 GMT; domain=.aboutyou.com; HttpOnly; Secure; SameSite=None', 'Vary': 'Accept-Encoding', 'Server': 'cloudflare', 'CF-RAY': '6805616b8facf1b2-ATL', 'Content-Encoding': 'gzip'}
Here are previous Stack Overflow questions on Python Requests with gRPC
Can't make gRPC work with python requests rest api call
Send plain JSON to a gRPC server using python
I looked through the GitHub repository for Python Requests and saw that HTTP/2 has been a requested feature for almost 7 years.
During my research, I discovered HTTPX, which is a HTTP client for Python 3, which provides sync and async APIs, and support for both HTTP/1.1 and HTTP/2. The documentation states that the package is stable, but is still considered a beta at this point. I would recommend trying HTTPX to see if it solves your issue with logging into your target site.
I'm trying to get the value of API key avaialable within headers from this website. The value of API key can be found using this link within headers (once the page is reloaded).
In dev tools, I found the headers like the following where API key and value are present:
Accept: application/json
Content-Type: application/json
Referer: https://www.pinnacle.com/en/
Sec-Fetch-Mode: cors
User-Agent: Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36
X-API-Key: CmX2KcMrXuFmNg6YFbmTxE0y9CIrOi0R
X-Device-UUID: 3a10d97d-5dc63d32-9b562999-2a023260
However, when I print the headers (using the second link), I get the following items except for that API key.
{'Date': 'Tue, 20 Aug 2019 03:53:47 GMT', 'Content-Type': 'application/problem+json', 'Content-Length': '119', 'Connection': 'keep-alive', 'Set-Cookie': '__cfduid=d43bcbb47c4b830f22e994d7311c5f37d1566273227; expires=Wed, 19-Aug-20 03:53:47 GMT; path=/; domain=.pinnacle.com; HttpOnly', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Methods': 'HEAD, GET, POST, PUT, DELETE, OPTIONS', 'Access-Control-Allow-Headers': 'Accept, Content-Type, X-API-Key, X-Device-UUID, X-Session, X-Language', 'Access-Control-Max-Age': '86400', 'Cache-Control': 'no-cache', 'CF-Cache-Status': 'MISS', 'Expect-CT': 'max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"', 'Vary': 'Accept-Encoding', 'Server': 'cloudflare', 'CF-RAY': '50916c15eb6ee03b-DFW'}
I've tried with:
import requests
from bs4 import BeautifulSoup
link = 'https://guest.api.arcadia.pinnacle.com/0.1/sports/33/markets/live/straight'
res = requests.get(link)
print(res.headers)
How can I get the value of API key from that site?
Let's just break down how 'requests' works.
When you say:
res = requests.get(link)
It means you're sending the API server a request - you're supposed to be providing the API key here. It isn't supposed to be something 'requests' receives after a request, instead it's supposed to be something requests needs to perform the request.
I'm still very new to coding (been coding for a week) so I am struggling with a very basic function.
I am trying to log into a website using python however I am having a hard time changing the set-cookie header.
See my current code below:
import requests
targetURL = "http://hostip/v2_Website/aspx/login.aspx"
headers = {
"Host": "*host IP*",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"}
response = requests.get(url=targetURL,
proxies=proxies,
headers=headers,)
response_headers = response.headers
When I print the response.headers I get the following:
{'Cache-Control': 'no-cache, no-store', 'Pragma': 'no-cache,no-cache', 'Content-Length': '15671', 'Content-Type': 'text/html; charset=utf-8', 'Expires': '-1', 'Server': 'Microsoft-IIS/7.5', 'X-AspNet-Version': '2.0.50727', 'Set-Cookie': 'ASP.NET_SessionId=vq5q4lzlrqiiebbmxw341yic; path=/; HttpOnly, CookieLoginAttempts=5; expires=Tue, 14-Aug-2018 17:14:09 GMT; path=/', 'X-Powered-By': 'ASP.NET', 'Date': 'Tue, 14 Aug 2018 07:14:10 GMT', 'Connection': 'close'}
Obviously when I use these headers in my HTTP POST it fails due to the POST having a Set-Cookie header instead of Cookie value.
My objectives are as follows:
Update/change the Set-Cookie key to Cookie
Then I would like to remove values that are not needed in the Cookie key
Add other keys and values
Ultimately I would like to change the headers to the following so I can use it for my POST in order for me to pass login credentials:
POST /Test server/aspx/Login.aspx?function=Welcome HTTP/1.1
Host: *Host IP*
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Referer: http://*HostIP*/v2_Website/aspx/main.aspx?function=Welcome
Cookie: ASP.NET_SessionId=3vy0fy55xsmffhbotikrwh55; CookieLoginAttempts=5; Admin=false
Connection: close
Upgrade-Insecure-Requests: 1
Content-Type: application/x-www-form-urlencoded
Content-Length: 220
Is my above objective even possible? If so how does one even achieve it as I don't understand the process of modifying a dictionary I can't see.
Again I would like to you to note I am still very green in the world of coding and trying to "think like a coder" thus keeping responses a little less technical would be highly appreciated, just so I can understand your response and advice. Any help would be great!
I was a able to find the answer with a quite a bit of research.
Instead of trying to edit it manually I did the following:
import requests
session_requests = requests.session()
login_url = "http://*host ip*/v2_Website/aspx/Login.aspx"
result = session_requests.get(login_url)
result = session_requests.post(login_url,
headers= dict(referer=login_url))
This pulls through the needed cookie and adds everything as need. My headers come back as follows:
POST /_v2_Website/aspx/Login.aspx HTTP/1.1
Host: *host IP*
User-Agent: python-requests/2.18.4
Accept-Encoding: gzip, deflate
Accept: */*
Connection: close
referer: http://*hostIP*/v2_Website/aspx/Login.aspx
Cookie: ASP.NET_SessionId=3crqoo45hnn21anuqulmmr55; CookieLoginAttempts=5
Content-Length: 79
Content-Type: application/x-www-form-urlencoded
I am trying to make a get request to a webpage but I keep getting a 404 error using Python2.7 with requests package. However, using CURL I get a successful response and it works with the browser.
Python
r = requests.get('https://www.ynet.co.il/articles/07340L-446694800.html')
r.status_code
404
r.headers
{'backend-cache-control': '', 'Content-Length': '20661', 'WAI': '02',
'X-me': '08', 'vg_id': '1', 'Content-Encoding': 'gzip', 'Vary': 'Accept-Encoding',
'Last-Modified': 'Sun, 20 May 2018 01:20:04 GMT', 'Connection': 'keep-alive',
'V-TTL': '47413', 'Date': 'Sun, 20 May 2018 14:55:21 GMT', 'VX-Cache': 'HIT',
'Content-Type': 'text/html; charset=UTF-8', 'Accept-Ranges': 'bytes'}
r.reason
'Not Found'
CURL
curl https://www.ynet.co.il/articles/07340L-446694800.html
The code is correct, it works for some other sites (see https://repl.it/repls/MemorableUpbeatExams ).
This site loads for me in the browser, so I confirm your issue.
It might be that they block Python requests, because they don't want their site scraped and analysed by bots, but they forgot to block curl.
What you are doing is probably violating www.ynet.co.il terms of use, and you shouldn't do that.
A 404 is displayed when:
The URL is incorrect and the response is actually accurate.
Trailing spaces in the URL
The website may not like HTTP(S) requests coming from Python code. Change your headers by adding "www." to your Referer url.
resp = requests.get(r'http://www.xx.xx.xx.xx/server/rest/line/125')
or
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
}
result = requests.get('https://www.transfermarkt.co.uk', headers=headers)
I'm looking to write a script that can automatically download .zip files from the Bureau of Transportation Statistics Carrier Website, but I'm having trouble getting the same response headers as I can see in Chrome when I download the zip file. I'm looking to get a response header that looks like this:
HTTP/1.1 302 Object moved
Cache-Control: private
Content-Length: 183
Content-Type: text/html
Location: http://tsdata.bts.gov/103627300_T_T100_SEGMENT_ALL_CARRIER.zip
Server: Microsoft-IIS/8.5
X-Powered-By: ASP.NET
Date: Thu, 21 Apr 2016 15:56:31 GMT
However, when calling requests.post(url, data=params, headers=headers) with the same information that I can see in the Chrome network inspector I am getting the following response:
>>> res.headers
{'Cache-Control': 'private', 'Content-Length': '262', 'Content-Type': 'text/html', 'X-Powered-By': 'ASP.NET', 'Date': 'Thu, 21 Apr 2016 20:16:26 GMT', 'Server': 'Microsoft-IIS/8.5'}
It's got pretty much everything except it's missing the Location key that I need in order to download the .zip file with all of the data I want. Also the Content-Length value is different, but I'm not sure if that's an issue.
I think that my issue has something to do with the fact that when you click "Download" on the page it actually sends two requests that I can see in the Chrome network console. The first request is a POST request that yields an HTTP response of 302 and then has the Location in the response header. The second request is a GET request to the url specified in the Location value of the response header.
Should I really be sending two requests here? Why am I not getting the same response headers using requests as I do in the browser? FWIW I used curl -X POST -d /*my data*/ and got back this in my terminal:
<head><title>Object moved</title></head>
<body><h1>Object Moved</h1>This object may be found here.</body>
Really appreciate any help!
I was able to download the zip file that I was looking for by using almost all of the headers that I could see in the Google Chrome web console. My headers looked like this:
{'Connection': 'keep-alive', 'Cache-Control': 'max-age=0', 'Referer': 'http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=293', 'Origin': 'http://www.transtats.bts.gov', 'Upgrade-Insecure-Requests': 1, 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36', 'Cookie': 'ASPSESSIONIDQADBBRTA=CMKGLHMDDJIECMNGLMDPOKHC', 'Accept-Language': 'en-US,en;q=0.8', 'Accept-Encoding': 'gzip, deflate', 'Content-Type': 'application/x-www-form-urlencoded'}
And then I just wrote:
res = requests.post(url, data=form_data, headers=headers)
where form_data was copied from the "Form Data" section of the Chrome console. Once I got that request, I used the zipfile and io modules to parse the content of the response stored in res. Like this:
import zipfile, io
zipfile.ZipFile(io.BytesIO(res.content))
and then the file was in the directory where I ran the Python code.
Thanks to the users who answered on this thread.