I am trying to make a get request to a webpage but I keep getting a 404 error using Python2.7 with requests package. However, using CURL I get a successful response and it works with the browser.
Python
r = requests.get('https://www.ynet.co.il/articles/07340L-446694800.html')
r.status_code
404
r.headers
{'backend-cache-control': '', 'Content-Length': '20661', 'WAI': '02',
'X-me': '08', 'vg_id': '1', 'Content-Encoding': 'gzip', 'Vary': 'Accept-Encoding',
'Last-Modified': 'Sun, 20 May 2018 01:20:04 GMT', 'Connection': 'keep-alive',
'V-TTL': '47413', 'Date': 'Sun, 20 May 2018 14:55:21 GMT', 'VX-Cache': 'HIT',
'Content-Type': 'text/html; charset=UTF-8', 'Accept-Ranges': 'bytes'}
r.reason
'Not Found'
CURL
curl https://www.ynet.co.il/articles/07340L-446694800.html
The code is correct, it works for some other sites (see https://repl.it/repls/MemorableUpbeatExams ).
This site loads for me in the browser, so I confirm your issue.
It might be that they block Python requests, because they don't want their site scraped and analysed by bots, but they forgot to block curl.
What you are doing is probably violating www.ynet.co.il terms of use, and you shouldn't do that.
A 404 is displayed when:
The URL is incorrect and the response is actually accurate.
Trailing spaces in the URL
The website may not like HTTP(S) requests coming from Python code. Change your headers by adding "www." to your Referer url.
resp = requests.get(r'http://www.xx.xx.xx.xx/server/rest/line/125')
or
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
}
result = requests.get('https://www.transfermarkt.co.uk', headers=headers)
Related
I am trying to login into a website (using python requests) by passing user name and password in get.
But getting 403 forbidden error. I am able to login into the website through browser using the same credentials.
I have tried using different headers but nothing seems to be working.
I have tried all commented headers. also the proxy. This is basically a VB code we need t convert in python. In VB they have used proxy hence tried using proxy.
I need to login into the website and download a file abc.csv
code:
# Import library
import requests
# Define url, username, and password
url = 'https://example.com'
#r = requests.get(url, auth=(user, password))
#s = requests.Session()
#headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
#headers = {"Authorization" : "Basic" , "Pragma" : "no-cache","If-Modified-Since": "Thu, 01 Jun 1970 00:00:00 GMT",'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
#headers={'Server': 'nginx', 'Date': 'Thu, 23 Sep 2021 12:44:25 GMT', 'Content-Type': 'text/html; charset=iso-8859-1', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Vary': 'Accept-Encoding', 'Content-Encoding': 'gzip'}
proxies= {"https":"http://10.10.10.10:8080"}
r=requests.get(url,proxies=proxies),headers=headers)
requests.post(url, data={"username":"user","password":"pwd"})
#s.get(url)
# send a HTTP request to the server and save
# the HTTP response in a response object called r
with open("abc.csv",'wb') as f:
# Saving received content as a png file in
# binary format
# write the contents of the response (r.content)
# to a new file in binary mode.
r.write(r.content)
Thanks,
I am trying to query trader portfolios, e.g. this one: https://www.etoro.com/people/jaynemesis/portfolio
I know the basic setup should be as follows:
import requests
import json
response = requests.get(url, headers=header)
data = response.json()
Analyzing the Request Headers tab, I set the following parameters as headers:
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:81.0) Gecko/20100101 Firefox/81.0',
'Host': 'www.etoro.com',
'Referer': 'https://www.etoro.com/people/jaynemesis/portfolio/history'
}
and I found the following link, together with a GET prefix indicating a GET type request in Inspect -> Network tab -> Headers tab:
url = https://www.etoro.com/sapi/trade-data-real/live/public/portfolios?cid=3378352&client_request_id=7a29e39e-5324-4234-bac7-d54e8fe4b5a6
When printing response.status_code, I get a status code of 512. What am I missing? Is it possible that it is not at all possible to query this data (is it perhaps blocked somehow)?
EDIT:
response.text returns "error":{"failureReason":"Something went wrong, please try again"}
response.headers returns a long dict starting with:
{'Date': 'Mon, 19 Oct 2020 14:33:14 GMT', 'Content-Length': '66', 'Connection': 'keep-alive', 'Set-Cookie': '__cfduid= ... }
I'm trying to get the value of API key avaialable within headers from this website. The value of API key can be found using this link within headers (once the page is reloaded).
In dev tools, I found the headers like the following where API key and value are present:
Accept: application/json
Content-Type: application/json
Referer: https://www.pinnacle.com/en/
Sec-Fetch-Mode: cors
User-Agent: Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36
X-API-Key: CmX2KcMrXuFmNg6YFbmTxE0y9CIrOi0R
X-Device-UUID: 3a10d97d-5dc63d32-9b562999-2a023260
However, when I print the headers (using the second link), I get the following items except for that API key.
{'Date': 'Tue, 20 Aug 2019 03:53:47 GMT', 'Content-Type': 'application/problem+json', 'Content-Length': '119', 'Connection': 'keep-alive', 'Set-Cookie': '__cfduid=d43bcbb47c4b830f22e994d7311c5f37d1566273227; expires=Wed, 19-Aug-20 03:53:47 GMT; path=/; domain=.pinnacle.com; HttpOnly', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Methods': 'HEAD, GET, POST, PUT, DELETE, OPTIONS', 'Access-Control-Allow-Headers': 'Accept, Content-Type, X-API-Key, X-Device-UUID, X-Session, X-Language', 'Access-Control-Max-Age': '86400', 'Cache-Control': 'no-cache', 'CF-Cache-Status': 'MISS', 'Expect-CT': 'max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"', 'Vary': 'Accept-Encoding', 'Server': 'cloudflare', 'CF-RAY': '50916c15eb6ee03b-DFW'}
I've tried with:
import requests
from bs4 import BeautifulSoup
link = 'https://guest.api.arcadia.pinnacle.com/0.1/sports/33/markets/live/straight'
res = requests.get(link)
print(res.headers)
How can I get the value of API key from that site?
Let's just break down how 'requests' works.
When you say:
res = requests.get(link)
It means you're sending the API server a request - you're supposed to be providing the API key here. It isn't supposed to be something 'requests' receives after a request, instead it's supposed to be something requests needs to perform the request.
I'm looking to write a script that can automatically download .zip files from the Bureau of Transportation Statistics Carrier Website, but I'm having trouble getting the same response headers as I can see in Chrome when I download the zip file. I'm looking to get a response header that looks like this:
HTTP/1.1 302 Object moved
Cache-Control: private
Content-Length: 183
Content-Type: text/html
Location: http://tsdata.bts.gov/103627300_T_T100_SEGMENT_ALL_CARRIER.zip
Server: Microsoft-IIS/8.5
X-Powered-By: ASP.NET
Date: Thu, 21 Apr 2016 15:56:31 GMT
However, when calling requests.post(url, data=params, headers=headers) with the same information that I can see in the Chrome network inspector I am getting the following response:
>>> res.headers
{'Cache-Control': 'private', 'Content-Length': '262', 'Content-Type': 'text/html', 'X-Powered-By': 'ASP.NET', 'Date': 'Thu, 21 Apr 2016 20:16:26 GMT', 'Server': 'Microsoft-IIS/8.5'}
It's got pretty much everything except it's missing the Location key that I need in order to download the .zip file with all of the data I want. Also the Content-Length value is different, but I'm not sure if that's an issue.
I think that my issue has something to do with the fact that when you click "Download" on the page it actually sends two requests that I can see in the Chrome network console. The first request is a POST request that yields an HTTP response of 302 and then has the Location in the response header. The second request is a GET request to the url specified in the Location value of the response header.
Should I really be sending two requests here? Why am I not getting the same response headers using requests as I do in the browser? FWIW I used curl -X POST -d /*my data*/ and got back this in my terminal:
<head><title>Object moved</title></head>
<body><h1>Object Moved</h1>This object may be found here.</body>
Really appreciate any help!
I was able to download the zip file that I was looking for by using almost all of the headers that I could see in the Google Chrome web console. My headers looked like this:
{'Connection': 'keep-alive', 'Cache-Control': 'max-age=0', 'Referer': 'http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=293', 'Origin': 'http://www.transtats.bts.gov', 'Upgrade-Insecure-Requests': 1, 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36', 'Cookie': 'ASPSESSIONIDQADBBRTA=CMKGLHMDDJIECMNGLMDPOKHC', 'Accept-Language': 'en-US,en;q=0.8', 'Accept-Encoding': 'gzip, deflate', 'Content-Type': 'application/x-www-form-urlencoded'}
And then I just wrote:
res = requests.post(url, data=form_data, headers=headers)
where form_data was copied from the "Form Data" section of the Chrome console. Once I got that request, I used the zipfile and io modules to parse the content of the response stored in res. Like this:
import zipfile, io
zipfile.ZipFile(io.BytesIO(res.content))
and then the file was in the directory where I ran the Python code.
Thanks to the users who answered on this thread.
I am having an issue with submitting an HTTP Post request. My purpose of this program is to scrape the lyrics off a website, and then use that string in a text summarizer. I am having an issue submitting the POST request on the summarizer's website. Currently with the code below, it does not submit request. It just returns the page. I think it may be due to the content-type being different, but I am not sure.
My code:
def summarize(lyrics):
url = 'http://www.freesummarizer.com'
values = {'text' : lyrics,
'maxsentences' : '1',
'maxtopwords' : '40',
'email' : 'your#email.com' }
headers = {'User-Agent' : 'Mozilla/5.0'}
cookies = {'_jsuid': '777245265', '_ga':'GA1.2.164138903.1423973625', '__smToken':'elPdHJINsP5LvAYhia6OAA68', '__smListBuilderShown':'true', '_first_pageview':'1', '_gat':'1', '_eventqueue':'%7B%22heatmap%22%3A%5B%7B%22type%22%3A%22heatmap%22%2C%22href%22%3A%22%252F%22%2C%22x%22%3A324%2C%22y%22%3A1800%2C%22w%22%3A640%7D%5D%2C%22events%22%3A%5B%5D%7D', 'PHPSESSID':'28b0843d49700e134530fbe32ea62923', '__smSmartbarShown':'true'}
r = requests.post(url, data=values, headers=headers)
print(r.text)
My Response:
'transfer-encoding': 'chunked'
'set-cookie': 'PHPSESSID=1f10ec11e6f9040cbb5a81e16bfcdf7f; path=/',
'expires': 'Thu, 19 Nov 1981 08:52:00 GMT'
'keep-alive': 'timeout=5, max=100'
'server': 'Apache'
'connection': 'Keep-Alive'
'pragma': 'no-cache'
'cache-control': 'no-store, no-cache, must-revalidate, post-check=0, pre-check=0'
'date': 'Fri, 27 Feb 2015 18:38:41 GMT'
'content-type': 'text/html'
A successful response on this website:
Host: freesummarizer.com
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:35.0) Gecko/20100101 Firefox/35.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Referer: http://freesummarizer.com/
Cookie: _jsuid=777245265; _ga=GA1.2.164138903.1423973625; __smToken=elPdHJINsP5LvAYhia6OAA68; __smListBuilderShown=true; _first_pageview=1; _gat=1; _eventqueue=%7B%22heatmap%22%3A%5B%7B%22type%22%3A%22heatmap%22%2C%22href%22%3A%22%252F%22%2C%22x%22%3A324%2C%22y%22%3A1800%2C%22w%22%3A640%7D%5D%2C%22events%22%3A%5B%5D%7D; PHPSESSID=28b0843d49700e134530fbe32ea62923; __smSmartbarShown=true
Connection: keep-alive
Content-Type: application/x-www-form-urlencoded
Content-Length: 6044
Everything seems to be working just fine with requests.
But I think the issue here is that you are using the wrong tool for the job.
The tool I believe you are looking for is Selenium.
Selenium automates browsers. That's it! What you do with that power is entirely up to you. Primarily, it is for automating web applications for testing purposes, but is certainly not limited to just that. Boring web-based administration tasks can (and should!) also be automated as well.
You should absolutely take a look it this tool.
Selenium docs