I'm trying to parse an auction website with python and request.
But so far, it's returned error 403 forbidden access cloudfare protection.
Here's bellow my code :
import requests
url = "https://www.interencheres.com"
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
response = requests.get(url, headers=headers)
Related
Good morning,
Since yesterday, I'm having timeouts doing requests to ebay website. The code is simple:
import requests
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"}
htlm=requests.get("https://www.ebay.es",headers=headers).text
Tested with google and it works. This is the response I receive:
'\nGateway Timeout - In read \n\nGateway Timeout\nThe proxy server did not receive a timely response from the upstream server.\nReference #1.477f1602.1645295618.7675ccad\n\n'
What happened or changed? How could I solve it?
Removing the headers should work. Perhaps they don't like that user agent for some reason.
import requests
# headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"}
headers = {}
url = "https://www.ebay.es"
response = requests.get(url, headers=headers)
html_text = response.text
I am trying to scrape the following web page
https://jamanetwork.com/journals/jamaneurology/article-abstract/2696970
but getting an error.
url ='https://jamanetwork.com/journals/jamaneurology/article-abstract/2696970'
result = requests.get(url)
soup = BeautifulSoup(result.content, 'html.parser')
print(soup.prettify())
Result:
403 Forbidden Request forbidden by
administrative rules.
You can access the web page with no credentials, so not sure why I get 'Request forbidden' error while scraping.
As mentioned you should add a user-agent to your request:
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
You can check your own headers, send by the browser via opening dev tools and take a look under network section. Read more about user-agent.
Example
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
url ='https://jamanetwork.com/journals/jamaneurology/article-abstract/2696970'
result = requests.get(url,headers=headers)
soup = BeautifulSoup(result.content, 'html.parser')
print(soup.prettify())
I am able to open this url via a browser and see the response in json format. However, when I use the requests module, there is no response from the method.
import requests
response = requests.get('https://api.nasdaq.com/api/calendar/earnings?date=2021-02-23')
What is wrong here?
this worked for me:
url = 'https://api.nasdaq.com/api/calendar/earnings?date=2021-02-23'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'}
response = requests.get(url, headers=headers)
Explanation
The site is blocking requests from python. Refer to explanation here
When adding the headers of the query that appear when inspecting the element in chrome, the request works well in python:
import requests
response = requests.get('https://api.nasdaq.com/api/calendar/earnings?date=2021-02-23',headers={"authority":"api.nasdaq.com","scheme":"https","path":"/api/calendar/earnings?date=2021-02-23","pragma":"no-cache","cache-control":"no-cache","accept":"application/json, text/plain, */*","user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36","origin":"https://www.nasdaq.com","sec-fetch-site":"same-site","sec-fetch-mode":"cors","sec-fetch-dest":"empty","referer":"https://www.nasdaq.com/","accept-encoding":"gzip, deflate, br","accept-language":"en-US,en;q=0.9,es;q=0.8,nl;q=0.7"})
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36'}
url = 'https://www.nseindia.com/api/chart-databyindex?index=ACCEQN'
r = requests.get(url, headers=headers)
data = r.json()
print(data)
prices = data['grapthData']
print(prices)
It was working fine but now it showing error "Response [401]"
Well, it's all about the site's authentication requirements. It requires a certain level of authorization to access like this.
Currently, I am trying to make a user generator for a website and there is a fundemantal problem that I've been facing. The code below works but what it prints out is
The page has expired due to inactivity. Please refresh and try again
I have seen some of those solutions including using xsrf-token but either I am doing something wrong or it is not related to token.
with requests.Session() as s:
s.get('http://www.watchill.org/register')
token = s.cookies["XSRF-TOKEN"]
agent = {"User-Agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36 OPR/62.0.3331.116",
"XSRF-TOKEN":token}
r = s.post('http://www.watchill.org/register',headers=agent)
print(bs4.BeautifulSoup(r.content,"html.parser"))
The problem is with your CSRF token which is invalid.
I didn't check if this code is doing what's your aiming to but it does not return page expired message:
import requests
from bs4 import BeautifulSoup
def getXsrf(cookies):
for cookie in s.cookies:
if cookie.name =='XSRF-TOKEN':
return cookie.value
with requests.Session() as s:
s.get('http://www.watchill.org/register')
xsrf = getXsrf(s.cookies)
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36 OPR/62.0.3331.116"}
headers['X-XSRF-TOKEN'] = xsrf
r = s.post('http://www.watchill.org/register',headers=headers)
print(BeautifulSoup(r.content,"html.parser"))