How to scrape website that comes up with 403 error? - python

I am trying to scrape the following web page
https://jamanetwork.com/journals/jamaneurology/article-abstract/2696970
but getting an error.
url ='https://jamanetwork.com/journals/jamaneurology/article-abstract/2696970'
result = requests.get(url)
soup = BeautifulSoup(result.content, 'html.parser')
print(soup.prettify())
Result:
403 Forbidden Request forbidden by
administrative rules.
You can access the web page with no credentials, so not sure why I get 'Request forbidden' error while scraping.

As mentioned you should add a user-agent to your request:
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
You can check your own headers, send by the browser via opening dev tools and take a look under network section. Read more about user-agent.
Example
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
url ='https://jamanetwork.com/journals/jamaneurology/article-abstract/2696970'
result = requests.get(url,headers=headers)
soup = BeautifulSoup(result.content, 'html.parser')
print(soup.prettify())

Related

how to scrape web pages whose url links does not show page numbers

headers={'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36 Edg/105.0.1343.50'}
website = 'https://www.binance.com/en/markets'
response= requests.get('https://www.binance.com/en/markets', headers=headers)
for page in range(0, 5):
soup = BeautifulSoup(response.content, 'html.parser')
results = soup.find_all('div',{'class':'css-leyy1t'})
My objective is to scrape 'https://www.binance.com/en/markets' but all the pages url shows the same thing,no page number to edit in python so that i can loop through in order to scrape all the pages.
I expect to see a url like this 'https://www.cars.com/shopping/results/?page=6&page_size=20&list_price_max=&makes[]=mercedes_benz&maximum_distance=20&models[]=&stock_type=cpo&zip='

Web Scrapping just return None

I'm trying to make a pop-up program with mir4 draco price. But the price return None :
import requests
from bs4 import BeautifulSoup
urll = 'https://www.xdraco.com/coin/price/'
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/86.0.4240.198 Safari/537.36"}
site = requests.get(urll, headers=headers)
soup = BeautifulSoup(site.content, 'html5lib')
price = soup.find('span', class_="amount")
print(price)
You won't be able to parse a site that is dynamically loaded using JS as #jabbson mentioned.
This might be a way to get the data you want.
If you check the network requests being made by the page, you will find that it makes calls to a few different APIs. I found one that might have the info you're looking for. You can make POST requests to this API as shown below...
import requests
import json
headers = {'accept':'application/json, text/plain, */*','user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}
html = requests.post('https://api.mir4global.com/wallet/prices/hydra/daily', headers=headers)
output = json.loads(html.text)
# 'output' is a dictionary. If we index the last element, we can get the latest data entry
print(output['Data'][-1])
OUTPUT:
{'CreatedDT': '2022-08-04 21:55:00', 'HydraPrice': '2.1301000000000001', 'HydraAmount': '13434', 'HydraPricePrev': '2.3336000000000001', 'HydraAmountPrev': '5972', 'HydraUSDWemixRate': '2.9401340627166839', 'HydraUSDKLAYRate': '0.29840511595654395', 'USDHydraRate': '6.2627795669928084'}

Python Request returning different result than original page (browser)

I am trying to do a simple WebScrapper to monitor Nike's site here in Brazil.
Basically i want to track products that have stock right now, to check when new products are added.
My problem is that when i navigate to the site https://www.nike.com.br/snkrs#estoque I see different products compared to what I see using python requests method.
Here is the code I am using:
import requests
from bs4 import BeautifulSoup
headers ={
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
}
url = 'https://www.nike.com.br/snkrs#estoque'
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
len(soup.find_all(class_='produto produto--comprar'))
This code gives me 40, but using the browser I can see 56 products https://prnt.sc/26jeo1i
The data comes from a different source, within 3 pages.
import requests
from bs4 import BeautifulSoup
headers ={
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
}
productList = []
for p in [1,2,3]:
url = f'https://www.nike.com.br/Snkrs/Estoque?p={p}&demanda=true'
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
productList += soup.find_all(class_='produto produto--comprar')
Output:
print(len(productList))
56

Python Requests error 403 even with user agent

I'm trying to parse an auction website with python and request.
But so far, it's returned error 403 forbidden access cloudfare protection.
Here's bellow my code :
import requests
url = "https://www.interencheres.com"
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
response = requests.get(url, headers=headers)

Web scraping request stopped working, showing "Response [401]" in python?

import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36'}
url = 'https://www.nseindia.com/api/chart-databyindex?index=ACCEQN'
r = requests.get(url, headers=headers)
data = r.json()
print(data)
prices = data['grapthData']
print(prices)
It was working fine but now it showing error "Response [401]"
Well, it's all about the site's authentication requirements. It requires a certain level of authorization to access like this.

Categories

Resources