I the following python program that when I execute it, I get an HTTP 403: Forbidden error.
Here's the code:
import os
import re
from bs4 import BeautifulSoup
#import urllib.request
#from urllib.request import request, urlopen
from urllib import request
import pandas as pd
import numpy as np
import datetime
import time
import openpyxl
for a in range(0,len(symbols),1):
"""
Attempt to get past the Forbidden error message.
"""
#ua = UserAgent()
url = "https://fintel.io/ss/us/" + symbols[a]
"""
test urls:
https://fintel.io/ss/us/A
https://fintel.io/ss/us/BBY'
https://fintel.io/ss/us/WMT'
"""
print("Extracting Values for " + symbols[a] + ".")
header = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}
try:
page_request = request.Request(url, headers=header)
page = request.urlopen(page_request)
#old page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, "html.parser", from_encoding="iso-8859-1")
The result I'm getting:
Extracting Values for A.
HTTP Error 403: Forbidden
Extracting Values for BBY.
HTTP Error 403: Forbidden
Extracting Values for WMT.
HTTP Error 403: Forbidden
Any suggestions on how to handle this?
That's not a BeautifulSoup specific error, the website you're trying to scrape is probably protected by Cloudflare's anti bot page. You can try using cfscrape to bypass this.
Your complaint is with Request, rather than with beautifulsoup.
The website refuses to serve your requests.
When I use that UA it works fine, I receive a 200 document:
$ wget -U 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36' -S https://fintel.io/ss/us/BBY
But then again, even using UA of 'Wget/1.20.3 (darwin19.0.0)' works just fine:
$ wget -S https://fintel.io/ss/us/BBY
Looks like you've been doing excessive crawling,
and your IP is now being blocked by the webserver.
Related
I'm trying to automate a login using python's requests module, but whenever I use the POST or GET request the server sends 403 status code; the weird part is that I can access that same URL with any browser but it just won't work with curl and requests.
here is the code:
import requests
import lxml
from bs4 import BeautifulSoup
import os
url = "https://ais.usvisa-info.com/en-am/niv/users/sign_in"
req = requests.get(url).text
soup = BeautifulSoup(req, 'lxml')
ready = soup.prettify()
FILE = open("usvisa.html", "w")
FILE.write(ready)
FILE.close()
I'd appreciate any help or idea!
Its probably the /robots.txt, thats blocking you.
try overriding the user-agent with a custom one.
import requests
import lxml
from bs4 import BeautifulSoup
import os
headers={"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36"}
url = "https://ais.usvisa-info.com/en-am/niv/users/sign_in"
req = requests.get(url, headers=headers).text
soup = BeautifulSoup(req, 'lxml')
ready = soup.prettify()
FILE = open("usvisa.html", "w", encoding="utf-8")
FILE.write(ready)
FILE.close()
you also didnt specify the file encoding when opening a file.
I am trying to scrape some data from OLX, there are multiple pages, so I have added loops. My code works for 1st 4 iterations but after i = 3 I get error 404, even though the sites exist beyond i=3 they give 404 btw I can open them on the browser.
I tried changing user agent but didn't work.
Site for page 4: https://www.olx.com.pk/mobiles_c1411/q-samsung-galaxy-s10?page=4&sorting=asc-price
Please help
import csv
from urllib.request import urlopen as ureq
from bs4 import BeautifulSoup as soup
import time
import requests
headers = {'USER-AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}
for i in range(0, 8, 1):
page_html = requests.get('https://www.olx.com.pk/mobiles_c1411/q-samsung-galaxy-s10?page='+ str(i) + '&sorting=asc-price', headers=headers, timeout=10)
print(page_html.status_code)
I was trying to get search result from a website, however I got
"Response[403]" message, I've found similar post solving 403 error by adding headers to request.post, however it didn't work for my problem. What should I do to correctly get the result I want?
from urllib.request import urlopen
import urllib.parse
import urllib.request
import requests
from bs4 import BeautifulSoup
url="https://www.metal-archives.com/"
html= urlopen(url)
print("The keyword you entered to search is: %s\n" % 'Bathory')
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
result=requests.post(url, data='Bathory', headers=headers)
print(result.content)
First of all, you don't need the headers as you can see that you're getting status
code 200:
>>> r = requests.get('https://www.metal-archives.com')
>>> r.status_code
200
If you want to search for anything, you can see that the url changes to
https://www.metal-archives.com/search?searchString=bathory
That means, you can directly format it using this:
>>> keyword = 'bathory'
>>> r = requests.get('https://www.metal-archives.com/search?searchString='+keyword)
>>> r.status_code
200
>>> 'bathory' in r.text
True
If you check HTML you'll find that form method is GET (may be that's why you get 403 error):
<form id="search_form" action="https://www.metal-archives.com/search" method="get">
so all you need is to construct search URL:
#Music genre search
result=requests.get( "https://www.metal-archives.com/search?searchString={0}&type=band_genre".format("Bathory") )
#Band name search
result=requests.get( "https://www.metal-archives.com/search?searchString={0}&type=band_name".format("Bathory") )
Hi I am need to scrape web page end extract data-id use Regular expression
Here is my code :
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://clarity-project.info/tenders/?entity=38163425&offset=100")
bsObj = BeautifulSoup(html,"html.parser")
DataId = bsObg.findAll("data-id", {"skr":re.compile("data-id=[0-9,a-f]")})
for DataId in DataId:
print(DataId["skr"])
when I run my program in Jupyter :
HTTPError: HTTP Error 403: Forbidden
It looks like the web server is asking you to authenticate before serving content to Python's urllib. However, they serve everything neatly to wget and curl and https://clarity-project.info/robots.txt doesn't seem to exist, so I reckon scraping as such is fine with them. Still, it might be a good idea to ask them first.
As for the code, simply changing the User Agent string to something they like better seems to work:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from urllib.request import urlopen, Request
request = Request(
'https://clarity-project.info/tenders/?entity=38163425&offset=100',
headers={
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:55.0) Gecko/20100101 Firefox/55.0'})
html = urlopen(request).read().decode()
(unrelated, you have another mistake in your code: bsObj ≠ bsObg)
EDIT added code below to answer additional question from the comments:
What you seem to need is to find the value of the data-id attribute, no matter to which tag it belongs. The code below does just that:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
url = 'https://clarity-project.info/tenders/?entity=38163425&offset=100'
agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36\
(KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
request = Request(url, headers={'User-Agent': agent})
html = urlopen(request).read().decode()
soup = BeautifulSoup(html, 'html.parser')
tags = soup.findAll(lambda tag: tag.get('data-id', None) is not None)
for tag in tags:
print(tag['data-id'])
The key is to simply use a lambda expression as the parameter to the findAll function of BeautifulSoup.
The server is likely blocking your requests because of the default user agent. You can change this so that you will appear to the server to be a web browser. For example, a Chrome User-Agent is:
Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36
To add a User-Agent you can create a request object with the url as a parameter and the User-Agent passed in a dictionary as the keyword argument 'headers'.
See:
import urllib.request
r = urllib.request.Request(url, headers= {'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})
html = urllib.request.urlopen(r)
You could try with this:
#!/usr/bin/env python
from bs4 import BeautifulSoup
import requests
url = 'your url here'
soup = BeautifulSoup(requests.get(url).text,"html.parser")
for i in soup.find_all('tr', attrs={'class':'table-row'}):
print '[Data id] => {}'.format(i.get('data-id'))
This should work!
I'm trying to access a URL using python requests. But the results when using requests.get are different than what is shown in a browser. The browser shows a list of hotel rates, but the requests.get shows:
'No results found'.
import requests
url = 'https://m.ihg.com/hotels/ihg/us/en/searchresults?destination=Atlanta%2C+GA%2C+United+States&checkInDay=01&checkInMonthYear=12016&checkOutDay=02&checkOutMonthYear=12016&rateCode=IMGOV&numberOfRooms=1&numberOfAdults=1&numberOfChildren=0&lat=33.748901&lng=-84.3881&corporateNumber=&installationCode=&travelType=&officialType=&dvqBranch=&dvqRank=&installationName='
headers = {'user-agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e Safari/8536.25'}
response = requests.get(url)
response.text