I was trying to get search result from a website, however I got
"Response[403]" message, I've found similar post solving 403 error by adding headers to request.post, however it didn't work for my problem. What should I do to correctly get the result I want?
from urllib.request import urlopen
import urllib.parse
import urllib.request
import requests
from bs4 import BeautifulSoup
url="https://www.metal-archives.com/"
html= urlopen(url)
print("The keyword you entered to search is: %s\n" % 'Bathory')
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
result=requests.post(url, data='Bathory', headers=headers)
print(result.content)
First of all, you don't need the headers as you can see that you're getting status
code 200:
>>> r = requests.get('https://www.metal-archives.com')
>>> r.status_code
200
If you want to search for anything, you can see that the url changes to
https://www.metal-archives.com/search?searchString=bathory
That means, you can directly format it using this:
>>> keyword = 'bathory'
>>> r = requests.get('https://www.metal-archives.com/search?searchString='+keyword)
>>> r.status_code
200
>>> 'bathory' in r.text
True
If you check HTML you'll find that form method is GET (may be that's why you get 403 error):
<form id="search_form" action="https://www.metal-archives.com/search" method="get">
so all you need is to construct search URL:
#Music genre search
result=requests.get( "https://www.metal-archives.com/search?searchString={0}&type=band_genre".format("Bathory") )
#Band name search
result=requests.get( "https://www.metal-archives.com/search?searchString={0}&type=band_name".format("Bathory") )
Related
Python error when using request get
Hello guys i have this in my code
from bs4 import BeautifulSoup
r = requests.get(url)
And I'm gettin this
<Response [403]>
Whats could be the solution
The url is 'https://www3.animeflv.net/anime/sailor-moon'
btw the title is weird because i dont know why stack overflow dont allow me the way i want to put it :(
For your specific case you can overcome that by faking your User-Agent in request headers.
import requests
url = 'https://www3.animeflv.net/anime/sailor-moon'
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'}
res = requests.get(url, headers=headers)
print(res.status_code)
<Response [200]>
Some websites try to block requests made with python requests library, by default when you make a request from python script your User-Agent is something like python3/requests but if you fake it with manipulating headers you can easily bypass that. Take a look at this library https://pypi.org/project/fake-useragent/ for generating fake User-Agent strings.
I the following python program that when I execute it, I get an HTTP 403: Forbidden error.
Here's the code:
import os
import re
from bs4 import BeautifulSoup
#import urllib.request
#from urllib.request import request, urlopen
from urllib import request
import pandas as pd
import numpy as np
import datetime
import time
import openpyxl
for a in range(0,len(symbols),1):
"""
Attempt to get past the Forbidden error message.
"""
#ua = UserAgent()
url = "https://fintel.io/ss/us/" + symbols[a]
"""
test urls:
https://fintel.io/ss/us/A
https://fintel.io/ss/us/BBY'
https://fintel.io/ss/us/WMT'
"""
print("Extracting Values for " + symbols[a] + ".")
header = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}
try:
page_request = request.Request(url, headers=header)
page = request.urlopen(page_request)
#old page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, "html.parser", from_encoding="iso-8859-1")
The result I'm getting:
Extracting Values for A.
HTTP Error 403: Forbidden
Extracting Values for BBY.
HTTP Error 403: Forbidden
Extracting Values for WMT.
HTTP Error 403: Forbidden
Any suggestions on how to handle this?
That's not a BeautifulSoup specific error, the website you're trying to scrape is probably protected by Cloudflare's anti bot page. You can try using cfscrape to bypass this.
Your complaint is with Request, rather than with beautifulsoup.
The website refuses to serve your requests.
When I use that UA it works fine, I receive a 200 document:
$ wget -U 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36' -S https://fintel.io/ss/us/BBY
But then again, even using UA of 'Wget/1.20.3 (darwin19.0.0)' works just fine:
$ wget -S https://fintel.io/ss/us/BBY
Looks like you've been doing excessive crawling,
and your IP is now being blocked by the webserver.
I am running the following code to parse an amazon page using beautiful soup in Python but when I run the print line, I keep getting None. I am wondering whether I am doing something wrong or if theres an explanation/solution to this. Any help will be appreciated.
import requests
from bs4 import BeautifulSoup
URL = 'https://www.amazon.ca/Magnetic-Erase-Whiteboard-Bulletin-
Board/dp/B07GNVZKY2/ref=sr_1_3_sspa?keywords=whiteboard&qid=1578902710&s=office&sr=1-3-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUEzOE5ZSkFGSDdCOFVDJmVuY3J5cHRlZElkPUEwMDM2ODA4M0dWMEtMWkI1U1hJJmVuY3J5cHRlZEFkSWQ9QTA0MDIwMjQxMEUwMzlMQ0pTQVlBJndpZGdldE5hbWU9c3BfYXRmJmFjdGlvbj1jbGlja1JlZGlyZWN0JmRvTm90TG9nQ2xpY2s9dHJ1ZQ=='
headers = {"User-Agent": 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36'}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find(id="productTitle")
print(title)
Your code is absolutely correct.
There seems to be some issue with the the parser that you have used (html.parser)
I used html5lib in place of html.parser and the code now works:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.amazon.ca/Magnetic-Erase-Whiteboard-BulletinBoard/dp/B07GNVZKY2/ref=sr_1_3_sspa?keywords=whiteboard&qid=1578902710&s=office&sr=1-3-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUEzOE5ZSkFGSDdCOFVDJmVuY3J5cHRlZElkPUEwMDM2ODA4M0dWMEtMWkI1U1hJJmVuY3J5cHRlZEFkSWQ9QTA0MDIwMjQxMEUwMzlMQ0pTQVlBJndpZGdldE5hbWU9c3BfYXRmJmFjdGlvbj1jbGlja1JlZGlyZWN0JmRvTm90TG9nQ2xpY2s9dHJ1ZQ=='
headers = {"User-Agent": 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36'}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html5lib')
title = soup.find(id='productTitle')
print(title)
More Info not directly related to the answer:
For the other answer given to this question, I wasn't asked for a captcha when visiting the page.
However Amazon does change the response content if it detects that a bot is visiting the website: Remove the headers from requests.get() method, and try page.text
The default headers added by requests library lead to the identification of the request as being form a bot.
When requesting that page outside of a normal browser environment it asked for a captcha, I'd assume that's why the element doesn't exist.
Amazon probably has specific measures to counter "robots" accessing their pages, I suggest to look at their APIs to see if there's anything helpful instead of scraping the webpages directly.
I'm writing a script to find out which full URLs a large number of shortened URLs lead to. I'm using the requests module to follow redirects and get the URL one would end up at if entering the URL in a browser. This works for almost all link shorteners, but fails for URLs form disq.us for reasons I can't figure out (i.e. for disq.us URL's I get the same url I enter, whereas when I enter it in a browser, I get redirected)
Below is a snippet which correctly resolves a bit.ly-shortened link but fails with a disq.us-link. I run it with Python 3.6.4 and version 2.18.4 of the requests module.
SO will not allow me to include shortened URLs in the question, so I'll leave those in a comment.
import requests
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36'
url1 = "SOME BITLY URL"
url2 = "SOME DISQ.US URL"
for url in [url1, url2]:
s = requests.Session()
s.headers['User-Agent'] = user_agent
r = s.get(url, allow_redirects=True, timeout=10)
print(r.url)
Your first URL is a 404 for me. Interestingly, I just tried this with the second url and it worked, but I used a different user agent. Then I tried it with your user agent, and it isn't redirecting.
This suggests that the webserver is doing something strange in response to that user agent string, and that the problem isn't with requests.
>>> import requests
>>> user_agent = 'foo'
>>> url = 'THE_DISCUS_URL'
>>> s = requests.Session()
>>> s.headers['User-Agent'] = user_agent
>>> r = s.get(url, allow_redirects=True, timeout=10)
>>> r.url
'https://www.elsevier.com/connect/could-dissolvable-microneedles-replace-injected-vaccines'
vs.
>>> import requests
>>> user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36'
>>> url = 'THE_DISCUS_URL'
>>> s = requests.Session()
>>> s.headers['User-Agent'] = user_agent
>>> r = s.get(url, allow_redirects=True, timeout=10)
>>> r.url
'THE_DISCUS_URL'
I got curious, so I investigated a little more. The actual content of the response is a noscript tag with the link, and some javascript that does the redirect.
What's probably going on here is that if discus sees a real webbrowser user agent, it tries to redirect via javascript (and probably do a bunch of tracking in the process). On the other hand, if the user agent isn't familiar, the site assumes the visitor is a script, which probably can't do javascript, and just redirects.
Hi I am need to scrape web page end extract data-id use Regular expression
Here is my code :
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://clarity-project.info/tenders/?entity=38163425&offset=100")
bsObj = BeautifulSoup(html,"html.parser")
DataId = bsObg.findAll("data-id", {"skr":re.compile("data-id=[0-9,a-f]")})
for DataId in DataId:
print(DataId["skr"])
when I run my program in Jupyter :
HTTPError: HTTP Error 403: Forbidden
It looks like the web server is asking you to authenticate before serving content to Python's urllib. However, they serve everything neatly to wget and curl and https://clarity-project.info/robots.txt doesn't seem to exist, so I reckon scraping as such is fine with them. Still, it might be a good idea to ask them first.
As for the code, simply changing the User Agent string to something they like better seems to work:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from urllib.request import urlopen, Request
request = Request(
'https://clarity-project.info/tenders/?entity=38163425&offset=100',
headers={
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:55.0) Gecko/20100101 Firefox/55.0'})
html = urlopen(request).read().decode()
(unrelated, you have another mistake in your code: bsObj ≠ bsObg)
EDIT added code below to answer additional question from the comments:
What you seem to need is to find the value of the data-id attribute, no matter to which tag it belongs. The code below does just that:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
url = 'https://clarity-project.info/tenders/?entity=38163425&offset=100'
agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36\
(KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
request = Request(url, headers={'User-Agent': agent})
html = urlopen(request).read().decode()
soup = BeautifulSoup(html, 'html.parser')
tags = soup.findAll(lambda tag: tag.get('data-id', None) is not None)
for tag in tags:
print(tag['data-id'])
The key is to simply use a lambda expression as the parameter to the findAll function of BeautifulSoup.
The server is likely blocking your requests because of the default user agent. You can change this so that you will appear to the server to be a web browser. For example, a Chrome User-Agent is:
Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36
To add a User-Agent you can create a request object with the url as a parameter and the User-Agent passed in a dictionary as the keyword argument 'headers'.
See:
import urllib.request
r = urllib.request.Request(url, headers= {'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})
html = urllib.request.urlopen(r)
You could try with this:
#!/usr/bin/env python
from bs4 import BeautifulSoup
import requests
url = 'your url here'
soup = BeautifulSoup(requests.get(url).text,"html.parser")
for i in soup.find_all('tr', attrs={'class':'table-row'}):
print '[Data id] => {}'.format(i.get('data-id'))
This should work!