Python Web Scrapping Error 403 even with header User Agent - python

I'm a newbie learning Python. While using BeautifulSoup and Requests to scrap "https://batdongsan.com.vn/nha-dat-ban-tp-hcm" for collect data on housing price of my hometown, I get blocked by 403 error even though having tried Headers User Agent. Here is my code :
**url3 = "https://batdongsan.com.vn/nha-dat-ban-tp-hcm"
headers = {"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36 Edg/103.0.1264.49"}
page = requests.get(url3, headers = headers)
print(page)**
Result : <Response [403]>
Have anyone tried and succeeded to bypass the same problem. Any help is highly appriciated.
Many thanks

import cloudscraper
scraper = cloudscraper.create_scraper()
soup = BeautifulSoup(scraper.get("https://batdongsan.com.vn/nha-dat-ban-tp-hcm").text)
print(soup.text) ## do what you want with the response
You can install cloudscraper with pip install cloudscraper

Related

Problems with cloud flare 403 and python

I've been trying to make a code auto redeemer for a site theres a problem every time i send a request to the website the. The issue is a 403 error which means i haven't passed the right fooling methods like headers, cookies, CF. But I have so I'm lost I've tried everything the problem is 100% cloud flare having a strange verification I can't find a way to bypass it. I've passed auth headers with correct cookies aswell. I've tried with requests library and with cloudscrape and bs4
The site is
from bs4 import BeautifulSoup
import cloudscraper
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36'
}
scraper = cloudscraper.create_scraper()
r = scraper.get('https://rblxwild.com/api/promo-code/redeem-code', headers=headers)
print(r) > 403
Someone too tell me how to bypass the cloudflare protection methods.

While webscrapping this error shows Not Acceptable! An appropriate representation of the requested resource could not be found on this server

I am trying to scrape data from a website but it shows this error. I don't know how to fix this.
b'<head><title>Not Acceptable!</title></head><body><h1>Not Acceptable!</h1><p>An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security.</p></body></html>'
This is my code
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'https://insights.blackcoffer.com/how-is-login-logout-time-tracking-for-employees-in-office-done-by-ai/'
page = requests.get(url).content
page
Output
You need to add user-agent and it works.
If you do not put user-agent of some browser, the site thinks that you are bot and block you.
from bs4 import BeautifulSoup
import requests
url = 'https://insights.blackcoffer.com/how-is-login-logout-time-tracking-for-employees-in-office-done-by-ai/'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36"}
page = requests.get(url, headers=headers).content
print(page)

What does this robot.txt mean?

There's a website that I need to crawl, I have no financial purpose just to study.
I checked the robots.txt and it was as follows.
User-agent: *
Allow: /
Disallow: /*.notfound.html
Can I crawl this website using request and beautifulSoup?
I checked that crawling without a header causes a 403 error. Does this mean that crawling is not allowed?
status code: 403 means client-side error and from the server-side such type of
error is not responsible for meaning the website is allowed to extract data. To get ride of 403 error you must need to inject something with requests like headers and most of the time but not always will solve this problem just injecting User-Agent as header. Here is an example how to inject User-Agent using requests module with BeautifulSoup.
import requests
from bs4 import BeautifulSoup
header = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"
}
response = requests.get("Your url", headers=headers)
print(response.status_code)
#soup = BeautifulSoup(response .content, "lxml")

I try to get a football game schedule from Google and this error occurs

import urllib.request
from bs4 import BeautifulSoup
page = urllib.request.urlopen("https://www.google.com/search?sxsrf=ACYBGNTOhiadhX5wH-HLBzUmxJSBAPzpbQ%3A1574342044444&source=hp&ei=nI3WXbq4GMWGoASf-I2oAw&q=%EB%A6%AC%EB%B2%84%ED%92%80+&oq=%EB%A6%AC%EB%B2%84%ED%92%80+&gs_l=psy-ab.3..35i39j0l9.463.2481..2802...2.0..1.124.1086.0j10......0....1..gws-wiz.....10..0i131j0i10j35i362i39.ciJHtFLjhCA&ved=0ahUKEwi69r6SsfvlAhVFA4gKHR98AzUQ4dUDCAY&uact=5#sie=t;/m/04ltf;2;/m/02_tc;mt;fp;1;;").read()
soup = BeautifulSoup(page,'html.parser')
I try to get a football game schedule from Google and this error occurs. What's the reason?
rank = soup.find('table',{'class':'imspo_mt__mit'})
print(rank)
urllib.error.HTTPError: HTTP Error 403: Forbidden
Google has blocked you from accessing the page, that's what the 403 error is.
Try spoofing a user agent? The following works for me:
import requests
from bs4 import BeautifulSoup
user_agent = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'}
page = requests.get("https://www.google.com/search?sxsrf=ACYBGNTOhiadhX5wH-HLBzUmxJSBAPzpbQ%3A1574342044444&source=hp&ei=nI3WXbq4GMWGoASf-I2oAw&q=%EB%A6%AC%EB%B2%84%ED%92%80+&oq=%EB%A6%AC%EB%B2%84%ED%92%80+&gs_l=psy-ab.3..35i39j0l9.463.2481..2802...2.0..1.124.1086.0j10......0....1..gws-wiz.....10..0i131j0i10j35i362i39.ciJHtFLjhCA&ved=0ahUKEwi69r6SsfvlAhVFA4gKHR98AzUQ4dUDCAY&uact=5#sie=t;/m/04ltf;2;/m/02_tc;mt;fp;1;;", headers=user_agent)
soup = BeautifulSoup(page.text,'html.parser')
rank = soup.find('table',{'class':'imspo_mt__mit'})
print(rank)

Python requests html 403 response

Im using the requests module in python to try and make a search on the following webiste http://musicpleer.audio/, however this website appears to be blocking me as it issues nothing but a 403 when i attempt to access it, im wondering how i can get around this, ive tried sending it the user agent of my web browser(chrome) and it still returns error 403. any suggestions on how i could get around this an example of downloading a song from the site would be very helpful. Thanks in advance
My code:
import requests, os
def funGetList:
start_path = 'C:/Users/Jordan/Music/' # current directory
list = []
for path,dirs,files in os.walk(start_path):
for filename in files:
temp = (os.path.join(path,filename))
tempLen = len(temp)
"print(tempLen)"
iterate = 0
list.append(temp[22:(len(temp))-4])
def funDownloadMP3:
for i in list:
print(i)
payload = {'searchQuery': 'meme', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}
url = 'http://musicpleer.audio/'
print(requests.post(url, data=payload))
Putting the User-Agent in the headers seems to work:
In []:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}
url = 'http://musicpleer.audio/'
r = requests.get('{}#!{}'.format(url, 'meme'), headers=headers)
r.status_code
Out[]:
200
Note: It looks like the search url is simple '#!<search-term>'
HTML 403 Forbidden error code.
The server might be expecting some more request headers like Host or Cookies etc.
You might want to use Postman to debug it with ease

Categories

Resources