<Response [404]> Scraping data with BeautifulSoup - python

I'm trying to scrape data from this website 1xbet but I'm getting this error <Response [404]> all the time.
Here is my code.
type here
import requests, bs4
requests.packages.urllib3.disable_warnings()
headers = {"User-Agent":"Mozilla/5.0"}
url = "https://1xbet.com/sports/basketball/early"
response = requests.get(url,headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'}, verify=False)
print(response)
# soup = BeautifulSoup(page.content, 'html.parser')
# lists = soup.find_all('section', class_="a_event")
# print(lists)
How can I solve this?
I tried to include the headers and veriy=False so that it won't have the error "certificate verify failed", but after doing that I got this response 404. Any help would be appreciated.

404 mean this url source not found,you need check the url is right

Related

Request Website Timeout While Trying To Read Website- Python

I am attempting to read and parse a website that returns a JSON. Every attempt I have made, it gives me a timeout error or not an error at all(I have to stop it)
URL:
https://api.louisvuitton.com/api/eng-us/catalog/availability/M57089
Code I have tried:
import requests
from urllib.request import Request, urlopen
#Trial 1
BASE_URL = 'https://api.louisvuitton.com/api/eng-us/catalog/availability/M57089'
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_1_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36'
}
response = requests.get(BASE_URL, headers=headers)
#Trial2
url = ('https://api.louisvuitton.com/api/eng-us/catalog/availability/M57089')
req = Request(url, headers= headers)
webpage = urlopen(req).read()
page_soup = soup(webpage, "html.parser")
obj=json.loads(str(page_soup))
#Trial3
import dload
j = dload.json('https://api.louisvuitton.com/api/eng-us/catalog/availability/M57089')
print(j)
So far none of these attempts or any variation similar to these have been successful to open the website and read it. Any help would be appreciated.

Why is requests.get() returning an outdated website in Python?

Relevant line of code is :
response = requests.get(url)
Here's what I've tried so far :
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(url, headers=headers)
and :
from fake_useragent import UserAgent
import requests
ua = UserAgent()
headers = {'User-Agent':str(ua.chrome)}
response = requests.get(url, headers=headers)
But the data I get is still not the current version of the website.
The website I'm trying to scrape is this grocery store flyer.
Can anyone tell me why the data I get is outdated and/or how to fix it?
Update: it works all of a sudden but I haven't changed anything so I'm still curious as to why ...

Scraping webpage using BeautifulSoup

I am attempting to scrape this site: https://www.senate.gov/general/contact_information/senators_cfm.cfm
My Code:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.senate.gov/general/contact_information/senators_cfm.cfm'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
print(soup)
The issue is that it's not actually going to the site. The HTML that I get in my soup var is not at all what the HTML is in the correct webpage.
This worked for me
headers = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
}
r = requests.get(URL,headers=headers)
Found the info here - https://towardsdatascience.com/5-strategies-to-write-unblock-able-web-scrapers-in-python-5e40c147bdaf
DUPLICATE HTTP 503 Error while using python requests module
try that:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.senate.gov/general/contact_information/senators_cfm.cfm'
page = requests.post(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
print(soup)

BeautifulSoup using Python keep returning null even though the element exists

I am running the following code to parse an amazon page using beautiful soup in Python but when I run the print line, I keep getting None. I am wondering whether I am doing something wrong or if theres an explanation/solution to this. Any help will be appreciated.
import requests
from bs4 import BeautifulSoup
URL = 'https://www.amazon.ca/Magnetic-Erase-Whiteboard-Bulletin-
Board/dp/B07GNVZKY2/ref=sr_1_3_sspa?keywords=whiteboard&qid=1578902710&s=office&sr=1-3-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUEzOE5ZSkFGSDdCOFVDJmVuY3J5cHRlZElkPUEwMDM2ODA4M0dWMEtMWkI1U1hJJmVuY3J5cHRlZEFkSWQ9QTA0MDIwMjQxMEUwMzlMQ0pTQVlBJndpZGdldE5hbWU9c3BfYXRmJmFjdGlvbj1jbGlja1JlZGlyZWN0JmRvTm90TG9nQ2xpY2s9dHJ1ZQ=='
headers = {"User-Agent": 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36'}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find(id="productTitle")
print(title)
Your code is absolutely correct.
There seems to be some issue with the the parser that you have used (html.parser)
I used html5lib in place of html.parser and the code now works:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.amazon.ca/Magnetic-Erase-Whiteboard-BulletinBoard/dp/B07GNVZKY2/ref=sr_1_3_sspa?keywords=whiteboard&qid=1578902710&s=office&sr=1-3-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUEzOE5ZSkFGSDdCOFVDJmVuY3J5cHRlZElkPUEwMDM2ODA4M0dWMEtMWkI1U1hJJmVuY3J5cHRlZEFkSWQ9QTA0MDIwMjQxMEUwMzlMQ0pTQVlBJndpZGdldE5hbWU9c3BfYXRmJmFjdGlvbj1jbGlja1JlZGlyZWN0JmRvTm90TG9nQ2xpY2s9dHJ1ZQ=='
headers = {"User-Agent": 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36'}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html5lib')
title = soup.find(id='productTitle')
print(title)
More Info not directly related to the answer:
For the other answer given to this question, I wasn't asked for a captcha when visiting the page.
However Amazon does change the response content if it detects that a bot is visiting the website: Remove the headers from requests.get() method, and try page.text
The default headers added by requests library lead to the identification of the request as being form a bot.
When requesting that page outside of a normal browser environment it asked for a captcha, I'd assume that's why the element doesn't exist.
Amazon probably has specific measures to counter "robots" accessing their pages, I suggest to look at their APIs to see if there's anything helpful instead of scraping the webpages directly.

I am not able to scrape the web data from the given website using python

Hi I ans trying to scrape the data from the site https://health.usnews.com/doctors/city-index/new-jersey . I want all the city name and again from the link scrape the data. But using requests library in python something is going wrong. There are some session or cookies or something which is stopping to crawl the data. please help me out.
>>> import requests
>>> url = 'https://health.usnews.com/doctors/city-index/new-jersey'
>>> html_content = requests.get(url)
>>> html_content.status_code
403
>>> html_content.content
'<HTML><HEAD>\n<TITLE>Access Denied</TITLE>\n</HEAD><BODY>\n<H1>Access Denied</H1>\n \nYou don\'t have permission to access "http://health.usnews.com/doctors/city-index/new-jersey" on this server.<P>\nReference #18.7d70b17.1528874823.3fac5589\n</BODY>\n</HTML>\n'
>>>
Here is the error I am getting.
You need to add header in your request so that the site think you are a genuine user.
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
html_content = requests.get(url, headers=headers)
First of all, Like the previous answer suggested I would recommend you to add a header to your code, so your code should look something like this:
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:60.0) Gecko/20100101 Firefox/60.0'}
url = 'https://health.usnews.com/doctors/city-index/new-jersey'
html_content = requests.get(url, headers=headers)
html_content.status_code
print(html_content.text)

Categories

Resources