Different HTML returned when using python requests - python

I'm trying to access a URL using python requests. But the results when using requests.get are different than what is shown in a browser. The browser shows a list of hotel rates, but the requests.get shows:
'No results found'.
import requests
url = 'https://m.ihg.com/hotels/ihg/us/en/searchresults?destination=Atlanta%2C+GA%2C+United+States&checkInDay=01&checkInMonthYear=12016&checkOutDay=02&checkOutMonthYear=12016&rateCode=IMGOV&numberOfRooms=1&numberOfAdults=1&numberOfChildren=0&lat=33.748901&lng=-84.3881&corporateNumber=&installationCode=&travelType=&officialType=&dvqBranch=&dvqRank=&installationName='
headers = {'user-agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e Safari/8536.25'}
response = requests.get(url)
response.text

Related

Beautiful soup took too long to get data, ps never got data

This is the code, it took too long to get the data, plus never retrieved the data.
import requests
from bs4 import BeautifulSoup
print("started")
url="https://www.analog.com/en/products.html#"
def get_data(url):
r=requests.get(url)
soup=BeautifulSoup(r.text,"html.parser")
return soup
def parse(soup):
datas=soup.find_all("div",{"class":"product-row row"})
print(len(datas))
return
print("started")
soup=get_data(url)
print("got data")
parse(soup)
You will need to provide a User-Agent to you request header, just add
header = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
at the top of your file and then add the "headers" parameter to your request, as follows
r=requests.get(url,headers=header)
You can read more at this question: How to use Python requests to fake a browser visit a.k.a and generate User Agent?

Why is requests.get() returning an outdated website in Python?

Relevant line of code is :
response = requests.get(url)
Here's what I've tried so far :
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(url, headers=headers)
and :
from fake_useragent import UserAgent
import requests
ua = UserAgent()
headers = {'User-Agent':str(ua.chrome)}
response = requests.get(url, headers=headers)
But the data I get is still not the current version of the website.
The website I'm trying to scrape is this grocery store flyer.
Can anyone tell me why the data I get is outdated and/or how to fix it?
Update: it works all of a sudden but I haven't changed anything so I'm still curious as to why ...

Why are the parsed label names different?

from bs4 import BeautifulSoup
import requests
web_url = r'https://www.mlb.com/scores/2019-05-12'
get_web = requests.get(web_url).text
soup = BeautifulSoup(get_web,"html.parser")
score = soup.find_all('div',class_='container')
print(score)
I want to find this.
But result is this
Send headers to API to tell it "hey I'm a desktop browser" to get identical HTML from server side:
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
request = requests.get(url, headers={'User-Agent': user_agent})
Useful links:
How to use Python requests to fake a browser visit?
Sending "User-agent" using Requests library in Python

I am not able to scrape the web data from the given website using python

Hi I ans trying to scrape the data from the site https://health.usnews.com/doctors/city-index/new-jersey . I want all the city name and again from the link scrape the data. But using requests library in python something is going wrong. There are some session or cookies or something which is stopping to crawl the data. please help me out.
>>> import requests
>>> url = 'https://health.usnews.com/doctors/city-index/new-jersey'
>>> html_content = requests.get(url)
>>> html_content.status_code
403
>>> html_content.content
'<HTML><HEAD>\n<TITLE>Access Denied</TITLE>\n</HEAD><BODY>\n<H1>Access Denied</H1>\n \nYou don\'t have permission to access "http://health.usnews.com/doctors/city-index/new-jersey" on this server.<P>\nReference #18.7d70b17.1528874823.3fac5589\n</BODY>\n</HTML>\n'
>>>
Here is the error I am getting.
You need to add header in your request so that the site think you are a genuine user.
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
html_content = requests.get(url, headers=headers)
First of all, Like the previous answer suggested I would recommend you to add a header to your code, so your code should look something like this:
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:60.0) Gecko/20100101 Firefox/60.0'}
url = 'https://health.usnews.com/doctors/city-index/new-jersey'
html_content = requests.get(url, headers=headers)
html_content.status_code
print(html_content.text)

Web scraping: HTTPError: HTTP Error 403: Forbidden, python3

Hi I am need to scrape web page end extract data-id use Regular expression
Here is my code :
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://clarity-project.info/tenders/?entity=38163425&offset=100")
bsObj = BeautifulSoup(html,"html.parser")
DataId = bsObg.findAll("data-id", {"skr":re.compile("data-id=[0-9,a-f]")})
for DataId in DataId:
print(DataId["skr"])
when I run my program in Jupyter :
HTTPError: HTTP Error 403: Forbidden
It looks like the web server is asking you to authenticate before serving content to Python's urllib. However, they serve everything neatly to wget and curl and https://clarity-project.info/robots.txt doesn't seem to exist, so I reckon scraping as such is fine with them. Still, it might be a good idea to ask them first.
As for the code, simply changing the User Agent string to something they like better seems to work:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from urllib.request import urlopen, Request
request = Request(
'https://clarity-project.info/tenders/?entity=38163425&offset=100',
headers={
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:55.0) Gecko/20100101 Firefox/55.0'})
html = urlopen(request).read().decode()
(unrelated, you have another mistake in your code: bsObj ≠ bsObg)
EDIT added code below to answer additional question from the comments:
What you seem to need is to find the value of the data-id attribute, no matter to which tag it belongs. The code below does just that:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
url = 'https://clarity-project.info/tenders/?entity=38163425&offset=100'
agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36\
(KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
request = Request(url, headers={'User-Agent': agent})
html = urlopen(request).read().decode()
soup = BeautifulSoup(html, 'html.parser')
tags = soup.findAll(lambda tag: tag.get('data-id', None) is not None)
for tag in tags:
print(tag['data-id'])
The key is to simply use a lambda expression as the parameter to the findAll function of BeautifulSoup.
The server is likely blocking your requests because of the default user agent. You can change this so that you will appear to the server to be a web browser. For example, a Chrome User-Agent is:
Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36
To add a User-Agent you can create a request object with the url as a parameter and the User-Agent passed in a dictionary as the keyword argument 'headers'.
See:
import urllib.request
r = urllib.request.Request(url, headers= {'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})
html = urllib.request.urlopen(r)
You could try with this:
#!/usr/bin/env python
from bs4 import BeautifulSoup
import requests
url = 'your url here'
soup = BeautifulSoup(requests.get(url).text,"html.parser")
for i in soup.find_all('tr', attrs={'class':'table-row'}):
print '[Data id] => {}'.format(i.get('data-id'))
This should work!

Categories

Resources