Why are the parsed label names different? - python

from bs4 import BeautifulSoup
import requests
web_url = r'https://www.mlb.com/scores/2019-05-12'
get_web = requests.get(web_url).text
soup = BeautifulSoup(get_web,"html.parser")
score = soup.find_all('div',class_='container')
print(score)
I want to find this.
But result is this

Send headers to API to tell it "hey I'm a desktop browser" to get identical HTML from server side:
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
request = requests.get(url, headers={'User-Agent': user_agent})
Useful links:
How to use Python requests to fake a browser visit?
Sending "User-agent" using Requests library in Python

Related

Beautiful soup took too long to get data, ps never got data

This is the code, it took too long to get the data, plus never retrieved the data.
import requests
from bs4 import BeautifulSoup
print("started")
url="https://www.analog.com/en/products.html#"
def get_data(url):
r=requests.get(url)
soup=BeautifulSoup(r.text,"html.parser")
return soup
def parse(soup):
datas=soup.find_all("div",{"class":"product-row row"})
print(len(datas))
return
print("started")
soup=get_data(url)
print("got data")
parse(soup)
You will need to provide a User-Agent to you request header, just add
header = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
at the top of your file and then add the "headers" parameter to your request, as follows
r=requests.get(url,headers=header)
You can read more at this question: How to use Python requests to fake a browser visit a.k.a and generate User Agent?

Why is requests.get() returning an outdated website in Python?

Relevant line of code is :
response = requests.get(url)
Here's what I've tried so far :
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(url, headers=headers)
and :
from fake_useragent import UserAgent
import requests
ua = UserAgent()
headers = {'User-Agent':str(ua.chrome)}
response = requests.get(url, headers=headers)
But the data I get is still not the current version of the website.
The website I'm trying to scrape is this grocery store flyer.
Can anyone tell me why the data I get is outdated and/or how to fix it?
Update: it works all of a sudden but I haven't changed anything so I'm still curious as to why ...

I am not able to scrape the web data from the given website using python

Hi I ans trying to scrape the data from the site https://health.usnews.com/doctors/city-index/new-jersey . I want all the city name and again from the link scrape the data. But using requests library in python something is going wrong. There are some session or cookies or something which is stopping to crawl the data. please help me out.
>>> import requests
>>> url = 'https://health.usnews.com/doctors/city-index/new-jersey'
>>> html_content = requests.get(url)
>>> html_content.status_code
403
>>> html_content.content
'<HTML><HEAD>\n<TITLE>Access Denied</TITLE>\n</HEAD><BODY>\n<H1>Access Denied</H1>\n \nYou don\'t have permission to access "http://health.usnews.com/doctors/city-index/new-jersey" on this server.<P>\nReference #18.7d70b17.1528874823.3fac5589\n</BODY>\n</HTML>\n'
>>>
Here is the error I am getting.
You need to add header in your request so that the site think you are a genuine user.
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
html_content = requests.get(url, headers=headers)
First of all, Like the previous answer suggested I would recommend you to add a header to your code, so your code should look something like this:
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:60.0) Gecko/20100101 Firefox/60.0'}
url = 'https://health.usnews.com/doctors/city-index/new-jersey'
html_content = requests.get(url, headers=headers)
html_content.status_code
print(html_content.text)

Able to see image on browser, but urllib.urlretrieve() fails to downlad it. How can I download it?

Image path --> http://markinternational.info/data/out/366/221983609-black-hd-desktop-wallpaper.jpg
Code I am using
import urllib
urllib.urlretrieve("https://markinternational.info/data/out/366/221983609-black-hd-desktop-wallpaper.jpg" , "photu.jpg")
What it returns (returns same thing for successful or unsuccessful attempts)
('photu.jpg', <httplib.HTTPMessage instance at 0x7fe3cfb27d88>)
Can someone help?
You need to fake the user-agent to bypass this restriction by the web server.
I used Python3 and requests library, I managed to get the picture:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
url = 'https://markinternational.info/data/out/366/221983609-black-hd-desktop-wallpaper.jpg'
res = requests.get(url, headers=headers)
with open('photo.jpg', 'wb') as W:
W.write(res.content)
This might help.
import urllib
f = open('photu.jpg','wb')
f.write(urllib.urlopen('https://markinternational.info/data/out/366/221983609-black-hd-desktop-wallpaper.jpg').read())
f.close()
Since you're sending a raw http request without any User-Agent header, the server is not allowing the request to pass through. You can mock it with a defined User-Agent in header and it'll work as if it works on browser.
url = "https://markinternational.info/data/out/366/221983609-black-hd-desktop-wallpaper.jpg"
req = urllib.request.Request(
url,
data=None,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
)
with open('image.jpg', 'wb') as img_file:
img_file.write(urllib.request.urlopen(req).read())

Python3, beautifulsoup, return nothing in specific pages

In some pages, when I use beautifulsoup, return nothing...just blank pages.
from bs4 import BeautifulSoup
import urllib.request
Site = "http://gall.dcinside.com/board/lists/?id=parkbogum&page=2"
URL = Site
html = urllib.request.urlopen(URL).read()
soup = BeautifulSoup(html, "html.parser")
print(soup)
I can use beautifulsoup any other site except this site. and I dont know way...
This URL will require certain headers passed while requesting.
Pass this headers parameter while requesting the URL and you will get the HTML.
HTML = requests.get(URL , headers = headers).content
while
headers = {
"method":"GET",
"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36",
"Host":"gall.dcinside.com",
"Pragma":"no-cache",
"Upgrade-Insecure-Requests":"1",
"Accept":"text/html,application/xhtml+xml,
application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8"
}
As I can see, this site is using cookies. You can see the headers in the browser's developer tool. You can get the cookie by following:
import urllib.request
r = urllib.request.urlopen(URL)
ck = r.getheader('Set-Cookie')
Now you can create the header like this and send it with subsequent requests.
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Cookie": ck,
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36"
}
req = urllib.request.Request(URL, headers=headers)
html = urllib.request.urlopen(req).read()
Some website servers look for robot scripts trying to access their pages. One of the simpler methods of doing this is to check to see which User-Agent is being sent by the browser. In this case as you are using Python and not a web browser, the following is being sent:
python-requests/2.18.4
When it sees an agent it does not like, it will return nothing. To get around this, you need to change the User-Agent string in your request. There are hundreds to choose from, as the agent string changes with each release of a browser. For example see this list of Firefox User-Agent strings e.g.
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1
Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0
The trick is to try a few, and find one that the server is happy with. In your case, ONLY the header needs to be changed in order to get HTML to be returned from the website. In some cases, cookies will also need to be used.
The header can be easily changed by passing a dictionary. This could be done using requests as follows:
from bs4 import BeautifulSoup
import requests
url = "http://gall.dcinside.com/board/lists/?id=parkbogum&page=2"
html = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (iPad; U; CPU OS 3_2_1 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Mobile/7B405'}).content
soup = BeautifulSoup(html, "html.parser")
print(soup)

Categories

Resources