urlopen of urllib.request cannot open a page in python 3.7 - python

I want to write webscraper to collect titles of articles from Medium.com webpage.
I am trying to write a python script that will scrape headlines from Medium.com website. I am using python 3.7 and imported urlopen from urllib.request.
But it cannot open the site and shows
"urllib.error.HTTPError: HTTP Error 403: Forbidden" error.
from bs4 import BeautifulSoup
from urllib.request import urlopen
webAdd = urlopen("https://medium.com/")
bsObj = BeautifulSoup(webAdd.read())
Result = urllib.error.HTTPError: HTTP Error 403: Forbidden
Expected result is that it will not show any error and just read the web site.
But this does not happen when I use requests module.
import requests
from bs4 import BeautifulSoup
url = 'https://medium.com/'
response = requests.get(url, timeout=5)
This time around it works without error.
Why ??

Urllib is pretty old and small module. For webscraping, requests module is recommended.
You can check out this answer for additional information.

Many sites nowadays check where the user agent is coming from, to try and deter bots. requests is the better module to use, but if you really want to use urllib, you can alter the headers text, to pretend to be Firefox or something else, so that it is not blocked. Quick example can be found here:
https://stackoverflow.com/a/16187955
import urllib.request
user_agent = 'Mozilla/5.0 (platform; rv:geckoversion) Gecko/geckotrail Firefox/firefoxversion'
url = "http://example.com"
request = urllib.request.Request(url)
request.add_header('User-Agent', user_agent)
response = urllib.request.urlopen(request)
You will need to alter the user_agent string with the appropriate versions of things too. Hope this helps.

this worked for me
import urllib
from urllib.request import urlopen
html = urlopen(MY_URL)
contents = html.read()
print(contents)

Related

Can't parse coin gecko page from today with BeautifulSoup because of Cloudflare

from bs4 import BeautifulSoup as bs
import requests
import re
import cloudscraper
def get_btc_price(br):
data=requests.get('https://www.coingecko.com/en/coins/bitcoin')
soup = bs(data.text, 'html.parser')
price1=soup.find('table',{'class':'table b-b'})
fclas=price1.find('td')
spans=fclas.find('span')
price2=spans.text
price=(price2).strip()
x=float(price[1:])
y=x*br
z=round(y,2)
print(z)
return z
This has been working for months and this morning it decided to stop. Messages that I'm getting are like: checking your browser before you can continue...., check your antivirus or consult with manager to get access... and some cloudflare gibberish.
I tried
import cloudscraper
scraper = cloudscraper.create_scraper() # returns a CloudScraper instance
print(scraper.get("https://www.coingecko.com/en/coins/bitcoin").text)
and it still blocks me access. What should I do? Is there any other way to bypass this or am I doing something wrong.
It doesn't seem a problem from the scraper but with the server when dealing the negotiation for the connection.
Add a user agent otherwise the requestsuse the deafult
user_agent = #
response = requests.get(url, headers={ "user-agent": user_agent})
Check the "requirements"
url = #
response = requests.get(url)
for key, value in response.headers.items():
print(key, ":", value)

Method not allowed first API

been through a few web scraping tutorials now trying a basic api scraper.
This is my code
from bs4 import BeautifulSoup
import requests
url = 'https://qships.tmr.qld.gov.au/webx/services/wxdata.svc/GetDataX'
response = requests.get(url, timeout=5)
content = BeautifulSoup(response.content, "html.parser")
print (content)
comes up with method not allowed :(
Im still learning so any advice will be well recieved
cheers
It is clearly a problem with your URL, service doesn't allow to retrieve information. but you can check this URL, where the steps for retrieving metadata are described.
https://qships.tmr.qld.gov.au/webx/services/wxdata.svc

Python 3, urlopen - HTTP Error 403: Forbidden

I'm trying to download automatically the first image which appears in the google image search but I'm not able to read the website source and an error occurs ("HTTP Error 403: Forbidden").
Any ideas? Thank you for your help!
That's my code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
word = 'house'
r = urlopen('https://www.google.pl/search?&dcr=0&tbm=isch&q='+word)
data = r.read()
Apparently you have to pass the headers argument because the website is blocking you thinking you are a bot requesting data. I found an example of doing this here HTTP error 403 in Python 3 Web Scraping.
Also, the urlopen object didn't support the headers argument, so I had to use the Request object instead.
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
word = 'house'
r = Request('https://www.google.pl/search?&dcr=0&tbm=isch&q='+word, headers={'User-Agent': 'Mozilla/5.0'})
response = urlopen(r).read()

python requests return a different web page from browser or urllib

I use requests to scrape webpage for some content.
When I use
import requests
requests.get('example.org')
I get a different page from the one I get when I use my broswer or using
import urllib.request
urllib.request.urlopen('example.org')
I tried using urllib but it was really slow.
In a comparison test I did it was 50% slower than requests !!
How Do you solve this??
After a lot of investigations I found that the site passes a cookie in the header attached to the first visitor to the site only.
so the solution is to get the cookies with head request, then resend them with your get request
import requests
# get the cookies with head(), this doesn't get the body so it's FAST
cookies = requests.head('example.com')
# send get request with the cookies
result = requests.get('example.com', cookies=cookies)
Now It's faster than urllib + the same result :)

Headers, User-Agents, Url Requests in Ubuntu

Currently scraping LINK for products, deploying my script on a ubuntu server. This site requires you to specify User-Agent and url header related stuff. As I am using Ubuntu and connecting to a proxy server on Ubuntu, what should my "hdr" variable be within this script:
import urllib2
from bs4 import BeautifulSoup
import requests
from lxml import etree
import smtplib
import urllib2, sys
url = "http://www.sneakersnstuff.com/en/product/22422/adidas-superstar-80s"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(url,headers=hdr)
page = urllib2.urlopen(req)
This script works just fine on coming off my computer, however not sure what I would specify as browser or user-agent for ubuntu.
The code:
import urllib2
from bs4 import BeautifulSoup
import requests
from lxml import etree
import smtplib
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
import urllib2, sys
url = "http://www.sneakersnstuff.com/en/product/22422/adidas-superstar-80s"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(url,headers=hdr)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page, "lxml")
result = soup.find_all("span", {"class":"availability"})
returns the error code: urllib2.HTTPError: HTTP Error 403: Forbidden
But this only occurs on Ubuntu, not off local machine
I am not sure about the whole urllib2 thing, but if you are just trying to get the string within the html, you are importing way too much stuff here. For the url you've provided, the following is sufficient to scrape the text:
from bs4 import BeautifulSoup
import requests
As for the user-agent, that depends whether you want the site maintainer to know about your existence or not, mostly it is not related to the capacity of scraping itself. For some sites you might want to hide your user-agent for some you might prefer it to be explicitly stated.
For the url you've provided the following code worked for me without errors:
from bs4 import BeautifulSoup
import requests
url = "http://www.sneakersnstuff.com/en/product/22422/adidas-superstar-80s"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = requests.Session()
page_raw = req.get(url, headers=hdr)
page_raw.status_code # This was 200
page_raw.encoding = "utf-8" # Just to be sure
page_text = page_raw.text
page_soup = BeautifulSoup(page_text, "lxml")
page_avaliablity = page_soup.find_all("span", class_="availability")
You don't have to specify different User-Agent strings based on the operating systems where the script is running. You can leave it as is.
You can go further and start rotating User-Agent value - for instance, randomly picking it up from the fake-useragent's list of real world user agents.
You can specify any agent you want. The header is just a string that is part of the protocol HTTP protocol. There is no verification that goes on in the server. Beware that the header you specify will determine how the html your request will appear i.e. older agents might not contain all the information you expect

Categories

Resources