I'm trying to download automatically the first image which appears in the google image search but I'm not able to read the website source and an error occurs ("HTTP Error 403: Forbidden").
Any ideas? Thank you for your help!
That's my code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
word = 'house'
r = urlopen('https://www.google.pl/search?&dcr=0&tbm=isch&q='+word)
data = r.read()
Apparently you have to pass the headers argument because the website is blocking you thinking you are a bot requesting data. I found an example of doing this here HTTP error 403 in Python 3 Web Scraping.
Also, the urlopen object didn't support the headers argument, so I had to use the Request object instead.
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
word = 'house'
r = Request('https://www.google.pl/search?&dcr=0&tbm=isch&q='+word, headers={'User-Agent': 'Mozilla/5.0'})
response = urlopen(r).read()
Related
I am trying to use requests to get data from twitter but when i run my code i get this error: simplejson.errors.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
This is my code so far:
import requests
url = 'https://twitter.com/search?q=memes&src=typed_query'
results = requests.get(url)
better_results = results.json()
better_results['results'][1]['text'].encode('utf-8')
print(better_results)
because you are making a request to a dynamic website.
when we are making a request to a dynamic website we must render the html first in order to receive all the content that we were expecting to receive.
just making the request is not enough.
other libraries such as requests_html render the html and javascript in background using a lite browser.
you can try this code:
# pip install requests_html
from requests_html import HTMLSession
url = 'https://twitter.com/search?q=memes&src=typed_query'
session = HTMLSession()
response = session.get(url)
# rendering part
response.html.render(timeout=20)
better_results = response.json()
better_results['results'][1]['text'].encode('utf-8')
print(better_results)
I want to write webscraper to collect titles of articles from Medium.com webpage.
I am trying to write a python script that will scrape headlines from Medium.com website. I am using python 3.7 and imported urlopen from urllib.request.
But it cannot open the site and shows
"urllib.error.HTTPError: HTTP Error 403: Forbidden" error.
from bs4 import BeautifulSoup
from urllib.request import urlopen
webAdd = urlopen("https://medium.com/")
bsObj = BeautifulSoup(webAdd.read())
Result = urllib.error.HTTPError: HTTP Error 403: Forbidden
Expected result is that it will not show any error and just read the web site.
But this does not happen when I use requests module.
import requests
from bs4 import BeautifulSoup
url = 'https://medium.com/'
response = requests.get(url, timeout=5)
This time around it works without error.
Why ??
Urllib is pretty old and small module. For webscraping, requests module is recommended.
You can check out this answer for additional information.
Many sites nowadays check where the user agent is coming from, to try and deter bots. requests is the better module to use, but if you really want to use urllib, you can alter the headers text, to pretend to be Firefox or something else, so that it is not blocked. Quick example can be found here:
https://stackoverflow.com/a/16187955
import urllib.request
user_agent = 'Mozilla/5.0 (platform; rv:geckoversion) Gecko/geckotrail Firefox/firefoxversion'
url = "http://example.com"
request = urllib.request.Request(url)
request.add_header('User-Agent', user_agent)
response = urllib.request.urlopen(request)
You will need to alter the user_agent string with the appropriate versions of things too. Hope this helps.
this worked for me
import urllib
from urllib.request import urlopen
html = urlopen(MY_URL)
contents = html.read()
print(contents)
I am trying to fetch data from quandl using urllib2.Please check code below.
import json
from pymongo import MongoClient
import urllib2
import requests
import ssl
#import quandl
codes = [100526];
for id in codes:
url = 'https://www.quandl.com.com//api/v3/datasets/AMFI/"+str(id)+".json?api_key=XXXXXXXX&start_date=2013-08-30'
req = urllib2.Request(url)
response = urllib2.urlopen(req)
data = response.read()
print data
OR
for id in codes:
url = "https://www.quandl.com.com//api/v3/datasets/AMFI/"+str(id)+".json?api_key=XXXXXXXX&start_date=2013-08-30"
request = requests.get(url,verify=False)
print request
I am getting HTTPERROR exception 404 in 1st case. and when I use request module I get SSL error even after using verify=false. I am looking through previous posts but most of them are related to HTTP request.
Thanks for help.
J
This is working for me, but you get a warning about the SSL certificate but you don't need to care about it.
import requests
codes = [100526];
for id in codes:
url = "https://www.quandl.com.com//api/v3/datasets/AMFI/"+str(id)+".json?api_key=XXXXXXXX&start_date=2013-08-30"
request = requests.get(url, verify=False)
print request.text
request.text has your response data.
You seem to be using a wrong URL (.com.com instead of .com) as well as a combination of different quotes in the first version of your code. Use the following instead and it should work:
import urllib2
import requests
codes = [100526]
for id in codes:
url = "https://www.quandl.com//api/v3/datasets/AMFI/"+str(id)+".json?start_date=2013-08-30"
req = urllib2.Request(url)
response = urllib2.urlopen(req)
print response.read()
for id in codes:
url = "https://www.quandl.com//api/v3/datasets/AMFI/"+str(id)+".json?start_date=2013-08-30"
response = requests.get(url,verify=False)
print response.text
To disable the warning about the SSL certificate, use the following code before making the request using requests:
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
Currently scraping LINK for products, deploying my script on a ubuntu server. This site requires you to specify User-Agent and url header related stuff. As I am using Ubuntu and connecting to a proxy server on Ubuntu, what should my "hdr" variable be within this script:
import urllib2
from bs4 import BeautifulSoup
import requests
from lxml import etree
import smtplib
import urllib2, sys
url = "http://www.sneakersnstuff.com/en/product/22422/adidas-superstar-80s"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(url,headers=hdr)
page = urllib2.urlopen(req)
This script works just fine on coming off my computer, however not sure what I would specify as browser or user-agent for ubuntu.
The code:
import urllib2
from bs4 import BeautifulSoup
import requests
from lxml import etree
import smtplib
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
import urllib2, sys
url = "http://www.sneakersnstuff.com/en/product/22422/adidas-superstar-80s"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(url,headers=hdr)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page, "lxml")
result = soup.find_all("span", {"class":"availability"})
returns the error code: urllib2.HTTPError: HTTP Error 403: Forbidden
But this only occurs on Ubuntu, not off local machine
I am not sure about the whole urllib2 thing, but if you are just trying to get the string within the html, you are importing way too much stuff here. For the url you've provided, the following is sufficient to scrape the text:
from bs4 import BeautifulSoup
import requests
As for the user-agent, that depends whether you want the site maintainer to know about your existence or not, mostly it is not related to the capacity of scraping itself. For some sites you might want to hide your user-agent for some you might prefer it to be explicitly stated.
For the url you've provided the following code worked for me without errors:
from bs4 import BeautifulSoup
import requests
url = "http://www.sneakersnstuff.com/en/product/22422/adidas-superstar-80s"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = requests.Session()
page_raw = req.get(url, headers=hdr)
page_raw.status_code # This was 200
page_raw.encoding = "utf-8" # Just to be sure
page_text = page_raw.text
page_soup = BeautifulSoup(page_text, "lxml")
page_avaliablity = page_soup.find_all("span", class_="availability")
You don't have to specify different User-Agent strings based on the operating systems where the script is running. You can leave it as is.
You can go further and start rotating User-Agent value - for instance, randomly picking it up from the fake-useragent's list of real world user agents.
You can specify any agent you want. The header is just a string that is part of the protocol HTTP protocol. There is no verification that goes on in the server. Beware that the header you specify will determine how the html your request will appear i.e. older agents might not contain all the information you expect
I don't understand why this code works:
import urllib2
url = urllib2.urlopen('http://www.google.fr/search?hl=en&q=voiture').read()
print url
and not this one :
import urllib2
url = urllib2.urlopen('http://www.google.fr/search?hl=en&q=voiture&start=2&sa=N').read()
print url
it displays the following error:
**urllib2.HTTPError: HTTP Error 403: Forbidden**
Thanks ;)