Headers, User-Agents, Url Requests in Ubuntu - python

Currently scraping LINK for products, deploying my script on a ubuntu server. This site requires you to specify User-Agent and url header related stuff. As I am using Ubuntu and connecting to a proxy server on Ubuntu, what should my "hdr" variable be within this script:
import urllib2
from bs4 import BeautifulSoup
import requests
from lxml import etree
import smtplib
import urllib2, sys
url = "http://www.sneakersnstuff.com/en/product/22422/adidas-superstar-80s"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(url,headers=hdr)
page = urllib2.urlopen(req)
This script works just fine on coming off my computer, however not sure what I would specify as browser or user-agent for ubuntu.
The code:
import urllib2
from bs4 import BeautifulSoup
import requests
from lxml import etree
import smtplib
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
import urllib2, sys
url = "http://www.sneakersnstuff.com/en/product/22422/adidas-superstar-80s"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(url,headers=hdr)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page, "lxml")
result = soup.find_all("span", {"class":"availability"})
returns the error code: urllib2.HTTPError: HTTP Error 403: Forbidden
But this only occurs on Ubuntu, not off local machine

I am not sure about the whole urllib2 thing, but if you are just trying to get the string within the html, you are importing way too much stuff here. For the url you've provided, the following is sufficient to scrape the text:
from bs4 import BeautifulSoup
import requests
As for the user-agent, that depends whether you want the site maintainer to know about your existence or not, mostly it is not related to the capacity of scraping itself. For some sites you might want to hide your user-agent for some you might prefer it to be explicitly stated.
For the url you've provided the following code worked for me without errors:
from bs4 import BeautifulSoup
import requests
url = "http://www.sneakersnstuff.com/en/product/22422/adidas-superstar-80s"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = requests.Session()
page_raw = req.get(url, headers=hdr)
page_raw.status_code # This was 200
page_raw.encoding = "utf-8" # Just to be sure
page_text = page_raw.text
page_soup = BeautifulSoup(page_text, "lxml")
page_avaliablity = page_soup.find_all("span", class_="availability")

You don't have to specify different User-Agent strings based on the operating systems where the script is running. You can leave it as is.
You can go further and start rotating User-Agent value - for instance, randomly picking it up from the fake-useragent's list of real world user agents.

You can specify any agent you want. The header is just a string that is part of the protocol HTTP protocol. There is no verification that goes on in the server. Beware that the header you specify will determine how the html your request will appear i.e. older agents might not contain all the information you expect

Related

Can't parse coin gecko page from today with BeautifulSoup because of Cloudflare

from bs4 import BeautifulSoup as bs
import requests
import re
import cloudscraper
def get_btc_price(br):
data=requests.get('https://www.coingecko.com/en/coins/bitcoin')
soup = bs(data.text, 'html.parser')
price1=soup.find('table',{'class':'table b-b'})
fclas=price1.find('td')
spans=fclas.find('span')
price2=spans.text
price=(price2).strip()
x=float(price[1:])
y=x*br
z=round(y,2)
print(z)
return z
This has been working for months and this morning it decided to stop. Messages that I'm getting are like: checking your browser before you can continue...., check your antivirus or consult with manager to get access... and some cloudflare gibberish.
I tried
import cloudscraper
scraper = cloudscraper.create_scraper() # returns a CloudScraper instance
print(scraper.get("https://www.coingecko.com/en/coins/bitcoin").text)
and it still blocks me access. What should I do? Is there any other way to bypass this or am I doing something wrong.
It doesn't seem a problem from the scraper but with the server when dealing the negotiation for the connection.
Add a user agent otherwise the requestsuse the deafult
user_agent = #
response = requests.get(url, headers={ "user-agent": user_agent})
Check the "requirements"
url = #
response = requests.get(url)
for key, value in response.headers.items():
print(key, ":", value)

urlopen of urllib.request cannot open a page in python 3.7

I want to write webscraper to collect titles of articles from Medium.com webpage.
I am trying to write a python script that will scrape headlines from Medium.com website. I am using python 3.7 and imported urlopen from urllib.request.
But it cannot open the site and shows
"urllib.error.HTTPError: HTTP Error 403: Forbidden" error.
from bs4 import BeautifulSoup
from urllib.request import urlopen
webAdd = urlopen("https://medium.com/")
bsObj = BeautifulSoup(webAdd.read())
Result = urllib.error.HTTPError: HTTP Error 403: Forbidden
Expected result is that it will not show any error and just read the web site.
But this does not happen when I use requests module.
import requests
from bs4 import BeautifulSoup
url = 'https://medium.com/'
response = requests.get(url, timeout=5)
This time around it works without error.
Why ??
Urllib is pretty old and small module. For webscraping, requests module is recommended.
You can check out this answer for additional information.
Many sites nowadays check where the user agent is coming from, to try and deter bots. requests is the better module to use, but if you really want to use urllib, you can alter the headers text, to pretend to be Firefox or something else, so that it is not blocked. Quick example can be found here:
https://stackoverflow.com/a/16187955
import urllib.request
user_agent = 'Mozilla/5.0 (platform; rv:geckoversion) Gecko/geckotrail Firefox/firefoxversion'
url = "http://example.com"
request = urllib.request.Request(url)
request.add_header('User-Agent', user_agent)
response = urllib.request.urlopen(request)
You will need to alter the user_agent string with the appropriate versions of things too. Hope this helps.
this worked for me
import urllib
from urllib.request import urlopen
html = urlopen(MY_URL)
contents = html.read()
print(contents)

How to send cookies with urllib

I'm attempting to connect to a website that requires you to have a specific cookie to access it. For the sake of this question, we'll call the cookie 'required_cookie' and the value 'required_value'.
This is my code:
import urllib
import http.cookiejar
cj = http.cookiejar.CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
opener.addheaders = [('required_cookie', 'required_value'), ('User-Agent', 'Mozilla/5.0')]
urllib.request.install_opener(opener)
req = Request('https://www.thewebsite.com/')
webpage = urlopen(req).read()
print(webpage)
I'm new to urllib so please answer me as a beginner
To do this with urllib, you need to:
Construct a Cookie object. The constructor isn't documented in the docs, but if you help(http.cookiejar.Cookie) in the interactive interpreter, you can see that its constructor demands values for all 16 attributes. Notice that the docs say, "It is not expected that users of http.cookiejar construct their own Cookie instances."
Add it to the cookiejar with cj.set_cookie(cookie).
Tell the cookiejar to add the correct headers to the request with cj.add_cookie_headers(req).
Assuming you've configured the policy correctly, you're set.
But this is a huge pain. As the docs for urllib.request say:
See also The Requests package is recommended for a higher-level HTTP client interface.
And, unless you have some good reason you can't install requests, you really should go that way. urllib is tolerable for really simple cases, and it can be handy when you need to get deep under the covers—but for everything else, requests is much better.
With requests, your whole program becomes a one-liner:
webpage = requests.get('https://www.thewebsite.com/', cookies={'required_cookie': required_value}, headers={'User-Agent': 'Mozilla/5.0'}).text
… although it's probably more readable as a few lines:
cookies = {'required_cookie': required_value}
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get('https://www.thewebsite.com/', cookies=cookies, headers=headers)
webpage = response.text
With the help of Kite documentation: https://www.kite.com/python/answers/how-to-add-a-cookie-to-an-http-request-using-urllib-in-python
You can add cookie this way:
import urllib
a_request = urllib.request.Request("http://www.kite.com/")
a_request.add_header("Cookie", "cookiename=cookievalue")
or in a different way:
from urllib.request import Request
url = "https://www.kite.com/"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0', 'Cookie':'myCookie=lovely'})

Python 3, urlopen - HTTP Error 403: Forbidden

I'm trying to download automatically the first image which appears in the google image search but I'm not able to read the website source and an error occurs ("HTTP Error 403: Forbidden").
Any ideas? Thank you for your help!
That's my code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
word = 'house'
r = urlopen('https://www.google.pl/search?&dcr=0&tbm=isch&q='+word)
data = r.read()
Apparently you have to pass the headers argument because the website is blocking you thinking you are a bot requesting data. I found an example of doing this here HTTP error 403 in Python 3 Web Scraping.
Also, the urlopen object didn't support the headers argument, so I had to use the Request object instead.
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
word = 'house'
r = Request('https://www.google.pl/search?&dcr=0&tbm=isch&q='+word, headers={'User-Agent': 'Mozilla/5.0'})
response = urlopen(r).read()

using python urlopen for a url query

Using urlopen also for url queries seems obvious. What I tried is:
import urllib2
query='http://www.onvista.de/aktien/snapshot.html?ID_OSI=86627'
f = urllib2.urlopen(query)
s = f.read()
f.close()
However, for this specific url query it fails with HTTP error 403 forbidden
When entering this query in my browser, it works.
Also when using http://www.httpquery.com/ to submit the query, it works.
Do you have suggestions how to use Python right to grab the correct response?
Looks like it requires cookies... (which you can do with urllib2), but an easier way if you're doing this, is to use requests
import requests
session = requests.session()
r = session.get('http://www.onvista.de/aktien/snapshot.html?ID_OSI=86627')
This is generally a much easier and less-stressful method of retrieving URLs in Python.
requests will automatically store and re-use cookies for you. Creating a session is slightly overkill here, but is useful for when you need to submit data to login pages etc..., or re-use cookies across a site... etc...
using urllib2 is something like
import urllib2, cookielib
cookies = cookielib.CookieJar()
opener = urllib2.build_opener( urllib2.HTTPCookieProcessor(cookies) )
data = opener.open('url').read()
It appears that the urllib2 default user agent is banned by the host. You can simply supply your own user agent string:
import urllib2
url = 'http://www.onvista.de/aktien/snapshot.html?ID_OSI=86627'
request = urllib2.Request(url, headers={"User-Agent" : "MyUserAgent"})
contents = urllib2.urlopen(request).read()
print contents

Categories

Resources