How can I recreate a urllib.requests in Python 2.7? - python

I'm crawling some web-pages and parsing through some data on them, but one of the sites seems to be blocking my requests. The version of the code using Python 3 with urllib.requests works fine. My problem is that I need to use Python 2.7, and I can't get a response using urllib2
Shouldn't these requests be identical?
Python 3 version:
def fetch_title(url):
req = urllib.request.Request(
url,
data=None,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
)
html = urllib.request.urlopen(req).read().encode('unicode-escape').decode('ascii')
return html
Python 2.7 version:
import urllib2
opener = urllib2.build_opener()
opener.addheaders = [(
'User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
)]
response = opener.open('http://website.com')
print response.read()

The following code should work, essentially with python 2.7 you can create a dictionary with your desired headers and format your request in a way that it will work properly with urllib2.urlopen using urllib2.Request.
import urllib2
def fetch_title(url):
my_headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36"
}
return urllib2.urlopen(urllib2.Request(url, headers=my_headers)).read()

Related

Read JSON metadata for a token from Solscan

I'm using python and trying to read the metadata from a token on solscan.
I am looking for the name, image, etc from metadata.
I am currently using JSON request which seems to work (ie not fail), but it only returns me:
{"holder":0}
Process finished with exit code 0
I am doing several other requests to website, so I think my request is correct.
I tried looking at the documentation on https://public-api.solscan.io/docs and I believe I am requesting the correct info, but I dont get it.
Here is my current code:
import requests
headers = {
'accept': 'application/jsonParsed',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
}
params = (
('tokenAddress', 'EArf8AxBi44QxFVnSab9gZpXTxVGiAX2YCLokccr1UsW'),
)
response = requests.get('https://public-api.solscan.io/token/meta', headers=headers, params=params)
#response = requests.get('https://arweave.net/viPcoBnO9OjXvnzGMXGvqJ2BEgl25BMtqGaj-I1tkCM', headers=headers)
print(response.content.decode())
Any help appreciated!
This code sample works:
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
}
params = {
'address': 'EArf8AxBi44QxFVnSab9gZpXTxVGiAX2YCLokccr1UsW',
}
response = requests.get('https://api.solscan.io/account', headers=headers, params=params)
print(response.content.decode())
I use another URL and parameters in my sample: https://api.solscan.io/account used instead of https://public-api.solscan.io/token/meta and address param instead of tokenAddress.

Python request yields status code 500 even though the website is available

I'm trying to use Python to check whether or not a list of websites is online. However, on several sites, requests yields the wrong status code. For example, the status code I get for https://signaturehound.com/ is 500 even though the website is online and in the Chrome developer tools the response code 200 is shown.
My code looks as follows:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
def url_ok(url):
r = requests.head(url,timeout=5,allow_redirects=True,headers=headers)
status_code = r.status_code
return status_code
print(url_ok("https://signaturehound.com/"))
As suggested by #CaptainDaVinci in the comments, the solution is to replace head by get in the code:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
def url_ok(url):
r = requests.get(url,timeout=5,allow_redirects=True,headers=headers)
status_code = r.status_code
return status_code
print(url_ok("https://signaturehound.com/"))

python user-agent xpath amazon

I am trying to make a scan of the HTML in 2 requests.
at the first one, it's working but when I am trying to use another one,
The HTML I am trying to locate are not visible its getting it wrong.
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
page = requests.get('https://www.amazon.com/gp/aw/ol/B00DZKQSRQ/ref=mw_dp_olp?ie=UTF8&condition=new, headers=headers)
newHeader = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36'}
pagePrice = requests.get('https://www.amazon.com/gp/aw/ol/B01EQJU8AW/ref=mw_dp_olp?ie=UTF8&condition=new',headers=newHeader)
The first request works fine and gets me the good HTML.
The second request gives bad HTML.
I saw this package, but not success :
https://pypi.python.org/pypi/fake-useragent
And I saw this topic not unaswered :
Double user-agent tag, "user-agent: user-agent: Mozilla/"
Thank you very much!

Ruby vs Python get request

I'm trying to make a GET request to the NBA API in ruby. I've tried both the unirest library but the request hangs after 5 minutes.
When I use python's requests library for the same url + endpoint it works. Here's what I'm trying for each:
require "unirest"
base_url = "http://stats.nba.com/stats"
team_endpoint = "/teamgamelog"
params = {:TeamID => "1610612739", :Season => "2016-17", :SeasonType => "Playoffs"}
HEADERS = {'user-agent'=> 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36',
'Accept'=> "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"
}
response = Unirest.get "#{base_url}#{team_endpoint}", headers: HEADERS,
parameters: params
The above code doesn't work (the connection eventually times-out). The following python code works:
import requests
base_url = 'http://stats.nba.com/stats'
HEADERS = {'user-agent': ('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/45.0.2454.101 Safari/537.36'),}
end_point = '/teamgamelog'
PARAMS = {'TeamID': '1610612739', 'Season': '2016-17', 'SeasonType': 'Playoffs'}
r = requests.get(base_url+end_point, headers=HEADERS, params=PARAMS)
Any advice on solving the ruby case?

Testing for user agent spoofing in python

I'm new to python and using python 3. I'm trying to download a webpage and what I want to know is if there is a way to actually see the name of the user agent the way a system admin or google sees it. In my code I download and save the webpage to a text file like this :
#Import
from urllib.request import urlopen,Request
url1 = urlopen(Request(url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'}))
#Create file and write
f=open('mah_a.txt','wb')
f.write(url1.read())
f.close()
How do I check to see if my user agent name has changed ?
You changed the User-Agent header, yes. You can use a online echo server like httpbin.org if you want to see what a server received:
url = 'http://httpbin.org/get'
url1 = urlopen(Request(url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'}))
print(url1.read().decode('utf8'))
Demo:
>>> from urllib.request import urlopen,Request
>>> url = 'http://httpbin.org/get'
>>> url1 = urlopen(Request(url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'}))
>>> print(url1.read().decode('utf8'))
{
"args": {},
"headers": {
"Accept-Encoding": "identity",
"Host": "httpbin.org",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36"
},
"origin": "188.29.165.166",
"url": "http://httpbin.org/get"
}

Categories

Resources