Using Requests for mixture of URLs in Python 3.x - python

I have a .txt file that contains a list of URLs. The structure of the URLs varies - some may begin with https, some with http, others with just www and others with just the domain name (stackoverflow.com). So an example of the .txt file content is:-
www.google.com
microsoft.com
https://www.yahoo.com
http://www.bing.com
What I want to do is parse through the list and check if the URLs are live. In order to do that, the stucture of the URL must be correct otherwise the request will fail. Here's my code so far:-
import requests
with open('urls.txt', 'r') as f:
urls = f.readlines()
for url in urls:
url = url.replace('\n', '')
if not url.startswith('http'): #This is to handle just domain names and those that begin with 'www'
url = 'http://' + url
if url.startswith('http:'):
print("trying url {}".format(url))
response = requests.get(url, timeout=10)
status_code = response.status_code
if status_code == 200:
continue
else:
print("URL {} has a response code of {}".format(url, status_code))
print("encountered error. Now trying with https")
url = url.replace('http://', 'https://')
print("Now replacing http with https and trying again")
response = requests.get(url, timeout=10)
status_code = response.status_code
print("URL {} has a response code of {}".format(url, status_code))
else:
response = requests.get(url, timeout=10)
status_code = response.status_code
print("URL {} has a response code of {}".format(url, status_code))
I feel like I've overcomplicated this somewhat and there must be an easier way of trying variants (ie. domain name, domain with 'www' at the beginning, with 'http' at the beginning and with 'https://' at the beginning, until a site is identified as being live or not (ie. all variables have been exhausted).
Any suggestions on my code or a better way to approach this? In essence, I want to handle the formatting of the URL to ensure that I then attempt to check the status of the URL.
Thanks in advance

This is a little too long for a comment, but, yes, it can be simplified, starting from, and replacing, the startswith part:
if not '//' in url:
url = 'http://' + url
response = requests.get(url, timeout=10)
etc.

Related

Python - Multiple URL - Append/Extend to Single Dataframe

I'm new to Python, but I've successfully connected to an api and upserted the data to our SQL database. However, I need to run the same process, with multiple URL's with identical data being returned. I'd like to build a single dataframe out of it, and then utilize all my existing upsert code.
import requests
import pandas as pd
URLs = ["https://www.url1.com/fall","https://www.url1.com/spring"]
data_results = []
payload={}
headers = {
'apikey': apikey
}
for url in URLs:
resp = requests.get(url, headers=headers, data=payload)
if resp.status_code != 200:
print(f"Error {url}")
continue
data_results.extend(resp)
data_results = resp.json(strict=False)
I've also changed .extend to .append
Then I wanted to build the dataframe from data_results
I get the output of the 2nd url only.
Am I missing something easy?
It was a combination of both of you!
for url in URLs:
resp = requests.get(url, headers=headers, data=payload)
if resp.status_code != 200:
print(f"Error {url}")
continue
data_results.extend(resp.json(strict=False))

POST request using urllib2 doesn't correctly send data (401 error)

I am trying to make a POST request in Python 2, using urllib2. My code is currently as follows;
url = 'http://' + server_url + '/playlists/upload?'
data = urllib.urlencode(OrderedDict([("sectionID", section_id), ("path", current_playlist), ("X-Plex-Token", plex_token)]))
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
d = response.read()
print(d)
'url' and 'data' return correctly formatted with the variables, I know this because I can copy their output into Postman for checking and the POST works fine (see example url below)
http://192.168.1.96:32400/playlists/upload?sectionID=11&path=D%3A%5CMedia%5CPPP%5Ctmp%5Cplex%5CAmbient.m3u&X-Plex-Token=XXXXXXXXX
When I run my Python code I get a 401 error returned, presumably meaning the X-Plex-Token parameter was not correctly sent, hence I am not allowed access.
Can anyone tell me where I'm going wrong? Help is greatly appreciated.
Have you tried removing the question mark and not using OrderedDict (no idea why you would need that) ?
url = 'http://' + server_url + '/playlists/upload'
data = urllib.urlencode({"sectionID":section_id), "path":current_playlist,"X-Plex-Token":plex_token})
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
d = response.read()
print(d)
Of course you should be using requests instead anyway:
import requests
r = requests.post('http://{}/playlists/upload'.format(server_url), data = {"sectionID":section_id), "path":current_playlist,"X-Plex-Token":plex_token})
print r.url
print r.text
print r.json
I've ended up switching to Python 3, as I didn't realise that the requests module was included by default. Still no idea why the above wasn't working, but maybe something to do with the lack of headers
headers = {'cache-control': "no-cache"}
edit:
This is what I'm using now, as mentioned above I probably don't need OrderedDict.
import requests
url = 'http://' + server_url + '/playlists/upload'
headers = {'cache-control': "no-cache"}
querystring = urllib.parse.urlencode(OrderedDict([("sectionID", section_id), ("path", current_playlist), ("X-Plex-Token", plex_token)]))
response = requests.request("POST", url, data = "", headers = headers, params = querystring)
print(response.text)

python using requests with valid hostname

Trying to use requests to download a list of urls and catch the exception if it is a bad url. Here's my test code:
import requests
from requests.exceptions import ConnectionError
#goodurl
url = "http://www.google.com"
#badurl with good host
#url = "http://www.google.com/thereisnothing.jpg"
#url with bad host
#url = "http://somethingpotato.com"
print url
try:
r = requests.get(url, allow_redirects=True)
print "the url is good"
except ConnectionError,e:
print e
print "the url is bad"
The problem is if I pass in url = "http://www.google.com" everything works as it should and as expected since it is a good url.
http://www.google.com
the url is good
But if I pass in url = "http://www.google.com/thereisnothing.jpg"
I still get :
http://www.google.com/thereisnothing.jpg
the url is good
So its almost like its not even looking at anything after the "/"
just to see if the error checking is working at all I passed a bad hostname: #url = "http://somethingpotato.com"
Which kicked back the error message I expected:
http://somethingpotato.com
HTTPConnectionPool(host='somethingpotato.com', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f1b6cd15b90>: Failed to establish a new connection: [Errno -2] Name or service not known',))
the url is bad
What am I missing to make request capture a bad url not just a bad hostname?
Thanks
requests do not create a throwable exception at a 404 response. Instead you need to filter them out be checking to see if the status is 'ok' (HTTP response 200)
import requests
from requests.exceptions import ConnectionError
#goodurl
url = "http://www.google.com/nothing"
#badurl with good host
#url = "http://www.google.com/thereisnothing.jpg"
#url with bad host
#url = "http://somethingpotato.com"
print url
try:
r = requests.get(url, allow_redirects=True)
if r.status_code == requests.codes.ok:
print "the url is good"
else:
print "the url is bad"
except ConnectionError,e:
print e
print "the url is bad"
EDIT:
import requests
from requests.exceptions import ConnectionError
def printFailedUrl(url, response):
if isinstance(response, ConnectionError):
print "The url " + url + " failed to connect with the exception " + str(response)
else:
print "The url " + url + " produced the failed response code " + str(response.status_code)
def testUrl(url):
try:
r = requests.get(url, allow_redirects=True)
if r.status_code == requests.codes.ok:
print "the url is good"
else:
printFailedUrl(url, r)
except ConnectionError,e:
printFailedUrl(url, e)
def main():
testUrl("http://www.google.com") #'Good' Url
testUrl("http://www.google.com/doesnotexist.jpg") #'Bad' Url with 404 response
testUrl("http://sdjgb") #'Bad' url with inaccessable url
main()
In this case one function can handle both getting an exception or a request response passed into it. This way you can have separate responses for if the url returns some non 'good' (non-200) response vs an unusable url which throws an exception. Hope this has the information you need in it.
what you want is to check r.status_code. Getting r.status_code on "http://www.google.com/thereisnothing.jpg" will give you 404. you can put a condition for only 200 code URL to be "good".

how to handle python crawler's urlopen error?

When I write python crawler, I often use urlopen. Sometimes it can't open url(So I get an error), but when I retry to open this url, it succeeds. So I handle this situation by writing my crawler like this:
def url_open(url):
'''open the url and return its content'''
req = urllib.request.Request(headers=header, url=url)
while True:
try:
response = urllib.request.urlopen(req)
break
except:
continue
contents = response.read().decode('utf8')
return contents
I think this code is ugly... but it works, so is there some elegant way to do this?
I would strongly recommend using the requests library. You may end up with the same problem, but I found requests easier to work with and also more reliable.
The same request would go like this
def url_open(url):
while True:
try:
response = requests.get(url, headers=header)
break
except:
continue
return response.text
What error are you getting?
I would recommend going ahead and using the requests API with Sessions and Adapters so that you can explicitly set the number of retries. It is more code, but it is definitely cleaner:
import requests
session = requests.Session()
http_adapter = requests.adapters.HTTPAdapter(max_retries=3)
https_adapter = requests.adapters.HTTPAdapter(max_retries=3)
session.mount('http://', http_adapter)
session.mount('https://', https_adapter)
response = s.get(url)
if response.status_code != 200 then:
# Handle the request failure here
pass

Python Requests - Is it possible to receive a partial response after an HTTP POST?

I am using the Python Requests Module to datamine a website. As part of the datamining, I have to HTTP POST a form and check if it succeeded by checking the resulting URL. My question is, after the POST, is it possible to request the server to not send the entire page? I only need to check the URL, yet my program downloads the entire page and consumes unnecessary bandwidth. The code is very simple
import requests
r = requests.post(URL, payload)
if 'keyword' in r.url:
success
fail
An easy solution, if it's implementable for you. Is to go low-level. Use socket library.
For example you need to send a POST with some data in its body. I used this in my Crawler for one site.
import socket
from urllib import quote # POST body is escaped. use quote
req_header = "POST /{0} HTTP/1.1\r\nHost: www.yourtarget.com\r\nUser-Agent: For the lulz..\r\nContent-Type: application/x-www-form-urlencoded; charset=UTF-8\r\nContent-Length: {1}"
req_body = quote("data1=yourtestdata&data2=foo&data3=bar=")
req_url = "test.php"
header = req_header.format(req_url,str(len(req_body))) #plug in req_url as {0}
#and length of req_body as Content-length
s = socket.socket(socket.AF_INET,socket.SOCK_STREAM) #create a socket
s.connect(("www.yourtarget.com",80)) #connect it
s.send(header+"\r\n\r\n"+body+"\r\n\r\n") # send header+ two times CR_LF + body + 2 times CR_LF to complete the request
page = ""
while True:
buf = s.recv(1024) #receive first 1024 bytes(in UTF-8 chars), this should be enought to receive the header in one try
if not buf:
break
if "\r\n\r\n" in page: # if we received the whole header(ending with 2x CRLF) break
break
page+=buf
s.close() # close the socket here. which should close the TCP connection even if data is still flowing in
# this should leave you with a header where you should find a 302 redirected and then your target URL in "Location:" header statement.
There's a chance the site uses Post/Redirect/Get (PRG) pattern. If so then it's enough to not follow redirect and read Location header from response.
Example
>>> import requests
>>> response = requests.get('http://httpbin.org/redirect/1', allow_redirects=False)
>>> response.status_code
302
>>> response.headers['location']
'http://httpbin.org/get'
If you need more information on what would you get if you had followed redirection then you can use HEAD on the url given in Location header.
Example
>>> import requests
>>> response = requests.get('http://httpbin.org/redirect/1', allow_redirects=False)
>>> response.status_code
302
>>> response.headers['location']
'http://httpbin.org/get'
>>> response2 = requests.head(response.headers['location'])
>>> response2.status_code
200
>>> response2.headers
{'date': 'Wed, 07 Nov 2012 20:04:16 GMT', 'content-length': '352', 'content-type':
'application/json', 'connection': 'keep-alive', 'server': 'gunicorn/0.13.4'}
It would help if you gave some more data, for example, a sample URL that you're trying to request. That being said, it seems to me that generally you're checking if you had the correct URL after your POST request using the following algorithm relying on redirection or HTTP 404 errors:
if original_url == returned request url:
correct url to a correctly made request
else:
wrong url and a wrongly made request
If this is the case, what you can do here is use the HTTP HEAD request (another type of HTTP request like GET, POST, etc.) in Python's requests library to get only the header and not also the page body. Then, you'd check the response code and redirection url (if present) to see if you made a request to a valid URL.
For example:
def attempt_url(url):
'''Checks the url to see if it is valid, or returns a redirect or error.
Returns True if valid, False otherwise.'''
r = requests.head(url)
if r.status_code == 200:
return True
elif r.status_code in (301, 302):
if r.headers['location'] == url:
return True
else:
return False
elif r.status_code == 404:
return False
else:
raise Exception, "A status code we haven't prepared for has arisen!"
If this isn't quite what you're looking for, additional detail on your requirements would help. At the very least, this gets you the status code and headers without pulling all of the page data.

Categories

Resources