Python - Check if url exists fails - python

I'm trying to create a python function that will connect to a URL and check if a list of directory's exists on that website. So the input consists of a target and the directory's. My ultimate goal is to write some sort of DirBuster like program.
This is my function untill now:
def checkDir(checkDir_target):
breakurl = urlparse(target)
conn = httplib.HTTPConnection(breakurl.netloc)
conn.request('HEAD', checkDir_target)
response = conn.getresponse()
print response.status
complete = target + x
if (response.status < 400):
print(" [X] " + complete)
global total_resp
total_resp += 1
found.append(complete)
else:
print(" [ ] " + complete)
The only problem I'm having right now is that dynamic created pages like wordpress pages also return HTTP Status 200 codes. So even when I'm testing on a non-existing url the website will still return a HTTP 200 OK.
Example: testing on www.wordpressexamplesite.com/DIRECTORYTHATDOESNTEXISTS/ gives a HTTP 200 code as well as website URL's that DOES exist.
This means that the whole check in the checkDir function is not doing it's work like I want it to.
Can one of you guys give me some ideas on how to resolve this?

Unfortunately for you, when the server returns a "200 OK", then that means the URL does, in fact, exist and has the contents returned. Those contents might be a page that says "This doesn't exist". To identify that you would need to work on some artificial intelligence that can render and read the content that was returned and comprehend it like a human would.
I consider it bad web site design (and even worse for AJAX APIs) to always return "200 OK" and embed the "real" status in the payload, but that is how some people code it.

Try use requests lib:
import requests
def checkDir(checkDir_target):
breakurl = urlparse(target)
response = requests.get(breakurl.netloc, headers=checkDir_target)
complete = target + x
if response.status < 400:
print(" [X] " + complete)
global total_resp
total_resp += 1
found.append(complete)
else:
print(" [ ] " + complete)
I think this can works for you.

Related

Getting twitter friends number with Python

I need to get twitter friends from an user. With twitter "friends" I mean user A follows user B, and also user B follows A.
Ok, so my idea was getting a list of people who are followed by user A, and if they are followed by user B, it increases numFriends.
First part works fine, but when I do that second request it falls apart and throws me an ugly 'error 400' :(
I read about the restrictions of twitter and all that, but it seems weird that second request so I don't know if it's doing ok.
Thank you in advance, I'm a noob at python and twitter api, and my mother tongue is not english, so I really hope it is everything clear. I will thank any help about this.
Here is the code :)
def getNumFriends(user):
dataUser=0
numFriends=0
url ="https://api.twitter.com/1.1/friends/ids.json?cursor=-1&screen_name=%s&count=5000"%(user.screen_name)
auth = OAuth1(getConsumerKey(), getConsumerSecret(), getAccessToken(), getAccessTokenSecret())
response = requests.get(url, auth=auth)
if response.status_code == 200:
dataUser = response.json()
userIDs = dataUser['ids']
else:
print "Error code %s" %response.status_code
#Here comes the problem :S
for friend in userIDs:
url = "https://api.twitter.com/1.1/friendships/show.json?source_id=%s&target_screen_name=%s"%(friend, user)
response = requests.get(url, auth)
if response.status_code == 200:
dataCompare = response.json()
mutualfriends = dataCompare['relationship']['target']['followed_by']
if mutualfriends =='true':
numFriends=numFriends+1
else:
print "First request OK. Second request error code %s" %response.status_code
break
return numFriends
Your code is fine, there are only a couple of minor mistakes. The error code 400, "bad request", does not give you very concrete info, but it tells you that something in the way you've written the url is wrong.
It should be:
url = "https://api.twitter.com/1.1/friendships/show.json?source_id=%s&target_screen_name=%s"%(friend, user.screen_name)
i.e it should be user.screen_name instead of user.
Besides, the second argument of requests.get() is not auth, thus you should specify always the name of the argument,
response = requests.get(url, auth=auth)
which wasn't the case in your second call.
BTW, just curious, any reason why you are not using a library like twython?
Hope it helps.
** EDIT after comments: **
There was another minor mistake in the code. Note that your variable mutualfriends is already boolean, in order to check its value you can do it like this,
if mutualfriends:
...
BTW, to check the type of a variable,
print(type(mutualfriends))

Unable to upload file to Google cloud storage using python - get a 404 error

I am trying a upload a file to Google Cloud Storage via a python script but keep getting a 404 error! I am sure I am not trying to reference a non-available resource. My code snippet is:
uploadFile = open("testUploadFile.txt", "r")
httpObj = httplib.HTTPSConnection("googleapis.com", timeout = 10)
httpObj.request("PUT", requestString, uploadFile, headerString)
uploadResponse = httpObj.getresponse()
print "Request string is:" + requestString
print "Return status:" + str(uploadResponse.status)
print "Reason:" + str(uploadResponse.reason)
Where
requestString = /upload/storage/v1beta2/b/bucket_id_12345678/o?uploadType=resumable&name=1%2FtestUploadFile.txt%7Calm_1391258335&upload_id=AbCd-1234
headerString = {'Content-Length': '47', 'Content-Type': 'text/plain'}
Any idea where I'm going wrong?
If you're doing a resumable upload, you'll need to start with a POST as described here: https://developers.google.com/storage/docs/json_api/v1/how-tos/upload#resumable
However, for a 47-byte object, you can use a simple upload, which will be much ... simpler. Instructions are here:
https://developers.google.com/storage/docs/json_api/v1/how-tos/upload#simple
It should be easy enough for you to replace the appropriate lines in your code with:
httpObj.request("POST", requestString, uploadFile, headerString)
requestString = /upload/storage/v1beta2/b/bucket_id_12345678/o?uploadType=media&name=1%2FtestUploadFile.txt%7Calm_1391258335
As an aside, in your code, headerString is actually a dict, not a string.

GAE - How can I combine the results of several asynchronous url fetches?

I have a Google AppEngine (in Python) application where I need to perform 4 to 5 url fetches, and then combine the data before I print it out to the response.
I can do this without any problems using a synchronous workflow, but since the urls that I am fetching are not related or dependent on each other, performing this asynchronously would be the most ideal (and quickest).
I have read and re-read the documentation here, but I just can't figure out how to get read the contents for each url. I've also searched the web for a small example (which is really what I am in need of). I have seen this SO question, but again, here they don't mention anything about reading the contents of these individual asynchronous url fetches.
Does anyone have any simple examples of how to perform 4 or 5 asynchronous url fetches with AppEngine? And then combine the results before I print it to the response?
Here is what I have so far:
rpcs = []
for album in result_object['data']:
total_facebook_photo_count = total_facebook_photo_count + album['count']
facebook_albumid_array.append(album['id'])
#Get the photos in the photo album
facebook_photos_url = 'https://graph.facebook.com/%s/photos?access_token=%s&limit=1000' % (album['id'], access_token)
rpc = urlfetch.create_rpc()
urlfetch.make_fetch_call(rpc, facebook_photos_url)
rpcs.append(rpc)
for rpc in rpcs:
result = rpc.get_result()
self.response.out.write(result.content)
However, it still looks like the line: result = rpc.get_result() is forcing it to wait for the first request to finish, then the second, then the third, and so forth. Is there a way to simply put the results in a variables as they are received?
Thanks!
In the example, text = result.content is where you get the content (body).
To do url fetches in parallell, you could set them up, add to a list and check results afterwards. Expanding on the example already mentioned, it could look something like:
from google.appengine.api import urlfetch
futures = []
for url in urls:
rpc = urlfetch.create_rpc()
urlfetch.make_fetch_call(rpc, url)
futures.append(rpc)
contents = []
for rpc in futures:
try:
result = rpc.get_result()
if result.status_code == 200:
contents.append(result.content)
# ...
except urlfetch.DownloadError:
# Request timed out or failed.
# ...
concatenated_result = '\n'.join(contents)
In this example, we assemble the body of all the requests that returned status code 200, and concatenate with linebreak between them.
Or with ndb, my personal preference for anything async on GAE, something like:
#ndb.tasklet
def get_urls(urls):
ctx = ndb.get_context()
result = yield map(ctx.urlfetch, urls)
contents = [r.content for r in result if r.status_code==200]
raise ndb.Return('\n'.join(contents))
I use this code (implmented before I learned about ndb tasklets):
while rpcs:
rpc = UserRPC.wait_any(rpcs)
result = rpc.get_result()
# process result here
rpcs.remove(rpc)

Python add data to post body

I am struggling getting a Rest API Post to work with a vendor api and hope someone can give me a pointer.
The intent is to feed a cli command to the post body and pass to a device which returns the output.
The call looks like this : ( this works for all other calls but this is different because of posting to body)
def __init__(self,host,username,password,sid,method,http_meth):
self.host=host
self.username= username
self.password= password
self.sid=sid
self.method=method
self.http_meth=http_meth
def __str__(self):
self.url = 'http://' + self.host + '/rest/'
self.authparams = urllib.urlencode({ "session_id":self.sid,"method": self.method,"username": self.username,
"password": self.password,
})
call = urllib2.urlopen(self.url.__str__(), self.authparams).read()
return (call)
No matter how I have tried this I cant get it to work properly. Here is an excerpt from the API docs that explains how to use this method:
To process these APIs, place your CLI commands in the HTTP post buffer, and then place the
method name, session ID, and other parameters in the URL.
Can anyone give me an idea of how to properly do this. I am not a developer and am trying to learn this correctly. For example if I wanted to send the command "help" in the post body?
Thanks for any guidance
Ok this was ridiculously simple and I was over-thinking this. I find that I can sometimes look at a much higher level than a problem really is and waist time. Anyway this is how it should work:
def cli(self,method):
self.url = ("http://" + str(self.host) + "/rest//?method=" +str(method)+ "&username=" +str(self.username)+ "&password=" +str(self.password)+ "&enable_password=test&session_id="+str(self.session_id))
data="show ver"
try:
req = urllib2.Request(self.url)
response = urllib2.urlopen(req,data)
result = response.read()
print result
except urllib2.URLError, e:
print e.reason
The cli commands are just placed in the buffer and not encoded....

How can I un-shorten a URL using python?

I have seen this thread already - How can I unshorten a URL?
My issue with the resolved answer (that is using the unshort.me API) is that I am focusing on unshortening youtube links. Since unshort.me is used readily, this returns almost 90% of the results with captchas which I am unable to resolve.
So far I am stuck with using:
def unshorten_url(url):
resolvedURL = urllib2.urlopen(url)
print resolvedURL.url
#t = Test()
#c = pycurl.Curl()
#c.setopt(c.URL, 'http://api.unshort.me/?r=%s&t=xml' % (url))
#c.setopt(c.WRITEFUNCTION, t.body_callback)
#c.perform()
#c.close()
#dom = xml.dom.minidom.parseString(t.contents)
#resolvedURL = dom.getElementsByTagName("resolvedURL")[0].firstChild.nodeValue
return resolvedURL.url
Note: everything in the comments is what I tried to do when using the unshort.me service which was returning captcha links.
Does anyone know of a more efficient way to complete this operation without using open (since it is a waste of bandwidth)?
one line functions, using requests library and yes, it supports recursion.
def unshorten_url(url):
return requests.head(url, allow_redirects=True).url
Use the best rated answer (not the accepted answer) in that question:
# This is for Py2k. For Py3k, use http.client and urllib.parse instead, and
# use // instead of / for the division
import httplib
import urlparse
def unshorten_url(url):
parsed = urlparse.urlparse(url)
h = httplib.HTTPConnection(parsed.netloc)
resource = parsed.path
if parsed.query != "":
resource += "?" + parsed.query
h.request('HEAD', resource )
response = h.getresponse()
if response.status/100 == 3 and response.getheader('Location'):
return unshorten_url(response.getheader('Location')) # changed to process chains of short urls
else:
return url
You DO have to open it, otherwise you won't know what URL it will redirect to. As Greg put it:
A short link is a key into somebody else's database; you can't expand the link without querying the database
Now to your question.
Does anyone know of a more efficient way to complete this operation
without using open (since it is a waste of bandwidth)?
The more efficient way is to not close the connection, keep it open in the background, by using HTTP's Connection: keep-alive.
After a small test, unshorten.me seems to take the HEAD method into account and doing a redirect to itself:
> telnet unshorten.me 80
Trying 64.202.189.170...
Connected to unshorten.me.
Escape character is '^]'.
HEAD http://unshort.me/index.php?r=http%3A%2F%2Fbit.ly%2FcXEInp HTTP/1.1
Host: unshorten.me
HTTP/1.1 301 Moved Permanently
Date: Mon, 22 Aug 2011 20:42:46 GMT
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
X-AspNet-Version: 2.0.50727
Location: http://resolves.me/index.php?r=http%3A%2F%2Fbit.ly%2FcXEInp
Cache-Control: private
Content-Length: 0
So if you use the HEAD HTTP method, instead of GET, you will actually end up doing the same work twice.
Instead, you should keep the connection alive, which will save you only a little bandwidth, but what it will certainly save is the latency of establishing a new connection every time. Establishing a TCP/IP connection is expensive.
You should get away with a number of kept-alive connections to the unshorten service equal to the number of concurrent connections your own service receives.
You could manage these connections in a pool. That's the closest you can get. Beside tweaking your kernel's TCP/IP stack.
Here a src code that takes into account almost of the useful corner cases:
set a custom Timeout.
set a custom User Agent.
check whether we have to use an http or https connection.
resolve recursively the input url and prevent ending within a loop.
The src code is on github # https://github.com/amirkrifa/UnShortenUrl
comments are welcome ...
import logging
logging.basicConfig(level=logging.DEBUG)
TIMEOUT = 10
class UnShortenUrl:
def process(self, url, previous_url=None):
logging.info('Init url: %s'%url)
import urlparse
import httplib
try:
parsed = urlparse.urlparse(url)
if parsed.scheme == 'https':
h = httplib.HTTPSConnection(parsed.netloc, timeout=TIMEOUT)
else:
h = httplib.HTTPConnection(parsed.netloc, timeout=TIMEOUT)
resource = parsed.path
if parsed.query != "":
resource += "?" + parsed.query
try:
h.request('HEAD',
resource,
headers={'User-Agent': 'curl/7.38.0'}
)
response = h.getresponse()
except:
import traceback
traceback.print_exec()
return url
logging.info('Response status: %d'%response.status)
if response.status/100 == 3 and response.getheader('Location'):
red_url = response.getheader('Location')
logging.info('Red, previous: %s, %s'%(red_url, previous_url))
if red_url == previous_url:
return red_url
return self.process(red_url, previous_url=url)
else:
return url
except:
import traceback
traceback.print_exc()
return None
import requests
short_url = "<your short url goes here>"
long_url = requests.get(short_url).url
print(long_url)

Categories

Resources