Unexpected behaviour with Urllib in Python

Unexpected behaviour with Urllib in Python - python

My system is not behind any proxy.
params = urllib.urlencode({'search':"August Rush"})
f = urllib.urlopen("http://www.thepiratebay.org/search/query", params)
This goes onto an infinite loop(Or just hangs). I can obviously get rid of this and use FancyUrlOpener and create the query myself rather than passing it parameters. But, I think doing the way I'm doing now is a better and cleaner approach.
Edit: This was more of a networking problem in which my Ubuntu workstation was configured to a different proxy. Had to do certain changes and it worked. Thank you!

The posted code works fine for me, with Python 2.7.2 on Windows.
Have you tried using a http-debugging tool, like Fiddler2 to see the actual conversation going between your program and the site?
If you run Fiddler2 on port 8888 on localhost, you can do this to see the request and response:
import urllib
proxies = {"http": "http://localhost:8888"}
params = urllib.urlencode({'search':"August Rush"})
f = urllib.urlopen("http://www.thepiratebay.org/search/query", params, proxies)
print len(f.read())

This works for me:
import urllib
params = urllib.urlencode({'q': "August Rush", 'page': '0', 'orderby': '99'})
f = urllib.urlopen("http://www.thepiratebay.org/s/", params)
with open('text.html', 'w') as ff:
ff.write('\n'.join(f.readlines()))
I opened http://www.thepiratebay.org with Google Chrome with network inspector enabled. I put "August Rush" into the search field and pressed 'Search'. Then i analyzed the headers sent and did the code above.

Related

what is the proper way to use proxies with requests in python

Requests is not honoring the proxies flag.
There is something I am missing about making a request over a proxy with python requests library.
If I enable the OS system proxy, then it works, but if I make the request with just requests module proxies setting, the remote machine will not see the proxy set in requests, but will see my real ip, it is as if not proxy was set.
The bellow example will show this effect, at the time of this post the bellow proxy is alive but any working proxy should replicate the effect.
import requests
proxy ={
'http:': 'https://143.208.200.26:7878',
'https:': 'http://143.208.200.26:7878'
}
data = requests.get(url='http://ip-api.com/json', proxies=proxy).json()
print('Ip: %s\nCity: %s\nCountry: %s' % (data['query'], data['city'], data['country']))
I also tried changing the proxy_dict format:
proxy ={
'http:': '143.208.200.26:7878',
'https:': '143.208.200.26:7878'
}
But still it has not effect.
I am using:
-Windows 10
-python 3.9.6
-urllib 1.25.8
Many thanks in advance for any response to help sort this out.

Ok is working yea !!! .
The credits for solving this goes to (Olvin Rogh) Thanks Olvin for your help and pointing out my problem. I was adding colon ":" inside the keys
This code is working now.
PROXY = {'https': 'https://143.208.200.26:7878',
'http': 'http://143.208.200.26:7878'}
with requests.Session() as session:
session.proxies = PROXY
r = session.get('http://ip-api.com/json')
print(json.dumps(r.json(), indent=2))

Logging HTML requests in robot framework

I have been struggling to find much information to go along with this so I have turned here for help.
I am running UI tests of a web app using robot framework. When a test fails I want a log of the HTML requests so I can look back and see what failed, i.e. things not loading, 500 errors etc.
To this point I haven't managed to find something within the robot framework or selenium?
Another option is to see if there is a python library for logging this sort of thing or whether it would be a reasonable task to create one?
I have also looked into using autoit it use the browsers internal network logging tools but using these is a whole test of its own and I am not sure how well it would work. I am sure I must not be the first person to want this functionality?
I have continued to look into this and have found a viable option may be a packet sniffer using pcapy, I have no idea what to do in network programming and how I would proccess packets to only get post and get packets and repsonses, any help would be much appreciated
Cheers

Selenium is only emulating user behaviour, so it does not help you here. You could use a proxy that logs all the traffic and lets you examine the traffic. BrowserMob Proxy let's you do that. See Create Webdriver from Selenium2Libray on how to configure proxy for your browser.
This way you can ask your proxy to return the traffic after you noticed a failure in you test.

I have implemented same thing using BrowserMobProxy. It captures network traffic based on the test requirement.
First function CaptureNetworkTraffic(), will open the browser with configuration provided in the parameters.
Second function Parse_Request_Response(), will get the HAR file from above function and return resp. network data based on the parameter configured.
e.g.
print Capture_Request_Response("g:\\har.txt","google.com",True,True,False,False,False)
In this case, it will check url with "google.com" and returns response and request headers for the url.
from browsermobproxy import Server
from selenium import webdriver
import json
def CaptureNetworkTraffic(url,server_ip,headers,file_path):
'''
This function can be used to capture network traffic from the browser. Using this function we can capture header/cookies/http calls made from the browser
url - Page url
server_ip - remap host to for specific URL
headers - this is a dictionary of the headers to be set
file_path - File in which HAR gets stored
'''
port = {'port':9090}
server = Server("G:\\browsermob\\bin\\browsermob-proxy",port) #Path to the BrowserMobProxy
server.start()
proxy = server.create_proxy()
proxy.remap_hosts("www.example.com",server_ip)
proxy.remap_hosts("www.example1.com",server_ip)
proxy.remap_hosts("www.example2.com",server_ip)
proxy.headers(headers)
profile = webdriver.FirefoxProfile()
profile.set_proxy(proxy.selenium_proxy())
driver = webdriver.Firefox(firefox_profile=profile)
new = {'captureHeaders':'True','captureContent':'True'}
proxy.new_har("google",new)
driver.get(url)
proxy.har # returns a HAR JSON blob
server.stop()
driver.quit()
file1 = open(file_path,'w')
json.dump(proxy.har,file1)
file1.close()
def Parse_Request_Response(filename,url,response=False,request_header=False,request_cookies=False,response_header=False,response_cookies=False):
resp ={}
har_data = open(filename, 'rb').read()
har = json.loads(har_data)
for i in har['log']['entries']:
if url in i['request']['url']:
resp['request'] = i['request']['url']
if response:
resp['response'] = i['response']['content']
if request_header:
resp['request_header'] = i['request']['headers']
if request_cookies:
resp['request_cookies'] = i['request']['cookies']
if response_header:
resp['response_header'] = i['response']['headers']
if response_cookies:
resp['response_cookies'] = i['response']['cookies']
return resp
if (__name__=="__main__"):
headers = {"User-Agent":"Mozilla/5.0 (iPad; CPU OS 5_0 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9A334 Safari/7534.48.3"}
CaptureNetworkTraffic("http://www.google.com","192.168.1.1",headers,"g:\\har.txt")
print Parse_Request_Response("g:\\har.txt","google.com",False,True,False,False,False)

Newbie: update changing IP using urlopen with https and do login

This is a newbie problem with python, advice is much appreciated.
no-ip.com provides an easy way to update a computer's changing ip-address, simply open the url
http://user:password#dynupdate.no-ip.com/nic/update?hostname=my.host.name
...both http and https work when entered in firefox. I tried to implement that in a script residing in "/etc/NetworkManager/dispatcher.d" to be used by Network Manager on a recent version of Ubuntu.
What works is the python script:
from urllib import urlopen;
urlopen("http://user:password#dynupdate.no-ip.com/nic/update?hostname=my.host.name")
What I want to have is the same with "https", which does not work as easily. Could anyone, please,
(1) show me what the script should look like for https,
(2) give me some keywords, which I can use to learn about this.
(3) perhaps even explain why it does not work any more when the script is changed to using "urllib2":
from urllib2 import urlopen;
urlopen("http://user:password#dynupdate.no-ip.com/nic/update?hostname=my.host.name")
Thank you!

The user:password part isn't in the actual URL, but a shortcut for HTTP authentication. The browser's URL parsing lib will filter them out. In urllib2, you want to
import base64, urllib2
user,password = 'john_smith','123456'
request = urllib2.Request('dynupdate.no-ip.com/nic/update?hostname=my.host.name')
auth = base64.base64encode(user + ':' + password)
request.add_header('Authorization', 'Basic ' + auth)
urllib2.urlopen(request)

urlib2.urlopen through proxy fails after a few calls

Edit: after much fiddling, it seems urlgrabber succeeds where urllib2 fails, even when telling it close the connection after each file. Seems like there might be something wrong with the way urllib2 handles proxies, or with the way I use it !
Anyways, here is the simplest possible code to retrieve files in a loop:
import urlgrabber
for i in range(1, 100):
url = "http://www.iana.org/domains/example/"
urlgrabber.urlgrab(url, proxies={'http':'http://<user>:<password>#<proxy url>:<proxy port>'}, keepalive=1, close_connection=1, throttle=0)
Hello all !
I am trying to write a very simple python script to grab a bunch of files via urllib2.
This script needs to work through the proxy at work (my issue does not exist if grabbing files on the intranet, i.e. without the proxy).
Said script fails after a couple of requests with "HTTPError: HTTP Error 401: basic auth failed". Any idea why that might be ? It seems the proxy is rejecting my authentication, but why ? The first couple of urlopen requests went through correctly !
Edit: Adding a sleep of 10 seconds between requests to avoid some kind of throttling that might be done by the proxy did not change the results.
Here is a simplified version of my script (with identified information stripped, obviously):
import urllib2
passmgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
passmgr.add_password(None, '<proxy url>:<proxy port>', '<my user name>', '<my password>')
authinfo = urllib2.ProxyBasicAuthHandler(passmgr)
proxy_support = urllib2.ProxyHandler({"http" : "<proxy http address>"})
opener = urllib2.build_opener(authinfo, proxy_support)
urllib2.install_opener(opener)
for i in range(100):
with open("e:/tmp/images/tst{}.htm".format(i), "w") as outfile:
f = urllib2.urlopen("http://www.iana.org/domains/example/")
outfile.write(f.read())
Thanks in advance !

You can minimize the number of connection by using the keepalive handler from the urlgrabber module.
import urllib2
from keepalive import HTTPHandler
keepalive_handler = HTTPHandler()
opener = urllib2.build_opener(keepalive_handler)
urllib2.install_opener(opener)
fo = urllib2.urlopen('http://www.python.org')
I am unsure that this will work correctly with your Proxy setup.
You may have to hack the keepalive module.

The proxy might be throttling your requests. I guess it thinks you look like a bot.
You could add a timeout, and see if that gets you through.

How to stay alive in HTTP/1.1 using python urllib

For now I am doing this: (Python3, urllib)
url = 'someurl'
headers = '(('HOST', 'somehost'), /
('Connection', 'keep-alive'),/
('Accept-Encoding' , 'gzip,deflate'))
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor())
for h in headers:
opener.addheaders.append(x)
data = 'some logging data' #username, pw etc.
opener.open('somesite/login.php, data)
res = opener.open(someurl)
data = res.read()
... some stuff here...
res1 = opener.open(someurl2)
data = res1.read()
etc.
What is happening is this;
I keep getting gzipped responses from server and I stayed logged in (I am fetching some content which is not available if I were not logged in) but I think the connection is dropping between every request opener.open;
I think that because connecting is very slow and it seems like there is new connection every time. Two questions:
a)How do I test if in fact the connection is staying-alive/dying
b)How to make it stay-alive between request for other urls ?
Take care :)

This will be a very delayed answer, but:
You should see urllib3. It is for Python 2.x but you'll get the idea when you see their README document.
And yes, urllib by default doesn't keep connections alive, I'm now implementing urllib3 for Python 3 to be staying in my toolbag :)

Just if you didn't know yet, python-requests offer keep-alive feature, thanks to urllib3.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unexpected behaviour with Urllib in Python - python

Related

what is the proper way to use proxies with requests in python

Logging HTML requests in robot framework

Newbie: update changing IP using urlopen with https and do login

urlib2.urlopen through proxy fails after a few calls

How to stay alive in HTTP/1.1 using python urllib

Categories

Resources