EDIT: I changed the code and it still doesn't work! I used the links from the answer to do it but it didn't work!
Why does this not work? When I run it takes a long time to run and never finishes!
import urllib
import urllib2
url = 'https://www.locationary.com/index.jsp?ACTION_TOKEN=tile_loginBar_jsp$JspView$LoginAction'
values = {'inUserName' : 'USER',
'inUserPass' : 'PASSWORD'}
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
req.add_header('Host', 'www.locationary.com')
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; rv:8.0) Gecko/20100101 Firefox/8.0')
req.add_header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8')
req.add_header('Accept-Language', 'en-us,en;q=0.5')
req.add_header('Accept-Encoding','gzip, deflate')
req.add_header('Accept-Charset','ISO-8859-1,utf-8;q=0.7,*;q=0.7')
req.add_header('Connection','keep-alive')
req.add_header('Referer','http://www.locationary.com/')
req.add_header('Cookie','site_version=REGULAR; __utma=47547066.1079503560.1321924193.1322707232.1324693472.36; __utmz=47547066.1321924193.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); nickname=jacob501; locaCountry=1033; locaState=1795; locaCity=Montreal; jforumUserId=1; PMS=1; TurnOFfTips=true; Locacookie=enable; __utma=47547066.1079503560.1321924193.1322707232.1324693472.36; __utmz=47547066.1321924193.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); nickname=jacob501; PMS=1; __utmb=47547066.15.10.1324693472; __utmc=47547066; JSESSIONID=DC7F5AB08264A51FBCDB836393CB16E7; PSESSIONID=28b334905ab6305f7a7fe051e83857bc280af1a9; __utmc=47547066; __utmb=47547066.15.10.1324693472; ACTION_RESULT_CODE=ACTION_RESULT_FAIL; ACTION_ERROR_TEXT=java.lang.NullPointerException')
req.add_header('Content-Type','application/x-www-form-urlencoded')
#user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
#headers = { 'User-Agent' : user_agent }
response = urllib2.urlopen(req)
page = response.read()
print page
The remote server (the one at www.locationary.com) is waiting for the content of your HTTP post request, based on the Content-Type and Content-Length headers. Since you're never actually sending said awaited data, the remote server waits — and so does read() — until you do so.
I need to know how to send the content of my http post request.
Well, you need to actually send some data in the request. See:
urllib2 - The Missing Manual
How do I send a HTTP POST value to a (PHP) page using Python?
Final, "working" version:
import urllib
import urllib2
url = 'https://www.locationary.com/index.jsp?ACTION_TOKEN=tile_loginBar_jsp$JspView$LoginAction'
values = {'inUserName' : 'USER',
'inUserPass' : 'PASSWORD'}
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
req.add_header('Host', 'www.locationary.com')
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; rv:8.0) Gecko/20100101 Firefox/8.0')
req.add_header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8')
req.add_header('Accept-Language', 'en-us,en;q=0.5')
req.add_header('Accept-Charset','ISO-8859-1,utf-8;q=0.7,*;q=0.7')
req.add_header('Connection','keep-alive')
req.add_header('Referer','http://www.locationary.com/')
req.add_header('Cookie','site_version=REGULAR; __utma=47547066.1079503560.1321924193.1322707232.1324693472.36; __utmz=47547066.1321924193.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); nickname=jacob501; locaCountry=1033; locaState=1795; locaCity=Montreal; jforumUserId=1; PMS=1; TurnOFfTips=true; Locacookie=enable; __utma=47547066.1079503560.1321924193.1322707232.1324693472.36; __utmz=47547066.1321924193.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); nickname=jacob501; PMS=1; __utmb=47547066.15.10.1324693472; __utmc=47547066; JSESSIONID=DC7F5AB08264A51FBCDB836393CB16E7; PSESSIONID=28b334905ab6305f7a7fe051e83857bc280af1a9; __utmc=47547066; __utmb=47547066.15.10.1324693472; ACTION_RESULT_CODE=ACTION_RESULT_FAIL; ACTION_ERROR_TEXT=java.lang.NullPointerException')
req.add_header('Content-Type','application/x-www-form-urlencoded')
response = urllib2.urlopen(req)
page = response.read()
print page
Don't explicitly set the Content-Length header
Remove the req.add_header('Accept-Encoding','gzip, deflate') line, so that the response doesn't have to be decompressed (or — exercise left to the reader — ungzip it yourself)
Related
I'm trying to requests data from a federal agency in germany. I first have to send a post request on an HTML-form and afterwards request an URL with a CSV-Download.
After opening a Requests.Session(), sending the POST-request works with no problems, I have to set a header with the user-agent though.
When afterwards trying to get the CSV with requests.get I need to supply the header again (or I will be blocked) as well as the JSESSIONID for the website to know, which data I am requesting (from filling in the HTML form earlier).
The problem I'm facing is, that on my GET-request when I set the header with the user-agent, my JSESSIONID changes. When I'm not setting the header, the JSESSIONID remains the same, but I'm getting blocked for not providing a User-Agent.
What problem am I facing/What am I doing wrong?
As you can test, when removing the headers=headers from the line r2 = s.get(csv_url, headers=headers) the JSESSIONID is the same. But without the headers the website blocks my request.
import requests
s = requests.Session()
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20100101 Firefox/12.0'}
api_url = "https://foerderportal.bund.de/"
url_post = api_url + "foekat/jsp/SucheAction.do?actionMode=searchlist"
csv_url = api_url + "foekat/jsp/SucheAction.do?actionMode=print&presentationType=csv"
# Sending the HTML form
payload = {"suche.bundeslandSuche[0]": "Hessen"}
r = s.post(url_post, data=payload, headers=headers)
print(s.cookies)
# Requesting the CSV
r2 = s.get(csv_url, headers=headers)
print(s.cookies)
# Writing the file
with open("test.csv", "w") as file:
file.write(r2.text)
the default user agent is python-requests/{version}
Some websites block access from non web browser User Agents to prevent scraping.
You can define a default user-agent for your session like this:
import requests
s = requests.Session()
s.headers['User-Agent'] = 'Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20100101 Firefox/12.0'
api_url = "https://foerderportal.bund.de/"
url_post = api_url + "foekat/jsp/SucheAction.do?actionMode=searchlist"
csv_url = api_url + "foekat/jsp/SucheAction.do?actionMode=print&presentationType=csv"
# Sending the HTML form
payload = {"suche.bundeslandSuche[0]": "Hessen"}
r = s.post(url_post, data=payload)
print(s.cookies)
# Requesting the CSV
r2 = s.get(csv_url)
print(s.cookies)
I want to send a value for "User-agent" while requesting a webpage using Python Requests. I am not sure is if it is okay to send this as a part of the header, as in the code below:
debug = {'verbose': sys.stderr}
user_agent = {'User-agent': 'Mozilla/5.0'}
response = requests.get(url, headers = user_agent, config=debug)
The debug information isn't showing the headers being sent during the request.
Is it acceptable to send this information in the header? If not, how can I send it?
The user-agent should be specified as a field in the header.
Here is a list of HTTP header fields, and you'd probably be interested in request-specific fields, which includes User-Agent.
If you're using requests v2.13 and newer
The simplest way to do what you want is to create a dictionary and specify your headers directly, like so:
import requests
url = 'SOME URL'
headers = {
'User-Agent': 'My User Agent 1.0',
'From': 'youremail#domain.example' # This is another valid field
}
response = requests.get(url, headers=headers)
If you're using requests v2.12.x and older
Older versions of requests clobbered default headers, so you'd want to do the following to preserve default headers and then add your own to them.
import requests
url = 'SOME URL'
# Get a copy of the default headers that requests would use
headers = requests.utils.default_headers()
# Update the headers with your custom ones
# You don't have to worry about case-sensitivity with
# the dictionary keys, because default_headers uses a custom
# CaseInsensitiveDict implementation within requests' source code.
headers.update(
{
'User-Agent': 'My User Agent 1.0',
}
)
response = requests.get(url, headers=headers)
It's more convenient to use a session, this way you don't have to remember to set headers each time:
session = requests.Session()
session.headers.update({'User-Agent': 'Custom user agent'})
session.get('https://httpbin.org/headers')
By default, session also manages cookies for you. In case you want to disable that, see this question.
It will send the request like browser
import requests
url = 'https://Your-url'
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'}
response= requests.get(url.strip(), headers=headers, timeout=10)
I'm trying to set the user agent for my urllib request:
opener = urllib.request.build_opener(
urllib.request.HTTPCookieProcessor(cj),
urllib.request.HTTPRedirectHandler(),
urllib.request.ProxyHandler({'http': proxy})
)
and finally:
response3 = opener.open("https://www.google.com:443/search?q=test", timeout=timeout_value).read().decode("utf-8")
What would be the best way to set the user-agent header to
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36
With urllib we have two options, as far as I know.
build_opener returns a OpenerDirector object, which has an addheaders attribute. We can change the user-agent and other headers with that attribute.
opener.addheaders = [('User-Agent', 'My User-Agent')]
url = 'http://httpbin.org/user-agent'
r = opener.open(url, timeout=5)
text = r.read().decode("utf-8")
Alternatively, we can install the OpenerDirector object to the global opener with install_opener and use urlopen to submit the request. Now can use Request to set the headers.
urllib.request.install_opener(opener)
url = 'http://httpbin.org/user-agent'
headers = {'user-agent': "My User-Agent"}
req = urllib.request.Request(url, headers=headers)
r = urllib.request.urlopen(req, timeout=5)
text = r.read().decode("utf-8")
Personally, I prefer the second method because it is more consistent. Once we install the opener all requests will have the same handlers, and we can continue using urllib the same way. However, if you don't want to use those handlers for all requests you should choose the first method and use addheaders to set headers for a specific OpenerDirector object.
With requests things are simpler.
We can use the session.heders attribute if we want to change the user-agent or other headers for all requests,
s = requests.session()
s.headers['user-agent'] = "My User-Agent"
r = s.get(url, timeout=5)
or use the headers parameter if we want to set headers for a specific request only.
headers = {'user-agent': "My User-Agent"}
r = requests.get(url, headers=headers, timeout=5)
I use the following code to retrieve a web page.
import requests
payload = {'name': temp} #I extract temp from another page.
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; rv:49.0) Gecko/20100101 Firefox/49.0','Accept': 'text/html, */*; q=0.01','Accept-Language': 'en-US,en;q=0.5', 'X-Requested-With': 'XMLHttpRequest' }
full_url = url.rstrip() + '/test/log?'
r = requests.get(full_url, params=payload, headers=headers, stream=True)
for line in r.iter_lines():
if line:
print line
However for some reason the http response is lacking the text inside tags.
I found out that if I send the request to Burp, intercept it and wait for 3 secs before forwarding it, then I get the complete html page containing the text inside the tags....
I still could not find the cause. Ideas?
From requests documentation:
By default, when you make a request, the body of the response is
downloaded immediately. You can override this behaviour and defer
downloading the response body until you access the Response.content
attribute with the stream parameter:
Body Content Workflow
In other words try removing stream=True in your requests.get()
or
You will have all the content when you access r.content, where r is the response.
I am testing some application where I send some POST requests, want to test the behavior of the application when some headers are missing in the request to verify that it generates the correct error codes.
For doing this, my code is as follows.
header = {'Content-type': 'application/json'}
data = "hello world"
request = urllib2.Request(url, data, header)
f = urllib2.urlopen(request)
response = f.read()
The problem is urllib2 adds it's own headers like Content-Length, Accept-Encoding when it sends the POST request, but I don't want urllib2 to add any more headers than the one I specified in the headers dict above, is there a way to do that, I tried setting the other headers I don't want to None, but they still go with those empty values as part of the request which I don't want.
The header takes a dictionary type, example below using a chrome user-agent. For all standard and some non-stranded header fields take a look here. You also need to encode your data with urllib not urllib2. This is all mention in the python documentation here
import urllib
import urllib2
url = 'http://www.someserver.com/cgi-bin/register.cgi'
user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1'
values = {'name' : 'Michael Foord',
'location' : 'Northampton',
'language' : 'Python' }
headers = { 'User-Agent' : user_agent }
data = urllib.urlencode(values)
req = urllib2.Request(url, data, headers)
response = urllib2.urlopen(req)
the_page = response.read()