I use the following code to retrieve a web page.
import requests
payload = {'name': temp} #I extract temp from another page.
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; rv:49.0) Gecko/20100101 Firefox/49.0','Accept': 'text/html, */*; q=0.01','Accept-Language': 'en-US,en;q=0.5', 'X-Requested-With': 'XMLHttpRequest' }
full_url = url.rstrip() + '/test/log?'
r = requests.get(full_url, params=payload, headers=headers, stream=True)
for line in r.iter_lines():
if line:
print line
However for some reason the http response is lacking the text inside tags.
I found out that if I send the request to Burp, intercept it and wait for 3 secs before forwarding it, then I get the complete html page containing the text inside the tags....
I still could not find the cause. Ideas?
From requests documentation:
By default, when you make a request, the body of the response is
downloaded immediately. You can override this behaviour and defer
downloading the response body until you access the Response.content
attribute with the stream parameter:
Body Content Workflow
In other words try removing stream=True in your requests.get()
or
You will have all the content when you access r.content, where r is the response.
Related
I'm trying to use Python requests to get the results of this url Target URL. As you could see, it updates on javascript when you push button "Consultar" (leaving fields empty), so post method is not working.
I'm trying this code right here:
import requests
URL = https://www.cmfchile.cl/institucional/mercados/entidad.php?mercado=V&rut=61808000&tipoentidad=RVEMI&control=svs&pestania=25
page = requests.post(URL, headers=headers)
print(page.text)
Does anyone know any other way or how I could solve this?
This works:
import requests
url = "https://www.cmfchile.cl/institucional/mercados/entidad.php?mercado=V&rut=61808000&tipoentidad=RVEMI&control=svs&pestania=25"
payload='dd=%23%23%23&mm=%23%23%23&aa=%23%23%23&dd2=%23%23%23&mm2=%23%23%23&aa2=%23%23%23&dias=&entidad=AGUAS%2BANDINAS%2BS.A.&rut=61808000%2B&formulario=1'
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:96.0) Gecko/20100101 Firefox/96.0',
}
response = requests.request("POST", url, headers=headers, data=payload)
print(response.text)
The process I followed was to:
Copy the request as cURL in Firefox:
Import it into Postman:
3. Export as Python code from Postman:
[]
I'm trying to requests data from a federal agency in germany. I first have to send a post request on an HTML-form and afterwards request an URL with a CSV-Download.
After opening a Requests.Session(), sending the POST-request works with no problems, I have to set a header with the user-agent though.
When afterwards trying to get the CSV with requests.get I need to supply the header again (or I will be blocked) as well as the JSESSIONID for the website to know, which data I am requesting (from filling in the HTML form earlier).
The problem I'm facing is, that on my GET-request when I set the header with the user-agent, my JSESSIONID changes. When I'm not setting the header, the JSESSIONID remains the same, but I'm getting blocked for not providing a User-Agent.
What problem am I facing/What am I doing wrong?
As you can test, when removing the headers=headers from the line r2 = s.get(csv_url, headers=headers) the JSESSIONID is the same. But without the headers the website blocks my request.
import requests
s = requests.Session()
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20100101 Firefox/12.0'}
api_url = "https://foerderportal.bund.de/"
url_post = api_url + "foekat/jsp/SucheAction.do?actionMode=searchlist"
csv_url = api_url + "foekat/jsp/SucheAction.do?actionMode=print&presentationType=csv"
# Sending the HTML form
payload = {"suche.bundeslandSuche[0]": "Hessen"}
r = s.post(url_post, data=payload, headers=headers)
print(s.cookies)
# Requesting the CSV
r2 = s.get(csv_url, headers=headers)
print(s.cookies)
# Writing the file
with open("test.csv", "w") as file:
file.write(r2.text)
the default user agent is python-requests/{version}
Some websites block access from non web browser User Agents to prevent scraping.
You can define a default user-agent for your session like this:
import requests
s = requests.Session()
s.headers['User-Agent'] = 'Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20100101 Firefox/12.0'
api_url = "https://foerderportal.bund.de/"
url_post = api_url + "foekat/jsp/SucheAction.do?actionMode=searchlist"
csv_url = api_url + "foekat/jsp/SucheAction.do?actionMode=print&presentationType=csv"
# Sending the HTML form
payload = {"suche.bundeslandSuche[0]": "Hessen"}
r = s.post(url_post, data=payload)
print(s.cookies)
# Requesting the CSV
r2 = s.get(csv_url)
print(s.cookies)
I want to send a value for "User-agent" while requesting a webpage using Python Requests. I am not sure is if it is okay to send this as a part of the header, as in the code below:
debug = {'verbose': sys.stderr}
user_agent = {'User-agent': 'Mozilla/5.0'}
response = requests.get(url, headers = user_agent, config=debug)
The debug information isn't showing the headers being sent during the request.
Is it acceptable to send this information in the header? If not, how can I send it?
The user-agent should be specified as a field in the header.
Here is a list of HTTP header fields, and you'd probably be interested in request-specific fields, which includes User-Agent.
If you're using requests v2.13 and newer
The simplest way to do what you want is to create a dictionary and specify your headers directly, like so:
import requests
url = 'SOME URL'
headers = {
'User-Agent': 'My User Agent 1.0',
'From': 'youremail#domain.example' # This is another valid field
}
response = requests.get(url, headers=headers)
If you're using requests v2.12.x and older
Older versions of requests clobbered default headers, so you'd want to do the following to preserve default headers and then add your own to them.
import requests
url = 'SOME URL'
# Get a copy of the default headers that requests would use
headers = requests.utils.default_headers()
# Update the headers with your custom ones
# You don't have to worry about case-sensitivity with
# the dictionary keys, because default_headers uses a custom
# CaseInsensitiveDict implementation within requests' source code.
headers.update(
{
'User-Agent': 'My User Agent 1.0',
}
)
response = requests.get(url, headers=headers)
It's more convenient to use a session, this way you don't have to remember to set headers each time:
session = requests.Session()
session.headers.update({'User-Agent': 'Custom user agent'})
session.get('https://httpbin.org/headers')
By default, session also manages cookies for you. In case you want to disable that, see this question.
It will send the request like browser
import requests
url = 'https://Your-url'
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'}
response= requests.get(url.strip(), headers=headers, timeout=10)
I've been trying to learn and understand how http requests are being made between the clients and server, so I've decided to test out by sending a simple post request to geeksforgeeks' online IDE. Using requests, I am able to get a response of the following,
{'status': 'SUCCESS', 'sid': 'df1acaffefacb25dfef2c7cf66022925'} using the following code,
import requests
import time
url = "https://ide.geeksforgeeks.org/main.php"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:20.0) Gecko/20100101 Firefox/20.0'}
code = """
print("hello")
"""
data = {
'lang':'Python3',
'code': code,
'input':'0',
'save':'false'
}
r = requests.post(url, data = data, headers = headers)
print(r.json())
Next, I know that i have to resend back the sid parameter to https://ide.geeksforgeeks.org/submissionResult.php in order to get a successful response. However, when i entered the following code,
requesttype = {
'sid': r.json()['sid'] ,
'requestType': 'fetchResults'
}
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:20.0) Gecko/20100101 Firefox/20.0'}
url2 = "https://ide.geeksforgeeks.org/submissionResult.php"
session = requests.Session()
outcome = session.post(url2, data = requesttype, headers = headers)
It returns {'status': 'IN-QUEUE'}. Upon analyzing the networks tab in my browser, it seems as though the online IDE has a queue system that will only parse your incoming request server sided through a queue-like system. Hence I would like to know how to obtain a success response by "communicating" with their queue system.
A successful response looks like this,
{"valid":"1","output":"testing\n","time":"0.02","compResult":"S","memory":"0.125","hash":"79c7a40c6f6b36f1dfc23119f27ba66e_Tester_U16","status":"SUCCESS"}
My hunch is telling me that the website is detecting me as a bot and I should use something else like selenium perhaps.
When I run the code on my computer, it works just fine. I believe the problem could be something to do with the SSL certificate that you might want to initialize.
EDIT: I changed the code and it still doesn't work! I used the links from the answer to do it but it didn't work!
Why does this not work? When I run it takes a long time to run and never finishes!
import urllib
import urllib2
url = 'https://www.locationary.com/index.jsp?ACTION_TOKEN=tile_loginBar_jsp$JspView$LoginAction'
values = {'inUserName' : 'USER',
'inUserPass' : 'PASSWORD'}
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
req.add_header('Host', 'www.locationary.com')
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; rv:8.0) Gecko/20100101 Firefox/8.0')
req.add_header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8')
req.add_header('Accept-Language', 'en-us,en;q=0.5')
req.add_header('Accept-Encoding','gzip, deflate')
req.add_header('Accept-Charset','ISO-8859-1,utf-8;q=0.7,*;q=0.7')
req.add_header('Connection','keep-alive')
req.add_header('Referer','http://www.locationary.com/')
req.add_header('Cookie','site_version=REGULAR; __utma=47547066.1079503560.1321924193.1322707232.1324693472.36; __utmz=47547066.1321924193.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); nickname=jacob501; locaCountry=1033; locaState=1795; locaCity=Montreal; jforumUserId=1; PMS=1; TurnOFfTips=true; Locacookie=enable; __utma=47547066.1079503560.1321924193.1322707232.1324693472.36; __utmz=47547066.1321924193.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); nickname=jacob501; PMS=1; __utmb=47547066.15.10.1324693472; __utmc=47547066; JSESSIONID=DC7F5AB08264A51FBCDB836393CB16E7; PSESSIONID=28b334905ab6305f7a7fe051e83857bc280af1a9; __utmc=47547066; __utmb=47547066.15.10.1324693472; ACTION_RESULT_CODE=ACTION_RESULT_FAIL; ACTION_ERROR_TEXT=java.lang.NullPointerException')
req.add_header('Content-Type','application/x-www-form-urlencoded')
#user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
#headers = { 'User-Agent' : user_agent }
response = urllib2.urlopen(req)
page = response.read()
print page
The remote server (the one at www.locationary.com) is waiting for the content of your HTTP post request, based on the Content-Type and Content-Length headers. Since you're never actually sending said awaited data, the remote server waits — and so does read() — until you do so.
I need to know how to send the content of my http post request.
Well, you need to actually send some data in the request. See:
urllib2 - The Missing Manual
How do I send a HTTP POST value to a (PHP) page using Python?
Final, "working" version:
import urllib
import urllib2
url = 'https://www.locationary.com/index.jsp?ACTION_TOKEN=tile_loginBar_jsp$JspView$LoginAction'
values = {'inUserName' : 'USER',
'inUserPass' : 'PASSWORD'}
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
req.add_header('Host', 'www.locationary.com')
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; rv:8.0) Gecko/20100101 Firefox/8.0')
req.add_header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8')
req.add_header('Accept-Language', 'en-us,en;q=0.5')
req.add_header('Accept-Charset','ISO-8859-1,utf-8;q=0.7,*;q=0.7')
req.add_header('Connection','keep-alive')
req.add_header('Referer','http://www.locationary.com/')
req.add_header('Cookie','site_version=REGULAR; __utma=47547066.1079503560.1321924193.1322707232.1324693472.36; __utmz=47547066.1321924193.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); nickname=jacob501; locaCountry=1033; locaState=1795; locaCity=Montreal; jforumUserId=1; PMS=1; TurnOFfTips=true; Locacookie=enable; __utma=47547066.1079503560.1321924193.1322707232.1324693472.36; __utmz=47547066.1321924193.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); nickname=jacob501; PMS=1; __utmb=47547066.15.10.1324693472; __utmc=47547066; JSESSIONID=DC7F5AB08264A51FBCDB836393CB16E7; PSESSIONID=28b334905ab6305f7a7fe051e83857bc280af1a9; __utmc=47547066; __utmb=47547066.15.10.1324693472; ACTION_RESULT_CODE=ACTION_RESULT_FAIL; ACTION_ERROR_TEXT=java.lang.NullPointerException')
req.add_header('Content-Type','application/x-www-form-urlencoded')
response = urllib2.urlopen(req)
page = response.read()
print page
Don't explicitly set the Content-Length header
Remove the req.add_header('Accept-Encoding','gzip, deflate') line, so that the response doesn't have to be decompressed (or — exercise left to the reader — ungzip it yourself)