I am trying to fetch data from a webpage using urllib2. The page is visible on the browser but through the script I keep getting HTTPError: HTTP Error 403: Forbidden
I also tried mimicking a browser request by changing the user-agent string but no success.
Any ideas on this?
I tried with tamper data and firefox to send only user agent, and I get 403.
Try to add other headers:
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 115
Connection: keep-alive
I tried, and this should work.
The site is checking your User-Agent just set it to Internet Explorer:
request.add_header('User-Agent', 'Internet Explorer')
I confirmed that this works with wget, and you get 403 unless you set your user agent to Internet Explorer.
:) Am trying to get quotes from NSE too ! like pythonFoo says you need additional headers. Hower only Accept is sufficient.
The user-agent can say python ( stay true ! )
Related
I'm trying to scrape a website using requests. However, a post method that I need to use requires the headers below. I can fill everything apart from the JSESSION ID. The only way I can get this post method to work is if I manually go into the browser, start a session and inspect the page to retrieve the JSESSIONID.
I am looking for a way to retrieve this JSESSIONID using the requests package in python. I saw some suggestions for using a session. However, the requests session does not grab the JSESSIONID, which is the only thing I need. How should I go about a possible solution?
Host:
Connection:
Content-Length:
Accept:
X-Requested-With:
User-Agent:
Content-Type:
Sec-GPC:
Origin:
Sec-Fetch-Site:
Sec-Fetch-Mode:
Sec-Fetch-Dest:
Referer:
Accept-Encoding:
Accept-Language:
Cookie: _1aa19=; JSESSIONID=;
What I currently tried is use a session from the requests package, which should store the cookies of the session. However, After I use a .get method requests.cookies does not have the JSESSIONID stored
query = 'Example%20query'
s = requests.Session()
suggest = s.get(f'https://www.examplewebsite.nl/api_route/suggest?query={query}').json()
s.cookies
JSESSIONID is generated when you go to https://www.examplewebsite.nl page first.
import requests
query = 'Example%20query'
s = requests.Session()
s.get('https://www.examplewebsite.nl')
suggest = s.get(f'https://www.examplewebsite.nl/api_route/suggest?query={query}').json()
print(s.cookies.get("JSESSIONID"))
So I created a code which a client uploads a file to the server folder and he has an option to download it back, it works perfectly fine in chrome, I click on the item I want to download and it downloads it
def send_image(request, cs):
request = request.split('=')
try:
name = request[1]
except:
name = request[0]
print('using send_iamge!')
print('Na ' + name)
path = 'C:\\Users\\x\\Desktop\\webroot\\uploads' + '\\file-name=' + name
print(path)
with open(path, 'rb') as re:
print('exist!')
read = re.read()
cs.send(read)
the code above reads the file that you choose and sends the data as bytes to the client back.
In chrome, it downloads the file as I showed you already but in for example internet explorer, it just prints the data to the client and doesn't download it The real question is why doesn't it just prints the data in chrome, why does it download it and doesn't print it as internet explorer does and how can I fix it?(for your info: all the files that I download have the name file-name before them that's why I put it there)
http request:
UPDATE:
POST /upload?file-name=Screenshot_2.png HTTP/1.1
Host: 127.0.0.1
Connection: keep-alive
Content-Length: 3534
Accept: */*
X-Requested-With: XMLHttpRequest
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36
Content-Type: application/octet-stream
Origin: http://127.0.0.1
Sec-Fetch-Site: same-origin
Sec-Fetch-Mode: cors
Referer: http://127.0.0.1/
Accept-Encoding: gzip, deflate, br
Accept-Language: en-GB,en;q=0.9,en-US;q=0.8,he;q=0.7
It looks like that you don't send a HTTP/1 response but a HTTP/0.9 response (Note that I'm talking about the response send from the server not the request send from the client). A HTTP/1 response consists of a HTTP header and a HTTP body, similar to how a HTTP request is constructed. A HTTP/0.9 response instead only consists of the actual body, i.e. no header and thus no meta information in the header which tell the browser what to do with the body.
HTTP/0.9 is obsolete for 25 years but some browsers still support it. When a browser gets a HTTP/0.9 request it could anything with it since there is no defined meaning from the HTTP header. Browsers might try to interpret is as HTML, as plain text, offer it for download, refuse it in total ... - whatever.
The way to fix the problem is to send an actual HTTP response header before sending the body, i.e. something like this
cs.send("HTTP/1.0 200 ok\r\nContent-type: application/octet-stream\r\n\r\n")
with open(path, 'rb') as re:
...
cs.send(read)
In any case: HTTP is way more complex than you might think. There are established libraries to deal with this complexity. If you insist on not using any library please study the standard in order to avoid such problems.
Im trying to login to this website, seeking.com/login through scrapy shell. i also installed burp suite to analyze its url and headers, etc.
from scrapy.http import FormRequest
frmdata = {"captcha":"","email":"MYEMAIL.com","password":"MY_PASSWORD","is_rememberme":"0","locale":"en_US","auth_type":"bearer_token","date":"2018-12-13T09:56:22.957Z"}
url = "https://www.seeking.com/v3/auth/login"
r = FormRequest(url, formdata=frmdata)
fetch(r)
with this code i get a HTTP 401 Error, as far as i can tell essentially an authentication error.
I forwarded the calls through burpsuite and got the following intercept.
POST /v3/auth/login HTTP/1.1
Host: www.seeking.com
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:63.0)
Gecko/20100101 Firefox/63.0
Accept: application/json, text/plain, */*
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Referer: https://www.seeking.com/login?fromLogout=1
Content-Type: application/json;charset=utf-8
Web-Version: 3.59
Authorization: Basic NTI2ZTIwYzExMDI0NDYzNTk5OTI5MzUwZThiNWEzMTI6bHN0emd4ZzpSRzRzS3VmdEJMRTQxMm92TnMxbDR6L0ZkZ1dESHZuM2wwZWxtYWhyMGtnPQ==
Content-Length: 166
Connection: close
Cookie: __cfduid=dcf9fd66583d55382f362c18a83d904ca1544519479;
_gcl_au=1.1.2035701377.1544519485; _ga=GA1.2.1740241044.1544519486;
com.silverpop.iMAWebCookie=e88c45d1-3c24-11c6-089e-e287aae2c678;
__cfruid=3eebbdc1e401ed560c23a7c474c41e59b2e93018-1544520179;
device_cookie=1; __gads=ID=a1e437c03ddad1b3:T=1544519579:S=ALNI_MYb30xY4z76J4NniCK_ZtOyOdPMKA;_lb_user=gfpuzje6kg; seeking_session=eyJpdiI6Im4yMTNJNVNRZjkxbnZzMmNpYnQ4dkE9PSIsInZhbHVlIjoiVGhGVUJDejc1dElJbEwxekh5d2hXUnhjeDlpVWR2dW9IWWJqeDZvRmI3VU9Pc1lpZXZGWGJxejQ1alNXbGVXUGJqaEpORU9LNFJITVh0N3IwR1E0bUE9PSIsIm1hYyI6IjUyODU3MWIxYjM3MGU3M2E0YjI1YzM2MzNmNDc5ZDMzZDdjYTg1ZWMxYWU2ODJjY2JlMTJmZWJlNmUyZDkyNWMifQ%3D%3D {"captcha":"","email":"MYEMAIL","password":"MYPASS","is_rememberme":0,"locale":"en_US","auth_type":"bearer_token","date":"2018-12-14T09:15:56.016Z"}
I am completely new to this, and have spent 2 days trying to figure out what i need to pass to this POST to login.
My question is
1) based on this intercept what should my request via FormRequest look like?
2) I see there are cookies/authorization (Authorization token, that changes with each POST, session cookies, etc) tokens that are being passed in to the post... Where do they come from? How do i get them when i am scraping so that i can successfully login?
3) Do i need to store these session variables when scraping other pages on the site after login? Anything special i need to do to stay logged in to access other pages?
It looks like the login page is expecting to be passed soon data, and not a url-encoded string (which is what FormRequest will create).
Something like this should work:
r = scrapy.Request(
url=url,
method='POST',
body=json.dumps(frmdata),
headers={'Content-Type': 'application/json'},
)
The tokens, cookies, etc. are probably created when you initially request the login page, so you might need to request the login page before trying to log in.
It is possible that some of it is generated with javascript (haven't checked), so you might need to dig through the js code to figure out what's going on, or even execute the js yourself (e.g. using a browser).
Scrapy will keep track of your session for you, so there's nothing you need to do to stay logged in.
I have 403 FORBIDDEN error in agon-ratings plugin when submit rating.
I have read the doc. But csrf token exists in header:
Request Headersview source
Accept */*
Accept-Encoding gzip, deflate
Accept-Language en-US,en;q=0.5
Cache-Control no-cache
Connection keep-alive
Content-Length 22
Content-Type application/x-www-form-urlencoded; charset=UTF-8
Cookie csrftoken=6C7zHmrBufWbiYeTXwRkCWC9hDfdxGoW; sessionid=4d6b6977721fcb97f6903d0aaab5e632
Host localhost:8000
Pragma no-cache
Referer http://localhost:8000/news/40/asdas/
User-Agent Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:23.0) Gecko/20100101 Firefox/23.0
X-Requested-With XMLHttpRequest
Any help with this issue will be appreciated.
Thanks in advance
I guess you are doing a post via ajax request, if thats correct than you need to send the csrf token as part of POST data or via the X-CSRFToken header.
I dont see any of that in the request header you posted.
Django docs you linked have a working example about how to do this (and if you use jQuery its mostly a copy and paste job)
You are using POST method. And with POST method, you need to write {% csrf_token %} in HTML inside form element or pass csrftoken in ajax request i.e https://docs.djangoproject.com/en/dev/ref/contrib/csrf/.
It is compulsory when you write CSRFMiddleware and csrf context processor in your django settings.
I am writing a very basic web server as a homework assignment and I have it running on localhost port 14000. When I browser to localhost:14000, the server sends back an HTML page with a form on it (the form's action is the same address - localhost:14000, not sure if that's proper or not).
Basically I want to be able to gather the data from the GET request once the page reloads after the submit - how can I do this? How can i access the stuff in the GET in general?
NOTE: I already tried socket.recv(xxx), that doesn't work if the page is being loaded first time - in that case we are not "receiving" anything from the client so it just keeps spinning.
The secret lies in conn.recv which will give you the headers sent by the browser/client of the request. If they look like the one I generated with safari you can easily parse them (even without a complex regex pattern).
data = conn.recv(1024)
#Parse headers
"""
data will now be something like this:
GET /?banana=True HTTP/1.1
Host: localhost:50008
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/534.53.11 (KHTML, like Gecko) Version/5.1.3 Safari/534.53.10
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us
Accept-Encoding: gzip, deflate
Connection: keep-alive
"""
#A simple parsing of the get data would be:
GET={i.split("=")[0]:i.split("=")[1] for i in data.split("\n")[0].split(" ")[1][2:].split("&")}