Error logging into a HTTP Server - python

I'm trying to login into a http server to fetch some tables. The code I'm using is this:
MechBrowser = mechanize.Browser()
LoginUrl = 'http://www.jlrvehiclefeedback.com'
LoginData = "username=my_username&password=my_password&do=login"
LoginHeader = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36'}
LoginRequest = urllib2.Request(LoginUrl, LoginData, LoginHeader)
LoginResponse = MechBrowser.open(LoginRequest)
however I get this error:
mechanize._response.httperror_seek_wrapper: HTTP Error 403: request disallowed by robots.txt
As you could see, I've defined a UserAgent but still can't get through the bot policy.

Related

Issue using python request module

Good morning,
Since yesterday, I'm having timeouts doing requests to ebay website. The code is simple:
import requests
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"}
htlm=requests.get("https://www.ebay.es",headers=headers).text
Tested with google and it works. This is the response I receive:
'\nGateway Timeout - In read \n\nGateway Timeout\nThe proxy server did not receive a timely response from the upstream server.\nReference #1.477f1602.1645295618.7675ccad\n\n'
What happened or changed? How could I solve it?
Removing the headers should work. Perhaps they don't like that user agent for some reason.
import requests
# headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"}
headers = {}
url = "https://www.ebay.es"
response = requests.get(url, headers=headers)
html_text = response.text

Python Requests error 403 even with user agent

I'm trying to parse an auction website with python and request.
But so far, it's returned error 403 forbidden access cloudfare protection.
Here's bellow my code :
import requests
url = "https://www.interencheres.com"
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
response = requests.get(url, headers=headers)

Web scraping request stopped working, showing "Response [401]" in python?

import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36'}
url = 'https://www.nseindia.com/api/chart-databyindex?index=ACCEQN'
r = requests.get(url, headers=headers)
data = r.json()
print(data)
prices = data['grapthData']
print(prices)
It was working fine but now it showing error "Response [401]"
Well, it's all about the site's authentication requirements. It requires a certain level of authorization to access like this.

How can I handle "The page has expired due to inactivity."

Currently, I am trying to make a user generator for a website and there is a fundemantal problem that I've been facing. The code below works but what it prints out is
The page has expired due to inactivity. Please refresh and try again
I have seen some of those solutions including using xsrf-token but either I am doing something wrong or it is not related to token.
with requests.Session() as s:
s.get('http://www.watchill.org/register')
token = s.cookies["XSRF-TOKEN"]
agent = {"User-Agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36 OPR/62.0.3331.116",
"XSRF-TOKEN":token}
r = s.post('http://www.watchill.org/register',headers=agent)
print(bs4.BeautifulSoup(r.content,"html.parser"))
The problem is with your CSRF token which is invalid.
I didn't check if this code is doing what's your aiming to but it does not return page expired message:
import requests
from bs4 import BeautifulSoup
def getXsrf(cookies):
for cookie in s.cookies:
if cookie.name =='XSRF-TOKEN':
return cookie.value
with requests.Session() as s:
s.get('http://www.watchill.org/register')
xsrf = getXsrf(s.cookies)
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36 OPR/62.0.3331.116"}
headers['X-XSRF-TOKEN'] = xsrf
r = s.post('http://www.watchill.org/register',headers=headers)
print(BeautifulSoup(r.content,"html.parser"))

Login to website via Python Requests

for a university project I am currently trying to login to a website, and scrap a little detail (a list of news articles) from my user profile.
I am new to Python, but I did this before to some other website. My first two approaches deliver different HTTP errors. I have considered problems with the header my request is sending, however my understanding of this sites login process appears to be insufficient.
This is the login page: http://seekingalpha.com/account/login
My first approach looks like this:
import requests
with requests.Session() as c:
requestUrl ='http://seekingalpha.com/account/orthodox_login'
USERNAME = 'XXX'
PASSWORD = 'XXX'
userAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36'
login_data = {
"slugs[]":None,
"rt":None,
"user[url_source]":None,
"user[location_source]":"orthodox_login",
"user[email]":USERNAME,
"user[password]":PASSWORD
}
c.post(requestUrl, data=login_data, headers = {"referer": "http://seekingalpha.com/account/login", 'user-agent': userAgent})
page = c.get("http://seekingalpha.com/account/email_preferences")
print(page.content)
This results in "403 Forbidden"
My second approach looks like this:
from requests import Request, Session
requestUrl ='http://seekingalpha.com/account/orthodox_login'
USERNAME = 'XXX'
PASSWORD = 'XXX'
userAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36'
# c.get(requestUrl)
login_data = {
"slugs[]":None,
"rt":None,
"user[url_source]":None,
"user[location_source]":"orthodox_login",
"user[email]":USERNAME,
"user[password]":PASSWORD
}
headers = {
"accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Language":"de-DE,de;q=0.8,en-US;q=0.6,en;q=0.4",
"origin":"http://seekingalpha.com",
"referer":"http://seekingalpha.com/account/login",
"Cache-Control":"max-age=0",
"Upgrade-Insecure-Requests":1,
"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36"
}
s = Session()
req = Request('POST', requestUrl, data=login_data, headers=headers)
prepped = s.prepare_request(req)
prepped.body ="slugs%5B%5D=&rt=&user%5Burl_source%5D=&user%5Blocation_source%5D=orthodox_login&user%5Bemail%5D=XXX%40XXX.com&user%5Bpassword%5D=XXX"
resp = s.send(prepped)
print(resp.status_code)
In this approach I was trying to prepare the header exactly as my browser would do it. Sorry for redundancy. This results in HTTP error 400.
Does someone have an idea, what went wrong? Probably a lot.
Instead of spending a lot of energy on manually logging in and playing with Session, I suggest you just scrape the pages right away using your cookie.
When you log in, usually there is a cookie added to your request to identify your identity. Please see this for example:
Your code will be like this:
import requests
response = requests.get("www.example.com", cookies={
"c_user":"my_cookie_part",
"xs":"my_other_cookie_part"
})
print response.content

Categories

Resources