Web scraping, Url jump prevents authorization?

Web scraping, Url jump prevents authorization? - python

I've been trying to scrape this website (www.dearedu.com), specifically, i've been having tremendous amounts of difficulties logging in...having tried everything I could find on previously answered authorization questions on stack exchange.
Currently, I am using request sessions to log in using the following code,
cj = cookielib.CookieJar()
mySession = requests.session()
mySession.headers.update({'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'})
data = mySession.get('http://www.dearedu.com/', cookies = cj)
data= {'userid': myusername, 'pwd': mypassword,
'fmdo': 'login', 'dopost': 'login',
'keeptime': '604800', 'teshu': 't'}
data = mySession.post('http://club.dearedu.com/member/index_do.php', data=data)
when the code above is run with the correct password and username, you get the following html
<head>
<title>第二教育网提示信息</title>
<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />
<base target='_self'/>
<style>div{line-height:160%;}</style></head>
<body leftmargin='0' topmargin='0' bgcolor='#FFFFFF'>
<center>
<script>
var pgo=0;
function JumpUrl(){
if(pgo==0){ location='http://www.dearedu.com'; pgo=1; }
}
document.write("<br /><div style='width:450px;padding:0px;border:1px
solid #DADADA;'><div style='padding:6px;font-size:12px;border-
bottom:1px solid #DADADA;background:#DBEEBD
url(/plus/img/wbg.gif)';'><b>第二教育网提示信息！</b></div>");
document.write("<div style='height:130px;font-
size:10pt;background:#ffffff'><br />");
document.write("成功登录，现在转向系统主页...");
document.write("<br /><a href='http://www.dearedu.com'>如果你的浏览器没
反应，请点击这里...</a><br/></div>");
setTimeout('JumpUrl()',1000);</script>
</center>
</body>
</html>
The thing I do not understand is that the cookies and status code received indicate that I have successfully logged in, but when I try to then access the main page it indicates that I am not.
If i had to take a guess is that it has something to do with the url jumping. Specifically the website waits for like a second or so before redirecting you to the main page.
Can someone explain what is wrong and how to fix it? Thank you!
EDIT:
"成功登录，现在转向系统主页" = successfully logged in, redirecting to homepage
"如果你的浏览器没反应，请点击这里" = If your browser does not respond, please click here
The rest I do not think is relevant. Thanks!

Related

How can I get html code of the page in python?

I need to get html code of the website but I get only 403 error or 403 status_code
import urllib
req = urllib.request.Request(url)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7')
response = urllib.request.urlopen(req)
data = response.read() # a `bytes` object
html = data.decode('utf-8') # a `str`; this step can't be used if data is binary
print(html)
It will give this error
urllib.error.HTTPError: HTTP Error 403: Forbidden
And I tried this too. (verify=False doesn't work too)
import requests
res = requests.get('https://santehnika-online.ru/cart-link/b15caf63da6313698633f41747b9d9eb/', headers={'user-agent':'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'})
print(res.text, res.status_code)
I got it. It is checking browser page, but it doesn't exist if I open this website in browser
<!DOCTYPE html>
<html lang="ru">
<head>
<title>Проверка браузера</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<meta name="viewport" content="width=device-width,initial-scale=1.0" />
<link rel="preconnect" href="https://fonts.googleapis.com">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
</header>
<div class="message">
<div class="wrapper wrapper_message">
<div class="message-content">
<div class="message-title">Проверяем браузер</div>
<div>Сайт скоро откроется</div>
</div>
<div class="loader"></div>
</div>
</div>
<div class="captcha">
<form id="challenge-form" class="challenge-form" action="/cart-link/b15caf63da6313698633f41747b9d9eb/?__cf_chl_f_tk=iRm3UPbqu59isE.16Z5X1Rx8TYQzALW_hvQcP9ji_Rc-1676103711-0-gaNycGzNCXs" method="POST" enctype="application/x-www-form-urlencoded">
<div id="cf-please-wait">
<div id="spinner">
Help me please to get 200 status_code and good hrml code.

If getting a status code of 403, according to google:
An HTTP 403 response code means that a client is forbidden from accessing a valid URL. The server understands the request, but it can't fulfill the request because of client-side issues.
So this means that the website is inaccessible.

Best way to get html websites (in my oppunion) is using Beautifulsoup library
simple
import urllib2
from bs4 import BeautifulSoup
# Fetch the html file
response = urllib2.urlopen('http://tutorialspoint.com/python/python_overview.htm')
html_doc = response.read()
# Parse the html file
soup = BeautifulSoup(html_doc, 'html.parser')
# Format the parsed html file
strhtm = soup.prettify()
# Print the first few characters
print (strhtm[:225])
the best part for this library is usage like this
this part is extracting tag value
import urllib2
from bs4 import BeautifulSoup
response = urllib2.urlopen('http://tutorialspoint.com/python/python_overview.htm')
html_doc = response.read()
soup = BeautifulSoup(html_doc, 'html.parser')
print (soup.title)
print(soup.title.string)
print(soup.a.string)
print(soup.b.string)
also about the error maybe IP blocking you can use proxy or other agents
but I think maybe website is using cookies you can track the website using console network in inspect to check if any suspicion cookies you get after going inside the website and add that cookie to your request and see the results

The site is protected by cloudflare technology, which is currently one of the best. The cloudflare checks if the client can run javascript (this is the biggest difference between python requests and the browser).
Use these projects that are designed to pass through cloudflare's protection:
cloudscrape
undetected selenium

Downloading excel file from url in pandas (post authentication)

I am facing a strange problem, I dont know much about for my lack of knowledge of html. I want to download an excel file post login from a website.
The file_url is:
file_url="https://xyz.xyz.com/portal/workspace/IN AWP ABRL/Reports & Analysis Library/CDI Reports/CDI_SM_Mar'20.xlsx"
There is a share button for the file which gives the link2 (For the same file):
file_url2='http://xyz.xyz.com/portal/traffic/4a8367bfd0fae3046d45cd83085072a0'
When I use requests.get to read link 2 (post login to a session) I am able to read the excel into pandas. However, link 2 does not serve my purpose as I cant schedule my report on this on a periodic basis (by changing Mar'20 to Apr'20 etc). Link1 suits my purpose but gives the following on passing r=requests.get in the r.content method:
b'\n\n\n\n\n\n\n\n\n\n<html>\n\t<head>\n\t\t<title></title>\n\t</head>\n\t\n\t<body bgcolor="#FFFFFF">\n\t\n\n\t<script language="javascript">\n\t\t<!-- \n\t\t\ttop.location.href="https://xyz.xyz.com/portal/workspace/IN%20AWP%20ABRL/Reports%20&%20Analysis%20Library/CDI%20Reports/CDI_SM_Mar\'20.xlsx";\t\n\t\t-->\n\t</script>\n\t</body>\n</html>'
I have tried all encoding decoding of url but cant understand this alphanumeric url (link2).
My python code (working) is:
import requests
url = 'http://xyz.xyz.com/portal/site'
username=''
password=''
s = requests.Session()
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}
r = s.get(url,auth=(username, password),verify=False,headers=headers)
r2 = s.get(file_url,verify=False,allow_redirects=True)
r2.content
# df=pd.read_excel(BytesIO(r2.content))

You get HTML with JavaScript which redirects browser to new url. But requests can't run JavaScript. it is simple methods to block some simple scripts/bots.
But HTML is only string so you can use string's functions to get url from string and use this url with requests to get file.
content = b'\n\n\n\n\n\n\n\n\n\n<html>\n\t<head>\n\t\t<title></title>\n\t</head>\n\t\n\t<body bgcolor="#FFFFFF">\n\t\n\n\t<script language="javascript">\n\t\t<!-- \n\t\t\ttop.location.href="https://xyz.xyz.com/portal/workspace/IN%20AWP%20ABRL/Reports%20&%20Analysis%20Library/CDI%20Reports/CDI_SM_Mar\'20.xlsx";\t\n\t\t-->\n\t</script>\n\t</body>\n</html>'
text = content.decode()
print(text)
print('\n---\n')
start = text.find('href="') + len('href="')
end = text.find('";', start)
url = text[start:end]
print('url:', url)
response = s.get(url)
Results:
<html>
<head>
<title></title>
</head>
<body bgcolor="#FFFFFF">
<script language="javascript">
<!--
top.location.href="https://xyz.xyz.com/portal/workspace/IN%20AWP%20ABRL/Reports%20&%20Analysis%20Library/CDI%20Reports/CDI_SM_Mar'20.xlsx";
-->
</script>
</body>
</html>
---
url: https://xyz.xyz.com/portal/workspace/IN%20AWP%20ABRL/Reports%20&%20Analysis%20Library/CDI%20Reports/CDI_SM_Mar'20.xlsx

requests.post from python script to my Django website hosted using Apache giving 403 Forbidden

My Django website is hosted using Apache server. I want to send data using requests.post to my website using a python script on my pc but It is giving 403 forbidden.
import json
url = "http://54.161.205.225/Project/devicecapture"
headers = {'User-Agent':
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36',
'content-type': 'application/json'}
data = {
"nb_imsi":"test API",
"tmsi1":"test",
"tmsi2":"test",
"imsi":"test API",
"country":"USA",
"brand":"Vodafone",
"operator":"test",
"mcc":"aa",
"mnc":"jhj",
"lac":"hfhf",
"cellIid":"test cell"
}
response = requests.post(url, data =json.dumps(data),headers=headers)
print(response.status_code)
I have also given permission to the directory containing the views.py where this request will go.
I have gone through many other answers but they didn't help.
I have tried the code without json.dumps also but it isn't working with that also.
How to resolve this?

After investigating it looks like the URL that you need to post to in order to login is: http://54.161.205.225/Project/accounts/login/?next=/Project/
You can work out what you need to send in a post request by looking in the Chrome DevTools, Network tab. This tells us that you need to send the fields username, password and csrfmiddlewaretoken, which you need to pull from the page.
You can get it by extracting it from the response of the first get request. It is stored on the page like this:
<input type="hidden" name="csrfmiddlewaretoken" value="OspZfYQscPMHXZ3inZ5Yy5HUPt52LTiARwVuAxpD6r4xbgyVu4wYbfpgYMxDgHta">
So you'll need to do some kind of Regex to get it. You'll work it out.
So first you have to create a session. Then load the login page with a get request. Then send a post request with your login credentials to that same URL. And then your session will gain the required cookies that will then allow you to post to your desired URL. This is an example below.
import requests
# Create session
session = requests.session()
# Add user-agent string
session.headers.update({'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) ' +
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'})
# Get login page
response = session.get('http://54.161.205.225/Project/accounts/login/?next=/Project/')
# Get csrf
# Do something to response.text
# Post to login
response = session.post('http://54.161.205.225/Project/accounts/login/?next=/Project/', data={
'username': 'example123',
'password': 'examplexamplexample',
'csrfmiddlewaretoken': 'something123123',
})
# Post desired data
response = session.post('http://url.url/other_page', data={
'data': 'something',
})
print(response.status_code)
Hopefully this should get you there. Good luck.
For more information check out this question on requests: Python Requests and persistent sessions

I faced that situation many times
The problems were :
54.161.205.225 is not added to allowed hosts in settings.py
the apache wsgi is not correctly configured
things might help with debug :
Check apache error-logs to investigate what went wrong
try running server locally and post to it to make sure prob is not related to apache

Webscraping with http shows "Web page blocked"

I am trying to scrape http website using proxies and when I am trying to extract text, it shows as "Web page Blocked". How could I avoid this error?
My code is as follows
url = "http://campanulaceae.myspecies.info/"
proxy_dict = {
'http' : "174.138.54.49:8080",
'https' : "174.138.54.49:8080"
}
page = requests.get(url, proxies=proxy_dict)
soup = BeautifulSoup(page.text,'html.parser')
print(soup)
I get below output when I am trying to output text from the website.
<html>
<head>
<title>Web Page Blocked</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="NO-CACHE" http-equiv="PRAGMA"/>
<meta content="initial-scale=1.0" name="viewport"/>
........
<body bgcolor="#e7e8e9">
<div id="content">
<h1>Web Page Blocked</h1>
<p>Access to the web page you were trying to visit has been blocked in accordance with company policy. Please contact your system administrator if you believe this is in error.</p>

Because you did not specify a user-agent for the request headers.
Quite often, sites block requests that come from robot-like sources.
Try it like this:
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.106 Safari/537.36'}
page = requests.get(url, headers=headers, proxies=proxy_dict)

Can't login with python requests, even after making a get request first, and setting headers

I am trying to get data from a page. I've tried to read the posts of other people who had the same problem, Making a get request first to get cookies, setting headers, none of it works. When I examine the output of print(soup.title.get_text()) I still end up getting "Log In" as the title returned. The login_data has the same key names as the HTML <input> elements, e.g <input name=ctl00$cphMain$logIn$UserName ...> for username and <input name=ctl00$cphMain$logIn$Password ...> for password. Not sure what to do next. I can't use selenium, as I have to execute this script on an EC2 instance that's running a splunk server.
import requests
from bs4 import BeautifulSoup
link = "****"
login_URL = "https://erecruit.elwoodstaffing.com/Login.aspx"
login_data = {
"ctl00$cphMain$logIn$UserName": "****",
"ctl00$cphMain$logIn$Password": "****"
}
with requests.Session() as session:
z = session.get(login_URL)
session.headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.63 Safari/537.36',
'Content-Type':'application/json;charset=UTF-8',
}
post = session.post(login_URL, data=login_data)
response = session.get(link)
html = response.text
soup = BeautifulSoup(html, "html.parser")
print(soup.title.get_text())

I actually found the answer.
You can basically just go to the network tab using chrome, and then copy requests as a cURL statement. Then, just use a website or tool to convert the cURL statement to its programming language equivalent (Python, node, java, and so forth).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web scraping, Url jump prevents authorization? - python

Related

How can I get html code of the page in python?

Downloading excel file from url in pandas (post authentication)

requests.post from python script to my Django website hosted using Apache giving 403 Forbidden

Webscraping with http shows "Web page blocked"

Can't login with python requests, even after making a get request first, and setting headers

Categories

Resources