Python webscrape from company sharepoint

Python webscrape from company sharepoint - python

I need to scrape data from my company's Sharepoint site using Python, but I am stuck at the authentication phase. I have tried using HttpNtlmAuth from requests_ntlm, HttpNegotiateAuth from requests_negotiate_sspi, mechanize and none worked. I am new to web scraping and I have been stuck on this issue for a few days already. I just need to get the HTML source so I can start filtering for the data I need. Please anyone give me some guidance on this issue.
Methods I've tried:
import requests
from requests_negotiate_sspi import HttpNegotiateAuth
# this is the security certificate I downloaded using chrome
cert = 'certsharepoint.cer'
response = requests.get(
r'https://company.sharepoint.com/xxx/xxx/xxx/xxx/xxx.aspx',
auth=HttpNegotiateAuth(),
verify=cert)
print(response.status_code)
Error:
[X509: NO_CERTIFICATE_OR_CRL_FOUND] no certificate or crl found (_ssl.c:4293)
Another method:
import sharepy
s = sharepy.connect("https://company.sharepoint.com/xxx/xxx/xxx/xxx/xxx.aspx",
username="username",
password="password")
Error:
Invalid Request: AADSTS90023: Invalid STS request
There seems to be a problem with the certificate in the first method and researching the Invalid STS request does not bring up any solutions that work for me.
Another method:
import requests
from requests_ntlm import HttpNtlmAuth
r = requests.get("http://ntlm_protected_site.com",auth=HttpNtlmAuth('domain\\username','password'))
Error:
403 FORBIDDEN
Using requests.get with headers like so:
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '
'AppleWebKit/537.11 (KHTML, like Gecko) '
'Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
auth = HttpNtlmAuth(username = username,
password = password)
responseObject = requests.get(url, auth = auth, headers=headers)
returns a 200 response, whereas using requests.get without headers would return a 403 forbidden response. The returned HTML however is of no use, because it's the HTML for this page:
Moreover, removing the auth parameter from requests.get responseObject = requests.get(url, headers=headers) does not change anything, as in it still returns a 200 response with the same HTML for the "We can't sign you in" page.

If doing this interactively, try using Selenium. https://selenium-python.readthedocs.io/ with webdriver_manager (so you can skip having to download the web browser driver). https://pypi.org/project/webdriver-manager/. Selenium will not only allow you to authenticate to your tenant interactively, but also makes it possible to collect dynamic content that may require interaction after loading the page: like pushing a button to reveal a table.

I managed to connect to my company's sharepoint by using https://pypi.org/project/sharepy/2.0.0b1.post2/ instead of https://pypi.org/project/sharepy/
Using the current release of sharepy (1.3.0) and this code:
s = sharepy.connect("https://company.sharepoint.com",
username=username,
password=password)
responseObject = (s.get("https://company.sharepoint.com/teams/xxx/xxx/xxx.aspx"))
i got this error:
Authentication Failure: AADSTS50126: Error validating credentials due to invalid username or password
BUT using sharepy 2.0.0b1.post2 with the same code returns no error and successfully authenticates to sharepoint.

Related

requests.post from python script to my Django website hosted using Apache giving 403 Forbidden

My Django website is hosted using Apache server. I want to send data using requests.post to my website using a python script on my pc but It is giving 403 forbidden.
import json
url = "http://54.161.205.225/Project/devicecapture"
headers = {'User-Agent':
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36',
'content-type': 'application/json'}
data = {
"nb_imsi":"test API",
"tmsi1":"test",
"tmsi2":"test",
"imsi":"test API",
"country":"USA",
"brand":"Vodafone",
"operator":"test",
"mcc":"aa",
"mnc":"jhj",
"lac":"hfhf",
"cellIid":"test cell"
}
response = requests.post(url, data =json.dumps(data),headers=headers)
print(response.status_code)
I have also given permission to the directory containing the views.py where this request will go.
I have gone through many other answers but they didn't help.
I have tried the code without json.dumps also but it isn't working with that also.
How to resolve this?

After investigating it looks like the URL that you need to post to in order to login is: http://54.161.205.225/Project/accounts/login/?next=/Project/
You can work out what you need to send in a post request by looking in the Chrome DevTools, Network tab. This tells us that you need to send the fields username, password and csrfmiddlewaretoken, which you need to pull from the page.
You can get it by extracting it from the response of the first get request. It is stored on the page like this:
<input type="hidden" name="csrfmiddlewaretoken" value="OspZfYQscPMHXZ3inZ5Yy5HUPt52LTiARwVuAxpD6r4xbgyVu4wYbfpgYMxDgHta">
So you'll need to do some kind of Regex to get it. You'll work it out.
So first you have to create a session. Then load the login page with a get request. Then send a post request with your login credentials to that same URL. And then your session will gain the required cookies that will then allow you to post to your desired URL. This is an example below.
import requests
# Create session
session = requests.session()
# Add user-agent string
session.headers.update({'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) ' +
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'})
# Get login page
response = session.get('http://54.161.205.225/Project/accounts/login/?next=/Project/')
# Get csrf
# Do something to response.text
# Post to login
response = session.post('http://54.161.205.225/Project/accounts/login/?next=/Project/', data={
'username': 'example123',
'password': 'examplexamplexample',
'csrfmiddlewaretoken': 'something123123',
})
# Post desired data
response = session.post('http://url.url/other_page', data={
'data': 'something',
})
print(response.status_code)
Hopefully this should get you there. Good luck.
For more information check out this question on requests: Python Requests and persistent sessions

I faced that situation many times
The problems were :
54.161.205.225 is not added to allowed hosts in settings.py
the apache wsgi is not correctly configured
things might help with debug :
Check apache error-logs to investigate what went wrong
try running server locally and post to it to make sure prob is not related to apache

Can't login to a website with python request

I'm trying to log to a website https://app.factomos.com/connexion but that doesn't work, I still have the error 403, I try with different headers and different data, but I really don't know where is the problem...
I try another way with MechanicalSoup but that still return to the connection page.
If someone can help me... Thank you for your time :/
import requests
from bs4 import BeautifulSoup as bs
url = 'https://factomos.com'
email = 'myemail'
password = 'mypassword'
url_login = 'https://factomos.com/connexion'
headers = requests.utils.default_headers()
headers.update({
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36',
})
data_login = {
'appAction': 'login',
'email': email,
'password': password
}
with requests.Session() as s:
dash = s.post(url_login, headers=headers, data=data_login)
print(dash.status_code)
# MechanicalSoup
import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser()
resp = browser.open("https://app.factomos.com/connexion")
browser.select_form('form[id="login-form"]')
browser["email"] = 'myemail'
browser["password"] = 'mypassword'
response = browser.submit_selected()
print("submite: ", response.status_code)
print(browser.get_current_page())
I expect a response 200 with the dashboard page but the actual response is 403 or the connection page.

The URL you are using to login (https://factomos.com/connexion) is not the correct endpoint to log in. You can find this out using a browser's devtools/inspect element panel, specifically the "Network" tab.
Accessing this panel will vary by browser. Here's how you do it in Chrome, but in general you can access it by right clicking and clicking Inspect element.
From there, I sent a fake login attempt and I found the actual login endpoint is:
https://app.factomos.com/controllers/app-pro/login-ajax.php
As soon as you send the request, you can view details about it. Once you get a response, you can see those details too. Here are the request details:
Here was the form data I sent:
And the response:
{"error":{"code":-1,"message":"Identifiant ou mot de passe incorrect"}}

Python Requests ASP.net always redirected to login

Using Python 3, I am trying to download a file (xlsx) from a https ASP.Net form page using Python requests. I am creating a session and at first trying to login to the site. It is HTTPS but I do not have access to SSL cert, so I am using Verify=False, which I am happy with for this purpose.
I have manually set the User-Agent header with help from here,
to the same as the browser in Network traffic capturing under IE F12 feature, as this page seems to need a browser user-agent, as the python requests user-agent may be forbidden.
I am also capturing __VIEWSTATE and __VIEWSTATEGENERATOR from the response text as advised in this answer and adding this to my POST data along with Username & Password.
import requests
import bs4
login_payload = {'ctl00_txtEmailAddr':my_login, 'ctl00_txtPwd': pwd}
headers = {'User-Agent': user_agent,
'Accept':r'*/*',
'Accept-Encoding':r'gzip, deflate',
'Connection': r'Keep-Alive'}
s = requests.Session()
req = requests.Request('GET', my_url, headers=headers)
prep0 = s.prepare_request(req)
s.headers.update(headers)
resp = s.send(
prep0,
verify=False,
allow_redirects=True,
)
soup = bs4.BeautifulSoup(resp.text)
login_payload["__VIEWSTATE"] = soup.select_one("#__VIEWSTATE")["value"]
login_payload["__VIEWSTATEGENERATOR"] =
soup.select_one("#__VIEWSTATEGENERATOR")["value"]
req_login = requests.Request('POST', juvo_url, headers=s.headers,
data=login_payload)
prep1 = s.prepare_request(req_login)
login_resp = s.send(prep1, verify=False)
Here is the rest of the request body if this helps, I am not using this.
__EVENTTARGET=&__EVENTARGUMENT=&forErrorMsg=&ctl00%24txtEmailAddr=*MYLOGIN*&ctl00%24txtPwd=*MYPASSWORD*&ctl00%24ImgBtnLoging.x=0&ctl00%24ImgBtnLoging.y=0
With other attempts with more code additional to the above, every page, including trying to get the file from the direct hyperlink copied from IE, returns "Object moved to here" (with a direct link to the file I need which works in browser) or redirects me to the login page.
If I try to download this, in Python using this direct link from requests.history, I download a html file with the same, depending on the response either "Object moved to here" or the html of the login page.
My request status is always 302 or 200 as seen from urllib3 debugging being enabled, but I am yet to see any response other than login/object moved to here.
Closest I can get is with this header after doing a GET request after modifying in Python the copied browser URL to the date I am interested in:
(which may actually be a website vulnerability if I can get this far without being logged in...)
{'Cache-Control': 'private', 'Content-Length': '873', 'Content-Type': 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet; charset=utf-8', 'Location': 'redacted login page with a whole load of params', 'Server': 'Microsoft-IIS/7.5', 'content-disposition': 'attachment;filename='redacted filename', 'X-AspNet-Version': '2.0.50727', 'X-Powered-By': 'ASP.NET'}
With almost every SO hyperlink now purple, any clues/suggestions would be greatly appreciated.
Many thanks.

Getting 403 Forbidden requesting Amazon S3 file

I want to get the size of a file on Amazon S3 without having to download it. My attempt has been to try and send a HTTP HEAD and the request returned will include content-length HTTP header.
Here is my code:
import httplib
import urllib
urlPATH = urllib.unquote("/ticket/fakefile.zip?AWSAccessKeyId=AKIAIX44POYZ6RD4KV2A&Expires=1495332764&Signature=swGAc7vqIkFbtrfXjTPmY3Jffew%3D")
conn = httplib.HTTPConnection("cptl.s3.amazonaws.com")
conn.request("HEAD", urlPATH, headers={'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
)
res = conn.getresponse()
print res.status, res.reason
Error message is:
403 Forbidden
So to escape the "%" in the URL, I used urllib.unquote and after getting 403 Forbidden, I also attempt to try and add in some headers as I thought Amazon may be only returning files that appear to be requested by a browser, but I continue to get 403 error.
Is this a case of Amazon needing particular arguments to service the HTTP request properly or is my code bad?

Ok.... I found a solution by using a workaround. My best guess is that curl/wget are missing http headers in the request to S3, so they all fail and browser works. Tried to start analyzing the request but didn't.
Ultimately, got it working with the following code:
import urllib
d = urllib.urlopen("S3URL")
print d.info()['Content-Length']

403 Forbidden mildly points to an auth problem. Are you sure your access key and signature are correct?
If there's doubt, you could always try and get the metadata via Boto3, which handles all the auth stuff for you (pulling from config files or data you've passed in). Heck, if it works, you can even maybe turn on debug mode and see what it's actually sending that works.

How to login phone.ipkall.com with python requests library?

I am try to learn python, but I have no knowledge about HTTP, I read some posts here about how to use requests to login web site. But it doesn't work. My simple code is here (not real number and password):
#!/usr/bin/env python3
import requests
login_data = {'txtDID': '111111111',
'txtPswd': 'mypassword'}
with requests.Session() as c:
c.post('http://phone.ipkall.com/login.asp', data=login_data)
r = c.get('http://phone.ipkall.com/update.asp')
print(r.text)
print("Done")
But I can't get my personal information which should be showed after login. Can anyone give me some hint? Or point me a direction? I have no idea what's going wrong.

Servers don't like bots (scripts) for security reason. So your script have to behave like human using real browser. First use get() to get session cookies, set user-agent in headers to real one. Use http://httpbin.org/headers to see what user-agent is send by your browser.
Always check results r.status_code and r.url
So you can start with this:
(I don't have acount on this server so I can't test it)
#!/usr/bin/env python3
import requests
s = requests.Session()
s.headers.update({
'User-agent': "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:30.0) Gecko/20100101 Firefox/30.0",
})
# --------
# to get cookies, session ID, etc.
r = s.get('http://phone.ipkall.com/login.asp')
print( r.status_code, r.url )
# --------
login_data = {
'txtDID': '111111111',
'txtPswd': 'mypassword',
'submit1': 'Submit'
}
r = s.post('http://phone.ipkall.com/process.asp?action=verify', data=login_data)
print( r.status_code, r.url )
# --------
BTW: If page use JavaScript you have problem because requests can't run javascript on page.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python webscrape from company sharepoint - python

Related

requests.post from python script to my Django website hosted using Apache giving 403 Forbidden

Can't login to a website with python request

Python Requests ASP.net always redirected to login

Getting 403 Forbidden requesting Amazon S3 file

How to login phone.ipkall.com with python requests library?

Categories

Resources