Crawl data from a website using python

Crawl data from a website using python - python

I would like to crawl some data from a website. To manually access the target data, I need to log in and then click on some buttons on to finally get the target html page. Currently, I am using the Python request library to simulate this process. I am doing like this:
ss = requests.session()
#log in
resp = ss.post(url, data = (('username', 'xxx'), ('password', 'xxx')))
#then send requests to the target url
result = ss.get(taraget_url)
However, I found that the final request did not return me what I want.
So I changed the method. I download all the network traffic and look into the headers and cookies of the last request. I found that here are some contents that are different in each log in session like the sessionid and some other variables. So I traces back when these varibales are returned in the response and then get the values again by sending the corresponding requests. After this, I construct the correct headers and cookies and then send request like this:
resp = ss.get(target_url, headers = myheader, cookies = mycookie)
But still, it does not return me anything. Anyone can help?

I was in the same boat some time ago, and I eventually switched from trying to get requests to work to using Selenium instead, which made life much easier. (pip install selenium). Then you can log into a website and then navigate to a desired website like this:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
website_with_logins = "https://website.com"
website_to_access_after_login = "https://website.com/page"
driver.get( str(website_with_logins) )
username = driver.find_element_by_name("username")
username.send_keys("your_username")
password = driver.find_element_by_name("password")
password.send_keys("your_password")
password.send_keys(Keys.RETURN)
driver.get( str(website_to_access_after_login) )
Once you have the website_to_access_after_login loaded (you'll see it appear), you can get the html and have a field day using just
html = driver.page_source
Hope this helps.

Related

Python requests - Session not capturing response cookies

I'm not sure how else to describe this. I'm trying to log into a website using the requests library with Python but it doesn't seem to be capturing all cookies from when I login and subsequent requests to the site go back to the login page.
The code I'm using is as follows: (with redactions)
with requests.Session() as s:
r = s.post('https://www.website.co.uk/login', data={
'amember_login': 'username',
'amember_password': 'password'
})
Looking at the developer tools in Chrome. I see the following:
After checking r.cookies it seems only that PHPSESSID was captured there's no sign of the amember_nr cookie.
The value in PyCharm only shows:
{RequestsCookieJar: 1}<RequestsCookieJar[<Cookie PHPSESSID=kjlb0a33jm65o1sjh25ahb23j4 for .website.co.uk/>]>
Why does this code fail to save 'amember_nr' and is there any way to retrieve it?
SOLUTION:
It appears the only way I can get this code to work properly is using Selenium, selecting the elements on the page and automating the typing/clicking. The following code produces the desired result.
from seleniumrequests import Chrome
driver = Chrome()
driver.get('http://www.website.co.uk')
username = driver.find_element_by_xpath("//input[#name='amember_login']")
password = driver.find_element_by_xpath("//input[#name='amember_pass']")
username.send_keys("username")
password.send_keys("password")
driver.find_element_by_xpath("//input[#type='submit']").click() # Page is logged in and all relevant cookies saved.

You can try this:
with requests.Session() as s:
s.get('https://www.website.co.uk/login')
r = s.post('https://www.website.co.uk/login', data={
'amember_login': 'username',
'amember_password': 'password'
})
The get request will set the required cookies.

FYI I would use something like BurpSuite to capture ALL the data being sent to the server and sort out what headers etc are required ... sometimes people/servers to referrer checking, set cookies via JAVA or wonky scripting, even seen java obfuscation and blocking of agent tags not in whitelist etc... it's likely something the headers that the server is missing to give you the cookie.
Also you can have Python use burp as a proxy so you can see exactly what gets sent to the server and the response.
https://github.com/freeload101/Python/blob/master/CS_HIDE/CS_HIDE.py (proxy support )

Authenticating on ADFS with Python script

I need to parse site, which is hidden by ADFS service.
and struggling with authentication to it.
Is there any options to get in?
what i can see, most of solutions for backend applications, or for "system users"(with app_id, app_secret).
in my case, i can't use it, only login and password.
example of problem:
in chrome I open www.example.com and it redirects me to to https://login.microsoftonline.com/ and then to https://federation-sts.example.com/adfs/ls/?blabla with login and password form.
and how to get access into it with python3?

ADFS uses complicated redirection and CSRF protection techniques. Thus, it is better to use a browser automation tool to perform the authentication and parse the webpage afterwards. I recommend the selenium toolkit with python bindings. Here is a working example:
from selenium import webdriver
def MS_login(usrname, passwd): # call this with username and password
driver = webdriver.Edge() # change to your browser (supporting Firefox, Chrome, ...)
driver.delete_all_cookies() # clean up the prior login sessions
driver.get('https://login.microsoftonline.com/') # change the url to your website
time.sleep(5) # wait for redirection and rendering
driver.find_element_by_xpath("//input[#name='loginfmt'").send_keys(usrname)
driver.find_element_by_xpath("//input[#type='submit']").click()
time.sleep(5)
driver.find_element_by_xpath("//input[#name='passwd'").send_keys(passwd)
driver.find_element_by_xpath("//input[#name='KMSI' and #type='checkbox'").click()
driver.find_element_by_xpath("//input[#type='submit']").click()
time.sleep(5)
driver.find_element_by_xpath("//input[#type='submit']").click()
# Successfully login
# parse the site ...
driver.close() # close the browser
return driver
This script calls Microsoft Edge to open the website. It injects the username and password to the correct DOM elements and then let the browser to handle the rest. It has been tested on the webpage "https://login.microsoftonline.com". You may need to modify it to suit your website.

To Answer your question "How to Get in with python" i am assuming you want perform some web scraping operation on the pages which is secured by Azure AD authentication.
In these kind of scenario, you have to do the following steps.
For this script we will only need to import the following:
import requests
from lxml import html
First, we would like to create our session object. This object will allow us to persist the login session across all our requests.
session_requests = requests.session()
Second, we would like to extract the csrf token from the web page, this token is used during login. For this example we are using lxml and xpath, we could have used regular expression or any other method that will extract this data.
login_url = "https://bitbucket.org/account/signin/?next=/"
result = session_requests.get(login_url)
tree = html.fromstring(result.text)
authenticity_token = list(set(tree.xpath("//input[#name='csrfmiddlewaretoken']/#value")))[0]
Next, we would like to perform the login phase. In this phase, we send a POST request to the login url. We use the payload that we created in the previous step as the data. We also use a header for the request and add a referer key to it for the same url.
result = session_requests.post(
login_url,
data = payload,
headers = dict(referer=login_url)
)
Payload would be a dictionary object of user name and password etc.
payload = {
"username": "<USER NAME>",
"password": "<PASSWORD>",
"csrfmiddlewaretoken": "<CSRF_TOKEN>"
}
Note:- This is just an example.
Step 2:
Scrape content
Now, that we were able to successfully login, we will perform the actual scraping
url = 'https://bitbucket.org/dashboard/overview'
result = session_requests.get(
url,
headers = dict(referer = url)
)
So in other words, you need to get the request details payload from Azure AD and then create a session object using logged in method and then finally do the scraping.
Here is a very good example of Web scraping of a secured website.
Hope it helps.

Using Python to request draftkings.com info that requires login?

I'm trying to get contest data from the url: "https://www.draftkings.com/contest/gamecenter/32947401"
If you go to this URL and aren't logged in, it'll just re-direct you to the lobby. If you're logged in, it'll actually show you the contest results.
Here's some things I tried:
-First, I used Chrome's Dev networking tools to watch requests while I manually logged in
-I then tried copying the cookie that I thought contained the authentication info, it was of the form:
'ajs_anonymous_id=%123123123123123, mlc=true; optimizelyEndUserId'
-I then stored that cookie as an Evironment variable and ran this code:
HEADERS= {'cookie': os.environ['MY_COOKIE'] }
requests.get(draft_kings_url, headers= HEADERS)
No luck, this just gave me the lobby.
I then tried request's built in:
HTTPBasicAuth
HTTPDigestAuth
No luck here either.
I'm no python expert by far, and I've pretty much exhausted what I know and the search results I've found. Any ideas?

The tool that you want is selenium. Something along the lines of:
from selenium import webdriver
browser = webdriver.Firefox()
browser.get(r"https://www.draftkings.com/contest/gamecenter/32947401" )
username = browser.find_element_by_id("user")
username.send_keys("username")
password = browser.find_element_by_id("password")
password.send_keys("top_secret")
login = selenium.find_element_by_name("login")
login.click()

Use fiddler to see the exact request they are making when you try to log in. Then use Session class in requests package.
import requests
session = requests.Session()
session.get('YOUR_URL_LOGIN_PAGE')
this will save all the cookies from your url in your session variable (Like when you use a browser).
Then make a post request to the login url with appropriate data.
You dont have to manually pass cookie data as it is auto generated when you first visit a website. However you can set some header explicitly like UserAgent etc by:
session.headers.update({'header_name':'header_value'})
HTTPBasicAuth & HTTPDigestAuth might not work based on the website.

python scraping school's webpage which requires user login

I'm using python to scrape my school's webpage, but in order to do that I needed to simulate a user login first. here is my code:
import requests, lxml.html
s = requests.session()
url = "https://my.emich.edu"
login = s.get(url)
login_html = lxml.html.fromstring(login.text)
hidden_inputs = login_html.xpath(r'//form//input[#type="hidden"]')
form = {x.attrib["name"]:x.attrib["value"] for x in hidden_inputs}
form["username"] = "myusernamge"
form["password"] = "mypassword"
form["submit"] = "LOGIN"
response = s.post("https://netid.emich.edu/cas/loginservice=https%3A%2F%2Fmy.emich.edu%2Fc%2Fportal%2Flogin",form)
response = s.get("http://my.emich.edu")
f = open("result.html","w")
f.write(response.text)
print response.text
i am expecting that response.text will give me my own student account page instead of that it gives me a log in requirement page. Can any one help me with this issue?
BTW this is not a homework

There are a few options here, and I think your requests approach can be made much easier by logging in manually and copying over the headers.
Use a python scripting package like http://wwwsearch.sourceforge.net/mechanize/ to scrape the site.
Use a browser-emulater such as http://casperjs.org/. Using this you can basically do anything you'd be able to do in a browser.
My suggestion here would be to go to the website, log in, and then open the developer console and copy those headers/cookies into your requests headers/cookies. This way you can just hardcode the 'already-authenticated request' and it will work fine. Note that this method is the least reliable for doing robust, everyday scraping, but if you're looking for something that will be the quickest to implement and will work until the authentication runs out, use this method.
Also, you need the request the logged-in homepage (again) after you successfully do the post.

how to use python requests to login to website

Im trying to login and scrape a job site and send me notification when ever certain key words are found.I think i have correctly traced the xpath for the value of feild "login[iovation]" but i cannot extract the value, here is what i have done so far to login
import requests
from lxml import html
header = {"User-Agent":"Mozilla/4.0 (compatible; MSIE 5.5;Windows NT)"}
login_url = 'https://www.upwork.com/ab/account-security/login'
session_requests = requests.session()
#get csrf
result = session_requests.get(login_url)
tree=html.fromstring(result.text)
auth_token = list(set(tree.xpath('//*[#name="login[_token]"]/#value')))
auth_iovat = list(set(tree.xpath('//*[#name="login[iovation]"]/#value')))
# create payload
payload = {
"login[username]": "myemail#gmail.com",
"login[password]": "pa$$w0rD",
"login[_token]": auth_token,
"login[iovation]": auth_iovation,
"login[redir]": "/home"
}
#perform login
scrapeurl='https://www.upwork.com/ab/find-work/'
result=session_requests.post(login_url, data = payload, headers = dict(referer = login_url))
#test the result
print result.text
This is screen shot of form data when i login successfully

This is because upworks uses something called iOvation (https://www.iovation.com/) to reduce fraud. iOvation uses digital fingerprint of your device/browser, which are sent via login[iovation] parameter.
If you look at the javascripts loaded on your site, you will find two javascript being loaded from iesnare.com domain. This domain and many others are owned by iOvaiton to drop third party javascript to identify your device/browser.
I think if you copy the string from the successful login and send it over along with all the http headers as is including the browser agent in python code, you should be okie.

Are you sure that result is fetching 2XX code
When I am this code result = session_requests.get(login_url)..its fetching me a 403 status code, which means I am not going to login_url itself

They have an official API now, no need for scraping, just register for API keys.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Crawl data from a website using python - python

Related

Python requests - Session not capturing response cookies

Authenticating on ADFS with Python script

Using Python to request draftkings.com info that requires login?

python scraping school's webpage which requires user login

how to use python requests to login to website

Categories

Resources