web scraping a dynamic table with authentication

web scraping a dynamic table with authentication - python

I am new to python and web scraping and i'm trying to scrape a website that uses JavaScript. I have managed to automate the log in sequence via Selenium, however when I try to send the API call to get the data, I am not able to get anything. I'm assuming it's because the API call requires some sort of authentication. How can I get past this?
Here's my code:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
import time
import pandas as pd
import requests
import json
username = 'xxx'
password = 'xxx'
url = 'https://www.example.com/login'
#log in
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get(url)
driver.find_element(By.XPATH, '//*[#id="username"]').send_keys(username)
driver.find_element(By.XPATH, '//*[#id="password"]').send_keys(password)
driver.find_element(By.XPATH, '//*[#id="login_button"]').click()
# go to User Lines
driver.get('http://www.example.com/lines')
time.sleep(5)
response = requests.request("GET", url, headers=headers, data=payload)
subs = json.loads(response.text)
print(subs)

Every time an HTTP request is made some metadata is included. This is all the header data and cookies and maybe some other session data. It has to be sent every time because that's the only way to maintain a 'session'
If you login in Selenium, the browser is managing your session there. Making a request with the python requests library has nothing to do with Selenium, and most likely the authentication that you're missing is what is provided by logging in in Selenium.
So you have a few options:
1. Make the API call using Selenium After logging in just get() the API URL and the page source should be the data within a tag.
2. Log in using the requests library Instead of using Selenium, you can exclusively use requests. This can be tedious; you'll have to inspect the network calls using the devtools and piece together what you would need to replicate using requests to simulate the same login that happens on the browser. You would also need to use a persistent session using requests.Session() to create a session instance. You can use this object to make the requests instead of the requests library directly. But once you do, you can just make the API request as you were. This method has the fastest runtime too since you're not rendering a whole browser and running the javascript within that, and making all the network requests therein.
3. Pass the session data from Selenium to your requests' session instance I haven't tried doing this, but since session data is just passed along in the headers and are just strings, you can probably find a way to get the cookies from Selenium and add them to your session requests instance to make your API call without selenium.

Related

Get site url after rediretct requests

I was wondering if I can get the current url after a redirect from the starting page, done with requests.
For example:
I send the reqeusts to "google.com" that instantanely sends me to "google.com/page-123456", the page number changes everytime. Can I get the "google.com/page-123456" in my script?
With selenium it can be made like this:
import selenium
import time
driver = (...)
driver.get('google.com')
time.sleep(2)
url = driver.current_url
Can be this made in reqeusts / BeautifoulSoup? How?
Thanks

Try property url of Request object that you can access by response.request:
import requests
response=requests.get("https://google.com")
url=response.request.url

How does Cloudflare differentiate Selenium and Requests traffic?

Context
I am currently attempting to build a small-scale bot using Selenium and Requests module in Python.
However, the webpage I want to interact with is running behind Cloudflare.
My python script is running over Tor using stem module.
My traffic analysis is based on Firefox's "Developer options->Network" using Persist Logs.
My findings so far:
Selenium's Firefox webdriver can often access the webpage without going through "checking browser page" (return code 503) and "captcha page" (return code 403).
Requests session object with the same user agent always results in "captcha page" (return code 403).
If Cloudflare was checking my Javascript functionality, shouldn't my requests module return 503 ?
Code Example
driver = webdriver.Firefox(firefox_profile=fp, options=fOptions)
driver.get("https://www.cloudflare.com") # usually returns code 200 without verifying the browser
session = requests.Session()
# ... applied socks5 proxy for both http and https ... #
session.headers.update({"user-agent": driver.execute_script("return navigator.userAgent;")})
page = session.get("https://www.cloudflare.com")
print(page.status_code) # return code 403
print(page.text) # returns "captcha page"
Both Selenium and Requests modules are using the same user agent and ip.
Both are using GET without any parameters.
How does Cloudflare distinguish these traffic?
Am I missing something?
I tried to transfer cookies from the webdriver to the requests session to see if a bypass is possible but had no luck.
Here is the used code:
for c in driver.get_cookies():
session.cookies.set(c['name'], c['value'], domain=c['domain'])

There are additional JavaScript APIs exposed to the webpage when using Selenium. If you can disable them, you may be able to fix the problem.

Cloudflare doesn't only check HTTP headers or javascript — it also analyses the TLS header. I'm not sure exactly how it does it, but I've found that it can be circumvented by using NSS instead of OpenSSL (though it's not well integrated into Requests).

The captcha response depends on the browser fingerprint. It's not about just sending Cookies and User-agent.
Copy all the headers from Network Tab in Developers console, and send all the key value pairs as headers in request library.
This method should work logically.

HTTP GET and POST requests from python without using the requests module

I would like to access a ressource with a particular url. Let's say I have only access to a PC (without admin rights) from which I cannot use the requests module due to different reasons.
Normally, I would address an API und perform HTTP GET and HTTP POST requests with:
import requests
url = r"https://httpbin.org/json"
r = requests.get(url)
If I would like to provide header and authorisation details, I would add
headers = {"Content-Type": "application/json"}
auth = ("username", "password")
r = requests.post(url, auth=auth, headers=headers)
as well as the payload in the data exchange format of the API (either JSON or XML).
Unfortunately, I cannot use the requests module on the aforementioned system. However, I can use the selenium module with the Internet Explorer webdriver (no Firefox and no Chrome).
I tried to access the url of the API with
from selenium import webdriver
driver = webdriver.Ie()
driver.get(url)
This does open an authentication popup, which I cannot access with the selenium "switch_to" functions. Ideally, I would like to perform a HTTP POST via selenium and provide authentication as well as header information. Would that be possible?

How to get Authorization token from a webpage using python requests"

How to get Authorization token from a webpage using python requests, i have used requests basicAuth to login, it was worked, but subsequent pages are not accpting te basicAuth, it returns "Authuser is not validated"
There is a login url where i have successfully logged in using python requests's basicAuth. then succeeding pages didn't accept basicAuth credential but it needed authorization header. after looking into browser inspect tool, found out that, this authorization header's value is generated as a part of session local storage. is there any way to get this session value without using webdriver API?

Sounds like what you need is a requests persistent session
import requests
s=requests.Session()
#then simply make the request like you already are
r=s.get(r'https://stackoverflow.com/')
#the cookies are persisted
s.cookies.get_dict()
>{'prov':......}
i can't really get more specific without more info about the site you're using.

Using Python to request draftkings.com info that requires login?

I'm trying to get contest data from the url: "https://www.draftkings.com/contest/gamecenter/32947401"
If you go to this URL and aren't logged in, it'll just re-direct you to the lobby. If you're logged in, it'll actually show you the contest results.
Here's some things I tried:
-First, I used Chrome's Dev networking tools to watch requests while I manually logged in
-I then tried copying the cookie that I thought contained the authentication info, it was of the form:
'ajs_anonymous_id=%123123123123123, mlc=true; optimizelyEndUserId'
-I then stored that cookie as an Evironment variable and ran this code:
HEADERS= {'cookie': os.environ['MY_COOKIE'] }
requests.get(draft_kings_url, headers= HEADERS)
No luck, this just gave me the lobby.
I then tried request's built in:
HTTPBasicAuth
HTTPDigestAuth
No luck here either.
I'm no python expert by far, and I've pretty much exhausted what I know and the search results I've found. Any ideas?

The tool that you want is selenium. Something along the lines of:
from selenium import webdriver
browser = webdriver.Firefox()
browser.get(r"https://www.draftkings.com/contest/gamecenter/32947401" )
username = browser.find_element_by_id("user")
username.send_keys("username")
password = browser.find_element_by_id("password")
password.send_keys("top_secret")
login = selenium.find_element_by_name("login")
login.click()

Use fiddler to see the exact request they are making when you try to log in. Then use Session class in requests package.
import requests
session = requests.Session()
session.get('YOUR_URL_LOGIN_PAGE')
this will save all the cookies from your url in your session variable (Like when you use a browser).
Then make a post request to the login url with appropriate data.
You dont have to manually pass cookie data as it is auto generated when you first visit a website. However you can set some header explicitly like UserAgent etc by:
session.headers.update({'header_name':'header_value'})
HTTPBasicAuth & HTTPDigestAuth might not work based on the website.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.