Get site url after rediretct requests - python

I was wondering if I can get the current url after a redirect from the starting page, done with requests.
For example:
I send the reqeusts to "google.com" that instantanely sends me to "google.com/page-123456", the page number changes everytime. Can I get the "google.com/page-123456" in my script?
With selenium it can be made like this:
import selenium
import time
driver = (...)
driver.get('google.com')
time.sleep(2)
url = driver.current_url
Can be this made in reqeusts / BeautifoulSoup? How?
Thanks

Try property url of Request object that you can access by response.request:
import requests
response=requests.get("https://google.com")
url=response.request.url

Related

web scraping a dynamic table with authentication

I am new to python and web scraping and i'm trying to scrape a website that uses JavaScript. I have managed to automate the log in sequence via Selenium, however when I try to send the API call to get the data, I am not able to get anything. I'm assuming it's because the API call requires some sort of authentication. How can I get past this?
Here's my code:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
import time
import pandas as pd
import requests
import json
username = 'xxx'
password = 'xxx'
url = 'https://www.example.com/login'
#log in
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get(url)
driver.find_element(By.XPATH, '//*[#id="username"]').send_keys(username)
driver.find_element(By.XPATH, '//*[#id="password"]').send_keys(password)
driver.find_element(By.XPATH, '//*[#id="login_button"]').click()
# go to User Lines
driver.get('http://www.example.com/lines')
time.sleep(5)
response = requests.request("GET", url, headers=headers, data=payload)
subs = json.loads(response.text)
print(subs)
Every time an HTTP request is made some metadata is included. This is all the header data and cookies and maybe some other session data. It has to be sent every time because that's the only way to maintain a 'session'
If you login in Selenium, the browser is managing your session there. Making a request with the python requests library has nothing to do with Selenium, and most likely the authentication that you're missing is what is provided by logging in in Selenium.
So you have a few options:
1. Make the API call using Selenium After logging in just get() the API URL and the page source should be the data within a tag.
2. Log in using the requests library Instead of using Selenium, you can exclusively use requests. This can be tedious; you'll have to inspect the network calls using the devtools and piece together what you would need to replicate using requests to simulate the same login that happens on the browser. You would also need to use a persistent session using requests.Session() to create a session instance. You can use this object to make the requests instead of the requests library directly. But once you do, you can just make the API request as you were. This method has the fastest runtime too since you're not rendering a whole browser and running the javascript within that, and making all the network requests therein.
3. Pass the session data from Selenium to your requests' session instance I haven't tried doing this, but since session data is just passed along in the headers and are just strings, you can probably find a way to get the cookies from Selenium and add them to your session requests instance to make your API call without selenium.

Get cookies from selenium to requests

I can login to a website with selenium and i can receive all cookies.
But then I have to quickly submit a request to the site. Meanwhile, selenium stays very slow.
That's why I want to receive cookies with selenium and send requests via the request module.
My Selenium Code (First I log in to the website and received all cookies with the code below.)
browser.get('https://www.example.com/login')
cookiem1 = browser.get_cookies()
print(cookiem1)
2nd stage, I will go to another page of the website and make a request.
s = requests.Session()
for cookie in cookiem1:
s.cookies.set(cookie['name'], cookie['value'])
r = s.get("https://example.com/postcomment')
print(r.content)
I use cookies in this way, but when I send the url via request module, the site does not autohorize my user.
My error:
"errorMessage": "Unauthorized user",\r\n "errorDetails": "No cookie"
Probably with this code the site doesn't unauthorized my session
Thanks in advance
try this
import requests as re
ck = browser.get_cookies()
s = re.Session()
c = [s.cookies.set(c['name'], c['value']) for c in ck]
response = s.get("https://example.com/postcomment")

How to get the response status code with selenium?

As a newbie, I wonder whether there is a method to get the http response status code to judge some expections, like remote server down, url broken, url redirect, etc...
In Selenium it's Not Possible!
For more info click here.
You can accomplish it with request:
import requests
from selenium import webdriver
driver = webdriver.get("url")
r = requests.get("url")
print(r.status_code)
Update:
It actually is possible using the chrome-developer-protocoll with event listeners.
See example script at https://stackoverflow.com/a/75067388/20443541

python requests return a different web page from browser or urllib

I use requests to scrape webpage for some content.
When I use
import requests
requests.get('example.org')
I get a different page from the one I get when I use my broswer or using
import urllib.request
urllib.request.urlopen('example.org')
I tried using urllib but it was really slow.
In a comparison test I did it was 50% slower than requests !!
How Do you solve this??
After a lot of investigations I found that the site passes a cookie in the header attached to the first visitor to the site only.
so the solution is to get the cookies with head request, then resend them with your get request
import requests
# get the cookies with head(), this doesn't get the body so it's FAST
cookies = requests.head('example.com')
# send get request with the cookies
result = requests.get('example.com', cookies=cookies)
Now It's faster than urllib + the same result :)

Crawl data from a website using python

I would like to crawl some data from a website. To manually access the target data, I need to log in and then click on some buttons on to finally get the target html page. Currently, I am using the Python request library to simulate this process. I am doing like this:
ss = requests.session()
#log in
resp = ss.post(url, data = (('username', 'xxx'), ('password', 'xxx')))
#then send requests to the target url
result = ss.get(taraget_url)
However, I found that the final request did not return me what I want.
So I changed the method. I download all the network traffic and look into the headers and cookies of the last request. I found that here are some contents that are different in each log in session like the sessionid and some other variables. So I traces back when these varibales are returned in the response and then get the values again by sending the corresponding requests. After this, I construct the correct headers and cookies and then send request like this:
resp = ss.get(target_url, headers = myheader, cookies = mycookie)
But still, it does not return me anything. Anyone can help?
I was in the same boat some time ago, and I eventually switched from trying to get requests to work to using Selenium instead, which made life much easier. (pip install selenium). Then you can log into a website and then navigate to a desired website like this:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
website_with_logins = "https://website.com"
website_to_access_after_login = "https://website.com/page"
driver.get( str(website_with_logins) )
username = driver.find_element_by_name("username")
username.send_keys("your_username")
password = driver.find_element_by_name("password")
password.send_keys("your_password")
password.send_keys(Keys.RETURN)
driver.get( str(website_to_access_after_login) )
Once you have the website_to_access_after_login loaded (you'll see it appear), you can get the html and have a field day using just
html = driver.page_source
Hope this helps.

Categories

Resources