Using Mechanicalsoup to navigate multiple pages / forms - python

I've had success using mechanicalsoup with single pages / single forms, but am having difficulty with a multistep problem. The pages I am attempting to navigate start here: https://webapps2.ncua.gov/CustomQuery/CUSelect.aspx
I get through the first page/form, but then I am not sure how to deal with the second page/form. The third page includes the result that I wish to scrape.
import requests
import urllib.parse
import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser()
browser.open("https://webapps2.ncua.gov/CustomQuery/CUSelect.aspx")
form=browser.select_form()
browser["operand0"] = "State"
browser["operator0"] = "Not Equal"
browser["value0"] = "XX"
response = browser.submit_selected()
form2 = browser.get_current_form()
submit = browser.get_current_page().find('input', id='BtnAllAcct')
form2.choose_submit(submit)
browser.submit_selected()
submit = browser.get_current_page().find('input', id='Btndata1')
form2.choose_submit(submit)
browser.submit_selected()
Any ideas? This is my second attempt after first trying to interact with the API, but two separate forms is stumping me on that as well.

I was solving a similar issue, following the advice in MechanicalSoup follow a link without inside a button I switched to selenium

Related

Get redirected ULR with Python to get access code

This question has been asked several times, and didn't find any answer that works for me. I am using request library to get the redirect url, however my code returns the original url. If I click on the link it takes few second before I get the redirect url and then manually extract the code, but I need to get this information by python.
Here is my code. I have tried response.history but it returns empty list.
import requests
response = requests.get("https://login.microsoftonline.com/{tenant_id}/oauth2/v2.0/authorize?client_id={client_id}&response_type=code&redirect_uri=http%3A%2F%2Flocalhost%2Fmyapp%2F&response_mode=query&scope=user.read%20chat.read&state=12345")
print(response)
print('-------------------')
print(response.url)
I am trying to get the code by following this Microsoft documention "https://learn.microsoft.com/en-us/graph/auth-v2-user".
Here are the links that I found in stack over flow and didn't solve my issue.
To get redirected URL with requests , How to get redirect url code with Python? ( this is probably very close to my situation), how to get redirect url using python requests and this one Python Requests library redirect new url
I didn't have any luck to get redirected url back by using requests as mentioned in previous posts. But I was able to work around this using webbrowser library and then get the browser history using sqlite 3 and was able to get the result that I was looking for.
I had to go through postman and add postman url into my app registration for using Graph API, but if you simply want to get redirected url you can follow the same code and you should get redirected url.
let me know if there are better solutions.
here is my code:
import webbrowser
import sqlite3
import pandas as pd
import shutil
webbrowser.open("https://login.microsoftonline.com/{tenant_id}/oauth2/v2.0/authorize?client_id={client_id}&response_type=code&redirect_uri=http%3A%2F%2Flocalhost%2Fmyapp%2F&response_mode=query&scope=user.read%20chat.read&state=12345")
#source file is where the history of your webbroser is saved, I was using chrome, but it should be the same process if you are using different browser
source_file = 'C:\\Users\\{your_user_id}\\AppData\\Local\\Google\\Chrome\\User Data\\Default\\History'
# could not directly connect to history file as it was locked and had to make a copy of it in different location
destination_file = 'C:\\Users\\{user}\\Downloads\\History'
time.sleep(30) # there is some delay to update the history file, so 30 sec wait give it enough time to make sure your last url get logged
shutil.copy(source_file,destination_file) # copying the file.
con = sqlite3.connect('C:\\Users\\{user}\\Downloads\\History')#connecting to browser history
cursor = con.execute("SELECT * FROM urls")
names = [description[0] for description in cursor.description]
urls = cursor.fetchall()
con.close()
df_history = pd.DataFrame(urls,columns=names)
last_url = df_history.loc[len(df_history)-1,'url']
print(last_url)
>>https://oauth.pstmn.io/v1/browser-callback?code={code}&state=12345&session_state={session_state}#

Python webscraping page which requires login

I am trying to automate a web data gathering process using Python. In my case, I need to pull the information from https://app.ixml.com.br/documentos/nfe page. However, before you go to this page, you need to log in at https://app.ixml.com/login. The code below should theoretically log into the site:
import re
from robobrowser import RoboBrowser
username = 'email'
password = 'password'
br = RoboBrowser()
br.open('https://app.ixml.com.br/login')
form = br.get_form()
form['email'] = username
form['senha'] = password
br.submit_form(form)
src = str(br.parsed())
However, by printing the src variable, I get the source code from the https://app.ixml.com.br/login page, ie before logging in. If I add the following lines at the end of the previous code
br.open('https://app.ixml.com.br/documentos/nfe')
src2 = str(br.parsed())
The src2 variable contains the source code of the page https://app.ixml.com.br/. I tried some variations, such as creating a new br object, but got the same result. How can I access the information at https://app.ixml.com.br/documentos/nfe?
If it is ok to have a webpage opening you can try to solve this using selenium. This package makes it possible to create a program that reacts just like a user would.
The following code would have you login:
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("https://app.ixml.com.br/login")
browser.find_element_by_id("email").send_keys("abc#mail")
browser.find_element_by_id("senha").send_keys("abc")
browser.find_element_by_css_selector("button").click()

Scraping a secure website requiring clicks on javascript links

I have a daily task at work to download some files from internal company website. The site requires a login. But the main url is something like:
https://abcd.com
But when I open that in the browser, it redirects to something like:
https://abcdGW/ln-eng.aspx?lang=eng&lnid=e69d5d-xxx-xxx-1111cef&regl=en-US
My task normally is to open this site, login, click some links back and forth and download some files. This takes me 10 minutes everyday. But I wanna automate this using python. Using my basic knowledge I have written below code:
import urllib3
from bs4 import BeautifulSoup
import requests
import http
url = "https://abcd.com"
redirectURL = requests.get(url).url
jar = http.cookiejar.CookieJar(policy=None)
http = urllib3.PoolManager()
acc_pwd = {'datasouce': 'Data1', 'user':'xxxx', 'password':'xxxx'}
response = http.request('GET', redirectURL)
soup = BeautifulSoup(response.data)
r = requests.get(redirectURL, cookies=jar)
r = requests.post(redirectURL, cookies=jar, data=acc_pwd)
print ("RData %s" % r.text)
This shows that I am able to successfully login. The next step is something where i am stuck. On the page after login I have some links on left side, one of those I need to click. When I inspect them in Chrome, I see them as:
href="javascript:__doPostBack('myAppControl$menu_itm_proj11','')"><div class="menu-cell">
<img class="menu-image" src="images/LiteMenu/projects.png" style="border-width:0px;"><span class="menu-text">Projects</span> </div></a>
This is probably a javascript link. I need to click this, and then on new page another link, then another to download a file and back to the main page and do this all over again to download different files.
I would be grateful to anyone who can help or suggest.
Thanks to chris, I was able to complete this..
First using the request library I got the redirect url as:
redirectURL = requests.get(url).url
After that I use scrapy and selenium for click links and downloading files..
By adding selenium to the browser as add-in/plugin, it was quite simple.

Get current URL using Python and webbot

Is there a way to get the current url using webbot?
I looked at ways using os.environ, but it didn't work for me.
I need the url of the page I am opening using webbot.Browser(), because some links are redirected and change.
you can use get_current_url().
Consider the following example:
from webbot import Browser
web = Browser()
url = 'something.com' # site name
web.go_to(url)
... #your code
webbot_result_url = web.get_current_url()
print(webbot_result_url )

Select and submit form with python requests library

I am trying to scrape data from this website. To access the tables, I need to click the "Search" button. I was able to successfully do this using mechanize:
br = mechanize.Browser()
br.open(url + 'Wildnew_Online_Status_New.aspx')
br.select_form(name='aspnetForm')
page = br.submit(id='ctl00_ContentPlaceHolder1_Button1')
"page" gives me the resulting webpage with the table, as needed. However, I'd like to iterate through the links to subsequent pages at the bottom, and this triggers javascript. I've heard mechanize does not support this, so I need a new strategy.
I believe I can get to subsequent pages using a post request from the requests library. However, I am not able to click "search" on the main page to get to the initial table. In other words, I want to replicate the above code using requests. I tried
s = requests.Session()
form_data = {'name': 'aspnetForm', 'id': 'ctl00_ContentPlaceHolder1_Button1'}
r = s.post('http://forestsclearance.nic.in/Wildnew_Online_Status_New.aspx', data=form_data)
Not sure why, but this returns the main page again (without clicking Search). Any help appreciated.
I think you should look into scrapy
you forgot some parameters in ths post request:
https://www.pastiebin.com/5bc6562304e3c
check the Post request with google dev tools

Categories

Resources