I am looking to open a connection with python to http://www.horseandcountry.tv which takes my login parameters via the POST method. I would like to open a connection to this website in order to scrape the site for all video links (this, I also don't know how to do yet but am using the project to learn).
My question is how do I pass my credentials to the individual pages of the website? For example if all I wanted to do was use python code to open a browser window pointing to http://play.horseandcountry.tv/live/ and have it open with me already logged in, how do I go about this?
As far as I know you have two options depending how you want to crawl and what you need to crawl:
1) Use urllib. You can do your POST request with the necessary login credentials. This is the low level solution, which means that this is fast, but doesn't handle high level stuff like javascript codes.
2) Use selenium. Whith that you can simulate a browser (Chrome, Firefox, other..), and run actions via your python code. Then it is much slower but works well with too "sophisticated" websites.
What I usually do: I try the first option and if a encounter a problem like a javascript security layer on the website, then go for option 2. Moreover, selenium can open a real web browser from your desktop and give you a visual of your scrapping.
In any case, just goolge "urllib/selenium login to website" and you'll find what you need.
If you want to avoid using Selenium (opening web browsers), you can go for requests, it can login the website and grab anything you need in the background.
Here is how you can login to that website with requests.
import requests
from bs4 import BeautifulSoup
#Login Form Data
payload = {
'account_email': 'your_email',
'account_password': 'your_passowrd',
'submit': 'Sign In'
}
with requests.Session() as s:
#Login to the website.
response = s.post('https://play.horseandcountry.tv/login/', data=payload)
#Check if logged in successfully
soup = BeautifulSoup(response.text, 'lxml')
logged_in = soup.find('p', attrs={'class': 'navbar-text pull-right'})
print s.cookies
print response.status_code
if logged_in.text.startswith('Logged in as'):
print 'Logged In Successfully!'
If you need explanations for this, you can check this answer, or requests documentation
You could also use the requests module. It is one of the most popular. Here are some questions that relate to what you would like to do.
Log in to website using Python Requests module
logging in to website using requests
Related
I'm trying to make a script to auto-login to this website and I'm having some troubles. I was hoping I could get assistance with making this work. I have the below code assembled but I get 'Your request cannot be processed at this time\n' in the bottom of what's returned to me when I should be getting some different HTML if it was successful:
from pyquery import PyQuery
import requests
url = 'https://licensing.gov.nl.ca/miriad/sfjsp?interviewID=MRlogin'
values = {'d_1553779889165': 'email#email.com',
'd_1553779889166': 'thisIsMyPassw0rd$$$',
'd_1618409713756': 'true',
'd_1642075435596': 'Sign in'
}
r = requests.post(url, data=values)
print (r.content)
I do this in .NET, but I think the logic can be written in Python as well.
Firstly, I always use Fiddler to capture requests that a webpage sends then identify the request which you want to replicate and add all the cookies and headers that are sent with it in your code.
After sending the login request you will get some cookies that will identify that you've logged in and you use those cookies to proceed further in your site. For example, if you want to retrieve user's info after logging in first you need to trick the server thinking that you are logged in and that is where those log in cookies will help you
Also, I don't think the login would be so simple through a script because if you're trying to automate a government site, they may have some anti-bot security there lying there, some kind of fingerprint or captcha.
Hope this helps!
For privacy concerns, I cannot distribute the url publicly.
I have been able to access this site successfully using python requests session = requests.Session(); r = session.post(url, auth = HttpNtlmAuth(USERNAME, PASSWORD), proxies = proxies) which works great and I can parse the webpage with bs4. I have tried to return cookies using session.cookies.get_dict() but it returns an empty dict (assuming b/c site is hosted using sharepoint). My original thought was to retrieve cookies then use them to access the site.
The issue that I'm facing is when you redirect to the url, a box comes up asking for credentials - which when entered directs you to the url. You can not inspect the page that the box is on- which means that I can't use send.keys() etc. to login using selenium/chromedriver.
I read through some documentation but was unable to find a way to enter pass/username when calling driver = webdriver.Chrome(path_driver) or following calls.
Any help/thoughts would be appreciated.
When right clicking the below - no option to inspect webpage.
I'm trying to scrape multiple financial websites (Wells Fargo, etc.) to pull my transaction history for data analysis purposes. I can do the scraping part once I get to the page I need; the problem I'm having is getting there. I don't know how to pass my username and password and then navigate from there. I would like to do this without actually opening a browser.
I found Michael Foord's article "HOWTO Fetch Internet Resources Using The urllib Package" and tried to adapt one of the examples to meet my needs but can't get it to work (I've tried adapting to several other search results as well). Here's my code:
import bs4
import urllib.request
import urllib.parse
##Navigate to the website.
url = 'https://www.wellsfargo.com/'
values = {'j_username':'USERNAME', 'j_password':'PASSWORD'}
data = urllib.parse.urlencode(values)
data = data.encode('ascii')
req = urllib.request.Request(url, data)
with urllib.request.urlopen(req) as response:
the_page = response.read()
soup = bs4.BeautifulSoup(the_page,"html.parser")
The 'j_username' and 'j_password' both come from inspecting the text boxes on the login page.
I just don't think I'm pointing to the right place or passing my credentials correctly. The URL I'm using is just the login page so is it actually logging me in? When I print the URL from response it returns https://wellsfargo.com/. If I'm ever able to successfully login, it just takes me to a summary page of my accounts. I would then need to follow another link to my checking, savings, etc.
I really appreciate any help you can offer.
I'm filling a form on a web page using python's request module. I'm submitting the form as a POST request, which works fine. I get the expected response from the POST. However, it's a multistep form; after the first "submit" the site loads another form on the same page (using AJAX) . The post response has this HTML page . Now, how do I use this response to fill the form on the new page? Can I intertwine Requests module with Twill or Mechanize in some way?
Here's the code for the POST:
import requests
from requests.auth import HTTPProxyAuth
import formfill
from twill import get_browser
from twill.commands import *
import mechanize
from mechanize import ParseResponse, urlopen, urljoin
http_proxy = "some_Proxy"
https_proxy = "some_Proxy"
proxyDict = {
"http" : http_proxy,
"https" : https_proxy
}
auth = HTTPProxyAuth("user","pass")
r = requests.post("site_url",data={'key':'value'},proxies=proxyDict,auth=auth)
The response r above, contains the new HTML page that resulted from submitting that form. This HTML page also has a form which I have to fill. Can I send this r to twill or mechanize in some way, and use Mechanize's form filling API? Any ideas will be helpful.
The problem here is that you need to actually interact with the javascript on the page. requests, while being an excellent library has no support for javascript interaction, it is just an http library.
If you want to interact with javascript-rich web pages in a meaningful way I would suggest selenium. Selenium is actually a full web browser that can navigate exactly as a person would.
The main issue is that you'll see your speed drop precipitously. Rendering a web page takes a lot longer than the raw html request. If that's a real deal breaker for you you've got two options:
Go headless: There are many options here, but I personally prefer casper. You should see a ~3x speed up on browsing times by going headless, but every site is different.
Find a way to do everything through http: Most non-visual site functions have equivalent http functionality. Using the google developer tools network tab you can dig into the requests that are actually being launched, then replicate those in python.
As far as the tools you mentioned, neither mechanize nor twill will help. Since your main issue here is javascript interaction rather than cookie management, and neither of those frameworks support javascript interactions you would run into the same issue.
UPDATE: If the post response is actually the new page, then you're not actually interacting with AJAX at all. If that's the case and you actually have the raw html, you should simply mimic the typical http request that the form would send. The same approach you used on the first form will work on the second. You can either grab the information out of the HTML response, or simply hard-code the successive requests.
using Mechanize:
#get the name of the form
for form in br.forms():
print "Form name:", form.name
print form
#select 1st form on the page - nr=1 for next etc etc
#OR just select the form with the name br.select_form(form.name)
br.select_form(nr=0)
br.form['form#'] = 'Test Name'
#fill in the fields
r = br.submit() #can always pass in additional params
At work, I sit behind a proxy. When I connect to the company WiFi and open up a browser, a pop-up box usually appears asking for my company credentials before it will let me navigate to any internal/external site.
I am using the Python Requests package to automate pulling data from an external site but am encountering a 401 error that is related to not having authenticated first. This happens when I don't authenticate first using the browser. If I authenticate first with the browser and then use Python requests then everything is fine and I'm able to navigate to any site.
My questions is how do I perform the work authentication part using Python? I want to be able to automate this process so that I can set a cron job that grabs data from an external source every night.
I've tried providing an blank URL:
import requests
response = requests.get('')
But requests.get() requires a properly structure URL. I want to be able to emulate as if I've opened up a browser and capturing the pop-up that asks for authentication. This does not rely on any URL being used.
From the requests documentation
import requests
proxies = {
"http": "http://10.10.1.10:3128",
"https": "http://10.10.1.10:1080",
}
requests.get("http://example.org", proxies=proxies)