I am trying to use requests (python) to grab some pages from a website that requires me to be logged in.
I did inspect the login page to check out the username and password headers. But I found the names for those fields are not the standard 'username', 'password' used by most sites as you can see from the below screenshots
password field
I used them that way in my python script but each time I get a 'wrong syntax' error. Even sublimetext displayed a part of the name in orange as you can see from the pix below
From this I know there must be some problem with the name. But try to escape the $ signs did not help.
Even the login.aspx header disappears before google chrome could register it on the network.
The site is www dot bncnetwork dot net
I'd be happy if someone could help me figure out what to do about this.
Here is the code`import requests
import requests
def get_project_page(seed_page):
username = "*******************"
password = "*******************"
bnc_login = dict(ctl00$MainContent$txtEmailID=username, ctl00$MainContent$txtPassword=password)
sess_req = requests.Session()
sess_req.get(seed_page)
sess_req.post(seed_page, data=bnc_login, headers={"Referer":"http://www.bncnetwork.net/MyBNC.aspx"})
page = sess_req.get(seed_page)
return page.text`
You need to use strings for the keys, the $ will cause a syntax error if you don't:
data = {"ctl00$MainContent$txtPassword":password, "ctl00$MainContent$txtEmailID":email}
There are evenvalidation fileds etc.. to be filled in also, follow the logic from this answer to fill them out, all the fields can be seen in chrome tools:
Related
I have a subscription to the site https://www.naturalgasintel.com/ for daily feeds of data that show up on their site directly as .txt files; their user login page being https://www.naturalgasintel.com/user/login/
For example a file for today's feed is given by the link https://naturalgasintel.com/ext/resources/Data-Feed/Daily-GPI/2019/01/20190104td.txt and shows up on the site like the picture below:
What I'd like to do is to log in using my user_email and user_password and scrape this data in the form of an Excel file.
When I use Twill to try and 'point' me to the data by first logging me into the site I use this code:
from email.mime.text import MIMEText
from subprocess import Popen, PIPE
import twill
from twill.commands import *
year= NOW[0:4]
month=NOW[5:7]
day=NOW[8:10]
date=(year+month+day)
path = "https://naturalgasintel.com/ext/resources/Data-Feed/Daily-GPI/"
end = "td.txt"
go("http://www.naturalgasintel.com/user/login")
fv("2", "user[email]", user_email)
fv("2", "user[password]", user_password)
fv("2", "commit", "Login")
datafilelocation = path + year + "/" + month + "/" + date + end
go(datafilelocation)
However, logging in from the user login page sends me to this referrer link when I go to the data's location.
https://www.naturalgasintel.com/user/login?referer=%2Fext%2Fresources%2FData-Feed%2FDaily-GPI%2F2019%2F01%2F20190104td.txt
Rather than:
https://naturalgasintel.com/ext/resources/Data-Feed/Daily-GPI/2019/01/20190104td.txt
I've tried using modules like requests as well to log in from the site and then access this data but whatever method I use sends me to the HTML source rather than the .txt data location itself.
I've posted my complete walk-through with the Python 2.7 module Twill which I attached a bounty to here:
Using Twill to grab .txt from login page Python
What would the best solution to being able to access these password protected files be?
If you have a compatible version of FireFox for this, then get the plugin javascript 0.0.1 by Chee and add the following to run on the page:
document.getElementById('user_email').value = "E-What";
document.getElementById('user_password').value = " ABC Password ";
Change the email and password as you like. It will load the page, then after that it will put in your username and password.
There are other ways to do this all by yourself with your own stand-alone process. You do not have to download other people's programs and try to learn them (beyond this little thing) if you change it this way.
I would have up voted this question.
I am trying to develop a script with python to web scraping some information on a specific website for learning purposes.
I went over a lot of different tutorials and posts, trying to gather some insights from them, they are very useful but still didn't help me to find a way to log in the website and do searches with different keywords.
I tried to use different APIs, such as requests and urllib, maybe I didn't find the right way to solve it.
The steps lists as follow:
login information set up
Send login information to the website and get response for future use
keywords setup
import header
set up cookiejar
from login response, do the search
After I tried, it will work randomly, and
here is the code:
import getpass
# marvin
# date:2018/2/7
# login stage preparation
def login_values():
login="https://www.****.com/login"
username = input("Please insert your username: ")
password = getpass.getpass("Please type in your password: ")
host="www.****.com"
#store login screts
data = {
"username": username,
"password": password,
}
return login,host,data
The following is for getting the HTML file from a website
import requests
import random
import http.cookiejar
import socket
# Set up web scraping function to output the html text file
def webscrape(login_url,host_url,login_data,target_url):
#static values preparation
##import header
user_agents = [
***
]
agent = random.choice(user_agents)
headers={'User-agent':agent,
'Accept':'*/*',
'Accept-Language':'en-US,en;q=0.9;zh-cmn-Hans',
'Host':host_url,
'charset':'utf-8',
}
##set up cookie jar
cj = http.cookiejar.CookieJar()
#
# get the html file
socket.setdefaulttimeout(20)
s=requests.Session()
req=s.post(login_url, data=login_data)
res = s.get(target_url, cookies=cj,headers=headers)
html=res.text
return html
Here is the code to get each links from html:
from bs4 import BeautifulSoup
#set up html parsing function for parsing all the list links
def getlist(keyword,loginurl,hosturl,valuesurl,html_lists):
page=1
pagenum=10# set up maximum page num
links=[]
soup=BeautifulSoup(html_lists,"lxml")
try:
for li in soup.find("div",class_="search_pager human_pager in-block").ul.find_all('li'):
target_part=soup.find_all("div",class_="search_result_single search-2017 pb25 pt25 pl30 pr30 ")
[links.append(link.find('a')['href']) for link in target_part]
page+=1
if page<=pagenum:
try:
nexturl=soup.find('div',class_='search_pager human_pager in-block').ul.find('li',class_='pagination-next ng-scope ').a['href'] #next page
except AttributeError:
print("{}'s links are all stored!".format(keyword))
return links
else:
chs_html=webscrape(loginurl,hosturl,valuesurl,nexturl)
soup=BeautifulSoup(chs_html,"lxml")
except AttributeError:
target_part=soup.find_all("div",class_="search_result_single search-2017 pb25 pt25 pl30 pr30 ")
[links.append(link.find('a')['href']) for link in target_part]
print("There is only one page")
return links
The test code is:
keyword="****"
myurl="https://www.****.com/search/os2?key={}".format(keyword)
chs_html=webscrape(login,host,values,myurl)
chs_links=getlist(keyword,login,host,values,chs_html)
targethtml=webscrape(login,host,values,chs_links[1])
There are total 22 links and one page containing 19 links, so it is supposed to have more than one page, if the result "There is only one page" shown up, it indicates a failure.
Problems:
The login_values function is to secure my login information by combining all functions to a final function, but apparently, the username and password are still really easy to show just by print() command.
This the main problem!! Like I mentioned before, this method works randomly. By the way, what I mean not working, it is that the HTML file is only the login page instead of the searching result. I want to get a better control to make it work most of the time. I checked user-agents by print agent every time to see if they are relevant, and it is not! I cleared cookies with suspicious to full storage memory, and it is not.
There are sometimes I facing max trial error or OS error, I guess it is the error from the server I was trying to reach, is there a way I can set up a wait timer for me to prevent these errors from happening?
I've been writing a program to log in to Facebook and update the status as a side project. I managed to get the program to login. However, I'm having trouble selecting the textarea that ends up being the "Enter your status here" box. Using "Inspect Element" in Chrome, I'm able to see the form under which it's located, but listing the forms in the program doesn't seem to list said form...
import mechanize
import re
br = mechanize.Browser()
usernamecorrect = 0
while usernamecorrect == 0:
username = raw_input("What is the username for your Facebook Account? ")
matchmail = re.search(r'[\w.-]+#[\w.-]+', username)
if matchmail:
print matchmail.group()
usernamecorrect = 1
else:
print "That is not a valid username; please enter the e-mail address registered with your account.\n"
password = raw_input("What is the password for your account?")
print "Logging in..."
br.set_handle_robots(False)
br.open("https://www.facebook.com/")
br.select_form(nr = 0)
br['email'] = username
br['pass'] = password
br.submit()
raw_input("Login successful!")
print "Forms: \n"
for f in br.forms():
print f.name
The full output is as follows:
What is the username for your Facebook Account? myemail#website.com
What is the password for your account? thisisapassword
Logging in...
Login successful!
Forms:
navSearch
None
I took a look through the source of Facebook via Inspect Elements again, and "navSearch" is the "Find People, things, etc." search bar, and the unnamed form appears to have to do with the logout button. However, while Inspect Elements gives at least 2 more forms, one of which holds the status update box. I haven't been able to determine if it's because of JavaScript or not (while the status update box code block is encapsulated in , so are the navSearch and logout forms.) The most relevant thing I've been able to find is that navSearch and the logout forms are in a separate div, but I somehow feel as though that shouldn't be much of a problem for mechanize. Is there just something wrong with my code, or is it something else entirely?
Is there just something wrong with my code, or is it something else entirely?
Your whole approach is wrong:
I've been writing a program to log in to Facebook and update the status
That’s what the Graph API is for.
Scraping FB pages and trying to act as a “browser” is not the way to go. Apart from the fact, that FB policies do not allow that, you see how difficult it gets on a page that uses JavaScript/AJAX so much.
Go with the API, it’s the easy way.
I am having trouble getting a video entry which includes a link rel="edit". I need such an entry in order to be able to call DeleteVideoEntry(...) on it.
I am retrieving the video using GetYouTubeVideoEntry(youtube_id=XXXXXXX). My yt_service is initialized with a username, password, and a developer key. I use ProgrammaticLogin. This part seems to work fine. I use the same yt_service to upload said video earlier. Also, if I change the developer key to something bogus (during debugging) and try to authenticate, I get a 403 error. This leads me to believe that authentication works OK.
Needsless to say, the video entry retrieved with GetYouTubeVideoEntry(youtube_id=XXXXXXX) does not contain the edit link and I cannot use the entry in a DeleteVideoEntry(...) call.
Is there some special way to get a video entry which will contain a link element with a rel="edit"? Can anyone suggest some way to resolve my issue? Could this possibly be a bug?
Update:
For the records, when I tried getting the feed of all my uploads, and then looping through the video entries, the video entries do have an edit link. So using this works:
uri = 'http://gdata.youtube.com/feeds/api/users/%s/uploads' % username
feed = yt_service.GetYouTubeVideoFeed(uri)
for entry in feed.entry:
yt_service.DeleteVideoEntry(entry)
But this does not:
entry = yt_service.GetYouTubeVideoEntry(video_id = video.youtube_id)
yt_service.DeleteVideoEntry(entry)
Using the same yt_service.
I've just deleted youtube video using gdata and ProgrammaticLogin()
Here is some steps to reproduce:
import gdata.youtube.service
yt_service = gdata.youtube.service.YouTubeService()
yt_service.developer_key = 'developer_key'
yt_service.email = 'email'
yt_service.password = 'password'
yt_service.ProgrammaticLogin()
# video_id should looks like 'iu6Gq-tUsTc'
uri = 'https://gdata.youtube.com/feeds/api/users/%s/uploads/%s' % (username, video_id)
entry = yt_service.GetYouTubeUserEntry(uri=uri)
response = yt_service.DeleteVideoEntry(entry)
print response # True
yt_service.GetYouTubeVideoFeed(uri) works because GetYouTubeVideoFeed doesn't check uri and just calls self.Get(uri, ...) but originaly, I think, it expected 'https://gdata.youtube.com/feeds/api/videos' uri.
vice versa yt_service.GetYouTubeVideoEntry() use YOUTUBE_VIDEO_URI = 'https://gdata.youtube.com/feeds/api/videos' but this entry doesn't contains rel="edit"
Hope that helps you out
You can view the HTTP headers of the generated requests by setting the debug flag to true. This is as simple as:
yt_service = gdata.youtube.service.YouTubeService()
yt_service.debug = True
You can read about this in the documentation here.
I'm currently trying to get a grasp on pycurl. I'm attempting to login to a website. After logging into the site it should redirect to the main page. However when trying this script it just gets returned to the login page. What might I be doing wrong?
import pycurl
import urllib
import StringIO
pf = {'username' : 'user', 'password' : 'pass' }
fields = urllib.urlencode(pf)
pageContents = StringIO.StringIO()
p = pycurl.Curl()
p.setopt(pycurl.FOLLOWLOCATION, 1)
p.setopt(pycurl.COOKIEFILE, './cookie_test.txt')
p.setopt(pycurl.COOKIEJAR, './cookie_test.txt')
p.setopt(pycurl.POST, 1)
p.setopt(pycurl.POSTFIELDS, fields)
p.setopt(pycurl.WRITEFUNCTION, pageContents.write)
p.setopt(pycurl.URL, 'http://localhost')
p.perform()
pageContents.seek(0)
print pageContents.readlines()
EDIT: As pointed out by Peter the URL should point to a login URL but the site I'm trying to get this to work for fails to show me what URL this would be. The form's action just points to the home page ( /index.html )
As you're troubleshooting this problem, I suggest getting a browser plugin like FireBug or LiveHTTPHeaders (I suggest Firefox plugins, but there are similar plugins for other browsers as well). Then you can exercise a request to the site and see what action (URL), method, and form parameters are being passed to the target server. This will likely help elucidate the crux of the problem.
If that's no help, you may consider using a different tool for your mechanization. I've used ClientForm and BeautifulSoup to perform similar operations. Based on what I've read in the pycURL docs and your code above, ClientForm might be a better tool to use. ClientForm will parse your HTML page, locate the forms on it (including login forms), and construct the appropriate request for you based on the answers you supply to the form. You could even use ClientForm with pycURL... but at least ClientForm will provide you with the appropriate action to which to POST, and construct all of the appropriate parameters.
Be aware, though, that if there is JavaScript handling any necessary part of the login form, even ClientForm can't help you there. You will need something that interprets the JavaScript to effectively automate the login. In that case, I've used SeleniumRC to control a browser (and I let the browser handle the JavaScript).
One of the golden rule, you need to 'brake the ice', have debugging enabled when trying to solve pycurl example:
Note: don't forget to use p.close() after p.perform()
def test(debug_type, debug_msg):
if len(debug_msg) < 300:
print "debug(%d): %s" % (debug_type, debug_msg.strip())
p.setopt(pycurl.VERBOSE, True)
p.setopt(pycurl.DEBUGFUNCTION, test)
Now you can see how your code is breathing, because you have debugging enabled
import pycurl
import urllib
import StringIO
def test(debug_type, debug_msg):
if len(debug_msg) < 300:
print "debug(%d): %s" % (debug_type, debug_msg.strip())
pf = {'username' : 'user', 'password' : 'pass' }
fields = urllib.urlencode(pf)
pageContents = StringIO.StringIO()
p = pycurl.Curl()
p.setopt(pycurl.FOLLOWLOCATION, 1)
p.setopt(pycurl.COOKIEFILE, './cookie_test.txt')
p.setopt(pycurl.COOKIEJAR, './cookie_test.txt')
p.setopt(pycurl.POST, 1)
p.setopt(pycurl.POSTFIELDS, fields)
p.setopt(pycurl.WRITEFUNCTION, pageContents.write)
p.setopt(pycurl.VERBOSE, True)
p.setopt(pycurl.DEBUGFUNCTION, test)
p.setopt(pycurl.URL, 'http://localhost')
p.perform()
p.close() # This is mandatory.
pageContents.seek(0)
print pageContents.readlines()