I have a subscription to the site https://www.naturalgasintel.com/ for daily feeds of data that show up on their site directly as .txt files; their user login page being https://www.naturalgasintel.com/user/login/
For example a file for today's feed is given by the link https://naturalgasintel.com/ext/resources/Data-Feed/Daily-GPI/2019/01/20190104td.txt and shows up on the site like the picture below:
What I'd like to do is to log in using my user_email and user_password and scrape this data in the form of an Excel file.
When I use Twill to try and 'point' me to the data by first logging me into the site I use this code:
from email.mime.text import MIMEText
from subprocess import Popen, PIPE
import twill
from twill.commands import *
year= NOW[0:4]
month=NOW[5:7]
day=NOW[8:10]
date=(year+month+day)
path = "https://naturalgasintel.com/ext/resources/Data-Feed/Daily-GPI/"
end = "td.txt"
go("http://www.naturalgasintel.com/user/login")
fv("2", "user[email]", user_email)
fv("2", "user[password]", user_password)
fv("2", "commit", "Login")
datafilelocation = path + year + "/" + month + "/" + date + end
go(datafilelocation)
However, logging in from the user login page sends me to this referrer link when I go to the data's location.
https://www.naturalgasintel.com/user/login?referer=%2Fext%2Fresources%2FData-Feed%2FDaily-GPI%2F2019%2F01%2F20190104td.txt
Rather than:
https://naturalgasintel.com/ext/resources/Data-Feed/Daily-GPI/2019/01/20190104td.txt
I've tried using modules like requests as well to log in from the site and then access this data but whatever method I use sends me to the HTML source rather than the .txt data location itself.
I've posted my complete walk-through with the Python 2.7 module Twill which I attached a bounty to here:
Using Twill to grab .txt from login page Python
What would the best solution to being able to access these password protected files be?
If you have a compatible version of FireFox for this, then get the plugin javascript 0.0.1 by Chee and add the following to run on the page:
document.getElementById('user_email').value = "E-What";
document.getElementById('user_password').value = " ABC Password ";
Change the email and password as you like. It will load the page, then after that it will put in your username and password.
There are other ways to do this all by yourself with your own stand-alone process. You do not have to download other people's programs and try to learn them (beyond this little thing) if you change it this way.
I would have up voted this question.
Related
I am trying to make a script that auto fills the HTML5 form values (Name and Password). I don't know how to insert the values.
On other sites I can't find a good explanation about how to do it.
Login site: https://ncod22.n-able.com/
username = TestName
password = TestPwd
(These aren't working)
Pyton code (so far):
import webbrowser, sys
webbrowser.open('https://ncod22.n-able.com/')
Try mechanize:
import mechanize
url="https://ncod22.n-able.com/"
pg=mechanize.Browser()
pg.set_handle_robots(False)
r=pg.open(url) #open page
pg.select_form(nr=0) #select form number
pg.form["username"]="TestName" #<input> name
pg.form["password"]="TestPwd" #<input name="password">
pg.method="POST" #form method
r=pg.submit() #submitting form
#print(r.read()) to see Page source
I am trying to develop a script with python to web scraping some information on a specific website for learning purposes.
I went over a lot of different tutorials and posts, trying to gather some insights from them, they are very useful but still didn't help me to find a way to log in the website and do searches with different keywords.
I tried to use different APIs, such as requests and urllib, maybe I didn't find the right way to solve it.
The steps lists as follow:
login information set up
Send login information to the website and get response for future use
keywords setup
import header
set up cookiejar
from login response, do the search
After I tried, it will work randomly, and
here is the code:
import getpass
# marvin
# date:2018/2/7
# login stage preparation
def login_values():
login="https://www.****.com/login"
username = input("Please insert your username: ")
password = getpass.getpass("Please type in your password: ")
host="www.****.com"
#store login screts
data = {
"username": username,
"password": password,
}
return login,host,data
The following is for getting the HTML file from a website
import requests
import random
import http.cookiejar
import socket
# Set up web scraping function to output the html text file
def webscrape(login_url,host_url,login_data,target_url):
#static values preparation
##import header
user_agents = [
***
]
agent = random.choice(user_agents)
headers={'User-agent':agent,
'Accept':'*/*',
'Accept-Language':'en-US,en;q=0.9;zh-cmn-Hans',
'Host':host_url,
'charset':'utf-8',
}
##set up cookie jar
cj = http.cookiejar.CookieJar()
#
# get the html file
socket.setdefaulttimeout(20)
s=requests.Session()
req=s.post(login_url, data=login_data)
res = s.get(target_url, cookies=cj,headers=headers)
html=res.text
return html
Here is the code to get each links from html:
from bs4 import BeautifulSoup
#set up html parsing function for parsing all the list links
def getlist(keyword,loginurl,hosturl,valuesurl,html_lists):
page=1
pagenum=10# set up maximum page num
links=[]
soup=BeautifulSoup(html_lists,"lxml")
try:
for li in soup.find("div",class_="search_pager human_pager in-block").ul.find_all('li'):
target_part=soup.find_all("div",class_="search_result_single search-2017 pb25 pt25 pl30 pr30 ")
[links.append(link.find('a')['href']) for link in target_part]
page+=1
if page<=pagenum:
try:
nexturl=soup.find('div',class_='search_pager human_pager in-block').ul.find('li',class_='pagination-next ng-scope ').a['href'] #next page
except AttributeError:
print("{}'s links are all stored!".format(keyword))
return links
else:
chs_html=webscrape(loginurl,hosturl,valuesurl,nexturl)
soup=BeautifulSoup(chs_html,"lxml")
except AttributeError:
target_part=soup.find_all("div",class_="search_result_single search-2017 pb25 pt25 pl30 pr30 ")
[links.append(link.find('a')['href']) for link in target_part]
print("There is only one page")
return links
The test code is:
keyword="****"
myurl="https://www.****.com/search/os2?key={}".format(keyword)
chs_html=webscrape(login,host,values,myurl)
chs_links=getlist(keyword,login,host,values,chs_html)
targethtml=webscrape(login,host,values,chs_links[1])
There are total 22 links and one page containing 19 links, so it is supposed to have more than one page, if the result "There is only one page" shown up, it indicates a failure.
Problems:
The login_values function is to secure my login information by combining all functions to a final function, but apparently, the username and password are still really easy to show just by print() command.
This the main problem!! Like I mentioned before, this method works randomly. By the way, what I mean not working, it is that the HTML file is only the login page instead of the searching result. I want to get a better control to make it work most of the time. I checked user-agents by print agent every time to see if they are relevant, and it is not! I cleared cookies with suspicious to full storage memory, and it is not.
There are sometimes I facing max trial error or OS error, I guess it is the error from the server I was trying to reach, is there a way I can set up a wait timer for me to prevent these errors from happening?
I am trying to use requests (python) to grab some pages from a website that requires me to be logged in.
I did inspect the login page to check out the username and password headers. But I found the names for those fields are not the standard 'username', 'password' used by most sites as you can see from the below screenshots
password field
I used them that way in my python script but each time I get a 'wrong syntax' error. Even sublimetext displayed a part of the name in orange as you can see from the pix below
From this I know there must be some problem with the name. But try to escape the $ signs did not help.
Even the login.aspx header disappears before google chrome could register it on the network.
The site is www dot bncnetwork dot net
I'd be happy if someone could help me figure out what to do about this.
Here is the code`import requests
import requests
def get_project_page(seed_page):
username = "*******************"
password = "*******************"
bnc_login = dict(ctl00$MainContent$txtEmailID=username, ctl00$MainContent$txtPassword=password)
sess_req = requests.Session()
sess_req.get(seed_page)
sess_req.post(seed_page, data=bnc_login, headers={"Referer":"http://www.bncnetwork.net/MyBNC.aspx"})
page = sess_req.get(seed_page)
return page.text`
You need to use strings for the keys, the $ will cause a syntax error if you don't:
data = {"ctl00$MainContent$txtPassword":password, "ctl00$MainContent$txtEmailID":email}
There are evenvalidation fileds etc.. to be filled in also, follow the logic from this answer to fill them out, all the fields can be seen in chrome tools:
I'm trying to automate the process of creating an account for something, lets call it X, but I cant figure out what to do.
I saw this code somewhere,
import urllib
import urllib2
import webbrowser
data = urllib.urlencode({'q': 'Python'})
url = 'http://duckduckgo.com/html/'
full_url = url + '?' + data
response = urllib2.urlopen(full_url)
with open("results.html", "w") as f:
f.write(response.read())
webbrowser.open("results.html")
But I cant figure out how to modify it for my use.
I would highly recommend utilizing Selenium+Webdriver for this, since your question appears UI and browser-based. You can install Selenium via 'pip install selenium' in most cases. Here are a couple of good references to get started.
- http://selenium-python.readthedocs.io/
- https://pypi.python.org/pypi/selenium
Also, if this process needs to drive the browser headlessly, look into including PhantomJS (via GhostDriver), which can be downloaded from the phantomjs.org website.
I'm currently trying to get a grasp on pycurl. I'm attempting to login to a website. After logging into the site it should redirect to the main page. However when trying this script it just gets returned to the login page. What might I be doing wrong?
import pycurl
import urllib
import StringIO
pf = {'username' : 'user', 'password' : 'pass' }
fields = urllib.urlencode(pf)
pageContents = StringIO.StringIO()
p = pycurl.Curl()
p.setopt(pycurl.FOLLOWLOCATION, 1)
p.setopt(pycurl.COOKIEFILE, './cookie_test.txt')
p.setopt(pycurl.COOKIEJAR, './cookie_test.txt')
p.setopt(pycurl.POST, 1)
p.setopt(pycurl.POSTFIELDS, fields)
p.setopt(pycurl.WRITEFUNCTION, pageContents.write)
p.setopt(pycurl.URL, 'http://localhost')
p.perform()
pageContents.seek(0)
print pageContents.readlines()
EDIT: As pointed out by Peter the URL should point to a login URL but the site I'm trying to get this to work for fails to show me what URL this would be. The form's action just points to the home page ( /index.html )
As you're troubleshooting this problem, I suggest getting a browser plugin like FireBug or LiveHTTPHeaders (I suggest Firefox plugins, but there are similar plugins for other browsers as well). Then you can exercise a request to the site and see what action (URL), method, and form parameters are being passed to the target server. This will likely help elucidate the crux of the problem.
If that's no help, you may consider using a different tool for your mechanization. I've used ClientForm and BeautifulSoup to perform similar operations. Based on what I've read in the pycURL docs and your code above, ClientForm might be a better tool to use. ClientForm will parse your HTML page, locate the forms on it (including login forms), and construct the appropriate request for you based on the answers you supply to the form. You could even use ClientForm with pycURL... but at least ClientForm will provide you with the appropriate action to which to POST, and construct all of the appropriate parameters.
Be aware, though, that if there is JavaScript handling any necessary part of the login form, even ClientForm can't help you there. You will need something that interprets the JavaScript to effectively automate the login. In that case, I've used SeleniumRC to control a browser (and I let the browser handle the JavaScript).
One of the golden rule, you need to 'brake the ice', have debugging enabled when trying to solve pycurl example:
Note: don't forget to use p.close() after p.perform()
def test(debug_type, debug_msg):
if len(debug_msg) < 300:
print "debug(%d): %s" % (debug_type, debug_msg.strip())
p.setopt(pycurl.VERBOSE, True)
p.setopt(pycurl.DEBUGFUNCTION, test)
Now you can see how your code is breathing, because you have debugging enabled
import pycurl
import urllib
import StringIO
def test(debug_type, debug_msg):
if len(debug_msg) < 300:
print "debug(%d): %s" % (debug_type, debug_msg.strip())
pf = {'username' : 'user', 'password' : 'pass' }
fields = urllib.urlencode(pf)
pageContents = StringIO.StringIO()
p = pycurl.Curl()
p.setopt(pycurl.FOLLOWLOCATION, 1)
p.setopt(pycurl.COOKIEFILE, './cookie_test.txt')
p.setopt(pycurl.COOKIEJAR, './cookie_test.txt')
p.setopt(pycurl.POST, 1)
p.setopt(pycurl.POSTFIELDS, fields)
p.setopt(pycurl.WRITEFUNCTION, pageContents.write)
p.setopt(pycurl.VERBOSE, True)
p.setopt(pycurl.DEBUGFUNCTION, test)
p.setopt(pycurl.URL, 'http://localhost')
p.perform()
p.close() # This is mandatory.
pageContents.seek(0)
print pageContents.readlines()