Scrape Data Point Using Python - python

I am looking to scrape a data point using Python off of the url http://www.cavirtex.com/orderbook .
The data point I am looking to scrape is the lowest bid offer, which at the current moment looks like this:
<tr>
<td><b>Jan. 19, 2014, 2:37 a.m.</b></td>
<td><b>0.0775/0.1146</b></td>
<td><b>860.00000</b></td>
<td><b>66.65 CAD</b></td>
</tr>
The relevant point being the 860.00 . I am looking to build this into a script which can send me an email to alert me of certain price differentials compared to other exchanges.
I'm quite noobie so if in your explanations you could offer your thought process on why you've done certain things it would be very much appreciated.
This is what I have so far which will return me the name of the title correctly, I'm having trouble grabbing the table data though.
import urllib2, sys
from bs4 import BeautifulSoup
site= "http://cavirtex.com/orderbook"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(site,headers=hdr)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
print soup.title

Here is the code for scraping the lowest bid from the 'Buying BTC' table:
from selenium import webdriver
fp = webdriver.FirefoxProfile()
browser = webdriver.Firefox(firefox_profile=fp)
browser.get('http://www.cavirtex.com/orderbook')
lowest_bid = float('inf')
elements = browser.find_elements_by_xpath('//div[#id="orderbook_buy"]/table/tbody/tr/td')
for element in elements:
text = element.get_attribute('innerHTML').strip('<b>|</b>')
try:
bid = float(text)
if lowest_bid > bid:
lowest_bid = bid
except:
pass
browser.quit()
print lowest_bid
In order to install Selenium for Python on your Windows-PC, run from a command line:
pip install selenium (or pip install selenium --upgrade if you already have it).
If you want the 'Selling BTC' table instead, then change "orderbook_buy" to "orderbook_sell".
If you want the 'Last Trades' table instead, then change "orderbook_buy" to "orderbook_trades".
Note:
If you consider performance critical, then you can implement the data-scraping via URL-Connection instead of Selenium, and have your program running much faster. However, your code will probably end up being a lot "messier", due to the tedious XML parsing that you'll be obliged to apply...
Here is the code for sending the previous output in an email from yourself to yourself:
import smtplib,ssl
def SendMail(username,password,contents):
server = Connect(username)
try:
server.login(username,password)
server.sendmail(username,username,contents)
except smtplib.SMTPException,error:
Print(error)
Disconnect(server)
def Connect(username):
serverName = username[username.index("#")+1:username.index(".")]
while True:
try:
server = smtplib.SMTP(serverDict[serverName])
except smtplib.SMTPException,error:
Print(error)
continue
try:
server.ehlo()
if server.has_extn("starttls"):
server.starttls()
server.ehlo()
except (smtplib.SMTPException,ssl.SSLError),error:
Print(error)
Disconnect(server)
continue
break
return server
def Disconnect(server):
try:
server.quit()
except smtplib.SMTPException,error:
Print(error)
serverDict = {
"gmail" :"smtp.gmail.com",
"hotmail":"smtp.live.com",
"yahoo" :"smtp.mail.yahoo.com"
}
SendMail("your_username#your_provider.com","your_password",str(lowest_bid))
The above code should work if your email provider is either gmail or hotmail or yahoo.
Please note that depending on your firewall configuration, it may ask your permission upon the first time you try it...

Related

Shell script to download a lot of HTML files and store them statically with all CSS

I have posted on a science forum (roughly 290 questions) that I would like to get back by downloading them with all the associated answers.
The first issue is that I have to be logged on my personal space to have the list of all the messages. How to circumvent this first barrier to be able with a shell script or a single wget command to get back all URL and their content. Can I pass to wgeta login and a password to be logged and redirected to the appropriate URL obtaining the list of all messages?
Once this first issue will be solved, the second issue is that I have to start from 6 different menu pages that all contain the title and the link of the questions.
Moreover, concerning some of my questions, the answers and the discussions may be on multiple pages.
So I wonder if I could achieve this operation of global downloading knowing I would like to store them statically with local CSS stored also on my computer (to keep the same format into my browser when I consult them on my PC).
The URL of the first menu page of questions is (once I am logged on the website : that could be an issue also to download with wget if I am obliged to be connected).
An example of URL containing the list of messages, once I am logged, is:
https://forums.futura-sciences.com/search.php?searchid=22897684
The other pages (there all 6 or 7 pages of discussions title in total appering in the main menu page) have the format:
"https://forums.futura-sciences.com/search.php?searchid=22897684&pp=&page=2" (for page 2).
https://forums.futura-sciences.com/search.php?searchid=22897684&pp=&page=5
(for page 5)
One can see on each of these pages the title and the link of each of the discussions that I would like to download with also CSS (knowing each discussion may contain multiple pages also) :
for example the first page of discussion "https://forums.futura-sciences.com/archives/804364-demonstration-dilatation-temps.html"
has page 2: "https://forums.futura-sciences.com/archives/804364-demonstration-dilatation-temps-2.html"
and page 3: "https://forums.futura-sciences.com/archives/804364-demonstration-dilatation-temps-3.html"
Naively, I tried to do all this with only one command (with the example of URL on my personal space that I have taken at the beginning of post, i.e "https://forums.futura-sciences.com/search.php?searchid=22897684"):
wget -r --no-check-certificate --html-extension --convert-links "https://forums.futura-sciences.com/search.php?searchid=22897684"
but unfortunately, this command downloads all files, and even maybe not what I want, i.e my discussions.
I don't know what the approach to use: must I firstly store all URL in a file (with all sub-pages containing all answers and the global discussion for each of mu initial question)?
And after, I could do maybe a wget -i all_URL_questions.txt. How can I carry out this operation?
Update
My issue needs a script, I tried with Python the following things:
1)
import urllib, urllib2, cookielib
username = 'USERNAME'
password = 'PASSWORD'
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
login_data = urllib.urlencode({'username' : username, 'password' : password})
opener.open('https://forums.futura-sciences.com/login.php', login_data)
resp = opener.open('https://forums.futura-sciences.com/search.php?do=finduser&userid=253205&contenttype=vBForum_Post&showposts=1')
print resp.read()
But the page printed is not the page of my home into personal space.
2)
import requests
# Fill in your details here to be posted to the login form.
payload = {
'inUserName': 'USERNAME',
'inUserPass': 'PASSWORD'
}
# Use 'with' to ensure the session context is closed after use.
with requests.Session() as s:
p = s.post('https://forums.futura-sciences.com/login.php?do=login', data=payload)
# print the html returned or something more intelligent to see if it's a successful login page.
print p.text.encode('utf8')
# An authorised request.
r = s.get('https://forums.futura-sciences.com/search.php?do=finduser&userid=253205&contenttype=vBForum_Post&showposts=1')
print r.text.encode('utf8')
Here too, this doesn't work
3)
import requests
import bs4
site_url = 'https://forums.futura-sciences.com/login.php?do=login'
userid = 'USERNAME'
password = 'PASSWWORD'
file_url = 'https://forums.futura-sciences.com/search.php?do=finduser&userid=253205&contenttype=vBForum_Post&showposts=1'
o_file = 'abc.html'
# create session
s = requests.Session()
# GET request. This will generate cookie for you
s.get(site_url)
# login to site.
s.post(site_url, data={'vb_login_username': userid, 'vb_login_password': password})
# Next thing will be to visit URL for file you would like to download.
r = s.get(file_url)
# Download file
with open(o_file, 'wb') as output:
output.write(r.content)
print("requests:: File {o_file} downloaded successfully!")
# Close session once all work done
s.close()
Same thing, the content is wrong
4)
from selenium import webdriver
# To prevent download dialog
profile = webdriver.FirefoxProfile()
profile.set_preference('browser.download.folderList', 2) # custom location
profile.set_preference('browser.download.manager.showWhenStarting', False)
profile.set_preference('browser.download.dir', '/tmp')
profile.set_preference('browser.helperApps.neverAsk.saveToDisk', 'text/csv')
webdriver.get('https://forums.futura-sciences.com/')
webdriver.find_element_by_id('ID').send_keys('USERNAME')
webdriver.find_element_by_id ('ID').send_keys('PASSWORD')
webdriver.find_element_by_id('submit').click()
browser = webdriver.Firefox()
browser.get('https://forums.futura-sciences.com/search.php?do=finduser&userid=253205&contenttype=vBForum_Post&showposts=1')
Still not getting to log in with USERNAME and PASSSWORD and get content of homepage of personal space
5)
from selenium import webdriver
from selenium.webdriver.firefox.webdriver import FirefoxProfile
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
import time
def MS_login(username, passwd): # call this with username and password
firefox_capabilities = DesiredCapabilities.FIREFOX
firefox_capabilities['moz:webdriverClick'] = False
driver = webdriver.Firefox(capabilities=firefox_capabilities)
fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.folderList", 2) # 0 means to download to the desktop, 1 means to download to the default "Downloads" directory, 2 means to use the directory
fp.set_preference("browser.download.dir","/Users/user/work_archives_futura/")
driver.get('https://forums.futura-sciences.com/') # change the url to your website
time.sleep(5) # wait for redirection and rendering
driver.delete_all_cookies() # clean up the prior login sessions
driver.find_element_by_xpath("//input[#name='vb_login_username']").send_keys(username)
elem = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, "//input[#name='vb_login_password']")))
elem.send_keys(Keys.TAB)
driver.find_element_by_xpath("//input[#type='submit']").click()
print("success !!!!")
driver.close() # close the browser
return driver
if __name__ == '__main__':
MS_login("USERNAME","PASSWORD")
The window is well opened, username filled but impossible to fill or submit the password and click on submit.
PS: the main issue could come from that password field has display:none property, So I can't simulate TAB operation to password field and pass it, once I have put the login.
It seems you're pretty knowledgeable already about scraping using the various methods. All that was missing were the correct field names in the post request.
I used the chrome dev tools (f12 - then go to networking tab). With this open if you login and quickly stop the browser window from redirecting, you'll be able to see the full request to login.php and look at the fields etc.
With that I was able to build this for you. It includes a nice dumping function for responses. To test my code works you can use your real password for positive case and the bad password line for negative case.
import requests
import json
s = requests.Session()
def dumpResponseData(r, fileName):
print(r.status_code)
print(json.dumps(dict(r.headers), indent=1))
cookieDict = s.cookies.get_dict()
print(json.dumps(cookieDict, indent=1))
outfile = open(fileName, mode="w")
outfile.write(r.text)
outfile.close()
username = "your-username"
password = "your-password"
# password = "bad password"
def step1():
data = dict()
data["do"] = "login"
data["vb_login_md5password"] = ""
data["vb_login_md5password_utf"] = ""
data["s"] = ""
data["securitytoken"] = "guest"
data["url"] = "/search.php?do=finduser&userid=1077817&contenttype=vBForum_Post&showposts=1"
data["vb_login_username"] = username
data["vb_login_password"] = password
p = s.post('https://forums.futura-sciences.com/login.php?do=login', data=data)
# Logged In?
if "vbseo_loggedin" in s.cookies.keys():
print("Logged In!")
else:
print("Login Failed :(")
if __name__ == "__main__":
step1()
I don't have any posts in my newly created Futura account so I can't really do any more testing for you - I don't want to spam their forum with garbage.
But I would probably start by doing a request of post search url and scrape the links using bs4.
Then you could probably just use wget -r for each link you've scraped.
#Researcher is correct on their advice when it comes to the requests library. You are not posting all of the request params that the browser would send. Overall, I think it will be difficult to get requests to pull everything when you factor in static content and client side javascript
Your selenium code from section 4 has a few mistakes in it:
# yours
webdriver.find_element_by_id('ID').send_keys('USERNAME')
webdriver.find_element_by_id ('ID').send_keys('PASSWORD')
webdriver.find_element_by_id('submit').click()
# should be
webdriver.find_element_by_id('vb_login_username').send_keys('USERNAME')
webdriver.find_element_by_id('vb_login_password').send_keys('PASSWORD')
webdriver.find_element_by_xpath("//input[#type='submit']").click()
You may need to fiddle with the xpath for the submit button.
Hint: You can debug along the way by taking a screenshots :
webdriver.find_element_by_id('vb_login_username').send_keys('USERNAME')
webdriver.find_element_by_id('vb_login_password').send_keys('PASSWORD')
webdriver.get_screenshot_as_file('before_submit.png')
webdriver.find_element_by_xpath("//input[#type='submit']").click()
webdriver.get_screenshot_as_file('after_submit.png')

Scraping Flickr with Selenium/Beautiful soup in Python - ABSWP

I'm going through Automate Boring Stuff with Python and I'm stuck at the chapter about downloading data from the internet. One of the tasks is download photos for a given keyword from Flickr.
I have a massive problem with scraping this site. I've tried BeautifulSoup (which I think is not appropriate in this case as it uses Javascript) and Selenium. Looking at the html I think that I should locate 'overlay' class. However no matter which option I use (find_element_by_class_name, ...by_text, ...by_partial_text) I am not able to find these elements (I get: ".
Could you please help me to clarify what I'm doing wrong? I'd be also grateful for any materials that could help me understadt such cases better. Thanks!
Here's my simple code:
import sys
search_keywords = sys.argv[1]
from selenium import webdriver
browser = webdriver.Firefox()
browser.get(f'https://www.flickr.com/search/?text={search_keywords}')
elems = browser.find_element_by_class_name("overlay")
print(elems)
elems.click()
Sample keywords I type in shell: "industrial design interior"
Are you getting any error message? With Selenium it's useful to surround your code in try/except blocks.
What are you trying to do exactly, download the photos? With a bit of re-writing
try:
options = webdriver.ChromeOptions()
#options.add_argument('--headless')
driver = webdriver.Chrome(chrome_options = options)
search_keywords = "cars"
driver.get(f'https://www.flickr.com/search/?text={search_keywords}')
time.sleep(1)
except Exception as e:
print("Error loading search results page" + str(e))
try:
elems = driver.find_element_by_class_name("overlay")
print(elems)
elems.click()
time.sleep(5)
except Exception as e:
print(str(e))
Loads the page as expected and then clicks on the photo, taking us to This Page
I would be able to help more if you could go into more detail of what you're wanting to accomplish.

Python exception handling with selenium

I am new to Python and I am trying to write a nagios script which uses selenium to log into a webapp and print out some information. As of now the script works as expected but I would like it to alert the system if it fails to retrieve the website. Here is what I have
#!/usr/bin/env python
import sys
from selenium import webdriver
url = '<main web site>'
systemInformation = '<sys information site>'
logout = '<log out link>'
browser = webdriver.PhantomJS('<path to phantomjs for headless operation>')
login_username = '<username>'
login_password = '<password>'
try:
browser.get(url)
username = browser.find_element_by_name("username")
password = browser.find_element_by_name("password")
username.send_keys(login_username)
password.send_keys(login_password)
link = browser.find_element_by_name('loginbutton')
link.click()
browser.get(systemInformation)
print "OK: Web Application is Running"
for element in browser.find_elements_by_name('SystemReportsForm'):
print element.text
browser.get(logout)
browser.quit()
sys.exit(0)
except:
print "WARNING: Web Application is Down!"
sys.exit(2)
I would expect if that first section fails it would then go to the except section, however the script is printing out both the try and except even though there is an exit. I'm sure it's something simple I am missing.
Thank's in advance
Update
This is how I ended up resolving this issue, thanks for the help
#!/usr/bin/env python
import sys, urllib2
from selenium import webdriver
url = '<log in url>'
systemInformation = '<sys info url>'
logout = '<logout url>'
browser = webdriver.PhantomJS('<phantomjs location for headless browser>')
login_username = '<user>'
login_password = '<password>'
def login(login_url,status_url):
browser.get(login_url)
username = browser.find_element_by_name("username")
password = browser.find_element_by_name("password")
username.send_keys(login_username)
password.send_keys(login_password)
link = browser.find_element_by_name('loginbutton')
link.click()
browser.get(status_url)
if browser.title == 'Log In':
print "WARNING: Site up but Failed to login!"
browser.get(logout)
browser.quit()
sys.exit(1)
else:
print "OK: Everything Looks Good"
for element in browser.find_elements_by_name('SystemReportsForm'):
print element.text
browser.get(logout)
browser.quit()
sys.exit(0)
req = urllib2.Request(url)
try:
urllib2.urlopen(req)
login(url,systemInformation)
except urllib2.HTTPError as e:
print('CRITICAL: Site Appears to be Down!')
browser.get(logout)
browser.quit()
sys.exit(2)
sys.exit([status]) raising SystemExit(status) exception that's why the except clause is executed
Exit the interpreter by raising SystemExit(status). If the status is
omitted or None, it defaults to zero (i.e., success). If the status is
an integer, it will be used as the system exit status. If it is
another kind of object, it will be printed and the system exit status
will be one (i.e., failure).
Remove sys.exit(0) inside try
(if you shown the complete version of the script)

How to read source web site already open in browser

I wondering to know if is any way to open url in browser and read source opened url ?
I'm trying to check if my XPath selector getting right value of captcha img src. I can't do this making 2 connections to url cause captcha will reload every single time i connect to url.
For reading source i'm using:
url = "https://prod.ceidg.gov.pl/CEIDG/CEIDG.Public.UI/Search.aspx"
sock = urllib.urlopen(url)
htmlSource = sock.read()
sock.close()
To open url in browser i'm using:
if sys.platform=='win32':
os.startfile(url)
elif sys.platform=='darwin':
subprocess.Popen(['open', url])
else:
try:
subprocess.Popen(['xdg-open', url])
except OSError:
print 'Please open a browser on: '+url
Does any of you guys know how to solve it ?
Thanks
I found solution. To see url in browser and in the same time see source code of this page just use this code:
from selenium import webdriver
from lxml import etree, html
url = "https://prod.ceidg.gov.pl/CEIDG/CEIDG.Public.UI/Search.aspx"
adres_prefix = "https://prod.ceidg.gov.pl/CEIDG/CEIDG.Public.UI/"
adres_sufix = etree.XPath('string(//img[#class="captcha"]/#src)')
browser = webdriver.Firefox()
browser.get(url)
html_source = browser.page_source # i'm getting source code of open url
root = etree.HTML(html_source)
result = etree.tostring(root, pretty_print=True, method="html")
result2 = adres_sufix(root)
www = adres_prefix + result2
print www # now i see if XPath gives me right value
Hope it will help others
Thanks anyway for any help
Most of the cross platform python GUI tool kits such as wx.Python, pyside, etc., have a html display window that you can use to display the html source from within your python. I would recommend using one of those to display your content from within your python code.
You probably are going to need to make more than one request to get the CAPTCHA. Get yourself a copy of Fiddler 2 (free) http://fiddler2.com/get-fiddler. It will allow you to see the "conversation" between the server and your browser. Once you see that, you will probably know what you need.

python urllib post question

im making some simple python post script but it not working well.
there is 2 part to have to login.
first login is using 'http://mybuddy.buddybuddy.co.kr/userinfo/UserInfo.asp' this one.
and second login is using 'http://user.buddybuddy.co.kr/usercheck/UserCheckPWExec.asp'
i can login first login page, but i couldn't login second page website.
and return some error 'illegal access' such like .
i heard this is related with some cooke but i don't know how to implement to resolve this problem.
if anyone can help me much appreciated!! Thanks!
import re,sys,os,mechanize,urllib,time
import datetime,socket
params = urllib.urlencode({'ID':'ph896011', 'PWD':'pk1089' })
rq = mechanize.Request("http://mybuddy.buddybuddy.co.kr/userinfo/UserInfo.asp", params)
rs = mechanize.urlopen(rq)
data = rs.read()
logged_fail = r';history.back();</script>' in data
if not logged_fail:
print 'login success'
try:
params = urllib.urlencode({'PASSWORD':'pk1089'})
rq = mechanize.Request("http://user.buddybuddy.co.kr/usercheck/UserCheckPWExec.asp", params )
rs = mechanize.urlopen(rq)
data = rs.read()
print data
except:
print 'error'
You can't use selenium? IMHO it's better do automation with this.
For install utilize:
pip install selenium
A example:
from selenium import webdriver
browser = webdriver.Firefox()
# open site
browser.get('http://google.com.br')
# get page source
browser.page_source
A login example:
# different methods to get a html item
form = browser.find_element_by_tag_name('form')
username = browser.find_element_by_id('input_username')
password = browser.find_element_by_css_selector('input[type=password]')
username.send_keys('myUser')
password.send_keys('myPass')
form.submit()

Categories

Resources