I'm currently doing a site search like : site:somedomain.com into BING using Python and Mechanize.
It is submitting fine to bing and returning an output - looks like Json? I can't seem to figure out a good way to further parse the results. Is it JSON?
I'm getting an output like:
Link(base_url=u'http://www.bing.com/search?q=site%3Asomesite.com', url='http://www.somesite.com/prof.php?pID=478', text='SomeSite - Professor Rating of Louis Scerbo', tag='a', attrs=[('href', 'http://www.somesite.com/prof.php?pID=478'), ('h', 'ID=SERP,5105.1')])Link(base_url=u'http://www.bing.com/search?q=site%3Asomesite.com', url='http://www.somesite.com/prof.php?pID=527', text='SomeSite - Professor Rating of Jahan \xe2\x80\xa6', tag='a', attrs=[('href', 'http://www.somesite.com/prof.php?pID=527'), ('h', 'ID=SERP,5118.1')])Link(base_url=u'http://www.bing.com/search?q=site%3Asomesite.com', url='http://www.somesite.com/prof.php?pID=645', text='SomeSite - Professor Rating of David Kutzik', tag='a', attrs=[('href', 'http://www.somesite.com/prof.php?pID=645'), ('h', 'ID=SERP,5131.1')])
I want to get all the urls like:
http://www.somesite.com/prof.php?pID=478
http://www.somesite.com/prof.php?pID=527
http://www.somesite.com/prof.php?pID=645
and so on, so the url attribute within
How can I further do this with mechanize within my code? Keep in mind, some urls in the future might look like:
http://www.anothersite.com/dir/dir/dir/send.php?pID=100
Thank you !
Well mechanize is more a browser like package for Python, for parsing HTML/XML I would recommend Lxml, you can feed that data to lxml and look for urls. Another option is to use regular expressions to look for urls, this approach would be more flexible.
import re
url_regex = re.compile('http:[^\']+')
urls = re.findall(url_regex, html_text)
Edit:
Well instead of printing output, just pass output instead of html_text in re.findall() and then print urls
Using Microsoft's Azure Datamarket API with Python requests, you can request JSON strings directly:
import requests, urllib
q = u'Hello World'
q = urllib.quote(q.encode('utf8'), '')
req = requests.get(
u'https://api.datamarket.azure.com/Data.ashx/Bing/SearchWeb/Web?$format=JSON&Query=%%27%s%%27' % q,
auth=('', u'YOU_API_KEY')
)
# print req.json()
results = req.json()['d']['results']
list_of_urls = [ r['Url'] for r in results]
Depending on your input data you may or may not need the .encode('utf8') part of "q". The "site:xy.com" query should work, too, but I didn't test this. Additionally, we occasionally had some strange encodings returned from Bing ... so we had to re-encode returned URLs like so:
url = r['Url'].encode('latin1')
But those were really special cases ...
You need to register for the Azure API (free of charge) and up to 5000 Bing search requests per month are free: http://datamarket.azure.com/dataset/bing/search
There are several params to fine tune your results: http://datamarket.azure.com/dataset/bing/search#schema
Related
Since Yahoo finance updated their website. some tables seem to be created dynamically and not actually stored in the HTML (I used to get this information using BeautifulSoup, urllib but this won't work anymore). I am after the Analyst tables for example ADP specifically the Earnings Estimates for Year Ago EPS (Current Year Column). You cannot get this information from the API.
I found this link which works well for the Analyst Recommendations Trends. does anyone know how to do something similar for the main table on this page? (LINK:
python lxml etree applet information from yahoo )
I tried to follow the steps taken but frankly its beyond me.
returning the whole table is all I need I can pick out bits from there. cheers
In order to get that data, you need to open Chrome DevTools and select Network tab with XHR filter. If you click on ADP request you can see link in RequestUrl.
You can use Requests library for making a request and parsing json response from the site.
import requests
from pprint import pprint
url = 'https://query1.finance.yahoo.com/v10/finance/quoteSummary/ADP?formatted=true&crumb=ILlIC9tOoXt&lang=en-US®ion=US&modules=upgradeDowngradeHistory%2CrecommendationTrend%2CfinancialData%2CearningsHistory%2CearningsTrend%2CindustryTrend%2CindexTrend%2CsectorTrend&corsDomain=finance.yahoo.com'
r = requests.get(url).json()
pprint(r)
further to volds answer above and using the answer in the link I posted above. (credit to saaj). This gives just the dataset I need and is neater when calling the module. I am not sure what the parameter crumb is but, it seems to work ok without it.
import json
from pprint import pprint
from urllib.request import urlopen
from urllib.parse import urlencode
def parse():
host = 'https://query1.finance.yahoo.com'
#host = 'https://query2.finance.yahoo.com' # try if above doesn't work
path = '/v10/finance/quoteSummary/%s' % 'ADP'
params = {
'formatted' : 'true',
#'crumb' : 'ILlIC9tOoXt',
'lang' : 'en-US',
'region' : 'US',
'modules' : 'earningsTrend',
'domain' : 'finance.yahoo.com'
}
response = urlopen('{}{}?{}'.format(host, path, urlencode(params)))
data = json.loads(response.read().decode())
pprint(data)
if __name__ == '__main__':
parse()
Other modules (just add a comma between them):
assetProfile
financialData
defaultKeyStatistics
calendarEvents
incomeStatementHistory
cashflowStatementHistory
balanceSheetHistory
recommendationTrend
upgradeDowngradeHistory
earningsHistory
earningsTrend
industryTrend
In GitHub, c0redumb has proposed a whole solution. You can download the yqd.py. After import it, you can get Yahoo finance data by one line of code, as blew.
import yqd
yf_data = yqd.load_yahoo_quote('GOOG', '20170722', '20170725')
The result 'yf_data' is:
['Date,Open,High,Low,Close,Adj Close,Volume',
'2017-07-24,972.219971,986.200012,970.770020,980.340027,980.340027,3248300',
'2017-07-25,953.809998,959.700012,945.400024,950.700012,950.700012,4661000',
'']
I am trying to crawl wordreference, but I am not succeding.
The first problem I have encountered is, that a big part is loaded via JavaScript, but that shouldn't be much problem because I can see what I need in the source code.
So, for example, I want to extract for a given word, the first two meanings, so in this url: http://www.wordreference.com/es/translation.asp?tranword=crane I need to extract grulla and grĂșa.
This is my code:
import lxml.html as lh
import urllib2
url = 'http://www.wordreference.com/es/translation.asp?tranword=crane'
doc = lh.parse((urllib2.urlopen(url)))
trans = doc.xpath('//td[#class="ToWrd"]/text()')
for i in trans:
print i
The result is that I get an empty list.
I have tried to crawl it with scrapy too, no success. I am not sure what is going on, the only way I have been able to crawl it is using curl, but that is sloopy, I want to do it in an elegant way, with Python.
Thank you very much
It looks like you need a User-Agent header to be sent, see Changing user agent on urllib2.urlopen.
Also, just switching to requests would do the trick (it automatically sends the python-requests/version User Agent by default):
import lxml.html as lh
import requests
url = 'http://www.wordreference.com/es/translation.asp?tranword=crane'
response = requests.get("http://www.wordreference.com/es/translation.asp?tranword=crane")
doc = lh.fromstring(response.content)
trans = doc.xpath('//td[#class="ToWrd"]/text()')
for i in trans:
print(i)
Prints:
grulla
grĂșa
plataforma
...
grulla blanca
grulla trompetera
I have a colleague having the task submitting gene sequences of Hepatitis C viruses from patient samples into a request form of a specific website which then identifies mutations which provides information about potential drug resistance.
This is very cumbersome and takes days.
My thought is to automate this with a Python script using urllib2 (I cannot use mechanize, I have to develop on MAC OS and for reasons I not understand neither Python setup.py install nor pip mechanize install work - so I am bound to urllib2).
My first try was to access the respective website and first submit a sample gene sequence. (On the original website you simple paste the sequence into an entry field named "or paste in" and then press "go".)
On the next page, you will get the result and I want to read out the mutations via regular expressions.
My first try:
import url lib
import urllib2
url = 'http://hcv.geno2pheno.org/index.php'
form_data = {'or paste in:': 'CTTCACGGAGGCTATGACGAGGTACTCCGCTCCCCCCGGGGACCCCCCCCAACCAGAATACGACTTGGAGCTCATAACATCGTGCTCCTCTAACGTGTCAGTCGCCCACGACGGCGCTGGAAAAAGGGTCTACTACCTTACCCGTGACCCTACAACCCCCCTCGCAAGAGCTGCGTGGGAGACAGCAAGACACACTCCAGTCAATTCCTGGCTAGGCAACATAATCATGTTTGCCCCCACATTGTGGGCGAGAATGATACTGATGACCCACTTCTTCAGTGTCCTCATCGCCAGGGATCAACTTGAACAGGCCCTTGATTGCGAAATCTACGGAGCCTGCTACTCCATTCAACCACTGGACCTACCTCCAATCATTCAAAGACTCCATGGCCTTAGCGCATTTTCACTCCACAGTTACTCTCCAGGTGAAATCAATAGGGTGGCCGCATGCCTCAGGAAACTTGGGGTCCCGCCCTTGCGAGCTTGGAGACACCGGGCCCGGAGCGTCCGCGCTAAGCTTCTGTCCAGAGGAGGCAGGGCTGCCATATGTGGCAAGTACCTCTTCAATTGGGCAGTAAGAACAAAGCTCAAACTCACTCCAATAGCGGCCGCTGGCCAGCTGGACTTGTCCGGCTGGTTCACGGCTGGCTACAGCGGGGGAGACATTTATCACAGCGTGTCTC'}
params = urllib.urlencode(form_data)
response = urllib2.urlopen(url, params)
data = response.read()
print data
What I get from "data" is the source code from http://hcv.geno2pheno.org/index.php and not from the following result page.
Therefore, I have two questions:
1) How can I be sure that my sequence was pasted into the entry field "or paste in:" properly?
2) How do I access the source code of the result page so I can apply regular expressions?
There are a couple of things going wrong here. First, you need more parameters in your form_data dict. Just because you only manually fill in one field doesn't mean that's the only parameter the server needs to complete your request. I've included a form_data dict that worked for me below. The main key you're concerned with is 'v3seq'. This is the sequence you want to "paste in".
Then, when you're requesting the page, you need to use a Request object and read the response of that request. Looks like this:
import urllib
import urllib2
url = 'http://hcv.geno2pheno.org/index.php'
form_data = {
'v3seq': 'CTTCACGGAGGCTATGACGAGGTACTCCGCTCCCCCCGGGGACCCCCCCCAACCAGAATACGACTTGGAGCTCATAACATCGTGCTCCTCTAACGTGTCAGTCGCCCACGACGGCGCTGGAAAAAGGGTCTACTACCTTACCCGTGACCCTACAACCCCCCTCGCAAGAGCTGCGTGGGAGACAGCAAGACACACTCCAGTCAATTCCTGGCTAGGCAACATAATCATGTTTGCCCCCACATTGTGGGCGAGAATGATACTGATGACCCACTTCTTCAGTGTCCTCATCGCCAGGGATCAACTTGAACAGGCCCTTGATTGCGAAATCTACGGAGCCTGCTACTCCATTCAACCACTGGACCTACCTCCAATCATTCAAAGACTCCATGGCCTTAGCGCATTTTCACTCCACAGTTACTCTCCAGGTGAAATCAATAGGGTGGCCGCATGCCTCAGGAAACTTGGGGTCCCGCCCTTGCGAGCTTGGAGACACCGGGCCCGGAGCGTCCGCGCTAAGCTTCTGTCCAGAGGAGGCAGGGCTGCCATATGTGGCAAGTACCTCTTCAATTGGGCAGTAAGAACAAAGCTCAAACTCACTCCAATAGCGGCCGCTGGCCAGCTGGACTTGTCCGGCTGGTTCACGGCTGGCTACAGCGGGGGAGACATTTATCACAGCGTGTCTC',
'H77Switch': '1',
'ignore_sgtSwitch': '1',
'alignwidth': '3',
'action': '1',
'go': 'Go',
'viewResults': '1',
'viewResSec': 'Prediction'
}
data = urllib.urlencode(form_data)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
html_data = response.read()
You can then scrape the data from the response and apply your regular expressions. If you're able to get your pip working, I would also suggest taking a look at BeautifulSoup - it's an excellent library for scraping data from html.
I have tried this before. I'm completely at a loss for ideas.
On this page this dialog box to qet quotes.
http://www.schwab.com/public/schwab/non_navigable/marketing/email/get_quote.html?
I used SPY, XLV, IBM, MSFT
The output is the above with a table.
If you have an account the quote are real time --- via cookie.
How do I get the table into python using 2.6. The data as list or dictionary
Use something like Beautiful Soup to parse the HTML response from the web site and load it into a dictionary. use the symbol as the key and a tuple of whatever data you're interested in as the value. Iterate over all the symbols returned and add one entry per symbol.
You can see examples of how to do this in Toby Segaran's "Programming Collective Intelligence". The samples are all in Python.
First problem: the data is actually in an iframe in a frame; you need to be looking at https://www.schwab.wallst.com/public/research/stocks/summary.asp?user_id=schwabpublic&symbol=APC (where you substitute the appropriate symbol on the end of the URL).
Second problem: extracting the data from the page. I personally like lxml and xpath, but there are many packages which will do the job. I would probably expect some code like
import urllib2
import lxml.html
import re
re_dollars = '\$?\s*(\d+\.\d{2})'
def urlExtractData(url, defs):
"""
Get html from url, parse according to defs, return as dictionary
defs is a list of tuples ("name", "xpath", "regex", fn )
name becomes the key in the returned dictionary
xpath is used to extract a string from the page
regex further processes the string (skipped if None)
fn casts the string to the desired type (skipped if None)
"""
page = urllib2.urlopen(url) # can modify this to include your cookies
tree = lxml.html.parse(page)
res = {}
for name,path,reg,fn in defs:
txt = tree.xpath(path)[0]
if reg != None:
match = re.search(reg,txt)
txt = match.group(1)
if fn != None:
txt = fn(txt)
res[name] = txt
return res
def getStockData(code):
url = 'https://www.schwab.wallst.com/public/research/stocks/summary.asp?user_id=schwabpublic&symbol=' + code
defs = [
("stock_name", '//span[#class="header1"]/text()', None, str),
("stock_symbol", '//span[#class="header2"]/text()', None, str),
("last_price", '//span[#class="neu"]/text()', re_dollars, float)
# etc
]
return urlExtractData(url, defs)
When called as
print repr(getStockData('MSFT'))
it returns
{'stock_name': 'Microsoft Corp', 'last_price': 25.690000000000001, 'stock_symbol': 'MSFT:NASDAQ'}
Third problem: the markup on this page is presentational, not structural - which says to me that code based on it will likely be fragile, ie any change to the structure of the page (or variation between pages) will require reworking your xpaths.
Hope that helps!
Have you thought of using yahoo's quotes api?
see: http://developer.yahoo.com/yql/console/?q=show%20tables&env=store://datatables.org/alltableswithkeys#h=select%20*%20from%20yahoo.finance.quotes%20where%20symbol%20%3D%20%22YHOO%22
You will be able to dynamically generate a request to the website such as:
http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20yahoo.finance.quotes%20where%20symbol%20%3D%20%22YHOO%22&diagnostics=true&env=store%3A%2F%2Fdatatables.org%2Falltableswithkeys
And just poll it with standard a http GET request. The response is in XML format.
matplotlib has a module that gets historical quotes from Yahoo:
>>> from matplotlib.finance import quotes_historical_yahoo
>>> from datetime import date
>>> from pprint import pprint
>>> pprint(quotes_historical_yahoo('IBM', date(2010, 11, 12), date(2010, 11, 18)))
[(734088.0,
144.59,
143.74000000000001,
145.77000000000001,
143.55000000000001,
4731500.0),
(734091.0,
143.88999999999999,
143.63999999999999,
144.75,
143.27000000000001,
3827700.0),
(734092.0,
142.93000000000001,
142.24000000000001,
143.38,
141.18000000000001,
6342100.0),
(734093.0,
142.49000000000001,
141.94999999999999,
142.49000000000001,
141.38999999999999,
4785900.0)]
Looking for a python script that would simply connect to a web page (maybe some querystring parameters).
I am going to run this script as a batch job in unix.
urllib2 will do what you want and it's pretty simple to use.
import urllib
import urllib2
params = {'param1': 'value1'}
req = urllib2.Request("http://someurl", urllib.urlencode(params))
res = urllib2.urlopen(req)
data = res.read()
It's also nice because it's easy to modify the above code to do all sorts of other things like POST requests, Basic Authentication, etc.
Try this:
aResp = urllib2.urlopen("http://google.com/");
print aResp.read();
If you need your script to actually function as a user of the site (clicking links, etc.) then you're probably looking for the python mechanize library.
Python Mechanize
A simple wget called from a shell script might suffice.
in python 2.7:
import urllib2
params = "key=val&key2=val2" #make sure that it's in GET request format
url = "http://www.example.com"
html = urllib2.urlopen(url+"?"+params).read()
print html
more info at https://docs.python.org/2.7/library/urllib2.html
in python 3.6:
from urllib.request import urlopen
params = "key=val&key2=val2" #make sure that it's in GET request format
url = "http://www.example.com"
html = urlopen(url+"?"+params).read()
print(html)
more info at https://docs.python.org/3.6/library/urllib.request.html
to encode params into GET format:
def myEncode(dictionary):
result = ""
for k in dictionary: #k is the key
result += k+"="+dictionary[k]+"&"
return result[:-1] #all but that last `&`
I'm pretty sure this should work in either python2 or python3...
What are you trying to do? If you're just trying to fetch a web page, cURL is a pre-existing (and very common) tool that does exactly that.
Basic usage is very simple:
curl www.example.com
You might want to simply use httplib from the standard library.
myConnection = httplib.HTTPConnection('http://www.example.com')
you can find the official reference here: http://docs.python.org/library/httplib.html