python web-scraping yahoo finance

python web-scraping yahoo finance - python

Since Yahoo finance updated their website. some tables seem to be created dynamically and not actually stored in the HTML (I used to get this information using BeautifulSoup, urllib but this won't work anymore). I am after the Analyst tables for example ADP specifically the Earnings Estimates for Year Ago EPS (Current Year Column). You cannot get this information from the API.
I found this link which works well for the Analyst Recommendations Trends. does anyone know how to do something similar for the main table on this page? (LINK:
python lxml etree applet information from yahoo )
I tried to follow the steps taken but frankly its beyond me.
returning the whole table is all I need I can pick out bits from there. cheers

In order to get that data, you need to open Chrome DevTools and select Network tab with XHR filter. If you click on ADP request you can see link in RequestUrl.
You can use Requests library for making a request and parsing json response from the site.
import requests
from pprint import pprint
url = 'https://query1.finance.yahoo.com/v10/finance/quoteSummary/ADP?formatted=true&crumb=ILlIC9tOoXt&lang=en-US&region=US&modules=upgradeDowngradeHistory%2CrecommendationTrend%2CfinancialData%2CearningsHistory%2CearningsTrend%2CindustryTrend%2CindexTrend%2CsectorTrend&corsDomain=finance.yahoo.com'
r = requests.get(url).json()
pprint(r)

further to volds answer above and using the answer in the link I posted above. (credit to saaj). This gives just the dataset I need and is neater when calling the module. I am not sure what the parameter crumb is but, it seems to work ok without it.
import json
from pprint import pprint
from urllib.request import urlopen
from urllib.parse import urlencode
def parse():
host = 'https://query1.finance.yahoo.com'
#host = 'https://query2.finance.yahoo.com' # try if above doesn't work
path = '/v10/finance/quoteSummary/%s' % 'ADP'
params = {
'formatted' : 'true',
#'crumb' : 'ILlIC9tOoXt',
'lang' : 'en-US',
'region' : 'US',
'modules' : 'earningsTrend',
'domain' : 'finance.yahoo.com'
}
response = urlopen('{}{}?{}'.format(host, path, urlencode(params)))
data = json.loads(response.read().decode())
pprint(data)
if __name__ == '__main__':
parse()
Other modules (just add a comma between them):
assetProfile
financialData
defaultKeyStatistics
calendarEvents
incomeStatementHistory
cashflowStatementHistory
balanceSheetHistory
recommendationTrend
upgradeDowngradeHistory
earningsHistory
earningsTrend
industryTrend

In GitHub, c0redumb has proposed a whole solution. You can download the yqd.py. After import it, you can get Yahoo finance data by one line of code, as blew.
import yqd
yf_data = yqd.load_yahoo_quote('GOOG', '20170722', '20170725')
The result 'yf_data' is:
['Date,Open,High,Low,Close,Adj Close,Volume',
'2017-07-24,972.219971,986.200012,970.770020,980.340027,980.340027,3248300',
'2017-07-25,953.809998,959.700012,945.400024,950.700012,950.700012,4661000',
'']

Related

How to obtain stock market company sector from ticker or company name in python

Given a company ticker or name I would like to get its sector using python.
I have tried already several potential solutions but none has worked succesfully
The two most promising are:
1) Using the script from: https://gist.github.com/pratapvardhan/9b57634d57f21cf3874c
from urllib import urlopen
from lxml.html import parse
'''
Returns a tuple (Sector, Indistry)
Usage: GFinSectorIndustry('IBM')
'''
def GFinSectorIndustry(name):
tree = parse(urlopen('http://www.google.com/finance?&q='+name))
return tree.xpath("//a[#id='sector']")[0].text, tree.xpath("//a[#id='sector']")[0].getnext().text
However I am using python --version 3.8
I have been able to tweak this solution, but the last line is not working and I am completely new to scraping web pages, so I would appreciate if anyone has some suggestions.
Here is my current code:
from urllib.request import Request, urlopen
from lxml.html import parse
name="IBM"
req = Request('http://www.google.com/finance?&q='+name, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req)
tree = parse(webpage)
But then the last part is not working and I am very new to this xpath syntax:
tree.xpath("//a[#id='sector']")[0].text, tree.xpath("//a[#id='sector']")[0].getnext().text
2) The other option was embedding R's TTN package as shown here: Find which sector a stock belongs to
However, I want to run it within my Jupyter notebook, and it is just taking ages to run ss <- stockSymbols()

Following your comment, for marketwatch.com/investing/stock specifically, the xpath that is likely to work is "//div[#class='intraday__sector']/span[#class='label']" meaning that doing
tree.xpath("//div[#class='intraday__sector']/span[#class='label']")[0].text
should return the desired information.
I am completely new to scraping web pages [...]
Some precisions:
This xpath totally depends on the website you are looking at, explaining why there were no hope in searching "//a[#id='sector']" in the page you mention in comments, since this xpath (now outdated) was google-finance specific. Put differently, you first need to "study" the page you are interested in to know where the information you want is located.
To conduct such "study" I use Chrome DevTools and check any xpath in the console, doing $x(<your-xpath-of-interest>) where the function $x is documented here (with examples!).
Luckily for you, the information you want to get from marketwatch.com/investing/stock -- the sector's name -- is statically generated (i.e. not dynamically generated at page loading, in which case other scraping techniques would have been required, resorting to other python libraries such as Selenium.. but this is another question).

You can easily obtain the sector for any given company/ticker with yahoo finance:
import yfinance as yf
tickerdata = yf.Ticker('TSLA') #the tickersymbol for Tesla
print (tickerdata.info['sector'])
Code returns: 'Consumer Cyclical'
If you want other information about the company/ticker, just print(tickerdata.info) to see all other possible dictionary keys and corresponding values, like ['sector'] used in the code above.

To answer the question:
How to obtain stock market company sector from ticker or company name in python?
I had to find a work around after reading some material and some nice suggestions from #keepAlive.
The following does the job in a reverse way, i.e. gets the companies given the sector. There are 10 sectors, so it is not too much work if one wants info for all sectors: https://www.stockmonitor.com/sectors/
Given that marketwatch.com/investing/stock was throwing a 405 Error, I decided to use https://www.stockmonitor.com/sectors/, for example:
https://www.stockmonitor.com/sector/healthcare/
Here is the code:
import requests
import pandas as pd
from lxml.html import parse
from urllib.request import Request, urlopen
headers = [
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3)" + " "
"AppleWebKit/537.36 (KHTML, like Gecko)" + " " + "Chrome/35.0.1916.47" +
" " + "Safari/537.36"
]
url = 'https://www.stockmonitor.com/sector/healthcare/'
headers_dict = {'User-Agent': headers[0]}
req = Request(url, headers=headers_dict)
webpage = urlopen(req)
tree = parse(webpage)
healthcare_tickers = []
for element in tree.xpath("//tbody/tr/td[#class='text-left']/a"):
healthcare_tickers.append(element.text)
pd.Series(healthcare_tickers)
Thus, healthcare_tickers has the stock companies in the healthcare sector.

scraping data from json after using requests

i am trying to extract specific data from requested json file
so after passing Authorization and using requests.get i got my request , i think it is called dictionary for python coders and called json for javascript coders
it containt too much information that i dont need and i would like to extract one or two only
for example {"bio" : " hello world " }
and that json file contains more that one " bio "
for example i am scraping 100 accounts and i would like to extract all " bio " in one code
so i tried this :
from bs4 import BeautifulSoup
import requests
headers = {"Authorization" : "xxxx"}
req = requests.get('website', headers = headers)
data = req.text
soup = BeautifulSoup(data,'html.parser')
titles = soup.find_all('span',{'class':'bio'})
for title in titles :
print(title.text)
and didnt work , i tried multiple ideas with no success
if possible please write me a code that i can understande since iam trying to learn more about my mistakes
thanks

The Aphid library I created is perfect for this.
from command-prompt
py -m pip install Aphid
Then its just as easy as loading your json data and searching it with aphid.
import json
import Aphid
resp = requests.get(yoururl)
data = json.loads(resp.text)
results = Aphid.findall(data, 'bio')
results is now equal to a list of tuples(key, value), of every occurence of the 'bio' key.

After you get your request either:
you get a simple json file (in which case you import it to python using json) or
you get an html file from which you can extract the json code (using BeautifulSoup) which in turn you will parse using json library.

Store RDF data into Triplestore via SPARQL endpoint using python

I am trying to save data in the following url as triples into triples store for future query. Here are my code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import re
url='http://gnafld.net/address/?per_page=10&page=7'
page = requests.get(url)
response = requests.get(url)
response.raise_for_status()
results = re.findall('\"Address ID: (GAACT[0-9]+)\"', response.text)
address1=results[0]
a = "http://gnafld.net/address/"
new_url = a + address1
r = requests.get(new_url).content
print(r)
After I run the code above, I got the answer like:
enter image description here
My question is how to insert the RDF data to a Fuseki Server SPARQL endpoint? I try the code like this:
import rdflib
from rdflib.plugins.stores import sparqlstore
#the following sparql endpoint is provided by the GNAF website
endpoint = 'http://gnafld.net/sparql'
store = sparqlstore.SPARQLUpdateStore(endpoint)
gs=rdflib.ConjunctiveGraph(store)
gs.open((endpoint,endpoint))
for stmt in r:
gs.add(stmt)
But it seems that it does not work. How can I fix this problem? Thanks for your help!

The answer you show in the image is in RDF triple format, it is just not pretty printed.
To store the RDF data in an RDF store you can use RDFlib. Here is an example of how to do that.
If you use Jena Fuseki server you should be able to access it from python just as you access any other SPARQL endpoint from python.
You may want to see my answer to a related SO question as well.

Parsing Fanart API v3 with python3

forgive me, if if come straight out with it but python drives me nuts at something what seemed to be quite simple.
In a nutshell
I'm writing an extension for a musicvideo scraper which is responsible for getting the fanart backdrop.
Here is the URL:
github.com/MViDLibraryToolKit/.../APICaller
So I was able to call the Fanart.tv API and receiving the right json response. My problem is that i'm to dumb to collect the URLs under the Element "artistbackground"
I search the internet and found a very similar post here at stackoverflow but unluckily this was concerning python2,API V2 and a different category at fanart.tv so I was not able to take use out of it. Here it was
Anyway, here is my poor Try to collect URLs to list
# --------------------- Response Verarbeitung
# Ausgabe zwecks Debug
# print(fanartTVresp)
# http://webservice.fanart.tv/v3/music/albums/ba853904-ae25-4ebb-89d6-c44cfbd71bd2?api_key=fdadba00cfaaf3621eaa748669256a9e&client_key=dce01d75553d7e3fbc2ad742aaf5d371
# zu befüllende Liste
url_list = []
# lade Web-Response
json_response = json.loads(fanartTVresp)
# durch Element artistbackground loopen
for artistbackground in json_response:
url = urllib.parse.quote(['url'], ':/')
if url:
url_list.append(url)
print(url_list)
The libs I loaded...
import musicbrainzngs
import urllib
import json
import socket
from pprint import pprint
from urllib.parse import quote
The rest from the code you can find at my github link. Please help me, it drives me crazy ^^
Kind regards
p.s. Please excuse my english, I came from germany :)

I think I finally got it.
# URL List for background images
url_list = []
# set only for debug / value came from powershell runtime later
location = os.path.abspath('C:/temp')
# decode json
json_response = json.loads(fanartTVresp.decode())
# set string objects
bgitem = json_response["artistbackground"]
bgcoverurl = json_response["artistbackground"][0]["url"]
# iterating items and collect
for bgcoverurl in bgitem:
url_list.append(bgcoverurl)
print(url_list)
After taking some hourse of sleep I reallized that "json.loads" deserialized the response to regular python objects. Correct me if I'm wrong.
Anyway, it finally works!

Problems crawling wordreference

I am trying to crawl wordreference, but I am not succeding.
The first problem I have encountered is, that a big part is loaded via JavaScript, but that shouldn't be much problem because I can see what I need in the source code.
So, for example, I want to extract for a given word, the first two meanings, so in this url: http://www.wordreference.com/es/translation.asp?tranword=crane I need to extract grulla and grúa.
This is my code:
import lxml.html as lh
import urllib2
url = 'http://www.wordreference.com/es/translation.asp?tranword=crane'
doc = lh.parse((urllib2.urlopen(url)))
trans = doc.xpath('//td[#class="ToWrd"]/text()')
for i in trans:
print i
The result is that I get an empty list.
I have tried to crawl it with scrapy too, no success. I am not sure what is going on, the only way I have been able to crawl it is using curl, but that is sloopy, I want to do it in an elegant way, with Python.
Thank you very much

It looks like you need a User-Agent header to be sent, see Changing user agent on urllib2.urlopen.
Also, just switching to requests would do the trick (it automatically sends the python-requests/version User Agent by default):
import lxml.html as lh
import requests
url = 'http://www.wordreference.com/es/translation.asp?tranword=crane'
response = requests.get("http://www.wordreference.com/es/translation.asp?tranword=crane")
doc = lh.fromstring(response.content)
trans = doc.xpath('//td[#class="ToWrd"]/text()')
for i in trans:
print(i)
Prints:
grulla
grúa
plataforma
...
grulla blanca
grulla trompetera

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python web-scraping yahoo finance - python

Related

How to obtain stock market company sector from ticker or company name in python

scraping data from json after using requests

Store RDF data into Triplestore via SPARQL endpoint using python

Parsing Fanart API v3 with python3

Problems crawling wordreference

Categories

Resources