Web Scraping with Python - Satellites - python

I'm trying to get data on satellite positions several times a day from https://www.n2yo.com/. The satellite I'm focused on is MUOS 5. My problem is I'm not able to get to any of the data changing in the table.
from bs4 import BeautifulSoup
import requests
import csv
info = soup.find('div', class_='container-main')
info = info.find('div', id='trackinginfo')
info = info.find('div', id='paneldata')
info = info.find('table', id='tabledata')
info = info.find('tr')
print(info)
I expect to see the information shown and in the second column, 41622, But I don't know how to only access the second td
<tr>
<td>NORAD ID:</td><td><div id="noradid"></div></td>
</tr>
Any help/direction would be appreciated.

This is probably not a complete answer, but I believe it will get you started.
The page you linked to is dynamically loaded with javascript, so beautifulsoup can't handle it. The data itself is located at another url (see below - can be located through the Developer tab in your browser) and, since it's in json format, it can be loaded into python.
The json contains historical information, and the most recent item is located at the end of the json string. Once you have that, you can extract the relevant data from it.
As you'll see below, I managed to connect some of the dynamic data to some of the types, but I'm not really familiar with the terminology, so you will probably have to do some extra work to complete it. But, as I said, it will at least get you started:
import requests
import json
req = requests.get('https://www.n2yo.com/sat/instant-tracking.php?s=41622&hlat=40.71427&hlng=-74.00597&d=300&r=547647090737.1928&tz=GMT-04:00&O=n2yocom&rnd_str=8fde3fd56c515d8fb110d5145c7df86b&callback=')
data = json.loads(req.text)
heads = ['LATITUDE','LONGITUDE', 'AZIMUTH','ELEVATION','??','DECLINATION','ALTITUDE [km]','???','NORAD ID','ABC','xxx'] #as I said, not sure exactly what's what...
target = list(data[0].values())[-1][-1] #this is the most recent data
dats = [item for item in list(target.values())[0].split('|')]
for v,d in zip(dats,heads):
print(d,':',v)
Output:
LATITUDE : -6.18764033
LONGITUDE : -102.54010579
AZIMUTH : 216.12
ELEVATION : 28.65
?? : 242.47221474
DECLINATION : -12.92722520
ALTITUDE [km] : 35625.07
??? : 0.19091700104938
NORAD ID : 41622
ABC : 1598058725
xxx : 0
Hopefully this helps.

If I understood your question correctly. You are looking for data on this page: https://www.n2yo.com/satellite/?s=41622
For the reason that in my opinion the HTML is not "made for BeautifulSoup", I would recommend regular expressions
... </ul></p><br/><B>NORAD ID</B>: 41622 <a ...
m = re.search(r'\<B\>NORAD ID\<\/B\>: (.*?) \<a', read_data)
print(m.group(1))

Related

How to obtain stock market company sector from ticker or company name in python

Given a company ticker or name I would like to get its sector using python.
I have tried already several potential solutions but none has worked succesfully
The two most promising are:
1) Using the script from: https://gist.github.com/pratapvardhan/9b57634d57f21cf3874c
from urllib import urlopen
from lxml.html import parse
'''
Returns a tuple (Sector, Indistry)
Usage: GFinSectorIndustry('IBM')
'''
def GFinSectorIndustry(name):
tree = parse(urlopen('http://www.google.com/finance?&q='+name))
return tree.xpath("//a[#id='sector']")[0].text, tree.xpath("//a[#id='sector']")[0].getnext().text
However I am using python --version 3.8
I have been able to tweak this solution, but the last line is not working and I am completely new to scraping web pages, so I would appreciate if anyone has some suggestions.
Here is my current code:
from urllib.request import Request, urlopen
from lxml.html import parse
name="IBM"
req = Request('http://www.google.com/finance?&q='+name, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req)
tree = parse(webpage)
But then the last part is not working and I am very new to this xpath syntax:
tree.xpath("//a[#id='sector']")[0].text, tree.xpath("//a[#id='sector']")[0].getnext().text
2) The other option was embedding R's TTN package as shown here: Find which sector a stock belongs to
However, I want to run it within my Jupyter notebook, and it is just taking ages to run ss <- stockSymbols()
Following your comment, for marketwatch.com/investing/stock specifically, the xpath that is likely to work is "//div[#class='intraday__sector']/span[#class='label']" meaning that doing
tree.xpath("//div[#class='intraday__sector']/span[#class='label']")[0].text
should return the desired information.
I am completely new to scraping web pages [...]
Some precisions:
This xpath totally depends on the website you are looking at, explaining why there were no hope in searching "//a[#id='sector']" in the page you mention in comments, since this xpath (now outdated) was google-finance specific. Put differently, you first need to "study" the page you are interested in to know where the information you want is located.
To conduct such "study" I use Chrome DevTools and check any xpath in the console, doing $x(<your-xpath-of-interest>) where the function $x is documented here (with examples!).
Luckily for you, the information you want to get from marketwatch.com/investing/stock -- the sector's name -- is statically generated (i.e. not dynamically generated at page loading, in which case other scraping techniques would have been required, resorting to other python libraries such as Selenium.. but this is another question).
You can easily obtain the sector for any given company/ticker with yahoo finance:
import yfinance as yf
tickerdata = yf.Ticker('TSLA') #the tickersymbol for Tesla
print (tickerdata.info['sector'])
Code returns: 'Consumer Cyclical'
If you want other information about the company/ticker, just print(tickerdata.info) to see all other possible dictionary keys and corresponding values, like ['sector'] used in the code above.
To answer the question:
How to obtain stock market company sector from ticker or company name in python?
I had to find a work around after reading some material and some nice suggestions from #keepAlive.
The following does the job in a reverse way, i.e. gets the companies given the sector. There are 10 sectors, so it is not too much work if one wants info for all sectors: https://www.stockmonitor.com/sectors/
Given that marketwatch.com/investing/stock was throwing a 405 Error, I decided to use https://www.stockmonitor.com/sectors/, for example:
https://www.stockmonitor.com/sector/healthcare/
Here is the code:
import requests
import pandas as pd
from lxml.html import parse
from urllib.request import Request, urlopen
headers = [
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3)" + " "
"AppleWebKit/537.36 (KHTML, like Gecko)" + " " + "Chrome/35.0.1916.47" +
" " + "Safari/537.36"
]
url = 'https://www.stockmonitor.com/sector/healthcare/'
headers_dict = {'User-Agent': headers[0]}
req = Request(url, headers=headers_dict)
webpage = urlopen(req)
tree = parse(webpage)
healthcare_tickers = []
for element in tree.xpath("//tbody/tr/td[#class='text-left']/a"):
healthcare_tickers.append(element.text)
pd.Series(healthcare_tickers)
Thus, healthcare_tickers has the stock companies in the healthcare sector.

Sorting audiobooks on Audible.com by release date when using Python requests library

I am trying to reproduce the result of "Scraping and Exploring the Entire English Audible Catalog" by Toby Manders to add results for the books released after this article was published. The idea is to take Manders' dataset and add equivalent fields for all the new audiobooks in the past year or so, and to do that with as few http requests to Audible as possible. I'm using a different Python library than Manders, and Audible has also changed a bit since that piece was published.
The approach used by Manders of getting paged results of each category views is working so far, but my http request is not sorting the result by release date. Here is my code:
import requests
from bs4 import BeautifulSoup
base_url = 'https://www.audible.com/search?pf_rd_p=7fe4387b-4762-42a8-8d9a-a63254c74bb2&pf_rd_r=C7ENYKDADHMCH4KY12D4&ref=a_search_l1_feature_five_browse-bin_6&feature_six_browse-bin=9178177011&pageSize=50'
r = requests.get(base_url)
html = BeautifulSoup(r.text)
# get category list, and links
cat_tuples = []
for cat in html.find('div', {'class':'categories'}).find_all('li', {'class':'bc-list-item'}):
a = cat.find('a')
mytuple = (a.text, 'https://audible.com' + a['href']+'&sort=pubdate-desc-rank')
cat_tuples.append(mytuple)
# each tuple has a format like this ... ('Arts & Entertainment',
# 'https://audible.com/search?feature_six_browse-bin=9178177011&node=2226646011&pageSize=50&pf_rd_p=7fe4387b-4762-42a8-8d9a-a63254c74bb2&pf_rd_r=C7ENYKDADHMCH4KY12D4&ref=a_search_l1_feature_five_browse-bin_6&sort=pubdate-desc-rank')
#request first page of first category
r_page = requests.get(cat_tuples[0][1])
html_page = BeautifulSoup(r.text)
# results should start with '2Pac in the Studio' but instead it's 'Can't Hurt Me: Master Your Mind and Defy the Odds'
Adding sort=pubdate-desc-rank to the request URL appears to work in Chrome, but not with Python. I have tried changing the User Agent in my code as well, but that didn't work.
Note: I would describe Audible.com as generally unfriendly to scraping, but I don't see a clear prohibition against it. My interest in purely informational, and I do not seek to profit from gathering these results.
I took a fresh look at my code this morning and discovered that the solution to this one is a silly coding error on my part. I'm leaving it up in case anyone else has a similar issue. These lines of code:
#request first page of first category
r_page = requests.get(cat_tuples[0][1])
html_page = BeautifulSoup(r.text)
Should be as follows:
#request first page of first category
r_page = requests.get(cat_tuples[0][1])
html_page = BeautifulSoup(r_page.text)

python web-scraping yahoo finance

Since Yahoo finance updated their website. some tables seem to be created dynamically and not actually stored in the HTML (I used to get this information using BeautifulSoup, urllib but this won't work anymore). I am after the Analyst tables for example ADP specifically the Earnings Estimates for Year Ago EPS (Current Year Column). You cannot get this information from the API.
I found this link which works well for the Analyst Recommendations Trends. does anyone know how to do something similar for the main table on this page? (LINK:
python lxml etree applet information from yahoo )
I tried to follow the steps taken but frankly its beyond me.
returning the whole table is all I need I can pick out bits from there. cheers
In order to get that data, you need to open Chrome DevTools and select Network tab with XHR filter. If you click on ADP request you can see link in RequestUrl.
You can use Requests library for making a request and parsing json response from the site.
import requests
from pprint import pprint
url = 'https://query1.finance.yahoo.com/v10/finance/quoteSummary/ADP?formatted=true&crumb=ILlIC9tOoXt&lang=en-US&region=US&modules=upgradeDowngradeHistory%2CrecommendationTrend%2CfinancialData%2CearningsHistory%2CearningsTrend%2CindustryTrend%2CindexTrend%2CsectorTrend&corsDomain=finance.yahoo.com'
r = requests.get(url).json()
pprint(r)
further to volds answer above and using the answer in the link I posted above. (credit to saaj). This gives just the dataset I need and is neater when calling the module. I am not sure what the parameter crumb is but, it seems to work ok without it.
import json
from pprint import pprint
from urllib.request import urlopen
from urllib.parse import urlencode
def parse():
host = 'https://query1.finance.yahoo.com'
#host = 'https://query2.finance.yahoo.com' # try if above doesn't work
path = '/v10/finance/quoteSummary/%s' % 'ADP'
params = {
'formatted' : 'true',
#'crumb' : 'ILlIC9tOoXt',
'lang' : 'en-US',
'region' : 'US',
'modules' : 'earningsTrend',
'domain' : 'finance.yahoo.com'
}
response = urlopen('{}{}?{}'.format(host, path, urlencode(params)))
data = json.loads(response.read().decode())
pprint(data)
if __name__ == '__main__':
parse()
Other modules (just add a comma between them):
assetProfile
financialData
defaultKeyStatistics
calendarEvents
incomeStatementHistory
cashflowStatementHistory
balanceSheetHistory
recommendationTrend
upgradeDowngradeHistory
earningsHistory
earningsTrend
industryTrend
In GitHub, c0redumb has proposed a whole solution. You can download the yqd.py. After import it, you can get Yahoo finance data by one line of code, as blew.
import yqd
yf_data = yqd.load_yahoo_quote('GOOG', '20170722', '20170725')
The result 'yf_data' is:
['Date,Open,High,Low,Close,Adj Close,Volume',
'2017-07-24,972.219971,986.200012,970.770020,980.340027,980.340027,3248300',
'2017-07-25,953.809998,959.700012,945.400024,950.700012,950.700012,4661000',
'']

Bioinformatics : Programmatic Access to the BacDive Database

the resource at "BacDive" - ( http://bacdive.dsmz.de/) is a highly useful database for accessing bacterial knowledge, such as strain information, species information and parameters such as growth temperature optimums.
I have a scenario in which I have a set of organism names in a plain text file, and I would like to programmatically search them 1 by 1 against the Bacdive database (which doesnt allow a flat file to be downloaded) and retrieve the relevent information and populate my text file accordingly.
What are the main modules (such as beautifulsoups) that I would need to accomplish this? Is it straight forward? Is it allowed to programmatically access webpages ? Do I need permission?
A bacteria name would be "Pseudomonas putida" . Searching this would give 60 hits on bacdive. Clicking one of the hits, takes us to the specific page, where the line : "Growth temperature: [Ref.: #27] Recommended growth temperature : 26 °C " is the most important.
The script would have to access bacdive (which i have tried accessing using requests, but I feel they do not allow programmatic access, I have asked the moderator about this, and they said I should register for their API first).
I now have the API access. This is the page (http://www.bacdive.dsmz.de/api/bacdive/). This may seem quite simple to people who do HTML scraping, but I am not sure what to do now that I have access to the API.
Here is the solution...
import re
import urllib
from bs4 import BeautifulSoup
def get_growth_temp(url):
soup = BeautifulSoup(urllib.urlopen(url).read())
no_hits = int(map(float, re.findall(r'[+-]?[0-9]+',str(soup.find_all("span", class_="searchresultlayerhits"))))[0])
if no_hits > 1 :
letters = soup.find_all("li", class_="searchresultrow1") + soup.find_all("li", class_="searchresultrow2")
all_urls = []
for i in letters:
all_urls.append('http://bacdive.dsmz.de/index.php' + i.a["href"])
max_temp = []
for ind_url in all_urls:
soup = BeautifulSoup(urllib.urlopen(ind_url).read())
a = soup.body.findAll(text=re.compile('Recommended growth temperature :'))
if a:
max_temp.append(int(map(float, re.findall(r'[+-]?[0-9]+', str(a)))[0]))
print "Recommended growth temperature : %d °C:\t" % max(max_temp)
url = 'http://bacdive.dsmz.de/index.php?search=Pseudomonas+putida'
if __name__ == "__main__":
# TO Open file then iterate thru the urls/bacterias
# with open('file.txt', 'rU') as f:
# for url in f:
# get_growth_temp(url)
get_growth_temp(url)
Edit:
Here I am passing single url. if you want to pass multiple urls to get their growth temperature. call the function(url) by opening file. code is commented.
Hope it helped you..
Thanks

Extracting parts of a webpage with python

So I have a data retrieval/entry project and I want to extract a certain part of a webpage and store it in a text file. I have a text file of urls and the program is supposed to extract the same part of the page for each url.
Specifically, the program copies the legal statute following "Legal Authority:" on pages such as this. As you can see, there is only one statute listed. However, some of the urls also look like this, meaning that there are multiple separated statutes.
My code works for pages of the first kind:
from sys import argv
from urllib2 import urlopen
script, urlfile, legalfile = argv
input = open(urlfile, "r")
output = open(legalfile, "w")
def get_legal(page):
# this is where Legal Authority: starts in the code
start_link = page.find('Legal Authority:')
start_legal = page.find('">', start_link+1)
end_link = page.find('<', start_legal+1)
legal = page[start_legal+2: end_link]
return legal
for line in input:
pg = urlopen(line).read()
statute = get_legal(pg)
output.write(get_legal(pg))
Giving me the desired statute name in the "legalfile" output .txt. However, it cannot copy multiple statute names. I've tried something like this:
def get_legal(page):
# this is where Legal Authority: starts in the code
end_link = ""
legal = ""
start_link = page.find('Legal Authority:')
while (end_link != '</a> '):
start_legal = page.find('">', start_link+1)
end_link = page.find('<', start_legal+1)
end2 = page.find('</a> ', end_link+1)
legal += page[start_legal+2: end_link]
if
break
return legal
Since every list of statutes ends with '</a> ' (inspect the source of either of the two links) I thought I could use that fact (having it as the end of the index) to loop through and collect all the statutes in one string. Any ideas?
I would suggest using BeautifulSoup to parse and search your html. This will be much easier than doing basic string searches.
Here's a sample that pulls all the <a> tags found within the <td> tag that contains the <b>Legal Authority:</b> tag. (Note that I'm using requests library to fetch page content here - this is just a recommended and very easy to use alternative to urlopen.)
import requests
from BeautifulSoup import BeautifulSoup
# fetch the content of the page with requests library
url = "http://www.reginfo.gov/public/do/eAgendaViewRule?pubId=200210&RIN=1205-AB16"
response = requests.get(url)
# parse the html
html = BeautifulSoup(response.content)
# find all the <a> tags
a_tags = html.findAll('a', attrs={'class': 'pageSubNavTxt'})
def fetch_parent_tag(tags):
# fetch the parent <td> tag of the first <a> tag
# whose "previous sibling" is the <b>Legal Authority:</b> tag.
for tag in tags:
sibling = tag.findPreviousSibling()
if not sibling:
continue
if sibling.getText() == 'Legal Authority:':
return tag.findParent()
# now, just find all the child <a> tags of the parent.
# i.e. finding the parent of one child, find all the children
parent_tag = fetch_parent_tag(a_tags)
tags_you_want = parent_tag.findAll('a')
for tag in tags_you_want:
print 'statute: ' + tag.getText()
If this isn't exactly what you needed to do, BeautifulSoup is still the tool you likely want to use for sifting through html.
They provide XML data over there, see my comment. If you think you can't download that many files (or the other end could dislike so many HTTP GET requests), I'd recommend asking their admins if they would kindly provide you with a different way of accessing the data.
I have done so twice in the past (with scientific databases). In one instance the sheer size of the dataset prohibited a download; they ran a SQL query of mine and e-mailed the results (but had previously offered to mail a DVD or hard disk). In another case, I could have done some million HTTP requests to a webservice (and they were ok) each fetching about 1k bytes. This would have taken long, and would have been quite inconvenient (requiring some error-handling, since some of these requests would always time out) (and non-atomic due to paging). I was mailed a DVD.
I'd imagine that the Office of Management and Budget could possibly be similar accomodating.

Categories

Resources