How can I detect the method to request data from this site? - python

UPDATE: I've put together the following script to use the url for the XML without the time-code-like suffix as recommended in the answer below, and report the downlink powers which clearly fluctuate on the website. I'm getting three hour old, unvarying data.
So it looks like I need to properly construct that (time code? authorization? secret password?) in order to do this successfully. Like I say in the comment below, "I don't want to do anything that's not allowed and welcome - NASA has enough challenges already trying to talk to a forty year old spacecraft 20 billion kilometers away!"
def dictify(r,root=True):
"""from: https://stackoverflow.com/a/30923963/3904031"""
if root:
return {r.tag : dictify(r, False)}
d=copy(r.attrib)
if r.text:
d["_text"]=r.text
for x in r.findall("./*"):
if x.tag not in d:
d[x.tag]=[]
d[x.tag].append(dictify(x,False))
return d
import xml.etree.ElementTree as ET
from copy import copy
import urllib2
url = 'https://eyes.nasa.gov/dsn/data/dsn.xml'
contents = urllib2.urlopen(url).read()
root = ET.fromstring(contents)
DSNdict = dictify(root)
dishes = DSNdict['dsn']['dish']
dp_dict = dict()
for dish in dishes:
powers = [float(sig['power']) for sig in dish['downSignal'] if sig['power']]
dp_dict[dish['name']] = powers
print dp_dict['DSS26']
I'd like to keep track of which spacecraft that the NASA Deep Space Network (DSN) is communicating with, say once per minute.
I learned how to do something similar from Flight Radar 24 from the answer to my previous question, which also still represents my current skills in getting data from web sites.
With FR24 I had explanations in this blog as a great place to start. I have opened the page with the Developer Tools function in the Chrome browser, and I can see that data for items such as dishes, spacecraft and associated numerical data are requested as an XML with urls such as
https://eyes.nasa.gov/dsn/data/dsn.xml?r=293849023
so it looks like I need to construct the integer (time code? authorization? secret password?) after the r= once a minute.
My Question: Using python, how could I best find out what that integer represents, and how to generate it in order to correctly request data once per minute?
above: screen shot montage from NASA's DSN Now page https://eyes.nasa.gov/dsn/dsn.html see also this question

Using a random number (or a timestamp...) in a get parameter tricks the browser into really making the request (instead of using the browser cache).
This method is some kind of "hack" the webdevs use so that they are sure the request actually happens.
Since you aren't using a web browser, I'm pretty sure you could totally ignore this parameter, and still get the refreshed data.
--- Edit ---
Actually r seems to be required, and has to be updated.
#!/bin/bash
wget https://eyes.nasa.gov/dsn/data/dsn.xml?r=$(date +%s) -O a.xml -nv
while true; do
sleep 1
wget https://eyes.nasa.gov/dsn/data/dsn.xml?r=$(date +%s) -O b.xml -nv
diff a.xml b.xml
cp b.xml a.xml -f
done
You don't need to emulate a browser. Simply set r to anything and increment it. (Or use a timestamp)

Regarding your updated question, why avoid sending the r query string parameter when it is very easy to generate it? Also, with the requests module, it's easy to send the parameter with the request too:
import time
import requests
import xml.etree.ElementTree as ET
url = 'https://eyes.nasa.gov/dsn/data/dsn.xml'
r = int(time.time() / 5)
response = requests.get(url, params={'r': r})
root = ET.fromstring(response.content)
# etc....

Related

Web Scraping AccuWeather site

I have recently started learning Web scraping using Scrapy in python and am facing issues with scraping data from AccuWeather.org site (https://www.accuweather.com/en/gb/london/ec4a-2/may-weather/328328?year=2020).
Basically I am capturing dates and its weather temperature for my reporting purpose.
When inspected the site I found too many div tags so getting confused to write the code. Hence thought I would seek experts help on this.
Here is my code for your reference.
import scrapy
class QuoteSpider(scrapy.Spider):
name = 'quotes'
start_urls = ['https://www.accuweather.com/en/gb/london/ec4a-2/may-weather/328328?year=2020']
def parse(self, response):
All_div_tags = response.css('div.content-module')[0]
#Grid_tag = All_div_tags.css('div.monthly-grid')
Date_tag = All_div_tags.css('div.date::text').extract()
yield {
'Date' : Date_tag}
I wrote this in PyCharm and am getting error as "code is not handled or not allowed".
please could someone help me with this?
I've tried to read some websites that gave me the same error. It happens because some websites don't allow web scraping on them. To get data from these websites, you would probably need to use their API if they have one.
Fortunately, AccuWeather has made it easy to use their API (unlike other APIs):
You first need to create an account at their developers' website: https://developer.accuweather.com/
Now, create a new app by going to My Apps > Add a new app.
You will probably see some information about your app (if you don't, press its name and it will probably show up). The only information you will need is your API Key, which is essential for APIs.
AccuWeather has pretty good documentation about their API here, yet I will show you how to use the most useful ones. You will need to have the location key of the city you want to get the weather from, that is shown in the URL of its weather page, for example, London's URL is www.accuweather.com/en/gb/london/ec4a-2/weather-forecast/328328, so its location key is 328328.
When you have the location key of the city/cities you want to get the weather from, open a file, and type:
import requests
import json
If you want the daily weather (as shown here), type:
response = requests.get(url="http://dataservice.accuweather.com/forecasts/v1/daily/1day/LOCATIONKEY?apikey=APIKEY")
print(response.status_code)
Replacing APIKEY with your API key, and LOCATIONKEY with the city's location key. It should now display 200 when you run it (meaning the request was successful)
Now, load it as a JSON file:
response_json = json.loads(response.content)
And you can now get some information from it, such as the day's "definition":
print(response_json["Headline"]["Text"])
The minimum temperature:
min_temperature = response_json["DailyForecasts"][0]["Temperature"]["Minimum"]["Value"]
print(f"Minimum Temperature: {min_temperature}")
The maximum temperature
max_temperature = response_json["DailyForecasts"][0]["Temperature"]["Maximum"]["Value"]
print(f"Maximum Temperature: {max_temperature}")
The minimum temperature and maximum temperature with the unit:
min_temperature = str(response_json["DailyForecasts"][0]["Temperature"]["Minimum"]["Value"]) + response_json["DailyForecasts"][0]["Temperature"]["Minimum"]["Unit"]
print(f"Minimum Temperature: {min_temperature}")
max_temperature = str(response_json["DailyForecasts"][0]["Temperature"]["Maximum"]["Value"]) + response_json["DailyForecasts"][0]["Temperature"]["Maximum"]["Unit"]
print(f"Maximum Temperature: {max_temperature}")
And more.
If you have any questions, let me know. I hope I could help you!

Python - Facebook fb_dtsg

On Facebook I want to find fb_dtsg to make a status:
import urllib, urllib2, cookielib
jar = cookielib.CookieJar()
cookie = urllib2.HTTPCookieProcessor(jar)
opener = urllib2.build_opener(cookie)
data = urllib.urlencode({'email':"email",'pass':"password", "Log+In":"Log+In"})
req = urllib2.Request('http://www.facebook.com/login.php')
opener.open(req, data)
opener.open(req, data) #Needs to be opened twice to log on.
req2 = urllib2.Request("http://www.facebook.com/")
page = opener.open(req2)
fb_dtsg = page[page.find('name="fb_dtsg"') + 22:page.find('name="fb_dtsg"') + 33] #This just finds the value of "fb_dtsg".
Yes, this does find a value, and a value that looks like fb_dtsg would look like, but this value is always changing when I would open the webpage again and also when I would use it to make a status, it would not work, and when I would record what is happening on google chrome if I was making a status normally, I would get an working fb_dtsg value and it would not change (for a long session), and would work if I used it to try make a status. Please, please show me how I can fix this up without using the API.
The searching criteria to find fb_dtsg truncates last digit, so change 33 to 34
fb_dtsg = page[page.find('name="fb_dtsg"') + 22:page.find('name="fb_dtsg"') + 34]
Anyways you can use a better way of searching the fb_dtsg using re
re.findall('fb_dtsg.+?value="([^"]+)"',page)
As I answered in one of your early posts it may also require other hidden variables also.
If this still doesn't work, can you provide the code where you are making the post including all the post form data
BTW sorry for not looking at all your previous posts with same content :P

Script for a changing URL

I am having a bit of trouble in coding a process or a script that would do the following:
I need to get data from the URL of:
nomads.ncep.noaa.gov/dods/gfs_hd/gfs_hd20140430/gfs_hd_00z
But the file URL's (the days and model runs change), so it has to assume this base structure for variables.
Y - Year
M - Month
D - Day
C - Model Forecast/Initialization Hour
F- Model Frame Hour
Like so:
nomads.ncep.noaa.gov/dods/gfs_hd/gfs_hdYYYYMMDD/gfs_hd_CCz
This script would run, and then import that date (in the YYYYMMDD, as well as CC) with those variables coded -
So while the mission is to get
http://nomads.ncep.noaa.gov/dods/gfs_hd/gfs_hd20140430/gfs_hd_00z
While these variables correspond to get the current dates in the format of:
http://nomads.ncep.noaa.gov/dods/gfs_hd/gfs_hdYYYYMMDD/gfs_hd_CCz
Can you please advise how to go about and get the URL's to find the latest date in this format? Whether it'd be a script or something with wget, I'm all ears. Thank you in advance.
In Python, the requests library can be used to get at the URLs.
You can generate the URL using a combination of the base URL string plus generating the timestamps using the datetime class and its timedelta method in combination with its strftime method to generate the date in the format required.
i.e. start by getting the current time with datetime.datetime.now() and then in a loop subtract an hour (or whichever time gradient you think they're using) via timedelta and keep checking the URL with the requests library. The first one you see that's there is the latest one, and you can then do whatever further processing you need to do with it.
If you need to scrape the contents of the page, scrapy works well for that.
I'd try scraping the index one level up at http://nomads.ncep.noaa.gov/dods/gfs_hd ; the last link-of-particular-form there should take you to the daily downloads pages, where you could do something similar.
Here's an outline of scraping the daily downloads page:
import BeautifulSoup
import urllib
grdd = urllib.urlopen('http://nomads.ncep.noaa.gov/dods/gfs_hd/gfs_hd20140522')
soup = BeautifulSoup.BeautifulSoup(grdd)
datalinks = 'http://nomads.ncep.noaa.gov:80/dods/gfs_hd/gfs_hd'
for link in soup.findAll('a'):
if link.get('href').startswith(datalinks):
print('Suitable link: ' + link.get('href')[len(datalinks):])
# Figure out if you already have it, choose if you want info, das, dds, etc etc.
and scraping the page with the last thirty would, of course, be very similar.
The easiest solution would be just to mirror the parent directory:
wget -np -m -r http://nomads.ncep.noaa.gov:9090/dods/gfs_hd
However, if you just want the latest date, you can use Mojo::UserAgent as demonstrated on Mojocast Episode 5
use strict;
use warnings;
use Mojo::UserAgent;
my $url = 'http://nomads.ncep.noaa.gov:9090/dods/gfs_hd';
my $ua = Mojo::UserAgent->new;
my $dom = $ua->get($url)->res->dom;
my #links = $dom->find('a')->attr('href')->each;
my #gfs_hd = reverse sort grep {m{gfs_hd/}} #links;
print $gfs_hd[0], "\n";
On May 23rd, 2014, Outputs:
http://nomads.ncep.noaa.gov:9090/dods/gfs_hd/gfs_hd20140523

Parse what you google search

I'd like to write a script (preferably in python, but other languages is not a problem), that can parse what you type into a google search. Suppose I search 'cats', then I'd like to be able to parse the string cats and, for example, append it to a .txt file on my computer.
So if my searches were 'cats', 'dogs', 'cows' then I could have a .txt file like so,
cats
dogs
cows
Anyone know any APIs that can parse the search bar and return the string inputted? Or some object that I can cast into a string?
EDIT: I don't want to make a chrome extension or anything, but preferably a python (or bash or ruby) script I can run in terminal that can do this.
Thanks
If you have access to the URL, you can look for "&q=" to find the search term. (http://google.com/...&q=cats..., for example).
I can offer 2 popular solution
1) Google have a search-engine API https://developers.google.com/products/#google-search
(It have restriction on 100 requests per day)
cutted code:
def gapi_parser(args):
query = args.text; count = args.max_sites
import config
api_key = config.api_key
cx = config.cx
#Note: This API returns up to the first 100 results only.
#https://developers.google.com/custom-search/v1/using_rest?hl=ru-RU#WorkingResults
results = []; domains = set(); errors = []; start = 1
while True:
req = 'https://www.googleapis.com/customsearch/v1?key={key}&cx={cx}&q={q}&alt=json&start={start}'.format(key=api_key, cx=cx, q=query, start=start)
if start>=100: #google API does not can do more
break
con = urllib2.urlopen(req)
if con.getcode()==200:
data = con.read()
j = json.loads(data)
start = int(j['queries']['nextPage'][0]['startIndex'])
for item in j['items']:
match = re.search('^(https?://)?\w(\w|\.|-)+', item['link'])
if match:
domain = match.group(0)
if domain not in results:
results.append(domain)
domains.update([domain])
else:
errors.append('Can`t recognize domain: %s' % item['link'])
if len(domains) >= args.max_sites:
break
print
for error in errors:
print error
return (results, domains)
2) I wrote a selenuim based script what parse a page in real browser instance, but this solution have a some restrictions, for example captcha if you run searches like a robots.
A few options you might consider, with their advantages and disadvantages:
URL:
advantage: as Chris mentioned, accessing the URL and manually changing it is an option. It should be easy to write a script for this, and I can send you my perl script if you want
disadvantage: I am not sure if you can do it. I made a perl script for that before, but it didn't work because Google states that you can't use its services outside the Google interface. You might face the same problem
Google's search API:
advantage: popular choice. Good documentation. It should be a safe choice
disadvantage: Google's restrictions.
Research other search engines:
advantage: they might not have the same restrictions as Google. You might find some search engines that let you play around more and have more freedom in general.
disadvantage: you're not going to get results that are as good as Google's

ruby fetching url content is always empty

I am so frustrated trying to use Ruby to fetch a specific url content.
I've tried many different ways like open-uri, standard request none worked so far. I always get empty html. I also tried to use python to fetch the same url which always returned the correct html content. I am really not sure why... Please help as I am newbiew to both Ruby and Python... I want to use Ruby (prefer the tidy syntax and human friendly function names, easier to install libs using gem and homebrew (on mac) than python easy_install) but I am now considering Python because it just works (yet still trying to get my head around 2.x and 3.x issue). I may be doing something really stupid but I think is very unlikely.
ruby 1.9.2p136 (2010-12-25 revision 30365) [i386-darwin10.6.0]
Implementation 1:
url = URI.parse('http//:www.stackoverflow.com/') req = Net::HTTP::Get.new(url.path)
res = Net::HTTP.start(url.host, url.port) {|http| http.request(req) }
puts res.body #empty
Implementation 2:
doc = Nokogiri::HTML(open("http//:www.stackoverflow.com/", "User-Agent" => "Safari"))
#empty
#I tried to use without user agent, without Nokogiri none worked.
Python Implementation which worked every time perfectly
f = urllib.urlopen("http//:www.stackoverflow.com/")
# Read from the object, storing the page's contents in 's'.
s = f.read()
f.close()
print s
If that is your exact code it is invalid for several reasons.
http: should be http://
URL needs a path. if you want the root page of example.com it needs to be http://example.com/ the trailing slash is significant.
if you put 2 lines of code on one line you need to use ; to denote the end of the first line
SO
require 'net/http'
url = URI.parse('http://www.yellowpages.com.au/search/listings?clue=plumber&locationClue=Australia')
req = Net::HTTP::Get.new(url.path)
res = Net::HTTP.start(url.host, url.port) {|http| http.request(req) }
puts res.body
Same is true with using open in nokogiri
EDIT: that site is returning bad results many times:
counter = 0
20.times do
url = URI.parse('http://www.yellowpages.com.au/search/listings?clue=plumber&locationClue=Australia')
req = Net::HTTP::Get.new(url.path)
res = Net::HTTP.start(url.host, url.port) {|http| http.request(req) }
sleep 1
counter +=1 unless res.body.empty?
end
puts counter
for me this only returned once a non empty body. If you substitute in another site it works all the time
curl "http://www.yellowpages.com.au/search/listings?clue=plumber&locationClue=Australia"
Yields the same inconsistent results.
Two examples with openURI (standard lib), a wrapper for (among others) the rather cumbersome Net::HTTP :
require 'open-uri'
open("http://www.stackoverflow.com/"){|f| puts f.read}
puts URI::parse("http://www.google.com/").read

Categories

Resources