Script for a changing URL

Script for a changing URL - python

I am having a bit of trouble in coding a process or a script that would do the following:
I need to get data from the URL of:
nomads.ncep.noaa.gov/dods/gfs_hd/gfs_hd20140430/gfs_hd_00z
But the file URL's (the days and model runs change), so it has to assume this base structure for variables.
Y - Year
M - Month
D - Day
C - Model Forecast/Initialization Hour
F- Model Frame Hour
Like so:
nomads.ncep.noaa.gov/dods/gfs_hd/gfs_hdYYYYMMDD/gfs_hd_CCz
This script would run, and then import that date (in the YYYYMMDD, as well as CC) with those variables coded -
So while the mission is to get
http://nomads.ncep.noaa.gov/dods/gfs_hd/gfs_hd20140430/gfs_hd_00z
While these variables correspond to get the current dates in the format of:
http://nomads.ncep.noaa.gov/dods/gfs_hd/gfs_hdYYYYMMDD/gfs_hd_CCz
Can you please advise how to go about and get the URL's to find the latest date in this format? Whether it'd be a script or something with wget, I'm all ears. Thank you in advance.

In Python, the requests library can be used to get at the URLs.
You can generate the URL using a combination of the base URL string plus generating the timestamps using the datetime class and its timedelta method in combination with its strftime method to generate the date in the format required.
i.e. start by getting the current time with datetime.datetime.now() and then in a loop subtract an hour (or whichever time gradient you think they're using) via timedelta and keep checking the URL with the requests library. The first one you see that's there is the latest one, and you can then do whatever further processing you need to do with it.
If you need to scrape the contents of the page, scrapy works well for that.

I'd try scraping the index one level up at http://nomads.ncep.noaa.gov/dods/gfs_hd ; the last link-of-particular-form there should take you to the daily downloads pages, where you could do something similar.
Here's an outline of scraping the daily downloads page:
import BeautifulSoup
import urllib
grdd = urllib.urlopen('http://nomads.ncep.noaa.gov/dods/gfs_hd/gfs_hd20140522')
soup = BeautifulSoup.BeautifulSoup(grdd)
datalinks = 'http://nomads.ncep.noaa.gov:80/dods/gfs_hd/gfs_hd'
for link in soup.findAll('a'):
if link.get('href').startswith(datalinks):
print('Suitable link: ' + link.get('href')[len(datalinks):])
# Figure out if you already have it, choose if you want info, das, dds, etc etc.
and scraping the page with the last thirty would, of course, be very similar.

The easiest solution would be just to mirror the parent directory:
wget -np -m -r http://nomads.ncep.noaa.gov:9090/dods/gfs_hd
However, if you just want the latest date, you can use Mojo::UserAgent as demonstrated on Mojocast Episode 5
use strict;
use warnings;
use Mojo::UserAgent;
my $url = 'http://nomads.ncep.noaa.gov:9090/dods/gfs_hd';
my $ua = Mojo::UserAgent->new;
my $dom = $ua->get($url)->res->dom;
my #links = $dom->find('a')->attr('href')->each;
my #gfs_hd = reverse sort grep {m{gfs_hd/}} #links;
print $gfs_hd[0], "\n";
On May 23rd, 2014, Outputs:
http://nomads.ncep.noaa.gov:9090/dods/gfs_hd/gfs_hd20140523

Related

How to query with time filters in GoogleScraper?

Even if Google's official API does not offer time information in the query results - even no time filtering for keywords, there is time filtering option in the advanced search:
Google results for stackoverflow in the last one hour
GoogleScraper library offers many flexible options BUT time related ones. How to add time features using the library?

After a bit of inspection, I've found that time Google sends the filtering information by qdr value to the tbs key (possibly means time based search although not officially stated):
https://www.google.com/search?tbs=qdr:h1&q=stackoverflow
This gets the results for the past hour. m and y letters can be used for months and years respectively.
Also, to add sorting by date feature, add the sbd (should mean sort by date) value as well:
https://www.google.com/search?tbs=qdr:h1,sbd:1&q=stackoverflow
I was able to insert these keywords to the BASE Google URL of GoogleScraper. Insert below lines to the end of get_base_search_url_by_search_engine() method (just before return) in scraping.py:
if("google" in str(specific_base_url)):
specific_base_url = "https://www.google.com/search?tbs=qdr:{},sbd:1".format(config.get("time_filter", ""))
Now use the time_filter option in your config:
from GoogleScraper import scrape_with_config
config = {
'use_own_ip': True,
'keyword_file': "keywords.txt",
'search_engines': ['google'],
'num_pages_for_keyword': 2,
'scrape_method': 'http',
"time_filter": "d15" #up to 15 days ago
}
search = scrape_with_config(config)
Results will only include the time range. Additionally, text snippets in the results will have raw date information:
one_sample_result = search.serps[0].links[0]
print(one_sample_result.snippet)
4 mins ago It must be pretty easy - let propertytotalPriceOfOrder =
order.items.map(item => +item.unit * +item.quantity * +item.price);.
where order is your entire json object.

Getting more than 100 days of data web scraping Yahoo

Like many others I have been looking for an alternative source of stock prices now that the Yahoo and Google APIs are defunct. I decided to take a try at web scraping the Yahoo site from which historical prices are still available. I managed to put together the following code which almost does what I need:
import urllib.request as web
import bs4 as bs
def yahooPrice(tkr):
tkr=tkr.upper()
url='https://finance.yahoo.com/quote/'+tkr+'/history?p='+tkr
sauce=web.urlopen(url)
soup=bs.BeautifulSoup(sauce,'lxml')
table=soup.find('table')
table_rows=table.find_all('tr')
allrows=[]
for tr in table_rows:
td=tr.find_all('td')
row=[i.text for i in td]
if len(row)==7:
allrows.append(row)
vixdf= pd.DataFrame(allrows).iloc[0:-1]
vixdf.columns=['Date','Open','High','Low','Close','Aclose','Volume']
vixdf.set_index('Date',inplace=True)
return vixdf
which produces a dataframe with the information I want. Unfortunately, even though the actual web page shows a full year's worth of prices, my routine only returns 100 records (including dividend records). Any idea how I can get more?

The Yahoo Finance API was depreciated in May '17, I believe. Now, there are to many options for downloading time series data for free, at least that I know of. Nevertheless, there is always some kind of alternative. Check out the URL below to find a tool to download historical price.
http://investexcel.net/multiple-stock-quote-downloader-for-excel/
See this too.
https://blog.quandl.com/api-for-stock-data

I don't have the exact solution to your question but I have a workaround (I had the same problem and hence used this approach)....basically, you can use Bday() method - 'import pandas.tseries.offset' and look for x number of businessDays for collecting the data. In my case, i ran the loop thrice to get 300 businessDays data - knowing that 100 was maximum I was getting by default.
Basically, you run the loop thrice and set the Bday() method such that the iteration on first time grabs 100 days data from now, then the next 100 days (200 days from now) and finally the last 100 days (300 days from now). The whole point of using this is because at any given point, one can only scrape 100 days data. So basically, even if you loop through 300 days in one go, you may not get 300 days data - your original problem (possibly yahoo limits amount of data extracted in one go). I have my code here : https://github.com/ee07kkr/stock_forex_analysis/tree/dataGathering
Note, the csv files for some reason are not working with /t delimiter in my case...but basically u can use the data frame. One more issue I currently have is 'Volume' is a string instead of float....the way to get around is :
apple = pd.DataFrame.from_csv('AAPL.csv',sep ='\t')
apple['Volume'] = apple['Volume'].str.replace(',','').astype(float)

First - Run the code below to get your 100 days.
Then - Use SQL to insert the data into a small db (Sqlite3 is pretty easy to use with python).
Finally - Amend code below to then get daily prices which you can add to grow your database.
from pandas import DataFrame
import bs4
import requests
def function():
url = 'https://uk.finance.yahoo.com/quote/VOD.L/history?p=VOD.L'
response = requests.get(url)
soup=bs4.BeautifulSoup(response.text, 'html.parser')
headers=soup.find_all('th')
rows=soup.find_all('tr')
ts=[[td.getText() for td in rows[i].find_all('td')] for i in range (len(rows))]
date=[]
days=(100)
while days > 0:
for i in ts:
data.append (i[:-6])
now=data[num]
now=DataFrame(now)
now=now[0]
now=str(now[0])
print now, item
num=num-1

How can I detect the method to request data from this site?

UPDATE: I've put together the following script to use the url for the XML without the time-code-like suffix as recommended in the answer below, and report the downlink powers which clearly fluctuate on the website. I'm getting three hour old, unvarying data.
So it looks like I need to properly construct that (time code? authorization? secret password?) in order to do this successfully. Like I say in the comment below, "I don't want to do anything that's not allowed and welcome - NASA has enough challenges already trying to talk to a forty year old spacecraft 20 billion kilometers away!"
def dictify(r,root=True):
"""from: https://stackoverflow.com/a/30923963/3904031"""
if root:
return {r.tag : dictify(r, False)}
d=copy(r.attrib)
if r.text:
d["_text"]=r.text
for x in r.findall("./*"):
if x.tag not in d:
d[x.tag]=[]
d[x.tag].append(dictify(x,False))
return d
import xml.etree.ElementTree as ET
from copy import copy
import urllib2
url = 'https://eyes.nasa.gov/dsn/data/dsn.xml'
contents = urllib2.urlopen(url).read()
root = ET.fromstring(contents)
DSNdict = dictify(root)
dishes = DSNdict['dsn']['dish']
dp_dict = dict()
for dish in dishes:
powers = [float(sig['power']) for sig in dish['downSignal'] if sig['power']]
dp_dict[dish['name']] = powers
print dp_dict['DSS26']
I'd like to keep track of which spacecraft that the NASA Deep Space Network (DSN) is communicating with, say once per minute.
I learned how to do something similar from Flight Radar 24 from the answer to my previous question, which also still represents my current skills in getting data from web sites.
With FR24 I had explanations in this blog as a great place to start. I have opened the page with the Developer Tools function in the Chrome browser, and I can see that data for items such as dishes, spacecraft and associated numerical data are requested as an XML with urls such as
https://eyes.nasa.gov/dsn/data/dsn.xml?r=293849023
so it looks like I need to construct the integer (time code? authorization? secret password?) after the r= once a minute.
My Question: Using python, how could I best find out what that integer represents, and how to generate it in order to correctly request data once per minute?
above: screen shot montage from NASA's DSN Now page https://eyes.nasa.gov/dsn/dsn.html see also this question

Using a random number (or a timestamp...) in a get parameter tricks the browser into really making the request (instead of using the browser cache).
This method is some kind of "hack" the webdevs use so that they are sure the request actually happens.
Since you aren't using a web browser, I'm pretty sure you could totally ignore this parameter, and still get the refreshed data.
--- Edit ---
Actually r seems to be required, and has to be updated.
#!/bin/bash
wget https://eyes.nasa.gov/dsn/data/dsn.xml?r=$(date +%s) -O a.xml -nv
while true; do
sleep 1
wget https://eyes.nasa.gov/dsn/data/dsn.xml?r=$(date +%s) -O b.xml -nv
diff a.xml b.xml
cp b.xml a.xml -f
done
You don't need to emulate a browser. Simply set r to anything and increment it. (Or use a timestamp)

Regarding your updated question, why avoid sending the r query string parameter when it is very easy to generate it? Also, with the requests module, it's easy to send the parameter with the request too:
import time
import requests
import xml.etree.ElementTree as ET
url = 'https://eyes.nasa.gov/dsn/data/dsn.xml'
r = int(time.time() / 5)
response = requests.get(url, params={'r': r})
root = ET.fromstring(response.content)
# etc....

How do I add hours to Toggl via Python?

I'm new to python and would like to have a python script that would update Toggl based on the picture below. Note, that I don't want to start/stop the timer (although if you want to through that in, I may use it), but what I really want to do is just simply add time after the fact.
I just want to pass in:
text about what I did for the day
existing project to link to
duration
start time
date
I tried togglwrapper (https://pypi.python.org/pypi/togglwrapper/1.0.1) and connected to my account via API token just fine. I'm just not sure how to send a request to add a time entry.

TogglWrapper does not have the option to create a Time Entry like it is specified in the following endpoint:
https://github.com/toggl/toggl_api_docs/blob/master/chapters/time_entries.md#create-a-time-entry
But you can do the same by starting and stopping the timer like this:
In data be sure to include the proper data.
>>> from togglwrapper import Toggl
>>> toggl = Toggl('your_api_token')
>>> data = {"time_entry":{"description":"description","tags":["billed"],"pid":123,"created_with":"curl"}}
>>> response = toggl.TimeEntries.start()
It will call this API endpoint: https://github.com/toggl/toggl_api_docs/blob/master/chapters/time_entries.md#start-a-time-entry
Now get the Time entry ID to stop it.
>>> toggl.TimeEntries.stop(response.get('data').get('id'))
Hope it helps!

Best practices to filter in Django

In one of the pages of my Django app I have a page that simply displays all employees information in a table:
Like so:
First Name: Last Name: Age: Hire Date:
Bob Johnson 21 03/19/2011
Fred Jackson 50 12/01/1999
Now, I prompt the user for 2 dates and I want to know if an employee was hired between those 2 dates.
For HTTP GET I just render the page and for HTTP POST I'm sending a URL with the variables in the URL.
my urls.py file has these patterns:
('^employees/employees_by_date/$','project.reports.filter_by_date'),
('^employees/employees_by_date/sort/(?P<begin_date>\d+)/(? P<end_date>\d+)/$', EmployeesByDate.as_view()),
And my filter_by_date function looks like this:
def filter_by_date(request):
if request.method == 'GET':
return render(request,"../templates/reports/employees_by_date.html",{'form':BasicPrompt(),})
else:
form = BasicPrompt(request.POST)
if form.is_valid():
begin_date = form.cleaned_data['begin_date']
end_date = form.cleaned_data['end_date']
return HttpResponseRedirect('../reports/employees_by_date/sort/'+str(begin_date)+'/'+str(end_date)+'/')
The code works fine, the problem is I'm new to web dev and this doesn't feel like I'm accomplishing this in the right way. I want to use best practices so can anyone either confirm I am or guide me in the proper way to filter by dates?
Thanks!

You're right, it's a bit awkward to query your API in that way. If you need to add the employee name and something else to the filter, you will end up with a very long URL and it won't be flexible.
Your filter parameters (start and end date) should be added as a query in the url and not be part of path.
In this case, the url would be employees/employees_by_date/?start_date=xxx&end_date=yyy and the dates can be retrieved in the view using start_date = request.GET['start_date].
If a form is used with method='get', the input in the form are automatically converted to a query and appended at the end of the url.
If no form is used, parameters need to be encoded with a function to be able to pass values with special characters like \/ $%.

Use Unix timestamps instead of mm/dd/yyyy dates. A unix timestamp is the number of seconds that have elapsed from Jan 1 1970. ("The Epoch".) So it's just a simple integer number. As I'm writing this, the Unix time is 1432071354.
They aren't very human-readable, but Unix timestamps are unambiguous, concise, and can be filtered for with the simple regex [\d]+.
You'll see lots of APIs around the web use them, for example Facebook. Scroll down to "time based pagination", those numbers are Unix timestamps.
The problem with mm/dd/yyyy dates is ambiguity. Is it mm/dd/yyyy (US)? or dd/mm/yyyy (elsewhere)? What about mm-dd-yyyy?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Script for a changing URL - python

Related

How to query with time filters in GoogleScraper?

Getting more than 100 days of data web scraping Yahoo

How can I detect the method to request data from this site?

How do I add hours to Toggl via Python?

Best practices to filter in Django

Categories

Resources