I am getting a
HTTPError: Bad response
when trying to receive weather data of the Dark Sky API using the darkskylib in Python. Actually it is a 400 bad request code.
It seems it only happens when I use a loop through my pandas dataframe instances because when I run my code for a single instance I am getting the correct values as well as when I am using a direct URL request in my browser.
Here is my function which is called later (with df being the dataframe)
def engineer_features(df):
from datetime import datetime as dt
from darksky import forecast
print("Add weather data...")
# Add Windspeed
df['ISSTORM'] = 0
# Add Temperature
df['ISHOT'] = 0
df['ISCOLD'] = 0
# Add Precipitation probability
# (because the weather station is not at the coordinate of the taxi
# only a probability is added, but in regard to the center of Porto
# (because else the API calls would have been costly))
df['PRECIPPROB'] = 0
# sort data frame
data_times = df.sort_values(by='TIMESTAMP')
# initialize variable for previous day's date day (day before the first day)
prevDay = data_times.at[0,'TIMESTAMP'].date().day - 1
#initialize hour counter
hourCount = 0
# add personal DarkSky API key and assign with location data
key = 'MY_API_KEY'
PORTO = key, 41.1579, -8.6291
# loop through the sorted dataframe and add weather related data to the main dataframe
for index, row in data_times.iterrows():
# if the new row is a new day make a new API call for weather data of that new day
if row["TIMESTAMP"].day != prevDay:
# get Weather data
t = row["TIMESTAMP"].date().isoformat()
porto = forecast(*PORTO, time=t)
porto.refresh(units='si')
###...more code
My particular issue was that I converted my datetime into date. So instead of writing
t = row["TIMESTAMP"].date().isoformat()
I need to write
t = row["TIMESTAMP"].isoformat()
Related
I'm trying to download weekly Sentinel 2 data for one year. So, one Sentinel dataset within each week of the year. I can create a list of datasets using the code:
from sentinelsat import SentinelAPI
api = SentinelAPI(user, password, 'https://scihub.copernicus.eu/dhus')
products = api.query(footprint,
date = ('20211001', '20221031'),
platformname = 'Sentinel-2',
processinglevel = 'Level-2A',
cloudcoverpercentage = (0,10)
)
products_gdf = api.to_geodataframe(products)
products_gdf_sorted = products_gdf.sort_values(['beginposition'], ascending=[False])
products_gdf_sorted
This creates a list of all datasets available within the year, and as the data capture is around one in every five days you could argue I can work off this list. But instead I would like to have just one option each week (Mon - Sun). I thought I could create a dataframe with a startdate and an enddate for each week and loop that through the api.query code. But not sure how I would do this.
I have created a dataframe using:
import pandas as pd
dates_df = pd.DataFrame({'StartDate':pd.date_range(start='20211001', end='20221030', freq = 'W-MON'),'EndDate':pd.date_range(start='20211004', end='20221031', freq = 'W-SUN')})
print (dates_df)
Any tips or advice is greatly appreciated. Thanks!
I am attempting to fetch data from the NSRDB PSM API via a formatted URL but I keep getting a HTTP 400 Error and I can't get a specific reason from the stack trace.
The following is the code:
# Declare all variables as strings. Spaces must be replaced with '+', i.e., change 'John Smith' to 'John+Smith'.
# Define the lat, long of the location and the year
#Using UWI, St. Augustine Campus as the Target, 10.6416° N, 61.3995° W
lat, lon, year = 10.6416, 61.3995, 2019
# You must request an NSRDB api key from the link above(replace this with your API key
api_key = 'xxxxxxxxxxxxxxx'
# Set the attributes to extract (e.g., dhi, ghi, etc.), separated by commas.
attributes = 'ghi,dhi,dni,wind_speed,air_temperature,solar_zenith_angle'
# Set leap year to true or false. True will return leap day data if present, false will not.
leap_year = 'false'
# Set time interval in minutes, i.e., '30' is half hour intervals. Valid intervals are 30 & 60.
interval = '30'
# Specify Coordinated Universal Time (UTC), 'true' will use UTC, 'false' will use the local time zone of the data.
# NOTE: In order to use the NSRDB data in SAM, you must specify UTC as 'false'. SAM requires the data to be in the
# local time zone.
utc = 'false'
# Your full name, use '+' instead of spaces.
your_name = 'Shankar+Ramharack'
# Your reason for using the NSRDB.
reason_for_use = 'research'
# Your affiliation
your_affiliation = 'UWI'
# Your email address
your_email = 'XXXXXXX'
# Please join our mailing list so we can keep you up-to-date on new developments.
mailing_list = 'false'
# Declare url string
url = 'https://developer.nrel.gov/api/solar/nsrdb_psm3_download.csv?wkt=POINT({lon}%20{lat})&names={year}&leap_day={leap}&interval={interval}&utc={utc}&full_name={name}&email={email}&affiliation={affiliation}&mailing_list={mailing_list}&reason={reason}&api_key={api}&attributes={attr}'.format(year=year, lat=lat, lon=lon, leap=leap_year, interval=interval, utc=utc, name=your_name, email=your_email, mailing_list=mailing_list, affiliation=your_affiliation, reason=reason_for_use, api=api_key, attr=attributes)
# Return just the first 2 lines to get metadata:
info = pd.read_csv(url, nrows=1)
# See metadata for specified properties, e.g., timezone and elevation
timezone, elevation = info['Local Time Zone'], info['Elevation']
I followed the method specified on their page: here. I am not sure if I am making a pandas error in reading or a specific python error. I am relatively new to the language.
I got feedback from the devs and it was an issue on their site, they are currently updating the documentation. To access data use the following:
df = pd.read_csv('https://developer.nrel.gov/api/nsrdb/v2/solar/psm3-download.csv?wkt=POINT({lon}%20{lat})&names={year}&leap_day={leap}&interval={interval}&utc={utc}&full_name={name}&email={email}&affiliation={affiliation}&mailing_list={mailing_list}&reason={reason}&api_key={api}&attributes={attr}'.format(year=year, lat=lat, lon=lon, leap=leap_year, interval=interval, utc=utc, name=your_name, email=your_email, mailing_list=mailing_list, affiliation=your_affiliation, reason=reason_for_use, api=api_key, attr=attributes), skiprows=2)
An alternate method of obtaining NSRDB data is to use the wrapper functions from the pvlib library. The source code can be found here. Using pvlib.iotools.get_psm3() function, data can be queried from the PSM model API.
Use the latest release for best performance.
I have a dataset in json with gps coordinates:
"utc_date_and_time":"2021-06-05 13:54:34", # timestamp
"hdg":"018.0", # heading
"sog":"000.0", # speed
"lat":"5905.3262N", # latitude
"lon":"00554.2433E" # longitude
This data will be imported into a database, with one entry every second for every "vessel".
As you can imagine this is a huge amount of data that provides a level of accuracy I do not need.
My goal:
Create a new entry in the database for every X seconds
If I set X to 60 (a minute) and there are missing 10 entries within this period, 50 entries should be used. Data can be missing for certain periods, and I do not want this to create bogus positions.
Use timestamp from last entry in period.
Use the heading (hdg) that is appearing the most times within this period.
Calculate average speed within this period.
Latitude and longitude could use the last entry, but I have seen "spikes" that needs to be filtered out, or use average, and remove values that differ too much.
My script is now pushing all the data to the database via a for loop with different data-checks inside it, and this is working.
I am new to python and still learning every day through reading and youtube videos, but it would be great if anyone could point me in the right direction for how to achieve the above goal.
As of now the data is imported into a dictionary. And I am wondering if creating a dictionary where the timestamp is the key is the way to go, but I am a little lost.
Code:
import os
import json
from pathlib import Path
from datetime import datetime, timedelta, date
def generator(data):
for entry in data:
yield entry
data = json.load(open("5_gps_2021-06-05T141524.1397180000.json"))["gps_data"]
gps_count = len(data)
start_time = None
new_gps = list()
tempdata = list()
seconds = 60
i = 0
for entry in generator(data):
i = i+1
if start_time == None:
start_time = datetime.fromisoformat(entry['utc_date_and_time'])
# TODO: Filter out values with too much deviation
tempdata.append(entry)
elapsed = (datetime.fromisoformat(entry['utc_date_and_time']) - start_time).total_seconds()
if (elapsed >= seconds) or (i == gps_count):
# TODO: Calculate average values etc. instead of using last
new_gps.append(tempdata)
tempdata = []
start_time = None
print("GPS count before:" + str(gps_count))
print("GPS count after:" + str(len(new_gps)))
Output:
GPS count before:1186
GPS count after:20
I'm trying to collect the 150 rows of data from the text that appears at the bottom of a given Showbuzzdaily.com web page (example), but my script only collects 132 rows.
I'm new to Python. Is there something I need to add to my loop to ensure all records are collected as intended?
To troubleshoot, I created a list (program_count) to verify this is happening in the code before the CSV is generated, which shows there are only 132 items in the list, rather than 150. Interestingly, the final row (#132) ends up being duplicated at the end of the CSV for some reason.
I experience similar issues scraping Google Trends (using pytrends), where only about 80% of the data I try to scrape ended up in the CSV. So I'm suspecting there's something wrong with my code or that I'm overwhelming my target with requests.
Adding time.sleep(0.1) to for while loop in this code didn't produce different results.
import time
import requests
import datetime
from bs4 import BeautifulSoup
import pandas as pd # import pandas module
from datetime import date, timedelta
# creates empty 'records' list
records = []
start_date = date(2021, 4, 12)
orig_start_date = start_date # Used for naming the CSV
end_date = date(2021, 4, 12)
delta = timedelta(days=1) # Defines delta as +1 day
print(str(start_date) + ' to ' + str(end_date)) # Visual reassurance
# begins while loop that will continue for each daily viewership report until end_date is reached
while start_date <= end_date:
start_weekday = start_date.strftime("%A") # define weekday name
start_month_num = int(start_date.strftime("%m")) # define month number
start_month_num = str(start_month_num) # convert to string so it is ready to be put into address
start_month_day_num = int(start_date.strftime("%d")) # define day of the month
start_month_day_num = str(start_month_day_num) # convert to string so it is ready to be put into address
start_year = int(start_date.strftime("%Y")) # define year
start_year = str(start_year) # convert to string so it is ready to be put into address
#define address (URL)
address = 'http://www.showbuzzdaily.com/articles/showbuzzdailys-top-150-'+start_weekday.lower()+'-cable-originals-network-finals-'+start_month_num+'-'+start_month_day_num+'-'+start_year+'.html'
print(address) # print for visual reassurance
# read the web page at the defined address (URL)
r = requests.get(address)
soup = BeautifulSoup(r.text, 'html.parser')
# we're going to deal with results that appear within <td> tags
results = soup.find_all('td')
# reads the date text at the top of the web page so it can be inserted later to the CSV in the 'Date' column
date_line = results[0].text.split(": ",1)[1] # reads the text after the colon and space (': '), which is where the date information is located
weekday_name = date_line.split(' ')[0] # stores the weekday name
month_name = date_line.split(' ',2)[1] # stores the month name
day_month_num = date_line.split(' ',1)[1].split(' ')[1].split(',')[0] # stores the day of the month
year = date_line.split(', ',1)[1] # stores the year
# concatenates and stores the full date value
mmmmm_d_yyyy = month_name+' '+day_month_num+', '+year
del results[:10] # deletes the first 10 results, which contained the date information and column headers
program_count = [] # empty list for program counting
# (within the while loop) begins a for loop that appends data for each program in a daily viewership report
for result in results:
rank = results[0].text # stores P18-49 rank
program = results[1].text # stores program name
network = results[2].text # stores network name
start_time = results[3].text # stores program's start time
mins = results[4].text # stores program's duration in minutes
p18_49 = results[5].text # stores program's P18-49 rating
p2 = results[6].text # stores program's P2+ viewer count (in thousands)
records.append((mmmmm_d_yyyy, weekday_name, rank, program, network, start_time, mins, p18_49, p2)) # appends the data to the 'records' list
program_count.append(program) # adds each program name to the list.
del results[:7] # deletes the first 7 results remaining, which contained the data for 1 row (1 program) which was just stored in 'records'
print(len(program_count)) # Toubleshooting: prints to screen the number of programs counted. Should be 150.
records.append((mmmmm_d_yyyy, weekday_name, rank, program, network, start_time, mins, p18_49, p2)) # appends the data to the 'records' list
print(str(start_date)+' collected...') # Visual reassurance one page/day is finished being collected
start_date += delta # at the end of while loop, advance one day
df = pd.DataFrame(records, columns=['Date','Weekday','P18-49 Rank','Program','Network','Start time','Mins','P18-49','P2+']) # Creates DataFrame using the columns listed
df.to_csv('showbuzz '+ str(orig_start_date) + ' to '+ str(end_date) + '.csv', index=False, encoding='utf-8') # generates the CSV file, using start and end dates in filename
It seems like you're making debugging a lot tougher on yourself by pulling all the table data (<td>) individually like that. After stepping through the code and making a couple of changes, my best guess is the bug is coming from the fact that you're deleting entries from results while iterating over it, which gets messy. As a side note, you're also never using result from the loop which would make the declaration pointless. Something like this ends up a little cleaner, and gets you your 150 results:
results = soup.find_all('tr')
# reads the date text at the top of the web page so it can be inserted later to the CSV in the 'Date' column
date_line = results[0].select_one('td').text.split(": ", 1)[1] # Selects first td it finds under the first tr
weekday_name = date_line.split(' ')[0]
month_name = date_line.split(' ', 2)[1]
day_month_num = date_line.split(' ', 1)[1].split(' ')[1].split(',')[0]
year = date_line.split(', ', 1)[1]
mmmmm_d_yyyy = month_name + ' ' + day_month_num + ', ' + year
program_count = [] # empty list for program counting
for result in results[2:]:
children = result.find_all('td')
rank = children[0].text # stores P18-49 rank
program = children[1].text # stores program name
network = children[2].text # stores network name
start_time = children[3].text # stores program's start time
mins = children[4].text # stores program's duration in minutes
p18_49 = children[5].text # stores program's P18-49 rating
p2 = children[6].text # stores program's P2+ viewer count (in thousands)
records.append((mmmmm_d_yyyy, weekday_name, rank, program, network, start_time, mins, p18_49, p2))
program_count.append(program) # adds each program name to the list.
You also shouldn't need to use a second list to get the number of programs you've retrieved (appending programs to program_count). It ends up the same amount in both lists no matter what since you're appending a program name from every result. So instead of print(len(program_count)) you could've instead used print(len(records)). I'm assuming it was just for debugging purposes though.
I need the historical weather data (Temperature) on hourly basis for Chicago, IL (Zip code 60603)
Basically i need it for the month of June and July 2017 either hourly or in 15 mins interval.
I have searched on NOAA, Weather underground etc. But haven't found anything relevant to my use case. Tried my hands on scraping using R and Python, but no luck.
Here is a snippet for the same
R :
library(httr)
library(XML)
url <- "http://graphical.weather.gov/xml/sample_products/browser_interface/ndfdXMLclient.php"
response <- GET(url,query=list(zipCodeList="10001",
product="time-series",
begin=format(Sys.Date(),"%Y-%m-%d"),
Unit="e",
temp="temp",rh="rh",wspd="wspd"))
doc <- content(response,type="text/xml", encoding = "UTF-8") # XML document with the data
# extract the date-times
dates <- doc["//time-layout/start-valid-time"]
dates <- as.POSIXct(xmlSApply(dates,xmlValue),format="%Y-%m-%dT%H:%M:%S")
# extract the actual data
data <- doc["//parameters/*"]
data <- sapply(data,function(d)removeChildren(d,kids=list("name")))
result <- do.call(data.frame,lapply(data,function(d)xmlSApply(d,xmlValue)))
colnames(result) <- sapply(data,xmlName)
# combine into a data frame
result <- data.frame(dates,result)
head(result)
Error :
Error in UseMethod("xmlSApply") :
no applicable method for 'xmlSApply' applied to an object of class "list"
Python :
from pydap.client import open_url
# setup the connection
url = 'http://nomads.ncdc.noaa.gov/dods/NCEP_NARR_DAILY/197901/197901/narr-
a_221_197901dd_hh00_000'
modelconn = open_url(url)
tmp2m = modelconn['tmp2m']
# grab the data
lat_index = 200 # you could tie this to tmp2m.lat[:]
lon_index = 200 # you could tie this to tmp2m.lon[:]
print(tmp2m.array[:,lat_index,lon_index] )
Error :
HTTPError: 503 Service Temporarily Unavailable
Any other solution is appreciated either in R or Python or any related online dataset link
There is an R package rwunderground, but I've not had much success getting what I want out of it. In all honesty, I'm not sure if that's the package, or if it is me.
Eventually, I broke down and wrote a quick diddy to get the daily weather history for personal weather stations. You'll need to sign up for a Weather Underground API token (I'll leave that to you). Then you can use the following:
library(rjson)
api_key <- "your_key_here"
date <- seq(as.Date("2017-06-01"), as.Date("2017-07-31"), by = 1)
pws <- "KILCHICA403"
Weather <- vector("list", length = length(date))
for(i in seq_along(Weather)){
url <- paste0("http://api.wunderground.com/api/", api_key,
"/history_", format(date[i], format = "%Y%m%d"), "/q/pws:",
pws, ".json")
result <- rjson::fromJSON(paste0(readLines(url), collapse = " "))
Weather[[i]] <- do.call("rbind", lapply(result[[2]][[3]], as.data.frame,
stringsAsFactors = FALSE))
Sys.sleep(6)
}
Weather <- do.call("rbind", Weather)
There's a call to Sys.sleep, which causes the loop to wait 6 seconds before going to the next iteration. This is done because the free API only allows ten calls per minute (up to 500 per day).
Also, some days may not have data. Remember that this connects to a personal weather station. There could be any number of reasons that it stopped uploading data, including internet outages, power outages, or the owner turned off the link to Weather Underground. If you can't get the data off of one station, try another nearby and fill in the gaps.
To get a weather station code, go to weatherunderground.com. Enter your desired zip code into the search bar
Click on the "Change" link
You can see the station code of the current station, and options for other stations nearby.
Just to provide a python solution for whoever comes by this question looking for one. This will (per the post) go through each day in June and July 2017 getting all observations for a given location. This does not restrict to 15 minute or hourly but does provide all data observed on the day. Additional parsing of observation time per observation is necessary but this is a start.
WunderWeather
pip install WunderWeather
pip install arrow
import arrow # learn more: https://python.org/pypi/arrow
from WunderWeather import weather # learn more: https://python.org/pypi/WunderWeather
api_key = ''
extractor = weather.Extract(api_key)
zip = '02481'
begin_date = arrow.get("201706","YYYYMM")
end_date = arrow.get("201708","YYYYMM").shift(days=-1)
for date in arrow.Arrow.range('day',begin_date,end_date):
# get date object for feature
# http://wunderweather.readthedocs.io/en/latest/WunderWeather.html#WunderWeather.weather.Extract.date
date_weather = extractor.date(zip,date.format('YYYYMMDD'))
# use shortcut to get observations and data
# http://wunderweather.readthedocs.io/en/latest/WunderWeather.html#WunderWeather.date.Observation
for observation in date_weather.observations:
print("Date:",observation.date_pretty)
print("Temp:",observation.temp_f)
If you solve ML task and want to try weather historical data, I recommend you to try python library upgini for smart enrichment. It contains 12 years history weather data by 68 countries.
My code of usage is following:
%pip install -Uq upgini
from upgini import SearchKey, FeaturesEnricher
from upgini.metadata import CVType, RuntimeParameters
## define search keys
search_keys = {
"Date": SearchKey.DATE,
"country": SearchKey.COUNTRY,
"postal_code": SearchKey.POSTAL_CODE
}
## define X_train / y_train
X_train=df_prices.drop(columns=['Target'])
y_train = df_prices.Target
## define Features Enricher
features_enricher = FeaturesEnricher(
search_keys = search_keys,
cv = CVType.time_series
)
X_enriched=features_enricher.fit_transform(X_train, y_train, calculate_metrics=True)
As a result you'll get dataframe with new features with non-zero feature importance on the target, such as temperature, wind speed etc
Web: https://upgini.com GitHub: https://github.com/upgini