Related
I hope that I can clearly explain what I'm trying to do here. I'm able to retrieve the series date and values through the Bureau of Labor Statistics(BLS) API data. https://www.bls.gov/
I want to now get the 1-year percent change for each value. I'm grateful for any help on this. XXXXX is where I put my registration ID, so I can access more than three years of data.
Here's the developer's page: https://www.bls.gov/developers/api_signature_v2.htm#parameters
base_url = 'https://api.bls.gov/publicAPI/v2/timeseries/data/'
series = {'id': 'CUSR0000SA0',
'name': 'Consumer Price Index - All Urban Consumers'}
data_url = '{}{}/?registrationkey=XXXXXXXXXXXXXXXXXX&startyear=2010&endyear=2022'.format(base_url, series['id'])
import requests
r = requests.get(data_url).json()
print('Status: ' + r['status'])
r = r['Results']['series'][0]['data']
print(r[0])
import pandas as pd
##M13 is annual year, which I'm skipping since I only want months 1-12
dates = ['{}{}'.format(i['period'], i['year']) for i in r if i['period'] < 'M13']
index = pd.to_datetime(dates)
data = {series['id']: [float(i['value']) for i in r if i['period'] < 'M13']}
df = pd.DataFrame(index=index, data=data).iloc[::-1]
I'm not sure how to write up retrieving calculations in my code.
I tried "calculations: {[i['calculations'][0] for i in r if i['period'] < 'M13']}" but that is not right.
I'm trying to download weekly Sentinel 2 data for one year. So, one Sentinel dataset within each week of the year. I can create a list of datasets using the code:
from sentinelsat import SentinelAPI
api = SentinelAPI(user, password, 'https://scihub.copernicus.eu/dhus')
products = api.query(footprint,
date = ('20211001', '20221031'),
platformname = 'Sentinel-2',
processinglevel = 'Level-2A',
cloudcoverpercentage = (0,10)
)
products_gdf = api.to_geodataframe(products)
products_gdf_sorted = products_gdf.sort_values(['beginposition'], ascending=[False])
products_gdf_sorted
This creates a list of all datasets available within the year, and as the data capture is around one in every five days you could argue I can work off this list. But instead I would like to have just one option each week (Mon - Sun). I thought I could create a dataframe with a startdate and an enddate for each week and loop that through the api.query code. But not sure how I would do this.
I have created a dataframe using:
import pandas as pd
dates_df = pd.DataFrame({'StartDate':pd.date_range(start='20211001', end='20221030', freq = 'W-MON'),'EndDate':pd.date_range(start='20211004', end='20221031', freq = 'W-SUN')})
print (dates_df)
Any tips or advice is greatly appreciated. Thanks!
Only basic knowledge of Python, so I'm not even sure if this is possible?
I have a csv that looks like this:
[1]: https://i.stack.imgur.com/8clYM.png
(This is dummy data, the real one is about 30K rows.)
I need to find the most recent job title for each employee (unique id) and then calculate how long (= how many days) the employee has been on the same job title.
What I have done so far:
import csv
import datetime
from datetime import *
data = open("C:\\Users\\User\\PycharmProjects\\pythonProject\\jts.csv",encoding="utf-8")
csv_data = csv.reader(data)
data_lines = list(csv_data)
print(data_lines)
for i in data_lines:
for j in i[0]:
But then I haven't got anywhere because I can't even conceptualise how to structure this. :-(
I also know that at one point I will need:
datetime.strptime(data_lines[1][2] , '%Y/%M/%d').date()
Could somebody help, please? I just need a new list saying something like:
id jt days
500 plumber 370
Edit to clarify: The dates are data points taken. I need to calculate back from the most recent of those back until the job title was something else. So in my example for employee 5000 from 04/07/2021 to 01/03/2020.
Let's consider sample data as follows:
id,jtitle,date
5000,plumber,01/01/2020
5000,senior plumber,02/03/2020
6000,software engineer,01/02/2020
6000,software architecture,06/02/2021
7000,software tester,06/02/2019
The following code works.
import pandas as pd
import datetime
# load data
data = pd.read_csv('data.csv')
# convert to datetime object
data.date = pd.to_datetime(data.date, dayfirst=True)
print(data)
# group employees by ID
latest = data.sort_values('date', ascending=False).groupby('id').nth(0)
print(latest)
# find the latest point in time where there is a change in job title
prev_date = data.sort_values('date', ascending=False).groupby('id').nth(1).date
print(prev_date)
# calculate the difference in days
latest['days'] = latest.date - prev_date
print(latest)
Output:
jtitle date days
id
5000 senior plumber 2020-03-02 61 days
6000 software architecture 2021-02-06 371 days
7000 software tester 2019-02-06 NaT
But then I haven't got anywhere because I can't even conceptualise how to structure this. :-(
Have a map (dict) of employee to (date, title).
For every row, check if you already have an entry for the employee. If you don't just put the information in the map, otherwise compare the date of the row and that of the entry. If the row has a more recent date, replace the entry.
Once you've gone through all the rows, you can just go through the map you've collected and compute the difference between the date you ended up with and "today".
Incidentally your pattern is not correct, the sample data uses a %d/%m/%Y (day/month/year) or %m/%d/%Y (month/day/year) format, the sample data is not sufficient to say which, but it certainly is not YMD.
Seems like I'm too late... Nevertheless, in case you're interested, here's a suggestion in pure Python (nothing wrong with Pandas, though!):
import csv
import datetime as dt
from operator import itemgetter
from itertools import groupby
reader = csv.reader('data.csv')
next(reader) # Discard header row
# Read, transform (date), and sort in reverse (id first, then date):
data = sorted(((i, jtitle, dt.datetime.strptime(date, '%d/%m/%Y'))
for i, jtitle, date in reader),
key=itemgetter(0, 2), reverse=True)
# Process data grouped by id
result = []
for i, group in groupby(data, key=itemgetter(0)):
_, jtitle, end = next(group) # Fetch last job title resp. date
# Search for first ocurrence of different job title:
start = end
for _, jt, start in group:
if jt != jtitle:
break
# Collect results in list with datetimes transformed back
result.append((i, jtitle, end.strftime('%d/%m/%Y'), (end - start).days))
result = sorted(result, key=itemgetter(0))
The result for the input data
id,jtitle,date
5000,plumber,01/01/2020
5000,plumber,01/02/2020
5000,senior plumber,01/03/2020
5000,head plumber,01/05/2020
5000,head plumber,02/09/2020
5000,head plumber,05/01/2021
5000,head plumber,04/07/2021
6000,electrician,01/02/2018
6000,qualified electrician,01/06/2020
7000,plumber,01/01/2004
7000,plumber,09/11/2020
7000,senior plumber,05/06/2021
is
[('5000', 'head plumber', '04/07/2021', 490),
('6000', 'qualified electrician', '01/06/2020', 851),
('7000', 'senior plumber', '05/06/2021', 208)]
Using the Quandl API and the Quandl Python library I'm attempting to do a bulk download of the past 100 days worth of EOD data.
The bulk download uses this call to download all EOD data for all tickers for the last collected day. Removing the download_type=partial parameter will download all historical EOD data:
https://www.quandl.com/api/v3/databases/EOD/data?download_type=partial
This call will download the last n day's worth of EOD for a single ticker:
https://www.quandl.com/api/v3/datasets/EOD/AAPL?start_date=2019-02-07
Is it possible to combine these and download the last n day's worth of EOD data for all stocks at once?
At this point it seems my only options are:
Make individual API calls for all 8,000 tickers
Download all historical data for every stock
Quandle is not working for free anymore. used to be in the past.
If you want you can use IEX. check the below example which will give you the daily returns:
from datetime import datetime
from iexfinance.stocks import get_historical_data
from pandas_datareader import data
import pandas as pd
start = '2014-01-01'
end = datetime.today().utcnow()
datasets_original_test = ['AAPL', 'MSFT','NFLX','FB','GS','TSLA','BAC','TWTR','COF','TOL','EA','PFE','MS','C','SKX','GLD','SPY','EEM','XLF','GDX','EWZ','QQQ','FXI','XOP','EFA','VXXB','HYG','XLI','XLU','JNK','USO','IWM','XLP','XLE','EWJ','XLK','KRE','XLV','VNQ','MBB','OIH','FEZ','RSX','EWG','SMH','TLT','IBB','SLV','IYR','XRT','XLB','EMB','AGG','INDA','EWW','DBO','SPLV','KBE','VGK','XLY','EWH','EWT','DIA','IVV','XLRE','EPI','IJR','IEF']
dataset_names_test = ['AAPL', 'MSFT','NFLX','FB','GS','TSLA','BAC','TWTR','COF','TOL','EA','PFE','MS','C','SKX','GLD','SPY','EEM','XLF','GDX','EWZ','QQQ','FXI','XOP','EFA','VXXB','HYG','XLI','XLU','JNK','USO','IWM','XLP','XLE','EWJ','XLK','KRE','XLV','VNQ','MBB','OIH','FEZ','RSX','EWG','SMH','TLT','IBB','SLV','IYR','XRT','XLB','EMB','AGG','INDA','EWW','DBO','SPLV','KBE','VGK','XLY','EWH','EWT','DIA','IVV','XLRE','EPI','IJR','IEF']
datasets_test = []
for d in datasets_original_test:
data_original = data.DataReader(d, 'iex', start, end)
data_original.index = pd.to_datetime(data_original.index, format='%Y/%m/%d')
data_ch = data_original['close'].pct_change()
datasets_test.append(data_ch)
df_returns = pd.concat(datasets_test, axis=1, join_axes=[datasets_test[0].index])
df_returns.columns = dataset_names_test
I need the historical weather data (Temperature) on hourly basis for Chicago, IL (Zip code 60603)
Basically i need it for the month of June and July 2017 either hourly or in 15 mins interval.
I have searched on NOAA, Weather underground etc. But haven't found anything relevant to my use case. Tried my hands on scraping using R and Python, but no luck.
Here is a snippet for the same
R :
library(httr)
library(XML)
url <- "http://graphical.weather.gov/xml/sample_products/browser_interface/ndfdXMLclient.php"
response <- GET(url,query=list(zipCodeList="10001",
product="time-series",
begin=format(Sys.Date(),"%Y-%m-%d"),
Unit="e",
temp="temp",rh="rh",wspd="wspd"))
doc <- content(response,type="text/xml", encoding = "UTF-8") # XML document with the data
# extract the date-times
dates <- doc["//time-layout/start-valid-time"]
dates <- as.POSIXct(xmlSApply(dates,xmlValue),format="%Y-%m-%dT%H:%M:%S")
# extract the actual data
data <- doc["//parameters/*"]
data <- sapply(data,function(d)removeChildren(d,kids=list("name")))
result <- do.call(data.frame,lapply(data,function(d)xmlSApply(d,xmlValue)))
colnames(result) <- sapply(data,xmlName)
# combine into a data frame
result <- data.frame(dates,result)
head(result)
Error :
Error in UseMethod("xmlSApply") :
no applicable method for 'xmlSApply' applied to an object of class "list"
Python :
from pydap.client import open_url
# setup the connection
url = 'http://nomads.ncdc.noaa.gov/dods/NCEP_NARR_DAILY/197901/197901/narr-
a_221_197901dd_hh00_000'
modelconn = open_url(url)
tmp2m = modelconn['tmp2m']
# grab the data
lat_index = 200 # you could tie this to tmp2m.lat[:]
lon_index = 200 # you could tie this to tmp2m.lon[:]
print(tmp2m.array[:,lat_index,lon_index] )
Error :
HTTPError: 503 Service Temporarily Unavailable
Any other solution is appreciated either in R or Python or any related online dataset link
There is an R package rwunderground, but I've not had much success getting what I want out of it. In all honesty, I'm not sure if that's the package, or if it is me.
Eventually, I broke down and wrote a quick diddy to get the daily weather history for personal weather stations. You'll need to sign up for a Weather Underground API token (I'll leave that to you). Then you can use the following:
library(rjson)
api_key <- "your_key_here"
date <- seq(as.Date("2017-06-01"), as.Date("2017-07-31"), by = 1)
pws <- "KILCHICA403"
Weather <- vector("list", length = length(date))
for(i in seq_along(Weather)){
url <- paste0("http://api.wunderground.com/api/", api_key,
"/history_", format(date[i], format = "%Y%m%d"), "/q/pws:",
pws, ".json")
result <- rjson::fromJSON(paste0(readLines(url), collapse = " "))
Weather[[i]] <- do.call("rbind", lapply(result[[2]][[3]], as.data.frame,
stringsAsFactors = FALSE))
Sys.sleep(6)
}
Weather <- do.call("rbind", Weather)
There's a call to Sys.sleep, which causes the loop to wait 6 seconds before going to the next iteration. This is done because the free API only allows ten calls per minute (up to 500 per day).
Also, some days may not have data. Remember that this connects to a personal weather station. There could be any number of reasons that it stopped uploading data, including internet outages, power outages, or the owner turned off the link to Weather Underground. If you can't get the data off of one station, try another nearby and fill in the gaps.
To get a weather station code, go to weatherunderground.com. Enter your desired zip code into the search bar
Click on the "Change" link
You can see the station code of the current station, and options for other stations nearby.
Just to provide a python solution for whoever comes by this question looking for one. This will (per the post) go through each day in June and July 2017 getting all observations for a given location. This does not restrict to 15 minute or hourly but does provide all data observed on the day. Additional parsing of observation time per observation is necessary but this is a start.
WunderWeather
pip install WunderWeather
pip install arrow
import arrow # learn more: https://python.org/pypi/arrow
from WunderWeather import weather # learn more: https://python.org/pypi/WunderWeather
api_key = ''
extractor = weather.Extract(api_key)
zip = '02481'
begin_date = arrow.get("201706","YYYYMM")
end_date = arrow.get("201708","YYYYMM").shift(days=-1)
for date in arrow.Arrow.range('day',begin_date,end_date):
# get date object for feature
# http://wunderweather.readthedocs.io/en/latest/WunderWeather.html#WunderWeather.weather.Extract.date
date_weather = extractor.date(zip,date.format('YYYYMMDD'))
# use shortcut to get observations and data
# http://wunderweather.readthedocs.io/en/latest/WunderWeather.html#WunderWeather.date.Observation
for observation in date_weather.observations:
print("Date:",observation.date_pretty)
print("Temp:",observation.temp_f)
If you solve ML task and want to try weather historical data, I recommend you to try python library upgini for smart enrichment. It contains 12 years history weather data by 68 countries.
My code of usage is following:
%pip install -Uq upgini
from upgini import SearchKey, FeaturesEnricher
from upgini.metadata import CVType, RuntimeParameters
## define search keys
search_keys = {
"Date": SearchKey.DATE,
"country": SearchKey.COUNTRY,
"postal_code": SearchKey.POSTAL_CODE
}
## define X_train / y_train
X_train=df_prices.drop(columns=['Target'])
y_train = df_prices.Target
## define Features Enricher
features_enricher = FeaturesEnricher(
search_keys = search_keys,
cv = CVType.time_series
)
X_enriched=features_enricher.fit_transform(X_train, y_train, calculate_metrics=True)
As a result you'll get dataframe with new features with non-zero feature importance on the target, such as temperature, wind speed etc
Web: https://upgini.com GitHub: https://github.com/upgini