Python: API request nested dictionaries to dataframe with datetime indexed values - python

I run a query on python to get hourly price data from an API, using the get function:
result = (requests.get(url_prices, headers=headers, params={'SpotKey':'1','Fields':'hours','FromDate':'2016-05-05','ToDate':'2016-12-05','Currency':'eur','SortType':'ascending'}).json())
where 'SpotKey' identifies the item I want to retrieve from the API, in this example '1' is hourly price timeseries (the other parameters are self explanatory).
The result from the query is:
{'SpotKey': '1',
'SpotName': 'APX',
'Denomination': 'eur/mwh',
'Elements': [{'Date': '2016-05-05T00:00:00.0000000',
'TimeSpans': [{'TimeSpan': '00:00-01:00', 'Value': 23.69},
{'TimeSpan': '01:00-02:00', 'Value': 21.86},
{'TimeSpan': '02:00-03:00', 'Value': 21.26},
{'TimeSpan': '03:00-04:00', 'Value': 20.26},
{'TimeSpan': '04:00-05:00', 'Value': 19.79},
{'TimeSpan': '05:00-06:00', 'Value': 19.79},
...
{'TimeSpan': '19:00-20:00', 'Value': 57.52},
{'TimeSpan': '20:00-21:00', 'Value': 49.4},
{'TimeSpan': '21:00-22:00', 'Value': 42.23},
{'TimeSpan': '22:00-23:00', 'Value': 34.99},
{'TimeSpan': '23:00-24:00', 'Value': 33.51}]}]}
where 'Elements' is the relevant list containing the timeseries, structured as nested dictionaries of 'Date' keys and 'TimeSpans' keys.
Each 'TimeSpans' keys contains other nested dictionaries for each hour of the day, with a 'TimeSpan' key for the hour and a 'Value' key for the price.
I would like to transform it to a dataframe like:
Datetime eur/mwh
2016-05-05 00:00:00 23.69
2016-05-05 01:00:00 21.86
2016-05-05 02:00:00 21.26
2016-05-05 03:00:00 20.26
2016-05-05 04:00:00 19.79
... ...
2016-12-05 19:00:00 57.52
2016-12-05 20:00:00 49.40
2016-12-05 21:00:00 42.23
2016-12-05 22:00:00 34.99
2016-12-05 23:00:00 33.51
For the time being I managed to do so doing:
df = pd.concat([pd.DataFrame(x) for x in result['Elements']])
df['Date'] = pd.to_datetime(df['Date'] + ' ' + [x['TimeSpan'][:5] for x in df['TimeSpans']], errors='coerce')
df[result['Denomination']] = [x['Value'] for x in df['TimeSpans']]
df = df.set_index(df['Date'], drop=True).drop(columns=['Date','TimeSpans'])
df = df[~df.index.isnull()]
I did so because the daylight-saving-time is replacing the 'TimeSpan' hourly values with 'dts' string, giving ParseDate errors when creating the datetime index.
Since I will request data very frequently and potentially for different granularities (e.g. half-hourly), is there a better / quicker / standard way to shape so many nested dictionaries into a dataframe with the format I look for, that allows to avoid the parsing date error for daylight-saving-time changes?
thank you in advance, cheers.

You did not give examples of the dts, so I cannot verify. But in principle, trating the Date as timestamp and TimeSpan as as timedeltas should give you both the ability to ignore granularity changes and potentialy include additional "dts" parsing.
def parse_time(x):
if "dst" not in x:
return x[:5]+":00"
return f"{int(x[:2])+1}{x[2:5]}:00" # TODO ACTUALLY PARSE, time overflow etc
df = pd.DataFrame(result['Elements']).set_index("Date")
d2 = df.TimeSpans.explode().apply(pd.Series)
d2['Datetime'] = pd.to_datetime(d2.index) + pd.to_timedelta(d2.TimeSpan.apply(parse_dt))
pd.DataFrame(d2.set_index(d2.Datetime).Value).rename(columns={"Value": "eur/mwh"})
gives

this should work:
df = pd.DataFrame()
cols = ['Datetime', 'eur/mwh']
# concat days together to one df
for day in results['Elements']:
# chunk represents a day worth of data to concat
chunk = []
date = pd.to_datetime(day['Date'])
for pair in day['TimeSpans']:
# hour offset is just the first 2 characters of TimeSpan
offset = pd.DateOffset(hours=int(pair['TimeSpan'][:1])
value = pair['Value']
chunk.append([(date + offset), value])
# concat day-chunk to df
df = pd.concat([df, pd.DataFrame(chunk, columns=cols)]
only thing i'm not 100% sure of is the pd.to_datetime() but if it does't work you just need to use a format argument with it.
hope it helps :)

Related

How to access nested data in a pandas dataframe?

Here's an example of the data I'm working with:
values variable.variableName timeZone
0 [{'value': [], turbidity PST
'qualifier': [],
'qualityControlLevel': [],
'method': [{
'methodDescription': '[TS087: YSI 6136]',
'methodID': 15009}],
'source': [],
'offset': [],
'sample': [],
'censorCode': []},
{'value': [{
'value': '17.2',
'qualifiers': ['P'],
'dateTime': '2022-01-05T12:30:00.000-08:00'},
{'value': '17.5',
'qualifiers': ['P'],
'dateTime': '2022-01-05T14:00:00.000-08:00'}
}]
1 [{'value': degC PST
[{'value': '9.3',
'qualifiers': ['P'],
'dateTime': '2022-01-05T12:30:00.000-08:00'},
{'value': '9.4',
'qualifiers': ['P'],
'dateTime': '2022-01-05T12:45:00.000-08:00'},
}]
I'm trying to break out each of the variables in the data into their own dataframes, what I have so far works, however, if there are multiple sets of the values (like in turbidity); it only pulls in the first set, which is sometimes empty. How do I pull in all the value sets? Here's what I have so far:
import requests
import pandas as pd
url = ('https://waterservices.usgs.gov/nwis/iv?sites=11273400&period=P1D&format=json')
response = requests.get(url)
result = response.json()
json_list = result['value']['timeSeries']
df = pd.json_normalize(json_list)
new_df = df['values'].apply(lambda x: pd.DataFrame(x[0]['value']))
new_df.index = df['variable.variableName']
# print turbidity
print(new_df.loc['Turbidity, water, unfiltered, monochrome near infra-red LED light,
780-900 nm, detection angle 90 ±2.5°, formazin nephelometric units (FNU)'])
This outputs:
turbidity df
Empty DataFrame
Columns: []
Index: []
degC df
value qualifiers dateTime
0 9.3 P 2022-01-05T12:30:00.000-08:00
1 9.4 P 2022-01-05T12:45:00.000-08:00
Whereas I want my output to be something like:
turbidity df
value qualifiers dateTime
0 17.2 P 2022-01-05T12:30:00.000-08:00
1 17.5 P 2022-01-05T14:00:00.000-08:00
degC df
value qualifiers dateTime
0 9.3 P 2022-01-05T12:30:00.000-08:00
1 9.4 P 2022-01-05T12:45:00.000-08:00
Unfortunately, it only grabs the first value set, which in the case of turbidity is empty. How can I grab them all or check to see if the data frame is empty and grab the next one?
I believe the missing link here is DataFrame.explode() -- it allows you to split a single row that contains a list of values (your "values" column) into multiple rows.
You can then use
new_df = df.explode("values")
which will split the "turbidity" row into two.
You can then filter rows with empty "value" dictionaries and apply .explode() once again.
You can then also use pd.json_normalize again to expand a dictionary of values into multiple columns, or also look into Series.str.get() to extract a single element from a dict or list.
This JSON is nested deep so I think it requires a few steps to transform into what you want.
# First, use json_normalize on top level to extract values and variableName.
df = pd.json_normalize(result, record_path=['values'], meta=[['variable', 'variableName']])
# Then explode the value to flatten the array and filter out any empty array
df = df.explode('value').dropna(subset=['value'])
# Another json_normalize on the exploded value to extract the value and qualifier and dateTime, concat with variableName.
# explode('qualifiers') is to take out wrapping array.
df = pd.concat([df[['variable.variableName']].reset_index(drop=True),
pd.json_normalize(df.value).explode('qualifiers')], axis=1)
Resulted dataframe should look like this.
variable.variableName value qualifiers dateTime
0 Temperature, water, °C 10.7 P 2022-01-06T12:15:00.000-08:00
1 Temperature, water, °C 10.7 P 2022-01-06T12:30:00.000-08:00
2 Temperature, water, °C 10.7 P 2022-01-06T12:45:00.000-08:00
3 Temperature, water, °C 10.8 P 2022-01-06T13:00:00.000-08:00
If you will do further data processing, it is probably better to keep everything in 1 dataframe but if you really need to have separate dataframes, take it out with the filtering.
df_turbidity = df[df['variable.variableName'].str.startswith('Turbidity')]

JSON to Pandas Dataframe types change

I have JSON output from m3inference package in python like this:
{'input': {'description': 'Bundeskanzlerin',
'id': '2631881902',
'img_path': '/root/m3/cache/angelamerkeicdu_224x224.jpg',
'lang': 'de',
'name': 'Angela Merkel',
'screen_name': 'angelamerkeicdu'},
'output': {'age': {'19-29': 0.0,
'30-39': 0.0001,
'<=18': 0.0001,
'>=40': 0.9998},
'gender': {'female': 0.9991, 'male': 0.0009},
'org': {'is-org': 0.0032, 'non-org': 0.9968}}}
I store it in:
org = pd.DataFrame.from_dict(json_normalize(org['output']), orient='columns')
gender.male gender.female age.<=18 ... age.>=40 org.non-org org.is-org
0 0.0009 0.9991 0.0000 ... 0.9998 0.9968 0.0032
i dont know where is the 0 value in the first column coming from, I save org.isorg column to isorg
isorg = org['org.is-org']
but when i append it to panda data frame dtypes is object, the value is change to
0 0.0032 Name: org.is-org, dtype: float64
not 0.0032
How to fix this?
"i dont know where 0 value in first column coming from then i save org.isorg column to isorg"
That "0" is an index to your dataframe. Unless you specify your dataframe index, pandas will auto create the index. You can change you index instead.
code example:
org.set_index('gender.male', inplace=True)
Index is like an address to your data. It is how any data point across the dataframe or series can be accessed.

Efficiently calculating remaining useful lifetime with pandas

I have a pandas dataframe that contains multiple rows with a datetime and a sensor value. My goal is to add a column that calculates the days until the sensor value will exceed the threshold the next time.
For instance, for the data <2019-01-05 11:00:00, 200>, <2019-01-06 12:00:00, 250>, <2019-01-07 13:00:00, 300> I would want the additional column to look like [1 day, 0 days, 0 days] for thresholds between 200 and 250 and [2 days, 1 day, 0 days] for thresholds between 250 and 300.
I tried subsampling the dataframe with df_sub = df[df[sensor_value] >= threshold], iterate over both dataframes and calculate the next timestamp in df_sub given the current timestamp in df. However, this solution seems to be every inefficient and I think that pandas might have some optimized way to calculating what I need.
In the following example code, I tried what I described above.
import pandas as pd
data = [{'time': '2019-01-05 11:00:00', 'sensor_value' : 200},
{'time': '2019-01-05 14:37:52', 'sensor_value' : 220},
{'time': '2019-01-05 17:55:12', 'sensor_value' : 235},
{'time': '2019-01-06 12:00:00', 'sensor_value' : 250},
{'time': '2019-01-07 13:00:00', 'sensor_value' : 300},
{'time': '2019-01-08 14:00:00', 'sensor_value' : 250},
{'time': '2019-01-09 15:00:00', 'sensor_value' : 320}]
df = pd.DataFrame(data)
df['time'] = pd.to_datetime(df['time'])
def calc_rul(df, threshold):
# calculate all datetime where the threshold is exceeded
df_sub = sorted(df[df['sensor_value'] >= threshold]['time'].tolist())
# variable to store all days
remaining_days = []
for v1 in df['time'].tolist():
for v2 in df_sub:
# if the exceeding date is the first in future calculate the days difference
if(v2 > v1):
remaining_days.append((v2-v1).days)
break
elif(v2 == v1):
remaining_days.append(0)
break
df['RUL'] = pd.Series(remaining_days)
calc_rul(df, 300)
Expected output (output of the above sample):
Here's what I would do for one threshold
def calc_rul(df, thresh):
# we mark all the values greater than thresh
markers =df.value.ge(thresh)
# copy dates of the above row
df['last_day'] = np.nan
df.loc[markers, 'last_day'] = df.timestamp
# back fill those dates
df['last_day'] = df['last_day'].bfill().astype('datetime64[ns]')
df['RUL'] = (df.last_day - df.timestamp).dt.days
# drop the columns if necessary,
# remove this line to better see how the code works
df.drop('last_day', axis=1, inplace=True)
calc_rul(df, 300)
Instead of spliting the dataframe, you can use the '.loc' that allows you to filter and iterate through your threshold the same way:
df['RUL'] = '[2 days, 1 day, 0 days]'
for threshold in threshold_list:
df.loc[df['sensor_value'] > <your_rule>,'RUL'] = '[1 day, 0 days, 0 days]'
This technique avoids splitting the dataframe.

Pandas OHLCV to JSON format

I have real time data that I resample with pandas in order to get OHLCV data:
ohlcv = df.resample(_period).agg({'bid': 'ohlc', 'volume': 'sum'})
The dataframe looks like this:
volume bid
volume open high low close
timestamp
2016-09-01 300.0 77.644997 78.320331 77.638 78.320331
and the JSON output using ohlcv.to_json(orient='index') is:
{"1472688000000":{"["volume","volume"]":300.0,"["bid","open"]":77.644997,"["bid","high"]":78.320331,"["bid","low"]":77.638,"["bid","close"]":78.320331}}
How can I convert the dataframe in the following JSON:
{
"timestamp":1472688000000,
"open":77.644997,
"high":78.320331,
"close":78.320331,
"low":77.638,
"volume":300
}
Use MultiIndex.droplevel for convert MultiIndex in columns to flatten columns:
ohlcv = df.resample(_period).agg({'bid': 'ohlc', 'volume': 'sum'})
ohlcv.columns = ohlcv.columns.droplevel(0)
ohlcv.to_json(orient='index')

How to parse a Timestamp() with Python?

I am iterating over a dictionary that contains data from a SQL database and I want to count the number of times that user values appear between initial_date and ending_date, however, I am having some problems when I try to parse the Timestamp values. This is the code I have
initial_date = datetime(2017,09,01,00,00,00)
ending_date = datetime(2017,09,30,00,00,00)
dictionary sample that I got
sample = {'id': 100008222, 'sector name': 'BONGOX', 'site name': 'BONGO', 'region': 'EMEA',
'open date': Timestamp('2017-09-11 00:00:00'), 'mtti': '16', 'mttr': '1', 'mttc': '2','user':'John D.'},
{'id': 100008234, 'sector name': 'BONGOY', 'site name': 'BONGO', 'region': 'EMEA',
'open date': Timestamp('2017-09-09 12:05:00'), 'mtti': '1', 'mttr': '14', 'mttc': '7','user':'John D.'}
{'id': 101108234, 'sector name': 'BONGOA', 'site name': 'BONGO', 'region': 'EMEA',
'open date': Timestamp('2017-09-01 10:00:00'), 'mtti': '1', 'mttr': '12', 'mttc': '1','user':'John C.'}
{'id': 101108254, 'sector name': 'BONGOB', 'site name': 'BONGO', 'region': 'EMEA',
'open date': Timestamp('2017-09-02 20:00:00'), 'mtti': '2', 'mttr': '19', 'mttc': '73','user':'John C.'}
This is the code that I use to count the number of times user values appear between initial_date and ending_date
from datetime import time, datetime
from collections import Counter
#This approach does not work
Counter([li['user'] for li in sample if initial_date < dateutil.parser.parse(time.strptime(str(li.get(
'open date'),"%Y-%m-%d %H:%M:%S") < ending_date])
The code from above does not work because I encountered the error decoding to str: need a bytes-like object, Timestamp found
I have two questions:
How can I parse this Timestamp value that I encountered in these dictionaries?
I read in this post Why Counter is slow that Collections.Counter is a slow method compared to other approaches to count the number of times an item appears. If want to avoid using Counter.Collections, how can I achieve my desired result of counting the number of times user values appear between these dates?
use Timestamp.to_datetime() to convert to a datetime object
Question: How can I parse this Timestamp value that I encountered in these dictionaries?
Using class Timestampe from pandas
from pandas import Timestamp
Using Counter()
# Initialize a Counter() object
c = Counter()
# Iterate data
for s in sample:
# Get a datetime from Timestamp
dt = s['open date'].to_pydatetime()
# Compare with ending_date
if dt < ending_date:
print('{} < {}'.format(dt, ending_date))
# Increment the Counter key=s['user']
c[s['user']] += 1
print(c)
Output:
2017-09-11 00:00:00 < 2017-09-30 00:00:00
2017-09-09 12:05:00 < 2017-09-30 00:00:00
2017-09-01 10:00:00 < 2017-09-30 00:00:00
2017-09-02 20:00:00 < 2017-09-30 00:00:00
Counter({'John C.': 2, 'John D.': 2})
Question: If want to avoid using Counter.Collections, how can I achieve my desired result of counting
Without Counter()
# Initialize a Dict object
c = {}
# Iterate data
for s in sample:
# Get a datetime from Timestamp
dt = s['open date'].to_pydatetime()
# Compare with ending_date
if dt < ending_date:
# Add key=s['user'] to Dict if not exists
c.setdefault(s['user'], 0)
# Increment the Dict key=s['user']
c[s['user']] += 1
print(c)
Output:
{'John D.': 2, 'John C.': 2}
Tested with Python: 3.4.2

Categories

Resources