Related
I run a query on python to get hourly price data from an API, using the get function:
result = (requests.get(url_prices, headers=headers, params={'SpotKey':'1','Fields':'hours','FromDate':'2016-05-05','ToDate':'2016-12-05','Currency':'eur','SortType':'ascending'}).json())
where 'SpotKey' identifies the item I want to retrieve from the API, in this example '1' is hourly price timeseries (the other parameters are self explanatory).
The result from the query is:
{'SpotKey': '1',
'SpotName': 'APX',
'Denomination': 'eur/mwh',
'Elements': [{'Date': '2016-05-05T00:00:00.0000000',
'TimeSpans': [{'TimeSpan': '00:00-01:00', 'Value': 23.69},
{'TimeSpan': '01:00-02:00', 'Value': 21.86},
{'TimeSpan': '02:00-03:00', 'Value': 21.26},
{'TimeSpan': '03:00-04:00', 'Value': 20.26},
{'TimeSpan': '04:00-05:00', 'Value': 19.79},
{'TimeSpan': '05:00-06:00', 'Value': 19.79},
...
{'TimeSpan': '19:00-20:00', 'Value': 57.52},
{'TimeSpan': '20:00-21:00', 'Value': 49.4},
{'TimeSpan': '21:00-22:00', 'Value': 42.23},
{'TimeSpan': '22:00-23:00', 'Value': 34.99},
{'TimeSpan': '23:00-24:00', 'Value': 33.51}]}]}
where 'Elements' is the relevant list containing the timeseries, structured as nested dictionaries of 'Date' keys and 'TimeSpans' keys.
Each 'TimeSpans' keys contains other nested dictionaries for each hour of the day, with a 'TimeSpan' key for the hour and a 'Value' key for the price.
I would like to transform it to a dataframe like:
Datetime eur/mwh
2016-05-05 00:00:00 23.69
2016-05-05 01:00:00 21.86
2016-05-05 02:00:00 21.26
2016-05-05 03:00:00 20.26
2016-05-05 04:00:00 19.79
... ...
2016-12-05 19:00:00 57.52
2016-12-05 20:00:00 49.40
2016-12-05 21:00:00 42.23
2016-12-05 22:00:00 34.99
2016-12-05 23:00:00 33.51
For the time being I managed to do so doing:
df = pd.concat([pd.DataFrame(x) for x in result['Elements']])
df['Date'] = pd.to_datetime(df['Date'] + ' ' + [x['TimeSpan'][:5] for x in df['TimeSpans']], errors='coerce')
df[result['Denomination']] = [x['Value'] for x in df['TimeSpans']]
df = df.set_index(df['Date'], drop=True).drop(columns=['Date','TimeSpans'])
df = df[~df.index.isnull()]
I did so because the daylight-saving-time is replacing the 'TimeSpan' hourly values with 'dts' string, giving ParseDate errors when creating the datetime index.
Since I will request data very frequently and potentially for different granularities (e.g. half-hourly), is there a better / quicker / standard way to shape so many nested dictionaries into a dataframe with the format I look for, that allows to avoid the parsing date error for daylight-saving-time changes?
thank you in advance, cheers.
You did not give examples of the dts, so I cannot verify. But in principle, trating the Date as timestamp and TimeSpan as as timedeltas should give you both the ability to ignore granularity changes and potentialy include additional "dts" parsing.
def parse_time(x):
if "dst" not in x:
return x[:5]+":00"
return f"{int(x[:2])+1}{x[2:5]}:00" # TODO ACTUALLY PARSE, time overflow etc
df = pd.DataFrame(result['Elements']).set_index("Date")
d2 = df.TimeSpans.explode().apply(pd.Series)
d2['Datetime'] = pd.to_datetime(d2.index) + pd.to_timedelta(d2.TimeSpan.apply(parse_dt))
pd.DataFrame(d2.set_index(d2.Datetime).Value).rename(columns={"Value": "eur/mwh"})
gives
this should work:
df = pd.DataFrame()
cols = ['Datetime', 'eur/mwh']
# concat days together to one df
for day in results['Elements']:
# chunk represents a day worth of data to concat
chunk = []
date = pd.to_datetime(day['Date'])
for pair in day['TimeSpans']:
# hour offset is just the first 2 characters of TimeSpan
offset = pd.DateOffset(hours=int(pair['TimeSpan'][:1])
value = pair['Value']
chunk.append([(date + offset), value])
# concat day-chunk to df
df = pd.concat([df, pd.DataFrame(chunk, columns=cols)]
only thing i'm not 100% sure of is the pd.to_datetime() but if it does't work you just need to use a format argument with it.
hope it helps :)
Here's an example of the data I'm working with:
values variable.variableName timeZone
0 [{'value': [], turbidity PST
'qualifier': [],
'qualityControlLevel': [],
'method': [{
'methodDescription': '[TS087: YSI 6136]',
'methodID': 15009}],
'source': [],
'offset': [],
'sample': [],
'censorCode': []},
{'value': [{
'value': '17.2',
'qualifiers': ['P'],
'dateTime': '2022-01-05T12:30:00.000-08:00'},
{'value': '17.5',
'qualifiers': ['P'],
'dateTime': '2022-01-05T14:00:00.000-08:00'}
}]
1 [{'value': degC PST
[{'value': '9.3',
'qualifiers': ['P'],
'dateTime': '2022-01-05T12:30:00.000-08:00'},
{'value': '9.4',
'qualifiers': ['P'],
'dateTime': '2022-01-05T12:45:00.000-08:00'},
}]
I'm trying to break out each of the variables in the data into their own dataframes, what I have so far works, however, if there are multiple sets of the values (like in turbidity); it only pulls in the first set, which is sometimes empty. How do I pull in all the value sets? Here's what I have so far:
import requests
import pandas as pd
url = ('https://waterservices.usgs.gov/nwis/iv?sites=11273400&period=P1D&format=json')
response = requests.get(url)
result = response.json()
json_list = result['value']['timeSeries']
df = pd.json_normalize(json_list)
new_df = df['values'].apply(lambda x: pd.DataFrame(x[0]['value']))
new_df.index = df['variable.variableName']
# print turbidity
print(new_df.loc['Turbidity, water, unfiltered, monochrome near infra-red LED light,
780-900 nm, detection angle 90 ±2.5°, formazin nephelometric units (FNU)'])
This outputs:
turbidity df
Empty DataFrame
Columns: []
Index: []
degC df
value qualifiers dateTime
0 9.3 P 2022-01-05T12:30:00.000-08:00
1 9.4 P 2022-01-05T12:45:00.000-08:00
Whereas I want my output to be something like:
turbidity df
value qualifiers dateTime
0 17.2 P 2022-01-05T12:30:00.000-08:00
1 17.5 P 2022-01-05T14:00:00.000-08:00
degC df
value qualifiers dateTime
0 9.3 P 2022-01-05T12:30:00.000-08:00
1 9.4 P 2022-01-05T12:45:00.000-08:00
Unfortunately, it only grabs the first value set, which in the case of turbidity is empty. How can I grab them all or check to see if the data frame is empty and grab the next one?
I believe the missing link here is DataFrame.explode() -- it allows you to split a single row that contains a list of values (your "values" column) into multiple rows.
You can then use
new_df = df.explode("values")
which will split the "turbidity" row into two.
You can then filter rows with empty "value" dictionaries and apply .explode() once again.
You can then also use pd.json_normalize again to expand a dictionary of values into multiple columns, or also look into Series.str.get() to extract a single element from a dict or list.
This JSON is nested deep so I think it requires a few steps to transform into what you want.
# First, use json_normalize on top level to extract values and variableName.
df = pd.json_normalize(result, record_path=['values'], meta=[['variable', 'variableName']])
# Then explode the value to flatten the array and filter out any empty array
df = df.explode('value').dropna(subset=['value'])
# Another json_normalize on the exploded value to extract the value and qualifier and dateTime, concat with variableName.
# explode('qualifiers') is to take out wrapping array.
df = pd.concat([df[['variable.variableName']].reset_index(drop=True),
pd.json_normalize(df.value).explode('qualifiers')], axis=1)
Resulted dataframe should look like this.
variable.variableName value qualifiers dateTime
0 Temperature, water, °C 10.7 P 2022-01-06T12:15:00.000-08:00
1 Temperature, water, °C 10.7 P 2022-01-06T12:30:00.000-08:00
2 Temperature, water, °C 10.7 P 2022-01-06T12:45:00.000-08:00
3 Temperature, water, °C 10.8 P 2022-01-06T13:00:00.000-08:00
If you will do further data processing, it is probably better to keep everything in 1 dataframe but if you really need to have separate dataframes, take it out with the filtering.
df_turbidity = df[df['variable.variableName'].str.startswith('Turbidity')]
I am working on some portfolio analysis and am trying to get a working function for pulling data for stocks, using a list of Ticker Symbols. Here is my list:
Ticker_List={'Tickers':['SPY', 'AAPL', 'TSLA', 'AMZN', 'BRK.B', 'DAL', 'EURN', 'AMD',
'NVDA', 'SPG', 'DIS', 'SBUX', 'MMP', 'USFD', 'CHEF', 'SYY',
'GOOGL', 'MSFT']}
I'm passing the list through this function like so:
Port=kit.d(Ticker_List)
def d(Ticker_List):
x=[]
for i in Ticker_List['Tickers']:
x.append(Closing_price_alltime(i))
return x
def Closing_price_alltime(Ticker):
Closedf=td_client.get_price_history(Ticker, period_type='year', period=20, frequency_type='daily', frequency=1)
return Closedf
Which pulls data from TDAmeritrade and gives me back:
[{'candles': [{'open': 147.46875,'high': 148.21875,
'low': 146.875,'close': 147.125,
'volume': 6998100,'datetime': 960181200000},
{'open': 146.625,'high': 147.78125,
'low': 145.90625,'close': 146.46875,
'volume': 4858900,'datetime': 960267600000},
...],
'symbol': 'MSFT',
'empty': False}]`
(This is just a sample of course)
Finally, I'm cleaning up with:
Port=pd.DataFrame(Port)
Port=pd.DataFrame.drop(Port, columns='empty')`
Which gives the DataFrame:
candles symbol
0 [{'open': 147.46875, 'high': 148.21875, 'low': 146.875, 'close': 147.125, 'volume': 6998100, 'datetime': 960181200000}, {'open': 146.625, 'high': ...} SPY
1 [{'open': 3.33259, 'high': 3.401786, 'low': 3.203126, 'close': 3.261161, 'volume': 80917200, 'datetime': 960181200000}, {'open': 3.284599, 'high':...} AAPL
How can I get the close price out of the nested dictionary in each row and set that as a the columns, with the ticker symbols (currently in their own column) as the headers for the closing price columns. Also how to extract the datetime from each nested dictionary and set it as the index.
EDIT: More info
My original method of building this DataFrame was:
SPY_close=kit.Closing_price_alltime('SPY')
AAPL_close=kit.Closing_price_alltime('AAPL')
TSLA_close=kit.Closing_price_alltime('TSLA')
AMZN_close=kit.Closing_price_alltime('AMZN')
BRKB_close=kit.Closing_price_alltime('BRK.B')
DAL_close=kit.Closing_price_alltime('DAL')
EURN_close=kit.Closing_price_alltime('EURN')
AMD_close=kit.Closing_price_alltime('AMD')
NVDA_close=kit.Closing_price_alltime('NVDA')
SPG_close=kit.Closing_price_alltime('SPG')
DIS_close=kit.Closing_price_alltime('DIS')
SBUX_close=kit.Closing_price_alltime('SBUX')
MMP_close=kit.Closing_price_alltime('MMP')
USFD_close=kit.Closing_price_alltime('USFD')
CHEF_close=kit.Closing_price_alltime('CHEF')
SYY_close=kit.Closing_price_alltime('SYY')
GOOGL_close=kit.Closing_price_alltime('GOOGL')
MSFT_close=kit.Closing_price_alltime('MSFT')
def Closing_price_alltime(Ticker):
"""
Gets Closing Price for Past 20 Years w/ Daily Intervals
and Formats it to correct Date and single 'Closing Price'
column.
"""
Raw_close=td_client.get_price_history(Ticker,
period_type='year', period=20, frequency_type='daily', frequency=1)
#Closedf = pd.DataFrame(Raw_close['candles']).set_index('datetime')
#Closedf=pd.DataFrame.drop(Closedf, columns=['open', 'high',
'low', 'volume'])
#Closedf.index = pd.to_datetime(Closedf.index, unit='ms')
#Closedf.index.names=['Date']
#Closedf.columns=[f'{Ticker} Close']
#Closedf=Closedf.dropna()
return Closedf
SPY_pct=kit.pct_change(SPY_close)
AAPL_pct=kit.pct_change(AAPL_close)
TSLA_pct=kit.pct_change(TSLA_close)
AMZN_pct=kit.pct_change(AMZN_close)
BRKB_pct=kit.pct_change(BRKB_close)
DAL_pct=kit.pct_change(DAL_close)
EURN_pct=kit.pct_change(EURN_close)
AMD_pct=kit.pct_change(AMD_close)
NVDA_pct=kit.pct_change(NVDA_close)
SPG_pct=kit.pct_change(SPG_close)
DIS_pct=kit.pct_change(DIS_close)
SBUX_pct=kit.pct_change(SBUX_close)
MMP_pct=kit.pct_change(MMP_close)
USFD_pct=kit.pct_change(USFD_close)
CHEF_pct=kit.pct_change(CHEF_close)
SYY_pct=kit.pct_change(SYY_close)
GOOGL_pct=kit.pct_change(GOOGL_close)
MSFT_pct=kit.pct_change(MSFT_close)
def pct_change(Ticker_ClosingValues):
"""
Takes Closing Values and Finds Percent Change.
Closing Value Column must be named 'Closing Price'.
"""
return_pct=Ticker_ClosingValues.pct_change()
return_pct=return_pct.dropna()
return return_pct
Portfolio_hist_rets=[SPY_pct, AAPL_pct, TSLA_pct, AMZN_pct,
BRKB_pct, DAL_pct, EURN_pct, AMD_pct,
NVDA_pct, SPG_pct, DIS_pct, SBUX_pct,
MMP_pct, USFD_pct, CHEF_pct, SYY_pct,
GOOGL_pct, MSFT_pct]
Which returned exactly what I wanted:
SPY Close AAPL Close TSLA Close AMZN Close BRK.B Close
Date
2000-06-06 05:00:00 -0.004460 0.017111 NaN -0.072248 -0.002060
2000-06-07 05:00:00 0.006934 0.039704 NaN 0.024722 0.013416
2000-06-08 05:00:00 -0.003920 -0.018123 NaN 0.001206 -0.004073
This method is obviously much less efficient than just using a for loop to create a DataFrame from a list of tickers.
In short, I'm asking what changes can be made to my new code (above my edit) to achieve the same end result as my old code (below my edit) (a well formatted and labeled DataFrame).
Closing_price_alltime return value:
d = [{'candles': [{'open': 147.46875,'high': 148.21875,
'low': 146.875,'close': 147.125,
'volume': 6998100,'datetime': 960181200000},
{'open': 146.625,'high': 147.78125,
'low': 145.90625,'close': 146.46875,
'volume': 4858900,'datetime': 960267600000}
],
'symbol': 'MSFT',
'empty': False}]
You could extract symbol,datetime and closing like this.
import operator
import pandas as pd
data = operator.itemgetter('datetime','close')
symbol = d[0]['symbol']
candles = d[0]['candles']
dt, closing = zip(*map(data, candles))
# for loop equivalent to zip(*map...)
#dt = []
#closing = []
#for candle in candles:
# dt.append(candle['datetime'])
# closing.append(candle['close'])
s = pd.Series(data=closing,index=dt,name=symbol)
This will create a DataFrame of closing prices for each symbol in the list.
results = []
for ticker in Ticker_List['Tickers']:
d = Closing_price_alltime(ticker)
symbol = d[0]['symbol']
candles = d[0]['candles']
dt, closing = zip(*map(data, candles))
results.append(pd.Series(data=closing,index=dt,name=symbol))
df = pd.concat(results, axis=1)
pandas.DataFrame.pct_change
This is the final function I wrote which accomplishes my goal:
def Port_consol(Ticker_List):
"""
Consolidates Ticker Symbol Returns and Returns
a Single Portolio
"""
Port=[]
Port_=[]
for i in Ticker_List['Tickers']:
Port.append(Closing_price_alltime(i))
j=list(range(0, (n_assets)))
for i in j:
data = operator.itemgetter('datetime','close')
symbol = Port[i]['symbol']
candles = Port[i]['candles']
dt, closing = zip(*map(data, candles))
s = pd.Series(data=closing,index=dt,name=symbol)
s=pd.DataFrame(s)
s.index = pd.to_datetime(s.index, unit='ms')
Port_.append(s)
Portfolio=pd.concat(Port_, axis=1, sort=False)
return Portfolio
I can now pass though a list of tickers to this function, the data will be pulled from TDAmeritrade's API (using python package td-ameritrade-python-api), and a DataFrame is formed with historical closing prices for the Stocks whose tickers I pass through.
I have JSON output from m3inference package in python like this:
{'input': {'description': 'Bundeskanzlerin',
'id': '2631881902',
'img_path': '/root/m3/cache/angelamerkeicdu_224x224.jpg',
'lang': 'de',
'name': 'Angela Merkel',
'screen_name': 'angelamerkeicdu'},
'output': {'age': {'19-29': 0.0,
'30-39': 0.0001,
'<=18': 0.0001,
'>=40': 0.9998},
'gender': {'female': 0.9991, 'male': 0.0009},
'org': {'is-org': 0.0032, 'non-org': 0.9968}}}
I store it in:
org = pd.DataFrame.from_dict(json_normalize(org['output']), orient='columns')
gender.male gender.female age.<=18 ... age.>=40 org.non-org org.is-org
0 0.0009 0.9991 0.0000 ... 0.9998 0.9968 0.0032
i dont know where is the 0 value in the first column coming from, I save org.isorg column to isorg
isorg = org['org.is-org']
but when i append it to panda data frame dtypes is object, the value is change to
0 0.0032 Name: org.is-org, dtype: float64
not 0.0032
How to fix this?
"i dont know where 0 value in first column coming from then i save org.isorg column to isorg"
That "0" is an index to your dataframe. Unless you specify your dataframe index, pandas will auto create the index. You can change you index instead.
code example:
org.set_index('gender.male', inplace=True)
Index is like an address to your data. It is how any data point across the dataframe or series can be accessed.
I have a Dataset structured like this:
"Date","Time","Open","High","Low","Close","Up","Down","Volume"
01/03/2000,00:05,1481.50,1481.50,1481.00,1481.00,2,0,0.00
01/03/2000,00:10,1480.75,1480.75,1480.75,1480.75,1,0,1.00
01/03/2000,00:20,1480.50,1480.50,1480.50,1480.50,1,0,1.00
[...]
03/01/2018,11:05,2717.25,2718.00,2708.50,2709.25,9935,15371,25306.00
03/01/2018,11:10,2709.25,2711.75,2706.50,2709.50,8388,8234,16622.00
03/01/2018,11:15,2709.25,2711.50,2708.25,2709.50,4738,4703,9441.00
03/01/2018,11:20,2709.25,2709.50,2706.00,2707.25,3609,4685,8294.00
I read this file in this way:
rows = pd.read_csv("Datasets/myfile.txt")
I want to get this information with pandas: for each day (so grouped day by day) get the first value of "Open", last value of "Close", Highest value of "High" and Lower value of "Low", and sum of Volume.
I know how to do with some for cicle, but it is a very inefficient way. Is it possibile to do with a few line with Pandas?
Thanks
Use groupby and agg:
df.groupby('Date').agg({
'Close': 'last',
'Open': 'first',
'High': 'max',
'Low': 'min',
'Volume': 'sum'
})
Output:
Close Open High Low Volume
Date
01/03/2000 1480.50 1481.50 1481.5 1480.5 2.0
03/01/2018 2707.25 2717.25 2718.0 2706.0 59663.0