Related
I have the following API request that I then clean and sort the data:
Base_URL = "https://api.jao.eu/OWSMP/getauctions?"
headers = {
"AUTH_API_KEY": "06e690fb-697b-4ab2-9325-4268cbd14502"
}
params = {
"horizon":"Daily",
"corridor":"IF1-FR-GB",
"fromdate":"2021-01-01"
}
data = "results"
r = requests.get(Base_URL, headers=headers, params=params, json=data)
j = r.json()
df = pd.DataFrame.from_dict(j)
df=df.explode('results')
df=df.join(pd.json_normalize(df.pop('results')).add_suffix('_new'))
df.drop(['ftroption','identification','horizonName','periodToBeSecuredStart','periodToBeSecuredStop','bidGateOpening','bidGateClosure','isBidGateOpen','atcGateOpening','atcGateClosure','marketPeriodStop','disputeSubmissionGateOpening','disputeSubmissionGateClosure','disputeProcessGateOpening','disputeProcessGateClosure','ltResaleGateOpening','ltResaleGateClosure','maintenances','xnRule','winningParties','operationalMessage','products','lastDataUpdate','cancelled','comment_new','corridorCode_new','productIdentification_new','additionalMessage_new'], axis=1, inplace=True)
df
I then sort it by the date column, which is why it is important to be able to run it for every month, as I need to repeat this process and hopefully automate it in the future:
df['new'] = pd.to_datetime(df['marketPeriodStart']).dt.strftime('%d/%m/%Y')
df = df = df.sort_values(by='new', ascending=True)
df
As the API can only run in one month periods, I am trying to loop-through it to be able to change the "fromdate" param to every month. I can then change the "corridor" param and I would be able to repeat the above for-loop. Thank you!
Get all data:
import pandas as pd
import requests
Base_URL = "https://api.jao.eu/OWSMP/getauctions?"
headers = {
"AUTH_API_KEY": "api_key"
}
final_df=pd.DataFrame() #all data will store here.
#create dates like 2022-01-01, 2022-08-01...
year=['2021','2022']
month=list(range(1,13))
dates=[]
errors=[]
for i in year:
for j in month:
if i =='2022' and j in [11,12]:
pass
else:
dates.append(i+ '-' + f'{j:02}' + '-01')
#dates are ready. let's request for each date and append data to final df.
for i in dates:
params = {
"horizon":"Daily",
"corridor":"IF1-FR-GB",
"fromdate":i
}
data = "results"
r = requests.get(Base_URL, headers=headers, params=params, json=data)
j = r.json()
try:
df = pd.DataFrame.from_dict(j)
final_df=final_df.append(df)
except:
errors.append(j)
#now, let's do same process for final data.
final_df=final_df.explode('results')
final_df=final_df.join(pd.json_normalize(final_df.pop('results')).add_suffix('_new'))
final_df.drop(['ftroption','identification','horizonName','periodToBeSecuredStart','periodToBeSecuredStop','bidGateOpening','bidGateClosure','isBidGateOpen','atcGateOpening','atcGateClosure','marketPeriodStop','disputeSubmissionGateOpening','disputeSubmissionGateClosure','disputeProcessGateOpening','disputeProcessGateClosure','ltResaleGateOpening','ltResaleGateClosure','maintenances','xnRule','winningParties','operationalMessage','products','lastDataUpdate','cancelled','comment_new','corridorCode_new','productIdentification_new','additionalMessage_new'], axis=1, inplace=True)
After you get all the data, if you want to get it automatically every month, you should set it to run on the first day of every month (if you want a different day, you should change the day value in timedelta).
import pandas as pd
import requests
Base_URL = "https://api.jao.eu/OWSMP/getauctions?"
headers = {
"AUTH_API_KEY": "api_key"
}
from datetime import datetime,timedelta
now=(datetime.today() - timedelta(days=2)).strftime('%Y-%m-01')
params = {
"horizon":"Daily",
"corridor":"IF1-FR-GB",
"fromdate":now
}
data = "results"
r = requests.get(Base_URL, headers=headers, params=params, json=data)
j = r.json()
df = pd.DataFrame.from_dict(j)
df=df.append(df)
df=df.explode('results')
df=df.join(pd.json_normalize(df.pop('results')).add_suffix('_new'))
df.drop(['ftroption','identification','horizonName','periodToBeSecuredStart','periodToBeSecuredStop','bidGateOpening','bidGateClosure','isBidGateOpen','atcGateOpening','atcGateClosure','marketPeriodStop','disputeSubmissionGateOpening','disputeSubmissionGateClosure','disputeProcessGateOpening','disputeProcessGateClosure','ltResaleGateOpening','ltResaleGateClosure','maintenances','xnRule','winningParties','operationalMessage','products','lastDataUpdate','cancelled','comment_new','corridorCode_new','productIdentification_new','additionalMessage_new'], axis=1, inplace=True)
I'm trying to create a pandas dataframe with bucket object data (list_objects_v2) using boto3.
Without pagination, I can easily create a dataframe by using recursion on the response and appending rows to the dataframe.
import boto3
import pandas
s3 = boto3.client('s3')
response = s3.list_objects_v2(Bucket=bucket_name) #this creates a nested json
print(response)
{'ResponseMetadata': {'RequestId': 'PGMCTZNAPV677CWE', 'HostId': '/8qweqweEfpdegFSNU/hfqweqweqweSHtM=', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amz-id-2': '/8yacqweqwe/hfjuSwKXDv3qweqweqweHtM=', 'x-amz-request-id': 'PqweqweqweE', 'date': 'Fri, 09 Sep 2022 09:25:04 GMT', 'x-amz-bucket-region': 'eu-central-1', 'content-type': 'application/xml', 'transfer-encoding': 'chunked', 'server': 'AmazonS3'}, 'RetryAttempts': 0}, 'IsTruncated': False, 'Contents': [{'Key': 'qweqweIntraday.csv', 'LastModified': datetime.datetime(2022, 7, 12, 8, 32, 10, tzinfo=tzutc()), 'ETag': '"qweqweqwe4"', 'Size': 1165, 'StorageClass': 'STANDARD'}], 'Name': 'test-bucket', 'Prefix': '', 'MaxKeys': 1000, 'EncodingType': 'url', 'KeyCount': 1}
object_df = pandas.DataFrame()
for elem in response:
if 'Contents' in elem:
object_df = pandas.json_normalize(response['Contents'])
Because of the 1000 row limitation of list_objects_v2, I'm trying to get to the same result using recursion. I attempted to do this with the following code, but I don't get the desired output (infinite loops on larger buckets).
object_df = pandas.DataFrame()
paginator = s3.get_paginator("list_objects_v2")
for page in paginator.paginate(Bucket=bucket_name):
for elem in page:
if 'Contents' in elem:
object_df = pandas.json_normalize(page['Contents'])
I managed to find a solution with adding another dataframe and just appending each page to it.
appended_object_df = pandas.DataFrame()
paginator = s3.get_paginator("list_objects_v2")
for page in paginator.paginate(Bucket=bucket_name):
object_df = pandas.DataFrame()
object_df = pandas.json_normalize(page['Contents'])
appended_object_df=appended_object_df.append(object_df, ignore_index=True)
I'm still curious if it's possible to skip the appending part and have the code directly generate the complete df.
Per the pandas documentation:
Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.
So, you could do:
df_list = []
paginator = s3.get_paginator("list_objects_v2")
for page in paginator.paginate(Bucket=bucket_name):
page_df = pandas.json_normalize(page['Contents'])
df_list.append(page_df)
object_df = pandas.concat(df_list, ignore_index=True)
I'm trying to get the last data from the bitmex API
Base URI: https://www.bitmex.com/api/v1
I don't really understand how to get the last data (from today) using filters : https://www.bitmex.com/app/restAPI
here is my code:
from datetime import date
import requests
import json
import pandas as pd
today = date.today()
d1 = today.strftime("%Y-%m-%d")
#print("d1 =", d1)
def parser():
today = date.today()
# yy/dd/mm
d1 = today.strftime("%Y-%m-%d")
# print("d1 =", d1)
return f'https://www.bitmex.com/api/v1/trade?symbol=.BVOL24H&startTime={d1}×tamp.time=12:00:00.000&columns=price'
# Making a get request
response = requests.get(parser()).json()
# print(response)
for elem in response:
print(elem)
and the response is :
...
{'symbol': '.BVOL24H', 'timestamp': '2021-12-27T08:05:00.000Z', 'price': 2.02}
{'symbol': '.BVOL24H', 'timestamp': '2021-12-27T08:10:00.000Z', 'price': 2.02}
{'symbol': '.BVOL24H', 'timestamp': '2021-12-27T08:15:00.000Z', 'price': 2.02}
it's missing a few hours, I tried using endTime, StartTime and Count without success..
I think I need to pass another filter like endtime = now and timestamp.time = now but I don't know how to send a payload or how to url-encode it.
As Filtering part tells
Many table endpoints take a filter parameter. This is expected to be JSON
These parameters are not keys in the query string, but keys in a dictionary given in filter key
url = "https://www.bitmex.com/api/v1/trade"
filters = {
'startTime': date(2021, 12, 20).strftime("%Y-%m-%d"),
'timestamp.time': '12:00:00.000'
}
params = {
'symbol': '.BVOL24H',
'filter': json.dumps(filters),
}
response = requests.get(url, params=params)
for elem in response.json():
print(elem)
Example
/trade?symbol=.BVOL24H&filter={%22startTime%22:%222021-12-20%22,%22timestamp.time%22:%2212:00:00.000%22}
You can add additional parameters to the url with & like below.
'https://www.bitmex.com/api/v1/trade?symbol=.BVOL24H&startTime={d1}×tamp.time=12:00:00.000&columns=price&endTime={date.today()}×tamp.time={date.today()}'
I'm trying with no luck to save the output of an API response into a CSV file in a clear and ordered way, this is the script to retrieve API data:
import json
import requests
import csv
# List of keywords to be checked
keywords = open("/test.txt", encoding="ISO-8859-1")
keywords_to_check = []
try:
for keyword in keywords:
keyword = keyword.replace("\n", "")
keywords_to_check.append(keyword)
except Exception:
print("An error occurred. I will try again!")
pass
apikey = # my api key
apiurl = # api url
apiparams = {
'apikey': apikey,
'keyword': json.dumps(keywords_to_check),
'metrics_location': '2840',
'metrics_language': 'en',
'metrics_network': 'googlesearchnetwork',
'metrics_currency': 'USD',
'output': 'csv'
}
response = requests.post(apiurl, data=apiparams)
jsonize = json.dumps(response.json(), indent=4, sort_keys=True)
if response.status_code == 200:
print(json.dumps(response.json(), indent=4, sort_keys=True))
The output I get is the following:
{
"results": {
"bin": {
"cmp": 0.795286539,
"cpc": 3.645033,
"m1": 110000,
"m10": 90500,
"m10_month": 2,
"m10_year": 2019,
"m11": 135000,
"m11_month": 1,
"m11_year": 2019,
"m12": 135000,
"m12_month": 12,
"m12_year": 2018,
"m1_month": 11,
"m1_year": 2019,
"m2": 110000,
"m2_month": 10,
"m2_year": 2019,
"m3": 110000,
"m3_month": 9,
"m3_year": 2019,
"m4": 135000,
"m4_month": 8,
"m4_year": 2019,
"m5": 135000,
"m5_month": 7,
"m5_year": 2019,
"m6": 110000,
"m6_month": 6,
"m6_year": 2019,
"m7": 110000,
"m7_month": 5,
"m7_year": 2019,
"m8": 90500,
"m8_month": 4,
"m8_year": 2019,
"m9": 90500,
"m9_month": 3,
"m9_year": 2019,
"string": "bin",
"volume": 110000
},
"chair": {
"cmp": 1,
"cpc": 1.751945,
"m1": 1000000,
"m10": 823000,
"m10_month": 2,
"m10_year": 2019,
"m11": 1500000,
"m11_month": 1,
"m11_year": 2019,
"m12": 1500000,
"m12_month": 12,
"m12_year": 2018,
"m1_month": 11,
"m1_year": 2019,
"m2": 1000000,
"m2_month": 10,
"m2_year": 2019,
"m3": 1000000,
"m3_month": 9,
"m3_year": 2019,
"m4": 1220000,
"m4_month": 8,
"m4_year": 2019,
"m5": 1220000,
"m5_month": 7,
"m5_year": 2019,
"m6": 1000000,
"m6_month": 6,
"m6_year": 2019,
"m7": 1000000,
"m7_month": 5,
"m7_year": 2019,
"m8": 1000000,
"m8_month": 4,
"m8_year": 2019,
"m9": 1000000,
"m9_month": 3,
"m9_year": 2019,
"string": "chair",
"volume": 1220000
}, ....
What I'd like to achieve is a csv file showing the following info and ordering, with the columns being string, cmp, cpc and volume:
string;cmp;cpc;volume
bin;0.795286539;3.645033;110000
chair;1;1.751945;1220000
Following Sidous' suggestion I've come to the following:
import pandas as pd
data = response.json()
df = pd.DataFrame.from_dict(data)
df.head()
Which game me the following output:
results
bin {'string': 'bin', 'volume': 110000, 'm1': 1100...
chair {'string': 'chair', 'volume': 1220000, 'm1': 1...
flower {'string': 'flower', 'volume': 1830000, 'm1': ...
table {'string': 'table', 'volume': 673000, 'm1': 82...
water {'string': 'water', 'volume': 673000, 'm1': 67...
Close, but still how can I show "string", "volume" etc as columns and avoid displaying the {'s of the dictioary?
Thanks a lot to whoever can help me sort this out :)
Askew
I propose to save the response in pandas data frame, then store it by pandas (you know csv file are easily handled by pandas).
import pandas as pd
# receiving results in a dictionary
dic = response.json()
# remove the results key from the dictionary
dic = dic.pop("results", None)
# convert dictionary to dataframe
data = pd.DataFrame.from_dict(dic, orient='index')
# string;cmp;cpc;volume
new_data = pd.concat([data['string'], data['cmp'], data['cpc'], data['volume']], axis=1)
# removing the default index (bin and chair keys)
new_data.reset_index(drop=True, inplace=True)
print(new_data)
# saving new_data into a csv file
new_data.to_csv('name_of_file.csv')
You find the csv file in the same directory of the python file (otherwise you can specify it in the .to_csv() method).
You can see the final result in screen shot below.
Open a text file using with open command and further write the data down by iterating through the whole dict
with open("text.csv", "w+") as f:
f.write('string;cmp;cpc;volume\n')
for res in response.values(): #This is after I assumed that `response` is of type dict
for r in res.values():
f.write(r['string']+';'+str(r['cmp'])+';'+str(r['cpc'])+';'+str(r['volume'])+'\n')
try this:
import pandas as pd
data = response.json()
cleaned_data = []
for key, val in data["results"].items():
cleaned_data.append(val)
df = pd.DataFrame.from_dict(cleaned_data)
df1 = df[["string","cmp","cpc","volume"]]
df1.head()
df1.to_csv("output.csv")
What about using a csv.DictWriter, since your data are almost what it needs to function?
import csv
if __name__ is "__main__":
results = {"chair": {"cmp": 1, "cpc": 3.64}, "bin": {"cmp": 0.5, "cpc": 1.75}} # two rows will do for the example
# Now let's get the data structure we really want: a list of rows
rows = []
for key, value in results:
rows.append(results)
# And, while we're at it, set the string part
rows[-1]["string"] = key
# Create the header
fieldnames = set()
for row in rows:
for fname in row:
fieldnames.add(fname)
# Write to the file
with open("mycsv.csv", "w", newline="") as file_:
writer = csv.DictWriter(file_, fieldnames=fieldnames)
writer.writeheader()
for row in rows:
writer.writerow(row)
You should be good with that kind of stuff, without using any other lib
I have a list of places from an excel file which I would enrich with the geonames Ids. Starting from the excel file I made a pandas Data Frame then I would use the values from the DF as params in my request.
Here the script I made
import pandas as pd
import requests
import json
require_cols = [1]
required_df = pd.read_excel('grp.xlsx', usecols = require_cols)
print(required_df)
url = 'http://api.geonames.org/searchJSON?'
params = { 'username': "XXXXXXXX",
'name_equals': (required_df),
'maxRows': "1"}
e = requests.get(url, params=params)
pretty_json = json.loads(e.content)
print (json.dumps(pretty_json, indent=2))
The problem is related to the defintion of this parameter:
'name_equals': (required_df)
I would use the Places (around 15k) from the DF as param and recoursively retrieve the related geonames ID and write the output in a separate excel file.
The simple request works:
import requests
import json
url = 'http://api.geonames.org/searchJSON?'
params = { 'username': "XXXXXXX",
'name_equals': "Aire",
'maxRows': "1"}
e = requests.get(url, params=params)
pretty_json = json.loads(e.content)
print (json.dumps(pretty_json, indent=2))
#print(e.content)
As well as the definition of Pandas data frame:
# import pandas lib as pd
import pandas as pd
require_cols = [0,1]
# only read specific columns from an excel file
required_df = pd.read_excel('grp.xlsx', usecols = require_cols)
print(required_df)
I also tried via SPARQL without results so I decided to go via Python.
Thanks for your time.
You can use for-loop
import pandas as pd
df = pd.DataFrame({'Places': ['London', 'Paris', 'Berlin']})
for item in df['Places']:
print('requests for:', item)
# ... rest of code ...
or df.apply()
import pandas as pd
def run(item):
print('requests for:', item)
# ... rest of code ...
return 'result for ' + item
df = pd.DataFrame({'Places': ['London', 'Paris', 'Berlin']})
df['Results'] = df['Places'].apply(run)
Thanks #furas for your reply.
I solved like this:
import pandas as pd
import requests
import json
url = 'http://api.geonames.org/searchJSON?'
df = pd.read_excel('Book.xlsx', sheet_name='Sheet1', usecols="B")
for item in df.place_name:
df.place_name.head()
params ={ 'username': "XXXXXX",
'name_equals': item,
'maxRows': "1"}
e = requests.get(url, params=params)
pretty_json = json.loads(e.content)
for item in pretty_json["geonames"]:
print (json.dumps(item["geonameId"], indent=2))
with open('data.json', 'w', encoding='utf-8') as f:
json.dump(item["geonameId"], f, ensure_ascii=False, indent=4)
#print(e.content)
The only problem now is related to the json output: By print I'm having the complete IDs list however, when I'm going to write the output to a file I'm getting just the last ID from the list.