I have a Pandas dataframe and I want to call an API and pass some parameters from that dataframe. Then I get the results from the API and create a new column from that. This is my working code:
import http.client, urllib.request, urllib.parse, urllib.error, base64
import pandas as pd
import json
headers = {
# Request headers
'Content-Type': 'application/json',
'Ocp-Apim-Subscription-Key': 'my-api-key-goes-here',
}
params = urllib.parse.urlencode({
})
df = pd.read_csv('mydata.csv',names=['id','text'])
def call_api(row):
try:
body = {
"documents": [
{
"language": "en",
"id": row['id'],
"text": row['text']
}
]
}
conn = http.client.HTTPSConnection('api-url')
conn.request("POST", "api-endpoint" % params, str(body), headers)
response = conn.getresponse()
data = response.read()
data = json.loads(data)
return data['documents'][0]['score']
conn.close()
except Exception as e:
print("[Errno {0}] {1}".format(e.errno, e.strerror))
df['score'] = df.apply(call_api,axis=1)
The above works quite well. However, I have a limit on the number of api requests I can do and the API let me send up to 100 documents in the same request, by adding more on the body['documents'] list.
The returned data follow this schema:
{
"documents": [
{
"score": 0.92,
"id": "1"
},
{
"score": 0.85,
"id": "2"
},
{
"score": 0.34,
"id": "3"
}
],
"errors": null
}
So, what I am looking for is to apply the same api call not row by row, but in batches of 100 rows each time. Is there any way to do this in Pandas or should I iterate on dataframe rows, create the batches myself and then iterate again to add the returned values on the new column?
DataFrame.apply() is slow; we can do better. This will create the "documents" list-of-dicts in one go:
df.to_dict('records')
Then all you need to do is split it into chunks of 100:
start = 0
while start < len(df):
documents = df.iloc[start:start+100].to_dict('records')
call_api(documents)
start += 100
Finally, you could use a single HTTP session with the requests library:
import requests
session = requests.Session()
call_api(session, documents)
Then inside call_api() you do session.post(...). This is more efficient than setting up a new connection each time.
Related
Trying to retrieve data via the EIA data API (v2): https://www.eia.gov/opendata/documentation.php.
I'm able to use the API dashboard to return data:
https://www.eia.gov/opendata/browser/electricity/retail-sales?frequency=monthly&data=price;revenue;sales;&start=2013-01
But when I attempt to retrieve within Python using the attached documentation, I don't appear to be returning any values when using the same parameters.
url = 'https://api.eia.gov/v2/electricity/retail-sales/data/?api_key=' + API_KEY
params = {
"frequency": "monthly",
"data": [
"revenue",
"sales",
"price"
],
"start": "2013-01"
}
if x.status_code == 200:
print('Success')
else:
print('Failed')
res = x.json()['response']
data = res['data']
If I print the url created by the GET method, and compare to API url included in the dashboard, the issue appears to be in the way the GET method is attempting to retrieve items from the data parameter:
Works
https://api.eia.gov/v2/electricity/retail-sales/data/?frequency=monthly&data[0]=price&data[1]=revenue&data[2]=sales&start=2013-01&sort[0][column]=period&sort[0][direction]=desc&offset=0&length=5000
Doesn't work (returned by GET method):
https://api.eia.gov/v2/electricity/retail-sales/data/?api_key=MY_API&frequency=monthly&data=revenue&data=sales&data=price&start=2013-01
Can anyone provided guidance on how to coerce the GET method to pass my data parameters in the same way as the API dashboard appears to?
Your data in params not formatted correctly in url. Try this if you want the url to be formed as in your working version:
url = 'https://api.eia.gov/v2/electricity/retail-sales/data/?api_key=' + API_KEY
data = [
"revenue",
"sales",
"price"
]
params = {
"frequency": "monthly",
"start": "2013-01"
}
for index in range(0, len(data)):
params[f"data[{index}]"] = data[index]
response = requests.get(url, params = params)
But if the server is adequate, then square brackets in the name of the data[] parameter are enough:
url = 'https://api.eia.gov/v2/electricity/retail-sales/data/?api_key=' + API_KEY
params = {
"frequency": "monthly",
"data[]": [
"revenue",
"sales",
"price"
],
"start": "2013-01"
}
I am new in python and REST world.
My Python script
import json
import requests
with open(r"create-multiple-Users.json", "r") as payload:
data = json.load(payload)
json_data = json.dumps(data, indent=2)
headers = {'content-type': 'application/json; charset=utf-8'}
for i in range(len(data)):
r = requests.post('http://localhost:3000/users',
data=json_data, headers=headers)
Mock API server: https://github.com/typicode/json-server .
Entry file: "info.json" with Endpoint: /users that has one user initially.
{
"users": [
{
"id": 1,
"name": "John",
"job": "Wong"
}
]
}
Issue:
POSTing from a file with only one user works perfectly. The new user is appended to info.json as expected as an object.
But when trying to POST let's say 3 users from file "create-multiple-Users.json" below, then the users are appended to the "info.json" as lists of objects 3 times (i.e. the number of objects/iterations)
[
{
"id": 10,
"name": "Janet",
"job": "Weaver"
},
{
"id": 12,
"name": "Kwonn",
"job": "Wingtsel"
},
{
"id": 13,
"name": "Eve",
"job": "Holt"
}
]
I would expect the users to be appended one by one as separate objects.
Maybe I am too oversimplifying the looping?
Any help is highly appreciated.
PS: Sorry I couldn't get the multiple-users file formatted ;(
A simple change in your for iteration would help:
import json
import requests
with open(r"create-multiple-Users.json", "r") as payload:
data = json.load(payload)
json_data = json.dumps(data, indent=2)
headers = {'content-type': 'application/json; charset=utf-8'}
for row in data: # Change this to iterate the json list
r = requests.post('http://localhost:3000/users',
data=row, headers=headers) # Send row that is a single object
I found the solution by using the hint thanks to "enriqueojedalara"
import json
import requests
with open(r"create-multiple-Users.json", "r") as payload:
data = json.load(payload) #<class 'list'>
headers = {'content-type': 'application/json; charset=utf-8'}
print("Total number of objects: ", len(data))
for i in range(len(data)):
data_new = json.dumps(data[i])
r = requests.post('http://localhost:3000/users', data=data_new, headers=headers)
print("Item#", i, "added", " -> ", data_new)
I am really struggling with this one. I'm new to python and I'm trying to extract data from an API.
I have managed to run the script below but I need to amend it to filter on multiple values for one column, lets say England and Scotland. Is there an equivelant to the SQL IN operator e.g. Area_Name IN ('England','Scotland').
from requests import get
from json import dumps
ENDPOINT = "https://api.coronavirus.data.gov.uk/v1/data"
AREA_TYPE = "nation"
AREA_NAME = "england"
filters = [
f"areaType={ AREA_TYPE }",
f"areaName={ AREA_NAME }"
]
structure = {
"date": "date",
"name": "areaName",
"code": "areaCode",
"dailyCases": "newCasesByPublishDate",
}
api_params = {
"filters": str.join(";", filters),
"structure": dumps(structure, separators=(",", ":")),
"latestBy": "cumCasesByPublishDate"
}
formats = [
"json",
"xml",
"csv"
]
for fmt in formats:
api_params["format"] = fmt
response = get(ENDPOINT, params=api_params, timeout=10)
assert response.status_code == 200, f"Failed request for {fmt}: {response.text}"
print(f"{fmt} data:")
print(response.content.decode())
I have tried the script, and dict is the easiest type to handle in this case.
Given your json data output
data = {"length":1,"maxPageLimit":1,"data":[{"date":"2020-09-17","name":"England","code":"E92000001","dailyCases":2788}],"pagination":{"current":"/v1/data?filters=areaType%3Dnation%3BareaName%3Dengland&structure=%7B%22date%22%3A%22date%22%2C%22name%22%3A%22areaName%22%2C%22code%22%3A%22areaCode%22%2C%22dailyCases%22%3A%22newCasesByPublishDate%22%7D&latestBy=cumCasesByPublishDate&format=json&page=1","next":null,"previous":null,"first":"/v1/data?filters=areaType%3Dnation%3BareaName%3Dengland&structure=%7B%22date%22%3A%22date%22%2C%22name%22%3A%22areaName%22%2C%22code%22%3A%22areaCode%22%2C%22dailyCases%22%3A%22newCasesByPublishDate%22%7D&latestBy=cumCasesByPublishDate&format=json&page=1","last":"/v1/data?filters=areaType%3Dnation%3BareaName%3Dengland&structure=%7B%22date%22%3A%22date%22%2C%22name%22%3A%22areaName%22%2C%22code%22%3A%22areaCode%22%2C%22dailyCases%22%3A%22newCasesByPublishDate%22%7D&latestBy=cumCasesByPublishDate&format=json&page=1"}}
You can try something like this:
countries = ['England', 'France', 'Whatever']
return [country for country in data where country['name'] in countries]
I presume the data list is the only interesting key in the data dict since all others do not have any meaningful values.
*** I have updated code at the bottom.
I have a json object I'm working with and it's coming from Azure analytics for an application we build. I'm trying to figure hot to parse the Url that comes back with just the limits and location keys data in separate columns. The code I'm using is listed here: (keys are take out as well as url because of api keys and tokens)
import request
import pandas as pd
from urllib.parse import urlparse
from furl import furl
import json
d1 = '<This is where I have the rule for the API>'
querystring = {"timespan":"P7D") #get's last 7 days
headers = { Stuff in here for headers }
response = requests.request("GET", d1, headers=headers, parms=querystring)
data = json.loads(response.text)
#then I clean up the stuff in the dataframe
for stuff in data(['value'])
del stuff['count'].....(just a list of all the non needed fields in the json script)
newstuff = json.dumps(data, indent=2, sort_keys=True
data2 = json.loads(newstuff)
ok now here is the part I am having problems with, I want to pull out 3 columns of data from each row. the ['request']['url'], ['timestamp'], ['user']['id']
I'm pretty sure I need to do a for loop so I'm doing the following to get the pieces out.
for x in data2['value']:
time = x['timestamp']
user = x['user']['id']
url = furl(x['request']['url'])
limit = url.args['limit']
location = url.args['location']
What's happening is when I try this i'm getting limit does not exist for every url. I think I have to do a if else statement but not sure how to formulate this. I need to get everything into a dataframe so I can parse it out into a cursor.execute statement which I know how to do.
What's needed.
1. Get the information in the for loop into a dataframe
2. take the url and if the url does not have a Limit or a Location then make it none else put limit in a column and same for location in a column by itself.
Dataframe would look like this
timestamp user limit location
2018-01-01 bob#home.com null
2018-01-01 bill#home.com null
2018-01-01 same#home.com null null
2018-01-02 bob#home.com
here is the information on furl
here is some sample json to test with:
{
"value": [{
"request": {
"url": "https://website/testing"
},
"timestamp": "2018-09-23T18:32:58.153z",
"user": {
"id": ""
}
},
{
"request": {
"url": "https://website/testing/limit?location=31737863-c431-e6611-9420-90b11c44c42f"
},
"timestamp": "2018-09-23T18:32:58.153z",
"user": {
"id": "steve#home.com"
}
},
{
"request": {
"url": "https://website/testing/dealanalyzer?limit=57bd5872-3f45-42cf-bc32-72ec21c3b989&location=31737863-c431-e611-9420-90b11c44c42f"
},
"timestamp": "2018-09-23T18:32:58.153z",
"user": {
"id": "tom#home.com"
}
}
]
}
import requests
import pandas as pd
from urllib.parse import urlparse
import json
from pandas.io.json import json_normalize
d1 = "https://nowebsite/v1/apps/11111111-2222-2222-2222-33333333333333/events/requests"
querystring = {"timespan":"P7D"}
headers = {
'x-api-key': "xxxxxxxxxxxxxxxxxxxxxxxx",
'Cache-Control': "no-cache",
'Postman-Token': "xxxxxxxxxxxxxxxxxxxx"
}
response = requests.request("GET", d1, headers=headers, params=querystring)
data = json.loads(response.text)
# delete crap out of API GET Request
for stuff in data['value']:
del stuff['count']
del stuff['customDimensions']
del stuff['operation']
del stuff['session']
del stuff['cloud']
del stuff['ai']
del stuff['application']
del stuff['client']
del stuff['id']
del stuff['type']
del stuff['customMeasurements']
del stuff['user']['authenticatedId']
del stuff['user']['accountId']
del stuff['request']['name']
del stuff['request']['success']
del stuff['request']['duration']
del stuff['request']['performanceBucket']
del stuff['request']['resultCode']
del stuff['request']['source']
del stuff['request']['id']
newstuff = json.dumps(data, indent=2, sort_keys=True)
#print(newstuff)
# Now it's in a cleaner format to work with
data2 = json.loads(newstuff)
json_normalize(data2['value'])
From here the data is in a pandas dataframe and looks like I want it to.
I just need to know how to use the furl to pull the limit and location out of the url on a per row basis and create a new column called load and limits as mentions above.
I am facing this error while making request to fetch json from api.
I can get json data using the "/v1/articles' path.
conn = http.client.HTTPSConnection("api.xxxx.com.tr")
headers = {
'accept': "application/json",
'apikey': "cd6b6c96799847698d87dec9f9a731d6"
}
filter = "daily"
conn.request("GET", "/v1/articles", headers=headers)
reader = codecs.getreader("utf-8")
res = conn.getresponse()
data = json.load(reader(res))
json.dumps(data)
return data
But i am having JSONDecodeError if i set filter. Code:
conn = http.client.HTTPSConnection("api.xxxx.com.tr")
headers = {
'accept': "application/json",
'apikey': "cd6b6c96799847698d87dec9f9a731d6"
}
conn.request("GET", "/v1/articles?$filter=Path eq '/daily/'", headers=headers)
reader = codecs.getreader("utf-8")
res = conn.getresponse()
data = json.load(reader(res))
json.dumps(data)
return data
I tried same filter using Postman with no error and i can get Json data.
Returned Json data from Postman:
[
{
"Id": "40778196",
"ContentType": "Article",
"CreatedDate": "2018-03-20T08:28:05.385Z",
"Description": "İspanya'da 2016 yılında çalınan lüks otomobil, şasi numarası değiştirilerek Bulgaristan üzerinden getirildiği Türkiye'de bulundu.",
"Files": [
{
"FileUrl": "http://i.xxxx.com/i/xxxx/98/620x0/5ab0c6a9c9de3d18a866eb54.jpg",
"Metadata": {
"Title": "",
"Description": ""
}
}
],
"ModifiedDate": "2018-03-20T08:32:12.001Z",
"Path": "/gundem/",
"StartDate": "2018-03-20T08:32:12.001Z",
"Tags": [
"ispanya",
"Araç",
"Hırsız",
"Dolandırıcı"
],
"Title": "İspanya'da çalınan lüks araç Türkiye'de bulundu!",
"Url": "http://www.xxxx.com.tr/gundem/ispanyada-calinan-luks-arac-turkiyede-bulundu-40778196"
}
]
I can not figure out the problem. It would be great if anyone help me about this issue. Thank you.
I finally figured out the problem! Using the requests library have solved my problem now I can filter the api request.
data = requests.get('https://api.xxxxx.com.tr/v1/articles', headers =
headers, params={"$filter":"Path eq '/xxxxxx/'"}).json()
I am leaving this answer here for anyone else who can need this solution in the future.
Thanks for all your suggestions.
The problem is in the following line
data = json.load(reader(res))
when your response is not a json string, JSONDecodeError occurs. so, add an additional logic to see if the response is None or a json string. First thing, print the reader(res) and see what the return is