I am trying to make an API query in Python to return a list of districts and their unique IDs in Indonesia. I was able to pull data from the API and store it as JSON.
However, the API query only returns the first 100 results. In this particular example, I know that Indonesia has 700+ districts, so it would require 7 queries (but since I don't always know how many districts are in a country, I'd like to be able to run a loop that keeps querying until all of the results are exhausted).
Ideally, I would like to make repeated queries to the API, and append the results of each query into one variable until I have extracted the full list of districts.
e.g.
Query 1: store results 1 - 100 in variable X
Query 2: append results 101 - 200 to variable X.
repeat until all results are stored.
For more information on the API, see here: https://gtmp.linkssystem.org/docs/districts
Here is the code I started:
from urllib2 import Request, urlopen, URLError
request = Request('https://gtmp.linkssystem.org/api/districts?admin0=indonesia')
response = urlopen(request)
geo = response.read()
import json
geolist = json.loads(geo)
If this affects anything, I was hoping to write all of the results to a .csv like so:
import csv
with open('C:\\Users\\Owner\\Desktop\\geo.csv', 'wb') as fp:
a = csv.writer(fp)
a.writerow(["admin0id", "admin0","admin1id","admin1","admin2id","admin2","admin3id","admin3"])
for e in geolist:
a.writerow([e["admin0id"],
e["admin0"],
e["admin1id"],
e["admin1"],
e["admin2id"],
e["admin2"],
e["admin3id"],
e["admin3"]])
Related
I'm going crazy trying to get data through an API call using request and pandas. It looks like it's nested data, but I cant get the data i need.
https://xorosoft.docs.apiary.io/#reference/sales-orders/get-sales-orders
above is the api documentation. I'm just trying to keep it simple and get the itemnumber and qtyremainingtoship, but i cant even figure out how to access the nested data. I'm trying to use DataFrame to get it, but am just lost. any help would be appreciated. i keep getting stuck at the 'Data' level.
type(json['Data'])
df = pd.DataFrame(['Data'])
df.explode('SoEstimateHeader')
df.explode('SoEstimateHeader')
Cell In [64], line 1
df.explode([0:])
^
SyntaxError: invalid syntax
I used the link to grab a sample response from the API documentation page you provided. From the code you provided it looks like you are already able to get the data and I'm assuming the you have it as a dictionary type already.
From what I can tell I don't think you should be using pandas, unless its some downstream requirement in the task you are doing. But to get the ItemNumber & QtyRemainingToShip you can use the code below.
# get the interesting part of the data out of the api response
data_list = json['Data']
#the data_list is only one element long, so grab the first element which is of type dictionary
data = data_list[0]
# the dictionary has two keys at the top level
so_estimate_header = data['SoEstimateHeader']
# similar to the data list the value associated with "SoEstimateItemLineArr" is of type list and has 1 element in it, so we grab the first & only element.
so_estimate_item_line_arr = data['SoEstimateItemLineArr'][0]
# now we can grab the pieces of information we're interested in out of the dictionary
qtyremainingtoship = so_estimate_item_line_arr["QtyRemainingToShip"]
itemnumber = so_estimate_item_line_arr["ItemNumber"]
print("QtyRemainingToShip: ", qtyremainingtoship)
print("ItemNumber: ", itemnumber)
Output
QtyRemainingToShip: 1
ItemNumber: BC
Side Note
As a side note I wouldn't name any variables json because thats also the name of a popular library in python for parsing json, so that will be confusing to future readers and will clash with the name if you end up having to import the json library.
I am working on a code where I fetch information from an online database through the api they provide. The information I receive is a large block of data arranged in square- and curly brackets. As of now, I fetch the whole block of data every time I want a specific parameter. It becomes a problem as I am limited in how much I can send queries through their api. I would therefore like to fetch the large block of data one time and save it as a variable in the code that I then reference to when I want specific parts of it. I will then be able to get all the specific parts of the data with only one request.
As of now the code I have looks like this:
from monkeylearn import MonkeyLearn
ml = MonkeyLearn('my personal API-key')
model_id = 'the id of the model'
text1 = 'This is the text that is analyzed by monkeylearn'
data = ['first text', {'text': text1, 'external_id': 'ANY_ID'}, '']
response = ml.extractors.extract(model_id, data).body
company_name_tag = response[1]['extractions'][0]['tag_name']
company_name = response[1]['extractions'][0]['extracted_text']
response contains all the information I get from the request so I would like to only have to fetch it once. As of now, if I were to print(company_name_tag) and print(company_name) it would fetch the data through the api two times. This leads to me reaching my limit on the api a lot faster than necessary.
I appreciate all help with this issue!
I'm trying to access the table details to ultimately put into a dataframe and save as a csv with a limited number of rows(the dataset is massive) from the following site: https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2/data
I'm just starting out webscraping and was practicing on this dataset. I can effectively pull tags like div but when I try soup.findAll('tr') or td, it returns an empty set.
The table appears to be embedded in a different code(see link above) so that's maybe my issue, but still unsure how to access the detail rows and headers, etc..., Selenium maybe?
Thanks in advance!
By the looks of it, the website already allows you to export the data:
As it would seem, the original link is:
https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2/data
The .csv download link is:
https://data.cityofchicago.org/api/views/ijzp-q8t2/rows.csv?accessType=DOWNLOAD
The .json link is:
https://data.cityofchicago.org/resource/ijzp-q8t2.json
Therefore you could simply extract the ID of the data, in this case ijzp-q8t2, and replace it on the download links above. Here is the official documentation of their API.
import pandas as pd
from sodapy import Socrata
# Unauthenticated client only works with public data sets. Note 'None'
# in place of application token, and no username or password:
client = Socrata("data.cityofchicago.org", None)
# Example authenticated client (needed for non-public datasets):
# client = Socrata(data.cityofchicago.org,
# MyAppToken,
# userame="user#example.com",
# password="AFakePassword")
# First 2000 results, returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get("ijzp-q8t2", limit=2000)
# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)
Recently I am reading some stock prices database in Quandl using API call to extract the data. But I am really confused by the example I have.
import requests
api_url = 'https://www.quandl.com/api/v1/datasets/WIKI/%s.json' % stock
session = requests.Session()
session.mount('http://', requests.adapters.HTTPAdapter(max_retries=3))
raw_data = session.get(api_url)
Can anyone explain that to me?
1) for api_url, if I copy that webepage, it says 404 not found. So if I want to use other database, how do I prepare this api_usl? What does '% stock' mean?
2) here request looks like to be used to extract the data, what is the format of the raw_data? How do I know the column names? How do I extract the columns?
To expand on my comment above:
% stock is a string formatting operation, replacing %s in the preceding string with the value referenced by stock. Further details can be found here
raw_data actually references a Response object (part of the requests module - details found here
To expand on your code.
import requests
#Set the stock we are interested in, AAPL is Apple stock code
stock = 'AAPL'
#Your code
api_url = 'https://www.quandl.com/api/v1/datasets/WIKI/%s.json' % stock
session = requests.Session()
session.mount('http://', requests.adapters.HTTPAdapter(max_retries=3))
raw_data = session.get(api_url)
# Probably want to check that requests.Response is 200 - OK here
# to make sure we got the content successfully.
# requests.Response has a function to return json file as python dict
aapl_stock = raw_data.json()
# We can then look at the keys to see what we have access to
aapl_stock.keys()
# column_names Seems to be describing the individual data points
aapl_stock['column_names']
# A big list of data, lets just look at the first ten points...
aapl_stock['data'][0:10]
Edit to answer question in comment
So the aapl_stock[column_names] shows Date and Open as the first and second values respectively. This means they correspond to positions 0 and 1 in each element of the data.
Therefore to access date use aapl_stock['data'][0:10][0] (date value for first ten items) and to access the value for open use aapl_stock['data'][0:78][1] (open value for first 78 items).
To get a list of every value in the dataset, where each element is a list with values for Date and Open you could add something like aapl_date_open = aapl_stock['data'][:][0:1].
If you are new to python I seriously recommend looking at the list slice notation, a quick intro can be found here
I have an elasticsearch index which has 60k elements. I know that by checking the head plugin and I get the same information via Sense (the result is in the lower right corner)
I then wanted to query the same index from Python, in two diffrent ways: via a direct requests call and using the elasticsearch module:
import elasticsearch
import json
import requests
# the requests version
data = {"query": {"match_all": {}}}
r = requests.get('http://elk.example.com:9200/nessus_current/_search', data=json.dumps(data))
print(len(r.json()['hits']['hits']))
# the elasticsearch module version
es = elasticsearch.Elasticsearch(hosts='elk.example.com')
res = es.search(index="nessus_current", body={"query": {"match_all": {}}})
print(len(res['hits']['hits']))
In both cases the result is 10 - far from the expected 60k. The results of the query make sense (the content is what I expect), it is just that there are only a few of them.
I took one of these 10 hits and queried with Sense for its _id to close the loop. It is, as expected, found indeed:
So it looks like the 10 hits are a subset of the whole index, why aren't all elements reported in the Python version of the calls?
10 is the default size of the results returned by Elasticsearch. If you want more, specify "size": 100 for example. But, be careful, returning all the docs using size is not recommended as it can bring down your cluster. For getting back all the results use scan&scroll.
And I think it should be res['hits']['total'] not res['hits']['hits'] to get the number of total hits.