Recently I am reading some stock prices database in Quandl using API call to extract the data. But I am really confused by the example I have.
import requests
api_url = 'https://www.quandl.com/api/v1/datasets/WIKI/%s.json' % stock
session = requests.Session()
session.mount('http://', requests.adapters.HTTPAdapter(max_retries=3))
raw_data = session.get(api_url)
Can anyone explain that to me?
1) for api_url, if I copy that webepage, it says 404 not found. So if I want to use other database, how do I prepare this api_usl? What does '% stock' mean?
2) here request looks like to be used to extract the data, what is the format of the raw_data? How do I know the column names? How do I extract the columns?
To expand on my comment above:
% stock is a string formatting operation, replacing %s in the preceding string with the value referenced by stock. Further details can be found here
raw_data actually references a Response object (part of the requests module - details found here
To expand on your code.
import requests
#Set the stock we are interested in, AAPL is Apple stock code
stock = 'AAPL'
#Your code
api_url = 'https://www.quandl.com/api/v1/datasets/WIKI/%s.json' % stock
session = requests.Session()
session.mount('http://', requests.adapters.HTTPAdapter(max_retries=3))
raw_data = session.get(api_url)
# Probably want to check that requests.Response is 200 - OK here
# to make sure we got the content successfully.
# requests.Response has a function to return json file as python dict
aapl_stock = raw_data.json()
# We can then look at the keys to see what we have access to
aapl_stock.keys()
# column_names Seems to be describing the individual data points
aapl_stock['column_names']
# A big list of data, lets just look at the first ten points...
aapl_stock['data'][0:10]
Edit to answer question in comment
So the aapl_stock[column_names] shows Date and Open as the first and second values respectively. This means they correspond to positions 0 and 1 in each element of the data.
Therefore to access date use aapl_stock['data'][0:10][0] (date value for first ten items) and to access the value for open use aapl_stock['data'][0:78][1] (open value for first 78 items).
To get a list of every value in the dataset, where each element is a list with values for Date and Open you could add something like aapl_date_open = aapl_stock['data'][:][0:1].
If you are new to python I seriously recommend looking at the list slice notation, a quick intro can be found here
Related
I'm going crazy trying to get data through an API call using request and pandas. It looks like it's nested data, but I cant get the data i need.
https://xorosoft.docs.apiary.io/#reference/sales-orders/get-sales-orders
above is the api documentation. I'm just trying to keep it simple and get the itemnumber and qtyremainingtoship, but i cant even figure out how to access the nested data. I'm trying to use DataFrame to get it, but am just lost. any help would be appreciated. i keep getting stuck at the 'Data' level.
type(json['Data'])
df = pd.DataFrame(['Data'])
df.explode('SoEstimateHeader')
df.explode('SoEstimateHeader')
Cell In [64], line 1
df.explode([0:])
^
SyntaxError: invalid syntax
I used the link to grab a sample response from the API documentation page you provided. From the code you provided it looks like you are already able to get the data and I'm assuming the you have it as a dictionary type already.
From what I can tell I don't think you should be using pandas, unless its some downstream requirement in the task you are doing. But to get the ItemNumber & QtyRemainingToShip you can use the code below.
# get the interesting part of the data out of the api response
data_list = json['Data']
#the data_list is only one element long, so grab the first element which is of type dictionary
data = data_list[0]
# the dictionary has two keys at the top level
so_estimate_header = data['SoEstimateHeader']
# similar to the data list the value associated with "SoEstimateItemLineArr" is of type list and has 1 element in it, so we grab the first & only element.
so_estimate_item_line_arr = data['SoEstimateItemLineArr'][0]
# now we can grab the pieces of information we're interested in out of the dictionary
qtyremainingtoship = so_estimate_item_line_arr["QtyRemainingToShip"]
itemnumber = so_estimate_item_line_arr["ItemNumber"]
print("QtyRemainingToShip: ", qtyremainingtoship)
print("ItemNumber: ", itemnumber)
Output
QtyRemainingToShip: 1
ItemNumber: BC
Side Note
As a side note I wouldn't name any variables json because thats also the name of a popular library in python for parsing json, so that will be confusing to future readers and will clash with the name if you end up having to import the json library.
I'm trying to access the table details to ultimately put into a dataframe and save as a csv with a limited number of rows(the dataset is massive) from the following site: https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2/data
I'm just starting out webscraping and was practicing on this dataset. I can effectively pull tags like div but when I try soup.findAll('tr') or td, it returns an empty set.
The table appears to be embedded in a different code(see link above) so that's maybe my issue, but still unsure how to access the detail rows and headers, etc..., Selenium maybe?
Thanks in advance!
By the looks of it, the website already allows you to export the data:
As it would seem, the original link is:
https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2/data
The .csv download link is:
https://data.cityofchicago.org/api/views/ijzp-q8t2/rows.csv?accessType=DOWNLOAD
The .json link is:
https://data.cityofchicago.org/resource/ijzp-q8t2.json
Therefore you could simply extract the ID of the data, in this case ijzp-q8t2, and replace it on the download links above. Here is the official documentation of their API.
import pandas as pd
from sodapy import Socrata
# Unauthenticated client only works with public data sets. Note 'None'
# in place of application token, and no username or password:
client = Socrata("data.cityofchicago.org", None)
# Example authenticated client (needed for non-public datasets):
# client = Socrata(data.cityofchicago.org,
# MyAppToken,
# userame="user#example.com",
# password="AFakePassword")
# First 2000 results, returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get("ijzp-q8t2", limit=2000)
# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)
I'm a Python novice, thanks for your patience.
I retrieved a web page, using the requests module. I used Beautiful Soup to harvest a few hundred href objects (links). I used uritools to create an array of full URLs for the target pages I want to download.
I don't want everybody who reads this note to bombard the web server with requests, so I'll show a hypothetical example that is realistic for just 2 hrefs. The array looks like this:
hrefs2 = ['http://ku.edu/pls/WP040?PT001F01=910&pf7331=11',
'http://ku.edu/pls/WP040?PT001F01=910&pf7331=12']
If I were typing these into 100s of lines of code, I understand what to do in order to retrieve each page:
from lxml import html
import requests
url = 'http://ku.edu/pls/WP040/'
payload = {'PT001F01' : '910', 'pf7331' : '11')
r = requests.get(url, params = payload)
Then get the second page
payload = {'PT001F01' : '910', 'pf7331' : '12')
r = requests.get(url, params = payload)
And keep typing in payload objects. Not all of the hrefs I'm dealing with are sequential, not all of the payloads are different simply in the last integer.
I want to automate this and I don't see how to create the payloads from the hrefs2 array.
While fiddling with uritools, I find urisplit which can give me the part I need to parse into a payload:
[urisplit(x)[3] for x in hrefs2]
['PT001F01=910&pf7331=11',
'PT001F01=910&pf7331=12']
Each one of those has to be turned into a payload object and I don't understand what to do.
I'm using Python3 and I used uritools because that appears to be the standards-compliant replacement of urltools.
I fell back on shell script to get pages with wget, which does work, but it is so un-Python-ish that I'm asking here for what to do. I mean, this does work:
import subprocess
for i in hrefs2:
subprocess.call(["wget", i])
You can pass the full url to requests.get() without splitting up the parameters.
>>> requests.get('http://ku.edu/pls/WP040?PT001F01=910&pf7331=12')
<Response [200]>
If for some reason you don't want to do that, you'll need to split up the parameters some how. I'm sure there are better ways to do it, but the first thing that comes to mind is:
a = ['PT001F01=910&pf7331=11',
'PT001F01=910&pf7331=12']
# list to store all url parameters after they're converted to dicts
urldata = []
#iterate over list of params
for param in a:
data = {}
# split the string into key value pairs
for kv in param.split('&'):
# split the pairs up
b = kv.split('=')
# first part is the key, second is the value
data[b[0]] = b[1]
# After converting every kv pair in the parameter, add the result to a list.
urldata.append(data)
You could do this with less code but I wanted to be clear what was going on. I'm sure there is already a module somewhere out there that does this for you too.
My code:
from urllib import request
import json
lat = 31.33 ; long = -84.52
webpage = "http://www.mapquestapi.com/geocoding/v1/reverse?key=MY_KEY&callback=renderReverse&location={},{}".format(lat, long)
response = request.urlopen(webpage)
json_data = response.read().decode(response.info().get_param('charset') or 'utf-8')
data = json.loads(json_data)
print(data)
This gives me following error:
ValueError: Expecting value: line 1 column 1 (char 0)
I am trying to read county and state from MapQuest reverse geocoding api. The response looks like this:
renderReverse({"info":{"statuscode":0,"copyright":{"text":"\u00A9 2015 MapQuest, Inc.","imageUrl":"http://api.mqcdn.com/res/mqlogo.gif","imageAltText":"\u00A9 2015 MapQuest, Inc."},"messages":[]},"options":{"maxResults":1,"thumbMaps":true,"ignoreLatLngInput":false},"results":[{"providedLocation":{"latLng":{"lat":32.841516,"lng":-83.660992}},"locations":[{"street":"562 Patterson St","adminArea6":"","adminArea6Type":"Neighborhood","adminArea5":"Macon","adminArea5Type":"City","adminArea4":"Bibb","adminArea4Type":"County","adminArea3":"GA","adminArea3Type":"State","adminArea1":"US","adminArea1Type":"Country","postalCode":"31204-3508","geocodeQualityCode":"P1AAA","geocodeQuality":"POINT","dragPoint":false,"sideOfStreet":"L","linkId":"0","unknownInput":"","type":"s","latLng":{"lat":32.84117,"lng":-83.660973},"displayLatLng":{"lat":32.84117,"lng":-83.660973},"mapUrl":"http://www.mapquestapi.com/staticmap/v4/getmap?key=MY_KEY&type=map&size=225,160&pois=purple-1,32.84117,-83.660973,0,0,|¢er=32.84117,-83.660973&zoom=15&rand=-189494136"}]}]})
How do I convert this string to a dict from which I can query using a key? Any help would be appreciated. Thanks!
First, get rid of the callback parameter in your URL since that's what is causing the response to be wrapped in renderReverse()
webpage = "http://www.mapquestapi.com/geocoding/v1/reverse?key=MY_KEY&location={},{}".format(lat, long)
That will give you valid json which should work with the json.loads function you call. At this point you can interact with data like a dictionary, getting the county and state names by their keys. The way mapquests structures their json is pretty strange, so it looks like you may have to do some string matching to get the correct key name. In this case, 'adminArea4Type' is set to 'County' so you want to access the 'adminArea4' key to return the county name.
data['results'][0]['locations'][0]['adminArea4']
I am trying to make an API query in Python to return a list of districts and their unique IDs in Indonesia. I was able to pull data from the API and store it as JSON.
However, the API query only returns the first 100 results. In this particular example, I know that Indonesia has 700+ districts, so it would require 7 queries (but since I don't always know how many districts are in a country, I'd like to be able to run a loop that keeps querying until all of the results are exhausted).
Ideally, I would like to make repeated queries to the API, and append the results of each query into one variable until I have extracted the full list of districts.
e.g.
Query 1: store results 1 - 100 in variable X
Query 2: append results 101 - 200 to variable X.
repeat until all results are stored.
For more information on the API, see here: https://gtmp.linkssystem.org/docs/districts
Here is the code I started:
from urllib2 import Request, urlopen, URLError
request = Request('https://gtmp.linkssystem.org/api/districts?admin0=indonesia')
response = urlopen(request)
geo = response.read()
import json
geolist = json.loads(geo)
If this affects anything, I was hoping to write all of the results to a .csv like so:
import csv
with open('C:\\Users\\Owner\\Desktop\\geo.csv', 'wb') as fp:
a = csv.writer(fp)
a.writerow(["admin0id", "admin0","admin1id","admin1","admin2id","admin2","admin3id","admin3"])
for e in geolist:
a.writerow([e["admin0id"],
e["admin0"],
e["admin1id"],
e["admin1"],
e["admin2id"],
e["admin2"],
e["admin3id"],
e["admin3"]])