Unable to extract the table from API using python - python

I am trying to extract a table using an API but I am unable to do so. I am pretty sure that I am not using it correctly, and any help would be appreciated.
Actually I am trying to extract a table from this API but unable to figure out the right way on how to do it. This is what is mentioned in the website. I want to extract Latest_full_data table.
This is my code to get the table but I am getting error:
import urllib
import requests
import urllib.request
locu_api = 'api_Key'
def locu_search(query):
api_key = locu_api
url = 'https://www.quandl.com/api/v3/databases/WIKI/metadata?api_key=' + api_key
response = urllib.request.urlopen(url).read()
json_obj = str(response, 'utf-8')
datanew = json.loads(json_obj)
return datanew
When I do print(datanew). Update: Even if I change it to return data new, error is still the same.
I am getting this below error:
name 'datanew' is not defined

I had the same issues with urrlib before. If possible, try to use requests it's a better designed and working library in my opinion. Also, it is capable of reading JSON with a single function so no need to run it through multiple lines Sample code here:
import requests
locu_api = 'api_Key'
def locu_search():
url = 'https://www.quandl.com/api/v3/databases/WIKI/metadata?api_key=' + api_key
return requests.get(url).json()
locu_search()
Edit:
The endpoint that you are calling might not be the correct one. I think you are looking for the following one:
import requests
api_key = 'your_api_key_here'
def locu_search(dataset_code):
url = f'https://www.quandl.com/api/v3/datasets/WIKI/{dataset_code}/metadata.json?api_key={api_key}'
req = requests.get(url)
return req.json()
data = locu_search("FB")
This will return with all the metadata regarding a company. In this case Facebook.

Maybe it doesn't apply to your specific problem, but what I normally do is the following:
import requests
def get_values(url):
response = requests.get(url).text
values = json.loads(response)
return values

Related

How to scrape multi page website with python? [duplicate]

This question already has answers here:
Web scraping with Python [closed]
(10 answers)
Closed last month.
I need to scrape the following table: https://haexpeditions.com/advice/list-of-mount-everest-climbers/
How to do it with python?
The site uses this API to fetch the table data, so you could request it from there.
(I used cloudscraper because it's easier than trying to figure out how to set the right set of requests headers to avoid getting a 406 error response; and using the try..except...print approach (instead of just doing tableData = [dict(...) for row in api_req.json()] directly) helps understand what went wrong in case of error [without actually raising any errors that might break the program execution.])
# import cloudscraper
api_url = 'https://haexpeditions.com/wp-admin/admin-ajax.php?action=wp_ajax_ninja_tables_public_action&table_id=1084&target_action=get-all-data&default_sorting=old_first&ninja_table_public_nonce=2491a56a39&chunk_number=0'
api_req = cloudscraper.create_scraper().get(api_url)
try: jData, jdMsg = api_req.json(), f'- {len(api_req.json())} rows from'
except Exception as e: jData, jdMsg = [], f'failed to get data - {e} \nfrom'
print(api_req.status_code, api_req.reason, jdMsg, api_req.url)
tableData = [dict([(k, v) for k, v in row['value'].items()] + [
(f'{k}_options', v) for k, v in row['options'].items()
]) for row in jData]
At this point tableData is a list of dictionaries but you can build a DataFrame from it with pandas and save it to a CSV file with .to_csv.
# import pandas
pandas.DataFrame(tableData).set_index('number').to_csv('list_of_mount_everest_climbers.csv')
The API URL can be either copied from the browser network logs or extracted from the script tag containing it in the source HTML of the page.
The shorter way would be to just split the HTML string:
# import cloudscraper
pg_url = 'https://haexpeditions.com/advice/list-of-mount-everest-climbers/'
pg_req = cloudscraper.create_scraper().get(pg_url)
api_url = pg_req.text.split('"data_request_url":"', 1)[-1].split('"')[0]
api_url = api_url.replace('\\', '')
print(pg_req.status_code,pg_req.reason,'from',pg_req.url,'\napi_url:',api_url)
However, it's a little risky in case "data_request_url":" appears in any other context in the HTML aside from the one that we want. So, another way would be to parse with bs4 and json.
# import cloudscraper
# from bs4 import BeautifulSoup
# import json
pg_url = 'https://haexpeditions.com/advice/list-of-mount-everest-climbers/'
sel = 'div.footer.footer-inverse>div.bottom-bar+script[type="text/javascript"]'
api_url = 'https://haexpeditions.com/wp-admin/admin-ajax.php...' ## will be updated
pg_req = cloudscraper.create_scraper().get(pg_url)
jScript = BeautifulSoup(pg_req.content).select_one(sel)
try:
sjData = json.loads(jScript.get_text().split('=',1)[-1].strip())
api_url = sjData['init_config']['data_request_url']
auMsg = f'\napi_url: {api_url}'
except Exception as e: auMsg = f'failed to extract API URL - {type(e)} {e}'
print(pg_req.status_code,pg_req.reason,'from',pg_req.url,'\napi_url:',auMsg)
(I would consider the second method more reliable even though it's a bit longer.)

How to get a link on a website using Python that updates dynamically?

I am trying to download the most recent zip file from the ERCOT Website (https://www.ercot.com/mp/data-products/compliance-and-disclosure/?id=NP3-965-ER). However, the link of the zip file has a doclookup id that changes everytime. The id is also populated dynamically. I have tried using beautifulsoup to get the link, but since it's being loaded dynamically it is not providing any links. Any feedback or solutions will be appreciated. enter image description here
Using the exposed api:
import json
import pandas as pd
import pendulum
import requests
def get_document_id(type_id: int) -> int:
url = (
"https://www.ercot.com/misapp/servlets/IceDocListJsonWS?"
f"reportTypeId={type_id}&"
f"_={pendulum.now().format('X')}"
)
with requests.Session() as request:
response = request.get(url, timeout=10)
if response.status_code != 200:
print(response.raise_for_status())
data = json.loads(response.text)
return pd.json_normalize(data=data["ListDocsByRptTypeRes"], record_path="DocumentList").head(1)["Document.DocID"].squeeze()
id_number = get_document_id(13052)
print(id_number)
869234127

Store RDF data into Triplestore via SPARQL endpoint using python

I am trying to save data in the following url as triples into triples store for future query. Here are my code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import re
url='http://gnafld.net/address/?per_page=10&page=7'
page = requests.get(url)
response = requests.get(url)
response.raise_for_status()
results = re.findall('\"Address ID: (GAACT[0-9]+)\"', response.text)
address1=results[0]
a = "http://gnafld.net/address/"
new_url = a + address1
r = requests.get(new_url).content
print(r)
After I run the code above, I got the answer like:
enter image description here
My question is how to insert the RDF data to a Fuseki Server SPARQL endpoint? I try the code like this:
import rdflib
from rdflib.plugins.stores import sparqlstore
#the following sparql endpoint is provided by the GNAF website
endpoint = 'http://gnafld.net/sparql'
store = sparqlstore.SPARQLUpdateStore(endpoint)
gs=rdflib.ConjunctiveGraph(store)
gs.open((endpoint,endpoint))
for stmt in r:
gs.add(stmt)
But it seems that it does not work. How can I fix this problem? Thanks for your help!
The answer you show in the image is in RDF triple format, it is just not pretty printed.
To store the RDF data in an RDF store you can use RDFlib. Here is an example of how to do that.
If you use Jena Fuseki server you should be able to access it from python just as you access any other SPARQL endpoint from python.
You may want to see my answer to a related SO question as well.

Python reading json from a url [duplicate]

I am trying to GET a URL using Python and the response is JSON. However, when I run
import urllib2
response = urllib2.urlopen('https://api.instagram.com/v1/tags/pizza/media/XXXXXX')
html=response.read()
print html
The html is of type str and I am expecting a JSON. Is there any way I can capture the response as JSON or a python dictionary instead of a str.
If the URL is returning valid JSON-encoded data, use the json library to decode that:
import urllib2
import json
response = urllib2.urlopen('https://api.instagram.com/v1/tags/pizza/media/XXXXXX')
data = json.load(response)
print data
import json
import urllib
url = 'http://example.com/file.json'
r = urllib.request.urlopen(url)
data = json.loads(r.read().decode(r.info().get_param('charset') or 'utf-8'))
print(data)
urllib, for Python 3.4
HTTPMessage, returned by r.info()
"""
Return JSON to webpage
Adding to wonderful answer by #Sanal
For Django 3.4
Adding a working url that returns a json (Source: http://www.jsontest.com/#echo)
"""
import json
import urllib
url = 'http://echo.jsontest.com/insert-key-here/insert-value-here/key/value'
respons = urllib.request.urlopen(url)
data = json.loads(respons.read().decode(respons.info().get_param('charset') or 'utf-8'))
return HttpResponse(json.dumps(data), content_type="application/json")
Be careful about the validation and etc, but the straight solution is this:
import json
the_dict = json.load(response)
resource_url = 'http://localhost:8080/service/'
response = json.loads(urllib2.urlopen(resource_url).read())
Python 3 standard library one-liner:
load(urlopen(url))
# imports (place these above the code before running it)
from json import load
from urllib.request import urlopen
url = 'https://jsonplaceholder.typicode.com/todos/1'
you can also get json by using requests as below:
import requests
r = requests.get('http://yoursite.com/your-json-pfile.json')
json_response = r.json()
Though I guess it has already answered I would like to add my little bit in this
import json
import urllib2
class Website(object):
def __init__(self,name):
self.name = name
def dump(self):
self.data= urllib2.urlopen(self.name)
return self.data
def convJSON(self):
data= json.load(self.dump())
print data
domain = Website("https://example.com")
domain.convJSON()
Note : object passed to json.load() should support .read() , therefore urllib2.urlopen(self.name).read() would not work .
Doamin passed should be provided with protocol in this case http
This is another simpler solution to your question
pd.read_json(data)
where data is the str output from the following code
response = urlopen("https://data.nasa.gov/resource/y77d-th95.json")
json_data = response.read().decode('utf-8', 'replace')
None of the provided examples on here worked for me. They were either for Python 2 (uurllib2) or those for Python 3 return the error "ImportError: No module named request". I google the error message and it apparently requires me to install a the module - which is obviously unacceptable for such a simple task.
This code worked for me:
import json,urllib
data = urllib.urlopen("https://api.github.com/users?since=0").read()
d = json.loads(data)
print (d)

Convert results from url lib.request [duplicate]

I am trying to GET a URL using Python and the response is JSON. However, when I run
import urllib2
response = urllib2.urlopen('https://api.instagram.com/v1/tags/pizza/media/XXXXXX')
html=response.read()
print html
The html is of type str and I am expecting a JSON. Is there any way I can capture the response as JSON or a python dictionary instead of a str.
If the URL is returning valid JSON-encoded data, use the json library to decode that:
import urllib2
import json
response = urllib2.urlopen('https://api.instagram.com/v1/tags/pizza/media/XXXXXX')
data = json.load(response)
print data
import json
import urllib
url = 'http://example.com/file.json'
r = urllib.request.urlopen(url)
data = json.loads(r.read().decode(r.info().get_param('charset') or 'utf-8'))
print(data)
urllib, for Python 3.4
HTTPMessage, returned by r.info()
"""
Return JSON to webpage
Adding to wonderful answer by #Sanal
For Django 3.4
Adding a working url that returns a json (Source: http://www.jsontest.com/#echo)
"""
import json
import urllib
url = 'http://echo.jsontest.com/insert-key-here/insert-value-here/key/value'
respons = urllib.request.urlopen(url)
data = json.loads(respons.read().decode(respons.info().get_param('charset') or 'utf-8'))
return HttpResponse(json.dumps(data), content_type="application/json")
Be careful about the validation and etc, but the straight solution is this:
import json
the_dict = json.load(response)
resource_url = 'http://localhost:8080/service/'
response = json.loads(urllib2.urlopen(resource_url).read())
Python 3 standard library one-liner:
load(urlopen(url))
# imports (place these above the code before running it)
from json import load
from urllib.request import urlopen
url = 'https://jsonplaceholder.typicode.com/todos/1'
you can also get json by using requests as below:
import requests
r = requests.get('http://yoursite.com/your-json-pfile.json')
json_response = r.json()
Though I guess it has already answered I would like to add my little bit in this
import json
import urllib2
class Website(object):
def __init__(self,name):
self.name = name
def dump(self):
self.data= urllib2.urlopen(self.name)
return self.data
def convJSON(self):
data= json.load(self.dump())
print data
domain = Website("https://example.com")
domain.convJSON()
Note : object passed to json.load() should support .read() , therefore urllib2.urlopen(self.name).read() would not work .
Doamin passed should be provided with protocol in this case http
This is another simpler solution to your question
pd.read_json(data)
where data is the str output from the following code
response = urlopen("https://data.nasa.gov/resource/y77d-th95.json")
json_data = response.read().decode('utf-8', 'replace')
None of the provided examples on here worked for me. They were either for Python 2 (uurllib2) or those for Python 3 return the error "ImportError: No module named request". I google the error message and it apparently requires me to install a the module - which is obviously unacceptable for such a simple task.
This code worked for me:
import json,urllib
data = urllib.urlopen("https://api.github.com/users?since=0").read()
d = json.loads(data)
print (d)

Categories

Resources