I would like to create a dataframe by pulling only certain information from this website.
https://www.stockrover.com/build/production/Research/tail.js?1644930560
I would like to pull all the entries like this one. ["0005.HK","HSBC HOLDINGS","",""]
Another problem is, suppose I only want only the first 20,000 lines which is the stock information and there is other information after line 20,000 that I don't want included in the dataframe.
To summarize, could someone show me how to pull out just the information I'm trying to extract and create a dataframe with those results if this is possible.
A sample of the website results
function getStocksLibraryArray(){return[["0005.HK","HSBC HOLDINGS","",""],["0006.HK","Power Assets Holdings Ltd","",""],["000660.KS","SK hynix","",""],["004370.KS","Nongshim","",""],["005930.KS","Samsung Electroni","",""],["0123.HK","YUEXIU PROPERTY","",""],["0336.HK","HUABAO INTL","",""],["0408.HK","YIP'S CHEMICAL","",""],["0522.HK","ASM PACIFIC","",""],["0688.HK","CHINA OVERSEAS","",""],["0700.HK","TENCENT","",""],["0762.HK","CHINA UNICOM","",""],["0808.HK","PROSPERITY REIT","",""],["0813.HK","SHIMAO PROPERTY",
Code to pull all lines including ones not wanted
import requests
import pandas as pd
import requests
url = "https://www.stockrover.com/build/production/Research/tail.js?1644930560"
payload={}
headers = {}
response = requests.request("GET", url, headers=headers, data=payload)
print(response.text)
Use regex to extract the details followed by literal_eval to convert string to python object
import re
from ast import literal_eval
import pandas as pd
import requests
url = "https://www.stockrover.com/build/production/Research/tail.js?1644930560"
response = requests.request("GET", url, headers={}, data={})
regex_ = re.compile(r"getStocksLibraryArray\(\)\{return(.+?)}", re.DOTALL)
print(pd.DataFrame(literal_eval(regex_.search(response.text).group(1))))
0 1 2 3
0 0005.HK HSBC HOLDINGS
1 0006.HK Power Assets Holdings Ltd
2 000660.KS SK hynix
3 004370.KS Nongshim
4 005930.KS Samsung Electroni
... ... ... ... ..
21426 ZZHGF ZhongAn Online P&C _INSUP
21427 ZZHGY ZhongAn Online P&C _INSUP
21428 ZZLL ZZLL Information Tech _INTEC
21429 ZZZ.TO Sleep Country Canada _SPECR
21430 ZZZOF Zinc One Resources _OTHEI
I'm trying to obtain a table of data obtaining just the country, year and value from this World Bank API but I can't seem to filter for just the data I want. I've seen that these types of questions have already been asked but all the answers didn't seem to work.
Would really appreciate some help. Thank you!
import requests
import pandas as pd
from bs4 import BeautifulSoup
import json
url ="http://api.worldbank.org/v2/country/{}/indicator/NY.GDP.PCAP.CD?date=2015&format=json"
country = ["DZA","AGO","ARG","AUS","AUT","BEL","BRA","CAN","CHL","CHN","COL","CYP", "CZE","DNK","FIN","FRA","GEO","DEU",
"GRC""HUN","ISL","IND","IDN","IRL","ISR","ITA","JPN","KAZ","KWT","LBN","LIE","MYS","MEX","MCO","MAR","NPL","NLD",
"NZL","NGA","NOR","OMN","PER","PHL","POL","PRT","QAT","ROU","SGP","ZAF","ESP","SWE","CHE","TZA","THA","TUR","UKR",
"GBR","USA","VNM","ZWE"]
html={}
for i in country:
url_one = url.format(i)
html[i] = requests.get(url_one).json()
my_values=[]
for i in country:
value=html[i][1][0]['value']
my_values.append(value)
Edit
My data currently looks like this, I'm trying to extract the country name which is in '{'country': {'id': 'AO', 'value': 'Angola''}, the 'date' and the 'value'
Edit 2
Got the data I'm looking for but its repeated twice each
Note: Assumed that it would be great to store information for all the years at once and not only for one year - Enables you to simply filter in later processing. Take a look, there is a missing "," between your countries "GRC""HUN"
There are different options to achieve your goal, just point with two of them in the right direction.
Option #1
Pick information needed from json response, create a reshaped dict and append() it to my_values:
for d in data[1]:
my_values.append({
'country':d['country']['value'],
'date':d['date'],
'value':d['value']
})
Example
import requests
import pandas as pd
url = 'http://api.worldbank.org/v2/country/%s/indicator/NY.GDP.PCAP.CD?format=json'
countries = ["DZA","AGO","ARG","AUS","AUT","BEL","BRA","CAN","CHL","CHN","COL","CYP", "CZE","DNK","FIN","FRA","GEO","DEU",
"GRC","HUN","ISL","IND","IDN","IRL","ISR","ITA","JPN","KAZ","KWT","LBN","LIE","MYS","MEX","MCO","MAR","NPL","NLD",
"NZL","NGA","NOR","OMN","PER","PHL","POL","PRT","QAT","ROU","SGP","ZAF","ESP","SWE","CHE","TZA","THA","TUR","UKR",
"GBR","USA","VNM","ZWE"]
my_values = []
for country in countries:
data = requests.get(url %country).json()
try:
for d in data[1]:
my_values.append({
'country':d['country']['value'],
'date':d['date'],
'value':d['value']
})
except Exception as err:
print(f'[ERROR] country ==> {country} error ==> {err}')
pd.DataFrame(my_values).sort_values(['country', 'date'], ascending=True)
Option #2
Create a dataframes directly from the json response, concat them and make some adjustments on the final dataframe:
for d in data[1]:
my_values.append(pd.DataFrame(d))
...
pd.concat(my_values).loc[['value']][['country','date','value']].sort_values(['country', 'date'], ascending=True)
Output
country
date
value
Algeria
1971
341.389
Algeria
1972
442.678
Algeria
1973
554.293
Algeria
1974
818.008
Algeria
1975
936.79
...
...
...
Zimbabwe
2016
1464.59
Zimbabwe
2017
1235.19
Zimbabwe
2018
1254.64
Zimbabwe
2019
1316.74
Zimbabwe
2020
1214.51
Pandas read_json method needs valid JSON str, path object or file-like object, but you put string.
https://pandas.pydata.org/docs/reference/api/pandas.read_json.html
Try this:
import requests
import pandas as pd
url = "http://api.worldbank.org/v2/country/%s/indicator/NY.GDP.PCAP.CD?date=2015&format=json"
countries = ["DZA","AGO","ARG","AUS","AUT","BEL","BRA","CAN","CHL","CHN","COL","CYP", "CZE","DNK","FIN","FRA","GEO","DEU",
"GRC""HUN","ISL","IND","IDN","IRL","ISR","ITA","JPN","KAZ","KWT","LBN","LIE","MYS","MEX","MCO","MAR","NPL","NLD",
"NZL","NGA","NOR","OMN","PER","PHL","POL","PRT","QAT","ROU","SGP","ZAF","ESP","SWE","CHE","TZA","THA","TUR","UKR",
"GBR","USA","VNM","ZWE"]
datas = []
for country in countries:
data = requests.get(url %country).json()
try:
values = data[1][0]
datas.append(pd.DataFrame(values))
except Exception as err:
print(f"[ERROR] country ==> {country} with error ==> {err}")
df = pd.concat(datas)
Based on this post here, I have the possibility to transform the ISIN to some form ticker symbol with help of library investpy. This transformation is correct for most of united states stocks.
But this symbol itself is not in any case the same as the ticker-symbol I need to call pandas_dataframe. I think more exactly it conforms the RIC-symbol (e.g. look here).
For example if I try the following call:
import investpy
df = investpy.stocks.search_stocks(by='isin', value='DE0006048432')
print(df)
My output is:
country name ... currency symbol
0 germany Henkel VZO ... EUR HNKG_p
1 italy Henkel VZO ... EUR HNKG_p
2 switzerland Henkel VZO ... EUR HNKG_pEUR
but
from pandas_datareader import data as pdr
stock = pdr.DataReader('HNKG_p', data_source="yahoo", start="2021-01-01", end="2021-10-30")
gives me an error.
The correct call I need is:
stock = pdr.DataReader('HEN3.DE', data_source="yahoo", start="2021-01-01", end="2021-10-30")
So my question is:
is there a way to transform an ISIN, maybe WKN or also RIC to the
ticker-symbol yahoo needs for DataReader call.
Or more general
Is there a way to get historical stock data with the knowledge of ISIN, maybe WKN or RIC?
Super ugly and error prone but better than nothing:
import investpy as ip
import yahooquery as yq
from pandas_datareader import data as pdr
company_name = ip.stocks.search_stocks(by='isin', value='DE0006048432')
company_name = company_name["name"][0].split(' ')[0]
symbol = yq.search(company_name)["quotes"][0]["symbol"]
stock = pdr.DataReader(symbol, data_source="yahoo", start="2021-01-01", end="2021-10-30")
You could extend this code using things like fuzzywuzzy and an ordinary testing module with doctest. Do not use this code in production.
I am not even sure if this call keeps the order of the returned values:
yq.search(company_name)["quotes"]
So this code might actually behave randomly but it might give you a direction.
I am trying to pull search results data from an API on a website and put it into a pandas dataframe. I've been able to successfully pull the info from the API into a JSON format.
The next step I'm stuck on is how to loop through the search results on a particular page and then again for each page of results.
Here is what I've tried so far:
#Step 1: Connect to an API
import requests
import json
response_API = requests.get('https://www.federalregister.gov/api/v1/documents.json?conditions%5Bpublication_date%5D%5Bgte%5D=09%2F01%2F2021&conditions%5Bterm%5D=economy&order=relevant&page=1')
#200
#Step 2: Get the data from API
data = response_API.text
#Step 3: Parse the data into JSON format
parse_json = json.loads(data)
#Step 4: Extract data
title = parse_json['results'][0]['title']
pub_date = parse_json['results'][0]['publication_date']
agency = parse_json['results'][0]['agencies'][0]['name']
Here is where I've tried to put this all into a loop:
import numpy as np
import pandas as pd
df=[]
for page in np.arange(0,7):
url = 'https://www.federalregister.gov/api/v1/documents.json?conditions%5Bpublication_date%5D%5Bgte%5D=09%2F01%2F2021&conditions%5Bterm%5D=economy&order=relevant&page={page}'.format(page=page)
response_API = requests.get(url)
print(response_API.status_code)
data = response_API.text
parse_json = json.loads(data)
for i in parse_json:
title = parse_json['results'][i]['title']
pub_date = parse_json['results'][i]['publication_date']
agency = parse_json['results'][i]['agencies'][0]['name']
df.append([title,pub_date,agency])
cols = ["Title", "Date","Agency"]
df = pd.DataFrame(df,columns=cols)
I feel like I'm close to the correct answer, but I'm not sure how to move forward from here. I need to iterate through the results where I placed the i's when parsing through the json data, but I get an error that reads, "Type Error: list indices must be integers or slices, not str". I understand I can't put the i's in those spots, but how else am I supposed to iterate through the results?
Any help would be appreciated!
Thank you!
I think you are very close!
import numpy as np
import pandas as pd
import requests
BASE_URL = "'https://www.federalregister.gov/api/v1/documents.json?conditions%5Bpublication_date%5D%5Bgte%5D=09%2F01%2F2021&conditions%5Bterm%5D=economy&order=relevant&page={page}"
results = []
for page in range(0, 7):
response = requests.get(BASE_URL.format(page=page))
if response.ok:
resp_json = response.json()
for res in resp_json["results"]:
results.append(
[
res["title"],
res["publication_date"],
[agency["name"] for agency in res["agencies"]]
]
)
df = pd.DataFrame(results, columns=["Title", "Date", "Agencies"])
In this block of code, I used the requests library's built-in .json() method, which can automatically convert a response's text to a JSON dict (if it's in the proper format).
The if response.ok is a little less-verbose way provided by requests to check if the status code is < 400, and can prevent errors that might occur when attempting to parse the response if there was a problem with the HTTP call.
Finally, I'm not sure what data you need exactly for your DataFrame, but each object in the
"results" list from the pages pulled from that website has "agencies" as a list of agencies... wasn't sure if you wanted to drop all that data, so I kept the names as a list.
*Edit:
In case the response objects don't contain the proper keys, we can use the .get() method of Python dictionaries.
# ...snip
for res in resp_json["results"]:
results.append(
[
res.get("title"), # This will return `None` as a default, instead of causing a KeyError
res.get("publication_date"),
[
# Here, get the 'raw_name' or None, in case 'name' key doesn't exist
agency.get("name", agency.get("raw_name"))
for agency in res.get("agencies", [])
]
]
)
Slightly different approach: rather than iterating through the response, read into a dataframe then save what you need. The saves the first agency name in the list.
df_list=[]
for page in np.arange(0,7):
url = 'https://www.federalregister.gov/api/v1/documents.json?conditions%5Bpublication_date%5D%5Bgte%5D=09%2F01%2F2021&conditions%5Bterm%5D=economy&order=relevant&page={page}'.format(page=page)
response_API = requests.get(url)
# print(response_API.status_code)
data = response_API.text
parse_json = json.loads(data)
df = pd.json_normalize(parse_json['results'])
df['Agency'] = df['agencies'][0][0]['raw_name']
df_list.append(df[['title', 'publication_date', 'Agency']])
df_final = pd.concat(df_list)
df_final
title publication_date Agency
0 Determination of the Promotion of Economy and ... 2021-09-28 OFFICE OF MANAGEMENT AND BUDGET
1 Corporate Average Fuel Economy Standards for M... 2021-09-03 OFFICE OF MANAGEMENT AND BUDGET
2 Public Hearing for Corporate Average Fuel Econ... 2021-09-14 OFFICE OF MANAGEMENT AND BUDGET
3 Investigation of Urea Ammonium Nitrate Solutio... 2021-09-08 OFFICE OF MANAGEMENT AND BUDGET
4 Call for Nominations To Serve on the National ... 2021-09-08 OFFICE OF MANAGEMENT AND BUDGET
.. ... ... ...
15 Energy Conservation Program: Test Procedure fo... 2021-09-14 DEPARTMENT OF COMMERCE
16 Self-Regulatory Organizations; The Nasdaq Stoc... 2021-09-09 DEPARTMENT OF COMMERCE
17 Regulations To Improve Administration and Enfo... 2021-09-20 DEPARTMENT OF COMMERCE
18 Towing Vessel Firefighting Training 2021-09-01 DEPARTMENT OF COMMERCE
19 Patient Protection and Affordable Care Act; Up... 2021-09-27 DEPARTMENT OF COMMERCE
[140 rows x 3 columns]
I looked at a few other answers but couldn't find a solution which worked for me.
Here's my complete code, which you can run without any API key:
import requests
r = requests.get('http://api.worldbank.org/v2/country/GBR/indicator/NY.GDP.MKTP.KD.ZG')
If I print r.text, I get a string that starts with
'\ufeff<?xml version="1.0" encoding="utf-8"?>\r\n<wb:data page="1" pages="2" per_page="50" total="60" sourceid="2" lastupdated="2019-12-20" xmlns:wb="http://www.worldbank.org">\r\n <wb:data>\r\n <wb:indicator id="NY.GDP.MKTP.KD.ZG">GDP growth (annual %)</wb:indicator>\r\n <wb:country id="GB">United Kingdom</wb:country>\r\n <wb:countryiso3code>GBR</wb:countryiso3code>\r\n <wb:date>2019</wb:date>\r\n`
and goes on for a while.
One way of getting what I'd like out of it (which, as far as I understand, is heavily discouraged) is to use regex:
import regex
import pandas as pd
import re
pd.DataFrame(
re.findall(
r"<wb:date>(\d{4})</wb:date>\r\n <wb:value>((?:\d\.)?\d{14})", r.text
),
columns=["date", "value"],
)
What is a "proper" way of parsing this xml output? My final objective is to have a DataFrame with date and value columns, such as
date value
0 2018 1.38567356958762
1 2017 1.89207703836381
2 2016 1.91815510596298
3 2015 2.35552430595799
...
How about the following:
Decode the response:
decoded_response = response.content.decode('utf-8')
Convert to json:
response_json = json.loads(json.dumps(xmltodict.parse(decoded)))
Read into DataFrame:
pd.read_json(response_json)
Then you just need to play with the orient and such
(docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html)
You can use ElementTree API (as described here )
import requests
from xml.etree import ElementTree
response = requests.get('http://api.worldbank.org/v2/country/GBR/indicator/NY.GDP.MKTP.KD.ZG')
tree = ElementTree.fromstring(response.content)
print(tree)
But you will have to explore the structure to get what you want.
Full code I ended up using (based on Omri's excellent answer):
import xmltodict
import json
import pandas as pd
r = requests.get("http://api.worldbank.org/v2/country/GBR/indicator/NY.GDP.MKTP.KD.ZG")
decoded_response = r.content.decode("utf-8")
response_json = json.loads(json.dumps(xmltodict.parse(decoded_response)))
pd.DataFrame(response_json["wb:data"]["wb:data"])[["wb:date", "wb:value"]].rename(
columns=lambda x: x.replace("wb:", "")
)
which gives
date value
0 2019 None
1 2018 1.38567356958762
2 2017 1.89207703836381
3 2016 1.91815510596298
4 2015 2.35552430595799
...