How to parse xml from requests?

How to parse xml from requests? - python

I looked at a few other answers but couldn't find a solution which worked for me.
Here's my complete code, which you can run without any API key:
import requests
r = requests.get('http://api.worldbank.org/v2/country/GBR/indicator/NY.GDP.MKTP.KD.ZG')
If I print r.text, I get a string that starts with
'\ufeff<?xml version="1.0" encoding="utf-8"?>\r\n<wb:data page="1" pages="2" per_page="50" total="60" sourceid="2" lastupdated="2019-12-20" xmlns:wb="http://www.worldbank.org">\r\n <wb:data>\r\n <wb:indicator id="NY.GDP.MKTP.KD.ZG">GDP growth (annual %)</wb:indicator>\r\n <wb:country id="GB">United Kingdom</wb:country>\r\n <wb:countryiso3code>GBR</wb:countryiso3code>\r\n <wb:date>2019</wb:date>\r\n`
and goes on for a while.
One way of getting what I'd like out of it (which, as far as I understand, is heavily discouraged) is to use regex:
import regex
import pandas as pd
import re
pd.DataFrame(
re.findall(
r"<wb:date>(\d{4})</wb:date>\r\n <wb:value>((?:\d\.)?\d{14})", r.text
),
columns=["date", "value"],
)
What is a "proper" way of parsing this xml output? My final objective is to have a DataFrame with date and value columns, such as
date value
0 2018 1.38567356958762
1 2017 1.89207703836381
2 2016 1.91815510596298
3 2015 2.35552430595799
...

How about the following:
Decode the response:
decoded_response = response.content.decode('utf-8')
Convert to json:
response_json = json.loads(json.dumps(xmltodict.parse(decoded)))
Read into DataFrame:
pd.read_json(response_json)
Then you just need to play with the orient and such
(docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html)

You can use ElementTree API (as described here )
import requests
from xml.etree import ElementTree
response = requests.get('http://api.worldbank.org/v2/country/GBR/indicator/NY.GDP.MKTP.KD.ZG')
tree = ElementTree.fromstring(response.content)
print(tree)
But you will have to explore the structure to get what you want.

Full code I ended up using (based on Omri's excellent answer):
import xmltodict
import json
import pandas as pd
r = requests.get("http://api.worldbank.org/v2/country/GBR/indicator/NY.GDP.MKTP.KD.ZG")
decoded_response = r.content.decode("utf-8")
response_json = json.loads(json.dumps(xmltodict.parse(decoded_response)))
pd.DataFrame(response_json["wb:data"]["wb:data"])[["wb:date", "wb:value"]].rename(
columns=lambda x: x.replace("wb:", "")
)
which gives
date value
0 2019 None
1 2018 1.38567356958762
2 2017 1.89207703836381
3 2016 1.91815510596298
4 2015 2.35552430595799
...

Related

Extracting chosen information from URL results into a dataframe

I would like to create a dataframe by pulling only certain information from this website.
https://www.stockrover.com/build/production/Research/tail.js?1644930560
I would like to pull all the entries like this one. ["0005.HK","HSBC HOLDINGS","",""]
Another problem is, suppose I only want only the first 20,000 lines which is the stock information and there is other information after line 20,000 that I don't want included in the dataframe.
To summarize, could someone show me how to pull out just the information I'm trying to extract and create a dataframe with those results if this is possible.
A sample of the website results
function getStocksLibraryArray(){return[["0005.HK","HSBC HOLDINGS","",""],["0006.HK","Power Assets Holdings Ltd","",""],["000660.KS","SK hynix","",""],["004370.KS","Nongshim","",""],["005930.KS","Samsung Electroni","",""],["0123.HK","YUEXIU PROPERTY","",""],["0336.HK","HUABAO INTL","",""],["0408.HK","YIP'S CHEMICAL","",""],["0522.HK","ASM PACIFIC","",""],["0688.HK","CHINA OVERSEAS","",""],["0700.HK","TENCENT","",""],["0762.HK","CHINA UNICOM","",""],["0808.HK","PROSPERITY REIT","",""],["0813.HK","SHIMAO PROPERTY",
Code to pull all lines including ones not wanted
import requests
import pandas as pd
import requests
url = "https://www.stockrover.com/build/production/Research/tail.js?1644930560"
payload={}
headers = {}
response = requests.request("GET", url, headers=headers, data=payload)
print(response.text)

Use regex to extract the details followed by literal_eval to convert string to python object
import re
from ast import literal_eval
import pandas as pd
import requests
url = "https://www.stockrover.com/build/production/Research/tail.js?1644930560"
response = requests.request("GET", url, headers={}, data={})
regex_ = re.compile(r"getStocksLibraryArray\(\)\{return(.+?)}", re.DOTALL)
print(pd.DataFrame(literal_eval(regex_.search(response.text).group(1))))
0 1 2 3
0 0005.HK HSBC HOLDINGS
1 0006.HK Power Assets Holdings Ltd
2 000660.KS SK hynix
3 004370.KS Nongshim
4 005930.KS Samsung Electroni
... ... ... ... ..
21426 ZZHGF ZhongAn Online P&C _INSUP
21427 ZZHGY ZhongAn Online P&C _INSUP
21428 ZZLL ZZLL Information Tech _INTEC
21429 ZZZ.TO Sleep Country Canada _SPECR
21430 ZZZOF Zinc One Resources _OTHEI

Getting data from World Bank API using pandas

I'm trying to obtain a table of data obtaining just the country, year and value from this World Bank API but I can't seem to filter for just the data I want. I've seen that these types of questions have already been asked but all the answers didn't seem to work.
Would really appreciate some help. Thank you!
import requests
import pandas as pd
from bs4 import BeautifulSoup
import json
url ="http://api.worldbank.org/v2/country/{}/indicator/NY.GDP.PCAP.CD?date=2015&format=json"
country = ["DZA","AGO","ARG","AUS","AUT","BEL","BRA","CAN","CHL","CHN","COL","CYP", "CZE","DNK","FIN","FRA","GEO","DEU",
"GRC""HUN","ISL","IND","IDN","IRL","ISR","ITA","JPN","KAZ","KWT","LBN","LIE","MYS","MEX","MCO","MAR","NPL","NLD",
"NZL","NGA","NOR","OMN","PER","PHL","POL","PRT","QAT","ROU","SGP","ZAF","ESP","SWE","CHE","TZA","THA","TUR","UKR",
"GBR","USA","VNM","ZWE"]
html={}
for i in country:
url_one = url.format(i)
html[i] = requests.get(url_one).json()
my_values=[]
for i in country:
value=html[i][1][0]['value']
my_values.append(value)
Edit
My data currently looks like this, I'm trying to extract the country name which is in '{'country': {'id': 'AO', 'value': 'Angola''}, the 'date' and the 'value'
Edit 2
Got the data I'm looking for but its repeated twice each

Note: Assumed that it would be great to store information for all the years at once and not only for one year - Enables you to simply filter in later processing. Take a look, there is a missing "," between your countries "GRC""HUN"
There are different options to achieve your goal, just point with two of them in the right direction.
Option #1
Pick information needed from json response, create a reshaped dict and append() it to my_values:
for d in data[1]:
my_values.append({
'country':d['country']['value'],
'date':d['date'],
'value':d['value']
})
Example
import requests
import pandas as pd
url = 'http://api.worldbank.org/v2/country/%s/indicator/NY.GDP.PCAP.CD?format=json'
countries = ["DZA","AGO","ARG","AUS","AUT","BEL","BRA","CAN","CHL","CHN","COL","CYP", "CZE","DNK","FIN","FRA","GEO","DEU",
"GRC","HUN","ISL","IND","IDN","IRL","ISR","ITA","JPN","KAZ","KWT","LBN","LIE","MYS","MEX","MCO","MAR","NPL","NLD",
"NZL","NGA","NOR","OMN","PER","PHL","POL","PRT","QAT","ROU","SGP","ZAF","ESP","SWE","CHE","TZA","THA","TUR","UKR",
"GBR","USA","VNM","ZWE"]
my_values = []
for country in countries:
data = requests.get(url %country).json()
try:
for d in data[1]:
my_values.append({
'country':d['country']['value'],
'date':d['date'],
'value':d['value']
})
except Exception as err:
print(f'[ERROR] country ==> {country} error ==> {err}')
pd.DataFrame(my_values).sort_values(['country', 'date'], ascending=True)
Option #2
Create a dataframes directly from the json response, concat them and make some adjustments on the final dataframe:
for d in data[1]:
my_values.append(pd.DataFrame(d))
...
pd.concat(my_values).loc[['value']][['country','date','value']].sort_values(['country', 'date'], ascending=True)
Output
country
date
value
Algeria
1971
341.389
Algeria
1972
442.678
Algeria
1973
554.293
Algeria
1974
818.008
Algeria
1975
936.79
...
...
...
Zimbabwe
2016
1464.59
Zimbabwe
2017
1235.19
Zimbabwe
2018
1254.64
Zimbabwe
2019
1316.74
Zimbabwe
2020
1214.51

Pandas read_json method needs valid JSON str, path object or file-like object, but you put string.
https://pandas.pydata.org/docs/reference/api/pandas.read_json.html
Try this:
import requests
import pandas as pd
url = "http://api.worldbank.org/v2/country/%s/indicator/NY.GDP.PCAP.CD?date=2015&format=json"
countries = ["DZA","AGO","ARG","AUS","AUT","BEL","BRA","CAN","CHL","CHN","COL","CYP", "CZE","DNK","FIN","FRA","GEO","DEU",
"GRC""HUN","ISL","IND","IDN","IRL","ISR","ITA","JPN","KAZ","KWT","LBN","LIE","MYS","MEX","MCO","MAR","NPL","NLD",
"NZL","NGA","NOR","OMN","PER","PHL","POL","PRT","QAT","ROU","SGP","ZAF","ESP","SWE","CHE","TZA","THA","TUR","UKR",
"GBR","USA","VNM","ZWE"]
datas = []
for country in countries:
data = requests.get(url %country).json()
try:
values = data[1][0]
datas.append(pd.DataFrame(values))
except Exception as err:
print(f"[ERROR] country ==> {country} with error ==> {err}")
df = pd.concat(datas)

I can't correctly visualize a json dataframe from api

I am currently trying to read some data from a public API. It has different ways of reading (json, csv, txt, among others), just change the label in the url (/ json, / csv, / txt ...). The url is as follows:
https://estadisticas.bcrp.gob.pe/estadisticas/series/api/PN01210PM/csv/
https://estadisticas.bcrp.gob.pe/estadisticas/series/api/PN01210PM/json/
...
My problem is that when trying to import into the Pandas dataframe it doesn't read the data correctly. I am trying the following alternatives:
import pandas as pd
import requests
url = 'https://estadisticas.bcrp.gob.pe/estadisticas/series/api/PN01210PM/json/'
r = requests.get(url)
rjson = r.json()
df= json_normalize(rjson)
df['periods']
Also I try to read the data in csv format:
import pandas as pd
import requests
url = 'https://estadisticas.bcrp.gob.pe/estadisticas/series/api/PN01210PM/csv/'
collisions = pd.read_csv(url, sep='<br>')
collisions.head()
But I don't get good results; the dataframe cannot be visualized correctly since the 'periods' column is grouped with all the values ...
the output is displayed as follows:
all data appears as columns: /
Here is an example of how the data is displayed correctly:
What alternative do you recommend trying?
Thank you in advance for your time and help !!
I will be attentive to your answers, regards!

For csv you can use StringIO from io package
In [20]: import requests
In [21]: res = requests.get("https://estadisticas.bcrp.gob.pe/estadisticas/series/api/PN01210PM/csv/")
In [22]: import pandas as pd
In [23]: import io
In [24]: df = pd.read_csv(io.StringIO(res.text.strip().replace("<br>","\n")), engine='python')
In [25]: df
Out[25]:
Mes/Año Tipo de cambio - promedio del periodo (S/ por US$) - Bancario - Promedio
0 Jul.2018 3.276595
1 Ago.2018 3.288071
2 Sep.2018 3.311325
3 Oct.2018 3.333909
4 Nov.2018 3.374675
5 Dic.2018 3.364026
6 Ene.2019 3.343864
7 Feb.2019 3.321475
8 Mar.2019 3.304690
9 Abr.2019 3.303825
10 May.2019 3.332364
11 Jun.2019 3.325650
12 Jul.2019 3.290214
13 Ago.2019 3.377560
14 Sep.2019 3.357357
15 Oct.2019 3.359762
16 Nov.2019 3.371700
17 Dic.2019 3.355190
18 Ene.2020 3.327364
19 Feb.2020 3.390350
20 Mar.2020 3.491364
21 Abr.2020 3.397500
22 May.2020 3.421150
23 Jun.2020 3.470167

erh, sorry couldnt find the link for the read json with multiple objects inside it. the thing is we cant use load/s for this kind of format. so have to use raw_decode() instead
this code should work
import pandas as pd
import json
import urllib.request as ur
from pprint import pprint
d = json.JSONDecoder()
url = 'https://estadisticas.bcrp.gob.pe/estadisticas/series/api/PN01210PM/json/'
#reading and transforming json into list of dictionaries
data = []
with ur.urlopen(url) as json_file:
x = json_file.read().decode() # decode to convert bytes string into normal string
while True:
try:
j, n = d.raw_decode(x)
except ValueError:
break
#print(j)
data.append(j)
x = x[n:]
#pprint(data)
#creating list of dictionaries to convert into dataframe
clean_list = []
for i, d in enumerate(data[0]['periods']):
dict_data = {
"month_year": d['name'],
"value": d['values'][0],
}
clean_list.append(dict_data)
#print(clean_list)
#pd.options.display.width = 0
df = pd.DataFrame(clean_list)
print(df)
result
month_year value
0 Jul.2018 3.27659523809524
1 Ago.2018 3.28807142857143
2 Sep.2018 3.311325
3 Oct.2018 3.33390909090909
4 Nov.2018 3.374675
5 Dic.2018 3.36402631578947
6 Ene.2019 3.34386363636364
7 Feb.2019 3.321475
8 Mar.2019 3.30469047619048
9 Abr.2019 3.303825
10 May.2019 3.33236363636364
11 Jun.2019 3.32565
12 Jul.2019 3.29021428571428
13 Ago.2019 3.37756
14 Sep.2019 3.35735714285714
15 Oct.2019 3.3597619047619
16 Nov.2019 3.3717
17 Dic.2019 3.35519047619048
18 Ene.2020 3.32736363636364
19 Feb.2020 3.39035
20 Mar.2020 3.49136363636364
21 Abr.2020 3.3975
22 May.2020 3.42115
23 Jun.2020 3.47016666666667
if I somehow found the link again, I'll edit/comment my answer

How to scrape NHL skater stats using Xpath?

I am trying to scrape the stats for 2017/2018 NHL skaters. I have started on the code but I am running into issues parsing the data and printing to excel.
Here is my code so far:
#import modules
from urllib.request import urlopen
from lxml.html import fromstring
import pandas as pd
#connect to url
url = "https://www.hockey-reference.com/leagues/NHL_2018_skaters.html"
#remove HTML comment markup
content = str(urlopen(url).read())
comment = content.replace("-->","").replace("<!--","")
tree = fromstring(comment)
#setting up excel columns
columns = ("names", "gp", "g", "s", "team")
df = pd.DataFrame(columns=columns)
#attempt at parsing data while using loop
for nhl, skater_row in enumerate(tree.xpath('//table[contains(#class,"stats_table")]/tr')):
names = pitcher_row.xpath('.//td[#data-stat="player"]/a')[0].text
gp = skater_row.xpath('.//td[#data-stat="games_played"]/text()')[0]
g = skater_row.xpath('.//td[#data-stat="goals"]/text()')[0]
s = skater_row.xpath('.//td[#data-stat="shots"]/text()')[0]
try:
team = skater_row.xpath('.//td[#data-stat="team_id"]/a')[0].text
# create pandas dataframe to export data to excel
df.loc[nhl] = (names, team, gp, g, s)
#write data to excel
writer = pd.ExcelWriter('NHL skater.xlsx')
df.to_excel(writer, 'Sheet1')
writer.save()
Can someone please explain how to parse this data? Are there any tips you have to help write the Xpath so I can loop through the data?
I am having trouble writing the line:
for nhl, skater_row in enumerate(tree.xpath...
How did you find the Xpath? Did you use Xpath Finder or Xpath Helper?
Also, I ran into an error with the line:
df.loc[nhl] = (names, team, gp, g, s)
It shows an invalid syntax for df.
I am new to web scraping and have no prior experience coding. Any help would be greatly appreciated. Thanks in advance for your time!

If you still want to stick to XPath and get required data only instead of filtering complete data, you can try below:
for row in tree.xpath('//table[#id="stats"]/tbody/tr[not(#class="thead")]'):
name = row.xpath('.//td[#data-stat="player"]')[0].text_content()
gp = row.xpath('.//td[#data-stat="games_played"]')[0].text_content()
g = row.xpath('.//td[#data-stat="goals"]')[0].text_content()
s = row.xpath('.//td[#data-stat="shots"]')[0].text_content()
team = row.xpath('.//td[#data-stat="team_id"]')[0].text_content()
Output of print(name, gp, g, s, team):
Justin Abdelkader 75 13 110 DET
Pontus Aberg 53 4 70 TOT
Pontus Aberg 37 2 39 NSH
Pontus Aberg 16 2 31 EDM
Noel Acciari 60 10 66 BOS
Kenny Agostino 5 0 11 BOS
Sebastian Aho 78 29 200 CAR
...

IIUC: It can be done like this with BeautifulSoup and pandas read_html
import requests
import pandas
from bs4 import BeautifulSoup
url = 'https://www.hockey-reference.com/leagues/NHL_2018_skaters.html'
pg = requests.get(url)
bsf = BeautifulSoup(pg.content, 'html5lib')
tables = bsf.findAll('table', attrs={'id':'stats'})
dfs = pd.read_html(tables[0].prettify())
df = dfs[0]
The resultant dataframe will have all the columns in the table and use pandas to filter the columns that are required.
#Filters only columns 1, 3 and 5 similarly all required columns can be filtered.
dff = df[df.columns[[1, 3, 5]]]

Pandas import cvs: pandas.io.common.EmptyDataError: No columns to parse from file

my data in my file look like this
link availability product_type
1 1016842-5 "GlamWhite Home Bleaching Refill Kit (6% Wasserstoffperoxid)"
1 1045231-4 "Cabernet Sauvignon Burgenland Weingut Erich Scheiblhofer 2011 - 75cl"
1 1045232-4 "Blaufränkisch Ried Oberer Wald Burgenland Ernst Triebaumer 2009 - 75cl"
And I am using pandas trying to read the csv with this:
import csv
import pandas as pd
file_path = '/Users/nasiantalla/Downloads/pdsfeed (2).csv'
data = pd.read_csv(file_path,error_bad_lines=False,skiprows=1066576,sep='\t',lineterminator='\r', encoding='utf-8',header=0,
usecols=['availability', 'link'])
However I still get error:
pandas.io.common.EmptyDataError: No columns to parse from file
I don't understand, I tried all the different encodings, but no luck.. Do you see something I haven't noticed?
Thanks!

try without skiprows=1066576,sep='\t',lineterminator='\r'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to parse xml from requests? - python

Related

Extracting chosen information from URL results into a dataframe

Getting data from World Bank API using pandas

I can't correctly visualize a json dataframe from api

How to scrape NHL skater stats using Xpath?

Pandas import cvs: pandas.io.common.EmptyDataError: No columns to parse from file

Categories

Resources