Getting date modified of the files - webscraping with beautifulsoup in python - python

I am trying to download all csv files from the following website: https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices . I have managed to do that with the following code:
from bs4 import BeautifulSoup
import requests
url = 'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
csv_links = ['https://emi.ea.govt.nz'+a['href'] for a in soup.select('td.csv a')]
contents = []
for i in csv_links:
req = requests.get(i)
csv_contents = req.content
s=str(csv_contents,'utf-8')
data = StringIO(s)
df=pd.read_csv(data)
contents.append(df)
final_price = pd.concat(contents)
If at all feasible, I'd like to streamline this process.  The file on the website is modified every day, and I don't want to run the script every day to extract all of the files; instead, I simply want to extract files from Yesterday and append the existing files in my folder. And to achieve this, I need to scrape the Date Modified column along with the files URL. I'd be grateful if someone could tell me how to acquire the dates when the files were updated.

You can use nth-child range to filter for columns 1 and 2 of table along with the appropriate row offset within the table initially matched by class.
Then extract url or date (as text) within list comprehensions over split initial returned list (as will alternate column 1 column 2 column 1 etc). Complete the url or convert to actual dates (text) within respective list comprehensions, zip the resultant lists and convert to DataFrame
import requests
from datetime import datetime
from bs4 import BeautifulSoup as bs
import pandas as pd
r = requests.get(
'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices')
soup = bs(r.content, 'lxml')
selected_columns = soup.select('.table tr:nth-child(n+3) td:nth-child(-n+2)')
df = pd.DataFrame(zip(['https://emi.ea.govt.nz' + i.a['href'] for i in selected_columns[0::1]],
[datetime.strptime(i.text, '%d %b %Y').date() for i in selected_columns[1::2]]), columns=['name', 'date_modified'])
print(df)

You can apply list comprehension technique
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices'
r = requests.get(url)
print(r)
soup = BeautifulSoup(r.text, 'html.parser')
links=[]
date=[]
csv_links = ['https://emi.ea.govt.nz'+a['href'] for a in soup.select('td[class="expand-column csv"] a')]
modified_date=[ date.text for date in soup.select('td[class="two"] a')[1:]]
links.extend(csv_links)
date.extend(modified_date)
df = pd.DataFrame(data=list(zip(links,date)),columns=['csv_links','modified_date'])
print(df)
Output:
csv_links modified_date
0 https://emi.ea.govt.nz/Wholesale/Datasets/Fina... 22 Mar 2022
1 https://emi.ea.govt.nz/Wholesale/Datasets/Fina... 22 Mar 2022
2 https://emi.ea.govt.nz/Wholesale/Datasets/Fina... 22 Mar 2022
3 https://emi.ea.govt.nz/Wholesale/Datasets/Fina... 22 Mar 2022
4 https://emi.ea.govt.nz/Wholesale/Datasets/Fina... 22 Mar 2022
.. ... ...
107 https://emi.ea.govt.nz/Wholesale/Datasets/Fina... 20 Dec 2021
108 https://emi.ea.govt.nz/Wholesale/Datasets/Fina... 20 Dec 2021
109 https://emi.ea.govt.nz/Wholesale/Datasets/Fina... 20 Dec 2021
110 https://emi.ea.govt.nz/Wholesale/Datasets/Fina... 20 Dec 2021
111 https://emi.ea.govt.nz/Wholesale/Datasets/Fina... 20 Dec 2021
[112 rows x 2 columns]

Related

How can I use pd.read_html for scraping HTML tables with % values?

I'm trying to scrape the 'Profile and investment' table from the following url: https://markets.ft.com/data/funds/tearsheet/summary?s=LU0526609390:EUR, using the following code:
import requests
import pandas as pd
# Define all urls required for data scraping from the FT Website - if new fund is added simply add the appropriate Fund ID to the List
List = ['LU0526609390:EUR', 'IE00BHBX0Z19:EUR', 'LU1076093779:EUR', 'LU1116896363:EUR']
df = pd.DataFrame(List, columns=['List'])
urls = 'https://markets.ft.com/data/funds/tearsheet/summary?s='+ df['List']
for url in urls:
r = requests.get(url).content
df = pd.read_html(r)[0]
print (df)
However, when I use the pd.read_html function, I get the following error code:
ValueError: invalid literal for int() with base 10: '100%'
because the table has entries in %. Is there a way to make Pandas accept % values?
My required output is to get a table with the following format:
Fund_ID Fund_type Income_treatment Morningstar category ......
LU0526609390:EUR ... ... ....
IE00BHBX0Z19:EUR ... ... ....
LU1076093779:EUR ... ... ....
LU1116896363:EUR ... ... ....
The issue is the site uses the 'colspan' attribute and uses % instead of with an int. As AsishM mentions in the comments:
browsers are usually more lenient with things like %, but the html spec for colspan clearly mentions this should be an integer. Browsers treat 100% as 100. mdn link. It's not a pandas problem per se.
these should be in the form of an int, and while some browsers will accommodate for that, pandas is specifically wanting it to be the appropriate syntax of:
<td colspan="number">
Ways to approach this is:
Use BeautifulSoup to fix those attributes
Since it's not within the table you actually want to parse, use BeautifulSoup to grab that first table and then don't need to worry about it.
See if the table has a specific attribute and could add that to the .read_html() as a parameter so it grabs only that specific table.
I chose option 2 here:
import requests
import pandas as pd
from bs4 import BeautifulSoup
# Define all urls required for data __scraping__ from the FT Website - if new fund is added simply add the appropriate Fund ID to the List
List = ['LU0526609390:EUR', 'IE00BHBX0Z19:EUR', 'LU1076093779:EUR', 'LU1116896363:EUR']
df = pd.DataFrame(List, columns=['List'])
urls = 'https://markets.ft.com/data/funds/tearsheet/summary?s='+ df['List']
results = pd.DataFrame()
for url in urls:
print(url)
r = requests.get(url).content
soup = BeautifulSoup(r, 'html.parser')
table = soup.find('table')
df = pd.read_html(str(table), index_col=0)[0].T
results = results.append(df, sort=False)
results = results.reset_index(drop=True)
print (results)
Output:
print(results.to_string())
0 Fund type Income treatment Morningstar category IMA sector Launch date Price currency Domicile ISIN Manager & start date Investment style (bonds) Investment style (stocks)
0 SICAV Income Global Bond - EUR Hedged -- 06 Aug 2010 GBP Luxembourg LU0526609390 Jonathan Gregory01 Nov 2012Vivek Acharya09 Dec 2015Simon Foster01 Nov 2012 NaN NaN
1 Open Ended Investment Company Income EUR Diversified Bond -- 21 Feb 2014 EUR Ireland IE00BHBX0Z19 Lorenzo Pagani12 May 2017Konstantin Veit01 Jul 2019 Credit Quality: HighInterest-Rate Sensitivity: Mod NaN
2 SICAV Income Eurozone Large-Cap Equity -- 11 Jul 2014 GBP Luxembourg LU1076093779 NaN NaN Market Cap: LargeInvestment Style: Blend
3 SICAV Income EUR Flexible Bond -- 01 Dec 2014 EUR Luxembourg LU1116896363 NaN NaN NaN
Here's how you could use BeautifulSoup to fix those colspan attributes.
import requests
import pandas as pd
from bs4 import BeautifulSoup
# Define all urls required for data scraping from the FT Website - if new fund is added simply add the appropriate Fund ID to the List
List = ['LU0526609390:EUR', 'IE00BHBX0Z19:EUR', 'LU1076093779:EUR', 'LU1116896363:EUR']
df = pd.DataFrame(List, columns=['List'])
urls = 'https://markets.ft.com/data/funds/tearsheet/summary?s='+ df['List']
for url in urls:
print(url)
r = requests.get(url).content
soup = BeautifulSoup(r, 'html.parser')
all_colspan = soup.find_all(attrs={'colspan':True})
for colspan in all_colspan:
colspan.attrs['colspan'] = colspan.attrs['colspan'].replace('%', '')
df = pd.read_html(str(soup))

I can't correctly visualize a json dataframe from api

I am currently trying to read some data from a public API. It has different ways of reading (json, csv, txt, among others), just change the label in the url (/ json, / csv, / txt ...). The url is as follows:
https://estadisticas.bcrp.gob.pe/estadisticas/series/api/PN01210PM/csv/
https://estadisticas.bcrp.gob.pe/estadisticas/series/api/PN01210PM/json/
...
My problem is that when trying to import into the Pandas dataframe it doesn't read the data correctly. I am trying the following alternatives:
import pandas as pd
import requests
url = 'https://estadisticas.bcrp.gob.pe/estadisticas/series/api/PN01210PM/json/'
r = requests.get(url)
rjson = r.json()
df= json_normalize(rjson)
df['periods']
Also I try to read the data in csv format:
import pandas as pd
import requests
url = 'https://estadisticas.bcrp.gob.pe/estadisticas/series/api/PN01210PM/csv/'
collisions = pd.read_csv(url, sep='<br>')
collisions.head()
But I don't get good results; the dataframe cannot be visualized correctly since the 'periods' column is grouped with all the values ...
the output is displayed as follows:
all data appears as columns: /
Here is an example of how the data is displayed correctly:
What alternative do you recommend trying?
Thank you in advance for your time and help !!
I will be attentive to your answers, regards!
For csv you can use StringIO from io package
In [20]: import requests
In [21]: res = requests.get("https://estadisticas.bcrp.gob.pe/estadisticas/series/api/PN01210PM/csv/")
In [22]: import pandas as pd
In [23]: import io
In [24]: df = pd.read_csv(io.StringIO(res.text.strip().replace("<br>","\n")), engine='python')
In [25]: df
Out[25]:
Mes/Año Tipo de cambio - promedio del periodo (S/ por US$) - Bancario - Promedio
0 Jul.2018 3.276595
1 Ago.2018 3.288071
2 Sep.2018 3.311325
3 Oct.2018 3.333909
4 Nov.2018 3.374675
5 Dic.2018 3.364026
6 Ene.2019 3.343864
7 Feb.2019 3.321475
8 Mar.2019 3.304690
9 Abr.2019 3.303825
10 May.2019 3.332364
11 Jun.2019 3.325650
12 Jul.2019 3.290214
13 Ago.2019 3.377560
14 Sep.2019 3.357357
15 Oct.2019 3.359762
16 Nov.2019 3.371700
17 Dic.2019 3.355190
18 Ene.2020 3.327364
19 Feb.2020 3.390350
20 Mar.2020 3.491364
21 Abr.2020 3.397500
22 May.2020 3.421150
23 Jun.2020 3.470167
erh, sorry couldnt find the link for the read json with multiple objects inside it. the thing is we cant use load/s for this kind of format. so have to use raw_decode() instead
this code should work
import pandas as pd
import json
import urllib.request as ur
from pprint import pprint
d = json.JSONDecoder()
url = 'https://estadisticas.bcrp.gob.pe/estadisticas/series/api/PN01210PM/json/'
#reading and transforming json into list of dictionaries
data = []
with ur.urlopen(url) as json_file:
x = json_file.read().decode() # decode to convert bytes string into normal string
while True:
try:
j, n = d.raw_decode(x)
except ValueError:
break
#print(j)
data.append(j)
x = x[n:]
#pprint(data)
#creating list of dictionaries to convert into dataframe
clean_list = []
for i, d in enumerate(data[0]['periods']):
dict_data = {
"month_year": d['name'],
"value": d['values'][0],
}
clean_list.append(dict_data)
#print(clean_list)
#pd.options.display.width = 0
df = pd.DataFrame(clean_list)
print(df)
result
month_year value
0 Jul.2018 3.27659523809524
1 Ago.2018 3.28807142857143
2 Sep.2018 3.311325
3 Oct.2018 3.33390909090909
4 Nov.2018 3.374675
5 Dic.2018 3.36402631578947
6 Ene.2019 3.34386363636364
7 Feb.2019 3.321475
8 Mar.2019 3.30469047619048
9 Abr.2019 3.303825
10 May.2019 3.33236363636364
11 Jun.2019 3.32565
12 Jul.2019 3.29021428571428
13 Ago.2019 3.37756
14 Sep.2019 3.35735714285714
15 Oct.2019 3.3597619047619
16 Nov.2019 3.3717
17 Dic.2019 3.35519047619048
18 Ene.2020 3.32736363636364
19 Feb.2020 3.39035
20 Mar.2020 3.49136363636364
21 Abr.2020 3.3975
22 May.2020 3.42115
23 Jun.2020 3.47016666666667
if I somehow found the link again, I'll edit/comment my answer

How do you extract a HTML table and add a new column with constant values from an earlier <strong> tag?

I'm attempting to extract a series of tables from an HTML document and append a new column with a constant value from a tag being used as a header. The idea would then be to make this new three column table a dataframe. below is the code i've come up with so far. I.e. each table would have a third column where all the row values would equal either AGO, DPK, ATK, or PMS depending which header precedes the series of tables. Would be grateful for any help as i'm new to python and HTML. Thanks a mill!
import pandas as pd
from bs4 import BeautifulSoup
from robobrowser import RoboBrowser
br = RoboBrowser()
br.open("https://oilpriceng.net/03-09-2019")
table = br.find_all('td', class_='vc_table_cell')
for element in table:
data = element.find('span', class_='vc_table_content')
prod_name = br.find_all('strong')
ago = prod_name[0].text
dpk = prod_name[1].text
atk = prod_name[2].text
pms = prod_name[3].text
if br.find('strong').text == ago:
data.append(ago.text)
elif br.find('strong').text == dpk:
data.append(dpk.text)
elif br.find('strong').text == atk:
data.append(atk.text)
elif br.find('strong').text == pms:
data.append(pms.text)
print(data.text)
df = pd.DataFrame(data)
The result i'm hoping for is to go from this
AGO
Enterprise Price
Coy A $0.5/L
Coy B $0.6/L
Coy C $0.7/L
to the new table below as a dataframe in Pandas
Enterprise Price Product
Coy A $0.5/L AGO
Coy B $0.6/L AGO
Coy C $0.7/L AGO
and to repeat the same thing for other tables with DPK, ATK and PMS information
I hope I understood your question right. This script will scrape all tables found in the page into the dataframe and save it to csv file:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://oilpriceng.net/03-09-2019/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
data, last = {'Enterprise':[], 'Price':[], 'Product':[]}, ''
for tag in soup.select('h1 strong, tr:has(td.vc_table_cell)'):
if tag.name == 'strong':
last = tag.get_text(strip=True)
else:
a, b = tag.select('td')
a, b = a.get_text(strip=True), b.get_text(strip=True)
if a and b != 'DEPOT PRICE':
data['Enterprise'].append(a)
data['Price'].append(b)
data['Product'].append(last)
df = pd.DataFrame(data)
print(df)
df.to_csv('data.csv')
Prints:
Enterprise Price Product
0 AVIDOR PH ₦190.0 AGO
1 SHORELINK AGO
2 BULK STRATEGIC PH ₦190.0 AGO
3 TSL AGO
4 MASTERS AGO
.. ... ... ...
165 CHIPET ₦132.0 PMS
166 BOND PMS
167 RAIN OIL PMS
168 MENJ ₦133.0 PMS
169 NIPCO ₦ 2,9000,000 LPG
[170 rows x 3 columns]
The data.csv (screenshot from LibreOffice):

How to parse xml from requests?

I looked at a few other answers but couldn't find a solution which worked for me.
Here's my complete code, which you can run without any API key:
import requests
r = requests.get('http://api.worldbank.org/v2/country/GBR/indicator/NY.GDP.MKTP.KD.ZG')
If I print r.text, I get a string that starts with
'\ufeff<?xml version="1.0" encoding="utf-8"?>\r\n<wb:data page="1" pages="2" per_page="50" total="60" sourceid="2" lastupdated="2019-12-20" xmlns:wb="http://www.worldbank.org">\r\n <wb:data>\r\n <wb:indicator id="NY.GDP.MKTP.KD.ZG">GDP growth (annual %)</wb:indicator>\r\n <wb:country id="GB">United Kingdom</wb:country>\r\n <wb:countryiso3code>GBR</wb:countryiso3code>\r\n <wb:date>2019</wb:date>\r\n`
and goes on for a while.
One way of getting what I'd like out of it (which, as far as I understand, is heavily discouraged) is to use regex:
import regex
import pandas as pd
import re
pd.DataFrame(
re.findall(
r"<wb:date>(\d{4})</wb:date>\r\n <wb:value>((?:\d\.)?\d{14})", r.text
),
columns=["date", "value"],
)
What is a "proper" way of parsing this xml output? My final objective is to have a DataFrame with date and value columns, such as
date value
0 2018 1.38567356958762
1 2017 1.89207703836381
2 2016 1.91815510596298
3 2015 2.35552430595799
...
How about the following:
Decode the response:
decoded_response = response.content.decode('utf-8')
Convert to json:
response_json = json.loads(json.dumps(xmltodict.parse(decoded)))
Read into DataFrame:
pd.read_json(response_json)
Then you just need to play with the orient and such
(docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html)
You can use ElementTree API (as described here )
import requests
from xml.etree import ElementTree
response = requests.get('http://api.worldbank.org/v2/country/GBR/indicator/NY.GDP.MKTP.KD.ZG')
tree = ElementTree.fromstring(response.content)
print(tree)
But you will have to explore the structure to get what you want.
Full code I ended up using (based on Omri's excellent answer):
import xmltodict
import json
import pandas as pd
r = requests.get("http://api.worldbank.org/v2/country/GBR/indicator/NY.GDP.MKTP.KD.ZG")
decoded_response = r.content.decode("utf-8")
response_json = json.loads(json.dumps(xmltodict.parse(decoded_response)))
pd.DataFrame(response_json["wb:data"]["wb:data"])[["wb:date", "wb:value"]].rename(
columns=lambda x: x.replace("wb:", "")
)
which gives
date value
0 2019 None
1 2018 1.38567356958762
2 2017 1.89207703836381
3 2016 1.91815510596298
4 2015 2.35552430595799
...

Get content of table in BeautifulSoup

I have the following table on a website which I am extracting with BeautifulSoup
This is the url (I have also attached a picture
Ideally I would like to have each company in one row in csv however I am getting it in different rows. Please see picture attached.
I would like it to have it like in field "D" but I am getting it in A1,A2,A3...
This is the code I am using to extract:
def _writeInCSV(text):
print "Writing in CSV File"
with open('sara.csv', 'wb') as csvfile:
#spamwriter = csv.writer(csvfile, delimiter='\t',quotechar='\n', quoting=csv.QUOTE_MINIMAL)
spamwriter = csv.writer(csvfile, delimiter='\t',quotechar="\n")
for item in text:
spamwriter.writerow([item])
read_list=[]
initial_list=[]
url="http://www.nse.com.ng/Issuers-section/corporate-disclosures/corporate-actions/closure-of-register"
r=requests.get(url)
soup = BeautifulSoup(r._content, "html.parser")
#gdata_even=soup.find_all("td", {"class":"ms-rteTableEvenRow-3"})
gdata_even=soup.find_all("td", {"class":"ms-rteTable-default"})
for item in gdata_even:
print item.text.encode("utf-8")
initial_list.append(item.text.encode("utf-8"))
print ""
_writeInCSV(initial_list)
Can someone help please ?
Here is the idea:
read the header cells from the table
read all the other rows from the table
zip all the data row cells with headers producing a list of dictionaries
use csv.DictWriter() to dump to csv
Implementation:
import csv
from pprint import pprint
from bs4 import BeautifulSoup
import requests
url = "http://www.nse.com.ng/Issuers-section/corporate-disclosures/corporate-actions/closure-of-register"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
rows = soup.select("table.ms-rteTable-default tr")
headers = [header.get_text(strip=True).encode("utf-8") for header in rows[0].find_all("td")]
data = [dict(zip(headers, [cell.get_text(strip=True).encode("utf-8") for cell in row.find_all("td")]))
for row in rows[1:]]
# see what the data looks like at this point
pprint(data)
with open('sara.csv', 'wb') as csvfile:
spamwriter = csv.DictWriter(csvfile, headers, delimiter='\t', quotechar="\n")
for row in data:
spamwriter.writerow(row)
Since #alecxe has already provided an amazing answer, here's another take using the pandas library.
import pandas as pd
url = "http://www.nse.com.ng/Issuers-section/corporate-disclosures/corporate-actions/closure-of-register"
tables = pd.read_html(url)
tb1 = tables[0] # Get the first table.
tb1.columns = tb1.iloc[0] # Assign the first row as header.
tb1 = tb1.iloc[1:] # Drop the first row.
tb1.reset_index(drop=True, inplace=True) # Reset the index.
print tb1.head() # Print first 5 rows.
# tb1.to_csv("table1.csv") # Export to CSV file.
Result:
In [5]: runfile('C:/Users/.../.spyder2/temp.py', wdir='C:/Users/.../.spyder2')
0 Company Dividend Bonus Closure of Register \
0 Nigerian Breweries Plc N3.50 Nil 5th - 11th March 2015
1 Forte Oil Plc N2.50 1 for 5 1st – 7th April 2015
2 Nestle Nigeria N17.50 Nil 27th April 2015
3 Greif Nigeria Plc 60 kobo Nil 25th - 27th March 2015
4 Guaranty Bank Plc N1.50 (final) Nil 17th March 2015
0 AGM Date Payment Date
0 13th May 2015 14th May 2015
1 15th April 2015 22nd April 2015
2 11th May 2015 12th May 2015
3 28th April 2015 5th May 2015
4 ​31st March 2015 31st March 2015
In [6]:

Categories

Resources