Python 3 html table data

Python 3 html table data - python

I'm new to Python and I need to get the data from a table on a
Webpage and send to a list.
I've tried everything, and the best I got is:
f = urllib.request.urlopen(url)
url = "http://www2.bmf.com.br/pages/portal/bmfbovespa/lumis/lum-taxas-referenciais-bmf-enUS.asp?Data=11/22/2017&Data1=20171122&slcTaxa=APR#"
soup = BeautifulSoup(urllib.request.urlopen(url).read(),'lxml')
rows=list()
for tr in soup.findAll('table'):
rows.append(tr)
Any suggestions?

You're not that far !
First make sure to import the proper version of BeautifulSoup which is BeautifulSoup4 by doing apt-get install python3-bs4 (assuming you're on Ubuntu or Debian and running Python 3).
Then isolate the td elements of html table and clean data a bit. For example remove the first 3 elements of the lists which are useless, and remove the ugly '\n':
import urllib
from bs4 import BeautifulSoup
url = "http://www2.bmf.com.br/pages/portal/bmfbovespa/lumis/lum-taxas-referenciais-bmf-enUS.asp?Data=11/22/2017&Data1=20171122&slcTaxa=APR#"
soup = BeautifulSoup(urllib.request.urlopen(url).read(),'lxml')
rows=list()
for tr in soup.findAll('table'):
for td in tr:
rows.append(td.string)
temp_list=rows[3:]
final_list=[element for element in temp_list if element != '\n']
I don't know which data you want to extract precisely. Now you need to work on your Python list (called final_list here)!
Hope it's clear.

There is a Dowload option at the end of the webpage. If you can download the file manually you are good to go.
If you want to access different dates automatically, and since it is JavaScript, I suggest to use Selenium to download the xlsx files through Python.
With the xlsx file you can use Xlsxwriter to read the data and do what you want.

Related

Generate and download tsv from a website (with python)

I have this website and want to write a script which can execute a code which gives the same output as clicking on 'Export' -> 'Generate tsv' -> Wait to generate -> 'Download'.
The endgoal is to use this for a list of approx. 1700 proteins which I have in .txt (so extract a protein, in this case 'Q9BXF6' and put it in the url: https://www.ebi.ac.uk/interpro/protein/UniProt/Q9BXF6/entry/InterPro/#table) and download all results in .tsv files.
I tried inspecting the 'Export' button but the sourcecode wasn't illuminating (or I didn't know where to look). I also tried this:
r = requests.get('https://www.ebi.ac.uk/interpro/protein/UniProt/Q9BXF6/entry/InterPro/#table')
soup = BeautifulSoup(r.content, 'html.parser')
to locate what I need but it outputs a bunch of characters that I can't really understand.
I also tried downloading the whole page just like it is with the urllib library:
with
myurl = 'https://www.ebi.ac.uk/interpro/protein/UniProt/Q9BXF6/entry/InterPro/#table'
urllib.request.urlopen() as f:
html = f.read().decode('utf-8')
or
urllib.urlretrieve (myurl, 'interpro.txt') # although this didn't work
It seems as if all content is written somewhere else and refered to and everything I've tried outputs something stupid, but I don't know anything about html and am really new to python (I only use R).

For your first question, you can use the URL of the following element to retrieve the protein value that you require for the next problem.
href="blob:https://www.ebi.ac.uk/806960aa-720c-4958-9392-f242adee627b"
The URL is set to the href tag which you can then use it to make the request to download the file. You can also obtain this by right-clicking on the download button for TSV and clicking Inspect-Element you will then be able to see the presence of this href tag.
Following that, download by doing e.g.
import urllib.request
url = 'https://www.ebi.ac.uk/806960aa-720c-4958-9392-f242adee627b'
urllib.request.urlretrieve(url, '/Users/abc/Downloads/file.tsv') # any dir to save
with open("/Users/abc/Downloads/file.tsv") as file_in:
for line in file_in:
#here make your calls for your second problem.
You can also use a Web-Automator such as selenium to gracefully solve this problem. If the latter is of interest do look into it - it's not hard.

Trouble downloading xlsx file from website - Scraping

I'm trying to write some code which download the two latest publications of the Outage Weeks found at the bottom of http://www.eirgridgroup.com/customer-and-industry/general-customer-information/outage-information/
It's xlsx-files, which I'm going to load into Excel afterwards.
It doesn't matter which programming language the code is written in.
My first idea was to use the direct url's, like http://www.eirgridgroup.com/site-files/library/EirGrid/Outage-Weeks_36(2016)-51(2016)_31%20August.xlsx
, and then make some code which guesses the url of the two latest publications.
But I have noticed some inconsistencies in the url names, so that solution wouldn't work.
Instead it might be solution to scrape the website and use the XPath to download the files. I found out that the two latest publications always have the following XPaths:
/html/body/div[3]/div[3]/div/div/p[5]/a
/html/body/div[3]/div[3]/div/div/p[6]/a
This is where I need help. I'm new to both XPath and Web Scraping. I have tried stuff like this in Python
from lxml import html
import requests
page = requests.get('http://www.eirgridgroup.com/customer-and-industry/general-customer-information/outage-information/')
tree = html.fromstring(page.content)
v = tree.xpath('/html/body/div[3]/div[3]/div/div/p[5]/a')
But v seems to be empty.
Any ideas would be greatly appreciated!

Just use contains to find the hrefs and slice the first two:
tree.xpath('//p/a[contains(#href, "/site-files/library/EirGrid/Outage-Weeks")]/#href')[:2]
Or doing it all with the xpath using [position() < 3]:
tree.xpath'(//p/a[contains(#href, "site-files/library/EirGrid/Outage-Weeks")])[position() < 3]/#href')
The files are ordered from latest to oldest so getting the first two gives you the two newest.
To download the files you just need to join each href to the base url and write the content to a file:
from lxml import html
import requests
import os
from urlparse import urljoin # from urllib.parse import urljoin
page = requests.get('http://www.eirgridgroup.com/customer-and-industry/general-customer-information/outage-information/')
tree = html.fromstring(page.content)
v = tree.xpath('(//p/a[contains(#href, "/site-files/library/EirGrid/Outage-Weeks")])[position() < 3]/#href')
for href in v:
# os.path.basename(href) -> Outage-Weeks_35(2016)-50(2016).xlsx
with open(os.path.basename(href), "wb") as f:
f.write(requests.get(urljoin("http://www.eirgridgroup.com", link)).content)

Python BeautifulSoup to csv scraping

I am attempting to scrape some simple dictionary information from an html page. So far I am able to print all the words I need on the IDE. My next step was to transfer the words to an array. My last step was to save the array as a csv file... When I run my code it seems to stop taking information after the 1309th or 1311th word, although I believe there to be over 1 million on the web page. I am stuck and would be very appreciative of any help. Thank you
from bs4 import BeautifulSoup
from urllib import urlopen
import csv
html = urlopen('http://www.mso.anu.edu.au/~ralph/OPTED/v003/wb1913_a.html').read()
soup = BeautifulSoup(html,"lxml")
words = []
for section in soup.findAll('b'):
words.append(section.renderContents())
print ('success')
print (len(words))
myfile = open('A.csv', 'wb')
wr = csv.writer(myfile)
wr.writerow(words)

I was not able to reproduce the problem (always getting 11616 items), but I suspect you have outdated beautifulsoup4 or lxml versions installed. Upgrade:
pip install --upgrade beautifulsoup4
pip install --upgrade lxml
Of course, this is just a theory.

I suspect a good deal of your problem may lie in how you're processing the scraped content. Do you need to scrape all the content before you output it to the file? Or can you do it as you go?
Instead of appending over and over to a list, you should use yield.
def tokenize(soup_):
for section in soup_.findAll('b'):
yield section.renderContents()
This'll give you a generator that as long as section.renderContents() returns a string, the csv module can write out with no problem.

Parsing xml in python - don't understand the DOM

I've been reading up on parsing xml with python all day, but looking at the site i need to extract data on, i'm not sure if i'm barking up the wrong tree. Basically i want to get the 13-digit barcodes from a supermarket website (found in the name of the images). For example:
http://www.tesco.com/groceries/SpecialOffers/SpecialOfferDetail/Default.aspx?promoId=A31033985
has 11 items and 11 images, the barcode for the first item is 0000003235676. However when i look at the page source (i assume this is the best way to extract all of the barcodes in one go with python, urllib and beautifulsoup) all of the barcodes are on one line (line 12) however the data doesn't seem to be structured as i would expect in terms of elements and attributes.
new TESCO.sites.UI.entities.Product({name:"Lb Mens Mattifying Dust 7G",xsiType:"QuantityOnlyProduct",productId:"275303365",baseProductId:"72617958",quantity:1,isPermanentlyUnavailable:true,imageURL:"http://img.tesco.com/Groceries/pi/805/5021320051805/IDShot_90x90.jpg",maxQuantity:99,maxGroupQuantity:0,bulkBuyLimitGroupId:"",increment:1,price:2.5,abbr:"g",unitPrice:3.58,catchWeight:"0",shelfName:"Mens Styling",superdepartment:"Health & Beauty",superdepartmentID:"TO_1448953606"});
new TESCO.sites.UI.entities.Product({name:"Lb Mens Thickening Shampoo 250Ml",xsiType:"QuantityOnlyProduct",productId:"275301223",baseProductId:"72617751",quantity:1,isPermanentlyUnavailable:true,imageURL:"http://img.tesco.com/Groceries/pi/225/5021320051225/IDShot_90x90.jpg",maxQuantity:99,maxGroupQuantity:0,bulkBuyLimitGroupId:"",increment:1,price:2.5,abbr:"ml",unitPrice:1,catchWeight:"0",shelfName:"Mens Shampoo ",superdepartment:"Health & Beauty",superdepartmentID:"TO_1448953606"});
new TESCO.sites.UI.entities.Product({name:"Lb Mens Sculpting Puty 75Ml",xsiType:"QuantityOnlyProduct",productId:"275301557",baseProductId:"72617906",quantity:1,isPermanentlyUnavailable:true,imageURL:"http://img.tesco.com/Groceries/pi/287/5021320051287/IDShot_90x90.jpg",maxQuantity:99,maxGroupQuantity:0,bulkBuyLimitGroupId:"",increment:1,price:2.5,abbr:"ml",unitPrice:3.34,catchWeight:"0",shelfName:"Pastes, Putty, Gums, Pomades",superdepartment:"Health & Beauty",superdepartmentID:"TO_1448953606"});
Maybe something like BeautifulSoup is overkill? I understand the DOM tree is not the same thing as the raw source, but why are they so different - when i go to inspect element in firefox the data seems structured as i would expect.
Apologies if this comes across as totally stupid, thanks in advance.

Unfortunately, the barcode is not given in the HTML as structured data; it only appears embedded as part of a URL. So we'll need to isolate the URL and then pick off the barcode with string manipulation:
import urllib2
import bs4 as bs
import re
import urlparse
url = 'http://www.tesco.com/groceries/SpecialOffers/SpecialOfferDetail/Default.aspx?promoId=A31033985'
response = urllib2.urlopen(url)
content = response.read()
# with open('/tmp/test.html', 'w') as f:
# f.write(content)
# Useful for debugging off-line:
# with open('/tmp/test.html', 'r') as f:
# content = f.read()
soup = bs.BeautifulSoup(content)
barcodes = set()
for tag in soup.find_all('img', {'src': re.compile(r'/pi/')}):
href = tag['src']
scheme, netloc, path, query, fragment = urlparse.urlsplit(href)
barcodes.add(path.split('\\')[1])
print(barcodes)
yields
set(['0000003222737', '0000010039670', '0000010036297', '0000010008393', '0000003050453', '0000010062951', '0000003239438', '0000010078402', '0000010016312', '0000003235676', '0000003203132'])

As your site uses javascript to format its content, You might find useful switching from urllib to a tool like Selenium. That way you can crawl pages as they render for a real user with a web browser. This github project seems to solve your task.
Other option will be filtering out json data from page javascript scripts and getting data directly from there.

How to extract tables from websites in Python

Here,
http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500
There is a table. My goal is to extract the table and save it to a csv file. I wrote a code:
import urllib
import os
web = urllib.urlopen("http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500")
s = web.read()
web.close()
ff = open(r"D:\ex\python_ex\urllib\output.txt", "w")
ff.write(s)
ff.close()
I lost from here. Anyone who can help on this? Thanks!

Pandas can do this right out of the box, saving you from having to parse the html yourself. read_html() extracts all tables from your html and puts them in a list of dataframes. to_csv() can be used to convert each dataframe to a csv file. For the web page in your example, the relevant table is the last one, which is why I used df_list[-1] in the code below.
import requests
import pandas as pd
url = 'http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
print(df)
df.to_csv('my data.csv')
It's simple enough to do in one line, if you prefer:
pd.read_html(requests.get(<url>).content)[-1].to_csv(<csv file>)
P.S. Just make sure you have lxml, html5lib, and BeautifulSoup4 packages installed in advance.

So essentially you want to parse out html file to get elements out of it. You can use BeautifulSoup or lxml for this task.
You already have solutions using BeautifulSoup. I'll post a solution using lxml:
from lxml import etree
import urllib.request
web = urllib.request.urlopen("http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500")
s = web.read()
html = etree.HTML(s)
## Get all 'tr'
tr_nodes = html.xpath('//table[#id="Report1_dgReportDemographic"]/tr')
## 'th' is inside first 'tr'
header = [i[0].text for i in tr_nodes[0].xpath("th")]
## Get text from rest all 'tr'
td_content = [[td.text for td in tr.xpath('td')] for tr in tr_nodes[1:]]

I would recommend BeautifulSoup as it has the most functionality. I modified a table parser that I found online that can extract all tables from a webpage, as long as there are no nested tables. Some of the code is specific to the problem I was trying to solve, but it should be pretty easy to modify for your usage. Here is the pastbin link.
http://pastebin.com/RPNbtX8Q
You could use it as follows:
from urllib2 import Request, urlopen, URLError
from TableParser import TableParser
url_addr ='http://foo/bar'
req = Request(url_addr)
url = urlopen(req)
tp = TableParser()
tp.feed(url.read())
# NOTE: Here you need to know exactly how many tables are on the page and which one
# you want. Let's say it's the first table
my_table = tp.get_tables()[0]
filename = 'table_as_csv.csv'
f = open(filename, 'wb')
with f:
writer = csv.writer(f)
for row in table:
writer.writerow(row)
The code above is an outline, but if you use the table parser from the pastbin link you should be able to get to where you want to go.

You need to parse the table into an internal data structure and then output it in CSV form.
Use BeautifulSoup to parse the table. This question is about how to do that (the accepted answer uses version 3.0.8 which is out of date by now, but you can still use it, or convert the instructions to work with BeautifulSoup version 4).
Once you have the table in a data structure (probably a list of lists in this case) you can write it out with csv.write.

Look at BeautifulSOup module. In documentation you will find many examples of parsing html.
Also for csv you have ready solution - csv module.
It should be quite easy.

Look at this answer parsing table with BeautifulSoup and write in text file.
Also use google with next words "python beautifulsoup"

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python 3 html table data - python

Related

Generate and download tsv from a website (with python)

Trouble downloading xlsx file from website - Scraping

Python BeautifulSoup to csv scraping

Parsing xml in python - don't understand the DOM

How to extract tables from websites in Python

Categories

Resources