parse html tables with lxml

parse html tables with lxml - python

I have been trying to parse the table contents from here
i have tried a couple of alternatives, like
xpath('//table//tr/td//text()')
xpath('//div[#id="replacetext"]/table/tbody//tr/td/a//text()')
here is my last code:
import requests, lxml.html
url ='https://nseindia.com/products/content/derivatives/equities/fo_underlying_home.htm'
url = requests.get(url)
html = lxml.html.fromstring(url.content)
packages = html.xpath('//div[#id="replacetext"]/table/tbody//tr/td/a//text()') # get the text inside all "<tr><td><a ...>text</a></td></tr>"
however none of the alternatives seems to be working. In the past, i have scraped data with similar code (although not from this url!). Any guidance will be really helpful.

I tried you code. The problem is not caused by lxml. It is caused by how you load the webpage.
I know that you use the requests to get the content of webpage, however, the content you get from requests may be different from the content you see in the browser.
In this page, 'https://nseindia.com/products/content/derivatives/equities/fo_underlying_home.htm', print the content of request.get, you will find that the source code of this page contains no table!!! The table is loaded by ajax query.
So find a way to load the 'really' page you want, the you can use 'lxml`.
By the way, in web scraping, there are also something you need to mention, for example, request headers. It's a good practice to set your request headers when you do the http request. Some sites may block you, if you do not provide a reasonable User-Agent in the header. Though there is nothing to do with your current problem.
Thanks.

In the HTML page, there is a namespace:
<html xmlns="http://www.w3.org/1999/xhtml">
So, you need to specify it:
NSMAP = {'html' : "http://www.w3.org/1999/xhtml"}
path = '//html:div[#id="replacetext"]/html:table/html:tbody//html:tr/html:td/html:a//text()'
packages = html.xpath(path, namespaces=NSMAP)
See http://lxml.de/xpathxslt.html#namespaces-and-prefixes
Interpreting Ajax call
import requests
from lxml import html
base_url = 'https://nseindia.com'
# sumulate the JavaScript
url = base_url + "/products/content/derivatives/equities/fo_underlyinglist.htm"
url = requests.get(url)
content = url.content
# -> <table>
# <tr><th>S. No.</td>
# <th>Underlying</td>
# <th>Symbol</th></tr>
# <tr>
# <td style='text-align: right;' >1</td>
# <td class="normalText" ><a href=fo_INDIAVIX.htm>INDIA VIX</a></td>
# <td class="normalText" >INDIAVIX</td>
# </tr>
# ...
html = html.fromstring(content)
packages = html.xpath('//td/a//text()')
# -> ['INDIA VIX',
# 'INDIAVIX',
# 'Nifty 50',
# 'NIFTY',
# 'Nifty IT',
# 'NIFTYIT',
# 'Nifty Bank',
# 'BANKNIFTY',
# 'Nifty Midcap 50',

Related

Why is the page source different between Selenium and BeautifulSoup?

As the Title, I am crawling data from the Vietnam's website (https://webgia.com/lai-suat/). I have used BeautifulSoup at first and it does not return the data as its html source showing on Chrome, the data number is hide. However, I changed the method to use Selenium for getting html source and it returns ideally result as all data number has shown.
The code is as below:
Using bs4:
import requests
from bs4 import BeautifulSoup
url = "https://webgia.com/lai-suat/"
req = requests.get(url)
soup = BeautifulSoup(req.text, "lxml")
table = soup.find_all('table', attrs={'class': 'table table-radius table-hover text-center'})
table_body = table[0].find('tbody')
rows = table_body.find_all('tr')
for row in rows:
cols = row.find_all('td')
for col in cols:
print(col)
The data is hiden as the result is:
<td class="text-left"><a class="bank-icon" href="https://webgia.com/lai-suat/abbank/" title="Lãi suất ABBank - Ngân hàng TMCP An Bình"><span class="bak-icon bi-abbank"></span><span>ABBank</span></a></td>
<td class="text-right lsd" nb="E3c7370616e20636c617C37B33d2B2746578742d6772H65I656e223e3A02c32303c2f7370616e3Ie"><small>web giá</small></td>
<td class="text-right lsd" nb="R3ZJ3YKJ2c3F635D"><small>xem tại webgia.com</small></td>
<td class="text-right lsd" nb="3c7370616e20636Fc61C73733d22746578742dC6772A65656e223e3S42cT303N03c2f7370616e3e"><small>webgia.com</small></td>
<td class="text-right lsd" nb="352cMA3Z6BE30"><small>web giá</small></td>
<td class="text-right lsd" nb="352cLXG3A7I30"><small>web giá</small></td>
But if I get html source by using Selenium, then using the same code above:
s = Service(executable_path=ChromeDriverManager().install())
driver = webdriver.Chrome(service = s)
driver.maximize_window()
url = "https://webgia.com/lai-suat/"
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'lxml')
...
The result was showing all data number
<td class="text-right"><span class="text-green">0,20</span></td>
<td class="text-right">3,65</td>
<td class="text-right"><span class="text-green">4,00</span></td>
<td class="text-right">5,60</td>
<td class="text-right">5,70</td>
<td class="text-right">5,70</td>
...
So can anyone explain why they have the difference like this? How to get the same result by just using bs4 instead of Selenium?
Thank you guys

The difference is because most websites today are shipped with not only HTML, but also JS scripts capable of modifying the HTML when executed. To execute those scripts, a JS engine is required and that's exactly what web browsers provide you with - a JS Engine (V8 for Chrome).
HTML contents fetched using BeautifulSoup are "raw" ones, unmodified by any JS scripts because there's no JS engine to execute them in the first place. It is those JS scripts who are in charge of data fetching and updating HTML with the fetched data
HTML contents provided by Selenium, on the other hand, are the ones after JS scripts have been executed. Selenium can do this because it has an external webdriver execute the scripts for you, not because Selenium itself can execute JS scripts
Since you'll eventually need a JS engine to execute the JS scripts, I don't think BeautifulSoup alone can cut it.

The reason is that selenium runs JavaScript, which can modify the contents of the page, whereas using requests to get the page only returns the HTML of the page that is initially sent in the request and does not execute the JavaScript.

The page source has that content obfuscated and placed inside the nb attribute of the relevant tds. When JavaScript runs in the browser the following script content runs which converts the obfuscated data into what you see on the page.
function gm(r) {
r = r.replace(/A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z/g, "");
for (var n = [], t = 0; t < r.length - 1; t += 2) n.push(parseInt(r.substr(t, 2), 16));
return String.fromCharCode.apply(String, n)
}
$(document).ready(function() {
$("td.blstg").each(function() {
var gtls = $(this).attr("nb");
$(this).removeClass("blstg").removeAttr("nb");
if (gtls) {
$(this).html(gm(gtls));
} else {
$(this).html("-");
}
});
});
With requests this script doesn't run so you are left with the generic text.
To answer your question about how to use bs4 to get this, you could write your own custom function(s) to reproduce the logic of the script.
Additionally, the class of these target elements, whose nb attribute require conversion, is dynamic, so that needs to be picked up also. In the above JavaScript the dynamic class value was blstg at the time of viewing. In the code below, I use regex to pick up the correct current value.
I have used thousands = None, as per this GitHub pandas issue, to preserve "," as the decimal point, as per source, when using read_html() to generate the final dataframe.
import requests
from bs4 import BeautifulSoup as bs
import re
import pandas as pd
def gm(r):
r = re.sub(r'A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z', '', r)
n = []
t = 0
while t < len(r) - 1:
n.append(int(r[t:t+2], 16))
t+=2
return ''.join(map(chr, n))
url = "https://webgia.com/lai-suat/"
req = requests.get(url, headers = {'User-Agent':'Mozilla/5.0'})
soup = bs(req.text, "lxml")
dynamic_class = re.search(r'\$\("td\.([a-z]+)"', req.text).group(1)
for i in soup.select(f'td.{dynamic_class}'):
replacement = i['nb']
del i['class'] # not actually needed as I replace innerText
del i['nb'] # not actually needed as I replace innerText
if replacement:
i.string.replace_with(bs(gm(replacement), 'lxml')
else:
i.replace_with('-')
df = pd.read_html(str(soup.select_one(".table-radius")), thousands=None)[0]
print(df)

expanding on the above answer and generally speaking
in order to tell if specific data is fetched/generated by js or returned with the page html
you can use a feature in chrome dev tools called block js execution (click inspect then f1) if you keep the chrome dev tools open when you visit the page and the data is there this is a clear indication the data is fetched with the html
if its not than its either fetched or generated by js
if the data is fetched simply inspecting the network requests your browser makes while you visit the website you should see the call to fetch the data and you should be able to replicate it using requests module
if not then you have to reverse engineer js by setting a onpageload breakpoint and refreshing the page the js execution will stop on the page being loaded by right clicking the element the data is set to you can click break on subtree modification or attribute modification removing the onpageload and refreshing the page chrome now will break on the js code responsible of the data generation

Can't find <tr> data with BeautifulSoup

I am trying to access rows of the table from https://www.parliament.gov.za/hansard?sorts[date]=-1. I ultimately want to download the PDFs contained in each row, but I am having a hard time accessing the rows of the table. When I inspect a table row element, I see that it is under the <tbody> tag. However, I can't seem to access this data using BeautifulSoup. I have done a decent amount of web scraping, but this is the first time I've run into this issue. This is the code that I currently have:
import requests
from bs4
import BeautifulSoup as bs
url = 'https://www.parliament.gov.za/hansard?sorts[date]=-1'
request = requests.get(url)
soup = bs(request.text, 'html.parser')
table1 = soup.findAll('table')[0]
print(table1)
Output:
<table id="papers-table">
<thead>
<th>Name</th>
<th>House</th>
<th>Language</th>
<th>Date</th>
<th>Type</th>
<th data-dynatable-column="file_location" style="display:none">File Location</th>
</thead>
<tbody>
</tbody>
</table>
Clearly, there is nothing in the <tbody> tag even though this is where I believe the row data should be. In general, whenever I try to find the tr tags, which is where Chrome says the row data is stored, I can't find any of the ones with the PDFs. I am fairly certain that the issue has something to do with the fact that the source code is missing this data as well, but I have no idea how to find it. Since it's on the website, I assume that there must be a way, right? Thanks!

The data is loaded dynamically, therefore requests won't support it. However, the data is available via sending a GET request to the websites API:
https://www.parliament.gov.za/docsjson?queries%5Btype%5D=hansard&sorts%5Bdate%5D=-1&page=1&perPage=10&offset=0
There's no need to use BeautifuSoup, using just the requests library is enough:
import requests
URL = "https://www.parliament.gov.za/docsjson?queries%5Btype%5D=hansard&sorts%5Bdate%5D=-1&page=1&perPage=10&offset=0"
BASE_URL = "https://www.parliament.gov.za/storage/app/media/Docs/"
response = requests.get(URL).json()
for data in response["records"]:
print(BASE_URL + data["file_location"])
Output:
https://www.parliament.gov.za/storage/app/media/Docs/hansard/3a888bc6-ffc7-46a1-9803-ffc148b07bfc.pdf
https://www.parliament.gov.za/storage/app/media/Docs/hansard/3eb3103c-2d3c-418f-bb24-494b17bdeb22.pdf
https://www.parliament.gov.za/storage/app/media/Docs/hansard/bf0afdf8-352c-4dde-a380-11ce0a038dad.pdf
https://www.parliament.gov.za/storage/app/media/Docs/hansard/285e1633-aaeb-4a0d-bd54-98a4d5ec5127.pdf
https://www.parliament.gov.za/storage/app/media/Docs/hansard/966926ce-4cfe-4f68-b4a1-f99a09433137.pdf
https://www.parliament.gov.za/storage/app/media/Docs/hansard/d4bdb2c2-e8c8-461f-bc0b-9ffff3403be3.pdf
https://www.parliament.gov.za/storage/app/media/Docs/hansard/daecc145-bb44-47f1-a3b2-9400437f71d8.pdf
https://www.parliament.gov.za/storage/app/media/Docs/hansard/4f204d7e-0a25-4b64-b5a7-46c8730abe91.pdf
https://www.parliament.gov.za/storage/app/media/Docs/hansard/f2863e16-b448-46e3-939d-e14859984513.pdf
https://www.parliament.gov.za/storage/app/media/Docs/hansard/cd30e289-2ff2-47f5-b2a7-77e496e52f3a.pdf

you're trying to scrape dynamic content whereas you're only loading the html for a static page with requests. Look for ways to get the dynamically generated html code or use selenium.

How to take table data from a website using bs4

I'm trying to scrape a website that has a table in it using bs4, but the element of the content I'm getting is not as complete compared to the one I get from inspect. I cannot find the tag <tr> and <td> in it. How can I get the full content of that site especially the tags for the table?
Here's my code:
from bs4 import BeautifulSoup
import requests
link = requests.get("https://pemilu2019.kpu.go.id/#/ppwp/hitung-suara/", verify = False)
src = link.content
soup = BeautifulSoup(src, "html.parser")
print(soup)
I expect the content to have the tag <tr> and <td> in it because they do exist when I inspect it,but I found none from the output.
Here's the image of the page where there is the tag <tr> and <td>

You should dump the contents of the text you're trying to parse to a file and look at it. This will tell you for sure what is and isn't there. Like this:
from bs4 import BeautifulSoup
import requests
link = requests.get("https://pemilu2019.kpu.go.id/#/ppwp/hitung-suara/", verify = False)
src = link.content
with open("/tmp/content.html", "w") as f:
f.write(src)
soup = BeautifulSoup(src, "html.parser")
print(soup)
Run this code, and then look at the file "/tmp/content.html" (use a different path, obviously, if you're on Windows), and look at what is actually in the file. You could probably do this with your browser, but this this is the way to be the most sure you know what you are getting. You could, of course, also just add print(src), but if it were me, I'd dump it to a file
If the HTML you're looking for is not in the initial HTML that you're getting back, then that HTML is coming from somewhere else. The table could be being built dynamically by JavaScript, or coming from another URL reference, possibly one that calls an HTTP API to grab the table's HTML via parameters passed to the API endpoint.
You will have to reverse engineer the site's design to find where that HTML comes from. If it comes from JavaScript, you may be stuck short of scripting the execution of a browser so you can gain access programmatically to the DOM in the browser's memory.
I would recommend running a debugging proxy that will show you each HTTP request being made by your browser. You'll be able to see the contents of each request and response. If you can do this, you can find the URL that actually returns the content you're looking for, if such a URL exists. You'll have to deal with SSL certificates and such because this is a https endpoint. Debugging proxies usually make that pretty easy. We use Charles. The standard browser toolboxes might do this too...allow you to see each request and response that is generated by a particular page load.
If you can discover the URL that actually returns the table HTML, then you can use that URL to grab it and parse it with BS.

Extracting href links from within website source w/ Python

I've asked this question before to no avail. I am trying to figure out how to implement bs4 to grab the links to be used for downloading from within the website's source. The problem I can't figure out is the links are within a dynamic content library. I've removed previous html snippet, look below
We've been able to grab the links with this script only after manually grabbing the source code from the website:
import re
enter code here
line = line.rstrip()
x = re.findall('href=[\'"]?([^\'" >]+)tif', line)
if len(x) > 0 :
result.write('tif">link</a><br>\n<a href="'.join(x))
`result.write('tif">link</a><br>\n\n</html>\n</body>\n')
result.write("There are " + len(x) + " links")
print "Download HTML page created."
But only after going into the website ctrl + a -> view source -> select all & copy -> paste onto SourceCode.txt. I would like to remove the manual labor from all this.
I'd greatly appreciate any information/tips/advice!
EDIT
I wanted to add some more information regarding the website we are using, the Library content will only show up when it has been manually expanded. Otherwise, the content (i.e., the download links/href *.tif) are not visible. Here's an example of what we see:
Source Code of site without opening the library element.
<html><body>
Source Code after opening library element.
<html><body>
<h3>Library</h3>
<div id="libraryModalBody">
<div><table><tbody>
<tr>
<td>Tile12</td>
<td>Button</td>
</tr>
</tbody></table></div>
</div>
Source code after expanding all download options.
<html><body>
<h3>Library</h3>
<div id="libraryModalBody">
<div><table><tbody>
<tr>
<td>Tile12</td>
<td>Button</td>
</tr>
<tr>
<td>Tile12_Set1.tif</td>
<td>Button</td>
</tr>
<tr>
<td>Tile12_Set2.tif</td>
<td>Button</td>
</tr>
</tbody></table></div>
</div>
Our end goal would be to grab the downloads link with only having to input the website url. The issue seems to be in the way the content is displayed (i.e., dynamic content only visible after manual expansion of the library.

Do not try and parse HTML with regular expressions. It's not possible and it won't work. Use BeautifulSoup4 instead:
from urllib2 import urlopen
from bs4 import BeautifulSoup
url = "http://www.your-server.com/page.html"
document = urlopen(url)
soup = BeautifulSoup(document)
# look for all URLs:
found_urls = [link["href"] for link in soup.find_all("a", href=True)]
# look only for URLs to *.tif files:
found_tif_urls = [link["href"] for link in soup.find_all("a", href=True) if link["href"].endswith(".tif")]

You may as well take a look at PyQuery library, which uses the (sub)set of CSS selectors from JQuery:
pq = PyQuery(body)
pq('div.content div#filter-container div.filter-section')

How to create a Python script that grabs text from one site and reposts it to another?

I would like to create a Python script that grabs digits of Pi from this site:
http://www.piday.org/million.php
and reposts them to this site:
http://www.freelove-forum.com/index.php
I am NOT spamming or playing a prank, it is an inside joke with the creator and webmaster, a belated Pi day celebration, if you will.

Import urllib2 and BeautifulSoup
import urllib2
from BeautifulSoup import BeautifulSoup
specify the url and fetch using urllib2
url = 'http://www.piday.org/million.php'
response = urlopen(url)
and then use BeautifulSoup which uses the tags in the page to build a dictionary, and then you can query the dictionary with the relevant tags that define the data to extract what you want.
soup = BeautifulSoup(response)
pi = soup.findAll('TAG')
where 'TAG' is the relevant tag you want to find that identifies where pi is.
Specify what you want to print out
out = '<html><body>'+pi+'</html></body>
You can then write this to a HTML file that you serve, using pythons inbuilt file operations.
f = open('file.html', 'w')
f.write(out)
f.close()
You then serve the file 'file.html' using your webserver.
If you don't want to use BeautifulSoup you could use re and urllib, but it is not as 'pretty' as BeautifulSoup.

When you post a post, it's done with a POST request which is sent to the server. Look at the code on your site:
<form action="enter.php" method="post">
<textarea name="post">Enter text here</textarea>
</form>
You are going to send a POST request with a parameter of post (bad object naming IMO), which is your text.
As for the site you are grabbing from, if you look at the source code, the Pi is actually inside of an <iframe> with this URL:
http://www.piday.org/includes/pi_to_1million_digits_v2.html
Looking at that source code, you can see that the page is just a single <p> tag directly descending from a <body> tag (the site doesn't have the <!DOCTYPE>, but I'll include one):
<!DOCTYPE html>
<html>
<head>
...
</head>
<body>
<p>3.1415926535897932384...</p>
</body>
</html>
Since HTML is a form of XML, you will need to use an XML parser to parse the webpage. I use BeautifulSoup, as it works very well with malformed or invalid XML, but even better with perfectly valid HTML.
To download the actual page, which you would feed into the XML parser, you can use Python's built-in urllib2. For the POST request, I'd use Python's standard httplib.
So a complete example would be this:
import urllib, httplib
from BeautifulSoup import BeautifulSoup
# Downloads and parses the webpage with Pi
page = urllib.urlopen('http://www.piday.org/includes/pi_to_1million_digits_v2.html')
soup = BeautifulSoup(page)
# Extracts the Pi. There's only one <p> tag, so just select the first one
pi_list = soup.findAll('p')[0].contents
pi = ''.join(str(s).replace('\n', '') for s in pi_list).replace('<br />', '')
# Creates the POST request's body. Still bad object naming on the creator's part...
parameters = urllib.urlencode({'post': pi,
'name': 'spammer',
'post_type': 'confession',
'school': 'all'})
# Crafts the POST request's header.
headers = {'Content-type': 'application/x-www-form-urlencoded',
'Accept': 'text/plain'}
# Creates the connection to the website
connection = httplib.HTTPConnection('freelove-forum.com:80')
connection.request('POST', '/enter.php', parameters, headers)
# Sends it out and gets the response
response = connection.getresponse()
print response.status, response.reason
# Finishes the connections
data = response.read()
connection.close()
But if you are using this for a malicious purpose, do know that the server logs all IP addresses.

You could use the urllib2 module which come in any Python distribution.
It allows you to open an URL as you were opening a file on the filesystem. So you can fetch the PI data with
pi_million_file = urllib2.urlopen("http://www.piday.org/million.php")
parse the resulting file which will be the HTML code of the webpage you see in your browser.
Then you should use the right URL for your website to POST with PI.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.