Extracting href links from within website source w/ Python

Extracting href links from within website source w/ Python - python

I've asked this question before to no avail. I am trying to figure out how to implement bs4 to grab the links to be used for downloading from within the website's source. The problem I can't figure out is the links are within a dynamic content library. I've removed previous html snippet, look below
We've been able to grab the links with this script only after manually grabbing the source code from the website:
import re
enter code here
line = line.rstrip()
x = re.findall('href=[\'"]?([^\'" >]+)tif', line)
if len(x) > 0 :
result.write('tif">link</a><br>\n<a href="'.join(x))
`result.write('tif">link</a><br>\n\n</html>\n</body>\n')
result.write("There are " + len(x) + " links")
print "Download HTML page created."
But only after going into the website ctrl + a -> view source -> select all & copy -> paste onto SourceCode.txt. I would like to remove the manual labor from all this.
I'd greatly appreciate any information/tips/advice!
EDIT
I wanted to add some more information regarding the website we are using, the Library content will only show up when it has been manually expanded. Otherwise, the content (i.e., the download links/href *.tif) are not visible. Here's an example of what we see:
Source Code of site without opening the library element.
<html><body>
Source Code after opening library element.
<html><body>
<h3>Library</h3>
<div id="libraryModalBody">
<div><table><tbody>
<tr>
<td>Tile12</td>
<td>Button</td>
</tr>
</tbody></table></div>
</div>
Source code after expanding all download options.
<html><body>
<h3>Library</h3>
<div id="libraryModalBody">
<div><table><tbody>
<tr>
<td>Tile12</td>
<td>Button</td>
</tr>
<tr>
<td>Tile12_Set1.tif</td>
<td>Button</td>
</tr>
<tr>
<td>Tile12_Set2.tif</td>
<td>Button</td>
</tr>
</tbody></table></div>
</div>
Our end goal would be to grab the downloads link with only having to input the website url. The issue seems to be in the way the content is displayed (i.e., dynamic content only visible after manual expansion of the library.

Do not try and parse HTML with regular expressions. It's not possible and it won't work. Use BeautifulSoup4 instead:
from urllib2 import urlopen
from bs4 import BeautifulSoup
url = "http://www.your-server.com/page.html"
document = urlopen(url)
soup = BeautifulSoup(document)
# look for all URLs:
found_urls = [link["href"] for link in soup.find_all("a", href=True)]
# look only for URLs to *.tif files:
found_tif_urls = [link["href"] for link in soup.find_all("a", href=True) if link["href"].endswith(".tif")]

You may as well take a look at PyQuery library, which uses the (sub)set of CSS selectors from JQuery:
pq = PyQuery(body)
pq('div.content div#filter-container div.filter-section')

Related

Why is the page source different between Selenium and BeautifulSoup?

As the Title, I am crawling data from the Vietnam's website (https://webgia.com/lai-suat/). I have used BeautifulSoup at first and it does not return the data as its html source showing on Chrome, the data number is hide. However, I changed the method to use Selenium for getting html source and it returns ideally result as all data number has shown.
The code is as below:
Using bs4:
import requests
from bs4 import BeautifulSoup
url = "https://webgia.com/lai-suat/"
req = requests.get(url)
soup = BeautifulSoup(req.text, "lxml")
table = soup.find_all('table', attrs={'class': 'table table-radius table-hover text-center'})
table_body = table[0].find('tbody')
rows = table_body.find_all('tr')
for row in rows:
cols = row.find_all('td')
for col in cols:
print(col)
The data is hiden as the result is:
<td class="text-left"><a class="bank-icon" href="https://webgia.com/lai-suat/abbank/" title="Lãi suất ABBank - Ngân hàng TMCP An Bình"><span class="bak-icon bi-abbank"></span><span>ABBank</span></a></td>
<td class="text-right lsd" nb="E3c7370616e20636c617C37B33d2B2746578742d6772H65I656e223e3A02c32303c2f7370616e3Ie"><small>web giá</small></td>
<td class="text-right lsd" nb="R3ZJ3YKJ2c3F635D"><small>xem tại webgia.com</small></td>
<td class="text-right lsd" nb="3c7370616e20636Fc61C73733d22746578742dC6772A65656e223e3S42cT303N03c2f7370616e3e"><small>webgia.com</small></td>
<td class="text-right lsd" nb="352cMA3Z6BE30"><small>web giá</small></td>
<td class="text-right lsd" nb="352cLXG3A7I30"><small>web giá</small></td>
But if I get html source by using Selenium, then using the same code above:
s = Service(executable_path=ChromeDriverManager().install())
driver = webdriver.Chrome(service = s)
driver.maximize_window()
url = "https://webgia.com/lai-suat/"
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'lxml')
...
The result was showing all data number
<td class="text-right"><span class="text-green">0,20</span></td>
<td class="text-right">3,65</td>
<td class="text-right"><span class="text-green">4,00</span></td>
<td class="text-right">5,60</td>
<td class="text-right">5,70</td>
<td class="text-right">5,70</td>
...
So can anyone explain why they have the difference like this? How to get the same result by just using bs4 instead of Selenium?
Thank you guys

The difference is because most websites today are shipped with not only HTML, but also JS scripts capable of modifying the HTML when executed. To execute those scripts, a JS engine is required and that's exactly what web browsers provide you with - a JS Engine (V8 for Chrome).
HTML contents fetched using BeautifulSoup are "raw" ones, unmodified by any JS scripts because there's no JS engine to execute them in the first place. It is those JS scripts who are in charge of data fetching and updating HTML with the fetched data
HTML contents provided by Selenium, on the other hand, are the ones after JS scripts have been executed. Selenium can do this because it has an external webdriver execute the scripts for you, not because Selenium itself can execute JS scripts
Since you'll eventually need a JS engine to execute the JS scripts, I don't think BeautifulSoup alone can cut it.

The reason is that selenium runs JavaScript, which can modify the contents of the page, whereas using requests to get the page only returns the HTML of the page that is initially sent in the request and does not execute the JavaScript.

The page source has that content obfuscated and placed inside the nb attribute of the relevant tds. When JavaScript runs in the browser the following script content runs which converts the obfuscated data into what you see on the page.
function gm(r) {
r = r.replace(/A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z/g, "");
for (var n = [], t = 0; t < r.length - 1; t += 2) n.push(parseInt(r.substr(t, 2), 16));
return String.fromCharCode.apply(String, n)
}
$(document).ready(function() {
$("td.blstg").each(function() {
var gtls = $(this).attr("nb");
$(this).removeClass("blstg").removeAttr("nb");
if (gtls) {
$(this).html(gm(gtls));
} else {
$(this).html("-");
}
});
});
With requests this script doesn't run so you are left with the generic text.
To answer your question about how to use bs4 to get this, you could write your own custom function(s) to reproduce the logic of the script.
Additionally, the class of these target elements, whose nb attribute require conversion, is dynamic, so that needs to be picked up also. In the above JavaScript the dynamic class value was blstg at the time of viewing. In the code below, I use regex to pick up the correct current value.
I have used thousands = None, as per this GitHub pandas issue, to preserve "," as the decimal point, as per source, when using read_html() to generate the final dataframe.
import requests
from bs4 import BeautifulSoup as bs
import re
import pandas as pd
def gm(r):
r = re.sub(r'A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z', '', r)
n = []
t = 0
while t < len(r) - 1:
n.append(int(r[t:t+2], 16))
t+=2
return ''.join(map(chr, n))
url = "https://webgia.com/lai-suat/"
req = requests.get(url, headers = {'User-Agent':'Mozilla/5.0'})
soup = bs(req.text, "lxml")
dynamic_class = re.search(r'\$\("td\.([a-z]+)"', req.text).group(1)
for i in soup.select(f'td.{dynamic_class}'):
replacement = i['nb']
del i['class'] # not actually needed as I replace innerText
del i['nb'] # not actually needed as I replace innerText
if replacement:
i.string.replace_with(bs(gm(replacement), 'lxml')
else:
i.replace_with('-')
df = pd.read_html(str(soup.select_one(".table-radius")), thousands=None)[0]
print(df)

expanding on the above answer and generally speaking
in order to tell if specific data is fetched/generated by js or returned with the page html
you can use a feature in chrome dev tools called block js execution (click inspect then f1) if you keep the chrome dev tools open when you visit the page and the data is there this is a clear indication the data is fetched with the html
if its not than its either fetched or generated by js
if the data is fetched simply inspecting the network requests your browser makes while you visit the website you should see the call to fetch the data and you should be able to replicate it using requests module
if not then you have to reverse engineer js by setting a onpageload breakpoint and refreshing the page the js execution will stop on the page being loaded by right clicking the element the data is set to you can click break on subtree modification or attribute modification removing the onpageload and refreshing the page chrome now will break on the js code responsible of the data generation

Can't find <tr> data with BeautifulSoup

I am trying to access rows of the table from https://www.parliament.gov.za/hansard?sorts[date]=-1. I ultimately want to download the PDFs contained in each row, but I am having a hard time accessing the rows of the table. When I inspect a table row element, I see that it is under the <tbody> tag. However, I can't seem to access this data using BeautifulSoup. I have done a decent amount of web scraping, but this is the first time I've run into this issue. This is the code that I currently have:
import requests
from bs4
import BeautifulSoup as bs
url = 'https://www.parliament.gov.za/hansard?sorts[date]=-1'
request = requests.get(url)
soup = bs(request.text, 'html.parser')
table1 = soup.findAll('table')[0]
print(table1)
Output:
<table id="papers-table">
<thead>
<th>Name</th>
<th>House</th>
<th>Language</th>
<th>Date</th>
<th>Type</th>
<th data-dynatable-column="file_location" style="display:none">File Location</th>
</thead>
<tbody>
</tbody>
</table>
Clearly, there is nothing in the <tbody> tag even though this is where I believe the row data should be. In general, whenever I try to find the tr tags, which is where Chrome says the row data is stored, I can't find any of the ones with the PDFs. I am fairly certain that the issue has something to do with the fact that the source code is missing this data as well, but I have no idea how to find it. Since it's on the website, I assume that there must be a way, right? Thanks!

The data is loaded dynamically, therefore requests won't support it. However, the data is available via sending a GET request to the websites API:
https://www.parliament.gov.za/docsjson?queries%5Btype%5D=hansard&sorts%5Bdate%5D=-1&page=1&perPage=10&offset=0
There's no need to use BeautifuSoup, using just the requests library is enough:
import requests
URL = "https://www.parliament.gov.za/docsjson?queries%5Btype%5D=hansard&sorts%5Bdate%5D=-1&page=1&perPage=10&offset=0"
BASE_URL = "https://www.parliament.gov.za/storage/app/media/Docs/"
response = requests.get(URL).json()
for data in response["records"]:
print(BASE_URL + data["file_location"])
Output:
https://www.parliament.gov.za/storage/app/media/Docs/hansard/3a888bc6-ffc7-46a1-9803-ffc148b07bfc.pdf
https://www.parliament.gov.za/storage/app/media/Docs/hansard/3eb3103c-2d3c-418f-bb24-494b17bdeb22.pdf
https://www.parliament.gov.za/storage/app/media/Docs/hansard/bf0afdf8-352c-4dde-a380-11ce0a038dad.pdf
https://www.parliament.gov.za/storage/app/media/Docs/hansard/285e1633-aaeb-4a0d-bd54-98a4d5ec5127.pdf
https://www.parliament.gov.za/storage/app/media/Docs/hansard/966926ce-4cfe-4f68-b4a1-f99a09433137.pdf
https://www.parliament.gov.za/storage/app/media/Docs/hansard/d4bdb2c2-e8c8-461f-bc0b-9ffff3403be3.pdf
https://www.parliament.gov.za/storage/app/media/Docs/hansard/daecc145-bb44-47f1-a3b2-9400437f71d8.pdf
https://www.parliament.gov.za/storage/app/media/Docs/hansard/4f204d7e-0a25-4b64-b5a7-46c8730abe91.pdf
https://www.parliament.gov.za/storage/app/media/Docs/hansard/f2863e16-b448-46e3-939d-e14859984513.pdf
https://www.parliament.gov.za/storage/app/media/Docs/hansard/cd30e289-2ff2-47f5-b2a7-77e496e52f3a.pdf

you're trying to scrape dynamic content whereas you're only loading the html for a static page with requests. Look for ways to get the dynamically generated html code or use selenium.

Beatiful soup parse page table probelm

I want to get the data (numbers) from this page. With those numbers I want to do some math.
My current code:
import requests
from bs4 import BeautifulSoup
result = requests.get("http://www.tsetmc.com/Loader.aspx?ParTree=151311&i=45050389997905274")
c = result.content
soup = BeautifulSoup(c , features='lxml')
cld=soup.select("#d03")
print(cld)
================
output : []
From the page-request I get this result:
<td id="d04" class="">2,105</td>
<td id="d03" class=""><span style="font-size:15px;font-weight:bold">2,147</span> <span style="font-size:11px;color:green">305 (16.56%)</span></td>
<td id="d05" class="">1,842</td>
From this result I only want the <td> ID's outputted.

The problem with that page is that it's content is generated dynamically. By the time you fetch the html of the page, the actual elements aren't generated (I suppose they are filled in by the javascript on the page). There are two ways you can approach this.
Try using selenium which simulates a browser. You can in fact wait for the response to be generated and then fetch the html element you want.
The other way would be just to see any network requests being done by the page to fetch the data. If it was not loaded in the html, surely there must be another API call made to their servers to fetch the data.
On an initial look, I can see that the data you need is being fetched with this URL . (http://www.tsetmc.com/tsev2/data/instinfodata.aspx?i=45050389997905274&c=57+). The response looks like this.
12:29:48,A ,2150,2147,2105,1842,2210,2105,2700,53654226,115204065144,1,20190814,122948;98/5/23 16:30:51,F,261391.50,<div class='pn'>4294.29</div>,9596315531133973,3376955600,11101143554708,345522,F,2046434489,11459858578563,282945,F,12927,3823488480,235,;8#240000#2148#2159#500#1,1#600#2145#2160#198067#2,10#1000000#2141#2161#2000#1,;61157,377398,660897;;;;0;
You can figure out the parsing logic in detail by going through their code I suppose. But it looks like you only need the second element 2147.

Perhaps this might work:
result=requests.get("http://www.tsetmc.com/Loader.aspxParTree=151311&i=45050389997905274")
c = result.content
soup = BeautifulSoup(c , features='lxml')
for tag in soup.find_all('td')[0:2]:
print(tag.get('id'))

Text missing after scraping a website using BeautifulSoup

I'm writing a python script to get the number of pull requests generated by a particular user during the ongoing hactoberfest event.
Here's a link to the official website of hacktoberfest.
Here's my code:
url= 'https://hacktoberfest.digitalocean.com/stats/user'
import urllib.request
from bs4 import BeautifulSoup
req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
html = urllib.request.urlopen(req).read()
soup = BeautifulSoup(html, 'html.parser')
name_box = soup.find('div', attrs={'class': 'userstats--progress'})
print(name_box)
Where 'user' in the first line of the code should be replaced by the user's github handle ( eg. BAJUKA ).
Below is the HTML tag I'm aiming to scrape:
<div class="userstats--progress">
<p>
Progress (<span data-js="userPRCount">5</span>/5)
</p>
<div class="ProgressBar ProgressBar--three u-mb--regular ProgressBar--full" data-js="progressBar"></div>
</div>
This is what I get after I run my code:
<div class="userstats--progress">
<p>
Progress (<span data-js="userPRCount"></span>/5)
</p>
<div class="ProgressBar ProgressBar--three u-mb--regular" data-js="progressBar"></div>
</div>
The difference being on the third line where the number of pull request in missing (i.e. in the span tag a 5 is missing)
These are the questions that I want to ask:
1.Why is the no. of pull requests ( i.e. 5 in this case ) missing from the scraped lines?
2.How can I solve this issue? That is get the no. of pull requests successfully.

The data you're looking for is not in the original data that the hacktober server sends, and Beautiful Soup downloads and parses; it's inserted into the HTML by the Javascript code that runs on that page in your browser after that original data is loaded.
If you use this shell command to download the data that's actually served as the page, you'll see that the span tag you're looking at starts off empty:
curl -s 'https://hacktoberfest.digitalocean.com/stats/BAJUKA' | grep -3 Progress
What's the javascript that fills that tag? Well, it's minified, so it's very hard to unpick what's going on. You can find it included in the very bottom of the original data, here:
curl -s 'https://hacktoberfest.digitalocean.com/stats/BAJUKA' | grep -3 "script src=" | tail -n5
which when I run it, outputs this:
<script src="https://go.digitalocean.com/js/forms2/js/forms2.min.js"></script>
<script src="/assets/application-134859a20456d7d32be9ea1bc32779e87cad0963355b5372df99a0cff784b7f0.js"></script>
That crazy looking source URL is a minified piece of Javascript, which means that it's been automatically shrunk, which also means that it's almost unreadable. But if you go to that page., and page right down to the bottom, you can see some garbled Javascript which you can try and decode.
I noticed this bit:
var d="2018-09-30T10%3A00%3A00%2B00%3A00",f="2018-11-01T12%3A00%3A00%2B00%3A00";$.getJSON("https://api.github.com/search/issues?q=-label:invalid+created:"+d+".."+f+"+type:pr+is:public+author:"+t+"&per_page=300"
Which I think is where it gets the data to fill that DIV. If you load up and parse that URL, I think you'll find the data you need. You'll need to fill in the dates for that search, and the author. Good luck!

parse html tables with lxml

I have been trying to parse the table contents from here
i have tried a couple of alternatives, like
xpath('//table//tr/td//text()')
xpath('//div[#id="replacetext"]/table/tbody//tr/td/a//text()')
here is my last code:
import requests, lxml.html
url ='https://nseindia.com/products/content/derivatives/equities/fo_underlying_home.htm'
url = requests.get(url)
html = lxml.html.fromstring(url.content)
packages = html.xpath('//div[#id="replacetext"]/table/tbody//tr/td/a//text()') # get the text inside all "<tr><td><a ...>text</a></td></tr>"
however none of the alternatives seems to be working. In the past, i have scraped data with similar code (although not from this url!). Any guidance will be really helpful.

I tried you code. The problem is not caused by lxml. It is caused by how you load the webpage.
I know that you use the requests to get the content of webpage, however, the content you get from requests may be different from the content you see in the browser.
In this page, 'https://nseindia.com/products/content/derivatives/equities/fo_underlying_home.htm', print the content of request.get, you will find that the source code of this page contains no table!!! The table is loaded by ajax query.
So find a way to load the 'really' page you want, the you can use 'lxml`.
By the way, in web scraping, there are also something you need to mention, for example, request headers. It's a good practice to set your request headers when you do the http request. Some sites may block you, if you do not provide a reasonable User-Agent in the header. Though there is nothing to do with your current problem.
Thanks.

In the HTML page, there is a namespace:
<html xmlns="http://www.w3.org/1999/xhtml">
So, you need to specify it:
NSMAP = {'html' : "http://www.w3.org/1999/xhtml"}
path = '//html:div[#id="replacetext"]/html:table/html:tbody//html:tr/html:td/html:a//text()'
packages = html.xpath(path, namespaces=NSMAP)
See http://lxml.de/xpathxslt.html#namespaces-and-prefixes
Interpreting Ajax call
import requests
from lxml import html
base_url = 'https://nseindia.com'
# sumulate the JavaScript
url = base_url + "/products/content/derivatives/equities/fo_underlyinglist.htm"
url = requests.get(url)
content = url.content
# -> <table>
# <tr><th>S. No.</td>
# <th>Underlying</td>
# <th>Symbol</th></tr>
# <tr>
# <td style='text-align: right;' >1</td>
# <td class="normalText" ><a href=fo_INDIAVIX.htm>INDIA VIX</a></td>
# <td class="normalText" >INDIAVIX</td>
# </tr>
# ...
html = html.fromstring(content)
packages = html.xpath('//td/a//text()')
# -> ['INDIA VIX',
# 'INDIAVIX',
# 'Nifty 50',
# 'NIFTY',
# 'Nifty IT',
# 'NIFTYIT',
# 'Nifty Bank',
# 'BANKNIFTY',
# 'Nifty Midcap 50',

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting href links from within website source w/ Python - python

You may as well take a look at PyQuery library, which uses the (sub)set of CSS selectors from JQuery: pq = PyQuery(body) pq('div.content div#filter-container div.filter-section')

Related

Why is the page source different between Selenium and BeautifulSoup?

Can't find <tr> data with BeautifulSoup

Beatiful soup parse page table probelm

Text missing after scraping a website using BeautifulSoup

parse html tables with lxml

Categories

Resources