How to get URL out of href that is itself a hyperlink? - python

I'm using Python and lxml to try to scrape this html page. The problem I'm running into is trying to get the URL out of this hyperlink text "Chapter02a". (Note that I can't seem to get the link formatting to work here).
<li>Examples of Operations</li>
I have tried
//ol[#id="ProbList"]/li/a/#href
but that only gives me the text "Chapter02a".
Also:
//ol[#id="ProbList"]/li/a
This returns a lxml.html.HtmlElement'object, and none of the properties that I found in the documentation accomplish what I'm trying to do.
from lxml import html
import requests
chapter_req = requests.get('https://www.math.wisc.edu/~mstemper2/Math/Pinter/Chapter02')
chapter_html = html.fromstring(chapter_req.content)
sections = chapter_html.xpath('//ol[#id="ProbList"]/li/a/#href')
print(sections[0])
I want sections to be a list of URLs to the subsections.

The return you are seeing is correct because Chapter02a is a "relative" link to the next section. The full url is not listed because that is not how it is stored in the html.
To get the full urls you can use:
url_base = 'https://www.math.wisc.edu/~mstemper2/Math/Pinter/'
sections = chapter_html.xpath('//ol[#id="ProbList"]/li/a/#href')
section_urls = [url_base + s for s in sections]

You can also do the concatenation directly at the XPATH level to regenerate the URL from the relative link:
from lxml import html
import requests
chapter_req = requests.get('https://www.math.wisc.edu/~mstemper2/Math/Pinter/Chapter02')
chapter_html = html.fromstring(chapter_req.content)
sections = chapter_html.xpath('concat("https://www.math.wisc.edu/~mstemper2/Math/Pinter/",//ol[#id="ProbList"]/li/a/#href)')
print(sections)
output:
https://www.math.wisc.edu/~mstemper2/Math/Pinter/Chapter02A

Related

How to get href from a class containing a specific text using CSS selector (Scrapy)

I am working with the following web site: https://inmuebles.mercadolibre.com.mx/venta/, and I am trying to get the link from "ver_todos" button from "Inmueble" section (in red). However, the "Tour virtual" and "Publicados hoy" sections (in blue) may or may not appear when visiting the site.
As shown in the image below, the classes ui-search-filter-dl contain the specific sections from the menu from above image; while ui-search-filter-container classes contain the sub-sections displayed by the site (e.g. Casas, Departamento & Terrenos for Inmueble). With the intention of obtaining the link from "ver todos" button from "Inmueble" section, I was using this line of code:
ver_todos = response.css('div.ui-search-filter-dl')[2].css('a.ui-search-modal__link').attrib['href']
But since "Tour virtual" and "Publicados hoy" are not always in the page, I cannot be sure that ui-search-filter-dl at index 2 is always the index corresponding to "ver todos" button.
I was trying to get the link from "ver todos" by using this line of code:
response.css(''':contains("Inmueble") ~ .ui-search-filter-dt-title
.ui-search-modal__link::attr(href)''').extract()
Basically, I was trying to get the href from a ui-search-filter-dt-title class that contains the title "Inmueble". Unfortunately, the output is an empty list. I would like to find the link from "ver todos" by using css and regex but I'm having trouble with it. How may I achieve that?
I think xpath is easier to select the target elements in most cases:
Code:
xpath = "//div[contains(text(), 'Inmueble')]/following-sibling::ul//a[contains(#class,'ui-search-modal__link')]/#href"
url = response.xpath(xpath).extract()[0]
Actually, I didn't create a scrapy project to check your code. Alternatively, I implemented the following code:
from lxml import html
import requests
res = requests.get( "https://inmuebles.mercadolibre.com.mx/venta/")
dom = html.fromstring(res.text)
xpath = "//div[contains(text(), 'Inmueble')]/following-sibling::ul//a[contains(#class,'ui-search-modal__link')]/#href"
url = dom.xpath(xpath)[0]
assert url == 'https://inmuebles.mercadolibre.com.mx/venta/_FiltersAvailableSidebar?filter=PROPERTY_TYPE'
Since the xpath should be the same among scrapy and lxml, of course, I hope the code shown in the beginning will also work fine in your scrapy project.
An easy way you could do it is by getting all the link <a> and then checking if any of their text matches ver todos.
import requests
from bs4 import BeautifulSoup
link = "https://inmuebles.mercadolibre.com.mx/venta/"
def main():
res = requests.get(link)
if res.status_code == 200:
soup = BeautifulSoup(res.text, "html.parser")
links = [a["href"] for a in soup.select("a") if a.text.strip().lower() == "ver todos"]
print(links)
if __name__ == "__main__":
main()

Python requests, xpath download whole link

I am trying to scrape this page
http://kenyalaw.org:8181/exist/kenyalex/actview.xql?actid=CAP.%2016
And have this sample python code:
import requests
from lxml import html
r = requests.get("http://kenyalaw.org:8181/exist/kenyalex/actview.xql?actid=CAP.%2016")
data = html.fromstring(r.content)
print(data.xpath("//div[#class='subleg']/a/#href")[0])
and this gives me this output:
sublegview.xql?subleg=CAP. 16
but when I use mouse hover on this xpath, there is different link, as you can see on the picture below:
http://kenyalaw.org:8181/exist/kenyalex/sublegview.xql?subleg=CAP.%2016
I think its just denotes the branch of the current URL that you are scraping i guess, so remove everything after the last / in your URL using Regex and join the href of the targeted element (i think it makes sense to you)
import requests
import re
from lxml import html
url = "http://kenyalaw.org:8181/exist/kenyalex/actview.xql?actid=CAP.%2016"
r = requests.get(url)
data = html.fromstring(r.content)
print(''.join([re.sub(r'(?<=/)[^/]*$', '', url), data.xpath("//div[#class='subleg']/a/#href")[0]]).replace(' ', ''))
Tell me if its not working...

Trying to use Xpath to get data from a website but returns empty list

So I am trying to pull the APR for each pool off the website below, I am trying to use lxml and the xpath. I am using the xpath helper extension to check that I have the correct Xpath and it seems to be showing that I do but only an empty list is returned. My code I'm using is below.
import requests
from lxml import html
url = 'https://polyroll.org/farms'
response = requests.get(url)
tree = html.fromstring(response.content)
xpath = '//div[#class="sc-hmbstg iriXCP"]/div[1]/div[2]/div[1]/div/text()'
result = tree.xpath(xpath)
Any help would be appreciated.

How to collect data from website when wanted tag haven't class?

I would know how to get data from a website
I find a tutorial and finished with this
import os
import csv
import requests
from bs4 import BeautifulSoup
requete = requests.get("https://www.palabrasaleatorias.com/mots-aleatoires.php")
page = requete.content
soup = BeautifulSoup(page)
The tutorial say me that I should use something like this to get the string of a tag
h1 = soup.find("h1", {"class": "ico-after ico-tutorials"})
print(h1.string)
But I got a problem : the tag where I want to get text content haven't class... how should I do ?
I tried to put {} but not working
this too {"class": ""}
In fact, it's return me a None
I want to get the text content of this part of the website :
<div style="font-size:3em; color:#6200C5;">
Orchard</div>
Where Orchard is the random word
Thank for any type of help
Unfortunately, there aren't many pointers featured in BeautifulSoup, and the page you are trying to get is terribly ill-suited for your task (no IDs, classes, or other useful html features to point at).
Hence, you should change the way you use to point at the html element, and use the Xpath - and you can't do it with BeautifulSoup. In order to do that, just use html from package lxml to parse the page. Below a code snippet (based on the answers to this question) which extracts the random word in your example.
import requests
from lxml import html
requete = requests.get("https://www.palabrasaleatorias.com/mots-aleatoires.php")
tree = html.fromstring(requete.content)
rand_w = tree.xpath('/html/body/center/center/table[1]/tr/td/div/text()')
print(rand_w)

Get the code from inspect element using Python

In the Safari browser, I can right-click and select "Inspect Element", and a lot of code appears. Is it possible to get this code using Python? The best solution would be to get a file with the code in it.
More specifically, I am trying to find the links to the images on this page: http://500px.com/popular. I can see the links from "Inspect Element" and I would like to retrieve them with Python.
One way to get at the source code of a web page is to use the Beautiful Soup library. A tutorial of this is shown here. The code from the page is shown below, the comments are mine. This particular code does not work as the contents have changed on the site it uses as an example, but the concept should help you to do what you want to do. Hope it helps.
from bs4 import BeautifulSoup
# If Python2:
#from urllib2 import urlopen
# If Python3 (urllib2 has been split into urllib.request and urllib.error):
from urllib.request import urlopen
BASE_URL = "http://www.chicagoreader.com"
def get_category_links(section_url):
# Put the stuff you see when using Inspect Element in a variable called html.
html = urlopen(section_url).read()
# Parse the stuff.
soup = BeautifulSoup(html, "lxml")
# The next two lines will change depending on what you're looking for. This
# line is looking for <dl class="boccat">.
boccat = soup.find("dl", "boccat")
# This line organizes what is found in the above line into a list of
# hrefs (i.e. links).
category_links = [BASE_URL + dd.a["href"] for dd in boccat.findAll("dd")]
return category_links
EDIT 1: The solution above provides a general way to web-scrape, but I agree with the comments to the question. The API is definitely the way to go for this site. Thanks to yuvi for providing it. The API is available at https://github.com/500px/PxMagic.
EDIT 2: There is an example of your question regarding getting links to popular photos. The Python code from the example is pasted below. You will need to install the API library.
import fhp.api.five_hundred_px as f
import fhp.helpers.authentication as authentication
from pprint import pprint
key = authentication.get_consumer_key()
secret = authentication.get_consumer_secret()
client = f.FiveHundredPx(key, secret)
results = client.get_photos(feature='popular')
i = 0
PHOTOS_NEEDED = 2
for photo in results:
pprint(photo)
i += 1
if i == PHOTOS_NEEDED:
break

Categories

Resources