Python requests, xpath download whole link - python

I am trying to scrape this page
http://kenyalaw.org:8181/exist/kenyalex/actview.xql?actid=CAP.%2016
And have this sample python code:
import requests
from lxml import html
r = requests.get("http://kenyalaw.org:8181/exist/kenyalex/actview.xql?actid=CAP.%2016")
data = html.fromstring(r.content)
print(data.xpath("//div[#class='subleg']/a/#href")[0])
and this gives me this output:
sublegview.xql?subleg=CAP. 16
but when I use mouse hover on this xpath, there is different link, as you can see on the picture below:
http://kenyalaw.org:8181/exist/kenyalex/sublegview.xql?subleg=CAP.%2016

I think its just denotes the branch of the current URL that you are scraping i guess, so remove everything after the last / in your URL using Regex and join the href of the targeted element (i think it makes sense to you)
import requests
import re
from lxml import html
url = "http://kenyalaw.org:8181/exist/kenyalex/actview.xql?actid=CAP.%2016"
r = requests.get(url)
data = html.fromstring(r.content)
print(''.join([re.sub(r'(?<=/)[^/]*$', '', url), data.xpath("//div[#class='subleg']/a/#href")[0]]).replace(' ', ''))
Tell me if its not working...

Related

Extractinf info form HTML that has no tags

I am using both selenium and BeautifulSoup in order to do some web scraping. I have managed myself to obtain the next piece of code:
from selenium.webdriver import Chrome
from bs4 import BeautifulSoup
url = 'https://www.renfe.com/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014181985.html'
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'lxml')
The output soup produces has the following structure:
<html>
<head>
</head>
<body>
<rf-list-detail line-color="245,150,40" line-number="C2" line-text="Línea C2"
list="[{... ;direction":"Place1"}
,... ,
;direction":"Place2"}...
Recall both text and output style have been modified for reading reasons. I attach an image of the actual output just in case it is more convinient.
Does anyone know how could I obtain every PlaceN (in the image, Moixent would be Place1) in a list? Something like
places = [Place1,...,PlaceN]
I have tried parsing it, but as it has no tags (or at least my html knowledge, which is barely none, says so) I obtain nothing. I have also tried using a regular expression, which I have just found out where a thing, but I am not sure how to do it properly.
Any thoughts?
Thank you in advance!!
output of soup
This site responds with non-html structure. So, you need no html-parser like BeautifulSoup or lxml for this task.
Here example using requests library. You can install it like this
pip install requests
import requests
import html
import json
url = 'https://www.renfe.com/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014181985.html'
response = requests.get(url)
data = response.text # get data from site
raw_list = data.split("'")[1] # extract rf-list-detail.list attribute
json_list = html.unescape(raw_list) # decode html symbols
parsed_list = json.loads(json_list) # parse json
print(parsed_list) # printing result
directions = []
for item in parsed_list:
directions.append(item["direction"])
print(directions) # extracting directions
# ['Moixent', 'Vallada', 'Montesa', "L'Alcudia de Crespins", 'Xàtiva', "L'Enova-Manuel", 'La Pobla Llarga', 'Carcaixent', 'Alzira', 'Algemesí', 'Benifaió-Almussafes', 'Silla', 'Catarroja', 'Massanassa', 'Alfafar-Benetússer', 'València Nord']

How to get URL out of href that is itself a hyperlink?

I'm using Python and lxml to try to scrape this html page. The problem I'm running into is trying to get the URL out of this hyperlink text "Chapter02a". (Note that I can't seem to get the link formatting to work here).
<li>Examples of Operations</li>
I have tried
//ol[#id="ProbList"]/li/a/#href
but that only gives me the text "Chapter02a".
Also:
//ol[#id="ProbList"]/li/a
This returns a lxml.html.HtmlElement'object, and none of the properties that I found in the documentation accomplish what I'm trying to do.
from lxml import html
import requests
chapter_req = requests.get('https://www.math.wisc.edu/~mstemper2/Math/Pinter/Chapter02')
chapter_html = html.fromstring(chapter_req.content)
sections = chapter_html.xpath('//ol[#id="ProbList"]/li/a/#href')
print(sections[0])
I want sections to be a list of URLs to the subsections.
The return you are seeing is correct because Chapter02a is a "relative" link to the next section. The full url is not listed because that is not how it is stored in the html.
To get the full urls you can use:
url_base = 'https://www.math.wisc.edu/~mstemper2/Math/Pinter/'
sections = chapter_html.xpath('//ol[#id="ProbList"]/li/a/#href')
section_urls = [url_base + s for s in sections]
You can also do the concatenation directly at the XPATH level to regenerate the URL from the relative link:
from lxml import html
import requests
chapter_req = requests.get('https://www.math.wisc.edu/~mstemper2/Math/Pinter/Chapter02')
chapter_html = html.fromstring(chapter_req.content)
sections = chapter_html.xpath('concat("https://www.math.wisc.edu/~mstemper2/Math/Pinter/",//ol[#id="ProbList"]/li/a/#href)')
print(sections)
output:
https://www.math.wisc.edu/~mstemper2/Math/Pinter/Chapter02A

Extract element from HTML with Python's BeautifulSoup library

I'm looking to extract data from Instagram and record the time of the post without using auth.
The below code gives me the HTML of the pages from the IG post, but I'm not able to extract the time element from the HTML.
from requests_html import HTMLSession
from bs4 import BeautifulSoup
import json
url_path = 'https://www.instagram.com/<username>'
session = HTMLSession()
r = session.get(url_path)
soup = BeautifulSoup(r.content,features='lxml')
print(soup)
I would like to extract data from the time element near the bottom of this screenshot
to extract time you can use html tag and its class :
time = soup.findAll("time", {"class": "_1o9PC Nzb55"}).text
I'm guessing that the picture you've shared is a browser inspector screenshot. Although inspecting the code is a good basic guideline on web scraping you should check what BeautifullSoup is getting. If you check the print of soup you will see that the data you are looking for its a json inside of a script tag. So your code and any other solution that targets the time tag aren't working on BS4. You might try with selenium maybe.
Anyway here goes the BeautifullSoup pseudo-solution using the instagram from your screenshot:
from bs4 import BeautifulSoup
import json
import re
import requests
import time
url_path = "https://www.instagram.com/srirachi9/"
response = requests.get(url_path)
soup = BeautifulSoup(response.content)
pattern = re.compile(r"window\._sharedData\ = (.*);", re.MULTILINE)
script = soup.find("script", text=lambda x: x and "window._sharedData" in x).text
data = json.loads(re.search(pattern, script).group(1))
times = len(data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['edges'])
for x in range(times):
time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['edges'][x]['node']['taken_at_timestamp']))
The times variable its the amount of timestamps the json contains. It may look like hell but its just a matter of patiently following the json structure and indexing accordingly.

Retrieving a subset of href's from findall() in BeautifulSoup

My goal is to write a python script that takes an artist's name as a string input and then appends it to the base URL that goes to the genius search query.Then retrieves all the lyrics from the returned web page's links (Which is the required subset of this problem that will also contain specifically the artist name in every link in that subset.).I am in the initial phase right now and just have been able to retrieve all links from the web page including the ones that I don't want in my subset. I tried to find a simple solution but failed continuously.
import requests
# The Requests library.
from bs4 import BeautifulSoup
from lxml import html
user_input = input("Enter Artist Name = ").replace(" ","+")
base_url = "https://genius.com/search?q="+user_input
header = {'User-Agent':''}
response = requests.get(base_url, headers=header)
soup = BeautifulSoup(response.content, "lxml")
for link in soup.find_all('a',href=True):
print (link['href'])
This returns this complete list while I only need the ones that end with lyrics and the artist's name (here for instance Drake). These will the links from where I should be able to retrieve the lyrics.
https://genius.com/
/signup
/login
https://www.facebook.com/geniusdotcom/
https://twitter.com/Genius
https://www.instagram.com/genius/
https://www.youtube.com/user/RapGeniusVideo
https://genius.com/new
https://genius.com/Drake-hotline-bling-lyrics
https://genius.com/Drake-one-dance-lyrics
https://genius.com/Drake-hold-on-were-going-home-lyrics
https://genius.com/Drake-know-yourself-lyrics
https://genius.com/Drake-back-to-back-lyrics
https://genius.com/Drake-all-me-lyrics
https://genius.com/Drake-0-to-100-the-catch-up-lyrics
https://genius.com/Drake-started-from-the-bottom-lyrics
https://genius.com/Drake-from-time-lyrics
https://genius.com/Drake-the-motto-lyrics
/search?page=2&q=drake
/search?page=3&q=drake
/search?page=4&q=drake
/search?page=5&q=drake
/search?page=6&q=drake
/search?page=7&q=drake
/search?page=8&q=drake
/search?page=9&q=drake
/search?page=672&q=drake
/search?page=673&q=drake
/search?page=2&q=drake
/embed_guide
/verified-artists
/contributor_guidelines
/about
/static/press
mailto:brands#genius.com
https://eventspace.genius.com/
/static/privacy_policy
/jobs
/developers
/static/terms
/static/copyright
/feedback/new
https://genius.com/Genius-how-genius-works-annotated
https://genius.com/Genius-how-genius-works-annotated
My next step would be to use selenium to emulate scroll which in the case of genius.com gives the entire set of search results. Any suggestions or resources would be appreciated. I would also like a few comments about the way I wish to proceed with this solution. Can we make it more generic?
P.S. I may not have well lucidly explained my problem but I have tried my best. Also, any ambiguities are welcome too. I am new to scraping and python and programming as well in so, just wanted to make sure that I am following the right path.
Use the regex module to match only the links you want.
import requests
# The Requests library.
from bs4 import BeautifulSoup
from lxml import html
from re import compile
user_input = input("Enter Artist Name = ").replace(" ","+")
base_url = "https://genius.com/search?q="+user_input
header = {'User-Agent':''}
response = requests.get(base_url, headers=header)
soup = BeautifulSoup(response.content, "lxml")
pattern = re.compile("[\S]+-lyrics$")
for link in soup.find_all('a',href=True):
if pattern.match(link['href']):
print (link['href'])
Output:
https://genius.com/Drake-hotline-bling-lyrics
https://genius.com/Drake-one-dance-lyrics
https://genius.com/Drake-hold-on-were-going-home-lyrics
https://genius.com/Drake-know-yourself-lyrics
https://genius.com/Drake-back-to-back-lyrics
https://genius.com/Drake-all-me-lyrics
https://genius.com/Drake-0-to-100-the-catch-up-lyrics
https://genius.com/Drake-started-from-the-bottom-lyrics
https://genius.com/Drake-from-time-lyrics
https://genius.com/Drake-the-motto-lyrics
This just looks if your link matches the pattern ending in -lyrics. You may use similar logic to filter using user_input variable as well.
Hope this helps.

get all links from html even with show more link

I am using python and beautifulsoup for html parsing.
I am using the following code :
from BeautifulSoup import BeautifulSoup
import urllib2
import re
url = "http://www.wikipathways.org//index.php?query=signal+transduction+pathway&species=Homo+sapiens&title=Special%3ASearchPathways&doSearch=1&ids=&codes=&type=query"
main_url = urllib2.urlopen(url)
content = main_url.read()
soup = BeautifulSoup(content)
for a in soup.findAll('a',href=True):
print a[href]
but I am not getting output links like :
http://www.wikipathways.org/index.php/Pathway:WP26
and also imp thing is there are 107 pathways. but I will not get all the links as other lins depends on "show links" at the bottom of the page.
so, how can I get all the links (107 links) from that url?
Your problem is line 8, content = url.read(). You're not actually reading the webpage, you're actually just doing nothing (If anything, you should be getting an error).
main_url is what you want to read, so change line 8 to:
content = main_url.read()
You also have another error, print a[href]. href should be a string, so it should be:
print a['href']
I would suggest using lxml its faster and better for parsing html worth investing the time to learn it.
from lxml.html import parse
dom = parse('http://www.wikipathways.org//index.php?query=signal+transduction+pathway&species=Homo+sapiens&title=Special%3ASearchPathways&doSearch=1&ids=&codes=&type=query').getroot()
links = dom.cssselect('a')
That should get you going.

Categories

Resources