CSS selectors to be used for scraping specific links

CSS selectors to be used for scraping specific links - python

I am new to Python and working on a scraping project. I am using Firebug to copy the CSS path of required links. I am trying to collect the links under the tab of "UPCOMING EVENTS" from http://kiascenehai.pk/ but it is just for learning how I can get the specified links.
I am looking for the fix of this problem and also suggestions for how to retrieve specified links using CSS selectors.
from bs4 import BeautifulSoup
import requests
url = "http://kiascenehai.pk/"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
for link in soup.select("html body div.body-outer-wrapper div.body-wrapper.boxed-mode div.main- outer-wrapper.mt30 div.main-wrapper.container div.row.row-wrapper div.page-wrapper.twelve.columns.b0 div.row div.page-wrapper.twelve.columns div.row div.eight.columns.b0 div.content.clearfix section#main-content div.row div.six.columns div.small-post-wrapper div.small-post-content h2.small-post-title a"):
print link.get('href')

First of all, that page requires a city selection to be made (in a cookie). Use a Session object to handle this:
s = requests.Session()
s.post('http://kiascenehai.pk/select_city/submit_city', data={'city': 'Lahore'})
response = s.get('http://kiascenehai.pk/')
Now the response gets the actual page content, not redirected to the city selection page.
Next, keep your CSS selector no larger than needed. In this page there isn't much to go on as it uses a grid layout, so we first need to zoom in on the right rows:
upcoming_events_header = soup.find('div', class_='featured-event')
upcoming_events_row = upcoming_events_header.find_next(class_='row')
for link in upcoming_events_row.select('h2 a[href]'):
print link['href']

This is co-founder KiaSceneHai.pk; please don't scrape websites, alot of effort goes into collecting the data, we offer access through our API, you can use the contact form to request access, ty

Related

Is there any way to scrape the links to all patents from a Google Patents search?

I want to scrape links to patents from a Google Patents Search using BeautifulSoup, but I'm not sure if Google converts their html into javascript, which cannot be parsed through BeautifulSoup, or what the issue is.
Here is some simple code:
url = 'https://patents.google.com/?assignee=Roche&after=priority:20110602&type=PATENT&num=100'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
links = []
for link in soup.find_all('a', href=True):
print(link['href'])
I also wanted to append the links into the list, but nothing is printed because there are no 'a' tags from the soup.
Is there any way to grab the links to all of the patents?

Data is dynamically render so its hard to get from bs4 so what you can try go to chrome developer mode.
Then go to Network tab you can now find xhr tab reload your web page so there will be links under Name tab from that one link is containing all data as json format
so you can copy the address of that link and you can use requests module make call and now you can extract what so ever data you want
also if you want individual link so it is made of publication_number and you can join it with old link to get links of publications.
import requests
main_url="https://patents.google.com/"
params="?assignee=Roche&after=priority:20110602&type=PATENT&num=100"
res=requests.get("https://patents.google.com/xhr/query?url=assignee%3DRoche%26after%3Dpriority%3A20110602%26type%3DPATENT%26num%3D100&exp=")
main_data=res.json()
data=main_data['results']['cluster']
for i in range(len(data[0]['result'])):
num=data[0]['result'][i]['patent']['publication_number']
print(num)
print(main_url+"patent/"+num+"/en"+params)
Output:
US10287352B2
https://patents.google.com/patent/US10287352B2/en?assignee=Roche&after=priority:20110602&type=PATENT&num=100
US10364292B2
https://patents.google.com/patent/US10364292B2/en?assignee=Roche&after=priority:20110602&type=PATENT&num=100
US10494633B2
.....
Image:

Python web-scraping on a multi-layered website without [href]

I am looking for a way to scrape data from the student-accomodation website uniplaces: https://www.uniplaces.com/en/accommodation/berlin.
In the end, I would like to scrape particular information for each property, such as bedroom size, number of roommates, location. In order to do this, I will first have to scrape all property links and then scrape the individual links afterwards.
However, even after going through the console and using BeautifulSoup for the extraction of urls, I was not able to extract the urls leading to the separate listings. They don't seem to be included as a [href] and I wasn't able to identify the links in any other format within the html code.
This is the python code I used but it also didn't return anything:
from bs4 import BeautifulSoup
import urllib.request
resp = urllib.request.urlopen("https://www.uniplaces.com/accommodation/lisbon")
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))
for link in soup.find_all('a', href=True):
print(link['href'])
So my question is: If links are not included in http:// format or referenced as [href]: is there any way to extract the listings urls?
I would really highly appreciate any support on this!
All the best,
Hannah

If you look at the network tab, you find some API call specifically to this url : https://www.uniplaces.com/api/search/offers?city=PT-lisbon&limit=24&locale=en_GB&ne=38.79507211908374%2C-9.046124472314432&page=1&sw=38.68769060641113%2C-9.327992453271463
which specifies the location PT-lisbon and northest(ne) and southwest(sw) direction. From this file, you can get the id for each offers and append it to the current url, you can also get all info you get from the webpage (price, description etc...)
For instance :
import requests
resp = requests.get(
url = 'https://www.uniplaces.com/api/search/offers',
params = {
"city":'PT-lisbon',
"limit":'24',
"locale":'en_GB',
"ne":'38.79507211908374%2C-9.046124472314432',
"page":'1',
"sw":'38.68769060641113%2C-9.327992453271463'
})
body = resp.json()
base_url = 'https://www.uniplaces.com/accommodation/lisbon'
data = [
(
t['id'], #offer id
base_url + '/' + t['id'], #this is the offer page
t['attributes']['accommodation_offer']['title'],
t['attributes']['accommodation_offer']['price']['amount'],
t['attributes']['accommodation_offer']['available_from']
)
for t in body['data']
]
print(data)

Finding video id on html website using python

I am scraping an html file, each page has a video on it, and in the html there is the video id. I want to print out the video id.
I know that if i want to print a headline from a div class i would do this
with open('yeehaw.html') as html_file:
soup = BeautifulSoup(html_file, 'lxml')
article = soup.find('div', class_='article')
headline = article.h2.a.text
print headline
However the id for the video is found inside a data-id='qe67234'
I dont know how to access this 'qe67234' and print it out.
please help thank you!

Assuming that the tag for data-id begins with div:
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup('<div class="_article" data-id="qe67234"></div>')
results = soup.findAll("div", {"data-id" : re.compile(r".*")})
print('output: ', results[0]['data-id'])
# output: qe67234

Assuming that the data-id is in div
BeautifulSoup.find returns you the found html element as a dictionary. You can therefore navigate it using standard means to get access to the text (as you did in your question) as well as html tags (as shown in the code below)
soup = BeautifulSoup('<div class="_article" data-id="qe67234">')
soup.find("div", {"class":"_article"})['data-id']
Note that, oftentimes, video elements require JS for playback, and you might not be able to find the necessary element if it was scraped with a non-javascript client (i.e. python requests).
If this happens, you have to use tools like phantomjs + selenium browser to get the website combined with the javascript to perform your scraping.
EDIT
If the data-id tag itself is not constant, you should look into lxml library to replace BeautifulSoup and use xpath values to find the element that you need

Requests Only Returns Partial Results

I'm trying to scrape the links for the top 10 articles on medium each day - by the looks of it, it seems like all the article links are in the class "postArticle-content," but when I run this code, I only get the top 3. Is there a way to get all 10?
from bs4 import BeautifulSoup
import requests
r = requests.get("https://medium.com/browse/726a53df8c8b")
data = r.text
soup = BeautifulSoup(data)
data = soup.findAll('div', attrs={'class' : 'postArticle-content'})
for div in data:
links = div.findAll('a')
for link in links:
print(link.get('href'))

requests gave you the entire results.
That page contains only the first three. The website's design is to use javascript code, running in the browser, to load additional content and add it to the page.
You need an entire web browser, with a javascript engine, to do what you are trying to do. The requests and beautiful-soup libraries are not a web browser. They are merely an implementation of the HTTP protocol and an HTML parser, respectively.

How to use CSS selectors to retrieve specific links lying in some class using BeautifulSoup?

I am new to Python and I am learning it for scraping purposes I am using BeautifulSoup to collect links (i.e href of 'a' tag). I am trying to collect the links under the "UPCOMING EVENTS" tab of site http://allevents.in/lahore/. I am using Firebug to inspect the element and to get the CSS path but this code returns me nothing. I am looking for the fix and also some suggestions for how I can choose proper CSS selectors to retrieve desired links from any site. I wrote this piece of code:
from bs4 import BeautifulSoup
import requests
url = "http://allevents.in/lahore/"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
for link in soup.select( 'html body div.non-overlay.gray-trans-back div.container div.row div.span8 div#eh-1748056798.events-horizontal div.eh-container.row ul.eh-slider li.h-item div.h-meta div.title a[href]'):
print link.get('href')

The page is not the most friendly in the use of classes and markup, but even so your CSS selector is too specific to be useful here.
If you want Upcoming Events, you want just the first <div class="events-horizontal">, then just grab the <div class="title"><a href="..."></div> tags, so the links on titles:
upcoming_events_div = soup.select_one('div.events-horizontal')
for link in upcoming_events_div.select('div.title a[href]'):
print(link['href'])
Note that you should not use r.text; use r.content and leave decoding to Unicode to BeautifulSoup. See Encoding issue of a character in utf-8

import bs4 , requests
res = requests.get("http://allevents.in/lahore/")
soup = bs4.BeautifulSoup(res.text)
for link in soup.select('a[property="schema:url"]'):
print link.get('href')
This code will work fine!!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

CSS selectors to be used for scraping specific links - python

This is co-founder KiaSceneHai.pk; please don't scrape websites, alot of effort goes into collecting the data, we offer access through our API, you can use the contact form to request access, ty

Related

Is there any way to scrape the links to all patents from a Google Patents search?

Python web-scraping on a multi-layered website without [href]

Finding video id on html website using python

Requests Only Returns Partial Results

How to use CSS selectors to retrieve specific links lying in some class using BeautifulSoup?

Categories

Resources