Hello, I want to pull the links from this page. All the knowledge in that field comes in according to my own methods. But I just need the links. How can I scrape links?(Pyhton-Beautifulsoup)
make_list = base_soup.findAll('div', {'a class': 'link--muted no--text--decoration result-item'})
one_make = make_list.findAll('href')
print(one_make)
The structure to extract the data is as follows:
<div class="cBox-body cBox-body--eyeCatcher" data-testid="no-top"> == $0
<a class="link--muted no--text--decoration result-item" href="https://link structure"
Every single link I want to collect is here.(link structure)
I tried methods like.Thank you very much in advance for your help.
Note: In newer code avoid old syntax findAll() instead use find_all() or select() with css selectors - For more take a minute to check docs
Iterate your ResultSet and extract the value of href attribute:
make_list = soup.find_all('a', {'class': 'link--muted no--text--decoration result-item'})
for e in make_list:
print(e.get('href'))
Example
from bs4 import BeautifulSoup
html='''
<div class="cBox-body cBox-body--eyeCatcher" data-testid="no-top">
<a class="link--muted no--text--decoration result-item" href="https://link structure"></a>
</div>
<div class="cBox-body cBox-body--eyeCatcher" data-testid="no-top">
<a class="link--muted no--text--decoration result-item" href="https://link structure"></a>
</div>
'''
soup = BeautifulSoup(html)
make_list = soup.find_all('a', {'class': 'link--muted no--text--decoration result-item'})
for e in make_list:
print(e.get('href'))
This is an example of code on how you can achieve that
from bs4 import BeautifulSoup
html = '''
<div class="cBox-body cBox-body--eyeCatcher" data-testid="no-top"> == $0
<a class="link--muted no--text--decoration result-item" href="https://link structure"></a>
</div>
<div class="cBox-body cBox-body--eyeCatcher" data-testid="no-top"> == $0
<a class="link--muted no--text--decoration result-item" href="https://link example.2"></a>
</div>
'''
soup = BeautifulSoup(html, features="lxml")
anchors = soup.find_all('a')
for anchor in anchors:
print(anchor['href'])
Alternatively, you can use a third-party service such as WebScrapingAPI to achieve your goal. I recommend this service since because it is beginner friendly and it offers CSS extracting and many advanced features such as IP rotations, rendering javascript, CAPTCHA solving, custom geolocation and many more which you can find out about by checking the docs. This in an example of how you can get links from a webpage using WebScrapingAPI:
from bs4 import BeautifulSoup
import requests
import json
API_KEY = '<YOUR-API-KEY>'
SCRAPER_URL = 'https://api.webscrapingapi.com/v1'
TARGET_URL = 'https://docs.webscrapingapi.com/'
PARAMS = {
"api_key":API_KEY,
"url": TARGET_URL,
"extract_rules": '{"linksList": {"selector": "a[href]", "output": "html", "all": 1 }}',
}
response = requests.get(SCRAPER_URL, params=PARAMS)
parsed_result = json.loads(response.text)
linksList = parsed_result['linksList']
for link in linksList:
soup = BeautifulSoup(link, features='lxml')
print(soup.find('a').get('href'))
If you are interested you can check more information about this on our Extraction Rules Docs
Related
I have found similar questions but none that directly address my issue. I have worked on this for about a week now with no luck.
I am trying to scrape data from this link: https://www.truecar.com/prices-new/chevrolet/malibu-pricing/?zipcode=44070
The issue is, the value I am looking for has no span-class attribute but when using the div class attributes, it shares the same name as other values on the page. I want my code to return $22,807 but anything I try either returns $25,195 or []. See the following HTML:
<div class="text-right col-3 col-sm-4 col-md-6">
<div class="label-block label-block-1 label-block-sm-2 text-muted" data-qa="vehicle-header-msrp"
data-test="vehicleHeaderMsrp">
<div class="label-block-title" data-qa="LabelBlock-title" data-test="labelBlockTitle"></div>
<div class="label-block-subtitle" data-qa="LabelBlock-subTitle" data-test="labelBlockSubTitle"></div>
<div data-qa="LabelBlock-text" class="label-block-text" data-test="labelBlockText">
<span class="pricing-block-amount-strikethrough">$25,195</span>
</div>
</div>
</div>
<div class="text-right col-3 col-sm-4 col-md-6">
<div class="label-block label-block-1 label-block-sm-2" data-qa="vehicle-header-average-market-price"
data-test="vehicleHeaderAverageMarketPrice">
<div class="label-block-title" data-qa="LabelBlock-title" data-test="labelBlockTitle"></div>
<div class="label-block-subtitle" data-qa="LabelBlock-subTitle" data-test="labelBlockSubTitle"></div>
<div data-qa="LabelBlock-text" class="label-block-text" data-test="labelBlockText">
<span class="">$22,807</span>
</div>
</div>
</div>
I can easily get the $25,195 returned with the following code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:70.0) Gecko/20190101 Firefox/70.0"
}
url = "https://www.truecar.com/prices-new/chevrolet/malibu-pricing/?zipcode=44070"
print(url)
page = requests.get(
url,
headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
test = soup.find('span', {'class': 'pricing-block-amount-strikethrough'})
print(test.get_text())
But no combination of calls that I try will return the $22,807 that I need.
What's interesting is that I can get the $25 value if I use
test = soup.find('div', {'class': 'label-block label-block-1 label-block-sm-2 text-muted'})
So I assumed that I could simply delete the "text-muted" part like:
test = soup.find('div', {'class': 'label-block label-block-1 label-block-sm-2'})
to get the $22 number but it just returns [ ].
Disclaimer: the dollar amount that I need changes frequently so if you help with this and end up getting a number slightly different than $22,807 it might still be correct. If you click on the link, the number I am looking for is the "Market Average" not the "MSRP."
Thank you!
If you browse the page it takes time for it to get the second value that you are looking for. In requests module it quickly gets the content doesn't wait for it to load completely. This is where you add selenium with bs4. To add the wait for the site to load then get the page content.
you can download the geckodriver from link
import time
from bs4 import BeautifulSoup
from selenium import webdriver
url = "https://www.truecar.com/prices-new/chevrolet/malibu-pricing/?zipcode=44070"
driver = webdriver.Firefox(executable_path=r'geckodriver.exe')
driver.get(url)
time.sleep(7)
soup = BeautifulSoup(driver.page_source, 'html')
div = soup.find_all('div', {'class': 'label-block-text'})
for x in div:
span = x.find('span')
print(span.get_text())
This is the part of the html that I am extracting on the platform and it has the snippet I want to get, the value of the href attribute of the tag with the class "booktitle"
</div>
<div class="elementList" style="padding-top: 10px;">
<div class="left" style="width: 75%;">
<a class="leftAlignedImage" href="/book/show/2784.Ways_of_Seeing" title="Ways of Seeing"><img alt="Ways of Seeing" src="https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1464018308l/2784._SY75_.jpg"/></a>
<a class="bookTitle" href="/book/show/2784.Ways_of_Seeing">Ways of Seeing (Paperback)</a>
<br/>
<span class="by">by</span>
<span itemprop="author" itemscope="" itemtype="http://schema.org/Person">
<div class="authorName__container">
<a class="authorName" href="https://www.goodreads.com/author/show/29919.John_Berger" itemprop="url"><span itemprop="name">John Berger</span></a>
</div>
After logging in using the mechanize library I have this piece of code to try to extract it, but here it returns the name of the book as the code asks, I tried several ways to get only the href value but none worked so far
from bs4 import BeautifulSoup as bs4
from requests import Session
from lxml import html
import Downloader as dw
import requests
def getGenders(browser : mc.Browser, url: str, name: str) -> None:
res = browser.open(url)
aux = res.read()
html2 = bs4(aux, 'html.parser')
with open(name, "w", encoding='utf-8') as file2:
file2.write( str( html2 ) )
getGenders(br, "https://www.goodreads.com/shelf/show/art", "gendersBooks.html")
with open("gendersBooks.html", "r", encoding='utf8') as file:
contents = file.read()
bsObj = bs4(contents, "lxml")
aux = open("books.text", "w", encoding='utf8')
officials = bsObj.find_all('a', {'class' : 'booktitle'})
for text in officials:
print(text.get_text())
aux.write(text.get_text().format())
aux.close()
file.close()
Can you try this? (sorry if it doesn't work, I am not on a pc with python right now)
for text in officials:
print(text['href'])
BeautifulSoup works just fine with the html code that you provided, if you want to get the text of a tag you simply use ".text", if you want to get the href you use ".get('href')" or if you are sure the tag has an href value you can use "['href']".
Here is a simple example easy to understand with your html code snipet.
from bs4 import BeautifulSoup
html_code = '''
</div>
<div class="elementList" style="padding-top: 10px;">
<div class="left" style="width: 75%;">
<a class="leftAlignedImage" href="/book/show/2784.Ways_of_Seeing" title="Ways of Seeing"><img alt="Ways of Seeing" src="https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1464018308l/2784._SY75_.jpg"/></a>
<a class="bookTitle" href="/book/show/2784.Ways_of_Seeing">Ways of Seeing (Paperback)</a>
<br/>
<span class="by">by</span>
<span itemprop="author" itemscope="" itemtype="http://schema.org/Person">
<div class="authorName__container">
<a class="authorName" href="https://www.goodreads.com/author/show/29919.John_Berger" itemprop="url"><span itemprop="name">John Berger</span></a>
</div>
'''
soup = BeautifulSoup(html_code, 'html.parser')
tag = soup.find('a', {'class':'bookTitle'})
# - Book Title -
title = tag.text
print(title)
# - Href Link -
href = tag.get('href')
print(href)
I don't know why you downloaded the html and saved it to disk and then open it again, If you just want to get some tag values, then downloading the html, saving to disk and then reopening is totally unnecessary, you can save the html to a variable and then pass that variable to beautifulsoup.
Now I see that you imported requests library, but you used mechanize instead, as far as I know requests is the easiest and the most modern library to use when getting data from web pages in python. I also see that you imported "session" from requests, session is not necessary unless you want to make mulltiple requests and want to keep the connection open with the server for faster subsecuent request's.
Also if you open a file with the "with" statement, you are using python context managers, which handles the closing of a file, which means you don't have to close the file at the end.
So your code more simplify without saving the downloaded 'html' to disk, I will make it like this.
from bs4 import BeautifulSoup
import requests
url = 'https://www.goodreads.com/shelf/show/art/gendersBooks.html'
html_source = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')
# - To get the tag that we want -
tag = soup.find('a', {'class' : 'booktitle'})
# - Extract Book Title -
href = tag.text
# - Extract href from Tag -
title = tag.get('href')
Now if you got multiple "a" tags with the same class name: ('a', {'class' : 'booktitle'}) then you do it like this.
get all the "a" tags first:
a_tags = soup.findAll('a', {'class' : 'booktitle'})
and then scrape all the book tags info and append each book info to a books list.
books = []
for a in a_tags:
try:
title = a.text
href = a.get('href')
books.append({'title':title, 'href':href}) #<-- add each book dict to books list
print(title)
print(href)
except:
pass
To understand your code better I advise you to read this related links:
BeautifulSoup:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
requests:
https://requests.readthedocs.io/en/master/
Python Context Manager:
https://book.pythontips.com/en/latest/context_managers.html
https://effbot.org/zone/python-with-statement.htm
I have some HTML I am parsing in Python using the BeautifulSoup package. Here's the HTML:
<div class='n'>Name</div>
<div class='x'>Address</div>
<div class='x'>Phone</div>
<div class='x c'>Other</div>
I am capturing the results using this code chunk:
names = soup3.find_all('div', {'class': "n"})
contact = soup3.find_all('div', {'class': "x"})
other = soup3.find_all('div', {'class': "x c"})
Right now, both classes 'x' and 'x c' are being captured in the 'contact' variable. How can I prevent this from happening?
Try:
soup.select('div[class="x"]')
Output:
[<div class="x">Address</div>, <div class="x">Phone</div>]
from bs4 import BeautifulSoup
html = """
<div class='n'>Name</div>
<div class='x'>Address</div>
<div class='x'>Phone</div>
<div class='x c'>Other</div>
"""
soup = BeautifulSoup(html, 'html.parser')
contact = soup.findAll("div", class_="x")[1]
print(contact)
Output:
<div class="x">Phone</div>
What about using sets?
others = set(soup.find_all('div', {'class': "x c"}))
contacts = set(soup.find_all('div', {'class': "x"})) - others
others will be {<div class="x c">Other</div>}
and
contacts will be {<div class="x">Phone</div>, <div class="x">Address</div>}
Noted that this will only work in this specific case of classes. It may not work in general, depends on the combinations of classes you have in the HTML.
See BeautifulSoup webscraping find_all( ): finding exact match for more details on how .find_all() works.
Currently i use lxml to parse the html document to get the data from the HTML elements
but there is a new challenge, there is one data stored as ratings inside HTML elements
https://i.stack.imgur.com/bwGle.png
<p data-rating="3">
<span class="glyphicon glyphicon-star xh-highlight"></span>
<span class="glyphicon glyphicon-star xh-highlight"></span>
<span class="glyphicon glyphicon-star xh-highlight"></span>
</p>
Its easy to extract text between tags but within tags no ideas.
What do you suggest ?
Challenge i want to extract "3"
URL:https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops
Br,
Gabriel.
If I understand your question and comments correctly, the following should extract all the rating in that page:
import lxml.html
import requests
BASE_URL = "https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
html = requests.get(BASE_URL)
root = lxml.html.fromstring(html.text)
targets = root.xpath('//p[./span[#class]]/#data-rating')
For example:
targets[0]
output
3
Try below script:
from bs4 import BeautifulSoup
import requests
BASE_URL = "https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
html = requests.get(BASE_URL).text
soup = BeautifulSoup(html, "html.parser")
for tag in soup.find_all("div", {"class":"ratings"}):
# get all child from the tags
for h in tag.children:
# convert to string data type
s = h.encode('utf-8').decode("utf-8")
# find the tag with data-rating and get text after the keyword
m = re.search('(?<=data-rating=)(.*)', s)
# check if not None
if m:
#print the text after data-rating and remove last char
print(m.group()[:-1])
I am scraping a few URLs in batch using BeautifulSoap.
Here is my script (only relevant stuff):
import urllib2
from bs4 import BeautifulSoup
quote_page = 'https://example.com/foo/bar'
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
url_box = soup.find('div', attrs={'class': 'player'})
print url_box
This gives 2 different kinds of print depending on the HTML of URL (about half pages gives first print and rest give the second print).
Here's first kind of print:
<div class="player">
<video class="video-js vjs-fluid video-player" height="100%" id="some-player" poster="https://example.com/path/to/jpg/random.jpg" width="100%"></video>
<span data-type="trailer-src" data-url="https://example.com/path/to/mp4/random.mp4"></span>
</div>
And here's the other:
<div class="player">
<img alt="Image description here" src="https://example.com/path/to/jpg/random.jpg"/>
</div>
I want to extract the image URL which is poster in first and src in second.
Any ideas how I can do that so same script extracts that URL from either kind of print?
P.S The first print also has a mp4 link which I do not need.
You can use the get() method to get the value of attrs from the targeted tag.
You should be able to do something like this:
if url_box.find('video'):
url = url_box.find('video').get('poster')
mp4 = ulr_box.find('span').get('data-url')
if url_box.find('img'):
url = url_box.find('img').get('src')
Decide which version you are dealing with and split accordingly:
firstVersion = '''<div class="player">
<video class="video-js vjs-fluid video-player" height="100%" id="some-player" poster="https://example.com/path/to/jpg/random.jpg" width="100%"></video>
<span data-type="trailer-src" data-url="https://example.com/path/to/mp4/random.mp4"></span>
</div>'''
secondVersion = '''<div class="player">
<img alt="Image description here" src="https://example.com/path/to/jpg/random.jpg"/>
</div>'''
def extractImageUrl(htmlInput):
imageUrl = ""
if "poster" in htmlInput:
imageUrl = htmlInput.split('poster="')[1].split('"')[0]
elif "src" in htmlInput:
imageUrl = htmlInput.split('src="')[1].split('"')[0]
return imageUrl