Python - Extract href value based on content value

Python - Extract href value based on content value - python

I am trying to scan a web page to find the link to a specific product using part of the product name.
The HTML below is the part I am trying to extract information from:
<article class='product' data-json-url='/en/GB/men/products/omia066s188000161001.json' id='product_24793' itemscope='' itemtype='http://schema.org/Product'>
<header>
<h3>OMIA066S188000161001</h3>
</header>
<a itemProp="url" href="/en/GB/men/products/omia066s188000161001"><span content='OFF WHITE Shoes OMIA066S188000161001' itemProp='name' style='display:none'></span>
<span content='OFF WHITE' itemProp='brand' style='display:none'></span>
<span content='OMIA066S188000161001' itemProp='model' style='display:none'></span>
<figure>
<img itemProp="image" alt="OMIA066S188000161001 image" class="top" src="https://cdn.off---white.com/images/156374/product_OMIA066S188000161001_1.jpg?1498806560" />
<figcaption>
<div class='brand-name'>
HIGH 3.0 SNEAKER
</div>
<div class='category-and-season'>
<span class='category'>Shoes</span>
</div>
<div class='price' itemProp='offers' itemscope='' itemtype='http://schema.org/Offer'>
<span content='530.0' itemProp='price'>
<strong>£ 530</strong>
</span>
<span content='GBP' itemProp='priceCurrency'></span>
</div>
<div class='size-box js-size-box'>
<!-- / .available-size -->
<!-- / = render 'availability', product: product -->
<div class='sizes'></div>
</div>
</figcaption>
</figure>
</a></article>
My code is below:
import requests
from bs4 import BeautifulSoup
item_to_find = 'off white shoes'
s = requests.Session()
r = s.get('https://www.off---white.com/en/GB/section/new-arrivals.js')
soup = BeautifulSoup(r.content, 'html.parser')
#find_url = soup.find("a", {"content":item_to_find})['href']
#print(find_url)
How do I filter only the line where 'content' contains item_to_find and then extract the 'href' for that product?
The final output should look like the below:
/en/GB/men/products/omia066s188000161001

Give this a shot.
import requests
from bs4 import BeautifulSoup
item_to_find = 'off white shoes'
s = requests.Session()
r = s.get('https://www.off---white.com/en/GB/section/new-arrivals.js')
soup = BeautifulSoup(r.content, 'html.parser')
links = soup.find_all("a")
for link in links:
if 'OFF WHITE Shoes' in link.encode_contents():
print link.get('href')
Since the "OFF WHITE Shoes" text exists within a span we can use encode_contents() to check all of the mark up within each link. If the text we are searching for exists we get the link by using BeautifulSoups .get method.

More specific answer considering python 3 would be:
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
search_item = 'orange timberland' #make sure the search terms are in small letters (a portion of text will suffice)
URL = 'https://www.off---white.com/en/GB/section/new-arrivals.js'
res = requests.get(URL)
soup = BeautifulSoup(res.text, 'html.parser')
for link in soup.find_all(class_="brand-name"):
if search_item in link.text.lower():
item_name = link.get_text(strip=True)
item_link = urljoin(URL,link.find_parents()[2].get('href'))
print("Name: {}\nLink: {}".format(item_name,item_link))
Output:
Name: ORANGE TIMBERLAND BOOTS
Link: https://www.off---white.com/en/GB/men/products/omia073s184780161900

Related

I want to get the value of multiple ids inside an a tag that resides in a div

Here's the HTML code:
<div class="sizeBlock">
<div class="size">
<a class="selectSize" id="44526" data-size-original="36.5">36.5</a>
</div>
<div class="size inactive active">
<a class="selectSize" id="44524" data-size-original="40">40</a>
</div>
<div class="size ">
<a class="selectSize" id="44525" data-size-original="40.5">40.5</a>
</div>
</div>
I want to get the values of the id tag and the data-size-original.
Here's my code:
for sizeBlock in soup.find_all('a', class_="selectSize"):
aid = sizeBlock.get('id')
size = sizeBlock.get('data-size-us')
The problem is that it gets the values of other ids that have the same class "selectSize".

I think this is what you want. You won't have ids and size from data in div class='size inactive active'
for sizeBlock in soup.select('div.size a.selectSize'):
aid = sizeBlock.get('id')
size = sizeBlock.get('data-size-us')

Already answered here How to Beautiful Soup (bs4) match just one, and only one, css class
Use soup.select. Here's a simple test:
from bs4 import BeautifulSoup
html_doc = """<div class="size">
<a class="selectSize otherclass" id="44526" data-ean="0193394075362" " data-tprice="" data-sku="1171177-36.5" data-size-original="36.5">5</a>
</div>"""
soup = BeautifulSoup(html_doc, 'html.parser')
#for sizeBlock in soup.find_all('a', class_= "selectSize"): # this would include the anchor
for sizeBlock in soup.select("a[class='selectSize']"):
aid = sizeBlock.get('id')
size = sizeBlock.get('data-size-original')
print aid, size

Python: extract all the information(src, href, title) inside the class

I found that I can extract all the information I want from this HTML. I need to extract title, href abd src from this.
HTML:
<div class="col-md-2 col-sm-2 col-xs-2 home-hot-thumb">
<a itemprop="url" href="/slim?p=3090" class="main">
<img src="/FileUploads/Post/3090.jpg?w=70&h=70&mode=crop" alt="apple" title="apple" />
</a>
</div>
<div class="col-md-2 col-sm-2 col-xs-2 home-hot-thumb">
<a itemprop="url" href="/slim?p=3091" class="main">
<img src="/FileUploads/Post/3091.jpg?w=70&h=70&mode=crop" alt="banana" title="banana" />
</a>
</div>
Code:
import requests
from bs4 import BeautifulSoup
res = requests.get('http://www.cad.com/')
soup = BeautifulSoup(res.text,"lxml")
for a in soup.findAll('div', {"id":"home"}):
for b in a.select(".main"):
print ("http://www.cad.com"+b.get('href'))
print(b.get('title'))
I can successfully get href from this, but since title and src are in another line, I don't know how to extract them. After this, I want to save them in excel, so maybe I need to finish one first then do the second one.
Expected output:
/slim?p=3090
apple
/FileUploads/Post/3091.jpg?w=70&h=70&mode=crop" alt="banana" title="banana
/slim?p=3091
banana
/FileUploads/Post/3091.jpg?w=70&h=70&mode=crop" alt="banana" title="banana

My own solution:
import requests
from bs4 import BeautifulSoup
res = requests.get('http://www.cad.com/')
soup = BeautifulSoup(res.text,"lxml")
for a in soup.findAll('div', {"id":"home"}):
div = a.findAll('div', {"class": "home-hot-thumb"})
for div in div:
title=(div.img.get('title'))
print(title)
href=('http://www.cad.com/'+div.a.get('href'))
print(href)
src=('http://www.cad.com/'+div.img.get('src'))
print(src.replace('?w=70&h=70&mode=crop', ''))

Scraping page content from divs using beautufulsoup

I am trying to scrape the title, summary, date, and link from http://www.indiainfoline.com/top-news for each div. with class' : 'row'.
link = 'http://www.indiainfoline.com/top-news'
redditFile = urllib2.urlopen(link)
redditHtml = redditFile.read()
redditFile.close()
soup = BeautifulSoup(redditHtml, "lxml")
productDivs = soup.findAll('div', attrs={'class' : 'row'})
for div in productDivs:
result = {}
try:
import pdb
#pdb.set_trace()
heading = div.find('p', attrs={'class': 'heading fs20e robo_slab mb10'}).get_text()
title = heading.get_text()
article_link = "http://www.indiainfoline.com"+heading.find('a')['href']
summary = div.find('p')
But none of the components are getting fetched. Any suggestion on how to fix this?

See there are many class=row in html source code, you need to filter out the section chunk where actual row data exist. In you case under id="search-list" all 16 expected rows exist. Thus first extract section and then row. Since .select return array, we must use [0] to extract data. Once you got row data you need to iterate and extract heading,articl_url, summary etc ..
from bs4 import BeautifulSoup
link = 'http://www.indiainfoline.com/top-news'
redditFile = urllib2.urlopen(link)
redditHtml = redditFile.read()
redditFile.close()
soup = BeautifulSoup(redditHtml, "lxml")
section = soup.select('#search-list')
rowdata = section[0].select('.row')
for row in rowdata[1:]:
heading = row.select('.heading.fs20e.robo_slab.mb10')[0].text
title = 'http://www.indiainfoline.com'+row.select('a')[0]['href']
summary = row.select('p')[0].text
Output :
PFC board to consider bonus issue; stock surges by 4%
http://www.indiainfoline.com/article/news-top-story/pfc-pfc-board-to-consider-bonus-issue-stock-surges-by-4-117080300814_1.html
PFC board to consider bonus issue; stock surges by 4%
...
...

Try this
from bs4 import BeautifulSoup
from urllib.request import urlopen
link = 'http://www.indiainfoline.com/top-news'
soup = BeautifulSoup(urlopen(link),"lxml")
fixed_html = soup.prettify()
ul = soup.find('ul', attrs={'class':'row'})
print(ul.find('li'))
You will get
<li class="animated" onclick="location.href='/article/news-top-story/lupin-lupin-gets-usfda-nod-to-market-rosuvastatin-calcium-117080300815_1.html';">
<div class="row">
<div class="col-lg-9 col-md-9 col-sm-9 col-xs-12 ">
<p class="heading fs20e robo_slab mb10">Lupin gets USFDA nod to market Rosuvastatin Calcium</p>
<p><!--style="color: green !important"-->
<img class="img-responsive visible-xs mob-img" src="http://content.indiainfoline.com/_media/iifl/img/article/2016-08/19/full/1471586016-9754.jpg"/>
Pharma major, Lupin announced on Thursday that the company has received the United States Food and Drug Administra...
</p>
<p class="source fs12e">India Infoline News Service |
Mumbai 15:42 IST | August 03, 2017 </p>
</div>
<div class="col-lg-3 col-md-3 col-sm-3 hidden-xs pl0 listing-image">
<img class="img-responsive" src="http://content.indiainfoline.com/_media/iifl/img/article/2016-08/19/full/1471586016-9754.jpg"/>
</div>
</div>
</li>
Of course you can print fixed_html to get the whole site content.

i want to scrape data using python script

I have written python script to scrape data from http://www.cricbuzz.com/cricket-stats/icc-rankings/batsmen-rankings
It is a list of 100 players and I successfully scraped this data. The problem is, when i run script instead of scraping data just one time it scraped the same data 3 times.
<div class="cb-col cb-col-100 cb-font-14 cb-lst-itm text-center">
<div class="cb-col cb-col-16 cb-rank-tbl cb-font-16">1</div>
<div class="cb-col cb-col-50 cb-lst-itm-sm text-left">
<div class="cb-col cb-col-33">
<div class="cb-col cb-col-50">
<span class=" cb-ico" style="position:absolute;"></span> –
</div>
<div class="cb-col cb-col-50">
<img src="http://i.cricketcb.com/i/stats/fw/50x50/img/faceImages/2250.jpg" class="img-responsive cb-rank-plyr-img">
</div>
</div>
<div class="cb-col cb-col-67 cb-rank-plyr">
<a class="text-hvr-underline text-bold cb-font-16" href="/profiles/2250/steven-smith" title="Steven Smith's Profile">Steven Smith</a>
<div class="cb-font-12 text-gray">AUSTRALIA</div>
</div>
</div>
<div class="cb-col cb-col-17 cb-rank-tbl">906</div>
<div class="cb-col cb-col-17 cb-rank-tbl">1</div>
</div>
And here is python script which i write scrap each player data.
import sys,requests,csv,io
from bs4 import BeautifulSoup
from urllib.parse import urljoin
url = "http://www.cricbuzz.com/cricket-stats/icc-rankings/batsmen-rankings"
r = requests.get(url)
r.content
soup = BeautifulSoup(r.content, "html.parser")
maindiv = soup.find_all("div", {"class": "text-center"})
for div in maindiv:
print(div.text)
but instead of scraping the data once, it scrapes the same data 3 times.
Where can I make changes to get data just one time?

Select the table and look for the divs in that:
maindiv = soup.select("#batsmen-tests div.text-center")
for div in maindiv:
print(div.text)
Your original output and that above gets all the text from the divs as one line which is not really useful, if you just want the player names:
anchors = soup.select("#batsmen-tests div.cb-rank-plyr a")
for a in anchors:
print(a.text)
A quick and easy way to get the data in a nice csv format is to just get text from each child:
maindiv = soup.select("#batsmen-tests div.text-center")
for d in maindiv[1:]:
row_data = u",".join(s.strip() for s in filter(None, (t.find(text=True, recursive=False) for t in d.find_all())))
if row_data:
print(row_data)
Now you get output like:
# rank, up/down, name, country, rating, best rank
1,–,Steven Smith,AUSTRALIA,906,1
2,–,Joe Root,ENGLAND,878,1
3,–,Kane Williamson,NEW ZEALAND,876,1
4,–,Hashim Amla,SOUTH AFRICA,847,1
5,–,Younis Khan,PAKISTAN,845,1
6,–,Adam Voges,AUSTRALIA,802,5
7,–,AB de Villiers,SOUTH AFRICA,802,1
8,–,Ajinkya Rahane,INDIA,785,8
9,2,David Warner,AUSTRALIA,772,3
10,–,Alastair Cook,ENGLAND,770,2
11,1,Misbah-ul-Haq,PAKISTAN,764,6
As opposed to:
PositionPlayerRatingBest Rank
Player
1    –Steven SmithAUSTRALIA9061
2    –Joe RootENGLAND8781
3    –Kane WilliamsonNEW ZEALAND8761
4    –Hashim AmlaSOUTH AFRICA8471
5    –Younis KhanPAKISTAN8451
6    –Adam VogesAUSTRALIA8025

The reason you get output three times is because the website has three categories you have to select it and then accordingly you can use it.
Simplest way of doing it with your code would be to add just one line
import sys,requests,csv,io
from bs4 import BeautifulSoup
url = "http://www.cricbuzz.com/cricket-stats/icc-rankings/batsmen- rankings"
r = requests.get(url)
r.content
soup = BeautifulSoup(r.content, "html.parser")
specific_div = soup.find_all("div", {"id": "batsmen-tests"})
maindiv = specific_div[0].find_all("div", {"class": "text-center"})
for div in maindiv:
print(div.text)
This will give similar reuslts with just test batsmen, for other output just change the "id" in specific_div line.

xbmc/kodi python scrape data using BeautifulSoup

I want to edit a Kodi addon that use re.compile to scrape data, and make it use BeautifulSoup4 instead.
The original code is like this:
import urllib, urllib2, re, sys, xbmcplugin, xbmcgui
link = read_url(url)
match = re.compile('<a class="frame[^"]*"'
' href="(http://somelink.com/section/[^"]+)" '
'title="([^"]+)">.*?<img src="([^"]+)".+?Length:([^<]+)',
re.DOTALL).findall(link)
for url, name, thumbnail, length in match:
addDownLink(name + length, url, 2, thumbnail)
The HTML it is scraping is like this:
<div id="content">
<span class="someclass">
<span class="sec">
<a class="frame" href="http://somlink.com/section/name-here" title="name here">
<img src="http://www.somlink.com/thumb/imgsection/thumbnail.jpg" >
</a>
</span>
<h3 class="title">
name here
</h3>
<span class="details"><span class="length">Length: 99:99</span>
</span>
.
.
.
</div>
How do I get all of url (href), name, length and thumbnail using BeautifulSoup4, and add them in addDownLink(name + length, url, 2, thumbnail)?

from bs4 import BeautifulSoup
html = """<div id="content">
<span class="someclass">
<span class="sec">
<a class="frame" href="http://somlink.com/section/name-here" title="name here">
<img src="http://www.somlink.com/thumb/imgsection/thumbnail.jpg" >
</a>
</span>
<h3 class="title">
name here
</h3>
<span class="details"><span class="length">Length: 99:99</span>
</span>
</div>
"""
soup = BeautifulSoup(html, "lxml")
sec = soup.find("span", {"class": "someclass"})
# get a tag with frame class
fr = sec.find("a", {"class": "frame"})
# pull img src and href from the a/frame
url, img = fr["href"], fr.find("img")["src"]
# get h3 with title class and extract the text from the anchor
name = sec.select("h3.title a")[0].text
# "size" is in the span with the details class
size = sec.select("span.details")[0].text.split(None,1)[-1]
print(url, img, name.strip(), size.split(None,1)[1].strip())
Which gives you:
('http://somlink.com/section/name-here', 'http://www.somlink.com/thumb/imgsection/thumbnail.jpg', u'name here', u'99:99')
If you have multiple sections, we just need find_all and to apply the logic to each section:
def secs():
soup = BeautifulSoup(html, "lxml")
sections = soup.find_all("span", {"class": "someclass"})
for sec in sections:
fr = sec.find("a", {"class": "frame"})
url, img = fr["href"], fr.find("img")["src"]
name, size = sec.select("h3.title a")[0].text, sec.select("span.details")[0].text.split(None,1)[-1]
yield url, name, img,size
If you don't know all the class but you know for instance there is one img tag you can call find on the section:
sec.find("img")["src"]
And the same logic applies to the rest.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - Extract href value based on content value - python

Related

I want to get the value of multiple ids inside an a tag that resides in a div

Python: extract all the information(src, href, title) inside the class

Scraping page content from divs using beautufulsoup

i want to scrape data using python script

xbmc/kodi python scrape data using BeautifulSoup

Categories

Resources