Extracting information by scraping products - python

I'm learning how to take information of products by scraping ecommerce and I have achieved a little, but there are parts that I'm not able of parse.
With this code I can take the information that is in the labels
from bs4 import BeautifulSoup
soup = BeautifulSoup('A LOT OF HTML HERE', 'html.parser')
productos = soup.find_all('li', {'class': 'item product product-item col-xs-12 col-sm-6 col-md-4'})
for product_info in productos:
# To store the information to a dictionary
web_content_dict = {}
web_content_dict['Marca'] = product_info.find('div',{'class':'product-item-manufacturer'}).text
web_content_dict['Producto'] = product_info.find('strong',{'class':'product name product-item-name'}).text
web_content_dict['Precio'] = product_info.find('span',{'class':'price'}).text
# To store the dictionary to into a list
web_content_list.append(web_content_dict)
df_kiwoko = pd.DataFrame(web_content_list)
I can take information from for example:
<div class="product-item-manufacturer"> PEDIGREE </div>
And I'd like to take information from this part:
<a href="https://www.kiwoko.com/sobre-pedigree-pollo-en-salsa-100-g-pollo-
y-verduras.html" class="product photo product-item-photo" tabindex="-1"
data-id="PED321441" data-name="Sobre Pedigree Vital Protection pollo y
verduras en salsa 100 g" data-price="0.49" data-category="PERROS" data-
list="PERROS" data-brand="PEDIGREE" data-quantity="1" data-click=""
For example take "Perros" from
data-category="PERROS"
How can I take information from parts that are not between >< and take the elements between ""?

Related

Python BS4 can not extract data properly

So I have this source code
<div class="field-fluid col-xs-12 col-sm-12 col-md-12 col-lg-6">
<div class="label-fluid">Email Address</div>
<div class="data-fluid rev-field" aria-data="rei-0">maldapalmer<span class="hd-form-field">ajk89;fjioasjdfwjepu90f30 v09u30r nv8704rhnv987rjl3409u0asu[amav084-8235 087307304u0[9fd0]] asf74 john 9##83r8cva sarah sj4t8g#!$%#7h v7hgv 398#$&&^#7y9</span>#gmail<span class="hd-form-field">ajk89;fjioasjdfwjepu90f30 v09u30r nv8704rhnv987rjl3409u0asu[amav084-8235 087307304u0[9fd0]] asf74 john 9##83r8cva sarah sj4t8g#!$%#7h v7hgv 398#$&&^#7y9</span>.com</div>
</div>
I seem to be doing everything right however I just can not extract the email address housed in the second div within the main div element. This is my code:
fields = []
for row in rows:
fields.append(row.find_all('div', recursive = False))
email = fields[0][0].find(class_ = "data-fluid rev-field").text
Row here is the element within the main div is housed. Any suggestions are welcome, also I hope I explained the issue well enough.
The problem I get is that the string shows up empty ''. Thanks!
You can extract Email by using the following code:
from bs4 import *
from requests import get
response = get('http://127.0.0.1/bs.html') # Replce 'http://127.0.0.1/bs.html' with your URL
sp = BeautifulSoup(response.text, 'html.parser')
email = sp.find('div', class_= "data-fluid rev-field").text
spn = sp.find('span', class_= "hd-form-field").text
email = email.replace(spn,"")
print(email)
Output:
maldapalmer#gmail.com

Finding tag of text-searched element in HTML

I am trying to scrape multiple web pages to compare the prices of books. Because every site has a different layout (and class names), I want to find the title of the book using regex and then the surrounding elements. An example of the code is given below.
from bs4 import BeautifulSoup
import re
html_page1 = """
<div class='product-box'>
<h2 class='title'>Title Book</h2>
<p class='price>18.45</p>
</div>
"""
html_page2 = """
<div class='page-box'>
<h2 class='orange-heading'>Title Book</h2>
<p class='blue-price'>18.45</p>
</div>
"""
# turn page into soup
soup1 = BeautifulSoup(html_page1, 'html.parser')
# find book titles
names1 = soup1.find_all(string=re.compile("[A-Z]([a-z]+,|\.|[a-z]+)(?:\s{1}[A-Z]([a-z]+,|\.|[a-z]+))"))
# print titles
print('Names1: ', names1)
# turn page into soup
soup2 = BeautifulSoup(html_page2, 'html.parser')
# find book titles
names2 = soup2.find_all(string=re.compile("[A-Z]([a-z]+,|\.|[a-z]+)(?:\s{1}[A-Z]([a-z]+,|\.|[a-z]+))"))
# print titles
print('Names2: ', names2)
This returns:
Names1: ['Title Book']
Names2: ['Title Book']
Now I want to use this information to find the corresponding price. I know that when an element has been selected using the tags and class names, "next_sibling" can be used, however this doesn't work for the element selected by text:
select_title = soup1.find('h2', {"class": "title"})
next_sib = new_try.next_sibling
print(next_sib) # returns <p class='price>18.45
# now try the same thing on element selected by name, this will result in an error
next_sib = names1.next_sibling
How can I use the same method to find the price when I have found the element using its text?
A similiar question can be found here: Find data within HTML tags using Python However, it still uses the html tags.
EDIT The problem is that I have many pages with different layouts and class names. Because of that I cannot use the tag/class/id name to find the elements and I have to find the book titles using regex.
To get the price Include 'h2' tag while doing it find_all() And then use find_next('p')
The first example of p tag where string was missing for classname I have added the string class='price'.
from bs4 import BeautifulSoup
import re
html_page1 = """
<div class='product-box'>
<h2 class='title'>Title Book</h2>
<p class='price'>18.45</p>
</div>
"""
html_page2 = """
<div class='page-box'>
<h2 class='orange-heading'>Title Book</h2>
<p class='blue-price'>18.45</p>
</div>
"""
# turn page into soup
soup1 = BeautifulSoup(html_page1, 'html.parser')
# find book titles
names1 = soup1.find_all('h2',string=re.compile("[A-Z]([a-z]+,|\.|[a-z]+)(?:\s{1}[A-Z]([a-z]+,|\.|[a-z]+))"))
# print titles
print('Names1: ', names1[0].find_next('p').text)
# turn page into soup
soup2 = BeautifulSoup(html_page2, 'html.parser')
# find book titles
names2 = soup2.find_all('h2',string=re.compile("[A-Z]([a-z]+,|\.|[a-z]+)(?:\s{1}[A-Z]([a-z]+,|\.|[a-z]+))"))
# print titles
print('Names2: ', names2[0].find_next('p').text)
Or change string to text
from bs4 import BeautifulSoup
import re
html_page1 = """
<div class='product-box'>
<h2 class='title'>Title Book</h2>
<p class='price'>18.45</p>
</div>
"""
html_page2 = """
<div class='page-box'>
<h2 class='orange-heading'>Title Book</h2>
<p class='blue-price'>18.45</p>
</div>
"""
# turn page into soup
soup1 = BeautifulSoup(html_page1, 'html.parser')
# find book titles
names1 = soup1.find_all(text=re.compile("[A-Z]([a-z]+,|\.|[a-z]+)(?:\s{1}[A-Z]([a-z]+,|\.|[a-z]+))"))
# print titles
print('Names1: ', names1[0].find_next('p').text)
# turn page into soup
soup2 = BeautifulSoup(html_page2, 'html.parser')
# find book titles
names2 = soup2.find_all(text=re.compile("[A-Z]([a-z]+,|\.|[a-z]+)(?:\s{1}[A-Z]([a-z]+,|\.|[a-z]+))"))
# print titles
print('Names2: ', names2[0].find_next('p').text)
EDITED
Use text to get the element without tag and next_element to get the value of price.
from bs4 import BeautifulSoup
import re
html_page1 = """
<div class='product-box'>
<h2 class='title'>Title Book</h2>
<p class='price'>18.45</p>
</div>
"""
html_page2 = """
<div class='page-box'>
<h2 class='orange-heading'>Title Book</h2>
<p class='blue-price'>18.45</p>
</div>
"""
# turn page into soup
soup1 = BeautifulSoup(html_page1, 'html.parser')
# find book titles
names1 = soup1.find_all(text=re.compile("[A-Z]([a-z]+,|\.|[a-z]+)(?:\s{1}[A-Z]([a-z]+,|\.|[a-z]+))"))
# print titles
print('Names1: ', names1[0])
print('Price1: ', names1[0].next_element.next_element.next_element)
# turn page into soup
soup2 = BeautifulSoup(html_page2, 'html.parser')
# find book titles
names2 = soup2.find_all(text=re.compile("[A-Z]([a-z]+,|\.|[a-z]+)(?:\s{1}[A-Z]([a-z]+,|\.|[a-z]+))"))
# print titles
print('Names2: ', names2[0])
print('Price2: ', names2[0].next_element.next_element.next_element)
Output:
Names1: Title Book
Price1: 18.45
Names2: Title Book
Price2: 18.45
You missed class closure comma for the p.price in html_page1.
With names1 = soup1.find_all(text=re.compile("[A-Z]([a-z]+,|\.|[a-z]+)(?:\s{1}[A-Z]([a-z]+,|\.|[a-z]+))")) you get NavigableString, that's why you'll get None for the next_sibling.
You can find solution with regex in #Kunduk answer.
Alternative more clear and simple solution for the both html_page1 and html_page2:
soup = BeautifulSoup(html_page1, 'html.parser')
# or BeautifulSoup(html_page2, 'html.parser')
books = soup.select('div[class*=box]')
for book in books:
book_title = book.select_one('h2').text
book_price = book.select_one('p[class*=price]').text
print(book_title, book_price)
div[class*=box] mean div where class contains box.

How to get text with no class/id using selenium/python?

I'm trying to get a list of variables (date, size, medium, etc.) from this page (https://www.artprice.com/artist/844/hans-arp/lots/pasts) using python/selenium.
For the titles it was easy enough to use :
titles = driver.find_elements_by_class_name("sln_lot_show")
for title in titles:
print(title.text)
However the other variables seem to be text within the source code which have no identifiable id or class.
For example, to fetch the dates made I have tried:
dates_made = driver.find_elements_by_xpath("//div[#class='col-sm-6']/p[1]")
for date_made in dates_made:
print(date_made.get_attribute("date"))
and
dates_made = driver.find_elements_by_xpath("//div[#class='col-sm-6']/p[1]/date")
for date_made in dates_made:
print(date_made.text)
which both produce no error, but are not printing any results.
Is there some way to this text, which has no specific class or id?
Specific html here :
......
<div class="col-xs-8 col-sm-6">
<p>
<i><a id="sln_16564482" class="sln_lot_show" href="/artist/844/hans-arp/print-multiple/16564482/vers-le-blanc-infini" title=""Vers le Blanc Infini"" ng-click="send_ga_event_now('artist_past_lots_search', 'select_lot_position', 'title', {eventValue: 1})">
"Vers le Blanc Infini"
</a></i>
<date>
(1960)
</date>
</p>
<p>
Print-Multiple, Etching, aquatint,
<span ng-show="unite_to == 'in'" class="ng-hide">15 3/4 x 18 in</span>
<span ng-show="unite_to == 'cm'">39 x 45 cm</span>
</p>
Progressive mode, below Javascript will return you two-dimensional array (lots and details - 0,1,2,8,9 your indexes):
lots = driver.execute_script("[...document.querySelectorAll(".lot .row")].map(e => [...e.querySelectorAll("p")].map(e1 => e1.textContent.trim()))")
Classic mode:
lots = driver.find_elements_by_css_selector(".lot .row")
for lot in lots:
lotNo = lot.find_element_by_xpath("./div[1]/p[1]").get_attribute("textContent").strip()
title = lot.find_element_by_xpath("./div[2]/i").get_attribute("textContent").strip()
details = lot.find_element_by_xpath("./div[2]/p[2]").get_attribute("textContent").strip()
date = lot.find_element_by_xpath("./div[3]/p[1]").get_attribute("textContent").strip()
country = lot.find_element_by_xpath("./div[3]/p[2]").get_attribute("textContent").strip()

BeautifulSoup: Scraping different data sets having same set of attributes in the source code

I'm using the BeautifulSoup module for scraping the total number of followers and total number of tweets from a Twitter account. However, when I tried inspecting the elements of the respective fields on the web page, I found that both the fields are enclosed inside same set of html attributes:
Followers
<a class="ProfileNav-stat ProfileNav-stat--link u-borderUserColor u-textCenter js-tooltip js-nav u-textUserColor" data-nav="followers" href="/IAmJericho/followers" data-original-title="2,469,681 Followers">
<span class="ProfileNav-label">Followers</span>
<span class="ProfileNav-value" data-is-compact="true">2.47M</span>
</a>
Tweet count
<a class="ProfileNav-stat ProfileNav-stat--link u-borderUserColor u-textCenter js-tooltip js-nav" data-nav="tweets" tabindex="0" data-original-title="21,769 Tweets">
<span class="ProfileNav-label">Tweets</span>
<span class="ProfileNav-value" data-is-compact="true">21.8K</span>
</a>
The mining script that I wrote:
import requests
import urllib2
from bs4 import BeautifulSoup
link = "https://twitter.com/iamjericho"
r = urllib2.urlopen(link)
src = r.read()
res = BeautifulSoup(src)
followers = ''
for e in res.findAll('span', {'data-is-compact':'true'}):
followers = e.text
print followers
However, since the values of both, the total tweet count and total number of followers are enclosed inside same set of HTML attributes, ie inside a span tag with class = "ProfileNav-value" and data-is-compact = "true", I only get the results of the total number of followers returned by running the above script.
How could I possibly extract two sets of information enclosed inside similar HTML attributes from BeautifulSoup?
In this case, one way to achieve it, is to check that data-is-compact="true" only appears twice for each piece of data you want to extract, and also you know that tweets is first and followers second, so you can have a list with those titles in same order and use a zip to join them in a tuple to print both at same time, like:
import urllib2
from bs4 import BeautifulSoup
profile = ['Tweets', 'Followers']
link = "https://twitter.com/iamjericho"
r = urllib2.urlopen(link)
src = r.read()
res = BeautifulSoup(src)
followers = ''
for p, d in zip(profile, res.find_all('span', { 'data-is-compact': "true"})):
print p, d.text
It yields:
Tweets 21,8K
Followers 2,47M

Python: How to extract URL from HTML Page using BeautifulSoup?

I have a HTML Page with multiple divs like
<div class="article-additional-info">
A peculiar situation arose in the Supreme Court on Tuesday when two lawyers claimed to be the representative of one of the six accused in the December 16 gangrape case who has sought shifting of t...
<a class="more" href="http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece">
<span class="arrows">»</span>
</a>
</div>
<div class="article-additional-info">
Power consumers in the city will have to brace for spending more on their monthly bills as all three power distribution companies – the Anil Ambani-owned BRPL and BYPL and the Tatas-owned Tata Powe...
<a class="more" href="http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece">
<a class="commentsCount" href="http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments">
</div>
and I need to get the <a href=> value for all the divs with class article-additional-info
I am new to BeautifulSoup
so I need the the urls
"http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece"
"http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece"
What is the best way to achieve this?
According to your criteria, it returns three URLs (not two) - did you want to filter out the third?
Basic idea is to iterate through the HTML, pulling out only those elements in your class, and then iterating through all of the links in that class, pulling out the actual links:
In [1]: from bs4 import BeautifulSoup
In [2]: html = # your HTML
In [3]: soup = BeautifulSoup(html)
In [4]: for item in soup.find_all(attrs={'class': 'article-additional-info'}):
...: for link in item.find_all('a'):
...: print link.get('href')
...:
http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments
This limits your search to just those elements with the article-additional-info class tag, and inside of there looks for all anchor (a) tags and grabs their corresponding href link.
After working with the documentation, I did it the following way, thank you all for your answers, I appreciate them
>>> import urllib2
>>> f = urllib2.urlopen('http://www.thehindu.com/news/cities/delhi/?union=citynews')
>>> soup = BeautifulSoup(f.fp)
>>> for link in soup.select('.article-additional-info'):
... print link.find('a').attrs['href']
...
http://www.thehindu.com/news/cities/Delhi/airport-metro-express-is-back/article4335059.ece
http://www.thehindu.com/news/cities/Delhi/91-more-illegal-colonies-to-be-regularised/article4335069.ece
http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece
http://www.thehindu.com/news/cities/Delhi/nurses-women-groups-demand-safety-audit-of-workplaces/article4331470.ece
http://www.thehindu.com/news/cities/Delhi/delhi-bpl-families-to-get-12-subsidised-lpg-cylinders/article4328990.ece
http://www.thehindu.com/news/cities/Delhi/shias-condemn-violence-against-religious-minorities/article4328276.ece
http://www.thehindu.com/news/cities/Delhi/new-archbishop-of-delhi-takes-over/article4328284.ece
http://www.thehindu.com/news/cities/Delhi/delhi-metro-to-construct-subway-without-disrupting-traffic/article4328290.ece
http://www.thehindu.com/life-and-style/Food/going-for-the-kill-in-patparganj/article4323210.ece
http://www.thehindu.com/news/cities/Delhi/fire-at-janpath-bhavan/article4335068.ece
http://www.thehindu.com/news/cities/Delhi/fiveyearold-girl-killed-as-school-van-overturns/article4335065.ece
http://www.thehindu.com/news/cities/Delhi/real-life-stories-of-real-women/article4331483.ece
http://www.thehindu.com/news/cities/Delhi/women-councillors-allege-harassment-by-male-councillors-of-rival-parties/article4331471.ece
http://www.thehindu.com/news/cities/Delhi/airport-metro-resumes-today/article4331467.ece
http://www.thehindu.com/news/national/hearing-today-on-plea-to-shift-trial/article4328415.ece
http://www.thehindu.com/news/cities/Delhi/protestors-demand-change-in-attitude-of-men-towards-women/article4328277.ece
http://www.thehindu.com/news/cities/Delhi/bjp-promises-5-lakh-houses-for-poor-on-interestfree-loans/article4328280.ece
http://www.thehindu.com/life-and-style/metroplus/papad-bidi-and-a-dacoit/article4323219.ece
http://www.thehindu.com/life-and-style/Food/gharana-of-food-not-just-music/article4323212.ece
>>>
from bs4 import BeautifulSoup as BS
html = # Your HTML
soup = BS(html)
for text in soup.find_all('div', class_='article-additional-info'):
for links in text.find_all('a'):
print links.get('href')
Which prints:
http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments
In [4]: for item in soup.find_all(attrs={'class': 'article-additional-info'}):
...: for link in item.find_all('a'):
...: print link.get('href')
...:
http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments

Categories

Resources