Python BS4 can not extract data properly - python

So I have this source code
<div class="field-fluid col-xs-12 col-sm-12 col-md-12 col-lg-6">
<div class="label-fluid">Email Address</div>
<div class="data-fluid rev-field" aria-data="rei-0">maldapalmer<span class="hd-form-field">ajk89;fjioasjdfwjepu90f30 v09u30r nv8704rhnv987rjl3409u0asu[amav084-8235 087307304u0[9fd0]] asf74 john 9##83r8cva sarah sj4t8g#!$%#7h v7hgv 398#$&&^#7y9</span>#gmail<span class="hd-form-field">ajk89;fjioasjdfwjepu90f30 v09u30r nv8704rhnv987rjl3409u0asu[amav084-8235 087307304u0[9fd0]] asf74 john 9##83r8cva sarah sj4t8g#!$%#7h v7hgv 398#$&&^#7y9</span>.com</div>
</div>
I seem to be doing everything right however I just can not extract the email address housed in the second div within the main div element. This is my code:
fields = []
for row in rows:
fields.append(row.find_all('div', recursive = False))
email = fields[0][0].find(class_ = "data-fluid rev-field").text
Row here is the element within the main div is housed. Any suggestions are welcome, also I hope I explained the issue well enough.
The problem I get is that the string shows up empty ''. Thanks!

You can extract Email by using the following code:
from bs4 import *
from requests import get
response = get('http://127.0.0.1/bs.html') # Replce 'http://127.0.0.1/bs.html' with your URL
sp = BeautifulSoup(response.text, 'html.parser')
email = sp.find('div', class_= "data-fluid rev-field").text
spn = sp.find('span', class_= "hd-form-field").text
email = email.replace(spn,"")
print(email)
Output:
maldapalmer#gmail.com

Related

Extracting information by scraping products

I'm learning how to take information of products by scraping ecommerce and I have achieved a little, but there are parts that I'm not able of parse.
With this code I can take the information that is in the labels
from bs4 import BeautifulSoup
soup = BeautifulSoup('A LOT OF HTML HERE', 'html.parser')
productos = soup.find_all('li', {'class': 'item product product-item col-xs-12 col-sm-6 col-md-4'})
for product_info in productos:
# To store the information to a dictionary
web_content_dict = {}
web_content_dict['Marca'] = product_info.find('div',{'class':'product-item-manufacturer'}).text
web_content_dict['Producto'] = product_info.find('strong',{'class':'product name product-item-name'}).text
web_content_dict['Precio'] = product_info.find('span',{'class':'price'}).text
# To store the dictionary to into a list
web_content_list.append(web_content_dict)
df_kiwoko = pd.DataFrame(web_content_list)
I can take information from for example:
<div class="product-item-manufacturer"> PEDIGREE </div>
And I'd like to take information from this part:
<a href="https://www.kiwoko.com/sobre-pedigree-pollo-en-salsa-100-g-pollo-
y-verduras.html" class="product photo product-item-photo" tabindex="-1"
data-id="PED321441" data-name="Sobre Pedigree Vital Protection pollo y
verduras en salsa 100 g" data-price="0.49" data-category="PERROS" data-
list="PERROS" data-brand="PEDIGREE" data-quantity="1" data-click=""
For example take "Perros" from
data-category="PERROS"
How can I take information from parts that are not between >< and take the elements between ""?

Certain content not loading when scraping a site with Beautiful Soup

I'm trying to scrape the ratings off recipes on NYT Cooking but having issues getting the content I need. When I look at the source on the NYT page, I see the following:
<div class="ratings-rating">
<span class="ratings-header ratings-content">194 ratings</span>
<div class="ratings-stars-wrap">
<div class="ratings-stars ratings-content four-star-rating avg-rating">
The content I'm trying to pull out is 194 ratings and four-star-rating. However, when I pull in the page source via Beautiful Soup I only see this:
<div class="ratings-rating">
<span class="ratings-header ratings-content"><%= header %></span>
<div class="ratings-stars-wrap">
<div class="ratings-stars ratings-content <%= ratingClass %> <%= state %>">
The code I'm using is:
url = 'https://cooking.nytimes.com/recipes/1020049-lemony-chicken-soup-with-fennel-and-dill'
r = get(url, headers = headers, timeout=15)
page_soup = soup(r.text,'html.parser')
Any thoughts why that information isn't pulling through?
Try using below code
import requests
import lxml
from lxml import html
import re
url = "https://cooking.nytimes.com/recipes/1019706-spiced-roasted-cauliflower-with-feta-and-garlic?action=click&module=Recirculation%20Band%20Recipe%20Card&region=More%20recipes%20from%20Alison%20Roman&pgType=recipedetails&rank=1"
r = requests.get(url)
tree = html.fromstring(r.content)
t = tree.xpath('/html/body/script[14]')[0]
# look for value for bootstrap.recipe.avg_rating
m = re.search("bootstrap.recipe.avg_rating = ", t.text)
colon = re.search(";", t.text[m.end()::])
rating = t.text[m.end():m.end()+colon.start()]
print(rating)
# look for value for bootstrap.recipe.num_ratings =
n = re.search("bootstrap.recipe.num_ratings = ", t.text)
colon2 = re.search(";", t.text[n.end()::])
star = t.text[n.end():n.end()+colon2.start()]
print(star)
much easier to use attribute = value selectors to grab from span with class ratings-metadata
import requests
from bs4 import BeautifulSoup
data = requests.get('https://cooking.nytimes.com/recipes/1020049-lemony-chicken-soup-with-fennel-and-dill')
soup = BeautifulSoup(data.content, 'lxml')
rating = soup.select_one('[itemprop=ratingValue]').text
ratingCount = soup.select_one('[itemprop=ratingCount]').text
print(rating, ratingCount)

extract text from html using python

hope anyone can help me. I am fairly new to python, but I want to scrape data from a site, which unfortunately needs an account. Although i am not able to extract the date (i.e. 2017-06-01).
<li class="latest-value-item">
<div class="latest-value-label">Date</div>
<div class="latest-value">2017-06-01</div>
</li>
<li class="latest-value-item">
<div class="latest-value-label">Index</div>
<div class="latest-value">1430</div>
</li>
THis is my code:
import urllib3
import urllib.request
from bs4 import BeautifulSoup
import pandas as pd
import requests
import csv
from datetime import datetime
url = 'https://www.quandl.com/data/LLOYDS/BCI-Baltic-Capesize-Index'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
Baltic_Indices = []
New_Value = []
#new = soup.find_all('div', attrs={'class':'latest-value'}).get_text()
date = soup.find_all(class_="latest value")
text1 = date.text
print(text1)
date = soup.find_all(class_="latest value")
You are using the wrong CSS class name ('latest value' != 'latest-value')
print(soup.find_all(attrs={'class': 'latest-value'}))
# [<div class="latest-value">2017-06-01</div>, <div class="latest-value">1430</div>]
for element in soup.find_all(attrs={'class': 'latest-value'}):
print(element.text)
# 2017-06-01
# 1430
I prefer to use attrs kwarg but your method works as well (given the correct CSS class name)
for element in soup.find_all(class_='latest-value'):
print(element.text)
# 2017-06-01
# 1430

Grabbing values from HTML with BS4

I am having a hard time figuring out how to grab certain data from this HTML snippet that I've obtained from parsing through HTML via BeautifulSoup.
Here is my code:
productpage = 'http://www.sneakersnstuff.com/en/product/26133/adidas-samba-waves-x-naked'
rr = requests.get(productpage)
soup1 = BeautifulSoup(rr.content, 'xml')
productIDArray = soup1.find_all("div", class_="size-button property available")
#print for debugging purposes
print(productIDArray[0])
productIDArray[0] returns
<div class="size-button property available" data-productId="207789">
<span class="size-type" title="UK 3.5 | 36">
US 4
</span>
</div>
How would i grab the value of data-productID and the title of the span so that I can place them into variables?
Thank you.
productIDArray['data-productId']
out:
'207789'
productIDArray.span['title']
out:
'UK 3.5 | 36'

How can I get data from a specific class of a html tag using beautifulsoup?

I want to get data located(name, city and address) in div tag from a HTML file like this:
<div class="mainInfoWrapper">
<h4 itemprop="name">name</h4>
<div>
city
Address
</div>
</div>
I don't know how can I get data that i want in that specific tag.
obviously I'm using python with beautifulsoup library.
There are several <h4> tags in the source HTML, but only one <h4> with the itemprop="name" attribute, so you can search for that first. Then access the remaining values from there. Note that the following HTML is correctly reproduced from the source page, whereas the HTML in the question was not:
from bs4 import BeautifulSoup
html = '''<div class="mainInfoWrapper">
<h4 itemprop="name">
NAME
</h4>
<div>
PROVINCE - CITY ADDRESS
</div>
</div>'''
soup = BeautifulSoup(html)
name_tag = soup.find('h4', itemprop='name')
addr_div = name_tag.find_next_sibling('div')
province_tag, city_tag = addr_div.find_all('a')
name, province, city = [t.text.strip() for t in name_tag, province_tag, city_tag]
address = city_tag.next_sibling.strip()
When run for the URL that you provided
import requests
from bs4 import BeautifulSoup
r = requests.get('http://goo.gl/sCXNp2')
soup = BeautifulSoup(r.content)
name_tag = soup.find('h4', itemprop='name')
addr_div = name_tag.find_next_sibling('div')
province_tag, city_tag = addr_div.find_all('a')
name, province, city = [t.text.strip() for t in name_tag, province_tag, city_tag]
address = city_tag.next_sibling.strip()
>>> print name
بیمارستان حضرت فاطمه (س)
>>> print province
تهران
>>> print city
تهران
>>> print address
یوسف آباد، خیابان بیست و یکم، جنب پارک شفق، بیمارستان ترمیمی پلاستیک فک و صورت
I'm not sure that the printed output is correct on my terminal, however, this code should produce the correct text for a properly configured terminal.
You can do it with built-in lxml.html module :
>>> s="""<div class="mainInfoWrapper">
... <h4 itemprop="name">name</h4>
... <div>
...
... city
...
... Address
... </div>
... </div>"""
>>>
>>> import lxml.html
>>> document = lxml.html.document_fromstring(s)
>>> print document.text_content().split()
['name', 'city', 'Address']
And with BeautifulSoup to get the text between your tags:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(s)
>>> print soup.text
And for get the text from a specific tag just use soup.find_all :
soup = BeautifulSoup(your_HTML_source)
for line in soup.find_all('div',attrs={"class" : "mainInfoWrapper"}):
print line.text
If h4 is used only once then you can do this -
name = soup.find('h4', attrs={'itemprop': 'name'})
print name.text
parentdiv = name.find_parent('div', class_='mainInfoWrapper')
cityaddressdiv = name.find_next_sibling('div')

Categories

Resources