extract text from html using python - python

hope anyone can help me. I am fairly new to python, but I want to scrape data from a site, which unfortunately needs an account. Although i am not able to extract the date (i.e. 2017-06-01).
<li class="latest-value-item">
<div class="latest-value-label">Date</div>
<div class="latest-value">2017-06-01</div>
</li>
<li class="latest-value-item">
<div class="latest-value-label">Index</div>
<div class="latest-value">1430</div>
</li>
THis is my code:
import urllib3
import urllib.request
from bs4 import BeautifulSoup
import pandas as pd
import requests
import csv
from datetime import datetime
url = 'https://www.quandl.com/data/LLOYDS/BCI-Baltic-Capesize-Index'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
Baltic_Indices = []
New_Value = []
#new = soup.find_all('div', attrs={'class':'latest-value'}).get_text()
date = soup.find_all(class_="latest value")
text1 = date.text
print(text1)

date = soup.find_all(class_="latest value")
You are using the wrong CSS class name ('latest value' != 'latest-value')
print(soup.find_all(attrs={'class': 'latest-value'}))
# [<div class="latest-value">2017-06-01</div>, <div class="latest-value">1430</div>]
for element in soup.find_all(attrs={'class': 'latest-value'}):
print(element.text)
# 2017-06-01
# 1430
I prefer to use attrs kwarg but your method works as well (given the correct CSS class name)
for element in soup.find_all(class_='latest-value'):
print(element.text)
# 2017-06-01
# 1430

Related

Getting only numbers from BeautifulSoup instead of whole div

I am trying to learn python by creating a small websraping program to make life easier, although I am having issues with only getting number when using BS4. I was able to get the price when I scraped an actual ad, but I would like to get all the prices from the page.
Here is my code:
from bs4 import BeautifulSoup
import requests
prices = []
url = 'https://www.kijiji.ca/b-cars-trucks/calgary/new__used/c174l1700199a49?ll=51.044733%2C-114.071883&address=Calgary%2C+AB&radius=50.0'
result = requests.get(url)
print (result.status_code)
src = result.content
soup = BeautifulSoup(src, 'html.parser')
print ("CLEARING")
price = soup.findAll("div", class_="price")
prices.append(price)
print (prices)
Here is my output
[<div class="price">
$46,999.00
<div class="dealer-logo">
<div class="dealer-logo-image">
<img src="https://i.ebayimg.com/00/s/NjBYMTIw/z/xMQAAOSwi9ZfoW7r/$_69.PNG"/>
</div>
</div>
</div>
Ideally, I would only want the output to be "46,999.00".
I tried with text=True, although this did not work and I would not get any output from it besides an empty list.
Thank you
You need to get the text portion of tag and then perform some regex processing on it.
import re
def get_price_from_div(div_item):
str_price = re.sub('[^0-9\.]','', div_item.text)
float_price = float(str_price)
return float_price
Just call this method in your code after you find the divs
from bs4 import BeautifulSoup
import requests
prices = []
url = 'https://www.kijiji.ca/b-cars-trucks/calgary/new__used/c174l1700199a49?ll=51.044733%2C-114.071883&address=Calgary%2C+AB&radius=50.0'
result = requests.get(url)
print (result.status_code)
src = result.content
soup = BeautifulSoup(src, 'html.parser')
print ("CLEARING")
price = soup.findAll("div", class_="price")
prices.extend([get_price_from_div(curr_div) for curr_div in price])
print (prices)
An option without using RegEx, is to filter out tags that startwith() a dollar sign $:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.kijiji.ca/b-cars-trucks/calgary/new__used/c174l1700199a49?ll=51.044733%2C-114.071883&address=Calgary%2C+AB&radius=50.0'
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
price_tags = soup.find_all("div", class_="price")
prices = [
tag.get_text(strip=True)[1:] for tag in price_tags
if tag.get_text(strip=True).startswith('$')
]
print(prices)
Output:
['48,888.00', '21,999.00', '44,488.00', '5,500.00', '33,000.00', '14,900.00', '1,750.00', '35,600.00', '1,800.00', '25,888.00', '36,888.00', '32,888.00', '30,888.00', '18,888.00', '21,888.00', '29,888.00', '22,888.00', '30,888.00', '17,888.00', '17,888.00', '16,888.00', '22,888.00', '22,888.00', '34,888.00', '31,888.00', '32,888.00', '30,888.00', '21,888.00', '15,888.00', '21,888.00', '28,888.00', '19,888.00', '18,888.00', '30,995.00', '30,995.00', '30,995.00', '19,888.00', '47,995.00', '21,888.00', '46,995.00', '32,888.00', '29,888.00', '26,888.00', '21,888.00']

Extracting each same tag from the same class beautifulsoup

I would like to extract all 'data-src' from this page. and then save the results to csv. There are several 'data-src' on this page all in the same class and I don't know how to deal with it.
import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd
from csv import writer
def test_list():
with open('largeXDDDDDDDDDD.csv','w') as f1:
writer=csv.writer(f1, delimiter='\t',lineterminator='\n',)
#df = pd.read_csv("C:\\Users\\Lukasz\\Desktop\\PROJEKTY PYTHON\\W TRAKCIE\\large.csv")
#url = df['LINKS'][1]
url='https://paypalshop.x.yupoo.com/albums/81513820?uid=1'
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'lxml')
szukaj=soup.find_all('div',{'class':"showalbum__children image__main"})
for XD in szukaj:
q=(soup.find_all("data-src"))
print(q)
#q= soup.find("img", {"class": "autocover image__img image__portrait"})
#q=(tag.get('data-src'))
test_list()```
HTML:
<div class="showalbum__children image__main" data-id="30842210">
<div class="image__imagewrap" data-type="photo">
<img alt="" class="autocover image__img image__portrait" data-album-id="83047567" data-frame="1" data-height="1080" data-origin-src="//photo.yupoo.com/ven-way/aac32ed1/2d2ed235.jpg" data-path="/ven-way/aac32ed1/2d2ed235.jpg" data-src="//photo.yupoo.com/ven-way/aac32ed1/big.jpg" data-type="photo" data-videoformats="" data-width="1080" src="//photo.yupoo.com/ven-way/aac32ed1/small.jpg"/>
<div class="image__clickhandle" data-photoid="30842210" style="width: 1080px; padding-bottom: 100.00%" title="点击查看详情">
</div>
Use a class selector for one of the children of the ones you are currently looking at to be at the right level. I use select and dict accessor notation to retrieve attribute. You cannot use with the syntax as you have written it.
from bs4 import BeautifulSoup
import csv
import pandas as pd
from csv import writer
def test_list():
#with open('largeXDDDDDDDDDD.csv','w') as f1:
#writer=csv.writer(f1, delimiter='\t',lineterminator='\n',)
url='https://paypalshop.x.yupoo.com/albums/81513820?uid=1'
response = requests.get(url)
data = response.content
soup = BeautifulSoup(data, 'lxml')
szukaj = soup.select('.image__portrait')
for x in szukaj:
q = x['data-src']
print(q)
test_list()

How to access text from both <p> using beautifulsoup4?

I want to grab text from both <p>, how do I get that?
for first <p> my code is working but I couldn't able to get the second <p>.
<p>
<a href="https://www.japantimes.co.jp/news/2019/03/19/world/crime-legal-world/emerging-online-threats-changing-homeland-securitys-role-merely-fighting-terrorism/">
Emerging online threats changing Homeland Security's role from merely fighting terrorism
</a>
</p>
</hgroup>
</header>
<p>
Homeland Security Secretary Kirstjen Nielsen said Monday that her department may have been founded to combat terrorism, but its mission is shifting to also confront emerging online threats.
China, Iran and other countries are mimicking the approach that Russia used to interfere in the U.S. ...
<a class="more_link" href="https://www.japantimes.co.jp/news/2019/03/19/world/crime-legal-world/emerging-online-threats-changing-homeland-securitys-role-merely-fighting-terrorism/">
<span class="icon-arrow-2">
</span>
</a>
</p>
My code is:
from bs4 import BeautifulSoup
ssl._create_default_https_context = ssl._create_unverified_context
article = "https://www.japantimes.co.jp/tag/cybersecurity/page/1/"
page = urllib.request.urlopen(article)
soup = BeautifulSoup(page, 'html.parser')
article = soup.find('div', class_="content_col")
date = article.h3.find('span', class_= "right date")
date = date.text
headline = article.p.find('a')
headline = headline.text
content = article.p.text
print(date, headline,content)
Use the parent id and p selector and index into returned list for required number of paragraphs. You can use the time tag for when posted
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.japantimes.co.jp/news/2019/03/19/world/crime-legal-world/emerging-online-threats-changing-homeland-securitys-role-merely-fighting-terrorism/#.XJIQNDj7TX4')
soup = bs(r.content, 'lxml')
posted = soup.select_one('time').text
print(posted)
paras = [item.text.strip() for item in soup.select('#jtarticle p')]
print(paras[:2])
You could use the .find_next(). However, it's not the full article:
from bs4 import BeautifulSoup
import requests
article = "https://www.japantimes.co.jp/tag/cybersecurity/page/1/"
page = requests.get(article)
soup = BeautifulSoup(page.text, 'html.parser')
article = soup.find('div', class_="content_col")
date = article.h3.find('span', class_= "right date")
date_text = date.text
headline = article.p.find('a')
headline_text = headline.text
content_text = article.p.find_next('p').text
print(date_text, headline_text ,content_text)

Certain content not loading when scraping a site with Beautiful Soup

I'm trying to scrape the ratings off recipes on NYT Cooking but having issues getting the content I need. When I look at the source on the NYT page, I see the following:
<div class="ratings-rating">
<span class="ratings-header ratings-content">194 ratings</span>
<div class="ratings-stars-wrap">
<div class="ratings-stars ratings-content four-star-rating avg-rating">
The content I'm trying to pull out is 194 ratings and four-star-rating. However, when I pull in the page source via Beautiful Soup I only see this:
<div class="ratings-rating">
<span class="ratings-header ratings-content"><%= header %></span>
<div class="ratings-stars-wrap">
<div class="ratings-stars ratings-content <%= ratingClass %> <%= state %>">
The code I'm using is:
url = 'https://cooking.nytimes.com/recipes/1020049-lemony-chicken-soup-with-fennel-and-dill'
r = get(url, headers = headers, timeout=15)
page_soup = soup(r.text,'html.parser')
Any thoughts why that information isn't pulling through?
Try using below code
import requests
import lxml
from lxml import html
import re
url = "https://cooking.nytimes.com/recipes/1019706-spiced-roasted-cauliflower-with-feta-and-garlic?action=click&module=Recirculation%20Band%20Recipe%20Card&region=More%20recipes%20from%20Alison%20Roman&pgType=recipedetails&rank=1"
r = requests.get(url)
tree = html.fromstring(r.content)
t = tree.xpath('/html/body/script[14]')[0]
# look for value for bootstrap.recipe.avg_rating
m = re.search("bootstrap.recipe.avg_rating = ", t.text)
colon = re.search(";", t.text[m.end()::])
rating = t.text[m.end():m.end()+colon.start()]
print(rating)
# look for value for bootstrap.recipe.num_ratings =
n = re.search("bootstrap.recipe.num_ratings = ", t.text)
colon2 = re.search(";", t.text[n.end()::])
star = t.text[n.end():n.end()+colon2.start()]
print(star)
much easier to use attribute = value selectors to grab from span with class ratings-metadata
import requests
from bs4 import BeautifulSoup
data = requests.get('https://cooking.nytimes.com/recipes/1020049-lemony-chicken-soup-with-fennel-and-dill')
soup = BeautifulSoup(data.content, 'lxml')
rating = soup.select_one('[itemprop=ratingValue]').text
ratingCount = soup.select_one('[itemprop=ratingCount]').text
print(rating, ratingCount)

How can I get data from a specific class of a html tag using beautifulsoup?

I want to get data located(name, city and address) in div tag from a HTML file like this:
<div class="mainInfoWrapper">
<h4 itemprop="name">name</h4>
<div>
city
Address
</div>
</div>
I don't know how can I get data that i want in that specific tag.
obviously I'm using python with beautifulsoup library.
There are several <h4> tags in the source HTML, but only one <h4> with the itemprop="name" attribute, so you can search for that first. Then access the remaining values from there. Note that the following HTML is correctly reproduced from the source page, whereas the HTML in the question was not:
from bs4 import BeautifulSoup
html = '''<div class="mainInfoWrapper">
<h4 itemprop="name">
NAME
</h4>
<div>
PROVINCE - CITY ADDRESS
</div>
</div>'''
soup = BeautifulSoup(html)
name_tag = soup.find('h4', itemprop='name')
addr_div = name_tag.find_next_sibling('div')
province_tag, city_tag = addr_div.find_all('a')
name, province, city = [t.text.strip() for t in name_tag, province_tag, city_tag]
address = city_tag.next_sibling.strip()
When run for the URL that you provided
import requests
from bs4 import BeautifulSoup
r = requests.get('http://goo.gl/sCXNp2')
soup = BeautifulSoup(r.content)
name_tag = soup.find('h4', itemprop='name')
addr_div = name_tag.find_next_sibling('div')
province_tag, city_tag = addr_div.find_all('a')
name, province, city = [t.text.strip() for t in name_tag, province_tag, city_tag]
address = city_tag.next_sibling.strip()
>>> print name
بیمارستان حضرت فاطمه (س)
>>> print province
تهران
>>> print city
تهران
>>> print address
یوسف آباد، خیابان بیست و یکم، جنب پارک شفق، بیمارستان ترمیمی پلاستیک فک و صورت
I'm not sure that the printed output is correct on my terminal, however, this code should produce the correct text for a properly configured terminal.
You can do it with built-in lxml.html module :
>>> s="""<div class="mainInfoWrapper">
... <h4 itemprop="name">name</h4>
... <div>
...
... city
...
... Address
... </div>
... </div>"""
>>>
>>> import lxml.html
>>> document = lxml.html.document_fromstring(s)
>>> print document.text_content().split()
['name', 'city', 'Address']
And with BeautifulSoup to get the text between your tags:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(s)
>>> print soup.text
And for get the text from a specific tag just use soup.find_all :
soup = BeautifulSoup(your_HTML_source)
for line in soup.find_all('div',attrs={"class" : "mainInfoWrapper"}):
print line.text
If h4 is used only once then you can do this -
name = soup.find('h4', attrs={'itemprop': 'name'})
print name.text
parentdiv = name.find_parent('div', class_='mainInfoWrapper')
cityaddressdiv = name.find_next_sibling('div')

Categories

Resources