Extract an object's description through Beautifulsoup in python - python

I want to extract the description near the figure (the one that goes from "Figurine model" to "Stay Tuned :)") and store it into the variable information through BeautifulSoup. How can I do it?
Here's my code, but I don't know how to continue it:
from bs4 import BeautifulSoup
response = requests.get('https://www.myminifactory.com/object/3d-print-the-little-prince-4707')
soup = BeautifulSoup(response.text, "lxml")
information =
I show you below the page from where I want to extract the object's description. Thank you in advance!

This works for me, not proud of the script because of the way I used the break statement. But the script works.
from urllib.request import urlopen
from bs4 import BeautifulSoup as BS
url = r'https://www.myminifactory.com/object/3d-print-the-little-prince-4707'
html = urlopen(url).read()
Soup = BS(html,"lxml")
Desc = Soup.find('div',{'class':'short-text text-auto-link'}).text
description = ''
for line in Desc.split('\n'):
if line.strip() == '_________________________________________________________________________':
break
if line.strip():
description += line.strip()
print(description)

Find the parent tag then looking for <p>, fliter the spaces and ____
parent = soup.find("div",class_="row container-info-obj margin-t-10")
result = [" ".join(p.text.split()) for p in parent.find_all("p") if p.text.strip() and not "_"*8 in p.text]
#youtube_v = parent.find("iframe")["src"]
print(result)

Related

Is there a function available with beautifulsoup that will delete all the whitespaces

I am pretty new to Python.
I am trying to scrape the website = https://nl.soccerway.com/.
For this scraping i use beautifulsoup.
The only problem is when I scrape the team names, the team names get
extracted with whitespace surrounding them on the left and right.
How can I delete this? I know many people asked this question before, but
I cannot get it to work.
2nd Question:
How can I extract an HREF title out of a TD?
See provided HTML Code.
The club name is Perugia.
search google
search stackoverflow
Perugia
import requests
from bs4 import BeautifulSoup
def main():
url = 'https://nl.soccerway.com/'
get_detail_data(get_page(url))
def get_page(url):
response = requests.get(url)
if not response.ok:
print('response code is:', response.status_code)
else:
soup = BeautifulSoup(response.text, 'lxml')
return soup
def get_detail_data(soup):
minutes = ""
score = ""
TeamA = ""
TeamB = ""
table_data = soup.find('table',class_='table-container')
try:
for tr in table_data.find_all('td', class_='minute visible'):
minutes = (tr.text)
print(minutes)
except:
pass
try:
for tr in soup.find_all('td', class_='team team-a'):
TeamA = tr.text
print(TeamA)
except:
pass
if __name__ == '__main__':
main()
you can use get_text(strip=True) method from beautifoulsoup
tr.get_text(strip=True)
Use the strip() method to remove trailing and leading whitespace. So in your case, it would be:
TeamA = tr.text.strip()
To get the href attribute, use the pattern tag['attribute']. In your case, it would be:
href = tr.a['href']

How to get the ''contents'' in' <span>contents <span>==$0' with beatutifulsoup

When I was trying to get some house information on this site(https://cd.lianjia.com/ershoufang/106101326994.html), I had a problem to get the ''contents'' in the statements'<span> contents <span>==$0' with beautifulsoup4 module, I always got a '0', not the contents.enter image description here.Thanks a lot!
here is my code:
import requests
from bs4 import BeautifulSoup
from Headers import headers
def getSigleHouseDetail(houseurl):
result = {}
res = requests.get(houseurl)
res.encoding = 'utf-8'
soup = BeautifulSoup(res.text, 'html.parser')
result['totalcount'] = soup.select('.totalCount')[0].select('span')[0].text
return result
url = 'https://cd.lianjia.com/ershoufang/106101326994.html'
print(getSigleHouseDetail(url)['totalcount'])
what you are doing now is printing the index of the object you are creating from line :
result['totalcount'] = soup.select('.totalCount')[0].select('span')[0].text
rather you should capture the content or using attributes such as class, id , and others
import requests
from bs4 import BeautifulSoup
def getSigleHouseDetail(houseurl):
res = requests.get(houseurl)
soup = BeautifulSoup(res.text,'html.parser',from_encoding='utf-8')
method_divs = soup.body.find_all('span', attrs= {'class': 'className'})
return method_divs[0].text
url = 'https://cd.lianjia.com/ershoufang/106101326994.html'
print(getSigleHouseDetail(url))
The line :
return method_divs[0].text
will print the text of first span with the className
Thanks for all your answers.I found the contents in the statement '<span> contents <span>==$0'can be found in a javescript data.

Python /bs4: trying to print temperature/city from a local website

I'm trying to get and print the current weather temperature and city name from a local website, but no success.
All I need it to read and print the city (Lodrina), the Temperature (23.1C) and if possible the title in ca-cond-firs ("Temperatura em declínio") - this last one changes as temps goes up or down...
This is the html section of the site:
THIS IS THE HTML (the part of matters:)
#<div class="ca-cidade">Londrina</div>
<ul class="ca-condicoes">
<li class="ca-cond-firs"><img src="/site/imagens/icones_condicoes/temperatura/temp_baixa.png" title="Temperatura em declínio"/><br/>23.1°C</li>
<li class="ca-cond"><img src="/site/imagens/icones_condicoes/vento/L.png"/><br/>10 km/h</li>
<li class="ca-cond"><div class="ur">UR</div><br/>54%</li>
<li class="ca-cond"><img src="/site/imagens/icones_condicoes/chuva.png"/><br/>0.0 mm</li>
THIS IS THE CODE I DID SO FAR:
from bs4 import BeautifulSoup
import requests
URL = 'http://www.simepar.br/site/index.shtml'
rawhtml = requests.get(URL).text
soup = BeautifulSoup(rawhtml, 'lxml')
id = soup.find('a', 'id=23185109')
print(id)
any help?
from bs4 import BeautifulSoup
import requests
URL = 'http://www.simepar.br/site/index.shtml'
rawhtml = requests.get(URL).text
soup = BeautifulSoup(rawhtml, 'html.parser') # parse page as html
temp_table = soup.find_all('table', {'class':'cidadeTempo'}) # get detail of table with class name cidadeTempo
for entity in temp_table:
city_name = entity.find('h3').text # fetches name of city
city_temp_max = entity.find('span', {'class':'tempMax'}).text # fetches max temperature
city_temp_min = entity.find('span', {'class':'tempMin'}).text # fetches min temperature
print("City :{} \t Max_temp: {} \t Min_temp: {}".format(city_name, city_temp_max, city_temp_min)) # prints content
below code can get details of temprature at right side of page as you require.
result_table = soup.find('div', {'class':'ca-content-wrapper'})
print(result_table.text) # in your case there is no other div exist with class name ca-content-wrapper hence I can use it directly without iterating. you can use if condition to control which city temprature to print and which to not.
# output will be like :
# Apucarana
# 21.5°C
# 4 km/h
# UR60%
# 0.0 mm
I'm not sure what problems you are running into with your code. In my attempts to use your code, I found that I needed to use the html parser to successfully parse the website. I also used soup.findAll() in order to find elements that matched the desired class. Hopefully the below will lead you to your answer:
from bs4 import BeautifulSoup
import requests
URL = 'http://www.simepar.br/site/index.shtml'
rawhtml = requests.get(URL).text
soup = BeautifulSoup(rawhtml, 'html.parser')
rows = soup.findAll('li', {'class', 'ca-cond-firs'})
print rows
You should try out the CSS3 selectors in BS4, I personally find it a lot easier to use than find and find_all.
from bs4 import BeautifulSoup
import requests
URL = 'http://www.simepar.br/site/index.shtml'
rawhtml = requests.get(URL).text
soup = BeautifulSoup(rawhtml, 'lxml')
# soup.select returns the list of all the elements that matches the CSS3 selector
# get the text inside each <a> tag inside div.ca-cidade
cities = [cityTag.text for cityTag in soup.select("div.ca-cidade > a")]
# get the temperature inside each li.ca-cond-firs
temps = [tempTag.text for tempTag in soup.select("li.ca-cond-firs")]
# get the temperature status inside each li.ca-cond-firs > img title attibute
tempStatus = [tag["title"] for tag in soup.select("li.ca-cond-firs > img")]
# len(cities) == len(temps) == len(tempStatus) => This is normally true.
for i in range(len(cities)):
print("City: {}, Temperature: {}, Status: {}.".format(cities[i], temps[i], tempStatus[i]))
Here you go. You can customize that wind thing depending on icon name.
#!/usr/bin/env python
# -*- encoding: utf8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
from bs4 import BeautifulSoup
import requests
def get_weather_data():
URL = 'http://www.simepar.br/site/index.shtml'
rawhtml = requests.get(URL).text
soup = BeautifulSoup(rawhtml, 'html.parser')
cities = soup.find('div', {"class":"ca-content-wrapper"})
weather_data = []
for city in cities.findAll("div", {"class":"ca-bg"}):
name = city.find("div", {"class":"ca-cidade"}).text
temp = city.find("li", {"class":"ca-cond-firs"}).text
conditons = city.findAll("li", {"class":"ca-cond"})
weather_data.append({
"city":name,
"temp":temp,
"conditions":[{
"wind":conditons[0].text +" "+what_wind(conditons[0].find("img")["src"]),
"humidity":conditons[1].text,
"raind":conditons[2].text,
}]
})
return weather_data
def what_wind(img):
if img.find ("NE"):
return "From North East"
if img.find ("O"):
return "From West"
if img.find ("N"):
return "From North"
#you can add other icons here
print get_weather_data()
And that is all weather data from that website.

Get all HTML data EXCEPT mailto: and tel: in BS4 Python decompose()

I need to take out phone numbers and Emails from HTML.
I can get the data.
description_source = soup.select('a[href^="mailto:"]'),
soup.select('a[href^="tel:"]')
But I do not want it.
I am trying to use
decompose
description_source = soup.decompose('a[href^="mailto:"]')
I get this error
TypeError: decompose() takes 1 positional argument but 2 were given
I have thought about using
SoupStrainer
But it looks like i would have to include everything but the mailto and tel to get the correct information...
full current code for this bit is this
import requests
from bs4 import BeautifulSoup as bs4
item_number = '122124438749'
ebay_url = "http://vi.vipr.ebaydesc.com/ws/eBayISAPI.dll?ViewItemDescV4&item=" + item_number
r = requests.get(ebay_url)
html_bytes = r.text
soup = bs4(html_bytes, 'html.parser')
description_source = soup.decompose('a[href^="mailto:"]')
#description_source.
print(description_source)
Try using find_all(). Find all the links in that page and then check which ones contain phone and email. Then remove them using extract().
Use lxml parser for faster processing. It's also recommended to use in the official documentation.
import requests
from bs4 import BeautifulSoup
item_number = '122124438749'
ebay_url = "http://vi.vipr.ebaydesc.com/ws/eBayISAPI.dll?ViewItemDescV4&item=" + item_number
r = requests.get(ebay_url)
html_bytes = r.text
soup = BeautifulSoup(html_bytes, 'lxml')
links = soup.find_all('a')
email = ''
phone = ''
for link in links:
if(link.get('href').find('tel:') > -1):
link.extract()
elif(link.get('href').find('mailto:') > -1):
link.extract()
print(soup.prettify())
You can use decompose() also instead of extract().

Web Scraping with Python (city) as parameter

def findWeather(city):
import urllib
connection = urllib.urlopen("http://www.canoe.ca/Weather/World.html")
rate = connection.read()
connection.close()
currentLoc = rate.find(city)
curr = rate.find("currentDegree")
temploc = rate.find("</span>", curr)
tempstart = rate.rfind(">", 0, temploc)
print "current temp:", rate[tempstart+1:temploc]
The link is provided above. The issue I have is everytime I run the program and use, say "Brussels" in Belgium, as the parameter, i.e findWeather("Brussels"), it will always print 24c as the temperature whereas (as I am writing this) it should be 19c. This is the case for many other cities provided by the site. Help on this code would be appreciated.
Thanks!
This one should work:
import requests
from bs4 import BeautifulSoup
url = 'http://www.canoe.ca/Weather/World.html'
response = requests.get(url)
# Get the text of the contents
html_content = response.text
# Convert the html content into a beautiful soup object
soup = BeautifulSoup(html_content, 'lxml')
cities = soup.find_all("span", class_="titleText")
cels = soup.find_all("span", class_="currentDegree")
for x,y in zip(cities,cels):
print (x.text,y.text)

Categories

Resources