I am trying to extract the what's within the 'title' tag from the following html, but so far I didn't manage to.
<div class="pull_right date details" title="22.12.2022 01:49:03 UTC-03:00">
This is my code:
from bs4 import BeautifulSoup
with open("messages.html") as fp:
soup = BeautifulSoup(fp, 'html.parser')
results = soup.find_all('div', attrs={'class':'pull_right date details'})
print(results)
And the output is a list with all <div for the html file.
To access the value inside title. Simply call ['title'].
If you use find_all, then this will return a list. Therefore you will need an index (e.g [0]['title'])
For example:
from bs4 import BeautifulSoup
fp = '<html><div class="pull_right date details" title="22.12.2022 01:49:03 UTC-03:00"></html>'
soup = BeautifulSoup(fp, 'html.parser')
results = soup.find_all('div', attrs={'class':'pull_right date details'})
print(results[0]['title'])
Or:
results = soup.find('div', attrs={'class':'pull_right date details'})
print(results['title'])
Output:
22.12.2022 01:49:03 UTC-03:00
22.12.2022 01:49:03 UTC-03:00
Why can I not get the text 3.7M from those (multiple with same class name) span with below code?:
result_prices_pc = soup.find_all("span", attrs={"class": "pc_color font-weight-bold"})
HTML:
<td><span class="pc_color font-weight-bold">3.7M <img alt="c" class="small-coins-icon" src="/design/img/coins_bin.png"></span></td>
I try to get all prices with a for loop:
for price in result_prices_pc:
print(price.text)
But I cant get the text from it.
The "problem" is the pc_color CSS class. When you load the page, you need to specify what version of page do you need (PS4/XBOX/PC) - this is done by "platform" cookie (or you can use ps4_color instead of pc_color, for example):
import requests
from bs4 import BeautifulSoup
url = "https://www.futbin.com/players"
cookies = {"platform": "pc"}
soup = BeautifulSoup(requests.get(url, cookies=cookies).content, "html.parser")
result_prices_pc = soup.find_all(
"span", attrs={"class": "pc_color font-weight-bold"}
)
for price in result_prices_pc:
print(price.text)
Prints:
0
1.15M
3.75M
1.7M
4.19M
1.81M
351.65K
0
1.66M
98K
1.16M
3M
775K
99K
1.62M
187K
280K
245K
220K
1.03M
395K
100K
185K
864.2K
0
1.95M
540K
0
0
89K
These elements are actually having multiple class names: pc_color font-weight-bold are actually pc_color and font-weight-bold class names.
Forthis case you should use this syntax:
result_prices_pc = soup.find_all("span", attrs={"class": ['pc_color', 'font-weight-bold']})
I am trying to learn python by creating a small websraping program to make life easier, although I am having issues with only getting number when using BS4. I was able to get the price when I scraped an actual ad, but I would like to get all the prices from the page.
Here is my code:
from bs4 import BeautifulSoup
import requests
prices = []
url = 'https://www.kijiji.ca/b-cars-trucks/calgary/new__used/c174l1700199a49?ll=51.044733%2C-114.071883&address=Calgary%2C+AB&radius=50.0'
result = requests.get(url)
print (result.status_code)
src = result.content
soup = BeautifulSoup(src, 'html.parser')
print ("CLEARING")
price = soup.findAll("div", class_="price")
prices.append(price)
print (prices)
Here is my output
[<div class="price">
$46,999.00
<div class="dealer-logo">
<div class="dealer-logo-image">
<img src="https://i.ebayimg.com/00/s/NjBYMTIw/z/xMQAAOSwi9ZfoW7r/$_69.PNG"/>
</div>
</div>
</div>
Ideally, I would only want the output to be "46,999.00".
I tried with text=True, although this did not work and I would not get any output from it besides an empty list.
Thank you
You need to get the text portion of tag and then perform some regex processing on it.
import re
def get_price_from_div(div_item):
str_price = re.sub('[^0-9\.]','', div_item.text)
float_price = float(str_price)
return float_price
Just call this method in your code after you find the divs
from bs4 import BeautifulSoup
import requests
prices = []
url = 'https://www.kijiji.ca/b-cars-trucks/calgary/new__used/c174l1700199a49?ll=51.044733%2C-114.071883&address=Calgary%2C+AB&radius=50.0'
result = requests.get(url)
print (result.status_code)
src = result.content
soup = BeautifulSoup(src, 'html.parser')
print ("CLEARING")
price = soup.findAll("div", class_="price")
prices.extend([get_price_from_div(curr_div) for curr_div in price])
print (prices)
An option without using RegEx, is to filter out tags that startwith() a dollar sign $:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.kijiji.ca/b-cars-trucks/calgary/new__used/c174l1700199a49?ll=51.044733%2C-114.071883&address=Calgary%2C+AB&radius=50.0'
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
price_tags = soup.find_all("div", class_="price")
prices = [
tag.get_text(strip=True)[1:] for tag in price_tags
if tag.get_text(strip=True).startswith('$')
]
print(prices)
Output:
['48,888.00', '21,999.00', '44,488.00', '5,500.00', '33,000.00', '14,900.00', '1,750.00', '35,600.00', '1,800.00', '25,888.00', '36,888.00', '32,888.00', '30,888.00', '18,888.00', '21,888.00', '29,888.00', '22,888.00', '30,888.00', '17,888.00', '17,888.00', '16,888.00', '22,888.00', '22,888.00', '34,888.00', '31,888.00', '32,888.00', '30,888.00', '21,888.00', '15,888.00', '21,888.00', '28,888.00', '19,888.00', '18,888.00', '30,995.00', '30,995.00', '30,995.00', '19,888.00', '47,995.00', '21,888.00', '46,995.00', '32,888.00', '29,888.00', '26,888.00', '21,888.00']
i wrote my code but it extract all links no matter what value is the seeders count,
here is the code i wrote:
from bs4 import BeautifulSoup
import urllib.request
import re
class AppURLopener(urllib.request.FancyURLopener):
version = "Mozilla/5.0"
url = input('What site you working on today, sir?\n-> ')
opener = AppURLopener()
html_page = opener.open(url)
soup = BeautifulSoup(html_page, "lxml")
pd = str(soup.findAll('td', attrs={'align':re.compile('right')}))
for link in soup.findAll('a', attrs={'href': re.compile("^magnet")}):
if not('0' is pd[18]):
print (link.get('href'),'\n')
and this is the html am working on : https://imgur.com/a/32J9qF4
in this case it's 0 seeders but it still gives me the magnet link.. HELP
This code snippet will extract all magnet links from the page, where seeders != 0:
from bs4 import BeautifulSoup
import requests
from pprint import pprint
soup = BeautifulSoup(requests.get('https://pirateproxy.mx/browse/201/1/3').text, 'lxml')
tds = soup.select('#searchResult td.vertTh ~ td')
links = [name.select_one('a[href^=magnet]')['href'] for name, seeders, leechers in zip(tds[0::3], tds[1::3], tds[2::3]) if seeders.text.strip() != '0']
pprint(links, width=120)
Prints:
['magnet:?xt=urn:btih:aa8a1f7847a49e640638c02ce851effff38d440f&dn=Affairs.of.State.2018.BRRip.x264.AC3-Manning&tr=udp%3A%2F%2Ftracker.leechers-paradise.org%3A6969&tr=udp%3A%2F%2Fzer0day.ch%3A1337&tr=udp%3A%2F%2Fopen.demonii.com%3A1337&tr=udp%3A%2F%2Ftracker.coppersurfer.tk%3A6969&tr=udp%3A%2F%2Fexodus.desync.com%3A6969',
'magnet:?xt=urn:btih:819cb9b477462cd61ab6653ebc4a6f4e790589c3&dn=Bad.Samaritan.2018.BRRip.x264.AC3-Manning&tr=udp%3A%2F%2Ftracker.leechers-paradise.org%3A6969&tr=udp%3A%2F%2Fzer0day.ch%3A1337&tr=udp%3A%2F%2Fopen.demonii.com%3A1337&tr=udp%3A%2F%2Ftracker.coppersurfer.tk%3A6969&tr=udp%3A%2F%2Fexodus.desync.com%3A6969',
'magnet:?xt=urn:btih:843d01992aa81d52be68190ee6a733ec9eee9b13&dn=The+Darkest+Minds+2018+HDCAM-1XBET&tr=udp%3A%2F%2Ftracker.leechers-paradise.org%3A6969&tr=udp%3A%2F%2Fzer0day.ch%3A1337&tr=udp%3A%2F%2Fopen.demonii.com%3A1337&tr=udp%3A%2F%2Ftracker.coppersurfer.tk%3A6969&tr=udp%3A%2F%2Fexodus.desync.com%3A6969',
'magnet:?xt=urn:btih:09a23daa69c42003d905ecf0a1cefdb0474e7d88&dn=Insidious+The+Last+Key+2018+BRRip+x264+AAC-SSN&tr=udp%3A%2F%2Ftracker.leechers-paradise.org%3A6969&tr=udp%3A%2F%2Fzer0day.ch%3A1337&tr=udp%3A%2F%2Fopen.demonii.com%3A1337&tr=udp%3A%2F%2Ftracker.coppersurfer.tk%3A6969&tr=udp%3A%2F%2Fexodus.desync.com%3A6969',
'magnet:?xt=urn:btih:98c42d5d620b4db834c5437a75f6da6f2d158207&dn=The+Darkest+Minds+2018+HDCAM-1XBET%5BTGx%5D&tr=udp%3A%2F%2Ftracker.leechers-paradise.org%3A6969&tr=udp%3A%2F%2Fzer0day.ch%3A1337&tr=udp%3A%2F%2Fopen.demonii.com%3A1337&tr=udp%3A%2F%2Ftracker.coppersurfer.tk%3A6969&tr=udp%3A%2F%2Fexodus.desync.com%3A6969',
'magnet:?xt=urn:btih:f30ebc409b215f2a5237433d7508c7ebfabb0e16&dn=Journeyman.2017.SWESUB.BRRiP.x264.mp4&tr=udp%3A%2F%2Ftracker.leechers-paradise.org%3A6969&tr=udp%3A%2F%2Fzer0day.ch%3A1337&tr=udp%3A%2F%2Fopen.demonii.com%3A1337&tr=udp%3A%2F%2Ftracker.coppersurfer.tk%3A6969&tr=udp%3A%2F%2Fexodus.desync.com%3A6969',
...and so on.
EDIT:
The soup.select('#searchResult td.vertTh ~ td') will select all <td> siblings of tag <td> with class vertTh which is inside tag with id=searchResult. There are three siblings like this in each row.
The select_one('a[href^=magnet]') will then select all links that href begins with magnet.
I need help with a problem... I am doing a code for know the content of a tag but... What can I do for take the content if it have got a id?
from bs4 import BeautifulSoup
import urllib2
code = '<span class="vi-is1-prcp" id="v4-27"> 15,00 EUR </span>'
soup = BeautifulSoup(code)
price = soup.find('a', id='v4-27') # <-- PROBLEM
print price
if that is the html code then you should replace the 'a' tag with a 'span' tag. It should look something like this...
...
price = soup.find('span',id="v4-27")
print price #optional price.string will give you just the 15,00 EUR
#instead of the entire html line