Getting data from html table with BeautifulSoup1

Getting data from html table with BeautifulSoup1 - python

content = soup.find_all("div", id = "ctl00_ContentPlaceHolder1_ctl03_divHO")
for book in content:
stock = book.find('I', {'class'= "Item_Price10"}).text
print (stock)
I would like to get stock price by using BS4 by finding out the value in CONTENT, but the code does not work well. Please help me, thank you in advance

The Item_price10 class appears to be part of a <td> tag, so you could try something like:
import requests
from bs4 import BeautifulSoup
req = requests.get("https://s.cafef.vn/Lich-su-giao-dich-FPT-1.chn")
soup = BeautifulSoup(req.content, "html.parser")
for div in soup.find_all("div", id = "ctl00_ContentPlaceHolder1_ctl03_divHO"):
for tr in div.find_all('tr'):
for td in tr.find_all('td', {'class': "Item_Price10"}):
print(td.text)
print()
This would display something starting like:
95.70 
95.70 
3,325,700 
319,480,000,000 
1,243,600 
95.50 
97.00 
95.40 
0 
0 
0 
94.70 
94.70 
...

Related

How to extract specific part of html using Beautifulsoup?

I am trying to extract the what's within the 'title' tag from the following html, but so far I didn't manage to.
<div class="pull_right date details" title="22.12.2022 01:49:03 UTC-03:00">
This is my code:
from bs4 import BeautifulSoup
with open("messages.html") as fp:
soup = BeautifulSoup(fp, 'html.parser')
results = soup.find_all('div', attrs={'class':'pull_right date details'})
print(results)
And the output is a list with all <div for the html file.

To access the value inside title. Simply call ['title'].
If you use find_all, then this will return a list. Therefore you will need an index (e.g [0]['title'])
For example:
from bs4 import BeautifulSoup
fp = '<html><div class="pull_right date details" title="22.12.2022 01:49:03 UTC-03:00"></html>'
soup = BeautifulSoup(fp, 'html.parser')
results = soup.find_all('div', attrs={'class':'pull_right date details'})
print(results[0]['title'])
Or:
results = soup.find('div', attrs={'class':'pull_right date details'})
print(results['title'])
Output:
22.12.2022 01:49:03 UTC-03:00
22.12.2022 01:49:03 UTC-03:00

cant get text from span with Beautifulsoup

Why can I not get the text 3.7M from those (multiple with same class name) span with below code?:
result_prices_pc = soup.find_all("span", attrs={"class": "pc_color font-weight-bold"})
HTML:
<td><span class="pc_color font-weight-bold">3.7M <img alt="c" class="small-coins-icon" src="/design/img/coins_bin.png"></span></td>
I try to get all prices with a for loop:
for price in result_prices_pc:
print(price.text)
But I cant get the text from it.

The "problem" is the pc_color CSS class. When you load the page, you need to specify what version of page do you need (PS4/XBOX/PC) - this is done by "platform" cookie (or you can use ps4_color instead of pc_color, for example):
import requests
from bs4 import BeautifulSoup
url = "https://www.futbin.com/players"
cookies = {"platform": "pc"}
soup = BeautifulSoup(requests.get(url, cookies=cookies).content, "html.parser")
result_prices_pc = soup.find_all(
"span", attrs={"class": "pc_color font-weight-bold"}
)
for price in result_prices_pc:
print(price.text)
Prints:
0
1.15M
3.75M
1.7M
4.19M
1.81M
351.65K
0
1.66M
98K
1.16M
3M
775K
99K
1.62M
187K
280K
245K
220K
1.03M
395K
100K
185K
864.2K
0
1.95M
540K
0
0
89K

These elements are actually having multiple class names: pc_color font-weight-bold are actually pc_color and font-weight-bold class names.
Forthis case you should use this syntax:
result_prices_pc = soup.find_all("span", attrs={"class": ['pc_color', 'font-weight-bold']})

Getting only numbers from BeautifulSoup instead of whole div

I am trying to learn python by creating a small websraping program to make life easier, although I am having issues with only getting number when using BS4. I was able to get the price when I scraped an actual ad, but I would like to get all the prices from the page.
Here is my code:
from bs4 import BeautifulSoup
import requests
prices = []
url = 'https://www.kijiji.ca/b-cars-trucks/calgary/new__used/c174l1700199a49?ll=51.044733%2C-114.071883&address=Calgary%2C+AB&radius=50.0'
result = requests.get(url)
print (result.status_code)
src = result.content
soup = BeautifulSoup(src, 'html.parser')
print ("CLEARING")
price = soup.findAll("div", class_="price")
prices.append(price)
print (prices)
Here is my output
[<div class="price">
$46,999.00
<div class="dealer-logo">
<div class="dealer-logo-image">
<img src="https://i.ebayimg.com/00/s/NjBYMTIw/z/xMQAAOSwi9ZfoW7r/$_69.PNG"/>
</div>
</div>
</div>
Ideally, I would only want the output to be "46,999.00".
I tried with text=True, although this did not work and I would not get any output from it besides an empty list.
Thank you

You need to get the text portion of tag and then perform some regex processing on it.
import re
def get_price_from_div(div_item):
str_price = re.sub('[^0-9\.]','', div_item.text)
float_price = float(str_price)
return float_price
Just call this method in your code after you find the divs
from bs4 import BeautifulSoup
import requests
prices = []
url = 'https://www.kijiji.ca/b-cars-trucks/calgary/new__used/c174l1700199a49?ll=51.044733%2C-114.071883&address=Calgary%2C+AB&radius=50.0'
result = requests.get(url)
print (result.status_code)
src = result.content
soup = BeautifulSoup(src, 'html.parser')
print ("CLEARING")
price = soup.findAll("div", class_="price")
prices.extend([get_price_from_div(curr_div) for curr_div in price])
print (prices)

An option without using RegEx, is to filter out tags that startwith() a dollar sign $:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.kijiji.ca/b-cars-trucks/calgary/new__used/c174l1700199a49?ll=51.044733%2C-114.071883&address=Calgary%2C+AB&radius=50.0'
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
price_tags = soup.find_all("div", class_="price")
prices = [
tag.get_text(strip=True)[1:] for tag in price_tags
if tag.get_text(strip=True).startswith('$')
]
print(prices)
Output:
['48,888.00', '21,999.00', '44,488.00', '5,500.00', '33,000.00', '14,900.00', '1,750.00', '35,600.00', '1,800.00', '25,888.00', '36,888.00', '32,888.00', '30,888.00', '18,888.00', '21,888.00', '29,888.00', '22,888.00', '30,888.00', '17,888.00', '17,888.00', '16,888.00', '22,888.00', '22,888.00', '34,888.00', '31,888.00', '32,888.00', '30,888.00', '21,888.00', '15,888.00', '21,888.00', '28,888.00', '19,888.00', '18,888.00', '30,995.00', '30,995.00', '30,995.00', '19,888.00', '47,995.00', '21,888.00', '46,995.00', '32,888.00', '29,888.00', '26,888.00', '21,888.00']

I want my code to not extract links with 0 seeders using python

i wrote my code but it extract all links no matter what value is the seeders count,
here is the code i wrote:
from bs4 import BeautifulSoup
import urllib.request
import re
class AppURLopener(urllib.request.FancyURLopener):
version = "Mozilla/5.0"
url = input('What site you working on today, sir?\n-> ')
opener = AppURLopener()
html_page = opener.open(url)
soup = BeautifulSoup(html_page, "lxml")
pd = str(soup.findAll('td', attrs={'align':re.compile('right')}))
for link in soup.findAll('a', attrs={'href': re.compile("^magnet")}):
if not('0' is pd[18]):
print (link.get('href'),'\n')
and this is the html am working on : https://imgur.com/a/32J9qF4
in this case it's 0 seeders but it still gives me the magnet link.. HELP

This code snippet will extract all magnet links from the page, where seeders != 0:
from bs4 import BeautifulSoup
import requests
from pprint import pprint
soup = BeautifulSoup(requests.get('https://pirateproxy.mx/browse/201/1/3').text, 'lxml')
tds = soup.select('#searchResult td.vertTh ~ td')
links = [name.select_one('a[href^=magnet]')['href'] for name, seeders, leechers in zip(tds[0::3], tds[1::3], tds[2::3]) if seeders.text.strip() != '0']
pprint(links, width=120)
Prints:
['magnet:?xt=urn:btih:aa8a1f7847a49e640638c02ce851effff38d440f&dn=Affairs.of.State.2018.BRRip.x264.AC3-Manning&tr=udp%3A%2F%2Ftracker.leechers-paradise.org%3A6969&tr=udp%3A%2F%2Fzer0day.ch%3A1337&tr=udp%3A%2F%2Fopen.demonii.com%3A1337&tr=udp%3A%2F%2Ftracker.coppersurfer.tk%3A6969&tr=udp%3A%2F%2Fexodus.desync.com%3A6969',
'magnet:?xt=urn:btih:819cb9b477462cd61ab6653ebc4a6f4e790589c3&dn=Bad.Samaritan.2018.BRRip.x264.AC3-Manning&tr=udp%3A%2F%2Ftracker.leechers-paradise.org%3A6969&tr=udp%3A%2F%2Fzer0day.ch%3A1337&tr=udp%3A%2F%2Fopen.demonii.com%3A1337&tr=udp%3A%2F%2Ftracker.coppersurfer.tk%3A6969&tr=udp%3A%2F%2Fexodus.desync.com%3A6969',
'magnet:?xt=urn:btih:843d01992aa81d52be68190ee6a733ec9eee9b13&dn=The+Darkest+Minds+2018+HDCAM-1XBET&tr=udp%3A%2F%2Ftracker.leechers-paradise.org%3A6969&tr=udp%3A%2F%2Fzer0day.ch%3A1337&tr=udp%3A%2F%2Fopen.demonii.com%3A1337&tr=udp%3A%2F%2Ftracker.coppersurfer.tk%3A6969&tr=udp%3A%2F%2Fexodus.desync.com%3A6969',
'magnet:?xt=urn:btih:09a23daa69c42003d905ecf0a1cefdb0474e7d88&dn=Insidious+The+Last+Key+2018+BRRip+x264+AAC-SSN&tr=udp%3A%2F%2Ftracker.leechers-paradise.org%3A6969&tr=udp%3A%2F%2Fzer0day.ch%3A1337&tr=udp%3A%2F%2Fopen.demonii.com%3A1337&tr=udp%3A%2F%2Ftracker.coppersurfer.tk%3A6969&tr=udp%3A%2F%2Fexodus.desync.com%3A6969',
'magnet:?xt=urn:btih:98c42d5d620b4db834c5437a75f6da6f2d158207&dn=The+Darkest+Minds+2018+HDCAM-1XBET%5BTGx%5D&tr=udp%3A%2F%2Ftracker.leechers-paradise.org%3A6969&tr=udp%3A%2F%2Fzer0day.ch%3A1337&tr=udp%3A%2F%2Fopen.demonii.com%3A1337&tr=udp%3A%2F%2Ftracker.coppersurfer.tk%3A6969&tr=udp%3A%2F%2Fexodus.desync.com%3A6969',
'magnet:?xt=urn:btih:f30ebc409b215f2a5237433d7508c7ebfabb0e16&dn=Journeyman.2017.SWESUB.BRRiP.x264.mp4&tr=udp%3A%2F%2Ftracker.leechers-paradise.org%3A6969&tr=udp%3A%2F%2Fzer0day.ch%3A1337&tr=udp%3A%2F%2Fopen.demonii.com%3A1337&tr=udp%3A%2F%2Ftracker.coppersurfer.tk%3A6969&tr=udp%3A%2F%2Fexodus.desync.com%3A6969',
...and so on.
EDIT:
The soup.select('#searchResult td.vertTh ~ td') will select all <td> siblings of tag <td> with class vertTh which is inside tag with id=searchResult. There are three siblings like this in each row.
The select_one('a[href^=magnet]') will then select all links that href begins with magnet.

Search an id in python with BeautifulSoup

I need help with a problem... I am doing a code for know the content of a tag but... What can I do for take the content if it have got a id?
from bs4 import BeautifulSoup
import urllib2
code = '<span class="vi-is1-prcp" id="v4-27"> 15,00 EUR </span>'
soup = BeautifulSoup(code)
price = soup.find('a', id='v4-27') # <-- PROBLEM
print price

if that is the html code then you should replace the 'a' tag with a 'span' tag. It should look something like this...
...
price = soup.find('span',id="v4-27")
print price #optional price.string will give you just the 15,00 EUR
#instead of the entire html line

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Getting data from html table with BeautifulSoup1 - python

Related

How to extract specific part of html using Beautifulsoup?

cant get text from span with Beautifulsoup

Getting only numbers from BeautifulSoup instead of whole div

I want my code to not extract links with 0 seeders using python

Search an id in python with BeautifulSoup

Categories

Resources