BeautifulSoup: Problems with numbers - python

I am getting some trouble with my code. The thing is that I'm extracting prices for a website and some of this prices includes the original price and the discounted one, and what i'm trying to get is only the discounted one. How do i fix it?
import requests
from bs4 import BeautifulSoup
search_url = "https://store.steampowered.com/search/?sort_by=Price_ASC&category1=998%2C996&category2=29"
category1 = ('998', '996')
category2 = '29'
params = {
'sort_by': 'Price_ASC',
'category1': ','.join(category1),
'category2': category2,
}
response = requests.get(
search_url,
params=params
)
soup = BeautifulSoup(response.text, "html.parser")
elms = soup.find_all("span", {"class": "title"})
prcs = soup.find_all("div",{"class": "col search_price discounted responsive_secondrow"})
for elm in elms:
print(elm.text)
for prc in prcs:
print(prc.text)

You can also use .contents. .contents list all the children of the selected tag and from there select your desired tag, in your case the last children of the tag.
for prc in prcs:
print(prc.contents[-1])

Use next_sibling
prcs = soup.find_all("div",{"class": "col search_price discounted responsive_secondrow"})
for p in prcs:
print(p.find('span').next_sibling.next)

Related

Beautiful Soup only extracting one tag when can see all the others in the html code

Trying to understand how web scraping works:
import requests
from bs4 import BeautifulSoup as soup
url = "https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
result = requests.get(url)
doc = soup(result.text, "lxml")
items = doc.find_all('div', {'class': 'col-sm-4 col-lg-4 col-md-4'})
for item in items:
caption = item.find('div', {'class': 'caption'})
price = item.find('h4', {'class': 'pull-right price'})
print(price.string)
However, when I run this all that returns is the final price from the website ($1799.00). Why does it skip all the other h4 tags and just return the last one?
Any help would be much appreciated!
If you need any more information please let me know
What happens?
You call the print() after you finally iterated over your results, thats why you only get the last one.
How to fix?
Put the print() into your loop
for item in items:
caption = item.find('div', {'class': 'caption'})
price = item.find('h4', {'class': 'pull-right price'})
print(price.string)
Output
$295.99
$299.00
$299.00
$306.99
$321.94
$356.49
$364.46
$372.70
$379.94
$379.95
$391.48
$393.88
$399.00
$399.99
$404.23
$408.98
$409.63
$410.46
$410.66
$416.99
$433.30
$436.29
$436.29
$439.73
$454.62
$454.73
$457.38
$465.95
$468.56
$469.10
$484.23
$485.90
$487.80
$488.64
$488.78
$494.71
$497.17
$498.23
$520.99
$564.98
$577.99
$581.99
$609.99
$679.00
$679.00
$729.00
$739.99
$745.99
$799.00
$809.00
$899.00
$999.00
$1033.99
$1096.02
$1098.42
$1099.00
$1099.00
$1101.83
$1102.66
$1110.14
$1112.91
$1114.55
$1123.87
$1123.87
$1124.20
$1133.82
$1133.91
$1139.54
$1140.62
$1143.40
$1144.20
$1144.40
$1149.00
$1149.00
$1149.73
$1154.04
$1170.10
$1178.19
$1178.99
$1179.00
$1187.88
$1187.98
$1199.00
$1199.00
$1199.73
$1203.41
$1212.16
$1221.58
$1223.99
$1235.49
$1238.37
$1239.20
$1244.99
$1259.00
$1260.13
$1271.06
$1273.11
$1281.99
$1294.74
$1299.00
$1310.39
$1311.99
$1326.83
$1333.00
$1337.28
$1338.37
$1341.22
$1347.78
$1349.23
$1362.24
$1366.32
$1381.13
$1399.00
$1399.00
$1769.00
$1769.00
$1799.00
Example
Instead of just printing the results while iterating, store them structured in a list of dicts and print or save it after the for loop
import requests
from bs4 import BeautifulSoup as soup
url = "https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
result = requests.get(url)
doc = soup(result.text, "lxml")
items = doc.find_all('div', {'class': 'col-sm-4 col-lg-4 col-md-4'})
data = []
for item in items:
data.append({
'caption' : item.a['title'],
'price' : item.find('h4', {'class': 'pull-right price'}).string
})
print(data)

python Beautifulsoup how to get text

I have a issue to get the text from this, i cannot do .text at the end of find_all().
I would like to get the text into ('a', class_ = 'tw-hidden lg:tw-flex font-bold tw-items-center tw-justify-between')
I'm beginner, i've tried many things but i get stocked, if someone can help me, that would be nice.
Thanks
import requests
from bs4 import BeautifulSoup
for i in range(2):
url = 'https://www.coingecko.com/fr?page=1' + str(i)
response = requests.get(url)
if response.ok:
soup = BeautifulSoup(response.text, "lxml")
coin = soup.find_all('a', class_ = 'tw-hidden lg:tw-flex font-bold tw-items-center tw-justify-between')
print(coin)
What happens?
find_all() create a results set / a list, that do not have a method .text.
How to get the text?
Simply iterate over your result set and call the .text methode on each element separatly:
for coin in soup.find_all('a', class_ = 'tw-hidden lg:tw-flex font-bold tw-items-center tw-justify-between'):
print(coin.text)
Hint
You can use get_text(strip=True) to get a "cleaner" result:
for coin in soup.find_all('a', class_ = 'tw-hidden lg:tw-flex font-bold tw-items-center tw-justify-between'):
print(coin.get_text(strip=True))
Complete example
import requests
from bs4 import BeautifulSoup
for i in range(2):
url = 'https://www.coingecko.com/fr?page=1' + str(i)
response = requests.get(url)
if response.ok:
soup = BeautifulSoup(response.text, "lxml")
for coin in soup.find_all('a', class_ = 'tw-hidden lg:tw-flex font-bold tw-items-center tw-justify-between'):
print(coin.get_text(strip=True))
Output
Neblio
Measurable Data Token
Secret Finance
MASS
88mph
GROWTH DeFi
My DeFi Pet
Bridge Mutual
Dfyn Network
e-Money
Defi Warrior
DeepBrain Chain
Peony Coin
Oraichain Token
FOAM
DVF
...

cant get text from span with Beautifulsoup

Why can I not get the text 3.7M from those (multiple with same class name) span with below code?:
result_prices_pc = soup.find_all("span", attrs={"class": "pc_color font-weight-bold"})
HTML:
<td><span class="pc_color font-weight-bold">3.7M <img alt="c" class="small-coins-icon" src="/design/img/coins_bin.png"></span></td>
I try to get all prices with a for loop:
for price in result_prices_pc:
print(price.text)
But I cant get the text from it.
The "problem" is the pc_color CSS class. When you load the page, you need to specify what version of page do you need (PS4/XBOX/PC) - this is done by "platform" cookie (or you can use ps4_color instead of pc_color, for example):
import requests
from bs4 import BeautifulSoup
url = "https://www.futbin.com/players"
cookies = {"platform": "pc"}
soup = BeautifulSoup(requests.get(url, cookies=cookies).content, "html.parser")
result_prices_pc = soup.find_all(
"span", attrs={"class": "pc_color font-weight-bold"}
)
for price in result_prices_pc:
print(price.text)
Prints:
0
1.15M
3.75M
1.7M
4.19M
1.81M
351.65K
0
1.66M
98K
1.16M
3M
775K
99K
1.62M
187K
280K
245K
220K
1.03M
395K
100K
185K
864.2K
0
1.95M
540K
0
0
89K
These elements are actually having multiple class names: pc_color font-weight-bold are actually pc_color and font-weight-bold class names.
Forthis case you should use this syntax:
result_prices_pc = soup.find_all("span", attrs={"class": ['pc_color', 'font-weight-bold']})

Selecting HTML page values using Beautiful Soup

I have been trying to figure this out for a few hour and it is doing my head in. Every method I try is not presenting the correct value.
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.off---white.com/en/GB/products/omia065s188000160100')
soup = BeautifulSoup(r.content, 'html.parser')
I want to extract the following values from the webpage (https://www.off---white.com/en/GB/products/omia065s188000160100)
Name = LOW 3.0 SNEAKER
Price = £ 415
img_url = https://cdn.off---white.com/images/156365/large_OMIA065S188000160100_4.jpg?1498202305
How would I extract these 3 values using Beautiful Soup?
import requests
from bs4 import BeautifulSoup
# Get prod name
r = requests.get('https://www.off---white.com/en/GB/products/omia065s188000160100')
soup = BeautifulSoup(r.text, 'html.parser')
spans = soup.find_all('span', {'class' : 'prod-title'})
data = [span.get_text() for span in spans]
prod_name = ''.join(data)
# Find prod price
spans = soup.find_all('div', {'class' : 'price'})
data = [span.get_text() for span in spans]
prod_price = ''.join(data)
# Find prod img
spans = soup.find_all('img', {'id' : 'image-0'})
for meta in spans:
prod_img = meta.attrs['src']
print(prod_name)
print(prod_price)
print(prod_img)

Best way to loop this situation?

I have a list of divs, and I'm trying to get certain info in each of them. The div classes are the same so I'm not sure how I would go about this.
I have tried for loops but have been getting various errors
Code to get list of divs:
import requests
from bs4 import BeautifulSoup
import re
url = 'https://sneakernews.com/release-dates/'
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, "lxml")
soup1 = soup.find("div", {'class': 'popular-releases-block'})
soup1 = str(soup1.find("div", {'class': 'row'}))
soup1 = soup1.split('</div>')
print(soup1)
Code I want to loop for each item in the soup1 list:
linkinfo = soup1.find('a')['href']
date = str(soup1.find('span'))
name = soup1.find('a')
non_decimal = re.compile(r'[^\d.]+')
date = non_decimal.sub('', date)
name = str(name)
name = re.sub('</a>', '', name)
link, name = name.split('>')
link = re.sub('<a href="', '', link)
link = re.sub('"', '', link)
name = name.split(' ')
name = str(name[-1])
date = str(date)
link = str(link)
print(link)
print(name)
print(date)
Based on the URL you posted above, I imagine you are interested in something like this:
import requests
from bs4 import BeautifulSoup
url = requests.get('https://sneakernews.com/release-dates/').text
soup = BeautifulSoup(url, 'html.parser')
tags = soup.find_all('div', {'class': 'col lg-2 sm-3 popular-releases-box'})
for tag in tags:
link = tag.find('a').get('href')
print(link)
print(tag.text)
#Anything else you want to do
If you are using the BeautifulSoup library, then you do not need regex to try to parse through HTML tags. Instead, use the handy methods that accompany BeautifulSoup. If you would like to apply a regex to the text output from the tags you locate via BeautifulSoup to accomplish a more specific task, then that would be reasonable.
My understanding is that you want to loop your code for each item within a list.
An example of this:
my_list = ["John", "Fred", "Tom"]
for name in my_list:
print(name)
This will loop for each name that is in my_list and print out each item (reffered to here as name in the list). You could do something similar with your code:
for item in soup1:
# perform some action

Categories

Resources