Python parse string in span class

Python parse string in span class - python

I've already tried the other solutions but did not work.
Here is my span tag:
<span class="DFlfde SwHCTb" data-precision="2" data-value="7.0498">7,05</span>
and here is my full code
import requests
from bs4 import BeautifulSoup
#<span class="DFlfde SwHCTb" data-precision="2" data-value="7.0498">7,05</span>
url = "https://www.google.com/search?q={}+kaç+tl".format(input())
r = requests.get(url)
source = BeautifulSoup(r.content,"html")
print(source.find_all("span",string="DFlfde SwHCTb"))
It returns a empty list, i need the value "7.05", how can i reach it? Thanks

There are 2 things to do to get your data:
specify User-Agent header (google needs this to return correct data)
in .find_all specify class_= parameter, not string
Code:
import requests
from bs4 import BeautifulSoup
url = "https://www.google.com/search?q=100+kaç+tl"
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:75.0) Gecko/20100101 Firefox/75.0'}
r = requests.get(url, headers=headers)
source = BeautifulSoup(r.content, "html.parser")
print(source.find_all("span", class_="DFlfde SwHCTb"))
# or
print(source.select_one('span.DFlfde.SwHCTb').text)
Prints:
[<span class="DFlfde SwHCTb" data-precision="2" data-value="13.0095">13,01</span>]
13,01

Related

BeautifulSoup doesn’t find tags

BeautifulSoup doesn’t find any tag on this page. Does anyone know what the problem can be?
I can find elements on the page with selenium, but since I have a list of pages, I don’t want to use selenium.
import requests
from bs4 import BeautifulSoup
url = 'https://dzen.ru/news/story/VMoskovskoj_oblasti_zapushhen_chat-bot_ochastichnoj_mobilizacii--b093f9a22a32ed6731e4a4ca50545831?lang=ru&from=reg_portal&fan=1&stid=fOB6O7PV5zeCUlGyzvOO&t=1664886434&persistent_id=233765704&story=90139eae-79df-5de1-9124-0d830e4d59a5&issue_tld=ru'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
soup.find_all('h1')

You can get the info on that page by adding headers to your requests, mimicking what you can see in Dev tools - Network tab main request to that url. Here is one way to get all links from that page:
import requests
from bs4 import BeautifulSoup as bs
headers = {
'Cookie': 'sso_checked=1',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
url = 'https://dzen.ru/news/story/VMoskovskoj_oblasti_zapushhen_chat-bot_ochastichnoj_mobilizacii--b093f9a22a32ed6731e4a4ca50545831?lang=ru&from=reg_portal&fan=1&stid=fOB6O7PV5zeCUlGyzvOO&t=1664886434&persistent_id=233765704&story=90139eae-79df-5de1-9124-0d830e4d59a5&issue_tld=ru'
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
links = [a.get('href') for a in soup.select('a')]
print(links)
Result printed in terminal:
['/news', 'https://dzen.ru/news', 'https://dzen.ru/news/region/moscow', 'https://dzen.ru/news/rubric/mobilizatsiya', 'https://dzen.ru/news/rubric/personal_feed', 'https://dzen.ru/news/rubric/politics', 'https://dzen.ru/news/rubric/society', 'https://dzen.ru/news/rubric/business', 'https://dzen.ru/news/rubric/world', 'https://dzen.ru/news/rubric/sport', 'https://dzen.ru/news/rubric/incident', 'https://dzen.ru/news/rubric/culture', 'https://dzen.ru/news/rubric/computers', 'https://dzen.ru/news/rubric/science', 'https://dzen.ru/news/rubric/auto', 'https://www.mosobl.kp.ru/online/news/4948743/?utm_source=yxnews&utm_medium=desktop', 'https://www.mosobl.kp.ru/online/news/4948743/?utm_source=yxnews&utm_medium=desktop', 'https://www.mosobl.kp.ru/online/news/4948743/?utm_source=yxnews&utm_medium=desktop', 'https://mosregtoday.ru/soc/v-podmoskove-zapustili-chat-bot-po-voprosam-chastichnoj-mobilizacii/?utm_source=yxnews&utm_medium=desktop', ...]

extracting text from id in beautifulsoup

have html code like this. using BeautifulSoup, i want to extract the text that is 2,441
have a span element and a id which is equals to lastPrice.
<span id="lastPrice">2,441.00</span>
I have tried to look up on the net and solve, but i am still unable to do it. I am a beginner.
i have tried this:
tag = soup.span
price = soup.find(id="lastPrice")
print(price.text)

The data you see on the page is rendered via JavaScript, so BeautifulSoup doesn't see the price. The price is embedded within the page in JavaScript form. You can extract it for example:
import json
import requests
from bs4 import BeautifulSoup
url = "https://www1.nseindia.com/live_market/dynaContent/live_watch/get_quote/GetQuote.jsp?symbol=HDFC"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:97.0) Gecko/20100101 Firefox/97.0"
}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
data = json.loads(soup.find(id="responseDiv").contents[0])
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
print(data["data"][0]["lastPrice"])
Prints:
2,407.90

Try this:
price = soup.select("#lastPrice")[0]
print(price.text)

Not with bs4 but with regex is as follows:
import re
line = '<span id="lastPrice">2,441.00</span>'
print(p.sub("", data))

soup.find_all returns empty list

I was trying to do some data scraping from booking.com for prices. But it just keeps on returning an empty list.
If anyone can explain me what is happening i would be really thankful to them.
Here is the website from which I am trying to scrape data:
https://www.booking.com/searchresults.html?label=gen173nr-1DCAEoggI46AdIM1gEaCeIAQGYATG4AQfIAQzYAQPoAQH4AQKIAgGoAgO4AuGQ8JAGwAIB0gIkYjFlZDljM2MtOGJiMy00MGZiLWIyMjMtMWIwYjNhYzU5OGQx2AIE4AIB&sid=2dad976fd78f6001d59007a49cb13017&sb=1&sb_lp=1&src=index&src_elem=sb&error_url=https%3A%2F%2Fwww.booking.com%2Findex.html%3Flabel%3Dgen173nr-1DCAEoggI46AdIM1gEaCeIAQGYATG4AQfIAQzYAQPoAQH4AQKIAgGoAgO4AuGQ8JAGwAIB0gIkYjFlZDljM2MtOGJiMy00MGZiLWIyMjMtMWIwYjNhYzU5OGQx2AIE4AIB%3Bsid%3D2dad976fd78f6001d59007a49cb13017%3Bsb_price_type%3Dtotal%26%3B&ss=Golden&is_ski_area=0&ssne=Golden&ssne_untouched=Golden&dest_id=-565331&dest_type=city&checkin_year=2022&checkin_month=3&checkin_monthday=15&checkout_year=2022&checkout_month=3&checkout_monthday=16&group_adults=2&group_children=0&no_rooms=1&b_h4u_keep_filters=&from_sf=1
Here is my code:
from bs4 import BeautifulSoup
import requests
html_text = requests.get("https://www.booking.com/searchresults.html?label=gen173nr-1DCAEoggI46AdIM1gEaCeIAQGYATG4AQfIAQzYAQPoAQH4AQKIAgGoAgO4AuGQ8JAGwAIB0gIkYjFlZDljM2MtOGJiMy00MGZiLWIyMjMtMWIwYjNhYzU5OGQx2AIE4AIB&sid=2dad976fd78f6001d59007a49cb13017&sb=1&sb_lp=1&src=index&src_elem=sb&error_url=https%3A%2F%2Fwww.booking.com%2Findex.html%3Flabel%3Dgen173nr-1DCAEoggI46AdIM1gEaCeIAQGYATG4AQfIAQzYAQPoAQH4AQKIAgGoAgO4AuGQ8JAGwAIB0gIkYjFlZDljM2MtOGJiMy00MGZiLWIyMjMtMWIwYjNhYzU5OGQx2AIE4AIB%3Bsid%3D2dad976fd78f6001d59007a49cb13017%3Bsb_price_type%3Dtotal%26%3B&ss=Golden&is_ski_area=0&ssne=Golden&ssne_untouched=Golden&dest_id=-565331&dest_type=city&checkin_year=2022&checkin_month=3&checkin_monthday=15&checkout_year=2022&checkout_month=3&checkout_monthday=16&group_adults=2&group_children=0&no_rooms=1&b_h4u_keep_filters=&from_sf=1").text
soup = BeautifulSoup(html_text, 'lxml')
prices = soup.find_all('div', class_='fde444d7ef _e885fdc12')
print(prices)

After checking different possible problems I found two problems.
price is in <span> but you search in <div>
server sends different HTML for different browsers or devices and code needs full header User-Agent from real browser. It can't be short Mozilla/5.0. And requests as default use something like Python/3.8 Requests/2.27
from bs4 import BeautifulSoup
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:97.0) Gecko/20100101 Firefox/97.0'
}
url = "https://www.booking.com/searchresults.html?label=gen173nr-1DCAEoggI46AdIM1gEaCeIAQGYATG4AQfIAQzYAQPoAQH4AQKIAgGoAgO4AuGQ8JAGwAIB0gIkYjFlZDljM2MtOGJiMy00MGZiLWIyMjMtMWIwYjNhYzU5OGQx2AIE4AIB&sid=2dad976fd78f6001d59007a49cb13017&sb=1&sb_lp=1&src=index&src_elem=sb&error_url=https%3A%2F%2Fwww.booking.com%2Findex.html%3Flabel%3Dgen173nr-1DCAEoggI46AdIM1gEaCeIAQGYATG4AQfIAQzYAQPoAQH4AQKIAgGoAgO4AuGQ8JAGwAIB0gIkYjFlZDljM2MtOGJiMy00MGZiLWIyMjMtMWIwYjNhYzU5OGQx2AIE4AIB%3Bsid%3D2dad976fd78f6001d59007a49cb13017%3Bsb_price_type%3Dtotal%26%3B&ss=Golden&is_ski_area=0&ssne=Golden&ssne_untouched=Golden&dest_id=-565331&dest_type=city&checkin_year=2022&checkin_month=3&checkin_monthday=15&checkout_year=2022&checkout_month=3&checkout_monthday=16&group_adults=2&group_children=0&no_rooms=1&b_h4u_keep_filters=&from_sf=1"
response = requests.get(url, headers=headers)
#print(response.status)
html_text = response.text
soup = BeautifulSoup(html_text, 'lxml')
prices = soup.find_all('span', class_='fde444d7ef _e885fdc12')
for item in prices:
print(item.text)

Extracting specific part of html

I am working on a webscraper using html requests and beautiful soup (New to this). For 1 webpage (https://www.selfridges.com/GB/en/cat/beauty/make-up/?pn=1) I am trying to scrape a part, which I will replicate for other products. The html looks like:
<div class="plp-listing-load-status c-list-header__counter initialized" data-page-number="1" data-total-pages-count="57" data-products-count="60" data-total-products-count="3361" data-status-format="{available}/{total} results">60/3361 results</div>
I want the scrape the "57" from the data-total-pages-count="57". I have tried using:
soup = BeautifulSoup(page.content, "html.parser")
nopagesstr = soup.find(class_="plp-listing-load-status c-list-header__counter initialized").get('data-total-pages-count')
and
nopagesstr = r.html.find('[data-total-pages-count]',first=True)
But both return None. I am not sure how to select the 57 specifically. Any help would be appreicated

To get total pages count, you can use this example:
import requests
from bs4 import BeautifulSoup
url = "https://www.selfridges.com/GB/en/cat/beauty/make-up/?pn=1"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:87.0) Gecko/20100101 Firefox/87.0"
}
soup = BeautifulSoup(requests.get(url, headers=headers).text, "html.parser")
print(soup.select_one("[data-total-pages-count]")["data-total-pages-count"])
Prints:
56

There is a link to the comment under the post from Vkontakte. How to get the content of a comment from a link

I tried using beautifulsoup, but when parsing the page for some reason gives the mobile version, where links to comments have a different look. Using vk_api, I need to specify the group id, which I need. I will be happy to give you any advice! Thanks! Here is the link "link". If you click on it, the comment "sexism" will appear. I want to implement this programmatically in python. If you can help me, I would be very grateful.

If I understand you right, you want to extract word "sexism" from the provided link that includes reply=<some number>.
You can do this for example:
import requests
import urllib.parse
from bs4 import BeautifulSoup
url = 'https://vk.com/wall-12648877_5011889?reply=5011893'
reply_number = urllib.parse.parse_qs(urllib.parse.urlparse(url).query)['reply'][0]
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
text = soup.select_one('div.ReplyItem:has(a[name="reply{}"]) .ReplyItem__body'.format(reply_number)).text.strip()
print(text)
Prints:
sexism
EDIT:
import requests
import urllib.parse
from bs4 import BeautifulSoup
url = 'https://vk.com/wall-12648877_5013166?reply=5013335&thread=5013176'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'}
reply_number = urllib.parse.parse_qs(urllib.parse.urlparse(url).query)['reply'][0]
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
text = soup.select_one('div.ReplyItem:has(a[name="reply{reply_number}"]) .ReplyItem__body, .reply_text div[id$="_{reply_number}"]'.format(reply_number=reply_number)).text.strip()
print(text)
Prints:
ахаха, опять ты. и опять твою фотку запостили в комментах, просто ОР, в прошлый раз ты что-то про геев говорил кажется. ))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python parse string in span class - python

Related

BeautifulSoup doesn’t find tags

extracting text from id in beautifulsoup

soup.find_all returns empty list

Extracting specific part of html

There is a link to the comment under the post from Vkontakte. How to get the content of a comment from a link

Categories

Resources