Extract text with Beautiful Soup [duplicate] - python

This question already has answers here:
BS4 Beautiful Soup extract text from find_all
(2 answers)
Closed 2 years ago.
I'm trying to learn how to use beautiful soup
using this website as a very simple example.
https://www.espncricinfo.com/ci/content/ground/56490.html#Profile
Lets say I want to extract the capacity of the ground. I have so far written the following code which gives me the field names, but I can't seem to understand how to get the actual value of 18,000
Can anyone help?
url="https://www.espncricinfo.com/ci/content/ground/56490.html"
response = requests.get(url)
soup = BeautifulSoup(response.text)
soup.findAll('label')

Perhaps something like
from bs4 import BeautifulSoup
import requests
url = "https://www.espncricinfo.com/ci/content/ground/56490.html"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
stats = soup.find('div', {'id': 'stats'})
for e in stats.findAll('label'):
print(f"{e.text}: {e.nextSibling}")
demo

Related

Web scraping with beautiful soup [duplicate]

This question already has answers here:
Getting all Links from a page Beautiful Soup
(3 answers)
Closed 2 months ago.
I am trying to scrape links from this page(https://www.setlist.fm/search?query=nightwish)
Whilst this code does retrieve the links i want it also comes back with a load of other stuff i don't want.
Example of what i want:
setlist/nightwish/2022/quarterback-immobilien-arena-leipzig-germany-2bbca8f2.html
setlist/nightwish/2022/brose-arena-bamberg-germany-3bf4963.html
setlist/nightwish/2022/arena-gliwice-gliwice-poland-3bc9dc7.html
Can i use beautiful soup to get these links or do i need to use regex?
url = 'https://www.setlist.fm/search?query=nightwish'
reqs = requests.get(url)
soup = bs4.BeautifulSoup(reqs.text, 'html.parser')
urls = []
for link in soup.select('a'):
urls.append(link)
print(link.get('href'))
Please check to see if the code snippet below is useful.
import requests
from bs4 import BeautifulSoup
url = 'https://www.setlist.fm/search?query=nightwish'
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "lxml")
for g_data in soup.find_all('a', {'class': 'link-primary'}, href=True):
print(g_data['href'])

Python 3 - Extract content between <td></td> [duplicate]

This question already has an answer here:
How to get inner text value of an HTML tag with BeautifulSoup bs4?
(1 answer)
Closed 5 years ago.
from bs4 import BeautifulSoup
import re
data = open('C:\folder')
soup = BeautifulSoup(data, 'html.parser')
emails = soup.find_all('td', text = re.compile('#'))
for line in emails:
print(line)
I have the script above that works perfect in Python 2.7 with Beautifulsoup for extracting content between several in a HTML-file. When I run the same script in Python 3.6.4, however, I get the following results:
<td>xxx#xxx.com</td>
<td>xxx#xxx.com</td>
I want the content between without the TD stuff...
Why is this happening in Python 3?
I found the answer...
from bs4 import BeautifulSoup
import re
data = open('C:\folder')
soup = BeautifulSoup(data, 'html.parser') #Lade till html.parser
emails = soup.find_all('td', text = re.compile('#'))
for td in emails:
print(td.get_text())
Look close at the two last lines :)

Code not on BS4, but can be found in 'Inspect Element' [duplicate]

This question already has answers here:
Web-scraping JavaScript page with Python
(18 answers)
Closed 2 years ago.
I have tried making a website which uses Beautiful Soup 4 to search g2a for the prices of games (by class). The problem is that when I look in the HTML code, it clearly shows the price of the first result (£2.30), but when I search for the class in Beautiful Soup 4, there is nothing between the same class's tags:
#summoningg2a
r = requests.get('https://www.g2a.com/?search=x')
data = r.text
soup = BeautifulSoup(data, 'html.parser')
#finding prices
prices = soup.find_all("strong", class_="mp-pi-price-min")
print(soup.prettify())
requests doesn't handle dynamic page content. You're best bet is using Selenium to drive a browser. From there you can parse page_source with BeautifulSoup to get the results you're looking for.
In chrome development tools, you can check the ajax request(made by Javascript) URL. you can mimic that requests and get data back.
r = requests.get('the ajax requests url')
data = r.text

How to scrape href with Python 3.5 and BeautifulSoup [duplicate]

This question already has answers here:
retrieve links from web page using python and BeautifulSoup [closed]
(16 answers)
Closed 6 years ago.
I want to scrape the href of every project from the website https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1 with Python 3.5 and BeautifulSoup.
That's my code
#Loading Libraries
import urllib
import urllib.request
from bs4 import BeautifulSoup
#define URL for scraping
theurl = "https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1"
thepage = urllib.request.urlopen(theurl)
#Cooking the Soup
soup = BeautifulSoup(thepage,"html.parser")
#Scraping "Link" (href)
project_ref = soup.findAll('h6', {'class': 'project-title'})
project_href = [project.findChildren('a')[0].href for project in project_ref if project.findChildren('a')]
print(project_href)
I get [None, None, .... None, None] back.
I need a list with all the href from the class .
Any ideas?
Try something like this:
import urllib.request
from bs4 import BeautifulSoup
theurl = "https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1"
thepage = urllib.request.urlopen(theurl)
soup = BeautifulSoup(thepage)
project_href = [i['href'] for i in soup.find_all('a', href=True)]
print(project_href)
This will return all the href instances. As i see in your link, a lot of href tags have # inside them. You can avoid these with a simple regex for proper links, or just ignore the # symboles.
project_href = [i['href'] for i in soup.find_all('a', href=True) if i['href'] != "#"]
This will still give you some trash links like /discover?ref=nav, so if you want to narrow it down use a proper regex for the links you need.
EDIT:
To solve the problem you mentioned in the comments:
soup = BeautifulSoup(thepage)
for i in soup.find_all('div', attrs={'class' : 'project-card-content'}):
print(i.a['href'])

beautiful soup and requests not getting full page [duplicate]

This question already has an answer here:
Beautiful Soup 4 find_all don't find links that Beautiful Soup 3 finds
(1 answer)
Closed 8 years ago.
My code looks like this.
from bs4 import BeautifulSoup
import requests
r = requests.get("http://www.data.com.sg/iCurrentLaunch.jsp")
data = r.text
soup = BeautifulSoup(data)
n = soup.findAll('table')[7].findAll('table')
for tab in n:
print tab.findAll('td')[1].text
what I am getting is the property name till IDYLLIC SUITES,after that I get error "list index out of range".What is the problem?
I am not sure what is exactly bothering you. Because when I tried your code (as it is) it worked for me.
Still, try changing the parser, may be to html5lib
So do,
pip install html5lib
And then change your code to,
from bs4 import BeautifulSoup
import requests
r = requests.get("http://www.data.com.sg/iCurrentLaunch.jsp")
data = r.text
soup = BeautifulSoup(data,'html5lib') # Change of Parser
n = soup.findAll('table')[7].findAll('table')
for tab in n:
print tab.findAll('td')[1].text
Let me know if it helps

Categories

Resources