from bs4 import BeautifulSoup
import requests
req = requests.get("https://www.airtel.in/myplan-infinity/")
soup = BeautifulSoup(req.content, 'html.parser')
#finding the div with the id
div_bs4 = soup.find('div')
print(div_bs4)
What should I do to scrape the recharge plans of the page?
you should use content when you need to get media data (like pictures, videos, etc)
if you need to get non-media data you've to use text instead of content
make the soup like this: `soup = BeautifulSoup(req.text, 'lxml')
make sure that this like div_bs4 = soup.find('div') finds the exact div you need (cuz it will just get the first div in html)
this line print(div_bs4) won't give you needed data
you'd better use this print(div_bs4.text)
Related
I am trying to Scrape NBA.com play by play table so I want to get the text for each box that is in the example picture.
for example(https://www.nba.com/game/bkn-vs-cha-0022000032/play-by-play).
checking the html code I figured that each line is in an article tag that contains div tag that contains two p tags with the information I want, however I wrote the following code and I get back that there are 0 articles and only 9 P tags (should be much more) but even the tags I get their text is not the box but something else. I get 9 tags so I am doing something terrible wrong and I am not sure what it is.
this is the code to get the tags:
from urllib.request import urlopen
from bs4 import BeautifulSoup
def contains_word(t):
return t and 'keyword' in t
url = "https://www.nba.com/game/bkn-vs-cha-0022000032/play-by-play"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")
div_tags = soup.find_all('div', text=contains_word("playByPlayContainer"))
articles=soup.find_all('article')
p_tag = soup.find_all('p', text=contains_word("md:bg"))
thank you!
Use Selenium since it's using Javascript and pass it to Beautifulsoup. Also pip install selenium and get the chromedriver.exe
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.nba.com/game/bkn-vs-cha-0022000032/play-by-play")
soup = BeautifulSoup(driver.page_source, "html.parser")
I have used Beautifulsoup to scrape a website. My current code helps me to get the website content in HTML format. I used soup to find the word if it is present or not but I am not able to get the paragraph it belongs to.
import requests
from bs4 import BeautifulSoup
# Make a request
page = requests.get(
"https://manychat.com/")
soup = BeautifulSoup(page.content, 'html.parser')
# Extract title of page
page_title = soup.title.text
# Extract body of page
page_body = soup.body
# Extract head of page
page_head = soup.head
# print the result
print(page_body, page_head)
thirdParty = soup.find(text = 'Facebook')
Usually, the areas you're interested in searching are of a common kind, like <div> with a common class. So, you have Soup return all of the <div>s with that class, and you search the div text for your word.
This is the link of the webpage I want to scrape:
https://www.tripadvisor.in/Restaurants-g494941-Indore_Indore_District_Madhya_Pradesh.html
I have also applied additional filters, by clicking on the encircled heading1
This is how the webpage looks like after clicking on the heading2
I want to get names of all the places displayed on the webpage but I seem to be having trouble with it as the url doesn't get changed on applying the filter.
I am using python urllib for this.
Here is my code:
url = "https://www.tripadvisor.in/Hotels-g494941-Indore_Indore_District_Madhya_Pradesh-Hotels.html"
page = urlopen(url)
html_bytes = page.read()
html = html_bytes.decode("utf-8")
print(html)
You can use bs4. Bs4 is a python module that allows you to get certain things off of webpages. This will get the text from the site:
from bs4 import BeautifulSoup as bs
soup = bs(html, features='html5lib')
text = soup.get_text()
print(text)
If you would like to get something that is not the text, maybe something with a certain tag you can also use bs4:
soup.findall('p') # Getting all p tags
soup.findall('p', class_='Title') #getting all p tags with a class of Title
Find what class and tag all of the place names have, and then use the above to get all the place names.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
I am trying to scrape sub-content from Wikipedia pages based on the internal link using python, The problem is that scrape all content from the page, how can scrape just internal link paragraph, Thanks in advance
base_link='https://ar.wikipedia.org/wiki/%D8%A7%D9%84%D8%AA%D9%87%D8%A7%D8%A8_%D8%A7%D9%84%D9%82%D8%B5%D8%A8%D8%A7%D8%AA'
sub_link="#الأسباب"
total=base_link+sub_link
r=requests.get(total)
soup = bs(r.text, 'html.parser')
results=soup.find('p')
print(results)
It is because it's not a sublink you are trying to scrape. It's an anchor.
Try to request the entire page and then to find the given id.
Something like this:
from bs4 import BeautifulSoup as soup
import requests
base_link='https://ar.wikipedia.org/wiki/%D8%A7%D9%84%D8%AA%D9%87%D8%A7%D8%A8_%D8%A7%D9%84%D9%82%D8%B5%D8%A8%D8%A7%D8%AA'
anchor_id="ﺍﻸﺴﺑﺎﺑ"
r=requests.get(base_link)
page = soup(r.text, 'html.parser')
span = page.find('span', {'id': anchor_id})
results = span.parent.find_next_siblings('p')
print(results[0].text)
tyring to pull the href links for the products on this webpage. The code pulls all of the href's except the products that are listed on the page.
from bs4 import BeautifulSoup
import requests
url = "https://www.neb.com/search#t=_483FEC15-900D-4CF1-B514-1B921DD055BA&sort=%40ftitle51880%20ascending"
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'lxml')
tags = soup.find_all('a')
for tag in tags:
print(tag.get('href'))
The products are loaded through rest API dynamically, the URL is this:
https://international.neb.com/coveo/rest/v2/?sitecoreItemUri=sitecore%3A%2F%2Fweb%2F%7BA1D9D237-B272-4C5E-A23F-EC954EB71A26%7D%3Flang%3Den%26ver%3D1&siteName=nebinternational
Loading this response will get you the URLs.
Next time, check your network inspector if any part of web page isn't loading dynamically (or use selenium).
Try to verify if the product href's is in the received response. I'm telling you to do this because if the part of the products is being dynamically generated by ajax, for example, a simple get on the main page will not bring them.
Print the response and verifiy if the products are being received in the html
I think you want something like this:
from bs4 import BeautifulSoup
import urllib.request
for numb in ('1', '100'):
resp = urllib.request.urlopen("https://www.neb.com/search#first=" + numb + "&t=_483FEC15-900D-4CF1-B514-1B921DD055BA&sort=%40ftitle51880%20ascending")
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))
for link in soup.find_all('a', href=True):
print(link['href'])