How to scrape a website using selected words if present? - python

I have used Beautifulsoup to scrape a website. My current code helps me to get the website content in HTML format. I used soup to find the word if it is present or not but I am not able to get the paragraph it belongs to.
import requests
from bs4 import BeautifulSoup
# Make a request
page = requests.get(
"https://manychat.com/")
soup = BeautifulSoup(page.content, 'html.parser')
# Extract title of page
page_title = soup.title.text
# Extract body of page
page_body = soup.body
# Extract head of page
page_head = soup.head
# print the result
print(page_body, page_head)
thirdParty = soup.find(text = 'Facebook')

Usually, the areas you're interested in searching are of a common kind, like <div> with a common class. So, you have Soup return all of the <div>s with that class, and you search the div text for your word.

Related

Can't able to scrape the airtel website plans with python BeautifulSoup

from bs4 import BeautifulSoup
import requests
req = requests.get("https://www.airtel.in/myplan-infinity/")
soup = BeautifulSoup(req.content, 'html.parser')
#finding the div with the id
div_bs4 = soup.find('div')
print(div_bs4)
What should I do to scrape the recharge plans of the page?
you should use content when you need to get media data (like pictures, videos, etc)
if you need to get non-media data you've to use text instead of content
make the soup like this: `soup = BeautifulSoup(req.text, 'lxml')
make sure that this like div_bs4 = soup.find('div') finds the exact div you need (cuz it will just get the first div in html)
this line print(div_bs4) won't give you needed data
you'd better use this print(div_bs4.text)

KeyError when trying to access ['style'] using Beautiful Soup

I'm trying to access the style of a DIV element on a page using Beautiful Soup 4 but I keep getting a key error. I know the styles are definitely there because I can inspect them using the inspector in the browser and I can see styles for the DIV with class "header large border first". (see the attached image)
Here is my code;
url = 'https://www.themoviedb.org/movie/595743-sas-red-notice'
response = requests.get(url, headers=headers)
soup = bs4.BeautifulSoup(response.text, 'html.parser')
header_image_style = soup.find("div", class_="header large border first")['style']
I'm not sure what I'm doing wrong can anyone help?
Here is an image of the DIV with styles;
Beautiful soup does not analyze the contents in style tags or in linked style sheets unfortunately, so it will be difficult to retrieve that value since we will need to handle parsing the css on our own.
The value we are looking for is contained within the document's style tag, so we can get the contents of the style tag and parse it for ourselves to get the value. Here's a working example:
from bs4 import BeautifulSoup
import cssutils
import requests
url = 'https://www.themoviedb.org/movie/595743-sas-red-notice'
response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
soup = BeautifulSoup(response.text, 'html.parser')
# get the style tag contents
style_str = soup.find("style").text
# parse the tag's contents
rules = cssutils.parseString(style_str)
# find the first rule that applies to "div.header.large.first"
rule = next(filter(lambda x: x.selectorText == "div.header.large.first", rules))
# get the backgroundImage property
background_property = rule.style.backgroundImage
# Cut out the start of the text that says "url(" and ")"
img_url = background_property[4:-1]
print(img_url)
You will need to pip install cssutils in order for this example to work.

scrape book body text from project gutenberg de

I am new to python and I am looking for a way to extract with beautiful soup existing open source books that are available on gutenberg-de, such as this one
I need to use them for further analysis and text mining.
I tried this code, found in a tutorial, and it extracts metadata, but instead of the body content it gives me a list of the "pages" I need to scrape the text from.
import requests
from bs4 import BeautifulSoup
# Make a request
page = requests.get(
"https://www.projekt-gutenberg.org/keller/heinrich/")
soup = BeautifulSoup(page.content, 'html.parser')
# Extract title of page
page_title = soup.title
# Extract body of page
page_body = soup.body
# Extract head of page
page_head = soup.head
# print the result
print(page_title, page_head)
I suppose I could use that as a second step to extract it then? I am not sure how, though.
Ideally I would like to store them in a tabular way and be able to save them as csv, preserving the metadata author, title, year, and chapter. any ideas?
What happens?
First of all you will get a list of pages, cause you not requesting the right url it to:
page = requests.get('https://www.projekt-gutenberg.org/keller/heinrich/hein101.html')
Recommend that if your looping all the urls store the content in a list of dicts and push it to csv or pandas or ...
Example
import requests
from bs4 import BeautifulSoup
data = []
# Make a request
page = requests.get('https://www.projekt-gutenberg.org/keller/heinrich/hein101.html')
soup = BeautifulSoup(page.content, 'html.parser')
data.append({
'title': soup.title,
'chapter': soup.h2.get_text(),
'text': ' '.join([p.get_text(strip=True) for p in soup.select('body p')[2:]])
}
)
data

How to use web scraping to get visible text on the webpage?

This is the link of the webpage I want to scrape:
https://www.tripadvisor.in/Restaurants-g494941-Indore_Indore_District_Madhya_Pradesh.html
I have also applied additional filters, by clicking on the encircled heading1
This is how the webpage looks like after clicking on the heading2
I want to get names of all the places displayed on the webpage but I seem to be having trouble with it as the url doesn't get changed on applying the filter.
I am using python urllib for this.
Here is my code:
url = "https://www.tripadvisor.in/Hotels-g494941-Indore_Indore_District_Madhya_Pradesh-Hotels.html"
page = urlopen(url)
html_bytes = page.read()
html = html_bytes.decode("utf-8")
print(html)
You can use bs4. Bs4 is a python module that allows you to get certain things off of webpages. This will get the text from the site:
from bs4 import BeautifulSoup as bs
soup = bs(html, features='html5lib')
text = soup.get_text()
print(text)
If you would like to get something that is not the text, maybe something with a certain tag you can also use bs4:
soup.findall('p') # Getting all p tags
soup.findall('p', class_='Title') #getting all p tags with a class of Title
Find what class and tag all of the place names have, and then use the above to get all the place names.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Webscraping a particular element of html

I’m having trouble scraping information from government travel advice websites for a research project I’m doing on Python.
I’ve picked the Turkey page but the logic could extend to any country.
The site is "https://www.gov.uk/foreign-travel-advice/turkey/safety-and-security"
The code I'm using is:
import requests
page = requests.get("https://www.gov.uk/foreign-travel-advice/turkey/safety-
and-security")
page
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('p')
soup.find_all('p')[0].get_text()
At the moment this is extracting all the html of the page. Having inspected the website the information I am interested in is located in:
<div class="govuk-govspeak direction-ltr">
<p>
Does anyone know how to amend the code above to only extract that part of the html?
Thanks
If you are only interested in data located inside govuk-govspeak direction-ltr class, therefore you can try these steps :
Beautiful Soup supports the most commonly-used CSS selectors. Just pass a string into the .select() method of a Tag object or the BeautifulSoup object itself. For class use . and for id use #
data = soup.select('.govuk-govspeak.direction-ltr')
# extract h3 tags
h3_tags = data[0].select('h3')
print(h3_tags)
[<h3 id="local-travel---syrian-border">Local travel - Syrian border</h3>, <h3 id="local-travel--eastern-provinces">Local travel – eastern provinces</h3>, <h3 id="political-situation">Political situation</h3>,...]
#extract p tags
p3_tags = data[0].select('p')
[<p>The FCO advise against all travel to within 10 ...]
You can find that particular <div> and then under that div you can find the <p> tags and get the data like this
import requests
page = requests.get("https://www.gov.uk/foreign-travel-advice/turkey/safety-and-security")
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
div=soup.find("div",{"class":"govuk-govspeak direction-ltr"})
data=[]
for i in div.find_all("p"):
data.append(i.get_text().encode("ascii","ignore"))
data="\n".join(data)
now data will contain the whole content with paragraphs seperated by \n
Note: The above code will give you only the text content of paragraph heading content will not be included
if you want both heading with paragraph text then you can extract both <h3> and <p> like this
import requests
page = requests.get("https://www.gov.uk/foreign-travel-advice/turkey/safety-and-security")
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
div=soup.find("div",{"class":"govuk-govspeak direction-ltr"})
data=[]
for i in div:
if i.name=="h3":
data.append(i.get_text().encode("ascii","ignore")+"\n\n")
if i.name=="p":
data.append(i.get_text().encode("ascii","ignore")+"\n")
data="".join(data)
Now data will have both headings and paragraphs where headings will be seperated by \n\n and paragraphs will be seperated by \n

Categories

Resources