BeautifulSoup find all occurrences of specific text - python

I will be analyzing a lot of sites with different htmls and I am trying to find all lines that contain specific text(inside html) using BeautifulSoup.
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
for text in soup.find_all():
if "price" in text:
print text
This approach doesn't work(even though "price" is mentioned over 40x in html). Maybe there is even better approach to do this?

Why don't let BeautifulSoup find you the nodes containing the desired text:
for node in soup.find_all(text=lambda x: x and "price" in x):
print(node)

With bs4 4.7.1 you can use :contains pseudo class with * to consider all elements. Obviously some repeats as parents may contain children with same text. Here I search for price.
import requests
from bs4 import BeautifulSoup
url = 'https://www.visitsealife.com/brighton/tickets/'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
items = soup.select('*:contains(price)')
print(items)
print(len(items))

To extract all text from a given URL, you could just use something like:
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
for element in soup.findAll(['script', 'style']):
element.extract()
text = soup.get_text()
This will also remove possibly unwanted text inside script and style sections. You could then search for your required text using that.

you don't have to use Beautiful soup to find the specific text in the html instead you can use the request for that For ex:
r = requests.get(url)
if 'specific text' in r.content:
print r.content

Related

Removing duplicate links from scraper I'm making

#!/usr/bin/python3
import requests
from bs4 import BeautifulSoup
import re
url = input("Please enter a URL to scrape: ")
r = requests.get(url)
html = r.text
print(html)
soup = BeautifulSoup(html, "html.parser")
for link in soup.find_all('a', attrs={'href': re.compile("^https://")}):
print(link.get('href'))
down at the bottom, where it prints the link... I know it'll go in there, but I can't think of a way to remove duplicate entries there. Can someone help me with that please?
Use a set to remove duplicates. You call add() to add an item and if the item is already present then it won't be added again.
Try this:
#!/usr/bin/python3
import requests
from bs4 import BeautifulSoup
import re
url = input("Please enter a URL to scrape: ")
r = requests.get(url)
html = r.text
print(html)
soup = BeautifulSoup(html, "html.parser")
urls = set()
for link in soup.find_all('a', attrs={'href': re.compile(r"^https://")}):
urls.add(link.get('href'))
print(urls) # urls contains unique set of URLs
Note some URLs might start with http:// so may want to use the regexp ^https?:// to catch both http and https URLs.
You can also use set comprehension syntax to rewrite the assignment and for statements like this.
urls = {
link.get("href")
for link in soup.find_all("a", attrs={"href": re.compile(r"^https://")})
}
instead of printing it you need to catch is somehow to compare.
Try this:
you get a list with all result by find_all and make it a set.
data = set(link.get('href') for link in soup.find_all('a', attrs={'href': re.compile("^https://")}))
for elem in data:
print(elem)

KeyError when trying to access ['style'] using Beautiful Soup

I'm trying to access the style of a DIV element on a page using Beautiful Soup 4 but I keep getting a key error. I know the styles are definitely there because I can inspect them using the inspector in the browser and I can see styles for the DIV with class "header large border first". (see the attached image)
Here is my code;
url = 'https://www.themoviedb.org/movie/595743-sas-red-notice'
response = requests.get(url, headers=headers)
soup = bs4.BeautifulSoup(response.text, 'html.parser')
header_image_style = soup.find("div", class_="header large border first")['style']
I'm not sure what I'm doing wrong can anyone help?
Here is an image of the DIV with styles;
Beautiful soup does not analyze the contents in style tags or in linked style sheets unfortunately, so it will be difficult to retrieve that value since we will need to handle parsing the css on our own.
The value we are looking for is contained within the document's style tag, so we can get the contents of the style tag and parse it for ourselves to get the value. Here's a working example:
from bs4 import BeautifulSoup
import cssutils
import requests
url = 'https://www.themoviedb.org/movie/595743-sas-red-notice'
response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
soup = BeautifulSoup(response.text, 'html.parser')
# get the style tag contents
style_str = soup.find("style").text
# parse the tag's contents
rules = cssutils.parseString(style_str)
# find the first rule that applies to "div.header.large.first"
rule = next(filter(lambda x: x.selectorText == "div.header.large.first", rules))
# get the backgroundImage property
background_property = rule.style.backgroundImage
# Cut out the start of the text that says "url(" and ")"
img_url = background_property[4:-1]
print(img_url)
You will need to pip install cssutils in order for this example to work.

how to remove the starting and ending tags using python Beautiful soup

I'm having difficulty in stripping the starting and ending tags from a json url. I've used beautiful soup and the only problem i'm facing is that i'm getting <pre> tags in my response. Please advise how can i remove the starting and ending tags. The code chunk i'm using is here:
page = Page( "link to json")
soup = bs.BeautifulSoup(page.html, "html.parser")
#fetching the response i want from the url it's inside pre tags.
json = soup.find("pre")
print(json)
So Thanks to Demian Wolf. The solution is something like this:
page = Page( "link to json")
soup = bs.BeautifulSoup(page.html, "html.parser")
#fetching the response i want from the url it's inside pre tags.
json = soup.find("pre")
print(json.text)
You may use soup.text to remove all the tags:
from bs4 import BeautifulSoup
soup = BeautifulSoup("<pre>Hello, world!</pre>", "html.parser")
print(soup.find("pre").text)

How can I get the text from this specific div class?

I want to extract the text here
a lot of text
I used
url = ('https://osu.ppy.sh/users/1521445')
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
mestuff = soup.find("div", {"class":"bbcode bbcode--profile-page"})
but it never fails to return with "None" in the terminal.
How can I go about this?
Link is "https://osu.ppy.sh/users/1521445"
(This is a repost since the old question was super old. I don't know if I should've made another question or not but aa)
Data is dynamically loaded from script tag so, as in other answer, you can grab from that tag. You can target the tag by its id then you need to pull out the relevant json, then the html from that json, then parse html which would have been loaded dynamically on page (at this point you can use your original class selector)
import requests, json, pprint
from bs4 import BeautifulSoup as bs
r = requests.get('https://osu.ppy.sh/users/1521445')
soup = bs(r.content, 'lxml')
all_data = json.loads(soup.select_one('#json-user').text)
soup = bs(all_data['page']['html'], 'lxml')
pprint.pprint(soup.select_one('.bbcode--profile-page').get_text('\n'))
You could try this:
url = ('https://osu.ppy.sh/users/1521445')
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
x = soup.findAll("script",{"id":re.compile(r"json-user")})
result = re.findall('raw\":(.+)},\"previous_usernames', x[0].text.strip())
print(result)
Im not sure why the div with class='bbcode bbcode--profile-page' is string inside script tag with class='json-user', that's why you can't get it's value by div with class='bbcode bbcode--profile-page'
Hope this could help

Beautiful Soup finding href based on hyperlink Text

I'm having an issue trying to get beautiful soup to find an a href with a specific title and extract the href only.
I have the code below but cant seem to make it get the href only(whatever is between the open " and close ") based on the hyperlink text found in the in that href.
res = requests.get(website_url)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
temp_tag_href = soup.select_one("a[href*=some text]")
sometexthrefonly = temp_tag_href.attrs['href']
Effectively, i would like it to go through the entire html parsed in soup and only return what is between the href open " and close " because the
that hyperlink text is 'some text'.
so the steps would be:
1: parse html,
2: look at all the a hrefs tags,
3: find the href that has the hyperlink text 'some text',
4: output only what is in between the href " " (not including the
"") for that href
Any help will greatly be appreciated!
ahmed,
So after some quick refreshers on requests and researching the BeautifulSoup library, I think you'll want something like the following:
res = requests.get(website_url)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
link = list(filter(lambda x: x['href'] == 'some text', soup.find_all('a')))[0]
print(link['href']) # since you don't specify output to where, I'll use stdout for simplicity
As it turns out in the Beautiful Soup Documentation there is a convenient way to access whatever attributes you want from an html element using dictionary lookup syntax. You can also do all kinds of lookups using this library.
If you are doing web scraping, it may also be useful to try switching to a library that supports XPATH, which allows you to write powerful queries such as //a[#href="some text"][1] which will get you the first link with url equal to "some text"
this should do the work:
from BeautifulSoup import BeautifulSoup
html = '''next
<div>later</div>
<h3>later</h3>'''
soup = BeautifulSoup(html)
# iterate all hrefs
for a in soup.find_all('a', href=True):
print("Next HREF: %s" % a['href'])
if a['href'] == 'some_text':
print("Found it!")

Categories

Resources