BeautifulSoup - search text inside a tag

BeautifulSoup - search text inside a tag - python

from bs4 import BeautifulSoup
from fake_useragent import UserAgent
import requests
user = UserAgent()
headers = {
'user-agent' : user.random
}
url = 'https://www.wildberries.ru/?utm_source=domain&utm_campaign=wilberes.ru'
def main():
resp = requests.get(url, headers=headers)
soup = BeautifulSoup(resp.text, 'lxml')
main = soup.find('div', class_='menu-burger__main')
ul = main.find('ul', class_='menu-burger__main-list')
all = ul.find_all_next('li', class_='menu-burger__main-list-item')
f = open('link.txt', 'a')
for lin in all:
get_link = lin.find('a').get('href')
f.write(get_link + '\n')
f.close()
if __name__ == '__main__':
main()
I'm trying to parse a link to a section and its name. I managed to get the link, but how can I get the name if it is not in the tag?

Using the string property seems to be the correct method based on BS4 documentation:
myLink = soup.find('a')
myLinkText = str(myLink.string)
The idea of using str() is to convert the text to regular python strings, as .string returns a BS4 NavigableString object (you may find you don't really have to do this, though - it might be safer though, as well as stripping whitespace from the string so that you don't get weird newlines or padding with the result -- str(myLink.string).strip())
Reference: https://crummy.com/software/BeautifulSoup/bs4/doc/#navigablestring
In your code, you are actually getting the href, so take note that in my code above I am getting the anchor tag, not just it's href attribute.

Related

How do I parse the values like this with BeutifulSoup?

I am trying to parse some information from awebsite but ran in to a little problem, the information I need wont print out and just shows [] when I need the values (3 for example from the source code provided. I would need some help to get it working. Hope someone here can help me out and assist in the matter.
Best of regards.
import re
import requests
from bs4 import BeautifulSoup
url_to_parse = "https://www.webpage.com"
response = requests.get(url_to_parse)
response_text = response.text
soup = BeautifulSoup(response_text, 'lxml')
#print(soup.prettify())
ragex = re.compile('c76a6')
content_lis = soup.find_all('button', attrs={'class': ragex})
print(content_lis)
source: <button class="c76a6" type="button" data-test-name="valueButton"><span class="_5a5c0" data-test-name="value">3</span></button>

because the find_all returns in array to get item require to index on it or loop through the matches and that takes time if you know that the target is unique so you have to use find get the first match so in that case you should add attribute called text to get the value only
import re
import requests
from bs4 import BeautifulSoup
url_to_parse = "https://www.webpage.com"
response = requests.get(url_to_parse)
response_content = response.content
soup = BeautifulSoup(response_content, 'lxml')
# print(soup.prettify())
regex = re.compile('c76a6')
content_list = soup.find('button',{'class': regex})
print(content_list.text)

Not getting json when using .text in bs4

In this code I think I made a mistake or something because I'm not getting the correct json when I print it, indeed I get nothing but when I index the script I get the json but using .text nothing appears I want the json alone.
CODE :
from bs4 import BeautifulSoup
from urllib.parse import quote_plus
import requests
import selenium.webdriver as webdriver
base_url = 'https://www.instagram.com/{}'
search = input('Enter the instagram account: ')
final_url = base_url.format(quote_plus(search))
response = requests.get(final_url)
print(response.status_code)
if response.ok:
html = response.text
bs_html = BeautifulSoup(html)
scripts = bs_html.select('script[type="application/ld+json"]')
print(scripts[0].text)

Change the line print(scripts[0].text) to print(scripts[0].string).
scripts[0] is a Beautiful Soup Tag object, and its string contents can be accessed through the .string property.
Source: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#string
If you want to then turn the string into a json so that you can access the data, you can do something like this:
...
if response.ok:
html = response.text
bs_html = BeautifulSoup(html)
scripts = bs_html.select('script[type="application/ld+json"]')
json_output = json.loads(scripts[0].string)
Then, for example, if you run print(json_output['name']) you should be able to access the name on the account.

BeautifulSoup findAll returns empty list when selecting class

findall() returns empty list when specifying class
Specifying tags work fine
import urllib2
from bs4 import BeautifulSoup
url = "https://www.reddit.com/r/Showerthoughts/top/?sort=top&t=week"
hdr = { 'User-Agent' : 'tempro' }
req = urllib2.Request(url, headers=hdr)
htmlpage = urllib2.urlopen(req).read()
BeautifulSoupFormat = BeautifulSoup(htmlpage,'lxml')
name_box = BeautifulSoupFormat.findAll("a",{'class':'title'})
for data in name_box:
print(data.text)
I'm trying to get only the text of the post. The current code prints out nothing. If I remove the {'class':'title'} it prints out the post text as well as username and comments of the post which I don't want.
I'm using python2 with the latest versions of BeautifulSoup and urllib2

To get all the comments you are going to need a method like selenium which will allow you to scroll. Without that, just to get initial results, you can grab from a script tag in the requests response
import requests
from bs4 import BeautifulSoup as bs
import re
import json
headers = {'User-Agent' : 'Mozilla/5.0'}
r = requests.get('https://www.reddit.com/r/Showerthoughts/top/?sort=top&t=week', headers = headers)
soup = bs(r.content, 'lxml')
script = soup.select_one('#data').text
p = re.compile(r'window.___r = (.*); window')
data = json.loads(p.findall(script)[0])
for item in data['posts']['models']:
print(data['posts']['models'][item]['title'])

The selector you try to use is not good, because you do not have a class = "title" for those posts. Please try this below:
name_box = BeautifulSoupFormat.select('a[data-click-id="body"] > h2')
this finds all the <a data-click-id="body"> where you have <h2> tag that contain the post text you need
More about selectors using BeatufulSoup you can read here:
(https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors)

Python web Automation to get Email from Webpage

I want a python script that opens a link and print the email address from that page.
E.g
Go to some site like example.com
Search for email in that.
Search in all the pages in that link.
I was tried below code
import requests
from bs4 import BeautifulSoup
r = requests.get('http://www.digitalseo.in/')
data = r.text
soup = BeautifulSoup(data)
for rate in soup.find_all('#'):
print rate.text
I take this website for reference.
Anyone help me to get this?

Because find_all() will only search Tags. From document:
Signature: find_all(name, attrs, recursive, string, limit, **kwargs)
The find_all() method looks through a tag’s descendants and retrieves all descendants that match your filters.
So you need add a keyword argument like this:
import re
import requests
from bs4 import BeautifulSoup
r = requests.get('http://www.digitalseo.in/')
data = r.text
soup = BeautifulSoup(data, "html.parser")
for i in soup.find_all(href=re.compile("mailto")):
print i.string
Demo:
contact#digitalseo.in
contact#digitalseo.in
From document:
Any argument that’s not recognized will be turned into a filter on one of a tag’s attributes. If you pass in a value for an argument called id, Beautiful Soup will filter against each tag's 'id' attribute:
soup.find_all(id='link2')
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
If you pass in a value for href, Beautiful Soup will filter against each tag's 'href' attribute:
soup.find_all(href=re.compile("elsie"))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
You can see the document for more info: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all
And if you'd like find the email address from a document, regex is a good choice.
For example:
import re
re.findall( '[^#]+#[^#]+\.[^#]+ ', text) # remember change `text` variable
And if you'd like find a link in a page by keyword, just use .get like this:
import re
import requests
from bs4 import BeautifulSoup
def get_link_by_keyword(keyword):
links = set()
for i in soup.find_all(href=re.compile(r"[http|/].*"+str(keyword))):
links.add(i.get('href'))
for i in links:
if i[0] == 'h':
yield i
elif i[0] == '/':
yield link+i
else:
pass
global link
link = raw_input('Please enter a link: ')
if link[-1] == '/':
link = link[:-1]
r = requests.get(link, verify=True)
data = r.text
soup = BeautifulSoup(data, "html.parser")
for i in get_link_by_keyword(raw_input('Enter a keyword: ')):
print i

How to crawl the description for sfglobe using python

I am trying to use Python and Beautifulsoup to get this page from sfglobe website: http://sfglobe.com/2015/04/28/stirring-pictures-from-the-riots-in-baltimore.
This is the code:
import urllib2
from bs4 import BeautifulSoup
url = 'http://sfglobe.com/2015/04/28/stirring-pictures-from-the-riots-in-baltimore'
req = urllib2.urlopen(url)
html = req.read()
soup = BeautifulSoup(html)
desc = soup.find('span', class_='articletext intro')
Could anyone help me to solve this problem?

From the question title, I assuming that the only thing you want is the description of the article, which can be found in the <meta> tag within the HTML <head>.
You were on the right track, but I'm not exactly sure why you did:
desc = soup.find('span', class_='articletext intro')
Regardless, I came up with something using requests (see http://stackoverflow.com/questions/2018026/should-i-use-urllib-or-urllib2-or-requests) rather than urllib2
import requests
from bs4 import BeautifulSoup
url = 'http://sfglobe.com/2015/04/28/stirring-pictures-from-the-riots-in-baltim\
ore'
req = requests.get(url)
html = req.text
soup = BeautifulSoup(html)
tag = soup.find(attrs={'name':'description'}) # find meta tag w/ description
desc = tag['value'] # get value of attribute 'value'
print desc
If that isn't what you are looking for, please clarify so I can try and help you more.
EDIT: after some clarification, I pieced together why you were originally using desc = soup.find('span', class_='articletext intro').
Maybe this is what you are looking for:
import requests
from bs4 import BeautifulSoup, NavigableString
url = 'http://sfglobe.com/2015/04/28/stirring-pictures-from-the-riots-in-baltimore'
req = requests.get(url)
html = req.text
soup = BeautifulSoup(html)
body = soup.find('span', class_='articletext intro')
# remove script tags
[s.extract() for s in body('script')]
text = ""
# iterate through non-script elements in the content body
for stuff in body.select('*'):
# get contents of tags, .contents returns a list
content = stuff.contents
# check if the list has the text content a.k.a. isn't empty AND is a NavigableString, not a tag
if len(content) == 1 and isinstance(content[0], NavigableString):
text += content[0]
print text

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeautifulSoup - search text inside a tag - python

Related

How do I parse the values like this with BeutifulSoup?

Not getting json when using .text in bs4

BeautifulSoup findAll returns empty list when selecting class

Python web Automation to get Email from Webpage

How to crawl the description for sfglobe using python

Categories

Resources