BeautifulSoup get_text from find_all

BeautifulSoup get_text from find_all - python

This is my first work with web scraping. So far I am able to navigate and find the part of the HTML I want. I can print it as well. The problem is printing only the text, which will not work. I get following error, when trying it: AttributeError: 'ResultSet' object has no attribute 'get_text'
Here my code:
from bs4 import BeautifulSoup
import urllib
page = urllib.urlopen('some url')
soup = BeautifulSoup(page)
zeug = soup.find_all('div', attrs={'class': 'fm_linkeSpalte'}).get_text()
print zeug

find_all() returns an array of elements. You should go through all of them and select that one you are need. And than call get_text()
UPD
For example:
for el in soup.find_all('div', attrs={'class': 'fm_linkeSpalte'}):
print el.get_text()
But note that you may have more than one element.

Try for inside the list for getting the data, like this:
zeug = [x.get_text() for x in soup.find_all('div', attrs={'class': 'fm_linkeSpalte'})]

I would close this issue for being a duplicate and link you to another I found that answers this question but I don't think I have the reputation needed to moderate... So...
Original Answer
Code for this:
for el in soup.findAll('div', attrs={'class': 'fm_linkeSpalte'}):
print ''.join(el.findAll(text=True))
If a mod wants to close this question that would be helpful.

Related

Whats the problem with my code it prints none when I use find() method and when I use findAll() method it prints empty array?

from bs4 import BeautifulSoup
import requests
yt_link = "https://www.youtube.com/watch?v=bKDdT_nyP54"
response = requests.get(yt_link)
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.findAll('div', {'class': 'style-scope ytd-app'})
print(title)
It prints empty array [], and if I use find() method then it prints None as a result.
Why does this happen. Please help me I am stuck here.

Yes its difficult to find title because of youtube uses javascript and dynamic content rendering so what you can do try to print soup first and find title from it,so in meta you can find title extract it. And it work for any URL probably
from bs4 import BeautifulSoup
import requests
yt_link = "https://www.youtube.com/watch?v=bKDdT_nyP54"
response = requests.get(yt_link)
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.find('meta',attrs={"name":"title"})
print(title.get("content"))
output:
Akon - Smack That (Official Music Video) ft. Eminem

As find() method returns the first matching object if it doesn't find then returns None and findAll() method returns the list of matching object if it doesn't find then returns empty list.

Error: ResultSet object has no attribute 'endswith'

I'm trying to scrape the unique links from a website, but when I do, I get the following error and I'm not sure what's causing it.
ResultSet object has no attribute 'endswith'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
I tried changing the url to see if it was the link, and it didn't work --which I wasn't surprised, but I wanted to check.
I looked at the documentation(https://www.crummy.com/software/BeautifulSoup/bs4/doc/#miscellaneous) and if I'm understanding it correctly, it's saying to use find instead of findall. I tried using find instead, but that didn't pull up anything, but even if it did, it wouldn't pull up what I'm looking for since I'm wanting all unique links.
Anyway, here's the code. Any ideas or places I can look to understand this error more?
import requests
from bs4 import BeautifulSoup
import urllib.request
import urllib.parse
url="https://www.census.gov/programs-surveys/popest.html"
r=requests.get(url)
soup= BeautifulSoup(r.content, "html.parser")
links = soup.find_all("a")
for link in links:
link.get("href")
def unique_links(tags,url):
cleaned_links = set()
for link in links:
link = link.get("href")
if link is None:
continue
if link.endswith('/') or links.endswith('#'):
link = link [-1]
actual_url = urllib.parse.urljoin(url,link)
cleaned_links.add(actual_url)
return cleaned_links
cleaned_links = unique_links(links, url)

There is a typo in your code: link.endswith('#'): instead of links.
if link.endswith('/') or link.endswith('#'):
link = link [-1]

python xpath returns empty list - exilead

I'm fairly new to scraping with Python.
I am trying to obtain the number of search results from a query on Exilead. In this example I would like to get "
586,564 results".
This is the code I am running:
r = requests.get(URL, headers=headers)
tree = html.fromstring(r.text)
stats = tree.xpath('//[#id="searchform"]/div/div/small/text()')
This returns an empty list.
I copy-pasted the xPath directly from the elements' page.
As an alternative, I have tried using Beautiful soup:
html = r.text
soup = BeautifulSoup(html, 'xml')
stats = soup.find('small', {'class': 'pull-right'}).text
which returns a Attribute error: NoneType object does not have attribute text.
When I checked the html source I realised I actually cannot find the element I am looking for (the number of results) on the source.
Does anyone know why this is happening and how this can be resolved?
Thanks a lot!

When I checked the html source I realised I actually cannot find the element I am looking for (the number of results) on the source.
This suggests that the data you're looking for is dynamically generated with javascript. You'll need to be able to see the element you're looking for in the html source.
To confirm this being the cause of your error, you could try something really simple like:
html = r.text
soup = BeautifulSoup(html, 'lxml')
*note the 'lxml' above.
And then manually check 'soup' to see if your desired element is there.

I can get that with a css selector combination of small.pull-right to target the tag and the class name of the element.
from bs4 import BeautifulSoup
import requests
url = 'https://www.exalead.com/search/web/results/?q=lead+poisoning'
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")
print(soup.select_one('small.pull-right').text)

Python and BeautifulSoup Opening pages

I am wondering how would I open another page in my list with BeautifulSoup? I have followed this tutorial, but it does not tell us how to open another page on the list. Also how would I open a "a href" that is nested inside of a class?
Here is my code:
# coding: utf-8
import requests
from bs4 import BeautifulSoup
r = requests.get("")
soup = BeautifulSoup(r.content)
soup.find_all("a")
for link in soup.find_all("a"):
print link.get("href")
for link in soup.find_all("a"):
print link.text
for link in soup.find_all("a"):
print link.text, link.get("href")
g_data = soup.find_all("div", {"class":"listing__left-column"})
for item in g_data:
print item.contents
for item in g_data:
print item.contents[0].text
print link.get('href')
for item in g_data:
print item.contents[0]
I am trying to collect the href's from the titles of each business, and then open them and scrape that data.

I am still not sure where you are getting the HTML from, but if you are trying to extract all of the href tags, then the following approach should work based on the image you have posted:
import requests
from bs4 import BeautifulSoup
r = requests.get("<add your URL here>")
soup = BeautifulSoup(r.content)
for a_tag in soup.find_all('a', class_='listing-name', href=True):
print 'href: ', a_tag['href']
By adding href=True to the find_all(), it ensures that only a elements that contain an href attribute are returned therefore removing the need to test for it as an attribute.
Just to warn you, you might find some websites will lock you out after one or two attempts as they are able to detect that you are trying to access a site via a script, rather than as a human. If you feel you are not getting the correct responses, I would recommend printing the HTML you are getting back to ensure it it still as you expect.
If you then want to get the HTML for each of the links, the following could be used:
import requests
from bs4 import BeautifulSoup
# Configure this to be your first request URL
r = requests.get("http://www.mywebsite.com/search/")
soup = BeautifulSoup(r.content)
for a_tag in soup.find_all('a', class_='listing-name', href=True):
print 'href: ', a_tag['href']
# Configure this to the root of the above website, e.g. 'http://www.mywebsite.com'
base_url = "http://www.mywebsite.com"
for a_tag in soup.find_all('a', class_='listing-name', href=True):
print '-' * 60 # Add a line of dashes
print 'href: ', a_tag['href']
request_href = requests.get(base_url + a_tag['href'])
print request_href.content
Tested using Python 2.x, for Python 3.x please add parentheses to the print statements.

I had the same problem and I will like to share my findings because I did try the answer, for some reasons it did not work but after some research I found something interesting.
You might need to find the attributes of the "href" link itself:
You will need the exact class which contains the href link in your case, I am thinking="class":"listing__left-column" and equate it to a variable say "all" for example:
from bs4 import BeautifulSoup
all = soup.find_all("div", {"class":"listing__left-column"})
for item in all:
for link in item.find_all("a"):
if 'href' in link.attrs:
a = link.attrs['href']
print(a)
print("")
I did this and I was able to get into another link which was embedded in the home page

Getting 'name' attributes with Beautiful Soup

from bs4 import BeautifulSoup
source_code = """
"""
soup = BeautifulSoup(source_code)
print soup.a['name'] #prints 'One'
Using BeautifulSoup, i can grab the first name attribute which is one, but i am not sure how i can print the second, which is Two
Anyone able to help me out?

You should read the documentation. There you can see that soup.find_all returns a list
so you can iterate over the list and, for each element, extract the tag you are looking for. So you should do something like (not tested here):
from bs4 import BeautifulSoup
soup = BeautifulSoup(source_code)
for item in soup.find_all('a'):
print item['name']

To get any a child element other than the first, use find_all. For the second a tag:
print soup.find_all('a', recursive=False)[1]['name']
To stay on the same level and avoid a deep search, pass the argument: recursive=False

This will give you all the tags of "a":
>>> from BeautifulSoup import BeautifulSoup
>>> aTags = BeautifulSoup(source_code).findAll('a')
>>> for tag in aTags: print tag["name"]
...
One
Two

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeautifulSoup get_text from find_all - python

find_all() returns an array of elements. You should go through all of them and select that one you are need. And than call get_text() UPD For example: for el in soup.find_all('div', attrs={'class': 'fm_linkeSpalte'}): print el.get_text() But note that you may have more than one element.

Try for inside the list for getting the data, like this: zeug = [x.get_text() for x in soup.find_all('div', attrs={'class': 'fm_linkeSpalte'})]

Related

Whats the problem with my code it prints none when I use find() method and when I use findAll() method it prints empty array?

Error: ResultSet object has no attribute 'endswith'

python xpath returns empty list - exilead

Python and BeautifulSoup Opening pages

Getting 'name' attributes with Beautiful Soup

Categories

Resources