I'm trying to scrape the unique links from a website, but when I do, I get the following error and I'm not sure what's causing it.
ResultSet object has no attribute 'endswith'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
I tried changing the url to see if it was the link, and it didn't work --which I wasn't surprised, but I wanted to check.
I looked at the documentation(https://www.crummy.com/software/BeautifulSoup/bs4/doc/#miscellaneous) and if I'm understanding it correctly, it's saying to use find instead of findall. I tried using find instead, but that didn't pull up anything, but even if it did, it wouldn't pull up what I'm looking for since I'm wanting all unique links.
Anyway, here's the code. Any ideas or places I can look to understand this error more?
import requests
from bs4 import BeautifulSoup
import urllib.request
import urllib.parse
url="https://www.census.gov/programs-surveys/popest.html"
r=requests.get(url)
soup= BeautifulSoup(r.content, "html.parser")
links = soup.find_all("a")
for link in links:
link.get("href")
def unique_links(tags,url):
cleaned_links = set()
for link in links:
link = link.get("href")
if link is None:
continue
if link.endswith('/') or links.endswith('#'):
link = link [-1]
actual_url = urllib.parse.urljoin(url,link)
cleaned_links.add(actual_url)
return cleaned_links
cleaned_links = unique_links(links, url)
There is a typo in your code: link.endswith('#'): instead of links.
if link.endswith('/') or link.endswith('#'):
link = link [-1]
Related
I am working on scraping links from a Christmas tree farm website. First, I used this tutorial method to get all the links. Then, I noticed that the links that I wanted did not lead with the proper hypertext transfer protocol, so I created a variable to concatenate. Now I am trying to create a if statement that grabs each link and looks for any two characters followed by "xmastrees.php". If that is true then my concatenate variable to the front of it. If the link does not contain the specific text then it is deleted. For example NYxmastrees.php will be http://www.pickyourownchristmastree.org/NYxmastrees.php and ../disclaimer.htm will be removed. I've tried multiple ways, but can't seem to find the right one.
Here is what I currently have and keep running into a syntax error: del. I commented out that line and get another error saying my string object has no attribute 're'. This confuses me because I though i could use regex with strings??
source = requests.get('http://www.pickyourownchristmastree.org/').text
soup = BeautifulSoup(source, 'lxml')
concatenate = 'http://www.pickyourownchristmastree.org/'
find_state_group = soup.find('div', class_ = 'alert')
for link in find_state_group.find_all('a', href=True):
if link['href'].re.search('^.\B.\$xmastrees'):
states = concatenate + link
else del link['href']
print(link['href']
Error with else del link['href']:
else del link['href']
^
SyntaxError: invalid syntax
Error without else del link['href']:
if link['href'].re.search('^.\B.\$xmastrees'):
AttributeError: 'str' object has no attribute 're'
You can try using:
import requests
from bs4 import BeautifulSoup as bs
u = "http://www.pickyourownchristmastree.org/"
soup = bs(requests.get(u).text, 'html5lib')
find_state_group = soup.find('div', {"class": 'alert'})
for link in find_state_group.find_all('a', href=True):
if "mastrees" in link['href']:
states = u + link['href']
print(states)
http://www.pickyourownchristmastree.org/ALxmastrees.php
http://www.pickyourownchristmastree.org/AZxmastrees.php
http://www.pickyourownchristmastree.org/AKxmastrees.php
...
Demo
I am trying to scrape news articles using Beautiful Soup. However, it only works for some articles on the website and not for others. I can't find any apparent differences in the source code so I would be very grateful for any ideas on how to solve this.
For example, this is fine:
import requests
from bs4 import BeautifulSoup
result = requests.get("https://www.dn.se/nyheter/sverige/ewa-stenberg-darfor-ligger-sverige-steget-efter/")
src = result.content
soup = BeautifulSoup(src, 'lxml')
content = soup.find('div', class_='article__body')
body = content.text
print(body)
But changing the url to:
result = requests.get("https://www.dn.se/nyheter/sverige/Regeringen-vill-att-skolor-ska-fa-satta-betyg-i-arskurs-4")
produces the following error:
AttributeError: 'NoneType' object has no attribute 'text'
In this example there is nothing wrong with the scraping itself, the second URL responds with a 301 (moved permanently redirect) which means you will get a new URL in the response. In requests you need to do some actions in order to follow the redirect.
See this answer https://stackoverflow.com/a/50606372/10201813 for info regarding how to solve it or read on http://docs.python-requests.org/en/latest/user/quickstart/#redirection-and-history for more information
I'm trying to print HREF tags of the link below.
Here's my first attempt.
# the Python 3 version:
from bs4 import BeautifulSoup
import urllib.request
resp = urllib.request.urlopen("https://www.linkedin.com/search/results/all/?keywords=tim%20morgan&origin=GLOBAL_SEARCH_HEADER")
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))
for link in soup.find_all('a', href=True):
print(link['href'])
When I run that, I get this.
/feed/
/feed/
/feed/
/mynetwork/
/jobs/
/messaging/
/notifications/
#
Here's my second attempt.
# and a version using the requests library, which as written will work in both Python 2 and 3:
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://www.linkedin.com/search/results/all/?keywords=tim%20morgan&origin=GLOBAL_SEARCH_HEADER')
soup = bs(r.content, 'lxml')
links = [item['href'] if item.get('href') is not None else item['src'] for item in soup.select('[href^="https"]') ]
print(links)
When I run that, I get this.
['https://static-exp1.licdn.com/sc/h/n7m1fekt1d9hawp3s7wats11', 'https://static-exp1.licdn.com/sc/h/al2o9zrvru7aqj8e1x2rzsrca', 'https://static-exp1.licdn.com/sc/h/2if24wp7oqlodqdlgei1n1520', 'https://static-exp1.licdn.com/sc/h/eahiplrwoq61f4uan012ia17i', 'https://static-exp1.licdn.com/sc/h/2if24wp7oqlodqdlgei1n1520', 'https://static-exp1.licdn.com/sc/h/eahiplrwoq61f4uan012ia17i', 'https://static-exp1.licdn.com/sc/h/c7y7qgvm2uh1zn8pgl84l3rty', 'https://static-exp1.licdn.com/sc/h/auhsc2hi2zkvt7nbqep2ejauv', 'https://static-exp1.licdn.com/sc/h/9vf4mi871c6wolrcm3pgqywes', 'https://static-exp1.licdn.com/sc/h/7z1536jzhgep1sw5uk19e8ec7', 'https://static-exp1.licdn.com/sc/h/a0on5mxqtufmy9y66neg9mdgy', 'https://static-exp1.licdn.com/sc/h/1edhu1lemiqjsbgubat2dejxr', 'https://static-exp1.licdn.com/sc/h/2gdon0pq1074su3zwdop1y2g1']
I was expecting to see something like this:
https://www.linkedin.com/in/timlmorgan/
https://www.linkedin.com/in/timmorgan3/
https://www.linkedin.com/in/tim-morgan-19543731/
etc., etc., etc.
I guess LinkedIn must be doing something special, which I'm not aware of. When I run the same code against 'https://www.nytimes.com/', I get the results that I would expect. This is just a learning exercise. I'm curious to know what's going on here. I'm not interested in actually scanning LinkedIn for data.
LinkedIn loads in data asynchronously, if we actually view-source (Ctrl + U on Windows) on that URL you're fetching, you won't find your expected results, because Javascript is loading them after the page has already loaded with the base information.
BeautifulSoup won't execute the Javascript on the page that fetches that data.
To solve this, one would actually figure out the API functions and have your script call those.
https://www.linkedin.com/voyager/api/search/filters?filters=List()&keywords=tim%20morgan&q=universalAll&queryContext=List(primaryHitType-%3EPEOPLE)
Except adjusting your call to pass the CSRF check. Or actually utilizing their API.
I tested some Selenium code which seems to do the trick.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox(executable_path=r'C:\files\geckodriver.exe')
driver.set_page_load_timeout(30)
driver.get("https://www.google.com/")
driver.get("https://www.linkedin.com/search/results/all/?keywords=tim%20morgan&origin=GLOBAL_SEARCH_HEADER")
continue_link = driver.find_element_by_tag_name('a')
elems = driver.find_elements_by_xpath("//a[#href]")
for elem in elems:
print(elem.get_attribute("href"))
I'm fairly new to scraping with Python.
I am trying to obtain the number of search results from a query on Exilead. In this example I would like to get "
586,564 results".
This is the code I am running:
r = requests.get(URL, headers=headers)
tree = html.fromstring(r.text)
stats = tree.xpath('//[#id="searchform"]/div/div/small/text()')
This returns an empty list.
I copy-pasted the xPath directly from the elements' page.
As an alternative, I have tried using Beautiful soup:
html = r.text
soup = BeautifulSoup(html, 'xml')
stats = soup.find('small', {'class': 'pull-right'}).text
which returns a Attribute error: NoneType object does not have attribute text.
When I checked the html source I realised I actually cannot find the element I am looking for (the number of results) on the source.
Does anyone know why this is happening and how this can be resolved?
Thanks a lot!
When I checked the html source I realised I actually cannot find the element I am looking for (the number of results) on the source.
This suggests that the data you're looking for is dynamically generated with javascript. You'll need to be able to see the element you're looking for in the html source.
To confirm this being the cause of your error, you could try something really simple like:
html = r.text
soup = BeautifulSoup(html, 'lxml')
*note the 'lxml' above.
And then manually check 'soup' to see if your desired element is there.
I can get that with a css selector combination of small.pull-right to target the tag and the class name of the element.
from bs4 import BeautifulSoup
import requests
url = 'https://www.exalead.com/search/web/results/?q=lead+poisoning'
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")
print(soup.select_one('small.pull-right').text)
This is my first work with web scraping. So far I am able to navigate and find the part of the HTML I want. I can print it as well. The problem is printing only the text, which will not work. I get following error, when trying it: AttributeError: 'ResultSet' object has no attribute 'get_text'
Here my code:
from bs4 import BeautifulSoup
import urllib
page = urllib.urlopen('some url')
soup = BeautifulSoup(page)
zeug = soup.find_all('div', attrs={'class': 'fm_linkeSpalte'}).get_text()
print zeug
find_all() returns an array of elements. You should go through all of them and select that one you are need. And than call get_text()
UPD
For example:
for el in soup.find_all('div', attrs={'class': 'fm_linkeSpalte'}):
print el.get_text()
But note that you may have more than one element.
Try for inside the list for getting the data, like this:
zeug = [x.get_text() for x in soup.find_all('div', attrs={'class': 'fm_linkeSpalte'})]
I would close this issue for being a duplicate and link you to another I found that answers this question but I don't think I have the reputation needed to moderate... So...
Original Answer
Code for this:
for el in soup.findAll('div', attrs={'class': 'fm_linkeSpalte'}):
print ''.join(el.findAll(text=True))
If a mod wants to close this question that would be helpful.