I am trying to parse a website for all links that have the attribute nofollow.
I want to print that list, one link by one.
However I failed to append the results of findall() to my list box(my attempt is in brackets).
What did I do wrong?
import sys
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen(sys.argv[1]).read()
soup = BeautifulSoup(page)
soup.prettify()
box = []
for anchor in soup.findAll('a', href=True, attrs = {'rel' : 'nofollow'}):
# box.extend(anchor['href'])
print anchor['href']
# print box
You are looping over soup.findAll so each anchor is not itself a list; use .append() for individual elements:
box.append(anchor['href'])
You could also use a list comprehension to grab all href attributes:
box = [a['href'] for a in soup.findAll('a', href=True, attrs = {'rel' : 'nofollow'})]
Related
When I parse for more than 1 class I get an error on line 12 (when I add all to find)
Error: ResultSet object has no attribute 'find'. You're probably treating a list of elements like a single element
import requests
from bs4 import BeautifulSoup
heroes_page_list=[]
url = f'https://dota2.fandom.com/wiki/Dota_2_Wiki'
q = requests.get(url)
result = q.content
soup = BeautifulSoup(result, 'lxml')
heroes = soup.find_all('div', class_= 'heroentry').find('a')
for hero in heroes:
hero_url = heroes.get('href')
heroes_page_list.append("https://dota2.fandom.com" + hero_url)
# print(heroes_page_list)
with open ('heroes_page_list.txt', "w") as file:
for line in heroes_page_list:
file.write(f'{line}\n')
You are searching a tag inside a list of div tags you need to do like this,
heroes = soup.find_all('div', class_= 'heroentry')
a_tags = [hero.find('a') for hero in heroes]
for a_tag in a_tags:
hero_url = a_tag.get('href')
heroes_page_list.append("https://dota2.fandom.com" + hero_url)
heroes_page_list look like this,
['https://dota2.fandom.com/wiki/Abaddon',
'https://dota2.fandom.com/wiki/Alchemist',
'https://dota2.fandom.com/wiki/Axe',
'https://dota2.fandom.com/wiki/Beastmaster',
'https://dota2.fandom.com/wiki/Brewmaster',
'https://dota2.fandom.com/wiki/Bristleback',
'https://dota2.fandom.com/wiki/Centaur_Warrunner',
....
The error is stating everything you need to do.
find() method is only usable on a single element. find_all() returns a list of elements. You are trying to apply find() to a list of elements.
If you want to apply find('a') you should to something similar to this:
heroes = soup.find_all('div', class_= 'heroentry')
for hero in heroes:
hero_a_tag = hero.find('a')
hero_url = hero_a_tag .get('href')
heroes_page_list.append("https://dota2.fandom.com" + hero_url)
You basically have to apply the find() method on every element presents in the list generated by the find_all() method
import requests
import re
from bs4 import BeautifulSoup
respond = requests.get("http://www.kulugyminiszterium.hu/dtwebe/Irodak.aspx")
print(respond)
soup = BeautifulSoup(respond.text, 'html.parser')
for link in soup.find_all('a'):
links = link.get('href')
linki_bloc = ('http://www.kulugyminiszterium.hu/dtwebe/'+links).replace(' ', '%20' )
print(linki_bloc)
value = linki_bloc
print(value.split())
I am trying to use the results of find_all('a') as a list. The only thing that succeeds for me is the last link.
It seems to me that the problem is the results as a list of links deselected \n. I tried many ways to get rid of the new line character but failed. Saving to a file (e.g. .txt) also fails, saving only the last link.
Close to your goal, but you overwrite the result wit each iteration - Simply append your manipulated links to a list with list comprehension directly:
['http://www.kulugyminiszterium.hu/dtwebe/'+link.get('href').replace(' ', '%20' ) for link in soup.find_all('a')]
or as in your example:
links = []
for link in soup.find_all('a'):
links.append('http://www.kulugyminiszterium.hu/dtwebe/'+link.get('href').replace(' ', '%20' ))
Example
import requests
from bs4 import BeautifulSoup
respond = requests.get("http://www.kulugyminiszterium.hu/dtwebe/Irodak.aspx")
soup = BeautifulSoup(respond.text, 'html.parser')
links = []
for link in soup.find_all('a'):
links.append('http://www.kulugyminiszterium.hu/dtwebe/'+link.get('href').replace(' ', '%20' ))
links
I am attempting to pull line and over under data for games from ESPN. To do this I need to pull a list item underneath a div tag.I can successfully get the over/under data because it's clear to me what the tag is, but the list item for the line doesn't seem to have a clear tag. Essentially I would be wanting to pull out "Line: IOWA -3.5" from this specific URL.
from bs4 import BeautifulSoup
page = requests.get('https://www.espn.com/college- football/game/_/gameId/401012863')
soup = BeautifulSoup(page.text, 'html.parser')
#Get over/under
game_ou = soup.find('li',class_='ou')
game_ou2 = game_ou.contents[0]
game_ou3=game_ou2.strip()
#Get Line
game_line = soup.find('div',class_='odds-details')
print(game_line)
Add in the parent class (with descendant combinator and child li type selector) then you can retrieve both li in a list and index in or just use select_one to retrieve the first
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://www.espn.com/college-football/game/_/gameId/401012863')
soup = bs(r.content, 'lxml')
lis = [i.text.strip() for i in soup.select('.odds-details li')]
print(lis[0])
print(lis[1])
print(soup.select_one('.odds-details li').text)
Use find('li') after find the div element.
from bs4 import BeautifulSoup
page = requests.get('https://www.espn.com/college-football/game/_/gameId/401012863')
soup = BeautifulSoup(page.text, 'html.parser')
print(soup.find("div",class_="odds-details").find('li').text)
print(soup.find("div",class_="odds-details").find('li',class_='ou').text.strip())
Output:
Line: IOWA -3.5
Over/Under: 47
I'm trying to extract a link under "a href="link"..."
As there are multiple rows I iterate over every one of them. The first link per row is the one I need so I use find_all('tr') and find('a').
I know find('a') returns a Nonetype but do not know how to work around this
I had a piece of code that worked but is inefficient (in comments).
sauce = urllib.request.urlopen('https://morocco.observation.org/soortenlijst_wg_v3.php')
soup = bs.BeautifulSoup(sauce, 'lxml')
tabel = soup.find('table', {'class': 'tablesorter'})
for i in tabel.find_all('tr'):
# if 'view' in i.get('href'):
# link_list.append(i.get('href'))
link = i.find('a')
#<a class="z1" href="/soort/view/164?from=1987-12-05&to=2019-05-31">Common Reed Bunting - <em>Emberiza schoeniclus</em></a>
How do I retrieve the link under href and work around the Nonetype getting only /soort/view/164?from=1987-12-05&to=2019-05-31?
Thanks in advance
A logical way is to use nth-of-type to isolate the target column
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://morocco.observation.org/soortenlijst_wg_v3.php')
soup = bs(r.content, 'lxml')
base = 'https://morocco.observation.org'
urls = [base + item['href'] for item in soup.select('#mytable_S td:nth-of-type(3) a')]
You could also pass a list of classes
urls = [base + item['href'] for item in soup.select('.z1, .z2,.z3,.z4')]
Or even use starts with, ^, operator for class
urls = [base + item['href'] for item in soup.select('[class^=z]')]
Or contains, *, operator for href
urls = [base + item['href'] for item in soup.select('[href*=view]')]
Read about different css selector methods here: https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors
link = i.find('a')
_href = link['href']
print(_href)
O/P:
"/soort/view/164?from=1987-12-05&to=2019-05-31?"
This is not proper url link, you should concate with domain name
new_url = "https://morocco.observation.org"+_href
print(new_url)
O/p:
https://morocco.observation.org/soort/view/164?from=1987-12-05&to=2019-05-31?
Update:
from bs4 import BeautifulSoup
from bs4.element import Tag
import requests
resp = requests.get("https://morocco.observation.org/soortenlijst_wg_v3.php")
soup = BeautifulSoup(resp.text, 'lxml')
tabel = soup.find('table', {'class': 'tablesorter'})
base_url = "https://morocco.observation.org"
for i in tabel.find_all('tr'):
link = i.find('a',href=True)
if link is None or not isinstance(link,Tag):
continue
url = base_url + link['href']
print(url)
O/P:
https://morocco.observation.org/soort/view/248?from=1975-05-05&to=2019-06-01
https://morocco.observation.org/soort/view/174?from=1989-12-15&to=2019-06-01
https://morocco.observation.org/soort/view/57?from=1975-05-05&to=2019-06-01
https://morocco.observation.org/soort/view/19278?from=1975-05-13&to=2019-06-01
https://morocco.observation.org/soort/view/56?from=1993-03-25&to=2019-06-01
https://morocco.observation.org/soort/view/1504?from=1979-05-25&to=2019-06-01
https://morocco.observation.org/soort/view/78394?from=1975-05-09&to=2019-06-01
https://morocco.observation.org/soort/view/164?from=1987-12-05&to=2019-06-01
I have a list of divs, and I'm trying to get certain info in each of them. The div classes are the same so I'm not sure how I would go about this.
I have tried for loops but have been getting various errors
Code to get list of divs:
import requests
from bs4 import BeautifulSoup
import re
url = 'https://sneakernews.com/release-dates/'
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, "lxml")
soup1 = soup.find("div", {'class': 'popular-releases-block'})
soup1 = str(soup1.find("div", {'class': 'row'}))
soup1 = soup1.split('</div>')
print(soup1)
Code I want to loop for each item in the soup1 list:
linkinfo = soup1.find('a')['href']
date = str(soup1.find('span'))
name = soup1.find('a')
non_decimal = re.compile(r'[^\d.]+')
date = non_decimal.sub('', date)
name = str(name)
name = re.sub('</a>', '', name)
link, name = name.split('>')
link = re.sub('<a href="', '', link)
link = re.sub('"', '', link)
name = name.split(' ')
name = str(name[-1])
date = str(date)
link = str(link)
print(link)
print(name)
print(date)
Based on the URL you posted above, I imagine you are interested in something like this:
import requests
from bs4 import BeautifulSoup
url = requests.get('https://sneakernews.com/release-dates/').text
soup = BeautifulSoup(url, 'html.parser')
tags = soup.find_all('div', {'class': 'col lg-2 sm-3 popular-releases-box'})
for tag in tags:
link = tag.find('a').get('href')
print(link)
print(tag.text)
#Anything else you want to do
If you are using the BeautifulSoup library, then you do not need regex to try to parse through HTML tags. Instead, use the handy methods that accompany BeautifulSoup. If you would like to apply a regex to the text output from the tags you locate via BeautifulSoup to accomplish a more specific task, then that would be reasonable.
My understanding is that you want to loop your code for each item within a list.
An example of this:
my_list = ["John", "Fred", "Tom"]
for name in my_list:
print(name)
This will loop for each name that is in my_list and print out each item (reffered to here as name in the list). You could do something similar with your code:
for item in soup1:
# perform some action