I have looked at similar posts, which come close to my case, but my result nonetheless seems unexpected.
import BeautifulSoup
import re
soup = BeautifulSoup.BeautifulSoup(<html page of interest>)
if (soup.find_all("td", attrs= {"class": "FilterElement"}, text= re.compile("HERE IS TEXT I AM LOOKING FOR")) is None):
print('There was no entry')
else:
print(soup.find("td", attrs= {"class": "FilterElement"}, text= re.compile("HERE IS THE TEXT I AM LOOKING FOR")))
I obviously filtered out the actual HTML page, as well as the text in the regular expression. The rest is exactly as written. I get the following error:
Traceback (most recent call last):
File "/Users/appa/src/workspace/web_forms/WebForms/src/root/queryForms.py", line 51, in <module>
LoopThroughDays(form, id, trailer)
File "/Users/appa/src/workspace/web_forms/WebForms/src/root/queryForms.py", line 33, in LoopThroughDays
if (soup.find_all("td", attrs= {"class": "FilterElement"}, text= re.compile("HERE IS THE TEXT I AM LOOKING FOR")) is None):
TypeError: 'NoneType' object is not callable
I understand that the text will sometimes be missing. But I thought that the way I have set up the if statement was precisely able to capture when it is missing, and therefore a NoneType.
Thanks in advance for any help!
It looks like it's just a typo. It should be soup.findAll not soup.find_all. I tried running it, and it works with the correction. So the full program should be:
import BeautifulSoup
import re
soup = BeautifulSoup.BeautifulSoup(<html page of interest>)
if (soup.findAll("td", attrs= {"class": "FilterElement"}, text= re.compile("HERE IS TEXT I AM LOOKING FOR")) is None):
print('There was no entry')
else:
print(soup.find("td", attrs= {"class": "FilterElement"}, text= re.compile("HERE IS THE TEXT I AM LOOKING FOR")))<html page of interest>
Related
I was trying to make a pokedex (https://replit.com/#Yoplayer1py/Gui-Pokedex) and I wanted to get the pokemon's description from https://www.pokemon.com/us/pokedex/{__pokename__} Here pokename means the name of the pokemon. for example: https://www.pokemon.com/us/pokedex/unown
There is a tag contains the description and the p tag's class is : version-xactive.
When i print the description i get nothing or sometimes i get None.
here's the code:
import requests
from bs4 import BeautifulSoup
# Assign URL
url = "https://www.pokemon.com/us/pokedex/"+text_id_name.get(1.0, "end-1c")
# Fetch raw HTML content
html_content = requests.get(url).text
# Now that the content is ready, iterate
# through the content using BeautifulSoup:
soup = BeautifulSoup(html_content, "html.parser")
# similarly to get all the occurences of a given tag
print(soup.find('p', attrs={'class': 'version-xactive'}).text)
The text_id_name.get(1.0, "end-1c") is from tkinter text input.
it shows that :
Exception in Tkinter callback
Traceback (most recent call last):
File "/usr/lib/python3.8/tkinter/__init__.py", line 1883, in __call__
return self.func(*args)
File "main.py", line 57, in load_pokemon
print(soup.find('p', attrs={'class': 'version-xactive'}).text)
AttributeError: 'NoneType' object has no attribute 'text'
Thanks in advance !!
It looks like the classes, multiple, of the description are version-x active (at least for Unown). That is why soup.find('p', attrs={'class': 'version-xactive'} is not finding the element, thereby returning None (hence why you are getting the error).
Adding a space will fix your problem: print(soup.find('p', attrs={'class': 'version-xactive'}).text). Just to note: if there are multiple p elements with the same classes, so find method might not return the element you want.
Adding a null check will also prevent the error from occurring:
description = soup.find('p', attrs={'class': 'version-x active'})
if description:
print(desription.text)
You should probably separate out your calls so you can do a safety check and a type check.
Replace
print(soup.find('p', attrs={'class': 'version-xactive'}).text)
with
tags = soup.find('p', attrs={'class': 'version-xactive'})
print("Tags:", tags)
if(type(tags) != NoneType):
print(tags.text)
That should give you more information at least. It might still break on tags.text. If it does, put the printout from print("Tags:", tags) up so we can see what the data looks like.
I'm trying to scrape a website. I learned to scrape from two resources: one used tag.get('href') to get the href from an a tag, and one used tag['href'] to get the same. As far as I understand it, they both do the same thing. But when I tried this code:
link_list = [l.get('href') for l in soup.find_all('a')]
it worked with the .get method, but not with the dictionary access way.
link_list = [l['href'] for l in soup.find_all('a')]
This throws a KeyError. I'm very new to scraping, so please pardon if this is a silly one.
Edit - Both of the methods worked for the find method instead of find_all.
You may let BeautifulSoup find the links with existing href attributes only.
test
You can do it in two common ways, via find_all():
link_list = [a['href'] for a in soup.find_all('a', href=True)]
Or, with a CSS selector:
link_list = [a['href'] for a in soup.select('a[href]')]
Maybe HTML-string does not have a "href"?
For example:
from bs4 import BeautifulSoup
doc_html = """<a class="vote-up-off" title="This question shows research effort; it is useful and clear">up vote</a>"""
soup = BeautifulSoup(doc_html, 'html.parser')
ahref = soup.find('a')
ahref.get('href')
Nothing will happen, but
ahref['href']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/sergey/.virtualenvs/soup_example/lib/python3.5/site-
packages/bs4/element.py", line 1011, in __getitem__
return self.attrs[key]
KeyError: 'href'
'href'
I am a beginner python programmer and I am trying to make a webcrawler as practice.
Currently I am facing a problem that I cannot find the right solution for. The problem is that I am trying to get a link location/address from a page that has no class, so I have no idea how to filter that specific link.
It is probably better to show you.
The page I am trying to get the link from.
As you can see, I am trying to get what is inside of the href attribute of the "Historical prices" link. Here is my python code:
import requests
from bs4 import BeautifulSoup
def find_historicalprices_link(url):
source = requests.get(url)
text = source.text
soup = BeautifulSoup(text, 'html.parser')
link = soup.find_all('li', 'fjfe-nav-sub')
href = str(link.get('href'))
find_spreadsheet(href)
def find_spreadsheet(url):
source = requests.get(url)
text = source.text
soup = BeautifulSoup(text, 'html.parser')
link = soup.find('a', {'class' : 'nowrap'})
href = str(link.get('href'))
download_spreadsheet(href)
def download_spreadsheet(url):
response = requests.get(url)
text = response.text
lines = text.split("\\n")
filename = r'google.csv'
file = open(filename, 'w')
for line in lines:
file.write(line + "\n")
file.close()
find_historicalprices_link('https://www.google.com/finance?q=NASDAQ%3AGOOGL&ei=3lowWYGRJNSvsgGPgaywDw')
In the function "find_spreadsheet(url)", I could easily filter the link by looking for the class called "nowrap". Unfortunately, the Historical prices link does not have such a class and right now my script just gives me the following error:
AttributeError: ResultSet object has no attribute 'get'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
How do I make sure that my crawler only takes the href from the "Historical prices"?
Thank you in advance.
UPDATE:
I found the way to do it. By only looking for the link with a specific text attached to it, I could find the href I needed.
Solution:
soup.find('a', string="Historical prices")
Does the following code sniplet helps you? I think you can solve your problem with the following code as I hope:
from bs4 import BeautifulSoup
html = """<a href='http://www.google.com'>Something else</a>
<a href='http://www.yahoo.com'>Historical prices</a>"""
soup = BeautifulSoup(html, "html5lib")
urls = soup.find_all("a")
print(urls)
print([a["href"] for a in urls if a.text == "Historical prices"])
This is my first work with web scraping. So far I am able to navigate and find the part of the HTML I want. I can print it as well. The problem is printing only the text, which will not work. I get following error, when trying it: AttributeError: 'ResultSet' object has no attribute 'get_text'
Here my code:
from bs4 import BeautifulSoup
import urllib
page = urllib.urlopen('some url')
soup = BeautifulSoup(page)
zeug = soup.find_all('div', attrs={'class': 'fm_linkeSpalte'}).get_text()
print zeug
find_all() returns an array of elements. You should go through all of them and select that one you are need. And than call get_text()
UPD
For example:
for el in soup.find_all('div', attrs={'class': 'fm_linkeSpalte'}):
print el.get_text()
But note that you may have more than one element.
Try for inside the list for getting the data, like this:
zeug = [x.get_text() for x in soup.find_all('div', attrs={'class': 'fm_linkeSpalte'})]
I would close this issue for being a duplicate and link you to another I found that answers this question but I don't think I have the reputation needed to moderate... So...
Original Answer
Code for this:
for el in soup.findAll('div', attrs={'class': 'fm_linkeSpalte'}):
print ''.join(el.findAll(text=True))
If a mod wants to close this question that would be helpful.
I am trying to get the links from a news website page(from one of its archives). I wrote the following lines of code in Python:
main.py contains :
import mechanize
from bs4 import BeautifulSoup
url = "http://www.thehindu.com/archive/web/2010/06/19/"
br = mechanize.Browser()
htmltext = br.open(url).read()
articletext = ""
soup = BeautifulSoup(htmltext)
for tag in soup.findAll('li', attrs={"data-section":"Business"}):
articletext += tag.contents[0]
print articletext
An example of the object in tag.contents[0] :
ITC to issue 1:1 bonus
But on running it I am getting the following error :
File "C:\Python27\crawler\main.py", line 4, in <module>
text = articletext.getArticle(url)
File "C:\Python27\crawler\articletext.py", line 23, in getArticle
return getArticleText(htmltext)
File "C:\Python27\crawler\articletext.py", line 18, in getArticleText
articletext += tag.contents[0]
TypeError: cannot concatenate 'str' and 'Tag' objects
Can someone help me to sort it out ?? I am new to Python Programming. thanks and regards.
you are using link_dictionary vaguely. If you are not using it for reading purpose then try the following code :
br = mechanize.Browser()
htmltext = br.open(url).read()
articletext = ""
for tag_li in soup.findAll('li', attrs={"data-section":"Op-Ed"}):
for link in tag_li.findAll('a'):
urlnew = urlnew = link.get('href')
brnew = mechanize.Browser()
htmltextnew = brnew.open(urlnew).read()
articletext = ""
soupnew = BeautifulSoup(htmltextnew)
for tag in soupnew.findAll('p'):
articletext += tag.text
print re.sub('\s+', ' ', articletext, flags=re.M)
Note : re is for regulare expression. for this you import the module of re.
I believe you may want to try accessing the text inside the list item like so:
for tag in soup.findAll('li', attrs={"data-section":"Business"}):
articletext += tag.string
Edited: General Comments on getting links from a page
Probably the easiest data type to use to gather a bunch of links and retrieve them later is a dictionary.
To get links from a page using BeautifulSoup, you could do something like the following:
link_dictionary = {}
with urlopen(url_source) as f:
soup = BeautifulSoup(f)
for link in soup.findAll('a'):
link_dictionary[link.string] = link.get('href')
This will provide you with a dictionary named link_dictionary, where every key in the dictionary is a string that is simply the text contents between the <a> </a> tags and every value is the the value of the href attribute.
How to combine this what your previous attempt
Now, if we combine this with the problem you were having before, we could try something like the following:
link_dictionary = {}
for tag in soup.findAll('li', attrs={"data-section":"Business"}):
for link in tag.findAll('a'):
link_dictionary[link.string] = link.get('href')
If this doesn't make sense, or you have a lot more questions, you will need to experiment first and try to come up with a solution before asking another new, clearer question.
You might want to use the powerful XPath query language with the faster lxml module. As simple as that:
import urllib2
from lxml import etree
url = 'http://www.thehindu.com/archive/web/2010/06/19/'
html = etree.HTML(urllib2.urlopen(url).read())
for link in html.xpath("//li[#data-section='Business']/a"):
print '{} ({})'.format(link.text, link.attrib['href'])
Update for #data-section='Chennai'
#!/usr/bin/python
import urllib2
from lxml import etree
url = 'http://www.thehindu.com/template/1-0-1/widget/archive/archiveWebDayRest.jsp?d=2010-06-19'
html = etree.HTML(urllib2.urlopen(url).read())
for link in html.xpath("//li[#data-section='Chennai']/a"):
print '{} => {}'.format(link.text, link.attrib['href'])