python beautifulsoup can't prettify - python

I seem to be doing something wrong. I have an HTML source that I pull using urllib. Based on this HTML file I use beautifulsoup to findAll elements with an ID based on a specified array. This works for me, however the output is messy and includes linebreaks "\n".
Python: 2.7.12
BeautifulSoup: bs4
I have tried to use prettify() to correct the output but always get an error:
AttributeError: 'ResultSet' object has no attribute 'prettify'
import urllib
import re
from bs4 import BeautifulSoup
cfile = open("test.txt")
clist = cfile.read()
clist = clist.split('\n')
i=0
while i<len (clist):
url = "https://example.com/"+clist[i]
htmlfile = urllib.urlopen (url)
htmltext = htmlfile.read()
soup = BeautifulSoup (htmltext, "html.parser")
soup = soup.findAll (id=["id1", "id2", "id3"])
print soup.prettify()
i+=1
I'm sure there is something simple I am overlooking with this line:
soup = soup.findAll (id=["id1", "id2", "id3"])
I'm just not sure what. Sorry if this is a stupid question. I've only been using Python and Beautiful Soup for a few days.

You are reassigning the soup variable to the result of .findAll(), which is a ResultSet object (basically, a list of tags) which does not have the prettify() method.
The solution is to keep the soup variable pointing to the BeautifulSoup instance.

You can call prettify() on the top-level BeautifulSoup object, or on any of its Tag objects:
findAll return a list of match tags, so your code equal to [tag1,tag2..].prettify()
and it will not work.

Related

How to select a tags and scrape href value?

I am having trouble getting hyperlinks for tennis matches listed on a webpage, how do I go about fixing the code below so that it can obtain it through print?
import requests
from bs4 import BeautifulSoup
response = requests.get("https://www.betexplorer.com/results/tennis/?year=2022&month=11&day=02")
webpage = response.content
soup = BeautifulSoup(webpage, "html.parser")
print(soup.findAll('a href'))
In newer code avoid old syntax findAll() instead use find_all() or select() with css selectors - For more take a minute to check docs
Select your elements more specific and use set comprehension to avoid duplicates:
set('https://www.betexplorer.com'+a.get('href') for a in soup.select('a[href^="/tennis"]:has(strong)'))
Example
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.betexplorer.com/results/tennis/?year=2022&month=11&day=02')
soup = BeautifulSoup(r.text)
set('https://www.betexplorer.com'+a.get('href') for a in soup.select('a[href^="/tennis"]:has(strong)'))
Output
{'https://www.betexplorer.com/tennis/itf-men-singles/m15-new-delhi-2/sinha-nitin-kumar-vardhan-vishnu/tOasQaJm/',
'https://www.betexplorer.com/tennis/itf-women-doubles/w25-jerusalem/mushika-mao-mushika-mio-cohen-sapir-nagornaia-sofiia/xbNOHTEH/',
'https://www.betexplorer.com/tennis/itf-men-singles/m25-jakarta-2/barki-nathan-anthony-sun-fajing/zy2r8bp0/',
'https://www.betexplorer.com/tennis/itf-women-singles/w15-solarino/margherita-marcon-abbagnato-anastasia/lpq2YX4d/',
'https://www.betexplorer.com/tennis/itf-women-singles/w60-sydney/lee-ya-hsuan-namigata-junri/CEQrNPIG/',
'https://www.betexplorer.com/tennis/itf-men-doubles/m15-sharm-elsheikh-16/echeverria-john-marrero-curbelo-ivan-ianin-nikita-jasper-lai/nsGbyqiT/',...}
Change the last line to
print([a['href'] for a in soup.findAll('a')])
See a full tutorial here: https://pythonprogramminglanguage.com/get-links-from-webpage/

Python HTML scraping cannot find attribute I know exists?

I am using the lxml and requests modules, and just trying to parse the article from a website. I tried using find_all from BeautifulSoup but still came up empty
from lxml import html
import requests
page = requests.get('https://www.thehindu.com/news/national/karnataka/kumaraswamy-congress-leaders-meet-to-discuss-cabinet-reshuffle/article27283040.ece')
tree = html.fromstring(page.content)
article = tree.xpath('//div[#class="article"]/text()')
Once I print article, I get a list of ['\n','\n','\n','\n','\n'], rather than the body of the article. Where exactly am I going wrong?
I would use bs4 and the class name in css select_one
import requests
from bs4 import BeautifulSoup as bs
page = requests.get('https://www.thehindu.com/news/national/karnataka/kumaraswamy-congress-leaders-meet-to-discuss-cabinet-reshuffle/article27283040.ece')
soup = bs(page.content, 'lxml')
print(soup.select_one('.article').text)
If you use
article = tree.xpath('//div[#class="article"]//text()')
you get a list and still get all the \n but also the text which I think you can handle with re.sub or conditional logic.

Trying to extract some data from a webpage (scraping beginner)

I'm trying to extract some data from a webpage using Requests and then Beautifulsoup. I started by getting the html code with Requests and then "putting it" in Beautifulsoup:
from bs4 import BeautifulSoup
import requests
result = requests.get("https://XXXXX")
#print(result.status_code)
#print(result.headers)
src = result.content
soup = BeautifulSoup(src, 'lxml')
Then I singled out some pieces of code:
tags = soup.findAll('ol',{'class':'activity-popup-users'})
print(tags)
Here is a part of what I got:
<div class="account js-actionable-user js-profile-popup-actionable " data-emojified-name="" data-feedback-token="" data-impression-id="" data-name="The UN Times" data-screen-name="TheUNTimes" data-user-id="3787869561">
What I want now is to extract the data after data-user-id=which consists of numbers between "". Then I would like that data to be entered into some kind of calc sheet.
I am an absolute beginner and I'm postly pasting code I found elsewhere on tutorials or documentation.
Thanks a lot for your time...
EDIT:
So here's what I tried:
from bs4 import BeautifulSoup
import requests
result = requests.get("https://XXXX")
src = result.content
soup = BeautifulSoup(src, 'html.parser')
tags = soup.findAll('ol',{'class':'activity-popup-users'})
print(tags['data-user-id'])
And here's what I got:
TypeError: list indices must be integers or slices, not str
So I tried that:
from bs4 import BeautifulSoup
import requests
result = requests.get("https://XXXX")
src = result.content soup = BeautifulSoup(src, 'html.parser')
#tags = soup.findAll('a',{'class':'account-group js-user-profile-link'})
tags = soup.findAll('ol',{'class':'activity-popup-users'})
tags.attrs
#print(tags['data-user-id'])
And got:
File "C:\Users\XXXX\element.py", line 1884, in __getattr__
"ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'attrs'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
You can get any attribute value of a tag by treating the tag like an attribute-value dictionary.
Read the BeautifulSoup documentation on attributes.
tag['data-user-id']
For example
html="""
<div class="account js-actionable-user js-profile-popup-actionable " data-emojified-name="" data-feedback-token="" data-impression-id="" data-name="The UN Times" data-screen-name="TheUNTimes" data-user-id="3787869561">
"""
from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'html.parser')
tag=soup.find('div')
print(tag['data-user-id'])
Output
3787869561
Edit to include OP's question change:
from bs4 import BeautifulSoup
import requests
result = requests.get("http://twitter.com/RussiaUN/media")
src = result.content
soup = BeautifulSoup(src, 'html.parser')
divs = soup.find_all('div',class_='account')
#just print
for div in divs:
print(div['data-user-id'])
#write to a file
with open('file.txt','w') as f:
for div in divs:
f.write(div['data-user-id']+'\n')
Output:
255471924
2154112404
408696260
1267887043
475954041
3787869561
796979978
261711504
398068796
1174451010
...

Getting 'name' attributes with Beautiful Soup

from bs4 import BeautifulSoup
source_code = """
"""
soup = BeautifulSoup(source_code)
print soup.a['name'] #prints 'One'
Using BeautifulSoup, i can grab the first name attribute which is one, but i am not sure how i can print the second, which is Two
Anyone able to help me out?
You should read the documentation. There you can see that soup.find_all returns a list
so you can iterate over the list and, for each element, extract the tag you are looking for. So you should do something like (not tested here):
from bs4 import BeautifulSoup
soup = BeautifulSoup(source_code)
for item in soup.find_all('a'):
print item['name']
To get any a child element other than the first, use find_all. For the second a tag:
print soup.find_all('a', recursive=False)[1]['name']
To stay on the same level and avoid a deep search, pass the argument: recursive=False
This will give you all the tags of "a":
>>> from BeautifulSoup import BeautifulSoup
>>> aTags = BeautifulSoup(source_code).findAll('a')
>>> for tag in aTags: print tag["name"]
...
One
Two

BeautifulSoup 'AttributeError'

I'm getting "AttributeError: 'NoneType' object has no attribute 'string'" when I run the following. however, when the same tasks are performed on a block string variable; it works.
Any Ideas as to what I'm doing wrong?
from BeautifulSoup import BeautifulSoup
from urllib import urlopen
url = ("https://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Albert%20Einstein&explaintext")
print ((BeautifulSoup(((urlopen(url)).read()))).find('extract').string).split("\n", 1)[0]
from BeautifulSoup import BeautifulSoup
from urllib import urlopen
url = ("https://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Albert%20Einstein&explaintext")
soup = BeautifulSoup(urlopen(url).read())
print soup.find('extract') # returns None
The find method is not finding anything with the tag 'extract'. If you want to see it work then give it a HTML tag that exists in the document like 'pre' or 'html'
'extract' looks like an xml tag. You might want to try reading the BeautifulSoup documentation on parsing XML - http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Parsing XML. Also there is a new version of BeautifulSoup out there (bs4). I find the API much nicer.

Categories

Resources