Trying to extract some data from a webpage (scraping beginner) - python

I'm trying to extract some data from a webpage using Requests and then Beautifulsoup. I started by getting the html code with Requests and then "putting it" in Beautifulsoup:
from bs4 import BeautifulSoup
import requests
result = requests.get("https://XXXXX")
#print(result.status_code)
#print(result.headers)
src = result.content
soup = BeautifulSoup(src, 'lxml')
Then I singled out some pieces of code:
tags = soup.findAll('ol',{'class':'activity-popup-users'})
print(tags)
Here is a part of what I got:
<div class="account js-actionable-user js-profile-popup-actionable " data-emojified-name="" data-feedback-token="" data-impression-id="" data-name="The UN Times" data-screen-name="TheUNTimes" data-user-id="3787869561">
What I want now is to extract the data after data-user-id=which consists of numbers between "". Then I would like that data to be entered into some kind of calc sheet.
I am an absolute beginner and I'm postly pasting code I found elsewhere on tutorials or documentation.
Thanks a lot for your time...
EDIT:
So here's what I tried:
from bs4 import BeautifulSoup
import requests
result = requests.get("https://XXXX")
src = result.content
soup = BeautifulSoup(src, 'html.parser')
tags = soup.findAll('ol',{'class':'activity-popup-users'})
print(tags['data-user-id'])
And here's what I got:
TypeError: list indices must be integers or slices, not str
So I tried that:
from bs4 import BeautifulSoup
import requests
result = requests.get("https://XXXX")
src = result.content soup = BeautifulSoup(src, 'html.parser')
#tags = soup.findAll('a',{'class':'account-group js-user-profile-link'})
tags = soup.findAll('ol',{'class':'activity-popup-users'})
tags.attrs
#print(tags['data-user-id'])
And got:
File "C:\Users\XXXX\element.py", line 1884, in __getattr__
"ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'attrs'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?

You can get any attribute value of a tag by treating the tag like an attribute-value dictionary.
Read the BeautifulSoup documentation on attributes.
tag['data-user-id']
For example
html="""
<div class="account js-actionable-user js-profile-popup-actionable " data-emojified-name="" data-feedback-token="" data-impression-id="" data-name="The UN Times" data-screen-name="TheUNTimes" data-user-id="3787869561">
"""
from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'html.parser')
tag=soup.find('div')
print(tag['data-user-id'])
Output
3787869561
Edit to include OP's question change:
from bs4 import BeautifulSoup
import requests
result = requests.get("http://twitter.com/RussiaUN/media")
src = result.content
soup = BeautifulSoup(src, 'html.parser')
divs = soup.find_all('div',class_='account')
#just print
for div in divs:
print(div['data-user-id'])
#write to a file
with open('file.txt','w') as f:
for div in divs:
f.write(div['data-user-id']+'\n')
Output:
255471924
2154112404
408696260
1267887043
475954041
3787869561
796979978
261711504
398068796
1174451010
...

Related

Can't scrape <h3> tag from page

Seems like i can scrape any tag and class, except h3 on this page. It keeps returning None or an empty list. I'm trying to get this h3 tag:
...on the following webpage:
https://www.empireonline.com/movies/features/best-movies-2/
And this is the code I use:
from bs4 import BeautifulSoup
import requests
from pprint import pprint
from bs4 import BeautifulSoup
URL = "https://www.empireonline.com/movies/features/best-movies-2/"
response = requests.get(URL)
web_html = response.text
soup = BeautifulSoup(web_html, "html.parser")
movies = soup.findAll(name = "h3" , class_ = "jsx-4245974604")
movies_text=[]
for item in movies:
result = item.getText()
movies_text.append(result)
print(movies_text)
Can you please help with the solution for this problem?
As other people mentioned this is dynamic content, which needs to be generated first when opening/running the webpage. Therefore you can't find the class "jsx-4245974604" with BS4.
If you print out your "soup" variable you actually can see that you won't find it. But if simply you want to get the names of the movies you can just use another part of the html in this case.
The movie name is in the alt tag of the picture (and actually also in many other parts of the html).
import requests
from pprint import pprint
from bs4 import BeautifulSoup
URL = "https://www.empireonline.com/movies/features/best-movies-2/"
response = requests.get(URL)
web_html = response.text
soup = BeautifulSoup(web_html, "html.parser")
movies = soup.findAll("img", class_="jsx-952983560")
movies_text=[]
for item in movies:
result = item.get('alt')
movies_text.append(result)
print(movies_text)
If you run into this issue in the future, remember to just print out the initial html you can get with soup and just check by eye if the information you need can be found.

How to avoid AttributeError: ResultSet object has no attribute 'text' in BeautifulSoup?

I wanna scrape the title attributes of all a tags in the New Texts - Section at this website:
Try to do it this way
from bs4 import BeautifulSoup
from urllib.request import urlopen as uReq
import requests
url = 'https://en.wikisource.org/wiki/Main_Page'
r = requests.get(url)
Soup = BeautifulSoup(r.text, "html5lib")
List = Soup.find("div",class_="enws-mainpage-widget-content").find_all('a')
for ebook in List:
print(List.get('title'))
When I run this I get this error:
File "C:\Users\Özdal\AppData\Local\Programs\Python\Python38-32\lib\site-packages\bs4\element.py", line 2173, in __getattr__
raise AttributeError(
AttributeError: ResultSet object has no attribute 'text'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
What happens?
In your for loop, you try to grab the title from the List object not
from each of the ebook. That is why the error occurred.
Change your printing line to print(ebook.get('title')) and you
will get the results.
To get only the title from New texts - Section you have to be more specific otherwise you grab all a including author, ...
You can fix this for example this way .select("b > i > a")
Example:
from bs4 import BeautifulSoup
from urllib.request import urlopen as uReq
import requests
url = 'https://en.wikisource.org/wiki/Main_Page'
r = requests.get(url)
Soup = BeautifulSoup(r.text, "html5lib")
List = Soup.find("div",{"id":"enws-mainpage-newtexts-content"}).select("b > i > a")
for ebook in List:
print(ebook.get('title'))
Output
The Center of the Web
Bobby Bumps Starts a Lodge
May (Mácha)
Animal Life and the World of Nature/1903/06/Notes and Comments
The Czechoslovak Review/Volume 2/No Compromise
She's All the World to Me
Their One Love

Get value of span tag using BeautifulSoup

I have a number of facebook groups that I would like to get the count of the members of. An example would be this group: https://www.facebook.com/groups/347805588637627/
I have looked at inspect element on the page and it is stored like so:
<span id="count_text">9,413 members</span>
I am trying to get "9,413 members" out of the page. I have tried using BeautifulSoup but cannot work it out.
Thanks
Edit:
from bs4 import BeautifulSoup
import requests
url = "https://www.facebook.com/groups/347805588637627/"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
span = soup.find("span", id="count_text")
print(span.text)
In case there is more than one span tag in the page:
from bs4 import BeautifulSoup
soup = BeautifulSoup(your_html_input, 'html.parser')
span = soup.find("span", id="count_text")
span.text
You can use the text attribute of the parsed span:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<span id="count_text">9,413 members</span>', 'html.parser')
>>> soup.span
<span id="count_text">9,413 members</span>
>>> soup.span.text
'9,413 members'
If you have more than one span tag you can try this
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
tags = soup('span')
for tag in tags:
print(tag.contents[0])
Facebook uses javascrypt to prevent bots from scraping. You need to use selenium to extract data on python.

python beautifulsoup can't prettify

I seem to be doing something wrong. I have an HTML source that I pull using urllib. Based on this HTML file I use beautifulsoup to findAll elements with an ID based on a specified array. This works for me, however the output is messy and includes linebreaks "\n".
Python: 2.7.12
BeautifulSoup: bs4
I have tried to use prettify() to correct the output but always get an error:
AttributeError: 'ResultSet' object has no attribute 'prettify'
import urllib
import re
from bs4 import BeautifulSoup
cfile = open("test.txt")
clist = cfile.read()
clist = clist.split('\n')
i=0
while i<len (clist):
url = "https://example.com/"+clist[i]
htmlfile = urllib.urlopen (url)
htmltext = htmlfile.read()
soup = BeautifulSoup (htmltext, "html.parser")
soup = soup.findAll (id=["id1", "id2", "id3"])
print soup.prettify()
i+=1
I'm sure there is something simple I am overlooking with this line:
soup = soup.findAll (id=["id1", "id2", "id3"])
I'm just not sure what. Sorry if this is a stupid question. I've only been using Python and Beautiful Soup for a few days.
You are reassigning the soup variable to the result of .findAll(), which is a ResultSet object (basically, a list of tags) which does not have the prettify() method.
The solution is to keep the soup variable pointing to the BeautifulSoup instance.
You can call prettify() on the top-level BeautifulSoup object, or on any of its Tag objects:
findAll return a list of match tags, so your code equal to [tag1,tag2..].prettify()
and it will not work.

How find specific data attribute from html tag in BeautifulSoup4?

Is there a way to find an element using only the data attribute in html, and then grab that value?
For example, with this line inside an html doc:
<ul data-bin="Sdafdo39">
How do I retrieve Sdafdo39 by searching the entire html doc for the element that has the data-bin attribute?
A little bit more accurate
[item['data-bin'] for item in bs.find_all('ul', attrs={'data-bin' : True})]
This way, the iterated list only has the ul elements that has the attr you want to find
from bs4 import BeautifulSoup
bs = BeautifulSoup(html_doc)
html_doc = """<ul class="foo">foo</ul><ul data-bin="Sdafdo39">"""
[item['data-bin'] for item in bs.find_all('ul', attrs={'data-bin' : True})]
You can use find_all method to get all the tags and filtering based on "data-bin" found in its attributes will get us the actual tag which has got it. Then we can simply extract the value corresponding to it, like this
from bs4 import BeautifulSoup
html_doc = """<ul data-bin="Sdafdo39">"""
bs = BeautifulSoup(html_doc)
print [item["data-bin"] for item in bs.find_all() if "data-bin" in item.attrs]
# ['Sdafdo39']
You could solve this with gazpacho in just a couple of lines:
First, import and turn the html into a Soup object:
from gazpacho import Soup
html = """<ul data-bin="Sdafdo39">"""
soup = Soup(html)
Then you can just search for the "ul" tag and extract the href attribute:
soup.find("ul").attrs["data-bin"]
# Sdafdo39
As an alternative if one prefers to use CSS selectors via select() instead of find_all():
from bs4 import BeautifulSoup
html_doc = """<ul class="foo">foo</ul><ul data-bin="Sdafdo39">"""
soup = BeautifulSoup(html_doc)
# Select
soup.select('ul[data-bin]')

Categories

Resources