String after <div class> not visible when scraping beautifulsoup - python

I'm scraping news article. Here is the link.
So I want to get that "13" string inside comment__counter total_comment_share class. As you can see that string is visible on inspect element and you can try it yourself from the link above. But when I did find() and print, that string is invisible so I can't scrape it. This is my code:
a = 'https://tekno.kompas.com/read/2020/11/12/08030087/youtube-down-pagi-ini-tidak-bisa-memutar-video'
b = requests.get(a)
c = (b.content)
d = BeautifulSoup(c)
e = d.find('div', {'class', 'social--inline eee'})
f = d.find('div', {'class', 'comment__read__text'})
print(f)
From my code I'm using find() on comment__read__text class to make it more clear I can find the elements but that "13" string. The result is same if I'm using find() on comment__counter total_comment_share class. This is the output from code above:
<div class="comment__read__text">
Komentar <div class="comment__counter total_comment_share"></div>
</div>
As you can see the "13" string is not there. Anyone knows why?
Any help would be appreciated.

it's because a request was made while the page was loading which makes the page renders the content dynamically. Try this out:
import requests
a = 'https://tekno.kompas.com/read/2020/11/12/08030087/youtube-down-pagi-ini-tidak-bisa-memutar-video'
b = requests.get('https://apis.kompas.com/api/comment/list?urlpage={}&json&limit=1'.format(a))
c = b.json()
f = c["result"]["total"]
print(f)
PS: if you're interested in scraping all the comments, just change limit to 100000 which will get you all the comments in one request as JSON.

Related

Extracting information from website with BeautifulSoup and Python

I'm attempting to extract information from this website. I can't get the text in the three fields marked in the image (in green, blue, and red rectangles) no matter how hard I try.
Using the following function, I thought I would succeed to get all of the text on the page but it didn't work:
from bs4 import BeautifulSoup
import requests
def get_text_from_maagarim_page(url: str):
html_text = requests.get(url).text
soup = BeautifulSoup(html_text, "html.parser")
res = soup.find_all(class_ = "tooltippedWord")
text = [el.getText() for el in res]
return text
url = "https://maagarim.hebrew-academy.org.il/Pages/PMain.aspx?koderekh=1484&page=1"
print(get_text_from_maagarim_page(url)) # >> empty list
I attempted to use the Chrome inspection tool and the exact reference provided here, but I couldn't figure out how to use that data hierarchy to extract the desired data.
I would love to hear if you have any suggestions on how to access this data.
Update and more details
As far as I can tell from the structure of the above-mentioned webpage, the element I'm looking for is in the following structure location:
<form name="aspnetForm" ...>
...
<div id="wrapper">
...
<div class="content">
...
<div class="mainContentArea">
...
<div id="mainSearchPannel" class="mainSearchContent">
...
<div class="searchPanes">
...
<div class="wordsSearchPane" style="display: block;">
...
<div id="searchResultsAreaWord"
class="searchResultsContainer">
...
<div id="srPanes">
...
<div id="srPane-2" class="resRefPane"
style>
...
<div style="height:600px;overflow:auto">
...
<ul class="esResultList">
...
# HERE IS THE TARGET ITEMS
The relevant items look likes this:
And the relevant data is in <td id ... >
The content you want is not present in the web page that beautiful soup loads. It is fetched in separate HTTP requests done when a "web browser" runs the javascript code present in the said web page. Beautiful Soup does not run javascript.
You may try to figure out what HTTP request has responded with the required data using the "Network" tab in your browser developer tools. If that turns out to be a predictable HTTP request then you can recreate that request in python directly and then use beautiful soup to pick out useful parts. #Martin Evans's answer (https://stackoverflow.com/a/72090358/1921546) uses this approach.
Or, you may use methods that actually involve remote controlling a web browser with python. It lets a web browser load the page and then you can access the DOM in Python to get what you want from the rendered page. Other answers like Scraping javascript-generated data using Python and scrape html generated by javascript with python can point you in that direction.
Exactly what tag-class are you trying to scrape from the webpage? When I copied and ran your code I included this line to check for the class name in the pages html, but did not find any.
print("tooltippedWord" in requests.get(url).text) #False
I can say that it's generally easier to use the attrs kwarg when using find_all or findAll.
res = soup.findAll(attrs={"class":"tooltippedWord"})
less confusion overall when typing it out. As far as a few possible approaches would be to look at the page in chrome (or another browser) using the dev tools to search for some non-random class tags or id tags like esResultListItem.
From there if you know what tag you are looking for //etc you can include it in the search like so.
res = soup.findAll("div",attrs={"class":"tooltippedWord"})
It's definitely easier if you know what tag you are looking for as well as if there are any class names or ids included in the tag
<span id="somespecialname" class="verySpecialName"></span>
if you're still looking or help, I can check by tomorrow, it is nearly 1:00 AM CST where I live and I still need to finish my CS assignments. It's just a lot easier to help you if you can provide more examples Pictures/Tags/etc so we could know how to best explain the process to you.
*
It is a bit difficult to understand what the text is, but what you are looking for is returned from a separate request made by the browser. The parameters used will hopefully make some sense to you.
This request returns JSON data which contains a d entry holding the HTML that you are looking for.
The following shows a possible approach:how to extract data near to what you are looking for:
import requests
from bs4 import BeautifulSoup
post_json = {"tabNum":3,"type":"Muvaot","kod1":"","sug1":"","tnua":"","kod2":"","zurot":"","kod":"","erechzman":"","erechzura":"","arachim":"1484","erechzurazman":"","cMaxDist":"","aMaxDist":"","sql1expr":"","sql1sug":"","sql2expr":"","sql2sug":"","sql3expr":"","sql3sug":"","sql4expr":"","sql4sug":"","sql5expr":"","sql5sug":"","sql6expr":"","sql6sug":"","sederZeruf":"","distance":"","kotm":"הערך: <b>אֶלָּא</b>","mislifnay":"0","misacharay":"0","sOrder":"standart","pagenum":"1","lines":"0","takeMaxPage":"true","nMaxPage":-1,"year":"","hekKazar":False}
req = requests.post('https://maagarim.hebrew-academy.org.il/Pages/ws/Arachim.asmx/GetMuvaot', json=post_json)
d = req.json()['d']
soup = BeautifulSoup(d, "html.parser")
for num, table in enumerate(soup.find_all('table'), start=1):
print(f"Entry {num}")
tr_row_second = table.find('tr', class_='srRowSecond')
td = tr_row_second.find_all('td')[1]
print(" ", td.strong.text)
tr_row_third = table.find('tr', class_='srRowThird')
td = tr_row_third.find_all('td')[1]
print(" ", td.text)
This would give you information starting:
Entry 1
תעודות בר כוכבא, ואדי מורבעאת 45
המסירה: Mur, 45
Entry 2
תעודות בר כוכבא, איגרת מיהונתן אל יוסה
מראה מקום: <שו' 4>  |  המסירה: Mur, 46
Entry 3
ברכת המזון
מראה מקום: רחם נא יי אלהינו על ישראל עמך, ברכה ג <שו' 6> (גרסה)  |  המסירה: New York, Jewish Theological Seminary (JTS), ENA, 2150, 47
Entry 4
ברכת המזון
מראה מקום: נחמנו יי אלהינו, ברכה ד, לשבת <שו' 6>  |  המסירה: Cambridge, University Library, T-S Collection, 8H 11, 4
I suggest you print(soup) to understand better what is returned.

Unable to list "all" class text from a webpage

I'm trying to list all nickname from a specific forum thread (webpage)
url = "https://www.webpage.com"
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
username = doc.find('div', class_='userText')
userd = username.a.text
print(userd)
On the webpage:
<div class="userText">
Nickname1
</div>
Nickname2
</div>
etc
So I'm sucessfully isolating the "userText" name from the webpage.
The thing is that I'm only able to get the frist nickname while there is more than 150 inside the page.
I tried a
doc.find_all
instead of my
doc.find
But then I'm hit with a
You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
I'm unsure on how to tackle this.
Fixed with a loop + put the div inside a list
username = doc.find_all(["div"], class_="userText")
for i in range(0,150):
print(username[i].a.text)

BeautifulSoup Web Scrape Running but Not Printing

Mega new coder here as I learned Web scraping yesterday. I'm attempting to scrape a site with the following html code:
<div id="db_detail_colorways">
<a class="db_colorway_line" href="database_detail_colorway.php?
ID=11240&table_name=glasses">
<div class="db_colorway_line_image"><img
src="database/Sport/small/BallisticNewMFrameStrike_MatteBlack_Clear.jpg"/>.
</div>.
<div class="grid_4" style="overflow:hidden;">Matte Black</div><div
class="grid_3">Clear</div><div class="grid_1">$133</div><div
class="grid_2">OO9060-01</div><div class="clear"></div></a><a
There are 4 total items being scraped. The goal is to print the attribute stored in <div class="grid_4" the code should loop over the 4 items being scraped, so for the html code provided, the first being displayed is "Matte Black" Here is my code:
for frame_colors in soup.find_all('a', class_ = 'db_colorway_line'):
all_frame_colors = frame_colors.find_all('div', class_ = 'grid_4').text
print(all_frame_colors)
Basically the code runs correctly and everything else thus far has run correctly in this jupyter notebook, but this runs and does not print out anything. I'm thinking it's a syntax error, but I could be wrong. Hopefully this makes sense. Can anyone help? Thanks!
You are treating a list of elements as a single element
frame_colors.find_all('div', class_ = 'grid_4').text
You can run loop of all_frame_colors and get the text from there like this:
for frame_colors in soup.find_all('a', class_ = 'db_colorway_line'):
all_frame_colors = frame_colors.find_all('div', class_ = 'grid_4')
for af in all_frame_colors:
print(af.text)
If it solves you problem then don't forget to mark this as an answer!

Scraping an onclick value in BeautifulSoup in Pandas

For class, we've been asked to scrape the North Koren News Agency's website: http://kcna.kp/kcna.user.home.retrieveHomeInfoList.kcmsf
The question asks to scrape the onclick values for the website. I've tried solving this in two different ways: by navigating the DOM tree. And by building a regex within a lop to systematically pull them out. I've failed on both counts.
Attempt1:
onclick_soup = soup_doc.find_all('a', class_='titlebet')[0]
onclick_soup
Output:
<a class="titlebet" href="#this" onclick='fn_showArticle("AR0140322",
"", "NT00", "L")'>경애하는 최고령도자 <nobr><strong><font
style="font-size:10pt;">김정은</font></strong></nobr>동지께서 라오스인민혁명당 중앙위원회
총비서인 라오스인민민주주의공화국 주석에게 축전을 보내시였다</a>
Attempt2:
regex_for_onclick_soup = r"onclick='(.*?)\(" onclick_value_soup =
soup_doc.find_all('a', class_='titlebet') for onclick_value in
onclick_value_soup: value =
re.findall(regex_for_onclick_value,onclick_value) print(onclick_value)
Attempt2 results in a TypeError
I'm doing this in pandas. Any guidance would be helpful.
You can simply iterate over every element tag in your html and check for the onclick event.
page= requests.get('http://kcna.kp/kcna.user.home.retrieveHomeInfoList.kcmsf')
soup = BeautifulSoup(page.content, 'lxml')
for tag in soup.find_all():
on_click = tag.get('onclick')
if on_click:
print(on_click)
Note that when using find_all() whithout any argument it will retrieve every tag. Then we use this tags to search for a onclick that is not None and print it out.
Outputs:
fn_convertLanguage('kor')
fn_convertLanguage('eng')
fn_convertLanguage('chn')
fn_convertLanguage('rus')
fn_convertLanguage('spn')
fn_convertLanguage('jpn')
GotoLogin()
register()
evalSearch()
...

Using Xpath to get an string from a webpage

I am trying to get the uniprot ID from this webpage: ENSEMBL . But I am having trouble using xpath. Right now I am getting an empty list and I do not understand why.
My idea is to write a small function that takes the ENSEMBL IDs and returns the uniprot ID.
import requests
from lxml import html
ens_code = 'ENST00000378404'
webpage = 'http://www.ensembl.org/id/'+ens_code
response = requests.get(webpage)
tree = html.fromstring(response.content)
path = '//*[#id="ensembl_panel_1"]/div[2]/div[3]/div[3]/div[2]/p/a'
uniprot_id = tree.xpath(path)
print uniprot_id
Any help would be appreciated :)
It is only printing the existing lists but is still returning the Nonetype list.
def getUniprot(ensembl_code):
ensembl_code = ensembl_code[:-1]
webpage = 'http://www.ensembl.org/id/'+ensembl_code
response = requests.get(webpage)
tree = html.fromstring(response.content)
path = '//div[#class="lhs" and text()="Uniprot"]/following-sibling::div/p/a/text()'
uniprot_id = tree.xpath(path)
if uniprot_id:
print uniprot_id
return uniprot_id
Why you getting an empty list is because it looks like you used the xpath that chrome supplied when you right clicked and chose copy xpath, the reason your xpath returns nothing is because the tag is not in the source, it is dynamically generated so what requests returns does not contain the element.
In [6]: response = requests.get(webpage)
In [7]: "ensembl_panel_1" in response.content
Out[7]: False
You should always check the page source to see what you are actually getting back, what you see in the developer console is not necessarily what you get when you download the source.
You can also use a specific xpath in case there were other http://www.uniprot.org/uniprot/ on the page, searching the divs for a class with "lhs" and the text Uniprot then get the text from the first following anchor tag:
path = '//div[#class="lhs" and text()="Uniprot"]/following::a[1]/text()'
Which would give you:
['Q8TDY3']
You can also select the following sibling div where the anchor is inside it's child p tag:
path = '//div[#class="lhs" and text()="Uniprot"]/following-sibling::div/p/a/text()'

Categories

Resources