Unable to list "all" class text from a webpage - python

I'm trying to list all nickname from a specific forum thread (webpage)
url = "https://www.webpage.com"
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
username = doc.find('div', class_='userText')
userd = username.a.text
print(userd)
On the webpage:
<div class="userText">
Nickname1
</div>
Nickname2
</div>
etc
So I'm sucessfully isolating the "userText" name from the webpage.
The thing is that I'm only able to get the frist nickname while there is more than 150 inside the page.
I tried a
doc.find_all
instead of my
doc.find
But then I'm hit with a
You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
I'm unsure on how to tackle this.

Fixed with a loop + put the div inside a list
username = doc.find_all(["div"], class_="userText")
for i in range(0,150):
print(username[i].a.text)

Related

BeautifulSoup Web Scrape Running but Not Printing

Mega new coder here as I learned Web scraping yesterday. I'm attempting to scrape a site with the following html code:
<div id="db_detail_colorways">
<a class="db_colorway_line" href="database_detail_colorway.php?
ID=11240&table_name=glasses">
<div class="db_colorway_line_image"><img
src="database/Sport/small/BallisticNewMFrameStrike_MatteBlack_Clear.jpg"/>.
</div>.
<div class="grid_4" style="overflow:hidden;">Matte Black</div><div
class="grid_3">Clear</div><div class="grid_1">$133</div><div
class="grid_2">OO9060-01</div><div class="clear"></div></a><a
There are 4 total items being scraped. The goal is to print the attribute stored in <div class="grid_4" the code should loop over the 4 items being scraped, so for the html code provided, the first being displayed is "Matte Black" Here is my code:
for frame_colors in soup.find_all('a', class_ = 'db_colorway_line'):
all_frame_colors = frame_colors.find_all('div', class_ = 'grid_4').text
print(all_frame_colors)
Basically the code runs correctly and everything else thus far has run correctly in this jupyter notebook, but this runs and does not print out anything. I'm thinking it's a syntax error, but I could be wrong. Hopefully this makes sense. Can anyone help? Thanks!
You are treating a list of elements as a single element
frame_colors.find_all('div', class_ = 'grid_4').text
You can run loop of all_frame_colors and get the text from there like this:
for frame_colors in soup.find_all('a', class_ = 'db_colorway_line'):
all_frame_colors = frame_colors.find_all('div', class_ = 'grid_4')
for af in all_frame_colors:
print(af.text)
If it solves you problem then don't forget to mark this as an answer!

Scraping an onclick value in BeautifulSoup in Pandas

For class, we've been asked to scrape the North Koren News Agency's website: http://kcna.kp/kcna.user.home.retrieveHomeInfoList.kcmsf
The question asks to scrape the onclick values for the website. I've tried solving this in two different ways: by navigating the DOM tree. And by building a regex within a lop to systematically pull them out. I've failed on both counts.
Attempt1:
onclick_soup = soup_doc.find_all('a', class_='titlebet')[0]
onclick_soup
Output:
<a class="titlebet" href="#this" onclick='fn_showArticle("AR0140322",
"", "NT00", "L")'>경애하는 최고령도자 <nobr><strong><font
style="font-size:10pt;">김정은</font></strong></nobr>동지께서 라오스인민혁명당 중앙위원회
총비서인 라오스인민민주주의공화국 주석에게 축전을 보내시였다</a>
Attempt2:
regex_for_onclick_soup = r"onclick='(.*?)\(" onclick_value_soup =
soup_doc.find_all('a', class_='titlebet') for onclick_value in
onclick_value_soup: value =
re.findall(regex_for_onclick_value,onclick_value) print(onclick_value)
Attempt2 results in a TypeError
I'm doing this in pandas. Any guidance would be helpful.
You can simply iterate over every element tag in your html and check for the onclick event.
page= requests.get('http://kcna.kp/kcna.user.home.retrieveHomeInfoList.kcmsf')
soup = BeautifulSoup(page.content, 'lxml')
for tag in soup.find_all():
on_click = tag.get('onclick')
if on_click:
print(on_click)
Note that when using find_all() whithout any argument it will retrieve every tag. Then we use this tags to search for a onclick that is not None and print it out.
Outputs:
fn_convertLanguage('kor')
fn_convertLanguage('eng')
fn_convertLanguage('chn')
fn_convertLanguage('rus')
fn_convertLanguage('spn')
fn_convertLanguage('jpn')
GotoLogin()
register()
evalSearch()
...

String after <div class> not visible when scraping beautifulsoup

I'm scraping news article. Here is the link.
So I want to get that "13" string inside comment__counter total_comment_share class. As you can see that string is visible on inspect element and you can try it yourself from the link above. But when I did find() and print, that string is invisible so I can't scrape it. This is my code:
a = 'https://tekno.kompas.com/read/2020/11/12/08030087/youtube-down-pagi-ini-tidak-bisa-memutar-video'
b = requests.get(a)
c = (b.content)
d = BeautifulSoup(c)
e = d.find('div', {'class', 'social--inline eee'})
f = d.find('div', {'class', 'comment__read__text'})
print(f)
From my code I'm using find() on comment__read__text class to make it more clear I can find the elements but that "13" string. The result is same if I'm using find() on comment__counter total_comment_share class. This is the output from code above:
<div class="comment__read__text">
Komentar <div class="comment__counter total_comment_share"></div>
</div>
As you can see the "13" string is not there. Anyone knows why?
Any help would be appreciated.
it's because a request was made while the page was loading which makes the page renders the content dynamically. Try this out:
import requests
a = 'https://tekno.kompas.com/read/2020/11/12/08030087/youtube-down-pagi-ini-tidak-bisa-memutar-video'
b = requests.get('https://apis.kompas.com/api/comment/list?urlpage={}&json&limit=1'.format(a))
c = b.json()
f = c["result"]["total"]
print(f)
PS: if you're interested in scraping all the comments, just change limit to 100000 which will get you all the comments in one request as JSON.

Webscraping with varying page numbers

So I'm trying to webscrape a bunch of profiles. Each profile has a collection of videos. I'm trying to webscrape information about each video. The problem I'm running into is that each profile uploads a different number of videos, so the number of pages containing videos per profile varies. For instance, one profile has 45 pages of videos, as you can see by the html below:
<div class="pagination "><ul><li><a class="active" href="">1</a></li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li><li class="no-page">...<li>45</li><li><a href="#1" class="no-page next-page"><span class="mobile-hide">Next</span>
While another profile has 2 pages
<div class="pagination "><ul><li><a class="active" href="">1</a></li><li>2</li><li><a href="#1" class="no-page next-page"><span class="mobile-hide">Next</span>
My question is, how do I account for the varying changes in page? I was thinking of making a for loop and just adding a random number at the end, like
for i in range(0,1000):
new_url = 'url' + str(i)
where i accounts for the page, but I want to know if there's a more efficient way of doing this.
Thank you.
The "skeleton" of the loop can look like this:
url = 'http://url/?page={page}'
page = 1
while True:
soup = BeautifulSoup(requests.get(url.format(page=page)).content, 'html.parser')
# ...
# do we have next page?
next_page = soup.select_one('.next-page')
# no, so break from the loop
if not next_page:
break
page += 1
You can have infinite loop while True: and you will break the loop only if there's no next page (if there isn't any class="next-page" tag on the last page).
Get the <li>...</li> elements of the <div class="pagination "><ul>
Exclude the last one by its class <li class="no-page">
Parse the "href" and build your next url destinations.
Scrape every new url destination.
I just want to thank everyone who did answer for taking the time to answer my question. I figured out the answer - or at least what worked for me- and decided to share in case it would be helpful for anyone else.
url = 'insert url'
re = requests.get(url)
soup = BeautifulSoup(re.content,'html.parser')
#look for pagination class
page = soup.find(class_='pagination')
#create list to include all page numbers
href=[]
#look for all 'li' tags as the users above suggested
links = page.findAll('li')
for link in links:
href += [link.find('a',href=True).text]
'''
href will now include all pages and the word Next.
So for instance it will look something like this:[1,2,3...,44,Next].
I want to get 44, which will be href[-2] and then convert that to an int for
a for loop. In the for loop add + 1 because it will iterate to i-1, instead of
i. For instance, if you're iterating (0,44), the last output of i will be 43,
which is why we +1
'''
for i in range(0, int(href[-2])+1):
new_url = url + str(1)

Python: Tell BeatifulSoup to choose one value from two

I am scraping a value using BeautifulSoup, however the output gives me two values because it is twice on the page, how do I choose one of them? This is my code:
url = 'URL'
r = requests.get(url,headers=headers)
soup = BeautifulSoup(r.content, "html.parser")
data = soup.find_all("input", {'name': 'CsrfToken', 'type':'hidden'})
for data in data:
print(data.get('value'))
Output:
c8b3226dc829256687cac584a9421e8acc4649ff4ee5f8f386ea11ce03a811c8
c8b3226dc829256687cac584a9421e8acc4649ff4ee5f8f386ea11ce03a811c8
The first 'CsrfToken' is in:
<form method="post" data-url="url" id="test-form" data-test-form="" action="url" name="test-form"><input type="hidden" name="CSRFToken" value="c8b3226dc829256687cac584a9421e8acc4649ff4ee5f8f386ea11ce03a811c8">
The second 'CsrfToken' is in:
<form method="post" name="AnotherForm" class="th-form th-form__compact th-form__compact__inline" data-testid="th-comp-Another-form" action="url" id="AnotherForm"><input type="hidden" name="CSRFToken" value="c8b3226dc829256687cac584a9421e8acc4649ff4ee5f8f386ea11ce03a811c8">
I only want the first or second value so that my payload request can load correctly.
Use find(), it will give you the first instance of the tag on the page.
find_all() returns all instances of the tag on the page.
From the documentation regarding find_all() vs. find():
The find_all() method scans the entire document looking for results,
but sometimes you only want to find one result. If you know a document
only has one <body> tag, it’s a waste of time to scan the entire
document looking for more. Rather than passing in limit=1 every time
you call find_all, you can use the find() method.
So you could still use find_all(), just pass in 1 as the limit parameter.
To leave the loop early try:
for data in data:
print(data.get('value'))
break
To always get the first element you can do:
def get_first_value(item):
try:
return item.get('value')[0]
except TypeError:
return None
value = get_first_value(data)

Categories

Resources