Python: Tell BeatifulSoup to choose one value from two - python

I am scraping a value using BeautifulSoup, however the output gives me two values because it is twice on the page, how do I choose one of them? This is my code:
url = 'URL'
r = requests.get(url,headers=headers)
soup = BeautifulSoup(r.content, "html.parser")
data = soup.find_all("input", {'name': 'CsrfToken', 'type':'hidden'})
for data in data:
print(data.get('value'))
Output:
c8b3226dc829256687cac584a9421e8acc4649ff4ee5f8f386ea11ce03a811c8
c8b3226dc829256687cac584a9421e8acc4649ff4ee5f8f386ea11ce03a811c8
The first 'CsrfToken' is in:
<form method="post" data-url="url" id="test-form" data-test-form="" action="url" name="test-form"><input type="hidden" name="CSRFToken" value="c8b3226dc829256687cac584a9421e8acc4649ff4ee5f8f386ea11ce03a811c8">
The second 'CsrfToken' is in:
<form method="post" name="AnotherForm" class="th-form th-form__compact th-form__compact__inline" data-testid="th-comp-Another-form" action="url" id="AnotherForm"><input type="hidden" name="CSRFToken" value="c8b3226dc829256687cac584a9421e8acc4649ff4ee5f8f386ea11ce03a811c8">
I only want the first or second value so that my payload request can load correctly.

Use find(), it will give you the first instance of the tag on the page.
find_all() returns all instances of the tag on the page.
From the documentation regarding find_all() vs. find():
The find_all() method scans the entire document looking for results,
but sometimes you only want to find one result. If you know a document
only has one <body> tag, it’s a waste of time to scan the entire
document looking for more. Rather than passing in limit=1 every time
you call find_all, you can use the find() method.
So you could still use find_all(), just pass in 1 as the limit parameter.

To leave the loop early try:
for data in data:
print(data.get('value'))
break
To always get the first element you can do:
def get_first_value(item):
try:
return item.get('value')[0]
except TypeError:
return None
value = get_first_value(data)

Related

FInd_all in bs4 returns one elment when there is more in the web page in

I am doing web scrapping to a new egg page and i want to scrape the rating of the product by the consumers and i am using this code
page = requests.get('https://www.newegg.com/msi-geforce-rtx-3060-rtx-3060-ventus-2x-12g-oc/p/N82E16814137632?Description=gpu&cm_re=gpu-_-14-137-632-_-Product').text
soup = bs(page , 'lxml')
the_rating = soup.find_all( class_ = 'rating rating-4')
print(the_rating)
And it returns only this one element even though I am using the find all element
[<i class="rating rating-4"></i>]
I get [] with your code; judging by the text content, or when I break it print the response status and url
r = requests.get('https://www.newegg.com/msi-geforce-rtx-3060-rtx-3060-ventus-2x-> 12g-oc/p/N82E16814137632?Description=gpu&cm_re=gpu-_-14-137-632-_-Product')
print(f'<{r.status_code} {r.reason}> from {r.url}')
# soup = bs(r.content , 'lxml')
output:
<200 OK> from https://www.newegg.com/areyouahuman?referer=/areyouahuman?referer=https%3A%2F%2Fwww.newegg.com%2Fmsi-geforce-rtx-3060-rtx-3060-ventus-2x-12g-oc%2Fp%2FN82E16814137632%3FDescription%3Dgpu%26cm_re%3Dgpu-_-14-137-632-_-Product&why=8&cm_re=gpu-_-14-137-632-_-Product&Description=gpu
It's been redirected to a CAPTCHA...
Anyway, even if you get past that (I couldn't so I just pasted and parsed the response from my browser's network logs to test), all you can get from page is the source HTML, which does not contain any elements with class="rating rating-4"; using selenium and waiting for the page to finish loading yielded a bit more, but even then there weren't any exact matches.
[There were some matches when I inspected in browser, but only if I wasn't in incognito mode, which is likely why selenium didn't find them either.]
So, the site probably adds or removes some classes based on the source of the request. If you just want to get all elements with both the rating and rating-4 classes (that will include the elements with class="rating is-large rating-4"), you can use .find... with lambda (or define a separate function) or use .select with CSS selectors like
the_rating = soup.select('.rating.rating-4') # shorter than
# .find_all(lambda t: {'rating', 'rating-4'}.issubset(set(t.get('class', []))))
[Just make sure you have the full/correct HTML.]

Unable to list "all" class text from a webpage

I'm trying to list all nickname from a specific forum thread (webpage)
url = "https://www.webpage.com"
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
username = doc.find('div', class_='userText')
userd = username.a.text
print(userd)
On the webpage:
<div class="userText">
Nickname1
</div>
Nickname2
</div>
etc
So I'm sucessfully isolating the "userText" name from the webpage.
The thing is that I'm only able to get the frist nickname while there is more than 150 inside the page.
I tried a
doc.find_all
instead of my
doc.find
But then I'm hit with a
You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
I'm unsure on how to tackle this.
Fixed with a loop + put the div inside a list
username = doc.find_all(["div"], class_="userText")
for i in range(0,150):
print(username[i].a.text)

Scraping an onclick value in BeautifulSoup in Pandas

For class, we've been asked to scrape the North Koren News Agency's website: http://kcna.kp/kcna.user.home.retrieveHomeInfoList.kcmsf
The question asks to scrape the onclick values for the website. I've tried solving this in two different ways: by navigating the DOM tree. And by building a regex within a lop to systematically pull them out. I've failed on both counts.
Attempt1:
onclick_soup = soup_doc.find_all('a', class_='titlebet')[0]
onclick_soup
Output:
<a class="titlebet" href="#this" onclick='fn_showArticle("AR0140322",
"", "NT00", "L")'>경애하는 최고령도자 <nobr><strong><font
style="font-size:10pt;">김정은</font></strong></nobr>동지께서 라오스인민혁명당 중앙위원회
총비서인 라오스인민민주주의공화국 주석에게 축전을 보내시였다</a>
Attempt2:
regex_for_onclick_soup = r"onclick='(.*?)\(" onclick_value_soup =
soup_doc.find_all('a', class_='titlebet') for onclick_value in
onclick_value_soup: value =
re.findall(regex_for_onclick_value,onclick_value) print(onclick_value)
Attempt2 results in a TypeError
I'm doing this in pandas. Any guidance would be helpful.
You can simply iterate over every element tag in your html and check for the onclick event.
page= requests.get('http://kcna.kp/kcna.user.home.retrieveHomeInfoList.kcmsf')
soup = BeautifulSoup(page.content, 'lxml')
for tag in soup.find_all():
on_click = tag.get('onclick')
if on_click:
print(on_click)
Note that when using find_all() whithout any argument it will retrieve every tag. Then we use this tags to search for a onclick that is not None and print it out.
Outputs:
fn_convertLanguage('kor')
fn_convertLanguage('eng')
fn_convertLanguage('chn')
fn_convertLanguage('rus')
fn_convertLanguage('spn')
fn_convertLanguage('jpn')
GotoLogin()
register()
evalSearch()
...

String after <div class> not visible when scraping beautifulsoup

I'm scraping news article. Here is the link.
So I want to get that "13" string inside comment__counter total_comment_share class. As you can see that string is visible on inspect element and you can try it yourself from the link above. But when I did find() and print, that string is invisible so I can't scrape it. This is my code:
a = 'https://tekno.kompas.com/read/2020/11/12/08030087/youtube-down-pagi-ini-tidak-bisa-memutar-video'
b = requests.get(a)
c = (b.content)
d = BeautifulSoup(c)
e = d.find('div', {'class', 'social--inline eee'})
f = d.find('div', {'class', 'comment__read__text'})
print(f)
From my code I'm using find() on comment__read__text class to make it more clear I can find the elements but that "13" string. The result is same if I'm using find() on comment__counter total_comment_share class. This is the output from code above:
<div class="comment__read__text">
Komentar <div class="comment__counter total_comment_share"></div>
</div>
As you can see the "13" string is not there. Anyone knows why?
Any help would be appreciated.
it's because a request was made while the page was loading which makes the page renders the content dynamically. Try this out:
import requests
a = 'https://tekno.kompas.com/read/2020/11/12/08030087/youtube-down-pagi-ini-tidak-bisa-memutar-video'
b = requests.get('https://apis.kompas.com/api/comment/list?urlpage={}&json&limit=1'.format(a))
c = b.json()
f = c["result"]["total"]
print(f)
PS: if you're interested in scraping all the comments, just change limit to 100000 which will get you all the comments in one request as JSON.

How can I get text of an element in Selenium WebDriver, without including child element text?

Consider:
<div id="a">This is some
<div id="b">text</div>
</div>
Getting "This is some" is nontrivial. For instance, this returns "This is some text":
driver.find_element_by_id('a').text
How does one, in a general way, get the text of a specific element without including the text of its children?
Here's a general solution:
def get_text_excluding_children(driver, element):
return driver.execute_script("""
return jQuery(arguments[0]).contents().filter(function() {
return this.nodeType == Node.TEXT_NODE;
}).text();
""", element)
The element passed to the function can be something obtained from the find_element...() methods (i.e., it can be a WebElement object).
Or if you don't have jQuery or don't want to use it, you can replace the body of the function above with this:
return self.driver.execute_script("""
var parent = arguments[0];
var child = parent.firstChild;
var ret = "";
while(child) {
if (child.nodeType === Node.TEXT_NODE)
ret += child.textContent;
child = child.nextSibling;
}
return ret;
""", element)
I'm actually using this code in a test suite.
In the HTML which you have shared:
<div id="a">This is some
<div id="b">text</div>
</div>
The text This is some is within a text node. To depict the text node in a structured way:
<div id="a">
This is some
<div id="b">text</div>
</div>
This use case
To extract and print the text This is some from the text node using Selenium's python client, you have two ways as follows:
Using splitlines(): You can identify the parent element i.e. <div id="a">, extract the innerHTML and then use splitlines() as follows:
using xpath:
print(driver.find_element_by_xpath("//div[#id='a']").get_attribute("innerHTML").splitlines()[0])
using css_selector:
print(driver.find_element_by_css_selector("div#a").get_attribute("innerHTML").splitlines()[0])
Using execute_script(): You can also use the execute_script() method which can synchronously execute JavaScript in the current window/frame as follows:
using xpath and firstChild:
parent_element = driver.find_element_by_xpath("//div[#id='a']")
print(driver.execute_script('return arguments[0].firstChild.textContent;', parent_element).strip())
using xpath and childNodes[n]:
parent_element = driver.find_element_by_xpath("//div[#id='a']")
print(driver.execute_script('return arguments[0].childNodes[1].textContent;', parent_element).strip())
Use:
def get_true_text(tag):
children = tag.find_elements_by_xpath('*')
original_text = tag.text
for child in children:
original_text = original_text.replace(child.text, '', 1)
return original_text
You don't have to do a replace. You can get the length of the children text, subtract that from the overall length, and slice into the original text. That should be substantially faster.
Unfortunately, Selenium was only built to work with Elements, not Text nodes.
If you try to use a function like get_element_by_xpath to target the text nodes, Selenium will throw an InvalidSelectorException.
One workaround is to grab the relevant HTML with Selenium and then use an HTML parsing library like Beautiful Soup that can handle text nodes more elegantly.
import bs4
from bs4 import BeautifulSoup
inner_html = driver.find_elements_by_css_selector('#a')[0].get_attribute("innerHTML")
inner_soup = BeautifulSoup(inner_html, 'html.parser')
outer_html = driver.find_elements_by_css_selector('#a')[0].get_attribute("outerHTML")
outer_soup = BeautifulSoup(outer_html, 'html.parser')
From there, there are several ways to search for the Text content. You'll have to experiment to see what works best for your use case.
Here's a simple one-liner that may be sufficient:
inner_soup.find(text=True)
If that doesn't work, then you can loop through the element's child nodes with .contents() and check their object type.
Beautiful Soup has four types of elements, and the one that you'll be interested in is the NavigableString type, which is produced by Text nodes. By contrast, Elements will have a type of Tag.
contents = inner_soup.contents
for bs4_object in contents:
if (type(bs4_object) == bs4.Tag):
print("This object is an Element.")
elif (type(bs4_object) == bs4.NavigableString):
print("This object is a Text node.")
Note that Beautiful Soup doesn't support XPath expressions. If you need those, then you can use some of the workarounds in this question.

Categories

Resources