Scraping an onclick value in BeautifulSoup in Pandas - python

For class, we've been asked to scrape the North Koren News Agency's website: http://kcna.kp/kcna.user.home.retrieveHomeInfoList.kcmsf
The question asks to scrape the onclick values for the website. I've tried solving this in two different ways: by navigating the DOM tree. And by building a regex within a lop to systematically pull them out. I've failed on both counts.
Attempt1:
onclick_soup = soup_doc.find_all('a', class_='titlebet')[0]
onclick_soup
Output:
<a class="titlebet" href="#this" onclick='fn_showArticle("AR0140322",
"", "NT00", "L")'>경애하는 최고령도자 <nobr><strong><font
style="font-size:10pt;">김정은</font></strong></nobr>동지께서 라오스인민혁명당 중앙위원회
총비서인 라오스인민민주주의공화국 주석에게 축전을 보내시였다</a>
Attempt2:
regex_for_onclick_soup = r"onclick='(.*?)\(" onclick_value_soup =
soup_doc.find_all('a', class_='titlebet') for onclick_value in
onclick_value_soup: value =
re.findall(regex_for_onclick_value,onclick_value) print(onclick_value)
Attempt2 results in a TypeError
I'm doing this in pandas. Any guidance would be helpful.

You can simply iterate over every element tag in your html and check for the onclick event.
page= requests.get('http://kcna.kp/kcna.user.home.retrieveHomeInfoList.kcmsf')
soup = BeautifulSoup(page.content, 'lxml')
for tag in soup.find_all():
on_click = tag.get('onclick')
if on_click:
print(on_click)
Note that when using find_all() whithout any argument it will retrieve every tag. Then we use this tags to search for a onclick that is not None and print it out.
Outputs:
fn_convertLanguage('kor')
fn_convertLanguage('eng')
fn_convertLanguage('chn')
fn_convertLanguage('rus')
fn_convertLanguage('spn')
fn_convertLanguage('jpn')
GotoLogin()
register()
evalSearch()
...

Related

FInd_all in bs4 returns one elment when there is more in the web page in

I am doing web scrapping to a new egg page and i want to scrape the rating of the product by the consumers and i am using this code
page = requests.get('https://www.newegg.com/msi-geforce-rtx-3060-rtx-3060-ventus-2x-12g-oc/p/N82E16814137632?Description=gpu&cm_re=gpu-_-14-137-632-_-Product').text
soup = bs(page , 'lxml')
the_rating = soup.find_all( class_ = 'rating rating-4')
print(the_rating)
And it returns only this one element even though I am using the find all element
[<i class="rating rating-4"></i>]
I get [] with your code; judging by the text content, or when I break it print the response status and url
r = requests.get('https://www.newegg.com/msi-geforce-rtx-3060-rtx-3060-ventus-2x-> 12g-oc/p/N82E16814137632?Description=gpu&cm_re=gpu-_-14-137-632-_-Product')
print(f'<{r.status_code} {r.reason}> from {r.url}')
# soup = bs(r.content , 'lxml')
output:
<200 OK> from https://www.newegg.com/areyouahuman?referer=/areyouahuman?referer=https%3A%2F%2Fwww.newegg.com%2Fmsi-geforce-rtx-3060-rtx-3060-ventus-2x-12g-oc%2Fp%2FN82E16814137632%3FDescription%3Dgpu%26cm_re%3Dgpu-_-14-137-632-_-Product&why=8&cm_re=gpu-_-14-137-632-_-Product&Description=gpu
It's been redirected to a CAPTCHA...
Anyway, even if you get past that (I couldn't so I just pasted and parsed the response from my browser's network logs to test), all you can get from page is the source HTML, which does not contain any elements with class="rating rating-4"; using selenium and waiting for the page to finish loading yielded a bit more, but even then there weren't any exact matches.
[There were some matches when I inspected in browser, but only if I wasn't in incognito mode, which is likely why selenium didn't find them either.]
So, the site probably adds or removes some classes based on the source of the request. If you just want to get all elements with both the rating and rating-4 classes (that will include the elements with class="rating is-large rating-4"), you can use .find... with lambda (or define a separate function) or use .select with CSS selectors like
the_rating = soup.select('.rating.rating-4') # shorter than
# .find_all(lambda t: {'rating', 'rating-4'}.issubset(set(t.get('class', []))))
[Just make sure you have the full/correct HTML.]

How do you use beautifulsoup and selenium to scrape html inside shadow dom?

I'm trying to make an automation program to scrape part of a website. But this website is made out of javascript, and the part of the website I want to scrape is in a shadow dom.
So I figured out that I should use selenium to go to that website and use this code to access elements in shadow dom
def expand_shadow_element(element):
shadow_root = driver.execute_script('return arguments[0].shadowRoot', element)
return shadow_root
and use
driver.page_source
to get the HTML of that website. But this code doesn't show me elements that are inside the shadow dom.
I've tried combining those two and tried
root1 = driver.find_element(By. CSS_SELECTOR, "path1")
shadow_root = expand_shadow_element(root1)
html = shadow_root.page_source
but I got
AttributeError: 'ShadowRoot' object has no attribute 'page_source'
for a response. So I think that I need to use BeautifulSoup to scrape data from that page, but I can't figure out how to combine BeautifulSoup and Selenium to scrape data from a shadow dom.
P.S. If the part I want to scrape is
<h3>apple</h3>
<p>1$</p>
<p>red</p>
I want to scrape that code exactly, not
apple
1$
red
You would use BeautifulSoup here as follows:
soup = BeautifulSoup(driver.page_source, 'lxml')
my_parts = soup.select('h3') # for example
Most likely you need to wait for an element to show in the code so you need to set Implicit Wait or Explicit Wait, then once an element is loaded you can soup that page for HTML result.
driver.implicitly_wait(15) #in secounds
text = shadow_root.find_element(By. CSS_SELECTOR, "path2").get_attribute('innerHTML')
None of the answers solved my problem, so I tinkered with the code and this worked! The answer was get_attribute!

Python Selenium - How do you extract a link from an element with no href? [duplicate]

I am trying to iterate through a series of car listings and return the links to the individual CarFax and Experian Autocheck documents for each listing.
Page I am trying to pull the links from
The XPATH for the one constant parent element across all child elements I am looking for is:
.//div[#class="display-inline-block align-self-start"]/div[1]
I initially tried to simply extract the href attribute from the child <div> and <a> tags at this XPATH: .//div[#class="display-inline-block align-self-start"]/div[1]/a[1]
This works great for some of the listings but does not work for others that do not have an <a> tag and instead include a <span> tag with an inline text link using text element "Get AutoCheck Vehicle History".
That link functions correctly on the page, but there is no href attribute or any link I can find attached to the element in the page and I do not know how to scrape it with Selenium. Any advice would be appreciated as I am new to Python and Selenium.
For reference, here is the code I was using to scrape through the page (this eventually returns an IndexError as only some of the iterations of elements on the list have the <a> tag and the final amount does not match the total amount of listings on the page indicated by len(name)
s = Service('/Users/admin/chromedriver')
driver = webdriver.Chrome(service=s)
driver.get("https://www.autotrader.com/cars-for-sale/ferrari/458-spider/beverly-hills-ca-90210?dma=&searchRadius=0&location=&isNewSearch=true&marketExtension=include&showAccelerateBanner=false&sortBy=relevance&numRecords=100")
nameList = []
autoCheckList = []
name = driver.find_elements(By.XPATH, './/h2[#class="text-bold text-size-400 text-size-sm-500 link-unstyled"]')
autoCheck = driver.find_elements(By.XPATH, './/div[#class="display-inline-block align-self-start"]/div[1]/a[1]')
for i in range(len(name)):
nameList.append(name[i].text)
autoCheckList.append(autoCheck[i].get_attribute('href'))

Python Selenium: How to pull a link from an element with no href

I am trying to iterate through a series of car listings and return the links to the individual CarFax and Experian Autocheck documents for each listing.
Page I am trying to pull the links from
The XPATH for the one constant parent element across all child elements I am looking for is:
.//div[#class="display-inline-block align-self-start"]/div[1]
I initially tried to simply extract the href attribute from the child <div> and <a> tags at this XPATH: .//div[#class="display-inline-block align-self-start"]/div[1]/a[1]
This works great for some of the listings but does not work for others that do not have an <a> tag and instead include a <span> tag with an inline text link using text element "Get AutoCheck Vehicle History".
That link functions correctly on the page, but there is no href attribute or any link I can find attached to the element in the page and I do not know how to scrape it with Selenium. Any advice would be appreciated as I am new to Python and Selenium.
For reference, here is the code I was using to scrape through the page (this eventually returns an IndexError as only some of the iterations of elements on the list have the <a> tag and the final amount does not match the total amount of listings on the page indicated by len(name)
s = Service('/Users/admin/chromedriver')
driver = webdriver.Chrome(service=s)
driver.get("https://www.autotrader.com/cars-for-sale/ferrari/458-spider/beverly-hills-ca-90210?dma=&searchRadius=0&location=&isNewSearch=true&marketExtension=include&showAccelerateBanner=false&sortBy=relevance&numRecords=100")
nameList = []
autoCheckList = []
name = driver.find_elements(By.XPATH, './/h2[#class="text-bold text-size-400 text-size-sm-500 link-unstyled"]')
autoCheck = driver.find_elements(By.XPATH, './/div[#class="display-inline-block align-self-start"]/div[1]/a[1]')
for i in range(len(name)):
nameList.append(name[i].text)
autoCheckList.append(autoCheck[i].get_attribute('href'))

how to use soup.find_all() providing attrs starting with sometext

I trying to scrape this website, [enter link description here][1]
[1]: https://beta.sam.gov/search?keywords=&sort=-modifiedDate&index=opp&is_active=true&page=1 all the data which I want to scrape will be inserted in a div whose class gets dynamic value each time. wo what I want to find all those divs using soup.find_all() and provide starting string to its class.
this is my current code,
outerDivs = soup.find_all(attrs={"tabindex": "-1", "class": "ng-tns-c1-1 ng-star-inserted"})
what I want to get is to find_all() divs having attribute tabindex=1 and class starts with ng-tns-... ng-star-inserted. the only value that changes comes after ng-tns ... right now it looks like ng-tns-c294-1 ng-star-inserted please note ng-star-inserted always remains the same.
this is how I get the soup code.
driver.get(
f'https://beta.sam.gov/search?keywords=&sort=-modifiedDate&index=opp&is_active=true&page={currentpage}')
WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div#search-results")))
source = driver.page_source
soup = BeautifulSoup(source, 'lxml')
current page increases by one each time to go to next page
I'm not the best with regex, so there might be a better way to do it, but this should do the trick:
soup.find_all(attrs={"tabindex": "-1", "class": re.compile("^ng-tns.*ng-star-inserted$")})
It will only match a class that starts specifically with "ng-tns," has any number of characters, then ends specifically with "ng-star-inserted."

Categories

Resources