how to use soup.find_all() providing attrs starting with sometext - python

I trying to scrape this website, [enter link description here][1]
[1]: https://beta.sam.gov/search?keywords=&sort=-modifiedDate&index=opp&is_active=true&page=1 all the data which I want to scrape will be inserted in a div whose class gets dynamic value each time. wo what I want to find all those divs using soup.find_all() and provide starting string to its class.
this is my current code,
outerDivs = soup.find_all(attrs={"tabindex": "-1", "class": "ng-tns-c1-1 ng-star-inserted"})
what I want to get is to find_all() divs having attribute tabindex=1 and class starts with ng-tns-... ng-star-inserted. the only value that changes comes after ng-tns ... right now it looks like ng-tns-c294-1 ng-star-inserted please note ng-star-inserted always remains the same.
this is how I get the soup code.
driver.get(
f'https://beta.sam.gov/search?keywords=&sort=-modifiedDate&index=opp&is_active=true&page={currentpage}')
WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div#search-results")))
source = driver.page_source
soup = BeautifulSoup(source, 'lxml')
current page increases by one each time to go to next page

I'm not the best with regex, so there might be a better way to do it, but this should do the trick:
soup.find_all(attrs={"tabindex": "-1", "class": re.compile("^ng-tns.*ng-star-inserted$")})
It will only match a class that starts specifically with "ng-tns," has any number of characters, then ends specifically with "ng-star-inserted."

Related

FInd_all in bs4 returns one elment when there is more in the web page in

I am doing web scrapping to a new egg page and i want to scrape the rating of the product by the consumers and i am using this code
page = requests.get('https://www.newegg.com/msi-geforce-rtx-3060-rtx-3060-ventus-2x-12g-oc/p/N82E16814137632?Description=gpu&cm_re=gpu-_-14-137-632-_-Product').text
soup = bs(page , 'lxml')
the_rating = soup.find_all( class_ = 'rating rating-4')
print(the_rating)
And it returns only this one element even though I am using the find all element
[<i class="rating rating-4"></i>]
I get [] with your code; judging by the text content, or when I break it print the response status and url
r = requests.get('https://www.newegg.com/msi-geforce-rtx-3060-rtx-3060-ventus-2x-> 12g-oc/p/N82E16814137632?Description=gpu&cm_re=gpu-_-14-137-632-_-Product')
print(f'<{r.status_code} {r.reason}> from {r.url}')
# soup = bs(r.content , 'lxml')
output:
<200 OK> from https://www.newegg.com/areyouahuman?referer=/areyouahuman?referer=https%3A%2F%2Fwww.newegg.com%2Fmsi-geforce-rtx-3060-rtx-3060-ventus-2x-12g-oc%2Fp%2FN82E16814137632%3FDescription%3Dgpu%26cm_re%3Dgpu-_-14-137-632-_-Product&why=8&cm_re=gpu-_-14-137-632-_-Product&Description=gpu
It's been redirected to a CAPTCHA...
Anyway, even if you get past that (I couldn't so I just pasted and parsed the response from my browser's network logs to test), all you can get from page is the source HTML, which does not contain any elements with class="rating rating-4"; using selenium and waiting for the page to finish loading yielded a bit more, but even then there weren't any exact matches.
[There were some matches when I inspected in browser, but only if I wasn't in incognito mode, which is likely why selenium didn't find them either.]
So, the site probably adds or removes some classes based on the source of the request. If you just want to get all elements with both the rating and rating-4 classes (that will include the elements with class="rating is-large rating-4"), you can use .find... with lambda (or define a separate function) or use .select with CSS selectors like
the_rating = soup.select('.rating.rating-4') # shorter than
# .find_all(lambda t: {'rating', 'rating-4'}.issubset(set(t.get('class', []))))
[Just make sure you have the full/correct HTML.]

Python Selenium: How to pull a link from an element with no href

I am trying to iterate through a series of car listings and return the links to the individual CarFax and Experian Autocheck documents for each listing.
Page I am trying to pull the links from
The XPATH for the one constant parent element across all child elements I am looking for is:
.//div[#class="display-inline-block align-self-start"]/div[1]
I initially tried to simply extract the href attribute from the child <div> and <a> tags at this XPATH: .//div[#class="display-inline-block align-self-start"]/div[1]/a[1]
This works great for some of the listings but does not work for others that do not have an <a> tag and instead include a <span> tag with an inline text link using text element "Get AutoCheck Vehicle History".
That link functions correctly on the page, but there is no href attribute or any link I can find attached to the element in the page and I do not know how to scrape it with Selenium. Any advice would be appreciated as I am new to Python and Selenium.
For reference, here is the code I was using to scrape through the page (this eventually returns an IndexError as only some of the iterations of elements on the list have the <a> tag and the final amount does not match the total amount of listings on the page indicated by len(name)
s = Service('/Users/admin/chromedriver')
driver = webdriver.Chrome(service=s)
driver.get("https://www.autotrader.com/cars-for-sale/ferrari/458-spider/beverly-hills-ca-90210?dma=&searchRadius=0&location=&isNewSearch=true&marketExtension=include&showAccelerateBanner=false&sortBy=relevance&numRecords=100")
nameList = []
autoCheckList = []
name = driver.find_elements(By.XPATH, './/h2[#class="text-bold text-size-400 text-size-sm-500 link-unstyled"]')
autoCheck = driver.find_elements(By.XPATH, './/div[#class="display-inline-block align-self-start"]/div[1]/a[1]')
for i in range(len(name)):
nameList.append(name[i].text)
autoCheckList.append(autoCheck[i].get_attribute('href'))

Scraping an onclick value in BeautifulSoup in Pandas

For class, we've been asked to scrape the North Koren News Agency's website: http://kcna.kp/kcna.user.home.retrieveHomeInfoList.kcmsf
The question asks to scrape the onclick values for the website. I've tried solving this in two different ways: by navigating the DOM tree. And by building a regex within a lop to systematically pull them out. I've failed on both counts.
Attempt1:
onclick_soup = soup_doc.find_all('a', class_='titlebet')[0]
onclick_soup
Output:
<a class="titlebet" href="#this" onclick='fn_showArticle("AR0140322",
"", "NT00", "L")'>경애하는 최고령도자 <nobr><strong><font
style="font-size:10pt;">김정은</font></strong></nobr>동지께서 라오스인민혁명당 중앙위원회
총비서인 라오스인민민주주의공화국 주석에게 축전을 보내시였다</a>
Attempt2:
regex_for_onclick_soup = r"onclick='(.*?)\(" onclick_value_soup =
soup_doc.find_all('a', class_='titlebet') for onclick_value in
onclick_value_soup: value =
re.findall(regex_for_onclick_value,onclick_value) print(onclick_value)
Attempt2 results in a TypeError
I'm doing this in pandas. Any guidance would be helpful.
You can simply iterate over every element tag in your html and check for the onclick event.
page= requests.get('http://kcna.kp/kcna.user.home.retrieveHomeInfoList.kcmsf')
soup = BeautifulSoup(page.content, 'lxml')
for tag in soup.find_all():
on_click = tag.get('onclick')
if on_click:
print(on_click)
Note that when using find_all() whithout any argument it will retrieve every tag. Then we use this tags to search for a onclick that is not None and print it out.
Outputs:
fn_convertLanguage('kor')
fn_convertLanguage('eng')
fn_convertLanguage('chn')
fn_convertLanguage('rus')
fn_convertLanguage('spn')
fn_convertLanguage('jpn')
GotoLogin()
register()
evalSearch()
...

Is there any way to get all the "inner html text" of a website and its corresponding coordinates using python selenium?

I'm able to get the div elements by using this code:
divs = driver.find_elements_by_xpath("//div")
and by looping through the divs and using .text attribute I'm able to get the text as well
code:
for i in divs:
print(i.text)
but in my use-case I want the location as well as the size of the text.
Please help !!
My code:
for i in range(0,len(WEBSITES)):
print(timestamp()) #timestamp
print(i,WEBSITES[i]) #name of the website
driver.get(WEBSITES[i])
delay = 10
time.sleep(delay)
img = cv2.imread(os.getcwd() + '/' + str(i)+'.png')#read the image to be inscribed
print("getting div tags \n")
divs = driver.find_elements_by_xpath("//div")# find all the div tags
# anchors = divs.find_elements_by_xpath("//*")#find all the child tags in the divs
for i in divs:
print(i.text.location)
Whenever I try .location or .size attribute I get Unicode error.
Disclaimer: I have searched through all the post so this is not a duplicate question.
Can you try getting the coordinates of the div rather than the text. Like below.
for i in divs:
print(i.location)
Edit
So if you want to get the text coordinates of all text in a page, get the text elements in a page like below and get their coordinates.
textElements = driver.find_elements_by_xpath("//body//*[text()]") #Gets all text elements
for i in textElements:
print(i.text)
print(i.location)

how to extract data from autocomplete box with selenium python

I am trying to extract data from a search box, you can see a good example on wikipedia
This is my code:
driver = webdriver.Firefox()
driver.get(response.url)
city = driver.find_element_by_id('searchInput')
city.click()
city.clear()
city.send_keys('a')
time.sleep(1.5) #waiting for ajax to load
selen_html = driver.page_source
#print selen_html.encode('utf-8')
hxs = HtmlXPathSelector(text=selen_html)
ajaxWikiList = hxs.select('//div[#class="suggestions"]')
items=[]
for city in ajaxWikiList:
item=TestItem()
item['ajax'] = city.select('/div[#class="suggestions-results"]/a/#title').extract()
items.append(item)
print items
Xpath expression is ok, I checked on a static page. If I uncomment the line that prints out scrapped html code the code for the box shows at the end of the file. But for some reason I can't extract data from it with the above code? I must miss something since I tried 2 different sources, wikipedia page is just another source where I can't get these data extracted.
Any advice here? Thanks!
Instead of passing the .page_source which in your case contains an empty suggestions div, get the innerHTML of the element and pass it to the Selector:
selen_html = driver.find_element_by_class_name('suggestions').get_attribute('innerHTML')
hxs = HtmlXPathSelector(text=selen_html)
suggestions = hxs.select('//div[#class="suggestions-results"]/a/#title').extract()
for suggestion in suggestions:
print suggestion
Outputs:
Animal
Association football
Arthropod
Australia
AllMusic
African American (U.S. Census)
Album
Angiosperms
Actor
American football
Note that it would be better to use selenium Waits feature to wait for the element to be accessible/visible, see:
How can I get Selenium Web Driver to wait for an element to be accessible, not just present?
Selenium waitForElement
Also, note that HtmlXPathSelector is deprecated, use Selector instead.

Categories

Resources