python selenium search element by class name - python

I want to collect the detailed recommendation description paragraphs that a person received on his/her LinkedIn profile, such as this link:
https://www.linkedin.com/in/teddunning/details/recommendations/
(This link can be viewed after logging in any LinkedIn account)
Here is my best try:
for index, row in df2.iterrows():
linkedin = row['LinkedIn Website']
current_url = f'{linkedin}/details/recommendations/'
driver.get(current_url)
time.sleep(random.uniform(2,3))
descriptions=driver.find_elements_by_xpath("//*[#class='display-flex align-items-center t-14 t-normal t-black']")
s=0
for description in descriptions:
s+=1
print(description.text)
df2.loc[index, f'RecDescription_{str(s)}'] = description.text
The urls I scraped in df2 are all similar to the example link above.
The code find nothing in the "descriptions" variable.
My question is: What element I should use to find the detailed recommendation content under "received tab"? Thank you very much!

Well you would first get the direct parent of the paragraphs. You can do that with XPath, class or id whatever fits best. After that you can do Your_Parent.find_elements(by=By.XPATH, value='./child::*') you can then loop over the result of that to get all paragraphs.
Edit
This selects all the paragraphs i have not yet looked into seperating them by post but here is what i got so far:
parents_of_paragraphs = driver.find_elements(By.CSS_SELECTOR, "div.display-flex.align-items-center.t-14.t-normal.t-black")
text_total = ""
for element in parents_of_paragraphs:
paragraph = element.find_element(by=By.XPATH, value='./child::*')
text_total += f"{paragraph.text}\n"
print(text_total)

Related

Scraping news articles using Selenium Python

I am Learning to scrape news articles from the website https://tribune.com.pk/pakistan/archives. The first thing is to scrape the link of every news article. Now the problem is that <a tag contains two href in it but I want to get the first href tag which I am unable to do
I am attaching the html of that particular part
The code I have written returns me 2 href tags but I only want the first one
def Url_Extraction():
category_name = driver.find_element(By.XPATH, '//*[#id="main-section"]/h1')
cat = category_name.text # Save category name in variable
print(f"{cat}")
news_articles = driver.find_elements(By.XPATH,"//div[contains(#class,'flex-wrap')]//a")
for element in news_articles:
URL = element.get_attribute('href')
print(URL)
Url.append(URL)
Category.append(cat)
current_time = time.time() - start_time
print(f'{len(Url)} urls extracted')
print(f'{len(Category)} categories extracted')
print(f'Current Time: {current_time / 3600:.2f} hr, {current_time / 60:.2f} min, {current_time:.2f} sec',
flush=True)
Moreover I am able to paginate but I can't get the full article by clicking the individual links given on the main page.
You have to modify the below XPath:
Instead of this -
news_articles = driver.find_elements(By.XPATH,"//div[contains(#class,'flex-wrap')]//a")
Use this -
news_articles = driver.find_elements(By.XPATH,"//div[contains(#class,'flex-wrap')]/a")

Python Selenium - How do you extract a link from an element with no href? [duplicate]

I am trying to iterate through a series of car listings and return the links to the individual CarFax and Experian Autocheck documents for each listing.
Page I am trying to pull the links from
The XPATH for the one constant parent element across all child elements I am looking for is:
.//div[#class="display-inline-block align-self-start"]/div[1]
I initially tried to simply extract the href attribute from the child <div> and <a> tags at this XPATH: .//div[#class="display-inline-block align-self-start"]/div[1]/a[1]
This works great for some of the listings but does not work for others that do not have an <a> tag and instead include a <span> tag with an inline text link using text element "Get AutoCheck Vehicle History".
That link functions correctly on the page, but there is no href attribute or any link I can find attached to the element in the page and I do not know how to scrape it with Selenium. Any advice would be appreciated as I am new to Python and Selenium.
For reference, here is the code I was using to scrape through the page (this eventually returns an IndexError as only some of the iterations of elements on the list have the <a> tag and the final amount does not match the total amount of listings on the page indicated by len(name)
s = Service('/Users/admin/chromedriver')
driver = webdriver.Chrome(service=s)
driver.get("https://www.autotrader.com/cars-for-sale/ferrari/458-spider/beverly-hills-ca-90210?dma=&searchRadius=0&location=&isNewSearch=true&marketExtension=include&showAccelerateBanner=false&sortBy=relevance&numRecords=100")
nameList = []
autoCheckList = []
name = driver.find_elements(By.XPATH, './/h2[#class="text-bold text-size-400 text-size-sm-500 link-unstyled"]')
autoCheck = driver.find_elements(By.XPATH, './/div[#class="display-inline-block align-self-start"]/div[1]/a[1]')
for i in range(len(name)):
nameList.append(name[i].text)
autoCheckList.append(autoCheck[i].get_attribute('href'))

Scraping an onclick value in BeautifulSoup in Pandas

For class, we've been asked to scrape the North Koren News Agency's website: http://kcna.kp/kcna.user.home.retrieveHomeInfoList.kcmsf
The question asks to scrape the onclick values for the website. I've tried solving this in two different ways: by navigating the DOM tree. And by building a regex within a lop to systematically pull them out. I've failed on both counts.
Attempt1:
onclick_soup = soup_doc.find_all('a', class_='titlebet')[0]
onclick_soup
Output:
<a class="titlebet" href="#this" onclick='fn_showArticle("AR0140322",
"", "NT00", "L")'>경애하는 최고령도자 <nobr><strong><font
style="font-size:10pt;">김정은</font></strong></nobr>동지께서 라오스인민혁명당 중앙위원회
총비서인 라오스인민민주주의공화국 주석에게 축전을 보내시였다</a>
Attempt2:
regex_for_onclick_soup = r"onclick='(.*?)\(" onclick_value_soup =
soup_doc.find_all('a', class_='titlebet') for onclick_value in
onclick_value_soup: value =
re.findall(regex_for_onclick_value,onclick_value) print(onclick_value)
Attempt2 results in a TypeError
I'm doing this in pandas. Any guidance would be helpful.
You can simply iterate over every element tag in your html and check for the onclick event.
page= requests.get('http://kcna.kp/kcna.user.home.retrieveHomeInfoList.kcmsf')
soup = BeautifulSoup(page.content, 'lxml')
for tag in soup.find_all():
on_click = tag.get('onclick')
if on_click:
print(on_click)
Note that when using find_all() whithout any argument it will retrieve every tag. Then we use this tags to search for a onclick that is not None and print it out.
Outputs:
fn_convertLanguage('kor')
fn_convertLanguage('eng')
fn_convertLanguage('chn')
fn_convertLanguage('rus')
fn_convertLanguage('spn')
fn_convertLanguage('jpn')
GotoLogin()
register()
evalSearch()
...

Webscraping with varying page numbers

So I'm trying to webscrape a bunch of profiles. Each profile has a collection of videos. I'm trying to webscrape information about each video. The problem I'm running into is that each profile uploads a different number of videos, so the number of pages containing videos per profile varies. For instance, one profile has 45 pages of videos, as you can see by the html below:
<div class="pagination "><ul><li><a class="active" href="">1</a></li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li><li class="no-page">...<li>45</li><li><a href="#1" class="no-page next-page"><span class="mobile-hide">Next</span>
While another profile has 2 pages
<div class="pagination "><ul><li><a class="active" href="">1</a></li><li>2</li><li><a href="#1" class="no-page next-page"><span class="mobile-hide">Next</span>
My question is, how do I account for the varying changes in page? I was thinking of making a for loop and just adding a random number at the end, like
for i in range(0,1000):
new_url = 'url' + str(i)
where i accounts for the page, but I want to know if there's a more efficient way of doing this.
Thank you.
The "skeleton" of the loop can look like this:
url = 'http://url/?page={page}'
page = 1
while True:
soup = BeautifulSoup(requests.get(url.format(page=page)).content, 'html.parser')
# ...
# do we have next page?
next_page = soup.select_one('.next-page')
# no, so break from the loop
if not next_page:
break
page += 1
You can have infinite loop while True: and you will break the loop only if there's no next page (if there isn't any class="next-page" tag on the last page).
Get the <li>...</li> elements of the <div class="pagination "><ul>
Exclude the last one by its class <li class="no-page">
Parse the "href" and build your next url destinations.
Scrape every new url destination.
I just want to thank everyone who did answer for taking the time to answer my question. I figured out the answer - or at least what worked for me- and decided to share in case it would be helpful for anyone else.
url = 'insert url'
re = requests.get(url)
soup = BeautifulSoup(re.content,'html.parser')
#look for pagination class
page = soup.find(class_='pagination')
#create list to include all page numbers
href=[]
#look for all 'li' tags as the users above suggested
links = page.findAll('li')
for link in links:
href += [link.find('a',href=True).text]
'''
href will now include all pages and the word Next.
So for instance it will look something like this:[1,2,3...,44,Next].
I want to get 44, which will be href[-2] and then convert that to an int for
a for loop. In the for loop add + 1 because it will iterate to i-1, instead of
i. For instance, if you're iterating (0,44), the last output of i will be 43,
which is why we +1
'''
for i in range(0, int(href[-2])+1):
new_url = url + str(1)

how to extract data from autocomplete box with selenium python

I am trying to extract data from a search box, you can see a good example on wikipedia
This is my code:
driver = webdriver.Firefox()
driver.get(response.url)
city = driver.find_element_by_id('searchInput')
city.click()
city.clear()
city.send_keys('a')
time.sleep(1.5) #waiting for ajax to load
selen_html = driver.page_source
#print selen_html.encode('utf-8')
hxs = HtmlXPathSelector(text=selen_html)
ajaxWikiList = hxs.select('//div[#class="suggestions"]')
items=[]
for city in ajaxWikiList:
item=TestItem()
item['ajax'] = city.select('/div[#class="suggestions-results"]/a/#title').extract()
items.append(item)
print items
Xpath expression is ok, I checked on a static page. If I uncomment the line that prints out scrapped html code the code for the box shows at the end of the file. But for some reason I can't extract data from it with the above code? I must miss something since I tried 2 different sources, wikipedia page is just another source where I can't get these data extracted.
Any advice here? Thanks!
Instead of passing the .page_source which in your case contains an empty suggestions div, get the innerHTML of the element and pass it to the Selector:
selen_html = driver.find_element_by_class_name('suggestions').get_attribute('innerHTML')
hxs = HtmlXPathSelector(text=selen_html)
suggestions = hxs.select('//div[#class="suggestions-results"]/a/#title').extract()
for suggestion in suggestions:
print suggestion
Outputs:
Animal
Association football
Arthropod
Australia
AllMusic
African American (U.S. Census)
Album
Angiosperms
Actor
American football
Note that it would be better to use selenium Waits feature to wait for the element to be accessible/visible, see:
How can I get Selenium Web Driver to wait for an element to be accessible, not just present?
Selenium waitForElement
Also, note that HtmlXPathSelector is deprecated, use Selector instead.

Categories

Resources