Trouble getting href - Instagram automation - python

I am not the first and not the last one who got into this: cannot get all hrefs from instagram. Although it is common I cannot get all hrefs from a class and all solutions I tried so far desperately failed. So, would appreciate a hand or a punch into the right direction.
I am searching for a hashtag:
hashtags = '#hashtag'
search.send_keys(hashtags)
time.sleep(2)
search.send_keys(Keys.ENTER)
time.sleep(2)
search.send_keys(Keys.ENTER)
link_list=[]
links = driver.find_elements_by_class_name('Nnq7C weEfm')
for link in links:
link_list.append(link.get_attribute('href'))
print(link_list)
There are several upper level classes that select all pics by neither gives me href.
I can get href from v1Nh3 kIKUG _bz0w - the class defing an individual pic on the search results page. Despite there are 33 v1Nh3 kIKUG _bz0w on the page I get only one href.

links=[x.get_attribute("href") for x in driver.find_elements_by_xpath("//div[#class='v1Nh3 kIKUG _bz0w']/a")]
Just use /a on the class and get the hrefs like so. I'd find a more suitable xpath since that class name looks dynamic though.

Related

Python Beatiful Soup (Select only the one class with same name)

I am using Beautiful Soup to parse through elements of an email and I have successfully been able to extract the links from a button from the email. However, the class name on the button appears twice in the email HTML, therefore extracting/ printing two links. I only need one the first link or reference to the class first with the same name.
This is my code:
soup = BeautifulSoup(msg.html, 'html.parser')
for link in soup('a', class\_='mcnButton', href=True):
print(link\['href'\])
The 'mcnButton' is referencing two html buttons within the email containing two seperate links.I only need the first reference to the 'mcnButton' class and link containing.
The above codes prints out two links (again I only need the first).
https://leemarpet.us10.list-manage.com/track/XXXXXXX1
https://leemarpet.us10.list-manage.com/track/XXXXXXX2
I figured there should be a way to index and separately access the first reference to the class and link. Any assistance would be greatly appreciated, Thanks!
I tried the select_one, find, and attempts to index the class, unfortunately resulted in a syntax error.
To find only the first element matching your pattern use .find():
soup.find('a', class_='mcnButton', href=True)
and to get its href:
soup.find('a', class_='mcnButton', href=True).get('href')
For more information check the docs

Python Selenium: How do I print the correct tag?

I am trying to print by ID using Selenium. As far as I can tell, "a" is the tag and "title" is the attribute. See HTML below.
When I run the following code:
print(driver.find_elements(By.TAG_NAME, "a")[0].get_attribute('title'))
I get the output:
Zero Tolerance
So I'm getting the first attribute correctly. When I increment the code above:
print(driver.find_elements(By.TAG_NAME, "a")[1].get_attribute('title'))
My expected output is:
Aaliyah Love
However, I'm just getting blank. No errors. What am I doing incorrectly? Pls don't suggest using xpath or css, I'm trying to learn Selenium tags.
HTML:
<a class=" Link ScenePlayer-ChannelName-Link styles_1lHAYbZZr4 Link ScenePlayer-ChannelName-Link styles_1lHAYbZZr4" href="/en/channel/ztfilms" title="Zero Tolerance" rel="">Zero Tolerance</a>
...
<a class=" Link ActorThumb-ActorImage-Link styles_3dXcTxVCON Link ActorThumb-ActorImage-Link styles_3dXcTxVCON" href="/[RETRACTED]/Aaliyah-Love/63565" title="Aaliyah Love"
Selenium locators are a toolbox and you're saying you only want to use a screwdriver (By.TAG_NAME) for all jobs. We aren't saying that you shouldn't use By.TAG_NAME, we're saying that you should use the right tool for the right job and sometimes (most times) By.TAG_NAME is not the right tool for the job. CSS selectors are WAY more powerful locators because they can search for not only tags but also classes, properties, etc.
It's hard to say for sure what's going on without access to the site/page. It could be that the entire page isn't loaded and you need to add a wait for the page to finish loading (maybe count links expected on the page?). It could be that your locator isn't specific enough and is catching other A tags that don't have a title attribute.
I would start by doing some debugging.
links = driver.find_elements(By.TAG_NAME, "a")
for link in links:
print(link.get_attribute('title'))
What does this print?
If it prints some blank lines sprinkled throughout the actual titles, your locator is probably not specific enough. Try a CSS selector
links = driver.find_elements(By.CSS_SELECTOR, "a[title]")
for link in links:
print(link.get_attribute('title'))
If instead it returns some titles and then nothing but blank lines, the page is probably not fully loaded. Try something like
count = 20 # the number of expected links on the page
link_locator = (By.TAG_NAME, "a")
WebDriverWait(driver, 10).until(lambda wd: len(wd.find_elements(link_locator)) == count)
links = driver.find_elements(link_locator)
for link in links:
print(link.get_attribute('title'))

How to open a list of links and scrape the text with Selenium

I am new to programming in Python and I want to write a Code to scrape text from articles on reuters using Selenium. I´m trying to open the article links and then get the full text from the article but it doesn´t work. I would be glad if somebody could help me.
article_links1 = []
for link in driver.find_elements_by_xpath("/html/body/div[4]/section[2]/div/div[1]/div[4]/div/div[3]/div[*]/div/h3/a"):
links = link.get_attribute("href")
article_links1.append(links)
article_links = article_links1[:5]
article_links
This is a shortened list of the articles, so it doesn´t take that long to scrape for testing. It contains 5 links, this is the output:
['https://www.reuters.com/article/idUSKCN2DM21B',
'https://www.reuters.com/article/idUSL2N2NS20U',
'https://www.reuters.com/article/idUSKCN2DM20N',
'https://www.reuters.com/article/idUSKCN2DM21W',
'https://www.reuters.com/article/idUSL3N2NS2F7']
Then I tried to iterate over the links and scrape the text out of the paragraphs but it doesn´t work.
for article in article_links:
driver.switch_to.window(driver.window_handles[-1])
driver.get(article)
time.sleep(5)
for article_text in driver.find_elements_by_xpath("/html/body/div[1]/div/div[4]/div[1]/article/div[1]/p[*]"):
full_text.append(article_text.text)
full_text
The output is only the empty list:
[]
There are a couple issues with your current code. The first one is an easy fix. You need to indent your second for loop, so that it's within the for loop that is iterating through each article. Otherwise, you won't be adding anything to the full_text list until it gets to the last article. It should look like this:
for article in article_links:
driver.switch_to.window(driver.window_handles[-1])
driver.get(article)
time.sleep(5)
for article_text in driver.find_elements_by_xpath("/html/body/div[1]/div/div[4]/div[1]/article/div[1]/p[*]"):
full_text.append(article_text.text)
The second problem lies within your xpath. Xpath can be very long when it's generated automatically by a browser. (I'd suggest learning CSS selectors, which are pretty concise. A good place to learn CSS selectors is called the CSS Diner.)
I've changed your find_elements_by_xpath() function to find_elements_by_css_selector(). You can see the example below.
for article_text in driver.find_elements_by_css_selector("article p"):
full_text.append(article_text.text)

get_attribute('href') returning None

I am learning selenium. I am trying to scrape the amazon website with selenium. Here is the link I am trying to scrape.
In the above url I am trying to extract all the elements with the class a-size-mini and extract the link from these elements.
here is my code
links = driver.find_elements_by_class_name("a-size-mini")
for link in links:
element = WebDriverWait(driver, 5).until(
EC.presence_of_element_located((By.LINK_TEXT, link.text)))
print(element.get_attribute('href'))
But this is returning None. I am not sure what I am doing wrong. the length of the links list is showing as 55 and when I try to print the element variable I get the following
<selenium.webdriver.remote.webelement.WebElement (session="121606058bd493d1a70fc957699d7f6d", element="c3dd6f5b-a9bb-409c-8ee2-666cac7e7432")>
So these variables are not empty or None. But when I try to extract the link using get_attribute('href') method it returns None
Please help me out. Thanks in advance
Please use this command.
links = driver.find_elements_by_xpath('//h2[contains(#class, "a-size-mini")]/a')
It's more efficient to parse html by xpath than class name.

Find a link by href in selenium python

Let's take the example of spotify because I'm listening to music on it right now.
I would like to get the text contained in the href tag in the following code.
<a data-testid="nowplaying-track-link" href="/album/3xIwVbGJuAcovYIhzbLO3J">Toosie Slide</a>
What I want is to get "/album/3xIwVbGJuAcovYIhzbLO3J" or if that's not possible, get "Toosie Slide" in order to store it in a variable to compare it with a constant.
The difficulty with Spotify (and many other sites) is that this href tag is present several times on the web page. So I'd like to get only the link that's contained in "nowplaying-track-link" which is a data-testid.
There, I hope I was clear.
PS: I already know the commands like: driver.find_element_by_xpath, etc... but I can't use them in this case...
I'm not sure what you mean about the commands of the type and not being able to use them, but this is how you would get the info you're seeking:
element = driver.find_element_by_css_selector('[data-testid="nowplaying-track-link"]')
href = element.get_attribute('href')
element_text = element.text
if you want to put together the link, you can do it this way:
link = driver.current_url + href

Categories

Resources