Data-e2e to list in Selenium - python

I'm doing a scraping process using selenium in which my goal is to extract the views, likes, comments and shares of the videos that are made to an audio in TikTok.
In the process I found this path:
<div data-e2e="music-item-list" mode="compact" class="tiktok-yvmafn-DivVideoFeedV2 e5w7ny40">
This contains the different videos of the audio, however it is inside a <div> and not <li>.
div dependency
How do I convert the divs contained in the path into a list that I can manipulate?
This is what I did:
url = 'https://www.tiktok.com/music/Sweater-Weather-Sped-Up-7086537183875599110'
driver.get(url)
posts = driver.find_element(By.XPATH, '//div[#data-e2e="music-item-list"]')
post1 = posts[0]

A proper way to locate those elements would be too wait for them in a first instance, and then locate them as a list, then access them:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
[...]
wait = WebDriverWait(driver, 20)
[...]
posts = wait.until(EC.presence_of_all_elements_located((By.XPATH , '//div[#data-e2e="music-item-list"]/div')))
for post in posts:
print(post.text)
Selenium documentation: https://www.selenium.dev/documentation/

Related

Can't get tweets with selenium in python

I am trying to build a tweet scraper for my nlp project but i cant get tweets.
Here is codes:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
import time
query = 'mutluluk'
URL = 'https://twitter.com/search?q=' + query + '&src=typed_query&f=live'
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
driver.get(URL)
wait.until(EC.title_contains(query + ' - Twitter Araması / Twitter'))
tweets = driver.find_elements_by_css_selector("div#tweet-text").text
print(tweets)
The page that is returned does not have the title you expect, your wait condition is too specific. If you change it to:
wait.until(EC.title_contains(query)
or
wait.until(EC.title_contains(query + ' - Twitter)
you'll get a page of tweets. After that, I don't think you have the right CSS selector, because it finds no matching element, so you need to further investigate the page contents with the developer tools.
You can wait until the elements you are searching for are present, rather than waiting explicitly for a text in the page title.
Your css selector is too poor for those types of websites, I recommend using XPATH, because big websites generally randomize the classes of the most elements in the DOM, so parsing the document will not be easy for beginners.
Use this snippet and you will get the text of your elements :
elements = wait.until(EC.presence_of_all_elements_located(
(By.XPATH, "//main//article")))
for ele in elements:
print(ele.text)

Can't grab all the pdf links within a table from a webpage

I've written a script in python in combination with selenium to scrape different pdf links generated upon clicking on the different numbers, as in 110015710, 110015670 etc located within a table from a webpage.
Site link
My script can click on those links, reveal the pdf files but parse only 5 of them out of many.
How can I get them all?
I've tried so far:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
link = "replace_with_above_link"
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
driver.get(link)
[driver.execute_script("arguments[0].click();",item) for item in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,"tr.Iec")))]
for elem in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,".IecAttachments li a[href$='.pdf']"))):
print(elem.get_attribute("href"))
driver.quit()
when you click the element it will doing XHR to request for pdf links, add delay after every click.
for item in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,"tr.Iec"))):
driver.execute_script("arguments[0].click();",item)
time.sleep(1)

Selenium - Get full html of site (maybe need to wait?)

I apologise in advance for the (probably) very basic question. I spent a lot of time searching forums but my knowledge is too poor to make sense of the results.
I just need to get the HTML after the page has finished loading as almost all of the content is stored in div id="root">/div> but at the moment i just get that one line and nothing inside it.
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
browser = webdriver.Chrome() #replace with .Firefox(), or with the browser of your choice
url = "https://beta.footballindex.co.uk/top-200"
browser.get(url) #navigate to the page
innerHTML = browser.execute_script("return document.body.innerHTML") #returns the inner HTML as a string
print(innerHTML)
Returns:
<div id="root"></div>
<script src="https://static.footballindex.co.uk/bundle_1537553245755.js"></script>
And this matches the innerHTML when you 'view page source'. But if i inspect element in my browser you are able to expand div id="root">/div> to see all the content inside and then I can manually copy all the HTML.
How do i get this automatically?
Many thanks in advance.

Not able to Scrape data using BeautifulSoup

I'm using Selenium to login to the webpage and getting the webpage for scraping
I'm able to get the page.
I have searched the html for a table that I wanted to scrape.
here it is:-
<table cellspacing="0" class=" tablehasmenu table hoverable sensors" id="table_devicesensortable">
This is the script :-
rawpage=driver.page_source #storing the webpage in variable
souppage=BeautifulSoup(rawpage,'html.parser') #parsing the webpage
tbody=souppage.find('table', attrs={'id':'table_devicesensortable'}) #scrapping
I'm able to get the parsed webpage in souppage variable.
but not able to scrape and store in tbody variable.
Required table might be generated dynamically, so you need to wait until its presence on page:
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait as wait
tbody = wait(driver, 10).until(EC.presence_of_element_located((By.ID, "table_devicesensortable")))
Also note that there is no need in using BeautifulSoup as Selenium has enough built-in methods and properties to do the same job for you, e.g.
headers = tbody.find_elements_by_tag_name("th")
rows = tbody.find_elements_by_tag_name("tr")
cells = tbody.find_elements_by_tag_name("td")
cell_values = [cell.text for cell in cells]
etc...
I was searching on stackoverflow for the issue and came across this post
BeautifulSoup returning none when element definitely exists
By reading the answer provided by luiyezheng i got the hint that might be as the data is fetched dynamically.So, the table might got created dynamically and hence i was unable to find.
So, the work around is :-
before storing the webpage i put a delay
so the code goes like this
time.sleep(4)
rawpage=driver.page_source #storing the webpage in variable
souppage=BeautifulSoup(rawpage,"html.parser") #parsing the webpage
tbody=souppage.find("table",{"id":"table_devicesensortable"}) #scrapping
i hope it might help someone.
As per the HTML you have shared to scrape the <table> you have induce WebDriverWait with expected_conditions clause set to presence_of_element_located and to achieve that you can use either of the following code blocks :
Using class:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, "//table[#class=' tablehasmenu table hoverable sensors' and #id='table_devicesensortable']")))
rawpage=driver.page_source #storing the webpage in variable
souppage=BeautifulSoup(rawpage,"html.parser") #parsing the webpage
tbody=souppage.find("table",{"class":" tablehasmenu table hoverable sensors"}) #scrapping
Using id:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, "//table[#class=' tablehasmenu table hoverable sensors' and #id='table_devicesensortable']")))
rawpage=driver.page_source #storing the webpage in variable
souppage=BeautifulSoup(rawpage,"html.parser") #parsing the webpage
tbody=souppage.find("table",{"id":"table_devicesensortable"}) #scrapping

HTTP selector of a link (xpath or css)

I'm trying to grab the href element of each shoe in this site:
http://www.soccerpro.com/Clearance-Soccer-Shoes-c168/
But I can't get the proper selectors right.
response.xpath('.//*[#class="newnav itemnamelink"]')
[]
Anyone know how would I do this in xpath or css?
Required links generated dynamically, so you wouldn't be able to scrape them from HTML source that you get like requests.get("http://www.soccerpro.com/Clearance-Soccer-Shoes-c168/")
You might use selenium to get required values via browser session:
from selenium import webdriver as web
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait as wait
driver = web.Chrome()
driver.get('http://www.soccerpro.com/Clearance-Soccer-Shoes-c168/')
wait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//table[#class='getproductdisplay-innertable']")))
links = [link.get_attribute('href') for link in driver.find_elements_by_xpath('//a[#class="newnav itemnamelink"]')]

Categories

Resources