Not able to scrape text from a website using Selenium - python

I want to get the values for the fields as shown in the attached picture. This is my sample code and it's not fetching the required fields any corrections are welcomed.
span_xpath = "//div[#id='se-siteDetailsPanel-panel']"
name_xpath = "//div[#id='se-siteDetailsPanel-name']" + span_xpath
site_data.append(browser.find_element_by_xpath(name_xpath).text)
# address:
adrs1_xpath = "//div[#id='se-siteDetailsPanel-firstAddress']" + span_xpath
adrs2_xpath = "//div[#id='se-siteDetailsPanel-address']" + span_xpath
address = browser.find_element_by_xpath(adrs1_xpath).text + \
browser.find_element_by_xpath(adrs2_xpath).text
site_data.append(address)
# installed:
installed_xpath = "//div[#id='se-siteDetailsPanel-installationDate']" + span_xpath
site_data.append(browser.find_element_by_xpath(installed_xpath).text)
#updated
updated_xpath = "//div[#id='se-siteDetailsPanel-lastUpdateTime']" + span_xpath
site_data.append(browser.find_element_by_xpath(updated_xpath).text)
# peak:
peak_xpath = "//div[#id='se-siteDetailsPanel-peakPower']" + span_xpath
peak = browser.find_element_by_xpath(peak_xpath).text
site_data.append(peak.split()[0])

You can try using xpath or By_id
If you cannot find the XPath then try chropath extension in chrome, you will easily find the xpath.

#itronic1990... your hint worked actually i saw that the xspan was wrong i corrected the span path and it started fetching the values –

Related

Is there a way to do multiple screenshots using selenium?

I have a code to check out whether that instagram account exist or not
exist=[]
url = []
for i in cli:
r = requests.get("https://www.instagram.com/"+i+"/")
if r.apparent_encoding == 'Windows-1252':
exist.append(i)
url.append("instagram.com/"+i+"/")
exist
['duolingoenglishtest',
'duolingo',
'duolingoespanol',
'duolingofrance']
I want to do a screenshot for each instagram account, and I think have found a way to screenshot each instagram account, but I don't know how to change the screenshots name for each image.
for ss in exist:
driver.get("https://www.instagram.com/"+ss+"/")
time.sleep(5)
screenshot = driver.save_screenshot('Pictures/Insta2.png')
driver.quit()
I really appreciate the help,
Thanks!
You could use your exist entries as filenames:
screenshot = driver.save_screenshot('Pictures/' + ss + '.png')
Or setup a numbering scheme:
i = 1
for ss in exist:
driver.get("https://www.instagram.com/"+ss+"/")
time.sleep(5)
screenshot = driver.save_screenshot('Pictures/Insta' + str(i) + '.png')
i += 1
driver.quit()

Find the xpath with get_attribute() in python selenium

This is a somewhat backwards approach to web scraping. I need to locate the xpath of a web element AFTER I have already found it with a text()= identifier
Because the xpath values are different based on what information shows up, I need to use predictable labels inside the row for locating the span text next to found element. I found a simple and reliable way is locating the keyword label and then increasing td integer by one inside the xpath.
def x_label(self, contains):
mls_data_xpath = f"//span[text()='{contains}']"
string = self.driver.find_element_by_xpath(mls_data_xpath).get_attribute("xpath")
digits = string.split("td[")[1]
num = int(re.findall(r'(\d+)', digits)[0]) + 1
labeled_data = f'{string.split("td[")[0]}td[{num}]/span'
print(labeled_data)
labeled_text = self.driver.find_element_by_xpath(labeled_data).text
return labeled_text
I cannot find too much information on .get_attribute() and get_property() so I am hoping there is something like .get_attribute("xpath") but I haven't been able to find it.
Basically, I am taking in a string like "ApprxTotalLivArea" which I can rely on and then increasing the integer after td[0] by 1 to find the span data from cell next door. I am hoping there is something like a get_attributes("xpath") to locate the xpath string from the element I locate through my text()='{contains}' search.
The Remote WebElement does includes the following methods:
get_attribute()
get_dom_attribute()
get_property()
But xpath isn't a valid property of a WebElement. So get_attribute("xpath") will always return NULL
I was able to find a python version of the execute script from this post that was based off a JavaScript answer in another forum. I had to make a lot of .replace() calls on the string this function creates but I was able to universally find the label string I need and increment the td/span xpath by +1 to find the column data and retrieve it regardless of differences in xpath values on different page listings.
def x_label(self, contains):
label_contains = f"//span[contains(text(), '{contains}')]"
print(label_contains)
labeled_element = self.driver.find_element_by_xpath(label_contains)
print(labeled_element)
element_label = labeled_element.text
print(element_label)
self.driver.execute_script("""
window.getPathTo = function (element) {
if (element.id!=='')
return 'id("'+element.id+'")';
if (element===document.body)
return element.tagName;
var ix= 0;
var siblings= element.parentNode.childNodes;
for (var i= 0; i<siblings.length; i++) {
var sibling= siblings[i];
if (sibling===element)
return window.getPathTo(element.parentNode)+'/'+element.tagName+'['+(ix+1)+']';
if (sibling.nodeType===1 && sibling.tagName===element.tagName)
ix++;
}
}
""")
generated_xpath = self.driver.execute_script("return window.getPathTo(arguments[0]);", labeled_element)
generated_xpath = f'//*[#{generated_xpath}'.lower().replace('tbody[1]', 'tbody')
print(f'generated_xpath = {generated_xpath}')
expected_path = r'//*[#id="wrapperTable"]/tbody/tr/td/table/tbody/tr[26]/td[6]/span'
generated_xpath = generated_xpath.replace('[#id("wrappertable")', '[#id="wrapperTable"]').replace('tr[1]', 'tr')
clean_path = generated_xpath.replace('td[1]', 'td').replace('table[1]', 'table').replace('span[1]', 'span')
print(f'clean_path = {clean_path}')
print(f'expected_path = {expected_path}')
digits = generated_xpath.split("]/td[")[1]
print(digits)
num = int(re.findall(r'(\d+)', digits)[0]) + 1
print(f'Number = {num}')
labeled_data = f'{clean_path.split("td[")[0]}td[{num}]/span'
print(f'labeled_data = {labeled_data}')
print(f'expected_path = {expected_path}')
if labeled_data == expected_path:
print('Congrats')
else:
print('Rats')
labeled_text = self.driver.find_element_by_xpath(labeled_data).text
print(labeled_text)
return labeled_text
This function iteratively get's the parent until it hits the html element at the top
from selenium import webdriver
from selenium.webdriver.common.by import By
def get_xpath(elm):
e = elm
xpath = elm.tag_name
while e.tag_name != "html":
e = e.find_element(By.XPATH, "..")
neighbours = e.find_elements(By.XPATH, "../" + e.tag_name)
level = e.tag_name
if len(neighbours) > 1:
level += "[" + str(neighbours.index(e) + 1) + "]"
xpath = level + "/" + xpath
return "/" + xpath
driver = webdriver.Chrome()
driver.get("https://www.stackoverflow.com")
login = driver.find_element(By.XPATH, "//a[text() ='Log in']")
xpath = get_xpath(login)
print(xpath)
assert login == driver.find_element(By.XPATH, xpath)
Hope this helps!

Web Scraping Without Getting Blocked [duplicate]

This question already has answers here:
Website blocking Selenium : is there a way to bypass?
(2 answers)
Closed 3 years ago.
I read a lot of posts on the topic, and also tried some of this article's advice, but I am still blocked.
https://www.scraperapi.com/blog/5-tips-for-web-scraping
IP Rotation: done I'm using a VPN and often changing IP (but not DURING the script, obviously)
Set a Real User-Agent: implemented fake-useragent with no luck
Set other request headers: tried with SeleniumWire but how to use it at the same time than 2.?
Set random intervals in between your requests: done but anyway at the present time I even cannot access the starting home page !!!
Set a referer: same as 3.
Use a headless browser: no clue
Avoid honeypot traps: same as 4.
10: irrelevant
The website I want to scrape: https://www.winamax.fr/paris-sportifs/
Without Selenium: it goes smoothly to a page with some games and their odds, and I can navigate from here
With Selenium: the page shows a "Winamax est actuellement en maintenance" message and no games and no odds
Try to execute this piece of code and you might get blocked quite quickly :
from selenium import webdriver
import time
from time import sleep
import json
driver = webdriver.Chrome(executable_path="chromedriver")
driver.get("https://www.winamax.fr/paris-sportifs/") #I'm even blocked here now !!!
toto = driver.page_source.splitlines()
titi = {}
matchez = []
matchez_detail = []
resultat_1 = {}
resultat_2 = {}
taratata = 1
comptine = 1
for tut in toto:
if tut[0:53] == "<script type=\"text/javascript\">var PRELOADED_STATE = ": titi = json.loads(tut[53:tut.find(";var BETTING_CONFIGURATION = ")])
for p_id in titi.items():
if p_id[0] == "sports":
for fufu in p_id:
if isinstance(fufu, dict):
for tyty in fufu.items():
resultat_1[tyty[0]] = tyty[1]["categories"]
for p_id in titi.items():
if p_id[0] == "categories":
for fufu in p_id:
if isinstance(fufu, dict):
for tyty in fufu.items():
resultat_2[tyty[0]] = tyty[1]["tournaments"]
for p_id in resultat_1.items():
for tgtg in p_id[1]:
for p_id2 in resultat_2.items():
if str(tgtg) == p_id2[0]:
for p_id3 in p_id2[1]:
matchez.append("https://www.winamax.fr/paris-sportifs/sports/"+str(p_id[0])+"/"+str(tgtg)+"/"+str(p_id3))
for alisson in matchez:
print("compet " + str(taratata) + "/" + str(len(matchez)) + " : " + alisson)
taratata = taratata + 1
driver.get(alisson)
sleep(1)
elements = driver.find_elements_by_xpath("//*[#id='app-inner']/div/div[1]/span/div/div[2]/div/section/div/div/div[1]/div/div/div/div/a")
for elm in elements:
matchez_detail.append(elm.get_attribute("href"))
for mat in matchez_detail:
print("match " + str(comptine) + "/" + str(len(matchez_detail)) + " : " + mat)
comptine = comptine + 1
driver.get(mat)
sleep(1)
elements = driver.find_elements_by_xpath("//*[#id='app-inner']//button/div/span")
for elm in elements:
elm.click()
sleep(1) # and after my specific code to scrape what I want
I recommend using requests , I don’t see a reason to use selenium since you said requests works, and requests can work with pretty much any site as long as you are using appropriate headers, you can see the headers needed by looking at the developer console in chrome or Firefox and looking at the request headers.

How To Scrape Specific Chracter in Selenium using Python

I Want To Scrape 70 character in this HTML code:
<p>2) Proof of payment emailed to satrader03<strong>#gmail.com</strong> direct from online banking 3) Selfie of you holding your ID 4) Selfie of you holding your bank card from which payment will be made OR 5) Skype or what's app Video call while logged onto online banking displaying account name which should match personal verified name Strictly no 3rd party payments</p>
I Want To Know How To Scrape Specific Character with selenium for example i want to scrape 30 character or other
Here is my code:
description = driver.find_elements_by_css_selector("p")
items = len(title)
with open('btc_gmail.csv','a',encoding="utf-8") as s:
for i in range(items):
s.write(str(title[i].text) + ',' + link[i].text + ',' + description[i].text + '\n')
How to scrape 30 characters or 70 or something
Edit (full code):
driver = webdriver.Firefox()
r = randrange(3,7)
for url_p in url_pattren:
time.sleep(3)
url1 = 'https://www.bing.com/search?q=site%3alocalbitcoins.com+%27%40gmail.com%27&qs=n&sp=-1&pq=site%3alocalbitcoins+%27%40gmail.com%27&sc=1-31&sk=&cvid=9547A785CF084BAE94D3F00168283D1D&first=' + str(url_p) + '&FORM=PERE3'
driver.get(url1)
time.sleep(r)
title = driver.find_elements_by_tag_name('h2')
link = driver.find_elements_by_css_selector("cite")
description = driver.find_elements_by_css_selector("p")
items = len(title)
with open('btc_gmail.csv','a',encoding="utf-8") as s:
for i in range(items):
s.write(str(title[i].text) + ',' + link[i].text + ',' + description[i].text[30:70] + '\n')
Any Solution?
You can get text of the tag and then use slice on string
>>> description = driver.find_elements_by_css_selector("p")[0].text
>>> print(description[30:70]) # printed from 30th to 70th symbol
'satrader03<strong>#gmail.com</strong>'

Excluding 'duplicated' scraped URLs in Python app?

I've never used Python before so excuse my lack of knowledge but I'm trying to scrape a xenforo forum for all of the threads. So far so good, except for the fact its picking up multiple URLs for each page of the same thread, I've posted some data before to explain what I mean.
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/page-9
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/page-10
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/page-11
Really, what I would ideally want to scrape is just one of these.
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/
Here is my script:
from bs4 import BeautifulSoup
import requests
def get_source(url):
return requests.get(url).content
def is_forum_link(self):
return self.find('special string') != -1
def fetch_all_links_with_word(url, word):
source = get_source(url)
soup = BeautifulSoup(source, 'lxml')
return soup.select("a[href*=" + word + "]")
main_url = "http://example.com/forum/"
forumLinks = fetch_all_links_with_word(main_url, "forums")
forums = []
for link in forumLinks:
if link.has_attr('href') and link.attrs['href'].find('.rss') == -1:
forums.append(link.attrs['href']);
print('Fetched ' + str(len(forums)) + ' forums')
threads = {}
for link in forums:
threadLinks = fetch_all_links_with_word(main_url + link, "threads")
for threadLink in threadLinks:
print(link + ': ' + threadLink.attrs['href'])
threads[link] = threadLink
print('Fetched ' + str(len(threads)) + ' threads')
This solution assumes that what should be removed from the url to check for uniqueness is always going to be "/page-#...". If that is not the case this solution will not work.
Instead of using a list to store your urls you can use a set, which will only add unique values. Then in the url remove the last instance of "page" and anything that comes after it if it is in the format of "/page-#", where # is any number, before adding it to the set.
forums = set()
for link in forumLinks:
if link.has_attr('href') and link.attrs['href'].find('.rss') == -1:
url = link.attrs['href']
position = url.rfind('/page-')
if position > 0 and url[position + 6:position + 7].isdigit():
url = url[:position + 1]
forums.add(url);

Categories

Resources