Instead of writing 10+ IF statements, I'm trying to create one IF statement using a variable. Unfortunately, I'm not familiar with how to implement string concatenation for xpath using python. Can anyone teach me how to perform string formatting for the following code segments?
I would greatly appreciate it, thanks.
if page_number == 1:
next_link = browser.find_element_by_xpath('//*[#title="Go to page 2"]')
next_link.click()
page_number = page_number + 1
time.sleep(30)
elif page_number == 2:
next_link = browser.find_element_by_xpath('//*[#title="Go to page 3"]')
next_link.click()
page_number = page_number + 1
time.sleep(30)
This answer is not about string concatenation, but about simple problem solution...
Instead of clicking particular link on pagination you can click "Next" button:
pages_number = 10
for _ in range(pages_number):
driver.find_element_by_xpath("//a[#title='Go to next page']").click()
time.sleep(30)
If you need to open specific page you can use below:
required_page = 3
driver.find_element_by_link_text(required_page).click()
P.S. I assumed you are talking about this site
You can use a for-loop
Ex:
for i in range(1, 10):
next_link = browser.find_element_by_xpath('//*[#title="Go to page {0}"]'.format(str(i+1))) #using str.format for page number
next_link.click()
time.sleep(30)
Related
Just like the title says, how do I write the code in python if I want to replace a part of the URL.
For this example replacing a specific part by 1, 2, 3, 4 and so on for this link (https://test.com/page/1), then doing something on said page and going to the next and repeat.
So, "open url > click on button or whatever > replace link by the new link with the next number in order"
(I know my code is a mess I am still a newbie, but I am trying to learn and I am adding whatever mess I've wrote so far to follow the posting rules)
PATH = Service("C:\Program Files (x86)\chromedriver.exe")
driver = webdriver.Chrome(service=PATH)
driver.maximize_window()
get = 1
url = "https://test.com/page/{get}"
while get < 5:
driver.get(url)
time.sleep(1)
driver.find_element_by_xpath("/html/body/div/div/div[2]/form/section[3]/input[4]").click()
get = get + 1
driver.get(url)
driver.close()
get = 1
url = f"https://test.com/page/{get}"
while get < 5:
driver.get(url)
driver.find_element_by_xpath("/html/body/div/div/div[2]/form/section[3]/input[4]").click()
print(get)
print(url)
get+=1
url = f"https://test.com/page/{get}"
To simply update url in a loop.
Outputs
1
https://test.com/page/1
2
https://test.com/page/2
3
https://test.com/page/3
4
https://test.com/page/4
Use the range() function and use String interpolation as follows:
for i in range(1,5):
print(f"https://test.com/page/{i}")
driver.get(f"https://test.com/page/{i}")
driver.find_element_by_xpath("/html/body/div/div/div[2]/form/section[3]/input[4]").click()
Console Output:
https://test.com/page/1
https://test.com/page/2
https://test.com/page/3
https://test.com/page/4
I am working on a web-scraping code, which needs to scrape multiple websites. I used Selenium web-driver to do that and I use find_element_by_xpath to extract the information that I want. Here is my code
op = webdriver.ChromeOptions()
driver = webdriver.Chrome(options=op)
driver.get(web)
content1 = driver.find_elements_by_class_name('k-master-row')
for c1 in content1:
for i in range(1,401):
amount1 = i
try:
caseid = c1.find_element_by_xpath('//*[#id="CasesGrid"]/table/tbody/tr[{}]/td[2]/a'.format(amount1)).text.strip()
case_id.append(caseid)
except:
pass
try:
style = c1.find_element_by_xpath('//*[#id="CasesGrid"]/table/tbody/tr[{}]/td[3]/div'.format(amount1)).text.strip()
defendant.append(style)
except:
defendant.append('')
The code works perfectly fine but for the range, I have to maually set it everytime when I scrape different URLs. //*[#id="CasesGrid"]/table/tbody/tr[1]/td[2]/a This is the xpath, which are identical across all the URLs that I need to scrape. The only thing different is the tr[1], it can range from 1 to 500 or 600. If the wbesite that I scrape start from 1 to 35, then I have to manually change the range to(1,35). I think this is very time consuming when I scrape a new URLs. I am wondering are there any better ways to set the range, so it will just stop whichever URLs that I scrape, thus I dont need to manually search the xpath to find the end number for the range.
Thank you all!!
Use an infinite loop, and break out when you reach the end.
amount1 = 1
while True:
try:
tr = c1.find_element_by_xpath(f'//*[#id="CasesGrid"]/table/tbody/tr[{amount1}]')
except:
break
try:
caseid = c1.find_element_by_xpath('//*[#id="CasesGrid"]/table/tbody/tr[{}]/td[2]/a'.format(amount1)).text.strip()
case_id.append(caseid)
except:
pass
try:
style = c1.find_element_by_xpath('//*[#id="CasesGrid"]/table/tbody/tr[{}]/td[3]/div'.format(amount1)).text.strip()
defendant.append(style)
except:
defendant.append('')
amount1 += 1
I'm trying to workout the number of for loops to run depending on the number of List (totalListNum)
And it seems that it is returning Nonetype when infact it should be returning either text or int
website:https://stamprally.org/?search_keywords=&search_keywords_operator=and&search_cat1=68&search_cat2=0
Code Below
for prefectureValue in prefectureValueStorage:
driver.get(
f"https://stamprally.org/?search_keywords&search_keywords_operator=and&search_cat1={prefectureValue}&search_cat2=0")
# Calculate How Many Times To Run Page Loop
totalListNum = driver.find_element_by_css_selector(
'div.page_navi2.clearfix>p').get_attribute('text')
totalListNum.text.split("件中")
if totalListNum[0] % 10 != 0:
pageLoopCount = math.ceil(totalListNum[0])
else:
continue
currentpage = 0
while currentpage < pageLoopCount:
currentpage += 1
print(currentpage)
I dont think you should use get_attribute. Instead try this.
totalListNum = driver.find_element_by_css_selector('div.page_navi2.clearfix>p').text
First, your locator is not unique
Use this:
div.page_navi2.clearfix:nth-of-type(1)>p
or for the second element:
div.page_navi2.clearfix:nth-of-type(2)>p
Second, as already mentioned, use .text to get the text.
If .text does not work you can use .get_attribute('innerHTML')
I have a main page where there are links to 5 other pages with the following xpaths from tr[1] to tr[5].
/html/body/div[3]/div[2]/div/div[5]/div/div[2]/table/tbody/tr[1]/td[3]/div[1]/a
/html/body/div[3]/div[2]/div/div[5]/div/div[2]/table/tbody/tr[2]/td[3]/div[1]/a
/html/body/div[3]/div[2]/div/div[5]/div/div[2]/table/tbody/tr[3]/td[3]/div[1]/a
/html/body/div[3]/div[2]/div/div[5]/div/div[2]/table/tbody/tr[4]/td[3]/div[1]/a
/html/body/div[3]/div[2]/div/div[5]/div/div[2]/table/tbody/tr[5]/td[3]/div[1]/a
Inside every page there I have the following actions:
driver.find_element_by_name('key').send_keys('test_1')
driver.find_element_by_name('i18n[en_EN][value]').send_keys('Test 1')
# and at the end this takes me back to the main page again
driver.find_element_by_xpath('/html/body/div[3]/div[2]/div/div[3]/div/ul/li[2]/a').click()
How can I iterate so that the script will go through all 5 pages and do the above actions. Tried for loop but I guess I didn't do it right... any help would be very appreciated.
You can try this:
xpath = '/html/body/div[3]/div[2]/div/div[5]/div/div[2]/table/tbody/tr[{}]/td[3]/div[1]/a'
for i in range(1, 6):
driver.find_element_by_xpath(xpath.format(i)).click()
seems like I figured it out so here is the answer which works for me now.
wls = ['/html/body/div[3]/div[2]/div/div[5]/div/div[2]/table/tbody/tr[1]/td[3]/div[1]/a',
'/html/body/div[3]/div[2]/div/div[5]/div/div[2]/table/tbody/tr[2]/td[3]/div[1]/a',
'/html/body/div[3]/div[2]/div/div[5]/div/div[2]/table/tbody/tr[3]/td[3]/div[1]/a',
'/html/body/div[3]/div[2]/div/div[5]/div/div[2]/table/tbody/tr[4]/td[3]/div[1]/a',
'/html/body/div[3]/div[2]/div/div[5]/div/div[2]/table/tbody/tr[5]/td[3]/div[1]/a']
for i in wls:
driver.find_element_by_xpath(i).click()
below
template = '/html/body/div[3]/div[2]/div/div[5]/div/div[2]/table/tbody/tr[{}]/td[3]/div[1]/a'
for x in range(1,6):
a = template.format(x)
print(a)
# do what you need to do with the 'a' element.
output
/html/body/div[3]/div[2]/div/div[5]/div/div[2]/table/tbody/tr[1]/td[3]/div[1]/a
/html/body/div[3]/div[2]/div/div[5]/div/div[2]/table/tbody/tr[2]/td[3]/div[1]/a
/html/body/div[3]/div[2]/div/div[5]/div/div[2]/table/tbody/tr[3]/td[3]/div[1]/a
/html/body/div[3]/div[2]/div/div[5]/div/div[2]/table/tbody/tr[4]/td[3]/div[1]/a
/html/body/div[3]/div[2]/div/div[5]/div/div[2]/table/tbody/tr[5]/td[3]/div[1]/a
I'm writing a code in Python to get all the 'a' tags in a URL using Beautiful soup, then I use the link at position 3 and follow that link, I will repeat this process about 18 times. I included the code below, which has the process repeated 4 times. When I run this code I get the same 4 links in the results. I should get 4 different links. I think there is something wrong in my loop, specifically in the line that says y=url. I need help figuring out what the problem is.
import re
import urllib
from BeautifulSoup import *
list1=list()
url = 'https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html'
for i in range (4): # repeat 4 times
htm2= urllib.urlopen(url).read()
soup1=BeautifulSoup(htm2)
tags1= soup1('a')
for tag1 in tags1:
x2 = tag1.get('href', None)
list1.append(x2)
y= list1[2]
if len(x2) < 3: # no 3rd link
break # exit the loop
else:
url=y
print y
You're continuing to add the third link EVER FOUND to your result list. Instead you should be adding the third link OF THAT ITERATION (which is soup('a')[2]), then reassigning your url and going again.
url = 'https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html'
result = []
for i in range(4):
soup = BeautifulSoup(urllib.urlopen(url).read())
links = soup('a')
for link in links:
result.append(link)
try:
third_link = links[2]['href']
except IndexError: # less than three links
break
else:
url = third_link
print(url)
This is actually pretty simple in a recursive function:
def get_links(url):
soup = BeautifulSoup(urllib.urlopen(url).read())
links = soup('a')
if len(links) < 3:
# base case
return links
else:
# recurse on third link
return links + get_links(links[2]['href'])
You can even modify that to make sure you don't recurse too deep
def get_links(url, times=None):
'''Returns all <a> tags from `url` and every 3rd link, up to `times` deep
get_links("protocol://hostname.tld", times=2) -> list
if times is None, recurse until there are fewer than 3 links to be found
'''
def _get_links(url, TTL):
soup = BeautifulSoup(urllib.urlopen(url).read())
links = soup('a')
if (times is not None and TTL >= times) or \
len(links) < 3:
# base case
return links
else:
return links + _get_links(links[2]['href'], TTL+1)
return _get_links(url, 0)
Your current code
y= list1[2]
just prints the URL located at index 2 of list1. Since that list only gets appended to, list[2] doesn't change. You should instead be selecting different indices each time you print if you want different URLs. I'm not sure what it is specifically that you're trying to print, but y= list1[-1] for instance would end up printing the last URL added to the list on that iteration (different each time).