Setting For loop in Range without the end number - python

I am working on a web-scraping code, which needs to scrape multiple websites. I used Selenium web-driver to do that and I use find_element_by_xpath to extract the information that I want. Here is my code
op = webdriver.ChromeOptions()
driver = webdriver.Chrome(options=op)
driver.get(web)
content1 = driver.find_elements_by_class_name('k-master-row')
for c1 in content1:
for i in range(1,401):
amount1 = i
try:
caseid = c1.find_element_by_xpath('//*[#id="CasesGrid"]/table/tbody/tr[{}]/td[2]/a'.format(amount1)).text.strip()
case_id.append(caseid)
except:
pass
try:
style = c1.find_element_by_xpath('//*[#id="CasesGrid"]/table/tbody/tr[{}]/td[3]/div'.format(amount1)).text.strip()
defendant.append(style)
except:
defendant.append('')
The code works perfectly fine but for the range, I have to maually set it everytime when I scrape different URLs. //*[#id="CasesGrid"]/table/tbody/tr[1]/td[2]/a This is the xpath, which are identical across all the URLs that I need to scrape. The only thing different is the tr[1], it can range from 1 to 500 or 600. If the wbesite that I scrape start from 1 to 35, then I have to manually change the range to(1,35). I think this is very time consuming when I scrape a new URLs. I am wondering are there any better ways to set the range, so it will just stop whichever URLs that I scrape, thus I dont need to manually search the xpath to find the end number for the range.
Thank you all!!

Use an infinite loop, and break out when you reach the end.
amount1 = 1
while True:
try:
tr = c1.find_element_by_xpath(f'//*[#id="CasesGrid"]/table/tbody/tr[{amount1}]')
except:
break
try:
caseid = c1.find_element_by_xpath('//*[#id="CasesGrid"]/table/tbody/tr[{}]/td[2]/a'.format(amount1)).text.strip()
case_id.append(caseid)
except:
pass
try:
style = c1.find_element_by_xpath('//*[#id="CasesGrid"]/table/tbody/tr[{}]/td[3]/div'.format(amount1)).text.strip()
defendant.append(style)
except:
defendant.append('')
amount1 += 1

Related

How to replace the url in a loop using Selenium Python

Just like the title says, how do I write the code in python if I want to replace a part of the URL.
For this example replacing a specific part by 1, 2, 3, 4 and so on for this link (https://test.com/page/1), then doing something on said page and going to the next and repeat.
So, "open url > click on button or whatever > replace link by the new link with the next number in order"
(I know my code is a mess I am still a newbie, but I am trying to learn and I am adding whatever mess I've wrote so far to follow the posting rules)
PATH = Service("C:\Program Files (x86)\chromedriver.exe")
driver = webdriver.Chrome(service=PATH)
driver.maximize_window()
get = 1
url = "https://test.com/page/{get}"
while get < 5:
driver.get(url)
time.sleep(1)
driver.find_element_by_xpath("/html/body/div/div/div[2]/form/section[3]/input[4]").click()
get = get + 1
driver.get(url)
driver.close()
get = 1
url = f"https://test.com/page/{get}"
while get < 5:
driver.get(url)
driver.find_element_by_xpath("/html/body/div/div/div[2]/form/section[3]/input[4]").click()
print(get)
print(url)
get+=1
url = f"https://test.com/page/{get}"
To simply update url in a loop.
Outputs
1
https://test.com/page/1
2
https://test.com/page/2
3
https://test.com/page/3
4
https://test.com/page/4
Use the range() function and use String interpolation as follows:
for i in range(1,5):
print(f"https://test.com/page/{i}")
driver.get(f"https://test.com/page/{i}")
driver.find_element_by_xpath("/html/body/div/div/div[2]/form/section[3]/input[4]").click()
Console Output:
https://test.com/page/1
https://test.com/page/2
https://test.com/page/3
https://test.com/page/4

Getting NoneType in my python selenium despite .text method used

I'm trying to workout the number of for loops to run depending on the number of List (totalListNum)
And it seems that it is returning Nonetype when infact it should be returning either text or int
website:https://stamprally.org/?search_keywords=&search_keywords_operator=and&search_cat1=68&search_cat2=0
Code Below
for prefectureValue in prefectureValueStorage:
driver.get(
f"https://stamprally.org/?search_keywords&search_keywords_operator=and&search_cat1={prefectureValue}&search_cat2=0")
# Calculate How Many Times To Run Page Loop
totalListNum = driver.find_element_by_css_selector(
'div.page_navi2.clearfix>p').get_attribute('text')
totalListNum.text.split("件中")
if totalListNum[0] % 10 != 0:
pageLoopCount = math.ceil(totalListNum[0])
else:
continue
currentpage = 0
while currentpage < pageLoopCount:
currentpage += 1
print(currentpage)
I dont think you should use get_attribute. Instead try this.
totalListNum = driver.find_element_by_css_selector('div.page_navi2.clearfix>p').text
First, your locator is not unique
Use this:
div.page_navi2.clearfix:nth-of-type(1)>p
or for the second element:
div.page_navi2.clearfix:nth-of-type(2)>p
Second, as already mentioned, use .text to get the text.
If .text does not work you can use .get_attribute('innerHTML')

Get value next row based on value current row Selenium

Set-up
I need to obtain the population data for all NUTS3 regions on this Wikipedia page.
I have obtained all URLs per NUTS3 region and will let Selenium loop over them to obtain each region's population number as displayed on its page.
That is to say, for each region I need to get the population displayed in its infobox geography vcard element. E.g. for this region, the population would be 591680.
Code
Before writing the loop, I'm trying to obtain the population for one individual region,
url = 'https://en.wikipedia.org/wiki/Arcadia'
browser.get(url)
vcard_element = browser.find_element_by_css_selector('#mw-content-text > div > table.infobox.geography.vcard').find_element_by_xpath('tbody')
for row in vcard_element.find_elements_by_xpath('tr'):
try:
if 'Population' in row.find_element_by_xpath('th').text:
print(row.find_element_by_xpath('th').text)
except Exception:
pass
Issue
The code works. That is, it prints the row containing the word 'Population'.
Question: How do I tell Selenium to get next row – the row containing the actual population number?
Use ./following::tr[1] or ./following-sibling::tr[1]
url = 'https://en.wikipedia.org/wiki/Arcadia'
browser=webdriver.Chrome()
browser.get(url)
vcard_element = browser.find_element_by_css_selector('#mw-content-text > div > table.infobox.geography.vcard').find_element_by_xpath('tbody')
for row in vcard_element.find_elements_by_xpath('tr'):
try:
if 'Population' in row.find_element_by_xpath('th').text:
print(row.find_element_by_xpath('th').text)
print(row.find_element_by_xpath('./following::tr[1]').text) #whole word
print(row.find_element_by_xpath('./following::tr[1]/td').text) #Only number
except Exception:
pass
Output on Console:
Population (2011)
• Total 86,685
86,685
While you can certainly do this with selenium, I would personally recommend using requests and lxml, as they are much lighter weight than selenium and can get the job done just as well. I found the below to work for a few regions I tested:
try:
response = requests.get(url)
infocard_rows = html.fromstring(response.content).xpath("//table[#class='infobox geography vcard']/tbody/tr")
except:
print('Error retrieving information from ' + url)
try:
population_row = 0
for i in range(len(infocard_rows)):
if infocard_rows[i].findtext('th') == 'Population':
population_row = i+1
break
population = infocard_rows[population_row].findtext('td')
except:
print('Unable to find population')
In essence, the html.fromstring().xpath() is getting all of the rows from the infobox geography vcard table on the path. The next try-catch then just tries to locate the th whose inner text is Population and then pulls the text from the next td (which is the population number).
Hopefully this is helpful, even if it isn't selenium like you were asking! Usually you'd use Selenium if you want to recreate browser behavior or inspect javascript elements. You can certainly use it here as well though.

String concatenation with python and xpath

Instead of writing 10+ IF statements, I'm trying to create one IF statement using a variable. Unfortunately, I'm not familiar with how to implement string concatenation for xpath using python. Can anyone teach me how to perform string formatting for the following code segments?
I would greatly appreciate it, thanks.
if page_number == 1:
next_link = browser.find_element_by_xpath('//*[#title="Go to page 2"]')
next_link.click()
page_number = page_number + 1
time.sleep(30)
elif page_number == 2:
next_link = browser.find_element_by_xpath('//*[#title="Go to page 3"]')
next_link.click()
page_number = page_number + 1
time.sleep(30)
This answer is not about string concatenation, but about simple problem solution...
Instead of clicking particular link on pagination you can click "Next" button:
pages_number = 10
for _ in range(pages_number):
driver.find_element_by_xpath("//a[#title='Go to next page']").click()
time.sleep(30)
If you need to open specific page you can use below:
required_page = 3
driver.find_element_by_link_text(required_page).click()
P.S. I assumed you are talking about this site
You can use a for-loop
Ex:
for i in range(1, 10):
next_link = browser.find_element_by_xpath('//*[#title="Go to page {0}"]'.format(str(i+1))) #using str.format for page number
next_link.click()
time.sleep(30)

My loop isn't running

I'm writing a code in Python to get all the 'a' tags in a URL using Beautiful soup, then I use the link at position 3 and follow that link, I will repeat this process about 18 times. I included the code below, which has the process repeated 4 times. When I run this code I get the same 4 links in the results. I should get 4 different links. I think there is something wrong in my loop, specifically in the line that says y=url. I need help figuring out what the problem is.
import re
import urllib
from BeautifulSoup import *
list1=list()
url = 'https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html'
for i in range (4): # repeat 4 times
htm2= urllib.urlopen(url).read()
soup1=BeautifulSoup(htm2)
tags1= soup1('a')
for tag1 in tags1:
x2 = tag1.get('href', None)
list1.append(x2)
y= list1[2]
if len(x2) < 3: # no 3rd link
break # exit the loop
else:
url=y
print y
You're continuing to add the third link EVER FOUND to your result list. Instead you should be adding the third link OF THAT ITERATION (which is soup('a')[2]), then reassigning your url and going again.
url = 'https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html'
result = []
for i in range(4):
soup = BeautifulSoup(urllib.urlopen(url).read())
links = soup('a')
for link in links:
result.append(link)
try:
third_link = links[2]['href']
except IndexError: # less than three links
break
else:
url = third_link
print(url)
This is actually pretty simple in a recursive function:
def get_links(url):
soup = BeautifulSoup(urllib.urlopen(url).read())
links = soup('a')
if len(links) < 3:
# base case
return links
else:
# recurse on third link
return links + get_links(links[2]['href'])
You can even modify that to make sure you don't recurse too deep
def get_links(url, times=None):
'''Returns all <a> tags from `url` and every 3rd link, up to `times` deep
get_links("protocol://hostname.tld", times=2) -> list
if times is None, recurse until there are fewer than 3 links to be found
'''
def _get_links(url, TTL):
soup = BeautifulSoup(urllib.urlopen(url).read())
links = soup('a')
if (times is not None and TTL >= times) or \
len(links) < 3:
# base case
return links
else:
return links + _get_links(links[2]['href'], TTL+1)
return _get_links(url, 0)
Your current code
y= list1[2]
just prints the URL located at index 2 of list1. Since that list only gets appended to, list[2] doesn't change. You should instead be selecting different indices each time you print if you want different URLs. I'm not sure what it is specifically that you're trying to print, but y= list1[-1] for instance would end up printing the last URL added to the list on that iteration (different each time).

Categories

Resources