Set-up
I need to obtain the population data for all NUTS3 regions on this Wikipedia page.
I have obtained all URLs per NUTS3 region and will let Selenium loop over them to obtain each region's population number as displayed on its page.
That is to say, for each region I need to get the population displayed in its infobox geography vcard element. E.g. for this region, the population would be 591680.
Code
Before writing the loop, I'm trying to obtain the population for one individual region,
url = 'https://en.wikipedia.org/wiki/Arcadia'
browser.get(url)
vcard_element = browser.find_element_by_css_selector('#mw-content-text > div > table.infobox.geography.vcard').find_element_by_xpath('tbody')
for row in vcard_element.find_elements_by_xpath('tr'):
try:
if 'Population' in row.find_element_by_xpath('th').text:
print(row.find_element_by_xpath('th').text)
except Exception:
pass
Issue
The code works. That is, it prints the row containing the word 'Population'.
Question: How do I tell Selenium to get next row – the row containing the actual population number?
Use ./following::tr[1] or ./following-sibling::tr[1]
url = 'https://en.wikipedia.org/wiki/Arcadia'
browser=webdriver.Chrome()
browser.get(url)
vcard_element = browser.find_element_by_css_selector('#mw-content-text > div > table.infobox.geography.vcard').find_element_by_xpath('tbody')
for row in vcard_element.find_elements_by_xpath('tr'):
try:
if 'Population' in row.find_element_by_xpath('th').text:
print(row.find_element_by_xpath('th').text)
print(row.find_element_by_xpath('./following::tr[1]').text) #whole word
print(row.find_element_by_xpath('./following::tr[1]/td').text) #Only number
except Exception:
pass
Output on Console:
Population (2011)
• Total 86,685
86,685
While you can certainly do this with selenium, I would personally recommend using requests and lxml, as they are much lighter weight than selenium and can get the job done just as well. I found the below to work for a few regions I tested:
try:
response = requests.get(url)
infocard_rows = html.fromstring(response.content).xpath("//table[#class='infobox geography vcard']/tbody/tr")
except:
print('Error retrieving information from ' + url)
try:
population_row = 0
for i in range(len(infocard_rows)):
if infocard_rows[i].findtext('th') == 'Population':
population_row = i+1
break
population = infocard_rows[population_row].findtext('td')
except:
print('Unable to find population')
In essence, the html.fromstring().xpath() is getting all of the rows from the infobox geography vcard table on the path. The next try-catch then just tries to locate the th whose inner text is Population and then pulls the text from the next td (which is the population number).
Hopefully this is helpful, even if it isn't selenium like you were asking! Usually you'd use Selenium if you want to recreate browser behavior or inspect javascript elements. You can certainly use it here as well though.
Related
My goal is to scrape the entire reviews of this firm. I tried manipulating #Driftr95 codes:
def extract(pg):
headers = {'user-agent' : 'Mozilla/5.0'}
url = f'https://www.glassdoor.com/Reviews/3M-Reviews-E446_P{pg}.htm?filter.iso3Language=eng'
# f'https://www.glassdoor.com/Reviews/Google-Engineering-Reviews-EI_IE9079.0,6_DEPT1007_IP{pg}.htm?sort.sortType=RD&sort.ascending=false&filter.iso3Language=eng'
r = requests.get(url, headers)
soup = BeautifulSoup(r.content, 'html.parser')# this a soup function that retuen the whole html
return soup
for j in range(1,21,10):
for i in range(j+1,j+11,1): #3M: 4251 reviews
soup = extract( f'https://www.glassdoor.com/Reviews/3M-Reviews-E446_P{i}.htm?filter.iso3Language=eng')
print(f' page {i}')
for r in soup.select('li[id^="empReview_"]'):
rDet = {'reviewId': r.get('id')}
for sr in r.select(subRatSel):
k = sr.select_one('div:first-of-type').get_text(' ').strip()
sval = getDECstars(sr.select_one('div:nth-of-type(2)'), soup)
rDet[f'[rating] {k}'] = sval
for k, sel in refDict.items():
sval = r.select_one(sel)
if sval: sval = sval.get_text(' ').strip()
rDet[k] = sval
empRevs.append(rDet)
In the case where not all the subratings are always available, all four subratings will turn out to be N.A.
All four subratings will turn out to be N.A.
there were some things that I didn't account for because I hadn't encountered them before, but the updated version of getDECstars shouldn't have that issue. (If you use the longer version with argument isv=True, it's easier to debug and figure out what's missing from the code...)
I scraped 200 reviews in this case, and it turned out that only 170 unique reviews
Duplicates are fairly easy to avoid by maintaining a list of reviewIds that have already been added and checking against it before adding a new review to empRevs
scrapedIds = []
# for...
# for ###
# soup = extract...
# for r in ...
if r.get('id') in scrapedIds: continue # skip duplicate
## rDet = ..... ## AND REST OF INNER FOR-LOOP ##
empRevs.append(rDet)
scrapedIds.append(rDet['reviewId']) # add to list of ids to check against
Https tends to time out after 100 rounds...
You could try adding breaks and switching out user-agents every 50 [or 5 or 10 or...] requests, but I'm quick to resort to selenium at times like this; this is my suggested solution - if you just call it like this and pass a url to start with:
## PASTE [OR DOWNLOAD&IMPORT] from https://pastebin.com/RsFHWNnt ##
startUrl = 'https://www.glassdoor.com/Reviews/3M-Reviews-E446.htm?sort.sortType=RD&sort.ascending=false&filter.iso3Language=eng'
scrape_gdRevs(startUrl, 'empRevs_3M.csv', maxScrapes=1000, constBreak=False)
[last 3 lines of] printed output:
total reviews: 4252
total reviews scraped this run: 4252
total reviews scraped over all time: 4252
It clicks through the pages until it reaches the last page (or maxes out maxScrapes). You do have to log in at the beginning though, so fill out login_to_gd with your username and password or log in manually by replacing the login_to_gd(driverG) line with the input(...) line that waits for you to login [then press ENTER in the terminal] before continuing.
I think cookies can also be used instead (with requests), but I'm not good at handling that. If you figure it out, then you can use some version of linkToSoup or your extract(pg); then, you'll have to comment out or remove the lines ending in ## for selenium and uncomment [or follow instructions from] the lines that end with ## without selenium. [But please note that I've only fully tested the selenium version.]
The CSVs [like "empRevs_3M.csv" and "scrapeLogs_empRevs_3M.csv" in this example] are updated after every page-scrape, so even if the program crashes [or you decide to interrupt it], it will have saved upto the previous scrape. Since it also tries to load form the CSVs at the beginning, you can just continue it later (just set startUrl to the url of the page you want to continue from - but even if it's at page 1, remember that duplicates will be ignored, so it's okay - it'll just waste some time though).
Creating a price tracker for Amazon at the moment and been running into a few problems, wondering if anyone can shed any light as to why I can't get this list to remove an element upon condition.
Here is what I am trying to make happen:
1.Check if in stock and price
If price below MAX, print Name, URL & Price
If found in stock and below MAX price, remove current URL and current MAX price from lists.
Continue checking other item(s) for prices below MAX.
As I am converting (Mapping) the first list to become Integers, I seem to be unable to use the .remove or any similar function to remove the MAX from list once the condition is met. (The URL is removed just fine)
Anyone able to point me in the right direction for this or explain how I can remove the MAX once product found?
Alternatively, does anyone know a way to sleep/ignore a list element for a certain or specific number of loop arounds once a condition has been met? This would be even more perfect.
Essentially, once condition is met, both elements at the current position in the looped lists should either be removed or slept for specific period of time
Thank you very much in advance if you're able to help or assist me to figure the answer out!
import requests
from bs4 import BeautifulSoup
MAXprices = ["80", "19"]
url_list = ["***AN EXAMPLE AMAZON LINK", "*****AN EXAMPLE AMAZON LINK********"]
while True:
MAXprice = map(int, MAXprices)
for (URL, MAX) in zip(url_list, MAXprice):
headers = {"User-Agent": ''}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
productName = soup.find(id="productTitle").get_text().strip()
productPrice = soup.find(class_="a-offscreen").get_text().strip()
converted_price = int(productPrice[1:3])
if converted_price < MAX:
print (productName)
print (URL)
print (productPrice)
url_list.remove(URL)
MAXprices.remove(MAX)
continue
I am working on a web-scraping code, which needs to scrape multiple websites. I used Selenium web-driver to do that and I use find_element_by_xpath to extract the information that I want. Here is my code
op = webdriver.ChromeOptions()
driver = webdriver.Chrome(options=op)
driver.get(web)
content1 = driver.find_elements_by_class_name('k-master-row')
for c1 in content1:
for i in range(1,401):
amount1 = i
try:
caseid = c1.find_element_by_xpath('//*[#id="CasesGrid"]/table/tbody/tr[{}]/td[2]/a'.format(amount1)).text.strip()
case_id.append(caseid)
except:
pass
try:
style = c1.find_element_by_xpath('//*[#id="CasesGrid"]/table/tbody/tr[{}]/td[3]/div'.format(amount1)).text.strip()
defendant.append(style)
except:
defendant.append('')
The code works perfectly fine but for the range, I have to maually set it everytime when I scrape different URLs. //*[#id="CasesGrid"]/table/tbody/tr[1]/td[2]/a This is the xpath, which are identical across all the URLs that I need to scrape. The only thing different is the tr[1], it can range from 1 to 500 or 600. If the wbesite that I scrape start from 1 to 35, then I have to manually change the range to(1,35). I think this is very time consuming when I scrape a new URLs. I am wondering are there any better ways to set the range, so it will just stop whichever URLs that I scrape, thus I dont need to manually search the xpath to find the end number for the range.
Thank you all!!
Use an infinite loop, and break out when you reach the end.
amount1 = 1
while True:
try:
tr = c1.find_element_by_xpath(f'//*[#id="CasesGrid"]/table/tbody/tr[{amount1}]')
except:
break
try:
caseid = c1.find_element_by_xpath('//*[#id="CasesGrid"]/table/tbody/tr[{}]/td[2]/a'.format(amount1)).text.strip()
case_id.append(caseid)
except:
pass
try:
style = c1.find_element_by_xpath('//*[#id="CasesGrid"]/table/tbody/tr[{}]/td[3]/div'.format(amount1)).text.strip()
defendant.append(style)
except:
defendant.append('')
amount1 += 1
I'm currently trying to extract text and labels (Topics) from a webpage with the following code :
Texts = []
Topics = []
url = 'https://www.unep.org/news-and-stories/story/yes-climate-change-driving-wildfires'
response = requests.get(url, headers={'User-agent': 'Mozilla/5.0'})
if response.ok:
soup = BeautifulSoup(response.text,'lxml')
txt = soup.findAll('div', {'class': 'para_content_text'})
for div in txt:
p = div.findAll('p')
Texts.append(p)
print(Texts)
top = soup.find('div', {'class': 'article_tags_topics'})
a = top.findAll('a')
Topics.append(a)
print(Topics)
No code problem, but here is an extract of what I've obtained with the previous code :
</p>, <p><strong>UNEP:</strong> And this is bad news?</p>, <p><strong>NH:</strong> This is bad news. This is bad for our health, for our wallet and for the fabric of society.</p>, <p><strong>UNEP:</strong> The world is heading towards a global average temperature that’s 3<strong>°</strong>C to 4<strong>°</strong>C higher than it was before the industrial revolution. For many people, that might not seem like a lot. What do you say to them?</p>, <p><strong>NH:</strong> Just think about your own body. When your temperature goes up from 36.7°C (98°F) to 37.7°C (100°F), you’ll probably consider taking the day off. If it goes 1.5°C above normal, you’re staying home for sure. If you add 3°C, people who are older and have preexisting conditions – they may die. The tolerances are just as tight for the planet.</p>]]
[[Forests, Climate change]]
As I'm looking for a "clean" text result I tried to add the following code line in my loops in order to only obtain text :
p = p.text
but I got :
AttributeError: ResultSet object has no attribute 'text'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
I've also notice that for Topic result I got un unwanted URL, I would like to only obtain Forest and results (without coma between them).
Any idea of what can I add to my code to obtain clean text and topic ?
This happens because p is a ResultSet object. You can see this by running the following:
print(type(Texts[0]))
Output:
<class 'bs4.element.ResultSet'>
To get the actual text, you can address each item in each ResultSet directly:
for result in Texts:
for item in result:
print(item.text)
Output:
As wildfires sweep across the western United States, taking lives, destroying homes and blanketing the country in smoke, Niklas Hagelberg has a sobering message: this could be America’s new normal.
......
Or even use a list comprehension:
full_text = '\n'.join([item.text for result in Texts for item in result])
The AttributeError means that you have a list of elements because you used p = div.findAll('p').
Try:
p[0].text
or change p = div.findAll('p') to p = div.find('p') (It will only return the first case it finds)
I'm writing a script that scans through a set of links. Within each link the script searches a table for a row. Once found, it increments the variable total_rank which is the sum ranks found on each web page. The rank is equal to the row number.
The code looks like this and is outputting zero:
import requests
from bs4 import BeautifulSoup
import time
url_to_scrape = 'https://www.teamrankings.com/ncb/stats/'
r = requests.get(url_to_scrape)
soup = BeautifulSoup(r.text, "html.parser")
stat_links = []
for a in soup.select(".chooser-list ul"):
list_entry = a.findAll('li')
relative_link = list_entry[0].find('a')['href']
link = "https://www.teamrankings.com" + relative_link
stat_links.append(link)
total_rank = 0
for link in stat_links:
r = requests.get(link)
soup = BeautifulSoup(r.text, "html.parser")
team_rows = soup.select(".tr-table.datatable.scrollable.dataTable.no-footer table")
for row in team_rows:
if row.findAll('td')[1].text.strip() == 'Oklahoma':
rank = row.findAll('td')[0].text.strip()
total_rank = total_rank + rank
# time.sleep(1)
print total_rank
debugging team_rows is empty after the select() call thing is, I've also tried different tags. For example I've tried soup.select(".scroll-wrapper div") I've tried soup.select("#DataTables_Table_0_wrapper div") all are returning nothing
The selector
".tr-table datatable scrollable dataTable no-footer tr"
Selects a <tr> element anywhere under a <no-footer> element anywhere under a <dataTable> element....etc.
I think really "datatable scrollable dataTable no-footer" are classes on your .tr-table? So in that case, they should be joined with the first class with a period. So I believe the final correct selector is:
".tr-table.datatable.scrollable.dataTable.no-footer tr"
UPDATE: the new selector looks like this:
".tr-table.datatable.scrollable.dataTable.no-footer table"
The problem here is that the first part, .tr-table.datatable... refers to the table itself. Assuming you're trying to get the rows of this table:
<table class="tr-table datatable scrollable dataTable no-footer" id="DataTables_Table_0" role="grid">
The proper selector remains the one I originally suggested.
The #audiodude's answer is correct though the suggested selector is not working for me.
You don't need to check every single class of the table element. Here is the working selector:
team_rows = soup.select("table.datatable tr")
Also, if you need to find Oklahoma inside the table - you don't have to iterate over every row and cell in the table. Just directly search for a specific cell and get the previous containing the rank:
rank = soup.find("td", {"data-sort": "Oklahoma"}).find_previous_sibling("td").get_text()
total_rank += int(rank) # it is important to convert the row number to int
Also note that you are extracting more stats links than you should - looks like the Player Stats links should not be followed since you are focused specifically on the Team Stats. Here is one way to get Team Stats links only:
links_list = soup.find("h2", text="Team Stats").find_next_sibling("ul")
stat_links = ["https://www.teamrankings.com" + a["href"]
for a in links_list.select("ul.expand-content li a[href]")]