Delete unwanted elements of python webscraping loop results - python

I'm currently trying to extract text and labels (Topics) from a webpage with the following code :
Texts = []
Topics = []
url = 'https://www.unep.org/news-and-stories/story/yes-climate-change-driving-wildfires'
response = requests.get(url, headers={'User-agent': 'Mozilla/5.0'})
if response.ok:
soup = BeautifulSoup(response.text,'lxml')
txt = soup.findAll('div', {'class': 'para_content_text'})
for div in txt:
p = div.findAll('p')
Texts.append(p)
print(Texts)
top = soup.find('div', {'class': 'article_tags_topics'})
a = top.findAll('a')
Topics.append(a)
print(Topics)
No code problem, but here is an extract of what I've obtained with the previous code :
</p>, <p><strong>UNEP:</strong> And this is bad news?</p>, <p><strong>NH:</strong> This is bad news. This is bad for our health, for our wallet and for the fabric of society.</p>, <p><strong>UNEP:</strong> The world is heading towards a global average temperature that’s 3<strong>°</strong>C to 4<strong>°</strong>C higher than  it was before the industrial revolution. For many people, that might not seem like a lot. What do you say to them?</p>, <p><strong>NH:</strong> Just think about your own body. When your temperature goes up from 36.7°C (98°F) to 37.7°C (100°F), you’ll probably consider taking the day off. If it goes 1.5°C above normal, you’re staying home for sure. If you add 3°C, people who are older and have preexisting conditions –  they may die. The tolerances are just as tight for the planet.</p>]]
[[Forests, Climate change]]
As I'm looking for a "clean" text result I tried to add the following code line in my loops in order to only obtain text :
p = p.text
but I got :
AttributeError: ResultSet object has no attribute 'text'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
I've also notice that for Topic result I got un unwanted URL, I would like to only obtain Forest and results (without coma between them).
Any idea of what can I add to my code to obtain clean text and topic ?

This happens because p is a ResultSet object. You can see this by running the following:
print(type(Texts[0]))
Output:
<class 'bs4.element.ResultSet'>
To get the actual text, you can address each item in each ResultSet directly:
for result in Texts:
for item in result:
print(item.text)
Output:
As wildfires sweep across the western United States, taking lives, destroying homes and blanketing the country in smoke, Niklas Hagelberg has a sobering message: this could be America’s new normal.
......
Or even use a list comprehension:
full_text = '\n'.join([item.text for result in Texts for item in result])

The AttributeError means that you have a list of elements because you used p = div.findAll('p').
Try:
p[0].text
or change p = div.findAll('p') to p = div.find('p') (It will only return the first case it finds)

Related

How to remove list element after condition met and also remove mapped INT from list

Creating a price tracker for Amazon at the moment and been running into a few problems, wondering if anyone can shed any light as to why I can't get this list to remove an element upon condition.
Here is what I am trying to make happen:
1.Check if in stock and price
If price below MAX, print Name, URL & Price
If found in stock and below MAX price, remove current URL and current MAX price from lists.
Continue checking other item(s) for prices below MAX.
As I am converting (Mapping) the first list to become Integers, I seem to be unable to use the .remove or any similar function to remove the MAX from list once the condition is met. (The URL is removed just fine)
Anyone able to point me in the right direction for this or explain how I can remove the MAX once product found?
Alternatively, does anyone know a way to sleep/ignore a list element for a certain or specific number of loop arounds once a condition has been met? This would be even more perfect.
Essentially, once condition is met, both elements at the current position in the looped lists should either be removed or slept for specific period of time
Thank you very much in advance if you're able to help or assist me to figure the answer out!
import requests
from bs4 import BeautifulSoup
MAXprices = ["80", "19"]
url_list = ["***AN EXAMPLE AMAZON LINK", "*****AN EXAMPLE AMAZON LINK********"]
while True:
MAXprice = map(int, MAXprices)
for (URL, MAX) in zip(url_list, MAXprice):
headers = {"User-Agent": ''}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
productName = soup.find(id="productTitle").get_text().strip()
productPrice = soup.find(class_="a-offscreen").get_text().strip()
converted_price = int(productPrice[1:3])
if converted_price < MAX:
print (productName)
print (URL)
print (productPrice)
url_list.remove(URL)
MAXprices.remove(MAX)
continue

Python BeautifulSoup - Improve readability of find by Id function?

I would like to improve the readability following code, especially lines 8 to 11
import requests
from bs4 import BeautifulSoup
URL = 'https://docs.google.com/forms/d/e/1FAIpQLSd5tU8isVcqd02ymC2n952LC2Nz_FFPd6NT1lD4crDeSsJi2w/viewform?usp=sf_link'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
question1 = str(soup.find(id='i1'))
question1 = question1.split('>')[1].lstrip().split('.')[1]
question1 = question1[1:]
question1 = question1.replace("_", "")
print(question1)
Thanks in advance :)
You could use the following
question1 = soup.find(id='i1').getText().split(".")[1].replace("_","").strip()
to replace lines 8 to 11.
.getText() takes care of removing the html-tags. Rest is pretty much the same.
In python you can almos always just chain operations. So your code would also be valid a a one-liner:
question1 = str(soup.find(id='i1')).split('>')[1].lstrip().split('.')[1][1:].replace("_", "")
But in most cases it is better to leave the code in a more readable form than to reduce the line-count.
Abhinav, is not very clear what you want to achieve, the script is actually already very simple which is a good thing and follow the Pythonic principle of The Zen of Python:
"Simple is better than complex."
Also is not comprehensive of what you actually mean:
Make it more simple as in Understandable and clear for Human beings?
Make it more simple for the machine to compute it, hence improve performance?
Reduce the line of codes and follow more the programming Guidelines?
I point this out because for next time would be better to make it more explicit in the question, having said that, as I don't know exactly what you mean, I come up with an answer that more or less covers all of 3 points:
ANSWER
import requests
from bs4 import BeautifulSoup
URL = 'https://docs.google.com/forms/d/e/1FAIpQLSd5tU8isVcqd02ymC2n952LC2Nz_FFPd6NT1lD4crDeSsJi2w/viewform?usp=sf_link'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
# ========= < FUNCTION TO GET ALL QUESTION DYNAMICALLY > ========= #
def clean_string_by_id(page, id):
content = str(page.find(id=id)) # Get Content of page by different ids
if content != 'None': # Check if there is actual content or not
find_question = content.split('>') # NOTE: Split at tags closing
if len(find_question) >= 2 and find_question[1][0].isdigit(): # NOTE: If len is 1 means that is not the correct element Also we check if the first element is a digit means that is correct
cleaned_question = find_question[1].split('.')[1].strip() # We get the actual Question and strip it already !
result = cleaned_question.replace('_', '')
return result
else:
return
# ========= < Scan the entire page Dynamically + add result to a list> ========= #
all_questions = []
for i in range(1, 50): # NOTE: I went up to 50 but there may be many more, I let you test it
get_question = clean_string_by_id(soup, f'i{i}')
if get_question: # Append result to list only if there is actual content
all_questions.append(get_question)
# ========= < show all results > ========= #
for question in all_questions:
print(question)
NOTE
Here I'm assuming that you want to get all elements from this page, hence you don't want to write 2000 variables, as you can see I left the logic basically the same as yours but I wrapped everything in a Function instead.
In fact the steps you follow were pretty good and yes you may "improve it" or make it "smarter" however comprehensible wins complexity. Also take in mind that I assumed that get all the 'questions' from that Google Forms was your goal.
EDIT
As pointed by #wuerfelfreak and as he explains in his answer further improvement can be achived by using getText() function
Hence here the result of the above function using getText:
def clean_string_by_id(page, id):
content = page.find(id=id)
if content: # NOTE: Check if there is actual content or not, same as if len(content) >= 0
find_question = content.getText() # NOTE: Split at tags closing
if find_question: # NOTE: same as do if len(findÑ_question) >= 1: ... If is 0 means that is a empty line so we skip it
cleaned_question = find_question.split('.')[1].strip() # Same as before
result = cleaned_question.replace('_', '')
return result
Documentations & Guides
Zen of Python
getText
geeksforgeeks.org | isdigit()

Get value next row based on value current row Selenium

Set-up
I need to obtain the population data for all NUTS3 regions on this Wikipedia page.
I have obtained all URLs per NUTS3 region and will let Selenium loop over them to obtain each region's population number as displayed on its page.
That is to say, for each region I need to get the population displayed in its infobox geography vcard element. E.g. for this region, the population would be 591680.
Code
Before writing the loop, I'm trying to obtain the population for one individual region,
url = 'https://en.wikipedia.org/wiki/Arcadia'
browser.get(url)
vcard_element = browser.find_element_by_css_selector('#mw-content-text > div > table.infobox.geography.vcard').find_element_by_xpath('tbody')
for row in vcard_element.find_elements_by_xpath('tr'):
try:
if 'Population' in row.find_element_by_xpath('th').text:
print(row.find_element_by_xpath('th').text)
except Exception:
pass
Issue
The code works. That is, it prints the row containing the word 'Population'.
Question: How do I tell Selenium to get next row – the row containing the actual population number?
Use ./following::tr[1] or ./following-sibling::tr[1]
url = 'https://en.wikipedia.org/wiki/Arcadia'
browser=webdriver.Chrome()
browser.get(url)
vcard_element = browser.find_element_by_css_selector('#mw-content-text > div > table.infobox.geography.vcard').find_element_by_xpath('tbody')
for row in vcard_element.find_elements_by_xpath('tr'):
try:
if 'Population' in row.find_element_by_xpath('th').text:
print(row.find_element_by_xpath('th').text)
print(row.find_element_by_xpath('./following::tr[1]').text) #whole word
print(row.find_element_by_xpath('./following::tr[1]/td').text) #Only number
except Exception:
pass
Output on Console:
Population (2011)
• Total 86,685
86,685
While you can certainly do this with selenium, I would personally recommend using requests and lxml, as they are much lighter weight than selenium and can get the job done just as well. I found the below to work for a few regions I tested:
try:
response = requests.get(url)
infocard_rows = html.fromstring(response.content).xpath("//table[#class='infobox geography vcard']/tbody/tr")
except:
print('Error retrieving information from ' + url)
try:
population_row = 0
for i in range(len(infocard_rows)):
if infocard_rows[i].findtext('th') == 'Population':
population_row = i+1
break
population = infocard_rows[population_row].findtext('td')
except:
print('Unable to find population')
In essence, the html.fromstring().xpath() is getting all of the rows from the infobox geography vcard table on the path. The next try-catch then just tries to locate the th whose inner text is Population and then pulls the text from the next td (which is the population number).
Hopefully this is helpful, even if it isn't selenium like you were asking! Usually you'd use Selenium if you want to recreate browser behavior or inspect javascript elements. You can certainly use it here as well though.

Storing Multi-dimensional Lists?

(Code below)
I'm scraping a website and the data I'm getting back is in 2 multi-dimensional arrays. I'm wanting everything to be in a JSON format because I want to save this and load it in again later when I add "tags".
So, less vague. I'm writing a program which takes in data like what characters you have and what missions are requiring you to do (you can complete multiple at once if the attributes align), and then checks that against a list of attributes that each character fulfills and returns a sorted list of the best characters for the context.
Right now I'm only scraping character data but I've already "got" the attribute data per character - the problem there was that it wasn't sorted by name so it was just a randomly repeating list that I needed to be able to look up. I still haven't quite figured out how to do that one.
Right now I have 2 arrays, 1 for the headers of the table and one for the rows of the table. The rows contain the "Answers" for the Header's "Questions" / "Titles" ; ie Maximum Level, 50
This is true for everything but the first entry which is the Name, Pronunciation (and I just want to store the name of course).
So:
Iterations = 0
While loop based on RowArray length / 9 (While Iterations <= that)
HeaderArray[0] gives me the name
RowArray[Iterations + 1] gives me data type 2
RowArray[Iterations + 2] gives me data type 3
Repeat until Array[Iterations + 8]
Iterations +=9
So I'm going through and appending these to separate lists - single arrays like CharName[] and CharMaxLevel[] and so on.
But I'm actually not sure if that's going to make this easier or not? Because my end goal here is to send "CharacterName" and get stuff back based on that AND be able to send in "DesiredTraits" and get "CharacterNames who fit that trait" back. Which means I also need to figure out how to store that category data semi-efficiently. There's over 80 possible categories and most only fit into about 10. I don't know how I'm going to store or load that data.
I'm assuming JSON is the best way? And I'm trying to keep it all in one file for performance and code readability reasons - don't want a file for each character.
CODE: (Forgive me, I've never scraped anything before + I'm actually somewhat new to Python - just got it 4? days ago)
https://pastebin.com/yh3Z535h
^ In the event anyone wants to run this and this somehow makes it easier to grab the raw code (:
import time
import requests, bs4, re
from urllib.parse import urljoin
import json
import os
target_dir = r"D:\00Coding\Js\WebScraper" #Yes, I do know that storing this in my Javascript folder is filthy
fullname = os.path.join(target_dir,'TsumData.txt')
StartURL = 'http://disneytsumtsum.wikia.com/wiki/Skill_Upgrade_Chart'
URLPrefix = 'http://disneytsumtsum.wikia.com'
def make_soup(url):
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text, 'lxml')
return soup
def get_links(url):
soup = make_soup(url)
a_tags = soup.find_all('a', href=re.compile(r"^/wiki/"))
links = [urljoin(URLPrefix, a['href'])for a in a_tags] # convert relative url to absolute url
return links
def get_tds(link):
soup = make_soup(link)
#tds = soup.find_all('li', class_="category normal") #This will give me the attributes / tags of each character
tds = soup.find_all('table', class_="wikia-infobox")
RowArray = []
HeaderArray = []
if tds:
for td in tds:
#print(td.text.strip()) #This is everything
rows = td.findChildren('tr')#[0]
headers = td.findChildren('th')#[0]
for row in rows:
cells = row.findChildren('td')
for cell in cells:
cell_content = cell.getText()
clean_content = re.sub( '\s+', ' ', cell_content).strip()
if clean_content:
RowArray.append(clean_content)
for row in rows:
cells = row.findChildren('th')
for cell in cells:
cell_content = cell.getText()
clean_content = re.sub( '\s+', ' ', cell_content).strip()
if clean_content:
HeaderArray.append(clean_content)
print(HeaderArray)
print(RowArray)
return(RowArray, HeaderArray)
#Output = json.dumps([dict(zip(RowArray, row_2)) for row_2 in HeaderArray], indent=1)
#print(json.dumps([dict(zip(RowArray, row_2)) for row_2 in HeaderArray], indent=1))
#TempFile = open(fullname, 'w') #Read only, Write Only, Append
#TempFile.write("EHLLO")
#TempFile.close()
#print(td.tbody.Series)
#print(td.tbody[Series])
#print(td.tbody["Series"])
#print(td.data-name)
#time.sleep(1)
if __name__ == '__main__':
links = get_links(StartURL)
MainHeaderArray = []
MainRowArray = []
MaxIterations = 60
Iterations = 0
for link in links: #Specifically I'll need to return and append the arrays here because they're being cleared repeatedly.
#print("Getting tds calling")
if Iterations > 38: #There are this many webpages it'll first look at that don't have the data I need
TempRA, TempHA = get_tds(link)
MainHeaderArray.append(TempHA)
MainRowArray.append(TempRA)
MaxIterations -= 1
Iterations += 1
#print(MaxIterations)
if MaxIterations <= 0: #I don't want to scrape the entire website for a prototype
break
#print("This is the end ??")
#time.sleep(3)
#jsonized = map(lambda item: {'Name':item[0], 'Series':item[1]}, zip())
print(MainHeaderArray)
#time.sleep(2.5)
#print(MainRowArray)
#time.sleep(2.5)
#print(zip())
TsumName = []
TsumSeries = []
TsumBoxType = []
TsumSkillDescription = []
TsumFullCharge = []
TsumMinScore = []
TsumScoreIncreasePerLevel = []
TsumMaxScore = []
TsumFullUpgrade = []
Iterations = 0
MaxIterations = len(MainRowArray)
while Iterations <= MaxIterations: #This will fire 1 time per Tsum
print(Iterations)
print(MainHeaderArray[Iterations][0]) #Holy this gives us Mickey ;
print(MainHeaderArray[Iterations+1][0])
print(MainHeaderArray[Iterations+2][0])
print(MainHeaderArray[Iterations+3][0])
TsumName.append(MainHeaderArray[Iterations][0])
print(MainRowArray[Iterations][1])
#At this point it will, of course, crash - that's because I only just realized I needed to append AND I just realized that everything
#Isn't stored in a list as I thought, but rather a multi-dimensional array (as you can see below I didn't know this)
TsumSeries[Iterations] = MainRowArray[Iterations+1]
TsumBoxType[Iterations] = MainRowArray[Iterations+2]
TsumSkillDescription[Iterations] = MainRowArray[Iterations+3]
TsumFullCharge[Iterations] = MainRowArray[Iterations+4]
TsumMinScore[Iterations] = MainRowArray[Iterations+5]
TsumScoreIncreasePerLevel[Iterations] = MainRowArray[Iterations+6]
TsumMaxScore[Iterations] = MainRowArray[Iterations+7]
TsumFullUpgrade[Iterations] = MainRowArray[Iterations+8]
Iterations += 9
print(Iterations)
print("It's Over")
time.sleep(3)
print(TsumName)
print(TsumSkillDescription)
Edit:
tl;dr my goal here is to be like
"For this Mission Card I need a Blue Tsum with high score potential, a Monster's Inc Tsum for a bunch of games, and a Male Tsum for a long chain.. what's the best Tsum given those?" and it'll be like "SULLY!" and automatically select it or at the very least give you a list of Tsums. Like "These ones match all of them, these ones match 2, and these match 1"
Edit 2:
Here's the command Line Output for the code above:
https://pastebin.com/vpRsX8ni
Edit 3: Alright, just got back for a short break. With some minor looking over I see what happened - my append code is saying "Append this list to the array" meaning I've got a list of lists for both the Header and Row arrays that I'm storing. So I can confirm (for myself at least) that these aren't nested lists per se but they are definitely 2 lists, each containing a single list at every entry. Definitely not a dictionary or anything "special case" at least. This should help me quickly find an answer now that I'm not throwing "multi-dimensional list" around my google searches or wondering why the list stuff isn't working (as it's expecting 1 value and gets a list instead).
Edit 4:
I need to simply add another list! But super nested.
It'll just store the categories that the Tsum has as a string.
so Array[10] = ArrayOfCategories[Tsum] (which contains every attribute in string form that the Tsum has)
So that'll be ie TsumArray[10] = ["Black", "White Gloves", "Mickey & Friends"]
And then I can just use the "Switch" that I've already made in order to check them. Possibly. Not feeling too well and haven't gotten that far yet.
Just use the with open file as json_file , write/read (super easy).
Ultimately stored 3 json files. No big deal. Much easier than appending into one big file.

Python script extract from HTML

I'm writing a script that scans through a set of links. Within each link the script searches a table for a row. Once found, it increments the variable total_rank which is the sum ranks found on each web page. The rank is equal to the row number.
The code looks like this and is outputting zero:
import requests
from bs4 import BeautifulSoup
import time
url_to_scrape = 'https://www.teamrankings.com/ncb/stats/'
r = requests.get(url_to_scrape)
soup = BeautifulSoup(r.text, "html.parser")
stat_links = []
for a in soup.select(".chooser-list ul"):
list_entry = a.findAll('li')
relative_link = list_entry[0].find('a')['href']
link = "https://www.teamrankings.com" + relative_link
stat_links.append(link)
total_rank = 0
for link in stat_links:
r = requests.get(link)
soup = BeautifulSoup(r.text, "html.parser")
team_rows = soup.select(".tr-table.datatable.scrollable.dataTable.no-footer table")
for row in team_rows:
if row.findAll('td')[1].text.strip() == 'Oklahoma':
rank = row.findAll('td')[0].text.strip()
total_rank = total_rank + rank
# time.sleep(1)
print total_rank
debugging team_rows is empty after the select() call thing is, I've also tried different tags. For example I've tried soup.select(".scroll-wrapper div") I've tried soup.select("#DataTables_Table_0_wrapper div") all are returning nothing
The selector
".tr-table datatable scrollable dataTable no-footer tr"
Selects a <tr> element anywhere under a <no-footer> element anywhere under a <dataTable> element....etc.
I think really "datatable scrollable dataTable no-footer" are classes on your .tr-table? So in that case, they should be joined with the first class with a period. So I believe the final correct selector is:
".tr-table.datatable.scrollable.dataTable.no-footer tr"
UPDATE: the new selector looks like this:
".tr-table.datatable.scrollable.dataTable.no-footer table"
The problem here is that the first part, .tr-table.datatable... refers to the table itself. Assuming you're trying to get the rows of this table:
<table class="tr-table datatable scrollable dataTable no-footer" id="DataTables_Table_0" role="grid">
The proper selector remains the one I originally suggested.
The #audiodude's answer is correct though the suggested selector is not working for me.
You don't need to check every single class of the table element. Here is the working selector:
team_rows = soup.select("table.datatable tr")
Also, if you need to find Oklahoma inside the table - you don't have to iterate over every row and cell in the table. Just directly search for a specific cell and get the previous containing the rank:
rank = soup.find("td", {"data-sort": "Oklahoma"}).find_previous_sibling("td").get_text()
total_rank += int(rank) # it is important to convert the row number to int
Also note that you are extracting more stats links than you should - looks like the Player Stats links should not be followed since you are focused specifically on the Team Stats. Here is one way to get Team Stats links only:
links_list = soup.find("h2", text="Team Stats").find_next_sibling("ul")
stat_links = ["https://www.teamrankings.com" + a["href"]
for a in links_list.select("ul.expand-content li a[href]")]

Categories

Resources