Python: not every web page have a certain element

Python: not every web page have a certain element - python

When I tried to use urls to scrape web pages, I found that some elements only exists in some pages and other have not. Let's take the code for example
Code:
for urls in article_url_set:
re=requests.get(urls)
soup=BeautifulSoup(re.text.encode('utf-8'), "html.parser")
title_tag = soup.select_one('.page_article_title')
if title_tag=True:
print(title_tag.text)
else:
#do something
if title_tag exits, I want to print them, if it's not, just skip them.
Another thing is that, I need to save other elements and title.tag.text in data.
data={
"Title":title_tag.text,
"Registration":fruit_tag.text,
"Keywords":list2
}
It will have an error cause not all the article have Title, what should I do to skip them when I try to save? 'NoneType' object has no attribute 'text'
Edit: I decide not to skip them and keep them as Null or None.

U code is wrong:
for urls in article_url_set:
re=requests.get(urls)
soup=BeautifulSoup(re.text.encode('utf-8'), "html.parser")
title_tag = soup.select_one('.page_article_title')
if title_tag=True: # wrong
print(title_tag.text)
else:
#do something
your code if title_tag=True,
changed code title_tag == True
It is recommended to create conditional statements as follows.
title_tag == True => True == title_tag
This is a way to make an error when making a mistake.
If Code is True = title_tag, occur error.

You can simply use a truth test to check if the tag is existing, otherwise assign a value like None, then you can insert it in the data container :
title_tag = soup.select_one('.page_article_title')
if title_tag:
print(title_tag.text)
title = title_tag.text
else:
title = None
Or in one line :
title = title_tag.text if title_tag else None

Related

Is there any way instead of status_code to determine the request is true or false?

I'm using Python3 with BeautifulSoup. I want to scrape data for a few employees from a site, depending on their ID number.
My code:
for UID in range(201810000,201810020):
ID = UID
print(ID)
#scrapped Data
ZeroDay = s.post("https://site/Add_StudantRow.php",data={"SID":ID})
ZeroDay_content = bs(ZeroDay.content,"html.parser", from_encoding='windows-1256')
std_ID = ZeroDay_content.find("input", {"name":"SID[]"})["value"]
std_name = ZeroDay_content.find("input", {"name":"Name[]"})["value"]
std_major_= ZeroDay_content.select_one("option[selected]", {"name":"Qualifications[]"})["value"]
std_major = ZeroDay_content.find("input", {"name":"Specialization[]"})["value"]
std_social= ZeroDay_content.select_one("select[name='MILITARY_STATUS[]'] option[selected]")["value"]
std_ID_num= ZeroDay_content.find("input", {"name":"ID_Number[]"})["value"]
std_gender= ZeroDay_content.select_one("select[name='Gender[]'] option[selected]")["value"]
print(std_ID,std_name,std_gender,std_major,std_major_,std_ID_num,std_social)
After I ran my code, this error appeared:
std_ID = ZeroDay_content.find("input", {"name":"SID[]"})["value"]
TypeError: 'NoneType' object is not subscriptable
I assigned a range for their ID's from 201810000 to 201810020 but not all the IDs are valid. I mean maybe 201810015 not valid and 201810018 valid.
Note: when I put a valid ID in UID the error did not appear, possibly because when the ID returns a null value the error appears, but how can I do a range of IDs in this case?

As not all of your UID values return a valid page, you would just need to first test for the presence of a required tag. As you are looking for form elements, I assume there will be an enclosing <form> tag you could test for first.
For example:
for UID in range(201810000, 201810020):
ID = UID
print(ID)
ZeroDay = s.post("https://site/Add_StudantRow.php", data={"SID":ID})
ZeroDay_content = bs(ZeroDay.content, "html.parser", from_encoding='windows-1256')
if ZeroDay_content.find("form", <xxxxxxx>):
std_ID = ZeroDay_content.find("input", {"name":"SID[]"})["value"]
std_name = ZeroDay_content.find("input", {"name":"Name[]"})["value"]
std_major_= ZeroDay_content.select_one("option[selected]", {"name":"Qualifications[]"})["value"]
std_major = ZeroDay_content.find("input", {"name":"Specialization[]"})["value"]
std_social= ZeroDay_content.select_one("select[name='MILITARY_STATUS[]'] option[selected]")["value"]
std_ID_num= ZeroDay_content.find("input", {"name":"ID_Number[]"})["value"]
std_gender= ZeroDay_content.select_one("select[name='Gender[]'] option[selected]")["value"]
print(std_ID, std_name, std_gender, std_major, std_major_, std_ID_num,s td_social)
Where <xxxxx> would be suitable attributes to search for.
The error you are getting is because your first .find() call is returning None to indicate that the item is not present. You then use ["value"] on None which gives the error without first testing if you have found the required item.

I resolve this by add an IF statement and use content-length as a thing to determine that the request was made or not, because i have noticed that the content-length is less than 170 if the request is return nothing and more 170 if return any thing .

BeautifulSoup checking if an element has a specific class

for containerElement in container:
brandingElement = containerElement.find("div", class_="item-branding")
titleElement=containerElement.find("a", class_="item-title")
rating = brandingElement.find("i", {"class":"rating"})["aria-label"]
priceElement = containerElement.find("li", class_="price-current")
so this for loop checks for prices, ratings, and the name of an item on a website. it works. however, some items have no reviews, in which case it fails. how do i fix this? i was thinking of an if statement to check if the containerElement (the actual container the item and all its information is in) has a rating, but im not exacatly sure how to do that

for containerElement in container:
brandingElement = containerElement.find("div", class_="item-branding")
titleElement=containerElement.find("a", class_="item-title")
rating = brandingElement.find("i", {"class":"rating"})["aria-label"] if brandingElement.find("i", {"class":"rating"}) else ""
priceElement = containerElement.find("li", class_="price-current")

Hacker News API, KeyError: 'title

Just starting to learn python, and I am starting to learn the web-based side of it.
Following the instructions I have, i keep getting a KeyError: 'title' on line 18. Now I see it as the request not returning a title so it gives an error, how would I write it up to give a generic description if there is no 'title'???
import requests
from operator import itemgetter as ig
url = 'https://hacker-news.firebaseio.com/v0/topstories.json'
r = requests.get(url)
print("Status Code:", r.status_code)
submission_ids= r.json()
submission_dicts = []
for submission_id in submission_ids[:30]:
url= ("https://hacker-news.firebaseio.com/v0/item" + str(submission_id) + '.json')
submission_r = requests.get(url)
print(submission_r.status_code)
response_dict = submission_r.json()
submission_dict = {
'title': response_dict['title'],
'link': "https://news.ycombinator.com/item?id=" +str(submission_id),
'comments': response_dict.get('descendants', 0)
}
submission_dicts = sorted(submission_dicts, key= ig('comments'), reverse= True)
for submission_dict in submission_dicts:
print("\nTitle:", submission_dict['title'])
print("Discussion link:", submission_dict['link'])
print("Comments:", submission_dict['comments'])
Status Code: 200
401
Traceback (most recent call last):
File "C:\Users\Shit Head\Documents\Programming\Tutorial Files\hn_submissions.py", line 18, in <module>
'title': response_dict['title'],
KeyError: 'title'
[Finished in 1.2s]

Following the instructions I have, i keep getting a KeyError: 'title' on line 18. Now I see it as the request not returning a title so it gives an error, how would I write it up to give a generic description if there is no 'title'???
It sounds like you're just looking for the get method:
get(key[, default])
Return the value for key if key is in the dictionary, else default. If default is not given, it defaults to None, so that this method never raises a KeyError.
So, instead of this:
'title': response_dict['title'],
… you do this:
'title': response_dict.get('title', 'Generic Hacker News Submission'),
Under the covers, this is just a more convenient way to write something you could have done anyway. The following are all pretty much equivalent:
title = response_dict.get('title', 'Generic')
title = response_dict['title'] if title in response_dict else 'Generic'
if title in response_dict:
title = response_dict['title']
else:
title = 'Generic'
try:
title = response_dict['title']
except KeyError:
title = 'Generic'
This is worth knowing because Python only usually provides shortcuts like get for really common cases like looking things up in a dictionary. If you wanted to do the same thing with, say, a list that may be empty or have one item, or a file that might or might not exist, or a regular expression that might return a match with a group string or might return None, you'd need to do things the long way.

Find page for a specific item in paginate() SQLAlchemy

I am usign Flask-SQLAlchemy’s paginate(). Now I need to find what is the page for a specific comment id.
For example, this will work, if I have all comments in the same page:
new_dict['url'] = '/comments#comment_' + str(comment.id)
However in my case I need this structure:
/comments?page=1#comment_73
How can I find what is the page?

From the docs, the Pagination class has .items and .has_next properties and a .next method we can use:
page_number = 0
search = Comment.query.get(15)
query = Comment.query.filter(Comment.id<40)
for num in range(1, query.paginate(1).pages + 1):
if search in query.paginate(num).items:
page_number = num
break
or
page_number = 0
search = Comment.query.get(15)
pag = Comment.query.filter(Comment.id<40).paginate(1)
while pag.has_next:
if search in pag.items:
page_number = num
break
pag.next()

As far as I know, Celeo's answer won't work. For example, what pag.next() does in his code, based on documentations is:
Returns a Pagination object for the next page.
So, basically, it's doing nothing unless you update your variable; and I recommend you to not create a new query since you already have the comment_id so:
comment_id=request.args.get('comment_id')
if comment_id and comment_id.isdigit():
comment_id = int(comment_id )
page_number = -1
index = 1 # page numbers are 1 indexed in Pagination Object
while comments_pagination_object.has_next:
for comment in comments_pagination_object.items:
if comment.id == comment_id :
page_number = index
break
if page_number != -1:
break
index += 1
product_items = product_items.next()
Then, in the URL, you will have something like:
/comments?comment_id=2
and the part product_items.next() is changing the PaginationObject's page till one of it's items (which in this case is a type of class Comment) has the same id as your request args.

Check whether a link is disabled using Selenium in Python?

I need to check whether a link's disabled attribute is set, in the following code,
<a id="ctl00_ContentPlaceHolder1_lbtnNext" disabled="disabled">Next</a>
However on the last page if I execute,
next_pg=driver.find_element_by_xpath("//a[#id='ctl00_ContentPlaceHolder1_lbtnNext']")
next_pg.click()
print next_pg.is_enabled()
I get True as the output, which should not be the case.
Also, only on the last page is the Next coded as given above, in all other pages it is coded as follows, due to which on checking the is_enabled() tag, an error is produced.
<a id="ctl00_ContentPlaceHolder1_lbtnNext" href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$lbtnNext','')">
How should I solve this?

Use this answer to get the attributes of the tag:
attrs = driver.execute_script('var items = {}; for (index = 0; index < arguments[0].attributes.length; ++index) { items[arguments[0].attributes[index].name] = arguments[0].attributes[index].value }; return items;', next_pg)
and check for the presence of the disabled tag and it's value:
if 'disabled' in attrs and attrs['disabled'] == 'disabled':
# ...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: not every web page have a certain element - python

Related

Is there any way instead of status_code to determine the request is true or false?

BeautifulSoup checking if an element has a specific class

Hacker News API, KeyError: 'title

Find page for a specific item in paginate() SQLAlchemy

Check whether a link is disabled using Selenium in Python?

Categories

Resources