How to increment between pages using Selenium and BeautifulSoup?

How to increment between pages using Selenium and BeautifulSoup? - python

I'm trying to get my code to increment through the pages of this website and I can't seem to get it to loop and increment, instead doing the first page, and giving up. Is there something I'm doing wrong?
if(pageExist is not None):
if(countitup != pageNum):
countitup = countitup + 1
driver.get('http://800notes.com/Phone.aspx/%s/%s' % (tele800,countitup))
delay = 4
scamNum = soup.find_all(text=re.compile(r"Scam"))
spamNum = soup.find_all(text=re.compile(r"Call type: Telemarketer"))
debtNum = soup.find_all(text=re.compile(r"Call type: Debt Collector"))
hospitalNum = soup.find_all(text=re.compile(r"Hospital"))
scamCount = len(scamNum) + scamCount
spamCount = len(spamNum) + spamCount
debtCount = len(debtNum) + debtCount
hospitalCount = len(hospitalNum) + hospitalCount
block = soup.find(text=re.compile(r"OctoNet HTTP filter"))
extrablock = soup.find(text=re.compile(r"returning an unknown error"))
type(block) is str
type(extrablock) is str
if(block is not None or extrablock is not None):
print("\n Damn. Gimme an hour to fix this.")
time.sleep(2000)
Repo: https://github.com/GarnetSunset/Haircuttery/tree/Experimental

pageExist is not None this seems to be the problem.
Since it checks whether the page is None and it will most likely never be none. There is no official way to check for HTTP responses but we can use something like this.
if (soup.find_element_by_xpath('/html/body/p'[contains(text(),'400')])
#this will check if there's a 400 code in the p tag.
or
if ('400' in soup.find_element_by_xpath('/html/body/p[1]').text)
I'm sure there are other ways that one can do this but this is one of those and so that's the only issue here. You can then increment or keep the rest of your code pretty much as soon as you fix the first if .
I might have made some mistakes (syntax) in my code since I'm not testing it but the logic applies), great code tho!
Also instead of
type(block) is str
type(extrablock) is str
the pythonic way is using
isinstace
isinstance(block, str)
isinstance(extrablock, str)
and as for time.sleep you can use WebDriverWait there are two available methods, implicit and explicit wait, please take a look here.

Related

Getting NoneType in my python selenium despite .text method used

I'm trying to workout the number of for loops to run depending on the number of List (totalListNum)
And it seems that it is returning Nonetype when infact it should be returning either text or int
website:https://stamprally.org/?search_keywords=&search_keywords_operator=and&search_cat1=68&search_cat2=0
Code Below
for prefectureValue in prefectureValueStorage:
driver.get(
f"https://stamprally.org/?search_keywords&search_keywords_operator=and&search_cat1={prefectureValue}&search_cat2=0")
# Calculate How Many Times To Run Page Loop
totalListNum = driver.find_element_by_css_selector(
'div.page_navi2.clearfix>p').get_attribute('text')
totalListNum.text.split("件中")
if totalListNum[0] % 10 != 0:
pageLoopCount = math.ceil(totalListNum[0])
else:
continue
currentpage = 0
while currentpage < pageLoopCount:
currentpage += 1
print(currentpage)

I dont think you should use get_attribute. Instead try this.
totalListNum = driver.find_element_by_css_selector('div.page_navi2.clearfix>p').text

First, your locator is not unique
Use this:
div.page_navi2.clearfix:nth-of-type(1)>p
or for the second element:
div.page_navi2.clearfix:nth-of-type(2)>p
Second, as already mentioned, use .text to get the text.

If .text does not work you can use .get_attribute('innerHTML')

Python BeautifulSoup - Improve readability of find by Id function?

I would like to improve the readability following code, especially lines 8 to 11
import requests
from bs4 import BeautifulSoup
URL = 'https://docs.google.com/forms/d/e/1FAIpQLSd5tU8isVcqd02ymC2n952LC2Nz_FFPd6NT1lD4crDeSsJi2w/viewform?usp=sf_link'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
question1 = str(soup.find(id='i1'))
question1 = question1.split('>')[1].lstrip().split('.')[1]
question1 = question1[1:]
question1 = question1.replace("_", "")
print(question1)
Thanks in advance :)

You could use the following
question1 = soup.find(id='i1').getText().split(".")[1].replace("_","").strip()
to replace lines 8 to 11.
.getText() takes care of removing the html-tags. Rest is pretty much the same.
In python you can almos always just chain operations. So your code would also be valid a a one-liner:
question1 = str(soup.find(id='i1')).split('>')[1].lstrip().split('.')[1][1:].replace("_", "")
But in most cases it is better to leave the code in a more readable form than to reduce the line-count.

Abhinav, is not very clear what you want to achieve, the script is actually already very simple which is a good thing and follow the Pythonic principle of The Zen of Python:
"Simple is better than complex."
Also is not comprehensive of what you actually mean:
Make it more simple as in Understandable and clear for Human beings?
Make it more simple for the machine to compute it, hence improve performance?
Reduce the line of codes and follow more the programming Guidelines?
I point this out because for next time would be better to make it more explicit in the question, having said that, as I don't know exactly what you mean, I come up with an answer that more or less covers all of 3 points:
ANSWER
import requests
from bs4 import BeautifulSoup
URL = 'https://docs.google.com/forms/d/e/1FAIpQLSd5tU8isVcqd02ymC2n952LC2Nz_FFPd6NT1lD4crDeSsJi2w/viewform?usp=sf_link'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
# ========= < FUNCTION TO GET ALL QUESTION DYNAMICALLY > ========= #
def clean_string_by_id(page, id):
content = str(page.find(id=id)) # Get Content of page by different ids
if content != 'None': # Check if there is actual content or not
find_question = content.split('>') # NOTE: Split at tags closing
if len(find_question) >= 2 and find_question[1][0].isdigit(): # NOTE: If len is 1 means that is not the correct element Also we check if the first element is a digit means that is correct
cleaned_question = find_question[1].split('.')[1].strip() # We get the actual Question and strip it already !
result = cleaned_question.replace('_', '')
return result
else:
return
# ========= < Scan the entire page Dynamically + add result to a list> ========= #
all_questions = []
for i in range(1, 50): # NOTE: I went up to 50 but there may be many more, I let you test it
get_question = clean_string_by_id(soup, f'i{i}')
if get_question: # Append result to list only if there is actual content
all_questions.append(get_question)
# ========= < show all results > ========= #
for question in all_questions:
print(question)
NOTE
Here I'm assuming that you want to get all elements from this page, hence you don't want to write 2000 variables, as you can see I left the logic basically the same as yours but I wrapped everything in a Function instead.
In fact the steps you follow were pretty good and yes you may "improve it" or make it "smarter" however comprehensible wins complexity. Also take in mind that I assumed that get all the 'questions' from that Google Forms was your goal.
EDIT
As pointed by #wuerfelfreak and as he explains in his answer further improvement can be achived by using getText() function
Hence here the result of the above function using getText:
def clean_string_by_id(page, id):
content = page.find(id=id)
if content: # NOTE: Check if there is actual content or not, same as if len(content) >= 0
find_question = content.getText() # NOTE: Split at tags closing
if find_question: # NOTE: same as do if len(findÑ_question) >= 1: ... If is 0 means that is a empty line so we skip it
cleaned_question = find_question.split('.')[1].strip() # Same as before
result = cleaned_question.replace('_', '')
return result
Documentations & Guides
Zen of Python
getText
geeksforgeeks.org | isdigit()

Get list of all JIRA issues (python)

I am trying to get a list of all JIRA issues so that I may iterate through them in the following manner:
from jira import JIRA
jira = JIRA(basic_auth=('username', 'password'), options={'server':'https://MY_JIRA.atlassian.net'})
issue = jira.issue('ISSUE_KEY')
print(issue.fields.project.key)
print(issue.fields.issuetype.name)
print(issue.fields.reporter.displayName)
print(issue.fields.summary)
print(issue.fields.comment.comments)
The code above returns the desired fields (but only an issue at a time), however, I need to be able to pass a list of all issue keys into:
issue = jira.issue('ISSUE_KEY')
The idea is to write a for loop that would go through this list and print the indicated fields.
I have not been able to populate this list.
Can someone point me in the right direction please?

def get_all_issues(jira_client, project_name, fields):
issues = []
i = 0
chunk_size = 100
while True:
chunk = jira_client.search_issues(f'project = {project_name}', startAt=i, maxResults=chunk_size, fields=fields)
i += chunk_size
issues += chunk.iterable
if i >= chunk.total:
break
return issues
issues = get_all_issues(jira, 'JIR', ["id", "fixVersion"])

options = {'server': 'YOUR SERVER NAME'}
jira = JIRA(options, basic_auth=('YOUR EMAIL', 'YOUR PASSWORD'))
size = 100
initial = 0
while True:
start= initial*size
issues = jira.search_issues('project=<NAME OR ID>', start,size)
if len(issues) == 0:
break
initial += 1
for issue in issues:
print 'ticket-no=',issue
print 'IssueType=',issue.fields.issuetype.name
print 'Status=',issue.fields.status.name
print 'Summary=',issue.fields.summary
The first 3 arguments of jira.search_issues() are the jql query, starting index (0 based hence the need for multiplying on line 6) and the maximum number of results.

You can execute a search instead of a single issue get.
Let's say your project key is PRO-KEY, to perform a search, you have to use this query:
https://MY_JIRA.atlassian.net/rest/api/2/search?jql=project=PRO-KEY
This will return the first 50 issues of the PRO-KEY and a number, in the field maxResults, of the total number of issues present.
Taken than number, you can perform others searches adding the to the previous query:
&startAt=50
With this new parameter you will be able to fetch the issues from 51 to 100 (or 50 to 99 if you consider the first issue 0).
The next query will be &startAt=100 and so on until you reach fetch all the issues in PRO-KEY.
If you wish to fetch more than 50 issues, add to the query:
&maxResults=200

You can use the jira.search_issues() method to pass in a JQL query. It will return the list of issues matching the JQL:
issues_in_proj = jira.search_issues('project=PROJ')
This will give you a list of issues that you can iterate through

Starting with Python3.8 reading all issues can be done relatively short and elegant:
issues = []
while issues_chunk := jira.search_issues('project=PROJ', startAt=len(issues)):
issues += list(issue issues_chunk)
(since we need len(issues) in every step we cannot use a list comprehension, can we?
Together with initialization and cashing and "preprocessing" (e.g. just taking issue.raw) you could write something like this:
jira = jira.JIRA(
server="https://jira.at-home.com",
basic_auth=json.load(open(os.path.expanduser("~/.jira-credentials")))
validate=True,
)
issues = json.load(open("jira_issues.json"))
while issues_chunk := jira.search_issues('project=PROJ', startAt=len(issues)):
issues += [issue.raw for issue issues_chunk]
json.dump(issues, open("jira_issues.json", "w"))

How to use offset in VKontakte with Python?

I am trying to build a script where I can get the check-ins for a specific location. For some reason when I specify lat, long coords VK never returns any check-ins so I have to fetch location IDs first and then request the check-ins from that list. However I am not sure on how to use the offset feature, which I presume is supposed to work somewhat like a pagination function.
So far I have this:
import vk
import json
app_id = #enter app id
login_nr = #enter your login phone or email
password = '' #enter password
vkapi = vk.API(app_id, login_nr, password)
vkapi.getServerTime()
def get_places(lat, lon, rad):
name_list = []
try:
locations = vkapi.places.search(latitude=lat, longitude=lon, radius=rad)
name_list.append(locations['items'])
except Exception, e:
print '*********------------ ERROR ------------*********'
print str(e)
return name_list
# Returns last checkins up to a maximum of 100
# Define the number of checkins you want, 100 being maximum
def get_checkins_id(place_id,check_count):
checkin_list= []
try:
checkins = vkapi.places.getCheckins(place = place_id, count = check_count)
checkin_list.append(checkins['items'])
except Exception, e:
print '*********------------ ERROR ------------*********'
print str(e)
return checkin_list
What I would like to do eventually is combine the two into a single function but before that I have to figure out how offset works, the current VK API documentation does not explain that too well. I would like the code to read something similar to:
def get_users_list_geo(lat, lon, rad, count):
users_list = []
locations_lists = []
users = []
locations = vkapi.places.search(latitude=lat, longitude=lon, radius=rad)
for i in locations[0]:
locations_list.append(i['id'])
for i in locations:
# Get each location ID
# Get Checkins for location
# Append checkin and ID to the list
From what I understand I have to count the offset when getting the check-ins and then somehow account for locations that have more than 100 check-ins. Anyways, I would greatly appreciate any type of help, advice, or anything. If you have any suggestions on the script I would love to hear them as well. I am teaching myself Python so clearly I am not very good so far.
Thanks!

I've worked with VK API with javascript, but I think, logic is the same.
TL;DR: Offset is a number of results (starting with the first) which API should skip in response
For example, you make query, which should return 1000 results (lets imagine that you know exact number of results).
But VK return to you only 100 per request. So, how to get other 900?
You say to API: give me next 100 results. Next is offset - number of results you want to skip because you've already handled them. So, VK API takes 1000 results, skip first 100, and return to you next (second) 100.

Also, if you are talking about this method (http://vk.com/dev/places.getCheckins) in first paragraph, please check that your lat/long are float, not integer. And it could be useful to try swap lat/long - maybe you got them mixed up?

python requests.get() InvalidSchema error

I'm incredibly new to python, and i'm trying to write something to get the first result returned from Google' "I'm feeling lucky" button. I have a list of 100 items I need it to get urls for. Here's what i have:
import requests
with open('2012.txt') as f:
lines = f.readlines()
for i in range(0, 100):
temp1 = "r'http://www.google.com/search?q=\""
temp2 = "\"&btnI'"
temp3 = lines[i]
temp3 = temp3[:-1]
temp4 = temp1+temp3+temp2
print temp4
var = requests.get(temp4)
print var.url
Now if I print the value in temp4 and paste it into requests.get(), it works as I want it to. However, I get error's every time I try to pass temp4 in, instead of a hard-coded string.

Specifically, I guess you're getting:
requests.exceptions.InvalidSchema: No connection adapters were found for 'r'http://www.google.com/search?q="foo"&btnI''
(except with something else in lieu of foo:-) -- please post exceptions as part of your Q, why make us guess or need to reproduce?!
The problem is obviously that leading r' which does indeed make the string into an invalid schema (the trailing ' doesn't help either).
So, try instead something like:
temp1 = 'http://www.google.com/search?q="'
temp2 = '"&btnI'
and things should go better... specifically, when I do that (still with 'foo' in lieu of a real temp3), I get
http://en.wikipedia.org/wiki/Foobar
which seems to make sense as the top search result for "foo"!-)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to increment between pages using Selenium and BeautifulSoup? - python

Related

Getting NoneType in my python selenium despite .text method used

Python BeautifulSoup - Improve readability of find by Id function?

Get list of all JIRA issues (python)

How to use offset in VKontakte with Python?

python requests.get() InvalidSchema error

Categories

Resources