How to web scrape when page offset value never ends - python

I'm trying to scrape player data off of https://sofifa.com using BeautifulSoup. Every page displays 60 players and so I use offset query param(for example https://sofifa.com/players?offset=60 shows 2nd page) to access all player's info.
One thing I noticed is that the offset value never ends(i.e. no matter how large the offset value I provide, it always shows me a page). Specifically, I noticed that for offset > 20000 or so, it always displays the 1st page(essentially after exhausting all players, it kinda rolls over to the 1st page and always displays that for all subsequent higher offset values). Try with https://sofifa.com/players?offset=20000000 to get an idea of what I mean.
I want to know if there is any way I can programmatically find out the last "valid" offset value; beyond which I am sure to get the 1st page back. That'll help me decide when I reached the end of the dataset.
Currently this is how I scrape:
for offset in range(0, 20000, 60):
try:
print("Processing page at offset " + str(offset))
sofifa_url = "https://sofifa.com/players?offset=" + str(offset)
# start scraping the page
:
:
except Exception as e:
print("Exception occured: " + str(e))
continue

Related

How can I actually free up CPU resources for this for loop in Jupyter Notebook?

I'm trying to run an automated process in a Jupyter Notebook (from deepnote.com) every single day, but after running the very first iteration of a while loop and starting the next iteration (at the for loop inside the while loop), the virtual machine crashes throwing the message below:
KernelInterrupted: Execution interrupted by the Jupyter kernel
Here's the code:
.
.
.
while y < 5:
print(f'\u001b[45m Try No. {y} out of 5 \033[0m')
#make the driver wait up to 10 seconds before doing anything.
driver.implicitly_wait(10)
#values for the example.
#Declaring several variables for looping.
#Let's start at the newest page.
link = 'https...'
driver.get(link)
#Here we use an Xpath element to get the initial page
initial_page = int(driver.find_element_by_xpath('Xpath').text)
print(f'The initial page is the No. {initial_page}')
final_page = initial_page + 120
pages = np.arange(initial_page, final_page+1, 1)
minimun_value = 0.95
maximum_value = 1.2
#the variable to_place is set as a string value that must exist in the rows in order to be scraped.
#if it doesn't exist it is ignored.
to_place = 'A particular place'
#the same comment stated above is applied to the variable POINTS.
POINTS = 'POINTS'
#let's set a final dataframe which will contain all the scraped data from the arange that
#matches with the parameters set (minimun_value, maximum value, to_place, POINTS).
df_final = pd.DataFrame()
dataframe_final = pd.DataFrame()
#set another final dataframe for the 2ND PART OF THE PROCESS.
initial_df = pd.DataFrame()
#set a for loop for each page from the arange.
for page in pages:
#INITIAL SEARCH.
#look for general data of the link.
#amount of results and pages for the execution of the for loop, "page" variable is used within the {}.
url = 'https...page={}&p=1'.format(page)
print(f'\u001b[42m Current page: {page} \033[0m '+'\u001b[42m Final page: '+str(final_page)+'\033[0m '+'\u001b[42m Page left: '+str(final_page-page)+'\033[0m '+'\u001b[45m Try No. '+str(y)+' out of '+str(5)+'\033[0m'+'\n')
driver.get(url)
#Here we order the scrapper to try finding the total number of subpages a particular page has if such page IS NOT empty.
#if so, the scrapper will proceed to execute the rest of the procedure.
try:
subpages = driver.find_element_by_xpath('Xpath').text
print(f'Reading the information about the number of subpages of this page ... {subpages}')
subpages = int(re.search(r'\d{0,3}$', subpages).group())
print(f'This page has {subpages} subpages in total')
df = pd.DataFrame()
df2 = pd.DataFrame()
print(df)
print(df2)
#FOR LOOP.
#search at each subpage all the rows that contain the previous parameters set.
#minimun_value, maximum value, to_place, POINTS.
#set a sub-loop for each row from the table of each subpage of each page
for subpage in range(1,subpages+1):
url = 'https...page={}&p={}'.format(page,subpage)
driver.get(url)
identities_found = int(driver.find_element_by_xpath('Xpath').text.replace('A total of ','').replace(' identities found','').replace(',',''))
identities_found_last = identities_found%50
print(f'Página: {page} de {pages}') #AT THIS LINE CRASHED THE LAST TIME
.
.
.
#If the particular page is empty
except:
print(f'This page No. {page} IT'S EMPTY ¯\_₍⸍⸌̣ʷ̣̫⸍̣⸌₎_/¯, ¡NEXT! ')
.
.
.
y += 1
Initially I thought the KernelInterrupted Error was thrown due to the lack of virtual memory my virtual machine had at the moment of running the second iteration...
But after several tests I figured out that my program isn't RAM-consuming at all because the virtual RAM wasn't changing that much during all the process 'til the Kernel crashed. I can guarantee that.
So now I think that maybe the virtual CPU of my virtual machine is the one that's causing the crashing of the Kernel, but if that was the case I just don't understand why, this is the first time I have to deal with such situation, this program runs perfectly in my PC.
Is any data scientist or machine learning engineer here that could assist me? Thanks in advance.
I have found the answer in the Deepnote community forum itself, simply the "Free Tier" machines of this platform do not guarantee a permanent operation (24h / 7) regardless of the program executed in their VM.
That's it. Problem is solved.

Used IDs are not available anymore in Selenium Python

I am using Python and Selenium to scrape some data out of an website. This website has the following structure:
First group item has the following base ID: frmGroupList_Label_GroupName and then you add _2 or _3 at the end of this base ID to get the 2nd/3rd group's ID.
Same thing goes for the user item, it has the following base ID: frmGroupContacts_TextLabel3 and then you add _2 or _3 at the end of this base ID to get the 2nd/3rd users's ID.
What I am trying to do is to get all the users out of each group. And this is how I did it: find the first group, select it and grab all of it users, then, go back to the 2nd group, grab its users, and so on.
def grab_contact(number_of_members):
groupContact = 'frmGroupContacts_TextLabel3'
contact = browser.find_element_by_id(groupContact).text
print(contact)
i = 2
time.sleep(1)
# write_to_excel(contact, group)
while i <= number_of_members:
group_contact_string = groupContact + '_' + str(i)
print(group_contact_string)
try:
contact = browser.find_element_by_id(group_contact_string).text
print(contact)
i = i + 1
time.sleep(1)
# write_to_excel(contact, group)
except NoSuchElementException:
break
time.sleep(3)
Same code applies for scraping the groups. And it works, up to a point!! Although the IDs of the groups are different, the IDs of the users are the same from one group to another. Example:
group_id_1 = user_id_1, user_id_2
group_id_2 = user_id_1, user_id_2, user_id_3, user_id_4, user_id_5
group_id_3 = user_id_1, user_id_2, user_id_3
The code runs, it goes to group_id_1, grabs user_id_1 and user_id_2 correctly, but when it gets to group_id_2, the user_id_1 and user_id_2 (which are different in matter of content) are EMPTY, and only user_id_3, user_id_4, user_id_5 are correct. Then, when it gets to group_id_3, all of the users are empty.
This has to do with the users having same IDs. As soon as it gets to a certain user ID in a group, I cannot retrieve all the users before that ID in another group. I tried quitting the browser, and reopening a new browser (it doesn't work, the new browser doesn't open), tried refreshing the page (doesn't work), tried opening a new tab (doesn't work).
I think the content of the IDs get stuck in memory when they are accessed, and are not freed when accessing a new group. Any ideas on how to get past this?
Thanks!
As the saying goes... it ain't stupid, if it works.
def refresh():
# accessing the groups page
url = "https://google.com"
browser.get(url)
time.sleep(5)
url = "https://my_url.com"
browser.get(url)
time.sleep(5)
While trying to debug this, and finding a solution, I thought: "what if you go to another website, then come back to yours, between group scraping"... and it works! Until I find other solution, I'll stick with this one.

Automatically change what website I pull from

My Python program is pulling from a website from inside of a subprocess. This is working properly.
url = 'https://www.website.com/us/{0}/recent/kvc-4020_120/'.format(zipCode)
However, the website depending on the zip code, may have multiple pages of results. When this occurs it happens in the format of:
https://www.website.com/us/ZIPCODE/recent/kvc-4020_120?sortId=2&offset=48
In this case, ?sortId=2&offset= stays constant. My question is - how I can change the URL automatically, as if I was manually clicking to go to the next page? The only thing that changes would be the offset. It increases by 24 each page. Example:
Page 1, /recent/kvc-4020_120
Page 2, /recent/kvc-4020_120?sortId=2&offset=24
Page 3, /recent/kvc-4020_120?sortId=2&offset=48
etc etc.
This could only reach up to 150 pages. I'm just unsure how to take into account page 1 URL versus anything past page 1.
After pulling from the website, I write to a txt file. I want to automatically check if there is a next page and if there is, change the URL and repeat the process. If there's no next page, move on to the next zipcode.
A for loop:
for i in ['/recent/kvc-'+str(y)+'_120'
if x == 0 else '/recent/kvc-'+str(y)+'_120?sortid=2&offset=' + str(x)
for x in range(0, 48, 24) for y in range(4000,5000)]:
your_function('web_prefix' + i)
Where:
range(0, 48, 24) # increment to 48 by 24 (just an example)
range(4000, 5000) # Assumed range of Postcodes

OFFSET must not be negative

When we have zero results for a pagination object, and we force ?page=-1
then we will get the error OFFSET must not be negative.
-1 will get the last page by default.
So, If you add that parameter in url you can cause an internal error always if the output is empty to paginate.
Example:
page = request.args.get('page', 1, type=int)
pagination = company.comments.order_by(Comment.timestamp.asc()).paginate(
page, per_page=current_app.config['COMMENTS_PER_PAGE'],
error_out=False)
This will avoid the error, but it is annoying make always this type of validation to handle potential empty paginations.
if company.comments.count() > 0:
pagination = ...
else:
pagination=None
My question is about the best way to handle this particular Internal server error.
This is probably what you're trying to do, but sqlalchemy won't evaluate this for you.
My suggestion is to calculate the number of pages yourself and then simply subtract one.
from sqlalchemy import func
if page < 1:
count = session.query(func.count(Comments.id)).scalar()
comments_per_page = current_app.config['COMMENTS_PER_PAGE']
page = count/float(comments_per_page) -1 # gets the last page
Please be aware that this is untested.

GAE Datastore - Is there a next page / Are there x+1 entities?

Currently, to determine whether or not there is a next page of entities I'm using the following code:
q = Entity.all().fetch(10)
cursor = q.cursor()
extra = q.fetch(1)
has_next_page = False
if extra:
has_next_page = True
However, this is very expensive in terms of the time it takes to execute the 'extra' query. I need to extract the cursor after 10 results, but I need to fetch 11 to see if there is a succeeding page.
Anyone have any better methods?
If you fetch 11 items straight away you'll only have to fetch 1 extra item to know if there is a next page or not. And you can just display the first 10 results and use the 11th result only as a "next page" indicator.

Categories

Resources