My Python program is pulling from a website from inside of a subprocess. This is working properly.
url = 'https://www.website.com/us/{0}/recent/kvc-4020_120/'.format(zipCode)
However, the website depending on the zip code, may have multiple pages of results. When this occurs it happens in the format of:
https://www.website.com/us/ZIPCODE/recent/kvc-4020_120?sortId=2&offset=48
In this case, ?sortId=2&offset= stays constant. My question is - how I can change the URL automatically, as if I was manually clicking to go to the next page? The only thing that changes would be the offset. It increases by 24 each page. Example:
Page 1, /recent/kvc-4020_120
Page 2, /recent/kvc-4020_120?sortId=2&offset=24
Page 3, /recent/kvc-4020_120?sortId=2&offset=48
etc etc.
This could only reach up to 150 pages. I'm just unsure how to take into account page 1 URL versus anything past page 1.
After pulling from the website, I write to a txt file. I want to automatically check if there is a next page and if there is, change the URL and repeat the process. If there's no next page, move on to the next zipcode.
A for loop:
for i in ['/recent/kvc-'+str(y)+'_120'
if x == 0 else '/recent/kvc-'+str(y)+'_120?sortid=2&offset=' + str(x)
for x in range(0, 48, 24) for y in range(4000,5000)]:
your_function('web_prefix' + i)
Where:
range(0, 48, 24) # increment to 48 by 24 (just an example)
range(4000, 5000) # Assumed range of Postcodes
Related
The British Library has a large amount of high-quality scans of books which are available for downloading. Unfortunately, their tool for downloading more than one page at a time does not work. For this reason, I've been trying to create a Python script with the Requests module that will download every page of a given book.
The jpg of every page has a specific url - in this case, that of the first page is https://api.bl.uk/image/iiif/ark:/81055/vdc_000000038900.0x000001/full/2306,/0/default.jpg and that of the second is https://api.bl.uk/image/iiif/ark:/81055/vdc_000000038900.0x000002/full/2306,/0/default.jpg. Extrapolating from the first nine pages (in this example, the book is 456 pages long), I naively created the following script:
import requests
base_url = "https://api.bl.uk/image/iiif/ark:/81055/vdc_000000038900.0x0000"
for i in range(1, 456):
target_url = base_url + str(i) + "/full/2306,/0/default.jpg"
r = requests.get(target_url)
with open('bl_' + str(i) + '.jpg', 'wb') as f:
f.write(r.content)
print(target_url)
This worked for the first 9 pages, but unfortunately, pages 10-15 are not 0000010-0000015, but 00000A-00000F. And the complications do not end here: pages 16-25 are 10-19, but with one leading 0 less (likewise 3-digit numbers have 2 zeros less, etc.). After that, pages 26-31 are 1A-1F, after which pages 16-25 are 10-19, after which pages 26-31 are 1A-1F, after which pages 32-41 are 20-29, after which pages 42-47 are 2A-2F. This pattern continues for as long as it can: up to page 159, which is 9F. After this, in order to remain in two digits, the pattern changes: pages 160-169 are A0-A9, pages 170-175 are AA-AF, pages 176-191 are B0-BF, and so on until page 255 which is FF. After this, pages 256-265 are 100-109, pages 266-271 are 10A-10F, pages 272-281 are 110-119, pages 282-287 are 11A-11F, and so on until page 415 which is 19F. After this, pages 416-425 are 1A0-1A9, pages 426-431 are 1AA-1AF, pages 432-441 are 1B0-1B9, and so on in this pattern until page 456, which is the final page of the book.
Evidently there is an algorithm generating this sequence according to certain parameters. Just as evidently, these parameters can be incorporated into the Python script I am trying to create. Sadly, my meagre coding knowledge was more than exhausted by the modest scriptlet above. I hope anyone here can help.
Replacing str(i) with f"{i:06x}" should give the correct numbering in hexadecimal over 6 zero padded digits in the URL.
I'm trying to run an automated process in a Jupyter Notebook (from deepnote.com) every single day, but after running the very first iteration of a while loop and starting the next iteration (at the for loop inside the while loop), the virtual machine crashes throwing the message below:
KernelInterrupted: Execution interrupted by the Jupyter kernel
Here's the code:
.
.
.
while y < 5:
print(f'\u001b[45m Try No. {y} out of 5 \033[0m')
#make the driver wait up to 10 seconds before doing anything.
driver.implicitly_wait(10)
#values for the example.
#Declaring several variables for looping.
#Let's start at the newest page.
link = 'https...'
driver.get(link)
#Here we use an Xpath element to get the initial page
initial_page = int(driver.find_element_by_xpath('Xpath').text)
print(f'The initial page is the No. {initial_page}')
final_page = initial_page + 120
pages = np.arange(initial_page, final_page+1, 1)
minimun_value = 0.95
maximum_value = 1.2
#the variable to_place is set as a string value that must exist in the rows in order to be scraped.
#if it doesn't exist it is ignored.
to_place = 'A particular place'
#the same comment stated above is applied to the variable POINTS.
POINTS = 'POINTS'
#let's set a final dataframe which will contain all the scraped data from the arange that
#matches with the parameters set (minimun_value, maximum value, to_place, POINTS).
df_final = pd.DataFrame()
dataframe_final = pd.DataFrame()
#set another final dataframe for the 2ND PART OF THE PROCESS.
initial_df = pd.DataFrame()
#set a for loop for each page from the arange.
for page in pages:
#INITIAL SEARCH.
#look for general data of the link.
#amount of results and pages for the execution of the for loop, "page" variable is used within the {}.
url = 'https...page={}&p=1'.format(page)
print(f'\u001b[42m Current page: {page} \033[0m '+'\u001b[42m Final page: '+str(final_page)+'\033[0m '+'\u001b[42m Page left: '+str(final_page-page)+'\033[0m '+'\u001b[45m Try No. '+str(y)+' out of '+str(5)+'\033[0m'+'\n')
driver.get(url)
#Here we order the scrapper to try finding the total number of subpages a particular page has if such page IS NOT empty.
#if so, the scrapper will proceed to execute the rest of the procedure.
try:
subpages = driver.find_element_by_xpath('Xpath').text
print(f'Reading the information about the number of subpages of this page ... {subpages}')
subpages = int(re.search(r'\d{0,3}$', subpages).group())
print(f'This page has {subpages} subpages in total')
df = pd.DataFrame()
df2 = pd.DataFrame()
print(df)
print(df2)
#FOR LOOP.
#search at each subpage all the rows that contain the previous parameters set.
#minimun_value, maximum value, to_place, POINTS.
#set a sub-loop for each row from the table of each subpage of each page
for subpage in range(1,subpages+1):
url = 'https...page={}&p={}'.format(page,subpage)
driver.get(url)
identities_found = int(driver.find_element_by_xpath('Xpath').text.replace('A total of ','').replace(' identities found','').replace(',',''))
identities_found_last = identities_found%50
print(f'Página: {page} de {pages}') #AT THIS LINE CRASHED THE LAST TIME
.
.
.
#If the particular page is empty
except:
print(f'This page No. {page} IT'S EMPTY ¯\_₍⸍⸌̣ʷ̣̫⸍̣⸌₎_/¯, ¡NEXT! ')
.
.
.
y += 1
Initially I thought the KernelInterrupted Error was thrown due to the lack of virtual memory my virtual machine had at the moment of running the second iteration...
But after several tests I figured out that my program isn't RAM-consuming at all because the virtual RAM wasn't changing that much during all the process 'til the Kernel crashed. I can guarantee that.
So now I think that maybe the virtual CPU of my virtual machine is the one that's causing the crashing of the Kernel, but if that was the case I just don't understand why, this is the first time I have to deal with such situation, this program runs perfectly in my PC.
Is any data scientist or machine learning engineer here that could assist me? Thanks in advance.
I have found the answer in the Deepnote community forum itself, simply the "Free Tier" machines of this platform do not guarantee a permanent operation (24h / 7) regardless of the program executed in their VM.
That's it. Problem is solved.
I am using Python and Selenium to scrape some data out of an website. This website has the following structure:
First group item has the following base ID: frmGroupList_Label_GroupName and then you add _2 or _3 at the end of this base ID to get the 2nd/3rd group's ID.
Same thing goes for the user item, it has the following base ID: frmGroupContacts_TextLabel3 and then you add _2 or _3 at the end of this base ID to get the 2nd/3rd users's ID.
What I am trying to do is to get all the users out of each group. And this is how I did it: find the first group, select it and grab all of it users, then, go back to the 2nd group, grab its users, and so on.
def grab_contact(number_of_members):
groupContact = 'frmGroupContacts_TextLabel3'
contact = browser.find_element_by_id(groupContact).text
print(contact)
i = 2
time.sleep(1)
# write_to_excel(contact, group)
while i <= number_of_members:
group_contact_string = groupContact + '_' + str(i)
print(group_contact_string)
try:
contact = browser.find_element_by_id(group_contact_string).text
print(contact)
i = i + 1
time.sleep(1)
# write_to_excel(contact, group)
except NoSuchElementException:
break
time.sleep(3)
Same code applies for scraping the groups. And it works, up to a point!! Although the IDs of the groups are different, the IDs of the users are the same from one group to another. Example:
group_id_1 = user_id_1, user_id_2
group_id_2 = user_id_1, user_id_2, user_id_3, user_id_4, user_id_5
group_id_3 = user_id_1, user_id_2, user_id_3
The code runs, it goes to group_id_1, grabs user_id_1 and user_id_2 correctly, but when it gets to group_id_2, the user_id_1 and user_id_2 (which are different in matter of content) are EMPTY, and only user_id_3, user_id_4, user_id_5 are correct. Then, when it gets to group_id_3, all of the users are empty.
This has to do with the users having same IDs. As soon as it gets to a certain user ID in a group, I cannot retrieve all the users before that ID in another group. I tried quitting the browser, and reopening a new browser (it doesn't work, the new browser doesn't open), tried refreshing the page (doesn't work), tried opening a new tab (doesn't work).
I think the content of the IDs get stuck in memory when they are accessed, and are not freed when accessing a new group. Any ideas on how to get past this?
Thanks!
As the saying goes... it ain't stupid, if it works.
def refresh():
# accessing the groups page
url = "https://google.com"
browser.get(url)
time.sleep(5)
url = "https://my_url.com"
browser.get(url)
time.sleep(5)
While trying to debug this, and finding a solution, I thought: "what if you go to another website, then come back to yours, between group scraping"... and it works! Until I find other solution, I'll stick with this one.
I'm trying to scrape player data off of https://sofifa.com using BeautifulSoup. Every page displays 60 players and so I use offset query param(for example https://sofifa.com/players?offset=60 shows 2nd page) to access all player's info.
One thing I noticed is that the offset value never ends(i.e. no matter how large the offset value I provide, it always shows me a page). Specifically, I noticed that for offset > 20000 or so, it always displays the 1st page(essentially after exhausting all players, it kinda rolls over to the 1st page and always displays that for all subsequent higher offset values). Try with https://sofifa.com/players?offset=20000000 to get an idea of what I mean.
I want to know if there is any way I can programmatically find out the last "valid" offset value; beyond which I am sure to get the 1st page back. That'll help me decide when I reached the end of the dataset.
Currently this is how I scrape:
for offset in range(0, 20000, 60):
try:
print("Processing page at offset " + str(offset))
sofifa_url = "https://sofifa.com/players?offset=" + str(offset)
# start scraping the page
:
:
except Exception as e:
print("Exception occured: " + str(e))
continue
Currently, to determine whether or not there is a next page of entities I'm using the following code:
q = Entity.all().fetch(10)
cursor = q.cursor()
extra = q.fetch(1)
has_next_page = False
if extra:
has_next_page = True
However, this is very expensive in terms of the time it takes to execute the 'extra' query. I need to extract the cursor after 10 results, but I need to fetch 11 to see if there is a succeeding page.
Anyone have any better methods?
If you fetch 11 items straight away you'll only have to fetch 1 extra item to know if there is a next page or not. And you can just display the first 10 results and use the 11th result only as a "next page" indicator.