Used IDs are not available anymore in Selenium Python - python

I am using Python and Selenium to scrape some data out of an website. This website has the following structure:
First group item has the following base ID: frmGroupList_Label_GroupName and then you add _2 or _3 at the end of this base ID to get the 2nd/3rd group's ID.
Same thing goes for the user item, it has the following base ID: frmGroupContacts_TextLabel3 and then you add _2 or _3 at the end of this base ID to get the 2nd/3rd users's ID.
What I am trying to do is to get all the users out of each group. And this is how I did it: find the first group, select it and grab all of it users, then, go back to the 2nd group, grab its users, and so on.
def grab_contact(number_of_members):
groupContact = 'frmGroupContacts_TextLabel3'
contact = browser.find_element_by_id(groupContact).text
print(contact)
i = 2
time.sleep(1)
# write_to_excel(contact, group)
while i <= number_of_members:
group_contact_string = groupContact + '_' + str(i)
print(group_contact_string)
try:
contact = browser.find_element_by_id(group_contact_string).text
print(contact)
i = i + 1
time.sleep(1)
# write_to_excel(contact, group)
except NoSuchElementException:
break
time.sleep(3)
Same code applies for scraping the groups. And it works, up to a point!! Although the IDs of the groups are different, the IDs of the users are the same from one group to another. Example:
group_id_1 = user_id_1, user_id_2
group_id_2 = user_id_1, user_id_2, user_id_3, user_id_4, user_id_5
group_id_3 = user_id_1, user_id_2, user_id_3
The code runs, it goes to group_id_1, grabs user_id_1 and user_id_2 correctly, but when it gets to group_id_2, the user_id_1 and user_id_2 (which are different in matter of content) are EMPTY, and only user_id_3, user_id_4, user_id_5 are correct. Then, when it gets to group_id_3, all of the users are empty.
This has to do with the users having same IDs. As soon as it gets to a certain user ID in a group, I cannot retrieve all the users before that ID in another group. I tried quitting the browser, and reopening a new browser (it doesn't work, the new browser doesn't open), tried refreshing the page (doesn't work), tried opening a new tab (doesn't work).
I think the content of the IDs get stuck in memory when they are accessed, and are not freed when accessing a new group. Any ideas on how to get past this?
Thanks!

As the saying goes... it ain't stupid, if it works.
def refresh():
# accessing the groups page
url = "https://google.com"
browser.get(url)
time.sleep(5)
url = "https://my_url.com"
browser.get(url)
time.sleep(5)
While trying to debug this, and finding a solution, I thought: "what if you go to another website, then come back to yours, between group scraping"... and it works! Until I find other solution, I'll stick with this one.

Related

how to Automate the process of Checking twitter profile have option of sending message or not

i want to check list of certain profile whether it profile has message function or not
so is it possible with selenium with the help python and requests
You can just use try-else to find that element and get value based on this:
try:
button = driver.find_element(...)
send_button = 1
else:
send_button = 0
You can expand this structure to a loop and store values in a nicer way like an array, column...
Be sure to carefully verify if the else statement triggers only when there is no button, e.g. by loading error.

How can I actually free up CPU resources for this for loop in Jupyter Notebook?

I'm trying to run an automated process in a Jupyter Notebook (from deepnote.com) every single day, but after running the very first iteration of a while loop and starting the next iteration (at the for loop inside the while loop), the virtual machine crashes throwing the message below:
KernelInterrupted: Execution interrupted by the Jupyter kernel
Here's the code:
.
.
.
while y < 5:
print(f'\u001b[45m Try No. {y} out of 5 \033[0m')
#make the driver wait up to 10 seconds before doing anything.
driver.implicitly_wait(10)
#values for the example.
#Declaring several variables for looping.
#Let's start at the newest page.
link = 'https...'
driver.get(link)
#Here we use an Xpath element to get the initial page
initial_page = int(driver.find_element_by_xpath('Xpath').text)
print(f'The initial page is the No. {initial_page}')
final_page = initial_page + 120
pages = np.arange(initial_page, final_page+1, 1)
minimun_value = 0.95
maximum_value = 1.2
#the variable to_place is set as a string value that must exist in the rows in order to be scraped.
#if it doesn't exist it is ignored.
to_place = 'A particular place'
#the same comment stated above is applied to the variable POINTS.
POINTS = 'POINTS'
#let's set a final dataframe which will contain all the scraped data from the arange that
#matches with the parameters set (minimun_value, maximum value, to_place, POINTS).
df_final = pd.DataFrame()
dataframe_final = pd.DataFrame()
#set another final dataframe for the 2ND PART OF THE PROCESS.
initial_df = pd.DataFrame()
#set a for loop for each page from the arange.
for page in pages:
#INITIAL SEARCH.
#look for general data of the link.
#amount of results and pages for the execution of the for loop, "page" variable is used within the {}.
url = 'https...page={}&p=1'.format(page)
print(f'\u001b[42m Current page: {page} \033[0m '+'\u001b[42m Final page: '+str(final_page)+'\033[0m '+'\u001b[42m Page left: '+str(final_page-page)+'\033[0m '+'\u001b[45m Try No. '+str(y)+' out of '+str(5)+'\033[0m'+'\n')
driver.get(url)
#Here we order the scrapper to try finding the total number of subpages a particular page has if such page IS NOT empty.
#if so, the scrapper will proceed to execute the rest of the procedure.
try:
subpages = driver.find_element_by_xpath('Xpath').text
print(f'Reading the information about the number of subpages of this page ... {subpages}')
subpages = int(re.search(r'\d{0,3}$', subpages).group())
print(f'This page has {subpages} subpages in total')
df = pd.DataFrame()
df2 = pd.DataFrame()
print(df)
print(df2)
#FOR LOOP.
#search at each subpage all the rows that contain the previous parameters set.
#minimun_value, maximum value, to_place, POINTS.
#set a sub-loop for each row from the table of each subpage of each page
for subpage in range(1,subpages+1):
url = 'https...page={}&p={}'.format(page,subpage)
driver.get(url)
identities_found = int(driver.find_element_by_xpath('Xpath').text.replace('A total of ','').replace(' identities found','').replace(',',''))
identities_found_last = identities_found%50
print(f'Página: {page} de {pages}') #AT THIS LINE CRASHED THE LAST TIME
.
.
.
#If the particular page is empty
except:
print(f'This page No. {page} IT'S EMPTY ¯\_₍⸍⸌̣ʷ̣̫⸍̣⸌₎_/¯, ¡NEXT! ')
.
.
.
y += 1
Initially I thought the KernelInterrupted Error was thrown due to the lack of virtual memory my virtual machine had at the moment of running the second iteration...
But after several tests I figured out that my program isn't RAM-consuming at all because the virtual RAM wasn't changing that much during all the process 'til the Kernel crashed. I can guarantee that.
So now I think that maybe the virtual CPU of my virtual machine is the one that's causing the crashing of the Kernel, but if that was the case I just don't understand why, this is the first time I have to deal with such situation, this program runs perfectly in my PC.
Is any data scientist or machine learning engineer here that could assist me? Thanks in advance.
I have found the answer in the Deepnote community forum itself, simply the "Free Tier" machines of this platform do not guarantee a permanent operation (24h / 7) regardless of the program executed in their VM.
That's it. Problem is solved.

Catching an exception and continuing with the loop to continue web search with python

I am running into exceptions when i try to search for data from a list of values in search bar. I would like to capture these exceptions and continue with the rest of the loop. Is there a way i could do this. I am getting two kinds of exceptions, one is above the search bar and one below. I am currently using selenium to login and get the necessary details
Error Messages:
Above search bar
our search returned more than 100 results. Only the first 100 results will be displayed. Please select 'Reset' and refine the search criteria for specific results. (29)
Employer Number is not a valid . Minimum length should be 8. (890)
Error Message below the search bar.
No records found...
This is my code:
for i in ids:
driver.find_element_by_xpath('//*[#id="print_area"]/table/tbody/tr[16]/td[1]/a').click()
driver.find_element_by_xpath('//*[#id="print_area"]/table/tbody/tr[4]/td[3]/a').click()
# searching for an id.
driver.find_element_by_xpath('//*[#id="ctl00_ctl00_cphMain_cphMain_txtEmprAcctNu"]').send_keys(i)
driver.find_element_by_id('ctl00_ctl00_cphMain_cphMain_btnSearch').click()
driver.find_element_by_xpath('//*[#id="ctl00_ctl00_cphMain_cphMain_grdAgentEmprResults"]/tbody/tr[2]/td[1]/a').click()
#navigating to the employee details
driver.find_element_by_xpath('//*[#id="print_area"]/table/tbody/tr[8]/td[3]/a').click()
driver.find_element_by_xpath('//*[#id="print_area"]/table/tbody/tr[4]/td[1]/a').click()
After the above code runs if there is an error or mismatch i am getting the mentioned exceptions and the code shuts down. How do i capture those exceptions and continue with the code. If i code do something similar the way i am capturing the date would be really helpful.
#copying the and storing the date
subdate = driver.find_element_by_id('ctl00_ctl00_cphMain_cphMain_frmViewAccountProfile_lblSubjectivityDate').text
subjectivitydate.append(subdate)
#exiting current employee details
driver.find_element_by_id('ctl00_ctl00_cphMain_ULinkButton4').click()
sleep(1)
Edited Code:
for i in ids:
try:
driver.find_element_by_xpath('//*[#id="print_area"]/table/tbody/tr[16]/td[1]/a').click()
driver.find_element_by_xpath('//*[#id="print_area"]/table/tbody/tr[4]/td[3]/a').click()
# searching for an id.
driver.find_element_by_xpath('//*[#id="ctl00_ctl00_cphMain_cphMain_txtEmprAcctNu"]').send_keys(i)
driver.find_element_by_id('ctl00_ctl00_cphMain_cphMain_btnSearch').click()
driver.find_element_by_xpath('//*[#id="ctl00_ctl00_cphMain_cphMain_grdAgentEmprResults"]/tbody/tr[2]/td[1]/a').click()
#navigating to the employee profile
driver.find_element_by_xpath('//*[#id="print_area"]/table/tbody/tr[8]/td[3]/a').click()
driver.find_element_by_xpath('//*[#id="print_area"]/table/tbody/tr[4]/td[1]/a').click()
#copying the and storing the date
subdate = driver.find_element_by_id('ctl00_ctl00_cphMain_cphMain_frmViewAccountProfile_lblSubjectivityDate').text
subjectivitydate.append(subdate)
#exiting current employee details
driver.find_element_by_id('ctl00_ctl00_cphMain_ULinkButton4').click()
sleep(1)
except:
continue
How do i restart the loop?
Regards,
Ren

Scrapy, Crawling Reviews on Tripadvisor: extract more hotel and user information

in need to extract more information from tripAdvisor
my code:
item = TripadvisorItem()
item['url'] = response.url.encode('ascii', errors='ignore')
item['state'] = hxs.xpath('//*[#id="PAGE"]/div[2]/div[1]/ul/li[2]/a/span/text()').extract()[0].encode('ascii', errors='ignore')
if(item['state']==[]):
item['state']=hxs.xpath('//*[#id="HEADING_GROUP"]/div[2]/address/span/span/span[contains(#class,"region_title")][2]/text()').extract()
item['city'] = hxs.select('//*[#id="PAGE"]/div[2]/div[1]/ul/li[3]/a/span/text()').extract()
if(item['city']==[]):
item['city'] =hxs.xpath('//*[#id="HEADING_GROUP"]/div[2]/address/span/span/span[1]/span/text()').extract()
if(item['city']==[]):
item['city']=hxs.xpath('//*[#id="HEADING_GROUP"]/div[2]/address/span/span/span[3]/span/text()').extract()
item['city']= item['city'][0].encode('ascii', errors='ignore')
item['hotelName'] = hxs.xpath('//*[#id="HEADING"]/span[2]/span/a/text()').extract()
item['hotelName']=item['hotelName'][0].encode('ascii', errors='ignore')
reviews = hxs.select('.//div[contains(#id, "review")]')
1. For every hotel in tripAdvisor, there is a id number for the hotel. like 80075 for this hotel: http://www.tripadvisor.com/Hotel_Review-g60763-d80075-Reviews-Amsterdam_Court_Hotel-New_York_City_New_York.html#REVIEWS
how can i extract this id from the TA item?
More information i need for every hotel : shortDescription, stars, zipCode, country and coordinates(long, lat). Can i extract this things?
i need to extract for every review the traveller type. how?
my code for review:
for review in reviews:
it = Review()
it['state'] = item['state']
it['city'] = item['city']
it['hotelName'] = item['hotelName']
it['date'] = review.xpath('.//div[1]/div[2]/div/div[2]/span[2]/#title').extract()
if(it['date']==[]):
it['date']=review.xpath('.//div[1]/div[2]/div/div[2]/span[2]/text()').extract()
if(it['date']!=[]):
it['date']=it['date'][0].encode('ascii', errors='ignore').replace("Reviewed","").strip()
it['userName'] = review.xpath('.//div[contains(#class,"username mo")]/span/text()').extract()
if (it['userName']!=[]):
it['userName']=it['userName'][0].encode('ascii', errors='ignore')
it['userLocation'] = ''.join(review.xpath('.//div[contains(#class,"location")]/text()').extract()).strip().encode('ascii', errors='ignore')
it['reviewTitle'] = review.xpath('.//div[1]/div[2]/div[1]/div[contains(#class,"quote")]/text()').extract()
if(it['reviewTitle']!=[]):
it['reviewTitle']=it['reviewTitle'][0].encode('ascii', errors='ignore')
else:
it['reviewTitle'] = review.xpath('.//div[1]/div[2]/div/div[1]/a/span[contains(#class,"noQuotes")]/text()').extract()
if(it['reviewTitle']!=[]):
it['reviewTitle']=it['reviewTitle'][0].encode('ascii', errors='ignore')
it['reviewContent'] = review.xpath('.//div[1]/div[2]/div[1]/div[3]/p/text()').extract()
if(it['reviewContent']!=[]):
it['reviewContent']=it['reviewContent'][0].encode('ascii', errors='ignore').strip()
it['generalRating'] = review.xpath('.//div/div[2]/div/div[2]/span[1]/img/#alt').extract()
if(it['generalRating']!=[]):
it['generalRating'] =it['generalRating'][0].encode('ascii', errors='ignore').split()[0]
there is a good manual how to find these things? i lost myself with all the spans and the divs..
thanks!
I'll try to do this in purely XPath. Unfortunately, it looks like most of the info you want is contained in <script> tags:
Hotel ID - Returns "80075"
substring-before(normalize-space(substring-after(//script[contains(., "geoId:") and contains(., "lat")]/text(), "locId:")), ",")
Alternatively, the Hotel ID is in the URL, as another answerer mentioned. If you're sure the format will always be the same (such as including a "d" prior to the ID), then you can use that instead.
Rating (the one at the top) - Returns "3.5"
//span[contains(#class, "rating_rr")]/img/#content
There are a couple instances of ratings on this page. The main rating at the top is what I've grabbed here. I haven't tested this within Scrapy, so it's possible that it's popoulated by JavaScript and not initially loaded as part of the HTML. If that's the case, you'll need to grab it somewhere else or use something like Selenium/PhantomJS.
Zip Code - Returns "10019"
(//span[#property="v:postal-code"]/text())[1]
Again, same deal as above. It's in the HTML, but you should check whether it's there upon page load.
Country - Returns ""US""
substring-before(substring-after(//script[contains(., "modelLocaleCountry")]/text(), "modelLocaleCountry = "), ";")
This one comes with quotes. You can always (and you should) use a pipeline to sanitize scraped data to get it to look the way you want.
Coordinates - Returns "40.76174" and "-73.985275", respectively
Lat: substring-before(normalize-space(substring-after(//script[contains(., "geoId:") and contains(., "lat")]/text(), "lat:")), ",")
Lon: substring-before(normalize-space(substring-after(//script[contains(., "geoId:") and contains(., "lat")]/text(), "lng:")), ",")
I'm not entirely sure where the short description exists on this page, so I didn't include that. It's possible you have to navigate elsewhere to get it. I also wasn't 100% sure what the "traveler type" meant, so I'll leave that one up to you.
As far as a manual, it's really about practice. You learn tricks and hacks for working within XPath, and Scrapy allows you to use some added features, such as regex and pipelines. I wouldn't recommend doing the whole "absolute path" XPath (i.e., ./div/div[3]/div[2]/ul/li[3]/...), since any deviation from that within the DOM will completely ruin your scraping. If you have a lot of data to scrape, and you plan on keeping this around a while, your project will become unmanageable very quickly if any site moves around even a single <div>.
I'd recommend more "querying" XPaths, such as //div[contains(#class, "foo")]//a[contains(#href, "detailID")]. Paths like that will make sure that no matter how many elements are placed between the elements you know will be there, and even if multiple target elements are slightly different from each other, you'll be able to grab them consistently.
XPaths are a lot of trial and error. A LOT. Here are a few tools that help me out significantly:
XPath Helper (Chrome extension)
scrapy shell <URL>
scrapy view <URL> (for rendering Scrapy's response in a browser)
PhantomJS (if you're interested in getting data that's been inserted via JavaScript)
Hope some of this helped.
Is it acceptable to get it from the URL using a regex?
id = re.search('(-d)([0-9]+)',url).group(2)

Checking if A follows B on twitter using Tweepy/Python

I have a list of a few thousand twitter ids and I would like to check who follows who in this network.
I used Tweepy to get the accounts using something like:
ids = {}
for i in list_of_accounts:
for page in tweepy.Cursor(api.followers_ids, screen_name=i).pages():
ids[i]=page
time.sleep(60)
The values in the dictionary ids form the network I would like to analyze. If I try to get the complete list of followers for each id (to compare to the list of users in the network) I run into two problems.
The first is that I may not have permission to see the user's followers - that's okay and I can skip those - but they stop my program. This is the case with the following code:
connections = {}
for x in user_ids:
l=[]
for page in tweepy.Cursor(api.followers_ids, user_id=x).pages():
l.append(page)
connections[x]=l
The second is that I have no way of telling when my program will need to sleep to avoid the rate-limit. If I put a 60 second wait after every page in this query - my program would take too long to run.
I tried to find a simple 'exists_friendship' command that might get around these issues in a simpler way - but I only find things that became obsolete with the change to API 1.1. I am open to using other packages for Python. Thanks.
if api.exists_friendship(userid_a, userid_b):
print "a follows b"
else:
print "a doesn't follow b, check separately if b follows a"

Categories

Resources