I have this automation tool I've built with Selenium 2 and Browsermob proxy that works quite well for most of what I need. However, I've run into a snag on capturing network traffic.
I basically want to capture the har that a click provides before the page redirects. For example, I have an analytics call happening on the click that I want to capture, then another analytics call on the page load that I don't want to capture.
All of my attempts currently capture the har too late, so I see both the click analytics call and the page load one. Is there any way to get this working? I've included my current relevant code sections below
METHODS INSIDE HELPER CLASS
class _check_for_page_load(object):
def __init__(self, browser, parent):
self.browser = browser
self.maxWait = 5
self.parent = parent
def __enter__(self):
self.old_page = self.browser.find_element_by_tag_name('html')
def wait_for(self,condition_function):
start_time = time.time()
while time.time() < start_time + self.maxWait:
if condition_function():
return True
else:
time.sleep(0.01)
raise Exception(
'Timeout waiting for {}'.format(condition_function.__name__)
)
def page_has_loaded(self):
new_page = self.browser.find_element_by_tag_name('html')
###self.parent.log("testing ---- " + str(new_page.id) + " " + str(self.old_page.id))
return new_page.id != self.old_page.id
def __exit__(self, *_):
try:
self.wait_for(self.page_has_loaded)
except:
pass
def startNetworkCalls(self):
if self._p != None:
self._p.new_har("Step"+str(self._currStep))
def getNetworkCalls(self, waitForTrafficToStop = True):
if self._p != None:
if waitForTrafficToStop:
self._p.wait_for_traffic_to_stop(5000, 30*1000);
return self._p.har
else:
return "{}"
def click(self, selector):
''' clicks on an element '''
self.log("Clicking element '" + selector + "'")
el = self.findEl(selector)
traffic = ""
with self._check_for_page_load(self._d, self):
try:
self._curr_window = self._d.window_handles[0]
el.click()
except:
actions = ActionChains(self._d);
actions.move_to_element(el).click().perform()
traffic = self.getNetworkCalls(False)
try:
popup = self._d.switch_to.alert
if popup != None:
popup.dismiss()
except:
pass
try:
window_after = self._d.window_handles[1]
if window_after != self._curr_window:
self._d.close()
self._d.switch_to_window(self._curr_window)
except:
pass
return traffic
INSIDE FILE THAT RUNS MULTIPLE SELENIUM ACTIONS
##inside a for loop, we get an action that looks like "click('#selector')"
util.startNetworkCalls()
if action.startswith("click"):
temp_traffic = eval(action)
if temp_traffic == "":
temp_traffic = util.getNetworkCalls()
traffic = json.dumps(temp_traffic, sort_keys=True) ##gives json har info that is saved later
You can see from these couple snippets that I initiate the "click" function which returns network traffic. Inside the click function, you can see it references the class "_check_for_page_load". However, the first time it reaches this line:
###self.parent.log("testing ---- " + str(new_page.id) + " " + str(self.old_page.id))
The log (when enabled) shows that the element ids don't match on the first time it logs, indicating the page load has already started to happen. I'm pretty stuck right now as I've tried everything I can think of to try to accomplish this functionality.
I found a solution to my own question - though it isn't perfect. I told my network calls to capture headers:
def startNetworkCalls(self):
if self._p != None:
self._p.new_har("Step"+str(self._currStep),{"captureHeaders": "true"})
Then, when I retrieve the har data, I can look for the "Referer" header and compare that with the page that was initially loaded (before the redirect from the click). From there, I can split the har into two separate lists of network calls to further process later.
This works for my needs, but it isn't perfect. Some things, like image requests, sometimes get the same referrer that the previous page's url matched, so the splitting puts those into the first bucket rather than the appropriate second bucket. However, since I'm more interested in requests that aren't on the same domain, this isn't really an issue.
Related
So I have this project: It is a website with multiple WebElements on it. I find those WebElements by their class name (they all have the same one obviously) and then iterate through them to click on each of them. After clicking on them I have to click on another button "next". Some of them then open a website in a new tab (others don't). I then immediately close the newly opened tab and try to iterate through the next element when I get the StaleElementReferenceException.
Don't get me wrong here, I know what a StaleElementReferenceException is, I just don't know why it occurs. The DOM of the initial website doesn't seem to change and more importantly: The WebElement I'm trying to reach in the next iteration is still known so I can print it out, but not click on it.
I have tried working around this issue by creating a new class CustomElement to permanently "save" the found WebElements to be able to reach them after the DOM has changed but that also doesn't seem to be working.
Whatever here's some code for you guys:
def myMethod():
driver.get("https://initialwebsite.com")
time.sleep(1)
scrollToBottom() #custom Method to scroll to the bottom of the website to make sure I find all webelemnts
ways = driver.find_elements_by_class_name("sc-dYzWWc")
waysCounter = 1
for way in ways:
# print("clicking: " + str(way)) ##this will get executed even if there was a new tab opened in the previous iteration....
driver.execute_script("arguments[0].click();", way)
# print("clicked: " + str(way)) ##...but this won't get executed
try:
text = driver.find_element_by_xpath("/html/body/div[1]/div/div[2]/div/div[1]/div/div[7]/div[2]/div[" + str(entryWaysCounter) + "]/div[1]/div/div/div[1]").text
except:
waysCounter += 1
text = driver.find_element_by_xpath("/html/body/div[1]/div/div[2]/div/div[1]/div/div[7]/div[2]/div[" + str(entryWaysCounter) + "]/div[1]/div/div/div[1]").text
methode = None
#big bunch of if and else statements to give methode a specific number based on what text reads
print(methode)
weiterButton = driver.find_element_by_xpath(
"/html/body/div[1]/div/div[2]/div/div[1]/div/div[7]/div[2]/div[" + str(
entryWaysCounter) + "]/div[2]/div/div/div/div/div/div[2]/button[2]")
try:
driver.execute_script("arguments[0].click();", weiterButton)
except:
pass
if (methode == 19):
time.sleep(0.2)
try:
driver.switch_to.window(driver.window_handles[1])
driver.close()
time.sleep(0.5)
driver.switch_to.window(driver.window_handles[0])
time.sleep(0.5)
except:
pass
waysCounter += 1
time.sleep(0.5)
And for those who are curious here's the workaround class I set up:
class CustomElement:
def __init__(self, text, id, location):
self.text = text
self.id = id
self.location = location
def __str__(self):
return str(str(self.text) + " \t" + str(self.id) + " \t" + str(self.location))
def storeWebElements(seleniumElements):
result = []
for elem in seleniumElements:
result.append(CustomElement(elem.text, elem.id, elem.location))
return result
I tried then working with the id and "re-finding" the WebElement ("way") by id but apparently the id that gets saved is a completely different id.
So what can I say I really tried my best, searched nearly every forum but didn't come up with a good soluation, I really hope you got one for me :)
Thanks!
Are you crawling links? If so then you want to save the destination, not the element.
Otherwise you could force the link to open in a new window (perhaps like https://stackoverflow.com/a/19152396/1387701), switch to that wind, parse the page, close the page and still have the original window open.
I try to listen to different RFID ID cards with a ACR122 reader and the nfcpy python library.
I would like to have the card's ID when the user connect it (without recognized it over and over) and get an event when user release it. Ideally in a loop, in order to listen to the next card when the user take his card away.
Below is my code, but the on-release event is fired even if the card is still on the reader. What is the correct way to
Get on-connect without recognizing over and over ?
Get on-release when user the card is away ?
import nfc
def on_startup(targets):
return targets
def on_connect(tag):
uid = str(tag.identifier).encode("hex").upper()
print(uid)
return True
def on_release(tag):
print('Released')
return tag
rdwr_options = {
'on-startup': on_startup,
'on-connect': on_connect,
'on-release': on_release,
'beep-on-connect': False,
}
with nfc.ContactlessFrontend('usb') as clf:
tag = clf.connect(rdwr=rdwr_options)
You might need to set an interval in your ContactlessFrontend config. Try this example:
import nfc
import ndef
tags = set()
rec = ndef.UriRecord("https://google.com")
def on_connect(tag):
if tag.identifier not in tags:
tags.add(tag.identifier)
fmt = tag.format()
if fmt is None:
print("Tag cannot be formatted (not supported).")
elif fmt is False:
print("Tag failed to be formatted (for some reason).")
else:
tag.ndef.records = [rec]
if __name__ == "__main__":
clf = nfc.ContactlessFrontend()
if not clf.open('usb'):
raise RuntimeError("Failed to open NFC device.")
while True:
config = {
'interval': 0.35,
'on-connect': on_connect
}
ret = clf.connect(rdwr=config)
if ret is None:
pass
elif not ret:
print ("NFC connection terminated due to an exception.")
break
else:
pass
clf.close()
https://gist.github.com/henrycjc/c1632b2d1f210ae0ff33d860c7c2eb8f
This discussion helped me figuring out how to solve this.
When reading the docs (‘on-release’ : function(tag)) very carefully – yes, it took me some loops – it becomes apparent that on-release is called as soon as on-connect returns True.
This function is called when the presence check was run (the ‘on-connect’ function returned a true value) and determined that communication with the tag has become impossible, or when the ‘terminate’ function returned a true value. The tag object may be used for cleanup actions but not for communication.
It seems that on-release must not be understood in a physical way, but rather in a communicative way (Released from communications, you may now remove the card).
To solve this issue, one needs to determine whether a card is present after it connected (or more precisely, after it was released – more about that later). The following code does the trick:
import nfc
import time
import logging
import inspect
logging.basicConfig(format="[%(name)s:%(levelname).4s] %(message)s")
logging.getLogger().setLevel(logging.DEBUG)
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
def on_startup(targets):
logger.debug(inspect.currentframe().f_code.co_name)
for target in targets:
target.sensf_req = bytearray.fromhex("0012FC0000")
return targets
def on_discover(target):
logger.debug(inspect.currentframe().f_code.co_name)
logger.info(target)
return True
def on_connect(tag):
logger.debug(inspect.currentframe().f_code.co_name)
logger.info(tag)
return True
def on_release(tag):
logger.debug(inspect.currentframe().f_code.co_name)
# Loop while card is present
while True:
time.sleep(1)
if not clf.sense(*[nfc.clf.RemoteTarget(target) for target in rdwr_options["targets"]]):
logger.info("Card removed")
break
return True
rdwr_options = {
"targets": ("106A", "106B", "212F"),
"on-startup": on_startup,
"on-discover": on_discover, # Here just for completeness :)
"on-connect": on_connect,
"on-release": on_release,
}
if __name__ == "__main__":
logger.debug(inspect.currentframe().f_code.co_name)
with nfc.ContactlessFrontend() as clf:
if not clf.open("usb"):
raise RuntimeError("Failed to open NFC device.")
while True:
ret = clf.connect(rdwr=rdwr_options)
if not ret:
break
Now is later: If we wait for removal of the card during the on-connect state, we run into trouble as on-release expects to retrieve information from the card (tag argument), which it cannot get anymore as communication is not possible with the card removed.
PS: The above mentioned discussion reads that the behavior of on-release depends on the type of card one is using.
So, I have a need to register when a card is present and when it leaves again. I have some Type2 tags for this.
Given this code:
def connected(tag):
print(tag)
return True def released(tag):
print("Bye")
tag = clf.connect(rdwr={'on-connect': connected, 'on-release':
released})
I would expect it to echo out the Tag ID when I present it to the
reader, and the echo "Bye" once I remove it. This works as expected on
a Type4 tag I have..
I'm afraid those cards are Mifare Classic 1K not supported by nfcpy. It is possible to read the UID but any other command requires to first authenticate and use the Mifare Crypto scheme. This should be doable with NXP reader ICs [...]. And NXP has a good selection of NFC Forum compatible Type 2 Tags that work just great.
My selenium code checks for a completed subroutine to be done by waiting on the site's title to change which worked perfectly. Code looks like this:
waitUntilDone = WebDriverWait(session, 15).until(EC.title_contains(somestring))
However, this can fail sometimes since the site's landing page changes after manual website visits. The server remembers where you left off. This forces me to check for an alternate condition (website title = "somestring2).
Here is what I came up with so far (also works as far as I can tell):
try:
waitUntilDone = WebDriverWait(session, 15).until(EC.title_contains(somestring)) # the old condition
except:
try:
waitUntilDone = WebDriverWait(session, 15).until(EC.title_contains(somestring2)) # the new other condition which is also valid
except:
print "oh crap" # we should never reach this point
Either one of these conditions is always true. I don't know which one thou.
Is there any way to include an "OR" inside these waits or make the try/except block look nicer?
Looks like selenium will let you do this by creating your own class. Check out the documentation here: http://selenium-python.readthedocs.io/waits.html
Here's a quick example for your case. Note the key is to have a method named __call__ in your class that defines the check you want. Selenium will call that function every 500 milliseconds until it returns True or some not null value.
class title_is_either(object):
def __init__(self, locator, string1, string2):
self.locator = locator
self.string1 = string1
self.string2 = string2
def __call__(self, driver):
element = driver.find_element(*self.locator) # Finding the referenced element
title = element.text
if self.string1 in title or self.string2 in title
return element
else:
return False
# Wait until an element with id='ID-of-title' contains text from one of your two strings
somestring = "Title 1"
somestring2 = "Title 2"
wait = WebDriverWait(driver, 10)
element = wait.until(title_is_either((By.ID, 'ID-of-title'), somestring, somestring2))
Recently I tried to add thread to my scraper so that it can have higher efficiency while scraping.
But somehow it randomly causes the python.exe to "has stopped working" with no further information given hence I have no idea how to debug it.
Here is some relevant code:
Where the threads are initiated:
def run(self):
"""
create the threads and run the scraper
:return:
"""
self.__load_resource()
self.__prepare_threads_args() # each thread is allocated a different set of links to scrape from, these should be no collision.
for item in self.threads_args:
try:
t = threading.Thread(target=self.urllib_method, args=(item,))
# use the following expression to use the selenium scraper
# t = threading.Thread(target=self.__scrape_site, args=(item,))
self.threads.append(t)
t.start()
except Exception as ex:
print ex
What the Scraper is like:
def urllib_method(self, thread_args):
"""
:param thread_args: arguments containing the files to scrape and the proxy to use
:return:
"""
site_scraper = SiteScraper()
for file in thread_args["files"]:
current_folder_path = self.__prepare_output_folder(file["name"])
articles_without_comments_file = os.path.join(current_folder_path, "articles_without_comments")
articles_without_comments_links = get_links_from_file(articles_without_comments_file) if isfile(articles_without_comments_file) else []
articles_scraped_file = os.path.join(current_folder_path, "articles_scraped")
articles_scraped_links = get_links_from_file(articles_without_comments_file) if isfile(articles_without_comments_file) else []
links = get_links_from_file(file["path"])
for link in links:
article_id = extract_article_id(link)
if isfile(join(current_folder_path, article_id)):
print "skip: ", link
if link not in articles_scraped_links:
append_text_to_file(articles_scraped_file, link)
continue
if link in articles_without_comments_links:
continue
comments = site_scraper.call_comments_endpoint(article_id, thread_args["proxy"])
if comments != "Pro article" and comments != "Crash" and comments != "No Comments" and comments is not None:
print article_id, comments[0:14]
write_text_to_file(os.path.join(current_folder_path, article_id), comments)
sleep(1)
append_text_to_file(articles_scraped_file, link)
elif comments == "No Comments":
print "article without comments: ", article_id
if link not in articles_without_comments_links:
append_text_to_file(articles_without_comments_file, link)
sleep(1)
I have tried to run the script on both Windows 10 and 8.1, the issue exists on both of them.
Also, the more data it scraped, the more frequent it happens. And the more threads used, the more frequent it happens.
Threads in Python pre 3.2 are very unsafe to use, due to the diabolical Global Interpreter Lock.
The preferred way to utilize multiple cores and processes in python is via the multiprocessing package.
https://docs.python.org/2/library/multiprocessing.html
I have a comprehensive list of Australian postcodes, and I need to use the search function of a specific site to get corresponding remoteness codes. I created a Python script to do that, and it does it efficiently.
Except that, at a seemingly random time during the iteration, it throws a 'Modal dialog present' exception. The problem is, I see no dialog! The webpage looks as usual, and I can interact normally with it with my mouse. What could be the problem and is there a solution?
browser = webdriver.Firefox() # Get local session of firefox
browser.set_page_load_timeout(30)
browser.get("http://www.doctorconnect.gov.au/internet/otd/Publishing.nsf/Content/locator") # Load page
assert "Locator" in browser.title
search = browser.find_element_by_name("Search") # Find the query box
ret = browser.find_element_by_id("searchButton")
doha_addr = []
doha_ra = []
for i in search_string_list:
search.send_keys(i)
ret.send_keys(Keys.RETURN)
addr = browser.find_element_by_xpath("//table/tbody/tr[2]/td[2]")
doha_addr.append(addr.text)
ra = browser.find_element_by_xpath("//table/tbody/tr[4]/td[2]")
doha_ra.append(ra.text)
try:
browser.find_element_by_xpath("//html/body/div/div[3]/div/div/div/div/div/div[7]/div/div[13]/div[1]").click()
except:
pass
search.clear()
I seem to have caught a glimpse of a popup dialog that shows up and hides itself while my script was running. Anyway, this becomes irrelevant with a while switch and a try/except clause... :7D
doha_ra = {}
for i in search_string_list:
switch = True
while switch == True:
search.send_keys(i)
ret.send_keys(Keys.RETURN)
try:
addr = browser.find_element_by_xpath("//table/tbody/tr[2]/td[2]")
ra = browser.find_element_by_xpath("//table/tbody/tr[4]/td[2]")
doha_ra[addr.text] = ra.text
switch = False
except:
pass
search.clear()