Selenium-rc: How do you use CaptureNetworkTraffic in python

Selenium-rc: How do you use CaptureNetworkTraffic in python - python

I've found many tutorials for selenium in java in which you first start selenium using s.start("captureNetworkTraffic=True"), but in python start() does not take any arguments.
How do you pass this argument? Or don't you need it in python?

I changed the start in selenium.py:
def start(self, captureNetworkTraffic=False):
l = [self.browserStartCommand, self.browserURL, self.extensionJs]
if captureNetworkTraffic:
l.append("captureNetworkTraffic=true")
result = self.get_string("getNewBrowserSession", l)
The you do:
sel = selenium.selenium('localhost', 4444, '*firefox', 'http://www.google.com')
sel.start(True)
sel.open('')
print sel.captureNetworkTraffic('json')
and it works like a charm

Start the browser in "proxy-injection mode" (note *pifirefox instead of *firefox). Then you can call the captureNetworkTraffic method.
import selenium
import time
sel=selenium.selenium("localhost",4444,"*pifirefox","http://www.google.com/webhp")
sel.start()
time.sleep(1)
print(sel.captureNetworkTraffic('json'))
I learned the *pifirefox "trick" here.

Related

Python schedule with commandline

I have this problem that I want to automate a script.
And in passed projects I've used python scheduler for this. But for this project I'm unsure how to handle this.
The problem is that the code works with login details that are outside the code and entered in the commandline when launching the script.
ex. python scriptname.py email#youremail.com password
How can I automate this with python scheduler?
The code that is in 'scriptname.py' is:
//LinkedBot.py
import argparse, os, time
import urlparse, random
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
def getPeopleLinks(page):
links = []
for link in page.find_all('a'):
url = link.get('href')
if url:
if 'profile/view?id=' in url:
links.append(url)
return links
def getJobLinks(page):
links = []
for link in page.find_all('a'):
url = link.get('href')
if url:
if '/jobs' in url:
links.append(url)
return links
def getID(url):
pUrl = urlparse.urlparse(url)
return urlparse.parse_qs(pUrl.query)['id'][0]
def ViewBot(browser):
visited = {}
pList = []
count = 0
while True:
#sleep to make sure everything loads, add random to make us look human.
time.sleep(random.uniform(3.5,6.9))
page = BeautifulSoup(browser.page_source)
people = getPeopleLinks(page)
if people:
for person in people:
ID = getID(person)
if ID not in visited:
pList.append(person)
visited[ID] = 1
if pList: #if there is people to look at look at them
person = pList.pop()
browser.get(person)
count += 1
else: #otherwise find people via the job pages
jobs = getJobLinks(page)
if jobs:
job = random.choice(jobs)
root = 'http://www.linkedin.com'
roots = 'https://www.linkedin.com'
if root not in job or roots not in job:
job = 'https://www.linkedin.com'+job
browser.get(job)
else:
print "I'm Lost Exiting"
break
#Output (Make option for this)
print "[+] "+browser.title+" Visited! \n("\
+str(count)+"/"+str(len(pList))+") Visited/Queue)"
def Main():
parser = argparse.ArgumentParser()
parser.add_argument("email", help="linkedin email")
parser.add_argument("password", help="linkedin password")
args = parser.parse_args()
browser = webdriver.Firefox()
browser.get("https://linkedin.com/uas/login")
emailElement = browser.find_element_by_id("session_key-login")
emailElement.send_keys(args.email)
passElement = browser.find_element_by_id("session_password-login")
passElement.send_keys(args.password)
passElement.submit()
Running this on OSX.

I can see at least two different way of automating the trigger of your script. Since you are mentioning that your script is started this way:
python scriptname.py email#youremail.com password
It means that you start it from a shell. As you want to have it scheduled, it sounds like a Crontab is a perfect answer. (see https://kvz.io/blog/2007/07/29/schedule-tasks-on-linux-using-crontab/ for example)
If you really want to use python scheduler, you can use the subprocess.
In your file using python scheduler:
import subprocess
subprocess.call("python scriptname.py email#youremail.com password", shell=True)
What is the best way to call a Python script from another Python script?

About the code itself
LinkedIn REST Api
Have you tried using LinkedIn's REST Api instead of retrieving heavy pages, filling in some form and sending it back?
Your code is prone to be broken whenever LinkedIn changes some elements in their page. Whereas the Api is a contract between LinkedIn and the users.
Check here https://developer.linkedin.com/docs/rest-api and there https://developer.linkedin.com/docs/guide/v2/concepts/methods
Credentials
So that you don't have to pass your credentials through command line (especially your password, which will be readable in clear through history), you should either
use a config file (with your Api Key) and read it with ConfigParser (or anything else, depending on the format of your config file (json, python, etc...)
or set them into your environment variables.
For the scheduling
Using Cron
Moreover, for the scheduling part, you can use cron.
Using Celery
If you're looking for a 100% Python solution, you can use the excellent Celery project. Check its periodic tasks.

You can pass the args to the python scheduler.
scheduler.enter(delay, priority, action, argument=(), kwargs={})
Schedule an event for delay more time units. Other than the relative time, the other arguments, the effect and the return value are the same as those for enterabs().
Changed in version 3.3: argument parameter is optional.
New in version 3.3: kwargs parameter was added.
>>> import sched, time
>>> s = sched.scheduler(time.time, time.sleep)
>>> def print_time(a='default'):
... print("From print_time", time.time(), a)
...
>>> def print_some_times():
... print(time.time())
... s.enter(10, 1, print_time)
... s.enter(5, 2, print_time, argument=('positional',))
... s.enter(5, 1, print_time, kwargs={'a': 'keyword'})
... s.run()
... print(time.time())
...
>>> print_some_times()
930343690.257
From print_time 930343695.274 positional
From print_time 930343695.275 keyword
From print_time 930343700.273 default
930343700.276

request.urlretrieve in multiprocessing Python gets stuck

I am trying to download images from a list of URLs using Python. To make the process faster, I used the multiprocessing library.
The problem I am facing is that the script often hangs/freezes on its own, and I don't know why.
Here is the code that I am using
...
import multiprocessing as mp
def getImages(val):
#Dowload images
try:
url= # preprocess the url from the input val
local= #Filename Generation From Global Varables And Rand Stuffs...
urllib.request.urlretrieve(url,local)
print("DONE - " + url)
return 1
except Exception as e:
print("CAN'T DOWNLOAD - " + url )
return 0
if __name__ == '__main__':
files = "urls.txt"
lst = list(open(files))
lst = [l.replace("\n", "") for l in lst]
pool = mp.Pool(processes=4)
res = pool.map(getImages, lst)
print ("tempw")
It often gets stuck halfway through the list (it prints DONE, or CAN't DOWNLOAD to half of the list it has processed but I don't know what is happening on the rest of them). Has anyone faced this problem? I have searched for similar problems (e.g. this link) but found no answer.
Thanks in advance

Ok, I have found an answer.
A possible culprit was the script was stuck in connecting/downloading from the URL. So what I added was a socket timeout to limit the time to connect and download the image.
And now, the issue no longer bothers me.
Here is my complete code
...
import multiprocessing as mp
import socket
# Set the default timeout in seconds
timeout = 20
socket.setdefaulttimeout(timeout)
def getImages(val):
#Dowload images
try:
url= # preprocess the url from the input val
local= #Filename Generation From Global Varables And Rand Stuffs...
urllib.request.urlretrieve(url,local)
print("DONE - " + url)
return 1
except Exception as e:
print("CAN'T DOWNLOAD - " + url )
return 0
if __name__ == '__main__':
files = "urls.txt"
lst = list(open(files))
lst = [l.replace("\n", "") for l in lst]
pool = mp.Pool(processes=4)
res = pool.map(getImages, lst)
print ("tempw")
Hope this solution helps others who are facing the same issue

It looks like you're facing a GIL issue : The python Global Interpreter Lock basically forbid python to do more than one task at the same time.
The Multiprocessing module is really launching separate instances of python to get the work done in parallel.
But in your case, urllib is called in all these instances : each of them is trying to lock the IO process : the one who succeed (e.g. come first) get you the result, while the others (trying to lock an already locked process) fail.
This is a very simplified explanation, but here are some additionnal ressources :
You can find another way to parallelize requests here : Multiprocessing useless with urllib2?
And more info about the GIL here : What is a global interpreter lock (GIL)?

Python Selenium Geckodrive

Hey there just trying a basic browser launch using Firefox.
I've tried using executable path, if statement, and without an if statement and the browser will still not open. I've checked the shell and I don't have an error.My best guess is I'm missing an action of some sort I just need someone to point my in the right direction using my current code, thank you.
from selenium import webdriver
class testbot():
def botfox(self):
driver = self.driver = webdriver.firfox(geckodriver)
driver.get("https://wwww.google.com")
if __name__ == "__botfox__":
botfox()

ok, try this :)
from selenium import webdriver
class testbot():
def botfox(self):
self.driver = webdriver.Firefox()
self.driver.get("https://wwww.google.com")
if __name__ == '__main__':
testBotInstace = testbot()
testBotInstace.botfox()

I'd be surprised if that worked. Have you tried calling it via testbot().botfox() ?
webdriver.firfox would not work, as the syntax is webdriver.Firefox
webdriver.firfox(geckodriver) would not work as geckodriver is not defined anywhere
botfox() would not work because there is no function defined as that. There is one inside of testbot but you would need to first instantiate the class and then call it via testbot().botfox()

Not understanding what's going on with my python code and httplib2 package

I seem to be getting different results when running my script normally or entering it in my cmd.
Here's the full code:
import httplib2, re
def search_for_Title(content):
searchBounds = re.compile('title(.{1,100})title')
Title = re.findall(searchBounds,content)
return Title
def main():
url = "http://www.nytimes.com/services/xml/rss/index.html"
h = httplib2.Http('.cache')
content = h.request(url)
print(content)
print(findTitle(str(content)))
I get nothing printed when running this.
The weird thing is, if I manually paste it into the cmd, I do actually get a printout for content. I do not see where else my script could be going wrong, seeing as I've tested the search_for_Title function and it works fine.
So ye... what's going on here?
PS Is there really no good IDE like Visual Studio for C++ or eclipse for Java? I feel naked without a debugger, using notepad++ at the moment. Also, what does httplib2.Http('.cache') actually do?

For your script to work, you need to call the function main() , you are just defining them, not calling them Example -
import httplib2, re
def search_for_Title(content):
searchBounds = re.compile('title(.{1,100})title')
Title = re.findall(searchBounds,content)
return Title
def main():
url = "http://www.nytimes.com/services/xml/rss/index.html"
h = httplib2.Http('.cache')
content = h.request(url)
print(content)
print(findTitle(str(content)))
main()

Python readline, tab completion cycling with the Cmd interface

I am using the cmd.Cmd class in Python to offer a simple readline interface to my program.
Self contained example:
from cmd import Cmd
class CommandParser(Cmd):
def do_x(self, line):
pass
def do_xy(self, line):
pass
def do_xyz(self, line):
pass
if __name__ == "__main__":
parser = CommandParser()
parser.cmdloop()
Pressing tab twice will show possibilities. Pressing tab again does the same.
My question is, how do I get the options to cycle on the third tab press? In readline terms I think this is called Tab: menu-complete, but I can't see how to apply this to a Cmd instance.
I already tried:
readline.parse_and_bind('Tab: menu-complete')
Both before and after instantiating the parser instance. No luck.
I also tried passing "Tab: menu-complete" to the Cmd constructor. No Luck here either.
Anyone know how it's done?
Cheers!

The easiest trick would be to add a space after menu-complete:
parser = CommandParser(completekey="tab: menu-complete ")
The bind expression that is executed
readline.parse_and_bind(self.completekey+": complete")
will then become
readline.parse_and_bind("tab: menu-complete : complete")
Everything after the second space is acutally ignored, so it's the same as tab: menu-complete.
If you don't want to rely on that behaviour of readline parsing (I haven't seen it documented) you could use a subclass of str that refuses to be extended as completekey:
class stubborn_str(str):
def __add__(self, other):
return self
parser = CommandParser(completekey=stubborn_str("tab: menu-complete"))
self.completekey+": complete" is now the same as self.completekey.

Unfortunately, it seems as though the only way around it is to monkey-patch the method cmdloop from the cmd.Cmd class, or roll your own.
The right approach is to use "Tab: menu-complete", but that's overriden by the class as shown in line 115: readline.parse_and_bind(self.completekey+": complete"), it is never activated. (For line 115, and the entire cmd package, see this: https://hg.python.org/cpython/file/2.7/Lib/cmd.py). I've shown an edited version of that function below, and how to use it:
import cmd
# note: taken from Python's library: https://hg.python.org/cpython/file/2.7/Lib/cmd.py
def cmdloop(self, intro=None):
"""Repeatedly issue a prompt, accept input, parse an initial prefix
off the received input, and dispatch to action methods, passing them
the remainder of the line as argument.
"""
self.preloop()
if self.use_rawinput and self.completekey:
try:
import readline
self.old_completer = readline.get_completer()
readline.set_completer(self.complete)
readline.parse_and_bind(self.completekey+": menu-complete") # <---
except ImportError:
pass
try:
if intro is not None:
self.intro = intro
if self.intro:
self.stdout.write(str(self.intro)+"\n")
stop = None
while not stop:
if self.cmdqueue:
line = self.cmdqueue.pop(0)
else:
if self.use_rawinput:
try:
line = raw_input(self.prompt)
except EOFError:
line = 'EOF'
else:
self.stdout.write(self.prompt)
self.stdout.flush()
line = self.stdin.readline()
if not len(line):
line = 'EOF'
else:
line = line.rstrip('\r\n')
line = self.precmd(line)
stop = self.onecmd(line)
stop = self.postcmd(stop, line)
self.postloop()
finally:
if self.use_rawinput and self.completekey:
try:
import readline
readline.set_completer(self.old_completer)
except ImportError:
pass
# monkey-patch - make sure this is done before any sort of inheritance is used!
cmd.Cmd.cmdloop = cmdloop
# inheritance of the class with the active monkey-patched `cmdloop`
class MyCmd(cmd.Cmd):
pass
Once you've monkey-patched the class method, (or implemented your own class), it provides the correct behavior (albeit without highlighting and reverse-tabbing, but these can be implemented with other keys as necessary).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Selenium-rc: How do you use CaptureNetworkTraffic in python - python

I've found many tutorials for selenium in java in which you first start selenium using s.start("captureNetworkTraffic=True"), but in python start() does not take any arguments. How do you pass this argument? Or don't you need it in python?

Related

Python schedule with commandline

request.urlretrieve in multiprocessing Python gets stuck

Python Selenium Geckodrive

Not understanding what's going on with my python code and httplib2 package

Python readline, tab completion cycling with the Cmd interface

Categories

Resources