I am attempting to quantify how much download quota would be consumed when a certain web page is loaded (in chrome in my case), including all the page's assets (e.g. 'loaded' according to regular human use of the webpage)
Is there a way to achieve this using mainstream techniques (e.g. a python library, selenium, netstat command line utility, curl or something else)?
Note: I guess one very crude way would be to check my ISP stats before/after the page load, but this is fraught with potential inaccuracies, probably most notably the device doing background tasks and the ISP not providing quota estimates fine enough to discern the additional kbs consumed by the page load, so I think this method would not be reliable
There may be better ways, but I found one that seems to work
In chrome, open developer tools (cmd + option + j), click on the 'network' tab, and refresh the page. When it has fully loaded, look for the resources.
Note: to get an accurate reading, it could be important to ensure the 'Disable cache' checkbox is ticked (failing to disallow the cache could underestimate the download quota required)
For the page we're on now, we see it uses 1.5MB without disabling the cache.
Note: the amount seemed to vary for me quite a bit each time I ran it (not always in a downward direction), so depending on the circumstances, it could be worth doing this several times and taking an average.
Related
I have worked on selenium project in python which uses both chrome and its driver.
For security reasons the customer doesn't want to use chrome, firefox etc... as each days security bugs are discovered in those browsers (highly sensitive data are stored on his machine and he doesn't want to risk it).
What can be done to solve this issue? What other alternatives are there so I can change my code accordingly.
Your two main options are
Most secure option
If your project is really that sensitive (national security), your only option is to air gap1 all systems
ensure that nobody can breach that air gap, such as with USB sticks in the staff car park2, perhaps by putting your infrastructure on Bouvet Island.
Most projects (99.999%) do not need this level of security.
Slightly less secure, significantly more practical.
Ensure that the version of Chrome that selenium is using is always the latest and greatest, with the most up to date security patches applied.
Run your chrome containers in an environment that has limited access to the Internet.
Allow incoming and outgoing requests from only certain countries.
Allow only certain ports to be open.
Side note, never use block lists, only use allow lists.
Block lists
block this stuff only
allow everything else
Allow lists
allow this stuff only
block everything else
Chrome is maintained by Google, a company who,
is huge, so has the funds necessary to keep their products patched
has a vested interest in browser security; they want to ensure that users can continue to click on Google ads.
I would also strongly suggest contributing to the Selenium project, as it is FOSS and needs support.
Life is all about risk mitigation strategies. I could get run over by a bus tomorrow, but I'm not going to ban all buses to mitigate that risk, I'll just look both ways before I cross. Equally, someone might discover a bug in Chrome tomorrow, but does that mean you should stop using the most popular browser on the the planet to test your code? Probably not.
Buttice, C. (2021). What is an Air Gap? - Definition from Techopedia. [online] Techopedia. Available at: https://www.techopedia.com/definition/17037/air-gap [Accessed 20 Apr. 2022].
Pompon, R. (2018). Attacking Air-Gap-Segregated Computers. [online] F5 Labs. Available at: https://www.f5.com/labs/articles/cisotociso/attacking-air-gap-segregated-computers.
I want to know if is possible to capture a lot of actions like a flow and export to selenium, in order to repeat that flow.
For example, I need to uninstall and reinstall and configure a few applications each day several times, the process is always the same, and it's a long process, so in order to avoid navigate between code to capture all IDs and classes, is there any way of doing that?
Kind regards.
I think Selenium IDE is basically what you are looking for. It is a Chrome or Firefox extension that has a record and play back feature, and it is able to export working code in a variety of languages (including python).
Word of caution: tests produced by the tool tend to be pretty unreliable/flaky, you can attain much better stability by coding with WebDriver.
Using pyautogui or something similar you could record the location of each click and either use the color of certain pixels to intiate different stages or waiting x amount of time before click each saved point on screen
I use Selenium (Python) to run multiple chromedriver instances on a Raspberry Pi. Preferably I would like to run as many instances as the Pi's hardware can manage, hence a lower resource usage will be beneficial. Running a headless chromedriver is unfortunately not possible due to chromedriver not allowing DRM content to load in this mode.
A starting point would most likely be to optimize switches. A very comprehensive list of available switches can be found here. The list contains a lot of switches, and I was wondering if anyone has a list of switches optimizing performance that I can use as a starting point.
In addition I would like to mention that it is not important for me to actually see what happens on the browser window. So, might it be relevant to e.g. prevent images from loading or hide/move the browser window of screen?
What is a responsible / ethical time delay to put in a web crawler that only crawls one root page?
I'm using time.sleep(#) between the following calls
requests.get(url)
I'm looking for a rough idea on what timescales are:
1. Way too conservative
2. Standard
3. Going to cause problems / get you noticed
I want to touch every page (at least 20,000, probably a lot more) meeting certain criteria. Is this feasible within a reasonable timeframe?
EDIT
This question is less about avoiding being blocked (though any relevant info. would be appreciated) and rather what time delays do not cause issues to the host website / servers.
I've tested with 10 second time delays and around 50 pages. I just don't have a clue if I'm being over cautious.
I'd check their robots.txt. If it lists a crawl-delay, use it! If not, try something reasonable (this depends on the size of the page). If it's a large page, try 2/second. If it's a simple .txt file, 10/sec should be fine.
If all else fails, contact the site owner to see what they're capable of handling nicely.
(I'm assuming this is an amateur server with minimal bandwidth)
I have just made a simple script that opens links from a hand typed list in the default browser. It succeeds at opening the pages, but often opens 2 or three windows with all the pages spread out across. What am I missing?
import webbrowser
new = 2
def open_page(url):
webbrowser.open(url,new=new)
def file_len(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1
line_len = file_len('social.txt')
with open('social.txt') as f:
content1 = f.readlines()
for i in range(line_len):
url = content1[i].rstrip()
open_page(url)
print url
This is up to your browser. As the documentation for webbrowser.open says:
If new is 2, a new browser page (“tab”) is opened if possible.
So, why might it not be possible?
At least two reasons, and probably more:
When you call webbrowser.open the second time, it hasn't finished opening the first window yet. So, when it tries to create a new tab in the current window, there is no current window, so it creates a new tab. This may happen a few times if you're spamming it as fast as possible (as you are). Not every browser works this way on every platform, but some do. This is particularly likely in cases where the "browser" program is actually just a script that talks to the real browser program, as with the firefox script on most *nix platforms except Mac OS X.
When you call webbrowser.open the 13th or so time, that exceeds some limit on max tabs/window, so it opens a new window. I believe Gecko-based browsers have this feature but it's disabled in Firefox (as in, you have to dig into about:preferences or edit prefs.js to set a limit), WebKit-based browsers don't have it at all, and I have no idea about IE or Opera.
Again, those are just two possible reasons.
So, how do you fix this? Well, it depends on the problem.
First, there are already some hacky workarounds to avoid the first problem with some browsers. In particular, if you're on a *nix platform, and your default browser is Firefox, but Python can't figure that out (and it's therefore just using $BROWSER and/or using xdg-open or similar), explicitly using firefox instead of the default may help.
But beyond that, this is a classic race condition. The solution to any race condition is to find the right thing to synchronize on. But there's no way to do that in this case. You're calling a function that kicks off a chain of events and gets no feedback whatsoever (e.g., it executes a wrapper script that talks to the real browser program that may itself just be a front-end that sends a message over a pipe to the real real browser program…). As the docs say:
For non-Unix platforms, or when a remote browser is available on Unix, the controlling process will not wait for the user to finish with the browser, but allow the remote browser to maintain its own windows on the display.
As usual with races, you can sort of paper over them with sleeps, but it's never a good solution. No matter how long you sleep between the first and second call, it's not guaranteed to be always long enough—and it's almost certain to be usually too long.
If you dig through the source for BackgroundBrowser and the UnixBrowser, MozillaBrowser, etc. that follow it, you can see that there are some hacks that you might be able to extend that might work in some cases. Or, you can go deeper under the covers and talk to a specific web browser using some more powerful specific API that it provides via, say, COM or AppleEvents or over some pipe or on the command line. But short of that, there's no real answer. (Well, you might be able to use something like Selenium that knows how to drive a number of different browsers, but it's the same basic idea.)
Here how I searched multiple questions from a .txt file
import webbrowser
import pandas as pd
ques = pd.read_csv('ques try.txt', sep='\n')
for i in ques['questions']:
webbrowser.open_new_tab(f'https://www.google.com/search?q={i}')