I've been doing some form-entry-stuffing tests in a buddy's website with Selenium by filling a form automatically. The form has a ReCaptcha box at the bottom which will give us nightmares for weeks. Basically, the first couple of entries I get in work perfectly (every instance of Selenium is run with a different proxy). But the more entries I submit, the harder the Captcha becomes until it'll take me 1 minute to solve a single one. I've tried many things without success.
At first the Captchas would be long straight off the bat, since I had the webdriver flag on. I managed to patch Chromedriver so ReCAPTCHA won't recognize me as a bot. Later on I disabled webrtc which seemed to help in being less conspicuous, and was getting shorter challenges, sometimes even one challenge and done.
But all of this worked as long as I did the captchas at a reasonable rate. When I need to fill in 30 forms in half an hour, by the 10th form I'm getting extremely long captchas which sometimes fail even if the answer is correct, and I can't use the audio captcha.
I thought it might be related to web-gl, so I disabled it and tried again. Nope, same thing, couldn't even choose the audio captcha option because it said I could be sending automated requests. Tried disabling js, then both, neither worked. Tried another computer, got a few easy first captchas, then back to the hard ones.
I am not aware of how ReCAPTCHA does its validation behind the scenes, so even if it's not public knowledge, I'm asking if anyone has any estimations on how it works--and has attempted and succeeded at avoiding being temporarily blocked by ReCAPTCHA.
Just as extra info, using proxies and attempting this same thing on a normal browser in incognito mode returns the same results. Get a few easy quick challenges, then get destroyed by ReCAPTCHA with minute-long-challenges
If anyone has any tips, I'd appreciate it!
Related
Ok, so I'm a total noob with aspirations of learning to code. I've read about a guy who, for example, wrote a script which, if he was at work past a certain time, would automatically send a text to his wife stating he would be late. I want to do something sorta similar.
What I want in essence is a script that will log in to a website at a certain time of day, check if a box/text is green/yes or red/no, and send a text or notification to my phone informing me of the result each day.
The progress I've made so far is installing Python, installing PyCharm and done some research about tools I could use toward achieving my goal. Selenium seems like it would be capable of logging into the website, but I've no idea how to go about setting up a conditional statement to check the result, nor how I could set it up to send a text/notification to my phone. Also, if there is a more appropriate tool I should look into rather than Selenium and Python, I'm not attached to the idea of using these specific tools.
Finally, I realize that this may end up being too complicated for a first project, so I'd be up for hiring a freelancer to set this up. Equally, if this is something that could feasibly be written by someone with very little knowledge of coding such as myself, I'd really appreciate some direction from an expert!
Thanks for any input!
You are on the right track with selenium for web form automation. Sending notification however would require something else as was pointed out, and if you're on windows you can use windows task scheduler to automate, to performed only on certain time of day etc.
To make things more simplified, you can also look up general purpose automation programs that might support all these features together. For example, JRVSInputs uses selenium for web auto-fills https://jrvs.in/forums/viewtopic.php?t=182 and have features to send email or windows notifications. It can convert all its scripts into a neat batch file, you can then automate this batch file in the task scheduler.
I want to know if is possible to capture a lot of actions like a flow and export to selenium, in order to repeat that flow.
For example, I need to uninstall and reinstall and configure a few applications each day several times, the process is always the same, and it's a long process, so in order to avoid navigate between code to capture all IDs and classes, is there any way of doing that?
Kind regards.
I think Selenium IDE is basically what you are looking for. It is a Chrome or Firefox extension that has a record and play back feature, and it is able to export working code in a variety of languages (including python).
Word of caution: tests produced by the tool tend to be pretty unreliable/flaky, you can attain much better stability by coding with WebDriver.
Using pyautogui or something similar you could record the location of each click and either use the color of certain pixels to intiate different stages or waiting x amount of time before click each saved point on screen
I have a webscraper I wrote with Python/Selenium that automatically reserves a spot at my gym for me every morning (You have to reserve at 7am and they fill up quick so I just automated it to run at 7 every day). It's been working well for me for a while but a couple days ago it stopped working. So I got up early and checked what was going on - to find that this gym has added Captcha to its reservation process.
Does this mean that someone working on the website added a Captcha to it? Or is it Google-added? Regardless, am I screwed? Is there any way for my bot to get around Captcha?
I found that when I run the Selenium script the Captcha requires addition steps (i.e finding all the crosswalks), whereas when I try to reserve manually the Captcha is still there but it only requires me to click on it before moving on. Is this something I can take advantage of?
Thank you in advance for any help.
I've run into similar problems before. Sometimes you're just stuck and can't get past it. That's exactly what Captcha is meant to accomplish, after all.
However, I've found that sometimes the site will only present you with Captcha if it suspects based on your behavior that you are a bot. This can be partially overcome, especially if you're only making occasional calls to a site, by making your bot somewhat less predictable. I do this using np.random. I use a Poisson distribution to simulate user actions within the context of an individual session, since the time between actions is often Poisson distributed. And I randomize the time I log into a site by simply randomly choosing a time within a certain range. These simple actions are highly effective, although eventually most sites will figure out what you're doing.
Before you implement either of these solutions, however, I strongly recommend you read the site's Terms of Use and consider whether overcoming their Captcha is a violation. If you signed a use agreement with them the right thing to do is to honor it, even if it's somewhat inconvenient. I'd argue this separate ethical decision is of much greater importance than the technical challenge of trying to bypass their Captcha.
Try use https://github.com/dessant/buster to solve captcha
implementation in python selenium -> repository
I wrote a script for my company that randomly selects employees for random drug tests. It works wonderfully, except when I gave it to the person who would use the program. She clicked on it and a message popped up asking if she trusts the program. AFter clicking run anyways, AVG flagged it two more times before it would finally load. I read someone else's comment saying to make an exception for it on the antivirus. The problem is, I wrote another program that reads other scripts and reads/writes txt files, generates excel spreadsheets and many other things. I'm really close to releasing the final product to a few select companies as a trial, and this certificate thing is going to be an issue. I code for fun, so there's a lot of lingo that goes right by me. Can someone point me in the right direction where I can get some information on creating a trusted program?
It appears to be a whole long process to obtain a digital certification. You need one to be issued by a certification authority. Microsoft appears to have a docs page on it.
After you have the certification, you'd need to sign your .exe file after it's been created using a tool like SignTool. You may find more useful and detailed answers than I can provide you in this thread, as I actually only know quite little about this whole process and can only redirect you to those who know more. I'd suggest you look through what I have listed here before asking me any more, since I probably know about as much as you do past this point.
If anyone else is having this problem, I stumbled on a solution that works for me.
I created an Install Wizard using Inno Setup. Before I could install the software (My drug test program), it got flagged, asking me if I trust the software. I clicked "run anyway" and my antivirus flagged it two more times. After the program was installed. it never flagged me again. Since my main program will probably be used by 100-200 people, I'm completely fine having to do that procedure once. However, for a more "professional" result, it's probably work investing in certificates.
What is a responsible / ethical time delay to put in a web crawler that only crawls one root page?
I'm using time.sleep(#) between the following calls
requests.get(url)
I'm looking for a rough idea on what timescales are:
1. Way too conservative
2. Standard
3. Going to cause problems / get you noticed
I want to touch every page (at least 20,000, probably a lot more) meeting certain criteria. Is this feasible within a reasonable timeframe?
EDIT
This question is less about avoiding being blocked (though any relevant info. would be appreciated) and rather what time delays do not cause issues to the host website / servers.
I've tested with 10 second time delays and around 50 pages. I just don't have a clue if I'm being over cautious.
I'd check their robots.txt. If it lists a crawl-delay, use it! If not, try something reasonable (this depends on the size of the page). If it's a large page, try 2/second. If it's a simple .txt file, 10/sec should be fine.
If all else fails, contact the site owner to see what they're capable of handling nicely.
(I'm assuming this is an amateur server with minimal bandwidth)