I've written a script to scrape data from a div and return a boolean if a prespecified string exists within the div class, everything works perfectly locally. However, when I copy the code to a colab notebook the script hits the ReCaptcha and returns a 403 status code.
My code is below:
def stock_checker(listofurls):
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/79.0.3945.88 Safari/537.36"
}
stock_level = []
for target_url in tqdm(listofurls):
print(target_url)
query = requests.get(target_url,headers=headers).text
html = soup(query, "html.parser")
soup_result = html.find("div", {"class": "product-details__options-basket"}).text
stock_bool = "Out of Stock" if "Out of Stock" in str(soup_result) else "In Stock"
stock_level.append(stock_bool)
return pd.DataFrame({"URls" : listofurls, "In Stock" : stock_level})
print(stock_checker(myurllist))
The html returned is for the ReCaptcha and therefore the div I'm referencing further down does not exist and the code errors.
Any ideas on why this is happening in colab and not locally? and/or how to fix the issue?
Ps - I'm putting it in colab so others can use by just running the code without needing to code.
RE: "why is this happening?" --
Headless browsers are often used for abuse, and hence more often receive counter-abuse tests like captchas. The chance is likely greater when executing from shared IP ranges typical of Cloud providers.
The short version is that the site you are using is probably working as intended. If you are not adhering to its robots.txt directives, I'd start there in order to reduce the chance of encountering counter-abuse mechanisms.
Related
I am currently trying to build a webscraping program to pull data from a real estate website using Beautiful Soup. I haven't gotten very far but the code is as follows:
import requests
from bs4 import BeautifulSoup
r=requests.get("http://pyclass.com/real-estate/rock-springs-wy/LCWYROCKSPRINGS/")
c=r.content
soup=BeautifulSoup(c,"html.parser")
print(soup)
When I try to print the data to at least see if the program is working I get an error message saying "Not Acceptable!An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security." How do I get the server to stop blocking my IP address? I've read some similar issues with other programs and tried clearing the cookies, trying different browsers, etc and nothing has fixed it.
This is happening since the webpage thinks that your a bot (and is correct), therefore you will get blocked when sending a request.
To "bypass" this issue, try adding the user-agent to the headers parameter in the requests.get() method.
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"
}
url = "http://pyclass.com/real-estate/rock-springs-wy/LCWYROCKSPRINGS/"
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
print(soup.prettify())
I've tried searching for this - can't seem to find the answer!
I'm trying to do a really simple scrape of an entire webpage so that I can look for key words. I'm using the following code:
import requests
Website = requests.get('http://www.somfy.com', {'User-Agent':'a'}, headers = {'Accept': '*/*'})
print (Website.text)
print (Website.status_code)
When I visit this website in a browser (eg chrome or firefox) it works. When I run the python code I just get the result "Gone" (error code 410).
I'd like to be able to reliably put in a range of website urls, and pull back the raw html to be able to look for key-words.
Questions
1. What have I done wrong, and how should I set this up to have the best chance of success in the future.
2. Could you point me to any guidance on how to go about working out what is wrong?
Many thanks - and sorry for the beginner questions!
You have an invalid User-Agent and you didn't include it in your headers.
I have fixed your code for you - it returns a 200 status code.
import requests
Website = requests.get('http://www.somfy.com', headers= {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3835.0 Safari/537.36', 'Accept': '*/*'})
print (Website.text)
print (Website.status_code)
I try to run the following python code:
import requests
headers={'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36'}
url="https://search.bilibili.com/all?keyword=Steins;Gate0"
try:
r=requests.get(url=url,headers=headers)
r.encoding='utf-8'
if(r.status_code==200):
print(r.text)
except:
print("This is the selection of Steins Gate")
I am a beginner of web crawler. This is a crawler spider write by requests on python, but I cannot get the full page and I think this is the problem of Asynchronous page loading (Perhaps the website has other Strategy)
So the question is how to get the full page.
What you're dealing with is a well known problem that is somewhat straightforward but complex to execute because the content you want doesn't exist on the page without some sort of browser interaction.
Some recommendations:
Investigate headless browsers like headless chrome and their use cases
Investigate selenium, how to use it with Python and headless browsers
I’m trying to create a script where I can parse the source code from https://www.youtube.com/feed/subscriptions and retrieve the URLs of the videos in my subscription feed, in order to stick them in a MP4 download and save to my FTP server.
However I have been stuck on this problem for a couple of hours.
import bs4
import requests
source = requests.get('https://www.youtube.com/feed/subscriptions')
sourceSoup = bs4.BeautifulSoup(source.text,'html.parser')
sourceSoup.select('#grid-319397 > li:nth-child(1) > div > div.yt-lockup-dismissable > div.yt-lockup-content > h3')
[]
I am right clicking on the css element and ‘inspect element’ then ‘copy selector’ and pasting it inside the select method.
As you can see, it keeps returning an empty list.
I have tried many different derivatives of this, but it’s not picking up anything. I am having the same problem when doing the same things on the homepage, therefore I doubt that it is because it is behind a login (although I am logged in on the PC in which the script is running).
Can someone please point in the right direction?
You are facing 2 different (but somehow related) issues:
The page that the server returns to the GET reguest that is being sent by your code might be different from the page that you recieve when you go to the page with your browser, because of an unknown user-agent to the server.
The item you're looking for is only visible after you log-in.
Now, instead of manually taking care for both of these issues, there is a YouTube API that you should be considering to use.
A demo code showing that we get a different page for different user-agents:
import requests
python_user_agent_request = requests.get('http://www.youtube.com')
chrome_user_agent_request = requests.get('http://www.youtube.com',
headers={'user-agent':'''Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'''})
print(python_user_agent_request.request.headers['user-agent'])
>> python-requests/2.7.0 CPython/3.4.2 Windows/7
print(chrome_user_agent_request.request.headers['user-agent'])
>> Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36
# .text holds the HTML page source
print(python_user_agent_request.text == chrome_user_agent_request.text)
>> False
I'm trying to use Python script to login to my bank account to further scrap the transactions I've made. I read a lot on that and it seems that it is not complicated, however I dont succeed in login. I think the issue is that when I post my form, I miss a token which is generated by ?? and I dont know how to retrieve it.
Here is my sample code :
import requests
s = requests.Session()
s.headers.update({
'User-agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36",
})
url_identification = 'https://accweb.mouv.desjardins.com/identifiantunique/identification'
url_identification_process = 'https://accweb.mouv.desjardins.com/identifiantunique/identification/identificationProcess'
login_data = {
'fnMemoriserUtilisateurActive': '0',
'InfoPosteRaa': {},
'OtherIdRaa': {},
'nbrCartesMemorisees':1,
'codeUtilisateur':xxxxxxxxxx,
'description':{},
'_tk':{},
'infoPosteClient':{}
}
r = s.get(url_identification)
r = s.post(url_identification_process, data=login_data)
print r.text
My post redirects me to the url_identification web page as I am not able to have the proper token. Any idea how I could retrieve it ? I don't know how it is generated, and as it is a bank website, I'm afraid it is sadely not possible :(
To answer your very question first:
The value of _tk can be found in the response of GETting url_identification, e.g.
<input type="hidden" name="_tk" value="86167503-a1e1-4fdf-a661-47348671df9d" />
The value infoPosteClient is calculated by a JavaScript function add_deviceprint. You may need to try simulating the specific behaviour, or just give it a constant value.
Other parameters of the POST request to url_identification_process including InfoPosteRaa, OtherIdRaa, and description, can be empty (at least in my test run).
I don't have an account of that bank to test with, thus I'm afraid I will not be able to check for further requests.
A piece of advice: I do think Python is capable for performing this task, and in fact I personally do many similar things (simulating XHR or other complex things using Python). But I guess using a JavaScript-based solution (e.g. PhantomJS) may be more suitable to your specific case.