I'm following the guide here:
Python3 Urllib Tutorial
Everything works fine for those first few examples:
import urllib.request
html = urllib.request.urlopen('https://arstechnica.com').read()
print(html)
and
import urllib.request
headers = {}
headers['User-Agent'] = "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:48.0) Gecko/20100101 Firefox/48.0"
req = urllib.request.Request('https://arstechnica.com', headers = headers)
html = urllib.request.urlopen(req).read()
print(html)
But if I replace "arstechnica" with "digikey", that urllib request always times out. But the website is easily accessible through a browser. What's going on?
Most websites will try to defend themselves against unwanted bots. If they detect suspicious traffic, they may decide to stop responding without properly closing the connection (leaving you hanging). Some sites are more sophisticated at detecting bots than than others.
Firefox 48.0 was released back in 2016, so it will be pretty obvious to Digikey that you are probably spoofing the header information. There are also additional headers that browsers typically send, that your script doesn't.
In Firefox, if you open the Developer Tools and go to the Network Monitor tab, you can inspect a request to see what headers it sends, then copy these to better mimic the behaviour of a typical browser.
import urllib.request
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Upgrade-Insecure-Requests": "1"
}
req = urllib.request.Request('https://www.digikey.com', headers = headers)
html = urllib.request.urlopen(req).read()
print(html)
Related
I am trying to get https://panel.op-net.com/login with Python Requests library. But I got an server error instantly. Here is the code:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'}
response = requests.get('https://panel.op-net.com/login', headers=headers)
print(response) # <Response [503]>
Seems to me like your website is protected by cloudfare.
I didn't find a easy way of bypassing that using the requests library. You could use Selenium-profles (undetected-selenium) or at least javascript-fetch in Selenium to initiate the request
I am trying to get data from the following website. https://www1.nseindia.com/products/dynaContent/common/productsSymbolMapping.jsp?symbol=SBIN&segmentLink=3&symbolCount=2&series=EQ&dateRange=+&fromDate=01-01-2020&toDate=31-12-2020&dataType=PRICEVOLUMEDELIVERABLE
I tried the following:
Get the whole url in requests:
response = requests.get('https://www1.nseindia.com/products/dynaContent/common/productsSymbolMapping.jsp?symbol=SBIN&segmentLink=3&symbolCount=2&series=EQ&dateRange=+&fromDate=01-01-2020&toDate=31-12-2020&dataType=PRICEVOLUMEDELIVERABLE')
Get the base webpage and add the params:
response = requests.get('https://www1.nseindia.com/products/dynaContent/common/productsSymbolMapping.jsp', params = {'symbol':'SBIN','segmentLink':'3','symbolCount':'2','series':'EQ','dateRange':' ','fromDate':'01-01-2020','toDate':'31-12-2020','dataType':'PRICEVOLUMEDELIVERABLE'})
used the urllib:
f = urllib.request.urlopen('https://www1.nseindia.com/products/dynaContent/common/productsSymbolMapping.jsp?symbol=SBIN&segmentLink=3&symbolCount=2&series=EQ&dateRange=+&fromDate=01-01-2020&toDate=31-12-2020&dataType=PRICEVOLUMEDELIVERABLE')
none of the above methods work.
They are just loading indefinitely.
Thanks in advance.
Don't forget to add User-Agent to request header, like that:
header = {
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:65.0) Gecko/20100101 Firefox/65.0',
"X-Requested-With": "XMLHttpRequest"
}
response = requests.get('you_url', headers=header)
print(response)
I tried running this Python script using BeautifulSoup and requests modules :
from bs4 import BeautifulSoup as bs
import requests
url = 'https://udemyfreecourses.org/
headers = {'UserAgent' : 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.193 Safari/537.36'
}
soup = bs(requests.get(url, headers= headers).text, 'lxml')
But when I send this line :
print(soup.get_text())
It doesn't scrape the text data but instead, It returns this output:
Not Acceptable!Not Acceptable!An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security.
I even even used headers when requesting the webpage, so It can looks like a normal navigator, but I'm still getting this message that's preventing me from accessing the real webpage
Note : The webpage is working perfectly on the navigator directly, but It doesn't show much info when I try to scrape it.
Is there any other way than the one I used with headers that can get a perfect valid request from the website and bypass this security called Mod_Security?
Any help would be very very helpful, Thanks.
EDIT: The Dash in "User-Agent" is essential.
Following this Answer https://stackoverflow.com/a/61968635/8106583
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:55.0) Gecko/20100101 Firefox/55.0',
}
Your User-Agent is the problem. This User-Agent works for me.
Also: Your ip might be blocked by now :D
I am trying to get a value from a website using beautiful soup but it keeps returning none. This is what I have for my code so far
def getmarketchange():
source = requests.get("https://www.coinbase.com/price").text
soup = bs4.BeautifulSoup(source, "lxml")
marketchange = soup.get("MarketHealth__Percent-sc-1a64a42-2.bEMszd")
print(marketchange)
getmarketchange()
and attached is a screenshot of html code I was trying to grab.
Thank you for your help in advance!
Have a look at the HTML source returned from your get() request - it's a CAPTCHA challenge. You won't be able to get to the Coinbase pricing without passing this challenge.
Excerpt:
<h2 class="cf-subheadline"><span data-translate="complete_sec_check">
Please complete the security check to access</span> www.coinbase.com
</h2>
<div class="cf-section cf-highlight cf-captcha-container">
Coinbase is recognizing that the HTTP request isn't coming from a standard browser-based user, and it is challenging the requester. BeautifulSoup doesn't have a way to pass this check on its own.
Passing in User-Agent headers (to mimic a browser request) also doesn't resolve this issue.
For example:
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
source = requests.get("https://www.coinbase.com/price", headers=headers).text
You might find better luck with Selenium, although I'm not sure about that.
To prevent a captcha page, try to specify User-Agent header:
import requests
from bs4 import BeautifulSoup
url = 'https://www.coinbase.com/price'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
print(soup.select_one('[class^="MarketHealth__Percent"]').text)
Prints:
0.65%
I am a beginner of Python. I just wrote a very simple web crawler and caused high memory usage when I was running the crawler.Not sure what's wrong in my codes, I spent quite some time but can't resolve it.
I intend to use it to capture some job info from following link: http://search.51job.com/jobsearch/search_result.php?fromJs=1&jobarea=070200%2C00&district=000000&funtype=0000&industrytype=00&issuedate=9&providesalary=06%2C07%2C08%2C09%2C10&keywordtype=2&curr_page=1&lang=c&stype=1&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&list_type=0&dibiaoid=0&confirmdate=9
The crawler extracts the links of each job,and generates the id of each job from the links. Then it reads the job title from the link through xpath, print all the info out in the end. Even the link number is only 50, but it caused my computer nearly unresponsive every time before printing out all the info. Below is my codes.
I just added the header, this is needed to parse the link of each job. My environment is Ubuntu16.04, Python3.5,Pycharm.
import requests
from lxml import etree
import re
headers = {"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en-US,en;q=0.5",
"Connection": "keep-alive",
"Host": "jobs.51job.com",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"}
def generate_info(url):
html = requests.get(url, headers=headers)
html.encoding = 'GBK'
select = etree.HTML(html.text.encode('utf-8'))
job_id = re.sub('[^0-9]', '', url)
job_title=select.xpath('/html/body//h1/text()')
print(job_id,job_title)
sum_page='http://search.51job.com/jobsearch/search_result.php?fromJs=1&jobarea=070200%2C00&district=000000&funtype=0000&industrytype=00&issuedate=9&providesalary=06%2C07%2C08%2C09%2C10&keywordtype=2&curr_page=1&lang=c&stype=1&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&list_type=0&dibiaoid=0&confirmdate=9'
sum_html=requests.get(sum_page)
sum_select=etree.HTML(sum_html.text.encode('utf-8'))
urls= sum_select.xpath('//*[#id="resultList"]/div/p/span/a/#href')
for url in urls:
generate_info(url)