I tried to get html code from a site name dcinside in Korea, i am using requests but cannot get html code
and this is my code
import requests
url = "http://gall.dcinside.com/board/lists/?id=bitcoins&page=1"
req = requests.get(url)
print (req)
print (req.content)
but the result was
Why I cannot get html codes even using requests??
Most likely they are detecting that you are trying to crawl data dynamically, and not giving any content as a response. Try pretending to be a browser and passing some User-Agent headers.
headers = {
'User-Agent': 'My User Agent 1.0',
'From': 'youremail#domain.com'
}
response = requests.get(url, headers=headers)
# use authentic mozilla or chrome user-agent strings if this doesn't work
Take a look at this:
Python Web Crawlers and "getting" html source code
Like the guy said in the aforementioned post, you should use urllib2 which will allow you to easily obtain web resources.
Related
I tried to getting the title of a web page by web scraping using Beautifulsoup4 python module and it's returning a string "Not Acceptable!" as the title, but when I open the webpage via browser the title is different. I tried looping through list of links and extract titles of all the webpages but it's returning the same string "Not Acceptable!" for all the links.
here is the python code
from bs4 import BeautifulSoup
import requests
URL = 'https://insights.blackcoffer.com/how-is-login-logout-time-tracking-for-employees-in-office-done-by-ai/'
result = requests.get(URL)
doc = BeautifulSoup(result.text, 'html.parser')
tag = doc.title
print(tag.get_text())
here is link to the corresponding web page webpage link
I don't know if it is a problem with Beautifulsoup4 or with requests library, is it because the site has enabled bot protection and not returning the HTML when sending the requests?
The server expects the User-Agent header. Interestingly, it is happy with any User-Agent, even a fictitious one:
result = requests.get(URL, headers = {'User-Agent': 'My User Agent 1.0'})
An easy way to debug this kind of issue is to print (or write to a file) the request.text. This is because some servers don't allow scraping. Some websites generate HTML using JavaScript at runtime (e.g. YouTube). These are some of the scenarios where the request.text can be different than the source HTML we see in the browser. The below text has been returned by the server.
<head><title>Not Acceptable!</title></head><body><h1>Not Acceptable!</h1><p>An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security.</p></body></html>
Edit:
As pointed by DYZ, this is a 406 error and User Agent in the request header was missing.
https://www.exai.com/blog/406-not-acceptable
The 406 Not Acceptable status code is a client-side error. It's part
of the HTTP response status codes in the 4xx category, which are
considered client error responses
So I have been trying to solve this for the past 3 days and just can't know why.
I'm trying to access the html of this site that requires login first.
I tried everyway I could and all return with the same problem.
Here is what I tried:
response = requests.get('https://de-legalization.tlscontact.com/eg/CAI/myapp.php', headers=headers, params=params, cookies=cookies)
print(response.content)
payload = {
'_token': 'TOKEN HERE',
'email': 'EMAIL HERE',
'pwd': 'PASSWORDHERE',
'client_token': 'CLIENT_TOKEN HERE'
}
with requests.session() as s:
r = s.post(login_url, data=payload)
print(r.text)
I also tried using URLLIB but they all return this:
<script>window.location="https://de-legalization.tlscontact.com/eg/CAI/index.php";</script>
Anyone knows why this is happening.
Also here is the url of the page I want the html of:
https://de-legalization.tlscontact.com/eg/CAI/myapp.php
You see this particular output because it is in fact the content of the page you are downloading.
You can test it in chrome by opening the following url:
view-source:https://de-legalization.tlscontact.com/eg/CAI/myapp.php
This is how it looks like in Chrome:
This is happening because you are being redirected by the javascript code on the page.
Since the page you are trying to access requires login, you cannot access it just by sending http request to the internal page.
You either need to extract all the cookies and add them to the python script.
Or you need to use a tool like Selenium that allows you to control a browser from your Python code.
Here you can find how to extract all the cookies from the browser session:
How to copy cookies in Google Chrome?
Here you can find how to add cookies to the http request in Python:
import requests
cookies = {'enwiki_session': '17ab96bd8ffbe8ca58a78657a918558'}
r = requests.post('http://wikipedia.org', cookies=cookies)
I'm trying to use Python urllib2 to read some pages but for given different urls returned the same page.
The page is an inquiry for campsite availability for a given campground from recreation.gov. Since there might be a lot campsites in a campground, the last index in url tells the page how many campsites will be listed.
For example if startIdx=0 the page lists out campsite 1~25, and if startIdx=25 the page lists out campsite 26~50.
So I constructed some urls with different startIdx but after using urllib2 to read the page, the returned html were all the same -- it seems somehow the startIdx in url was ignored.
In addition, if I manually open those urls in browser the pages look normal, but if I use webbrowser.open to open those urls the pages look weird.
The brief sample code duplicates the problem I'm having:
import urllib2
url1 = 'http://www.recreation.gov/campsiteCalendar.do?page=calendar&contractCode=NRSO&parkId=70928&calarvdate=03/11/2016&sitepage=true&startIdx=0'
url2 = 'http://www.recreation.gov/campsiteCalendar.do?page=calendar&contractCode=NRSO&parkId=70928&calarvdate=03/11/2016&sitepage=true&startIdx=25'
hdr = {'User-Agent': 'Mozilla/5.0'}
request1 = urllib2.Request( url1, headers = hdr )
response1 = urllib2.urlopen( request1 )
html1 = response1.read()
request2 = urllib2.Request( url2, headers = hdr )
response2 = urllib2.urlopen( request2 )
html2 = response2.read()
In [1]:html1 == html2
Out[2]: True
I have no other knowledge about how things work in inquiries and PHP related stuff. So I'm curious why does urllib2 behave like this. The Python version I'm using is 2.7
Thanks!
The web page may change during runtime, whereas you are only requesting HTML. There is probably some JavaScript that changes the contents of the page based on the URL encoded information. If the content was loaded server-side with PHP, then it would be present with the request as the server changes the HTML before sending it. JavaScript will change the HTML after sending it.
In other words, a regular browser will change the HTML based on the URL using JavaScript. Your simple request will not do that.
I came across a situation when I used Python Requests or urllib2 to open urls. I got 404 'page not found' responses. For example, url = 'https://www.facebook.com/mojombo'. However, I can copy and paste those urls in browser and visit them. Why does this happen?
I need to get some content from those pages' html source code. Since I can't open those urls using Requests or urllib2, I can't use BeautifulSoup to extract element from html source code. Is there a way to get those page's source code and extract content form it utilizing Python?
Although this is a general question, I still need some working code to solve it. Thanks!
It looks like your browser is using cookies to log you in. Try opening that url in a private or incognito tab, and you'll probably not be able to access it.
However, if you are using Requests, you can pass the appropriate login information as a dictionary of values. You'll need to check the form information to see what the fields are, but Requests can handle that as well.
The normal format would be:
payload = {
'username': 'your username',
'password': 'your password'
}
p = requests.post(myurl, data=payload)
with more or less fields added as needed.
I am using cookielib and some times opening a url in browser downloads many other files by browser making many other requests. Can I replicate the same behaviour using cookie lib or any other python library?
For example: To get all the required information from page https://applicant.keybank.com/psp/hrsappl/EMPLOYEE/HRMS/c/HRS_HRAM.HRS_CE.GBL?Page=HRS_CE_HM_PRE&Action=A&SiteId=1
I have to make more than 1 GET requests from my python script. I got the request urls of all the requests browser makes by analysing the network requests when I opened the page.
I am seeing if there is any way I can just make 1 request and it fetches all the related requests by itself like browser.
I am not very much interested in the js or css but the main html.
I tried with the following code but it couldn't download whole page
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
response = opener.open('https://applicant.keybank.com/psp/hrsappl/EMPLOYEE/HRMS/c/HRS_HRAM.HRS_CE.GBL?Page=HRS_CE_HM_PRE&Action=A&SiteId=1')
html = response.read()
but when I fetched 3 other GET urls in sequence it is able to give me the required html in the third GET response. I got these urls by examining network tab of the browser
'https://applicant.keybank.com/psc/hrsappl/EMPLOYEE/EMPL/s/WEBLIB_PT_NAV.ISCRIPT1.FieldFormula.IScript_UniHeader_Frame?c=NNTCgkqGs001AcPaisqGbYpTu%2fbGx4jx&Page=HRS_CE_HM_PRE&Action=A&SiteId=1&PortalActualURL=https%3a%2f%2fhronline.keybank.com%2fpsc%2fhrshrm%2fEMPLOYEE%2fHRMS%2fc%2fHRS_HRAM.HRS_CE.GBL%3fPage%3dHRS_CE_HM_PRE%26Action%3dA%26SiteId%3d1&PortalContentURL=https%3a%2f%2fhronline.keybank.com%2fpsc%2fhrshrm%2fEMPLOYEE%2fHRMS%2fc%2fHRS_HRAM.HRS_CE.GBL%3fPage%3dHRS_CE_HM_PRE%26Action%3dA%26SiteId%3d1&PortalContentProvider=HRMS&PortalRegistryName=EMPLOYEE&PortalServletURI=https%3a%2f%2fapplicant.keybank.com%2fpsp%2fhrsappl%2f&PortalURI=https%3a%2f%2fapplicant.keybank.com%2fpsc%2fhrsappl%2f&PortalHostNode=EMPL&PortalIsPagelet=true&NoCrumbs=yes')
'https://applicant.keybank.com/psc/hrsappl/EMPLOYEE/EMPL/s/WEBLIB_PTPPB.ISCRIPT2.FieldFormula.IScript_TemplatePageletBuilder?PTPPB_PAGELET_ID=KC_LNAV_APPLICANT&target=KCNV_KC_LNAV_APPLICANT_TMPL&Page=HRS_CE_HM_PRE&Action=A&SiteId=1&PortalActualURL=https%3a%2f%2fhronline.keybank.com%2fpsc%2fhrshrm%2fEMPLOYEE%2fHRMS%2fc%2fHRS_HRAM.HRS_CE.GBL%3fPage%3dHRS_CE_HM_PRE%26Action%3dA%26SiteId%3d1&PortalContentURL=https%3a%2f%2fhronline.keybank.com%2fpsc%2fhrshrm%2fEMPLOYEE%2fHRMS%2fc%2fHRS_HRAM.HRS_CE.GBL%3fPage%3dHRS_CE_HM_PRE%26Action%3dA%26SiteId%3d1&PortalContentProvider=HRMS&PortalRegistryName=EMPLOYEE&PortalServletURI=https%3a%2f%2fapplicant.keybank.com%2fpsp%2fhrsappl%2f&PortalURI=https%3a%2f%2fapplicant.keybank.com%2fpsc%2fhrsappl%2f&PortalHostNode=EMPL&PortalIsPagelet=true&NoCrumbs=yes&PortalTargetFrame=TargetContent'
'https://hronline.keybank.com/psc/hrshrm/EMPLOYEE/HRMS/c/HRS_HRAM.HRS_CE.GBL?Page=HRS_CE_HM_PRE&Action=A&SiteId=1&PortalActualURL=https%3a%2f%2fhronline.keybank.com%2fpsc%2fhrshrm%2fEMPLOYEE%2fHRMS%2fc%2fHRS_HRAM.HRS_CE.GBL%3fPage%3dHRS_CE_HM_PRE%26Action%3dA%26SiteId%3d1&PortalContentURL=https%3a%2f%2fhronline.keybank.com%2fpsc%2fhrshrm%2fEMPLOYEE%2fHRMS%2fc%2fHRS_HRAM.HRS_CE.GBL%3fPage%3dHRS_CE_HM_PRE%26Action%3dA%26SiteId%3d1&PortalContentProvider=HRMS&PortalCRefLabel=Careers&PortalRegistryName=EMPLOYEE&PortalServletURI=https%3a%2f%2fapplicant.keybank.com%2fpsp%2fhrsappl%2f&PortalURI=https%3a%2f%2fapplicant.keybank.com%2fpsc%2fhrsappl%2f&PortalHostNode=EMPL&NoCrumbs=yes&PortalKeyStruct=yes'
and following is the complete code for the other fetches I am making
response = opener.open('https://applicant.keybank.com/psc/hrsappl/EMPLOYEE/EMPL/s/WEBLIB_PT_NAV.ISCRIPT1.FieldFormula.IScript_UniHeader_Frame?c=NNTCgkqGs001AcPaisqGbYpTu%2fbGx4jx&Page=HRS_CE_HM_PRE&Action=A&SiteId=1&PortalActualURL=https%3a%2f%2fhronline.keybank.com%2fpsc%2fhrshrm%2fEMPLOYEE%2fHRMS%2fc%2fHRS_HRAM.HRS_CE.GBL%3fPage%3dHRS_CE_HM_PRE%26Action%3dA%26SiteId%3d1&PortalContentURL=https%3a%2f%2fhronline.keybank.com%2fpsc%2fhrshrm%2fEMPLOYEE%2fHRMS%2fc%2fHRS_HRAM.HRS_CE.GBL%3fPage%3dHRS_CE_HM_PRE%26Action%3dA%26SiteId%3d1&PortalContentProvider=HRMS&PortalRegistryName=EMPLOYEE&PortalServletURI=https%3a%2f%2fapplicant.keybank.com%2fpsp%2fhrsappl%2f&PortalURI=https%3a%2f%2fapplicant.keybank.com%2fpsc%2fhrsappl%2f&PortalHostNode=EMPL&PortalIsPagelet=true&NoCrumbs=yes')
response.read()
response = opener.open('https://applicant.keybank.com/psc/hrsappl/EMPLOYEE/EMPL/s/WEBLIB_PTPPB.ISCRIPT2.FieldFormula.IScript_TemplatePageletBuilder?PTPPB_PAGELET_ID=KC_LNAV_APPLICANT&target=KCNV_KC_LNAV_APPLICANT_TMPL&Page=HRS_CE_HM_PRE&Action=A&SiteId=1&PortalActualURL=https%3a%2f%2fhronline.keybank.com%2fpsc%2fhrshrm%2fEMPLOYEE%2fHRMS%2fc%2fHRS_HRAM.HRS_CE.GBL%3fPage%3dHRS_CE_HM_PRE%26Action%3dA%26SiteId%3d1&PortalContentURL=https%3a%2f%2fhronline.keybank.com%2fpsc%2fhrshrm%2fEMPLOYEE%2fHRMS%2fc%2fHRS_HRAM.HRS_CE.GBL%3fPage%3dHRS_CE_HM_PRE%26Action%3dA%26SiteId%3d1&PortalContentProvider=HRMS&PortalRegistryName=EMPLOYEE&PortalServletURI=https%3a%2f%2fapplicant.keybank.com%2fpsp%2fhrsappl%2f&PortalURI=https%3a%2f%2fapplicant.keybank.com%2fpsc%2fhrsappl%2f&PortalHostNode=EMPL&PortalIsPagelet=true&NoCrumbs=yes&PortalTargetFrame=TargetContent')
response.read()
response = opener.open('https://hronline.keybank.com/psc/hrshrm/EMPLOYEE/HRMS/c/HRS_HRAM.HRS_CE.GBL?Page=HRS_CE_HM_PRE&Action=A&SiteId=1&PortalActualURL=https%3a%2f%2fhronline.keybank.com%2fpsc%2fhrshrm%2fEMPLOYEE%2fHRMS%2fc%2fHRS_HRAM.HRS_CE.GBL%3fPage%3dHRS_CE_HM_PRE%26Action%3dA%26SiteId%3d1&PortalContentURL=https%3a%2f%2fhronline.keybank.com%2fpsc%2fhrshrm%2fEMPLOYEE%2fHRMS%2fc%2fHRS_HRAM.HRS_CE.GBL%3fPage%3dHRS_CE_HM_PRE%26Action%3dA%26SiteId%3d1&PortalContentProvider=HRMS&PortalCRefLabel=Careers&PortalRegistryName=EMPLOYEE&PortalServletURI=https%3a%2f%2fapplicant.keybank.com%2fpsp%2fhrsappl%2f&PortalURI=https%3a%2f%2fapplicant.keybank.com%2fpsc%2fhrsappl%2f&PortalHostNode=EMPL&NoCrumbs=yes&PortalKeyStruct=yes')
required_html = response.read()
requests can handle cookies, as you can see here.
It's a great library, far more powerful that urllib2, and yet simpler-looking.
>>> import requests
>>> r = requests.get('https://applicant.keybank.com/psp/hrsappl/EMPLOYEE/HRMS/c/HRS_HRAM.HRS_CE.GBL?Page=HRS_CE_HM_PRE&Action=A&SiteId=1')
>>> r.cookies
Edit: This answer dos not really address the problem, I read too fast. Sorry about that.
As suggested by #J.F.Sebastian, I'm adding a link to a python webkit client, Ghost.py, that could emulate a browser, as you requested.