I am building a web scraper and need to get the html page source as it actually appears on the page. However, I only get a limited html source, one that does not include the needed info. I think that I am either seeing it pre javascript loaded or else maybe I'm not getting the full info because I don't have the right authentication?? My result is the same as "view source" in Chrome when what I want is what Chrome's 'inspect element' shows. My test is cimber.dk after entering flight information and searching.
I am coding in python and tried the urllib2 library. Then I heard that Selenium was good for this so I tried that, too. However, that also gets me the same limited page source.
This is what I tried with urllib2 after using Firebug to see the parameters. (I deleted all my cookies after opening cimber.dk so I was starting with a 'clean slate')
url = 'https://www.cimber.dk/booking/'
values = {'ARRANGE_BY' : 'D',...} #one for each value
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
#Using HTTPRedirectHandler instead of HTTPCookieProcessor gives the same.
urllib2.install_opener(opener)
request = urllib2.Request(url)
opener.addheaders = [('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0) Gecko/20100101 Firefox/4.0')]
request.add_header(....) # one for each header, also the cookie one
p = urllib.urlencode(values)
data = opener.open(request, p).read()
# data is now the limited source, like Chrome View Source
#I tried to add the following in some vain attempt to do a redirect.
#The result is always "HTTP Error 400: Bad request"
f = opener.open('https://wftc2.e-travel.com/plnext/cimber/Override.action')
data = f.read()
f.close()
Most libraries like this do not support javascript.
If you want javascript, you will need to either automate an existing browser or browser engine, or get a really monolithic big beefy library that is essentially an advanced web crawler.
Related
I'm trying to automate log-in into Costco.com to check some member only prices.
I used dev tool and the Network tab to identify the request that handles the Logon, from which I inferred the POST URL and the parameters.
Code looks like:
import requests
s = requests.session()
payload = {'logonId': 'email#email.com',
'logonPassword': 'mypassword'
}
#get this data from Google-ing "my user agent"
user_agent = {"User-Agent" : "myusergent"}
url = 'https://www.costco.com/Logon'
response = s.post(url, headers=user_agent,data=payload)
print(response.status_code)
When I run this, it just runs and runs and never returns anything. Waited 5 minutes and still running.
What am I going worng?
maybe you should try to make a get requests to get some cookies before make the post requests, if the post requests doesnt work, maybe you should add a timeout so the script stop and you know that it doesnt work.
r = requests.get(w, verify=False, timeout=10)
This one is tough. Usually, in order to set the proper cookies, a get request to the url is first required. We can go directly to https://www.costco.com/LogonForm so long as we change the user agent from the default python requests one. This is accomplished as follows:
import requests
agent = (
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/85.0.4183.102 Safari/537.36"
)
with requests.Session() as s:
headers = {'user-agent': agent}
s.headers.update(headers)
logon = s.get('https://www.costco.com/LogonForm')
# Saved the cookies in variable, explanation below
cks = s.cookies
Logon get request is successful, ie status code 200! Taking a look at cks:
print(sorted([c.name for c in cks]))
['C_LOC',
'CriteoSessionUserId',
'JSESSIONID',
'WC_ACTIVEPOINTER',
'WC_AUTHENTICATION_-1002',
'WC_GENERIC_ACTIVITYDATA',
'WC_PERSISTENT',
'WC_SESSION_ESTABLISHED',
'WC_USERACTIVITY_-1002',
'_abck',
'ak_bmsc',
'akaas_AS01',
'bm_sz',
'client-zip-short']
Then using the inspect network in google chrome and clicking login yields the following form data for the post in order to login. (place this below cks)
data = {'logonId': username,
'logonPassword': password,
'reLogonURL': 'LogonForm',
'isPharmacy': 'false',
'fromCheckout': '',
'authToken': '-1002,5M9R2fZEDWOZ1d8MBwy40LOFIV0=',
'URL':'Lw=='}
login = s.post('https://www.costco.com/Logon', data=data, allow_redirects=True)
However, simply trying this makes the request just sit there and infinitely redirect.
Using burp suite, I stepped into the post and and found the post request when done via browser. This post has many more cookies than obtained in the initial get request.
Quite a few more in fact
# cookies is equal to the curl from burp, then converted curl to python req
sorted(cookies.keys())
['$JSESSIONID',
'AKA_A2',
'AMCVS_97B21CFE5329614E0A490D45%40AdobeOrg',
'AMCV_97B21CFE5329614E0A490D45%40AdobeOrg',
'C_LOC',
'CriteoSessionUserId',
'OptanonConsent',
'RT',
'WAREHOUSEDELIVERY_WHS',
'WC_ACTIVEPOINTER',
'WC_AUTHENTICATION_-1002',
'WC_GENERIC_ACTIVITYDATA',
'WC_PERSISTENT',
'WC_SESSION_ESTABLISHED',
'WC_USERACTIVITY_-1002',
'WRIgnore',
'WRUIDCD20200731',
'__CT_Data',
'_abck',
'_cs_c',
'_cs_cvars',
'_cs_id',
'_cs_s',
'_fbp',
'ajs_anonymous_id_2',
'ak_bmsc',
'akaas_AS01',
'at_check',
'bm_sz',
'client-zip-short',
'invCheckPostalCode',
'invCheckStateCode',
'mbox',
'rememberedLogonId',
's_cc',
's_sq',
'sto__count',
'sto__session']
Most of these look to be static, however because there are so many its hard to tell which is which and what each is supposed to be. It's here where I myself get stuck, and I am actually really curious how this would be accomplished. In some of the cookie data I can also see some sort of ibm commerce information, so I am linking Prevent Encryption (Krypto) Of Url Paramaters in IBM Commerce Server 6 as its the only other relevant SO answer question pertaining somewhat remotely to this.
Essentially though the steps would be to determine the proper cookies to pass for this post (and then the proper cookies and info for the redirect!). I believe some of these are being set by some js or something since they are not in the get response from the site. Sorry I can't be more help here.
If you absolutely need to login, try using selenium as it simulates a browser. Otherwise, if you just want to check if an item is in stock, this guide uses requests and doesn't need to be logged in https://aryaboudaie.com/python/technical/educational/2020/07/05/using-python-to-buy-a-gift.html
The code below is supposed to retrieve links from the search results page of google.
Without using header 'linkedElems' has 0 elements, but when I used a header 'linkedElems' had 44 elements which means after using header "select('.r a')" found 44 elements in page. Does the HTML code of a page change when a header is used?
I inspected the page's HTML code using the firefox's developer tool to find links and select them so "select('.r a') isn't supposed to return 0.
Code:
import requests,bs4
print("Search something in google:")
searchKeyword = input()
print("Googling.... " + searchKeyword)
head = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0'}
responseObj = requests.get("https://www.google.com/search?q="+searchKeyword, headers = head)
responseObj.raise_for_status()
print("Status code: " + str(responseObj.status_code))
soupObj = bs4.BeautifulSoup(responseObj.text, features='html.parser')
linkedElems = soupObj.select('.r a')
print(len(linkedElems))
Result (With header):
Search something in google:
test
Googling.... test
Status code: 200
44
Process finished with exit code 0
Result (Without header):
Search something in google:
test
Googling.... test
Status code: 200
0
Process finished with exit code 0
The User-Agent header is specifically designed for the server to know the browser/OS/hardware of the client that issued the request so it can build the proper response to that specific client:
The User-Agent request header is a characteristic string that lets servers and network peers identify the application, operating system, vendor, and/or version of the requesting user agent.
If Google's server was designed to return a specific HTML for specific clients (spoiler alert, it was), then the answer is "yes, the HTML will be different for different values of User-Agent", as you discovered yourself.
Missing User-Agent in requests.get causes requests module to substitute some default User-Agent, and then it causes Google to return a much simpler page to you, and it probably has a different structure. Without User-Agent I'm getting a response with the length of 30328, with -- 219287.
You can see the difference by doing something like
with open("temp.html", "w") as f:
f.write(responseObj.text)
and then opening the temp.html file in a browser.
The problem as other answers are mentioned about user-agent is that default requests user-agent is python-requests thus Google understands that it's a request made by bot/script and you received a different HTML with some sort of an error that doesn't contain .r a CSS selector. Check what's your user-agent.
You can forget about such a problem by using Google Search Engine Results API from SerpApi. It's a paid API with a free plan.
The difference is that you don't have to maintain the code, figure out how to bypass blocks from Google and other search engines, or figure out how to extract something from Javascript since it's already done for the end user.
Example code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google", # search engine
"q": "tesla", # query
"hl": "en", # language
"gl": "us", # country to search from
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
# scrapes all organic results (in this case, from the first page)
for result in results["organic_results"]:
print(result['title'])
print(result['link'])
Disclaimer, I work for SerpApi.
I tried to get html code from a site name dcinside in Korea, i am using requests but cannot get html code
and this is my code
import requests
url = "http://gall.dcinside.com/board/lists/?id=bitcoins&page=1"
req = requests.get(url)
print (req)
print (req.content)
but the result was
Why I cannot get html codes even using requests??
Most likely they are detecting that you are trying to crawl data dynamically, and not giving any content as a response. Try pretending to be a browser and passing some User-Agent headers.
headers = {
'User-Agent': 'My User Agent 1.0',
'From': 'youremail#domain.com'
}
response = requests.get(url, headers=headers)
# use authentic mozilla or chrome user-agent strings if this doesn't work
Take a look at this:
Python Web Crawlers and "getting" html source code
Like the guy said in the aforementioned post, you should use urllib2 which will allow you to easily obtain web resources.
Here is the situation: I want to access the content of an URL in Python via urllib2.
import urllib2
url = www.iwanttoknowwhatsinside.com
hdr = {
'User-Agent': 'OpenAnything/1.0 +http://somepage.org/',
'Connection': 'keep-alive'
}
request = urllib2.Request(url, headers=hdr)
opener = urllib2.build_opener()
HTML = opener.open(request).read()
This code normally works fine. But if I access a certain page via webbrowser, it says something like "Checking your browser before accessing ... Your browser will be redirected shortly" and then the page loads. The URL never changes. ADD: Then I can freely click around on the page, or open a second Tab with the same URL. I only have to wait before the initial access.
If I try to access this page via Python, I get an urllib2.HTTPError - Service Temporary Not Available instantly, so I figured urllib2 doesn't wait that time. Is there a way to force some waittime before throwing exceptions or retrieving the content? Or am I looking at this the wrong way?
Python noob here. I'm trying to extract a link, specifically the link to 'all reviews' on an Amazon product page. I get an unexpected result.
import urllib2
req = urllib2.Request('http://www.amazon.com/Ole-Henriksen-Truth-Collagen- Booster/dp/B000A0ADT8/ref=sr_1_1?s=hpc&ie=UTF8&qid=1342922857&sr=1-1&keywords=truth')
response = urllib2.urlopen(req)
page = response.read()
start = page.find("all reviews")
link_start = page.find("href=", start) + 6
link_end = page.find('"', link_start)
print page[link_start:link_end]
The program should output:
http://www.amazon.com/Ole-Henriksen-Truth-Collagen-Booster/product- reviews/B000A0ADT8/ref=dp_top_cm_cr_acr_txt?ie=UTF8&showViewpoints=1
Instead, it outputs:
http://www.amazon.com/Ole-Henriksen-Truth-Collagen-Booster/product-reviews/B000A0ADT8
I get the same result you do, but that appears to be simply because the page Amazon serves to your Python script is different from what it serves to my browser. I wrote the downloaded page to disk and loaded it in a text editor, and sure enough, the link ends with ADT8" without all the /ref=dp_top stuff.
In order to help convince Amazon to serve you the same page as a browser, your script is probably going to have to act a lot more like a browser (by accepting and sending cookies, for example). The mechanize module can help with this.
Ah, okay. If you do the usual trick of faking a user agent, for example:
req = urllib2.Request('http://www.amazon.com/Ole-Henriksen-Truth-Collagen-Booster/dp/B000A0ADT8/ref=sr_1_1?s=hpc&ie=UTF8&qid=1342922857&sr=1-1&keywords=truth')
ua = 'Mozilla/5.0 (X11; Linux x86_64; rv:2.0.1) Gecko/20110506 Firefox/4.0.1'
req.add_header('User-Agent', ua)
response = urllib2.urlopen(req)
then you should get something like
localhost-2:coding $ python plink.py
http://www.amazon.com/Ole-Henriksen-Truth-Collagen-Booster/product-reviews/B000A0ADT8/ref=dp_top_cm_cr_acr_txt/190-6179299-9485047?ie=UTF8&showViewpoints=1
which might be closer to what you want.
[Disclaimer: be sure to verify that Amazon's TOS rules permit whatever you're going to do before you do it..]