Page content not in response to get request - python

I'm trying to access my transaction history from my online banking page using python and requests. I have no trouble logging on with requests and getting my account overview page content but the bank account transaction data is not in the response text. Obviously, it shows up in my browser when I access the same page.
Viewing the raw html through my browser, my bank transaction data is present; however, it is not present in the response content I receive from a get request in python.
I'm thinking this has something to do with the following:
When the page is accessed through a browser, the transaction data is temporarily not visible because it is being loaded by some unknown background process. I think that the same process is happening when I access the site via python but the request response only contains the content that is present during the initial state of page when accessed; this state does not include the transaction data because the data is still loading.
One thing that supports this theory is the fact that the response text received through python and the response text in the browser (when viewed in developer tools) are identical until this line in the html:
<div id="accountRefreshDiv" style="display:none"><img blah blah>Updating...</div>
"Updating" also appears in the browser, along with a little spiny wheel, when the page is first accessed.
So my question is, what type of sub process could be going on in the background and how do I go about fetching the data that it is fetching (probably with JavaScript) but with python?

Related

How to identify the original HTTP request from client?

"To display a Web page, the browser sends an original request to fetch
the HTML document that represents the page. It then parses this file,
making additional requests corresponding to execution scripts, layout
information (CSS) to display, and sub-resources contained within the
page (usually images and videos)."
The previous quote is form MDN Web Docs An overview of HTTP
My question is: I want to identify the original request from the client, and then temporarily store the request and all subrequests made to the server, but when the client make another original request I want to replace the temporarily data with the new requests.
for example let say that I have an html page that when parsed by the client make additional requests to some resources on the server, when the user reload the page he is just making another original request, so the temporarily stored request data should be replaced by the new original request and its subrequests, the same happens when the client request for another html page.

Intercepting AJAX calls while web-scraping

I am trying to scrape an appStore, particularly the reviews for every app. The issue is that the reviews are dynamic and the client-side is making an AJAX POST call to get that data. The payload for this call is also being generated dynamically for every page. I was wondering if there is any way to intercept this request through code to get the payload, using which I could make a call using requests or any other library. I am able to make this call through POSTMAN using the parameters I get from inspect Network Activity utility of the web-browsers.
I could use selenium to scrape the finally loaded page- but it waits for the entire page to load which is highly un-optimized since I don't need to wait for the entirety of the page to load.
payload = "<This is dynamically created for every page and is constant for that given page>"
header = {"Content-Type": "application/x-www-form-urlencoded"}
url = 'https://appexchange.salesforce.com/appxListingDetail'
r = requests.post(url=url, data=payload, headers=header)
I was wondering if its possible to get this payload through the scraper which can intercept all the AJAX calls made when it tries to scrape the base web-page.
If you want to scrape a page that contains scripts and dynamic content, I would recommend using puppeteer.
It is a headless scriptable browser (can also be used in non-headless mode), it loads the page exactly like a browser with all its dynamic contents and scripts. So you can simply wait for the page to finish loading, and then read the rendered content
You can also use selenium
Also check this out: a list of almost all headless browsers
UPDATE
I know it's been some time, but I will post the answer to your comment here, in case you still need it or just for the sake of completeness:
In puppeteer you can asynchronously wait for a specific DOM node to appear on the page, or for some other events to occur. So you can continue on with your work after that and don't care about loading the rest of the page (although it will load in the background but you don't need to wait for it).

Scraping site that uses AJAX

I've read some relevant posts here but couldn't figure an answer.
I'm trying to crawl a web page with reviews. When site is visited there are only 10 reviews at first and a user should press "Show more" to get 10 more reviews (that also adds #add10 to the end of site's address) every time when he scrolls down to the end of reviews list. Actually, a user can get full review list by adding #add1000 (where 1000 is a number of additional reviews) to the end of the site's address. The problem is that I get only first 10 reviews using site_url#add1000 in my spider just like with site_url so this approach doesn't work.
I also can't find a way to make an appropriate Request imitating the origin one from the site. Origin AJAX url is of the form 'domain/ajaxlst?par1=x&par2=y' and I tried all of this:
Request(url='domain/ajaxlst?par1=x&par2=y', callback=self.parse_all)
Request(url='domain/ajaxlst?par1=x&par2=y', callback=self.parse_all,
headers={all_headers})
Request(url='domain/ajaxlst?par1=x&par2=y', callback=self.parse_all,
headers={all_headers}, cookies={all_cookies})
But every time I'm getting a 404 Error. Can anyone explain what I'm doing wrong?
What you need is a headless browser for this since request module can not handle AJAX well.
One of such headless browser is selenium.
i.e.)
driver.find_element_by_id("show more").click() # This is just an example case
Normally, when you scroll down the page, Ajax will send request to the server, and the server will then response a json/xml file back to your browser to refresh the page.
You need to figure out the url linked to this json/xml file. Normally, you can open your firefox browser and open tools/web dev/web console. monitor the network activities and you can easily catch this json/xml file.
Once you find this file, then you can directly parse reviews from them (I recommend Python Module requests and bs4 to do this work) and decrease a huge amount of time. Remember to use some different clients and IPs. Be nice to the server and it won't block you.

Downloading URL To file... Not returning JSON data but Login HTML instead

I am writing a web scraping application. When I enter the URL directly into a browser, it displays the JSON data I want.
However, if I use Python's request lib, or URLDownloadToFile in C++, it simply downloads the html for the login page.
The site I am trying to scrape it from (DraftKings.com) requires a login. The other sites I scrape from don't.
I am 100% sure this is related, since if I paste the url when I am logged out, I get the login page, rather than the JSON data. Once I log in, if I paste the URL again, I get the JSON data again.
The thing is that if I remain logged in, and then use the Python script or C++ app to download the JSON data, as mentioned.... it downloads the Login HTML.
Anyone know how I can fix this issue?
Please don't ask us to help with an activity that violates the terms of service of the site you are trying to (ab-)use:
Using automated means (including but not limited to harvesting bots, robots, parser, spiders or screen scrapers) to obtain, collect or access any information on the Website or of any User for any purpose.
Even if that kind of usage were allowed, the answer would be boring:
You'd need to implement the login functionality in your scraper.

How can I scrape information from HowLongToBeat.com? It doesn't use a variable in the URL

I'm trying to scrape information from How Long to Beat, how can I make a request for a search without having to put the search-term in the URL?
EDIT for clarity:
The problem I face is that the site doesn't use something like http://www.howlongtobeat.com/search.php?s=search-term, therefore I cannot do something like
url = 'http://www.howlongtobeat.com/search.php?s='
search_term = raw_input("Search: ")
r = requests.get(url + search_term)
In other words, when you type the search-term in the search dialog, the site doesn't refresh nor show a change in the URL so I can't find a way to search from outside the site.
I'm sorry if I made grammar mistakes, english is not my first language.
This is because the page is driven by AJAX requests - it updates automatically without redirecting you to visible URL.
If you open developer tools in your browser (F12) and navigate to Network panel, you will see that there are indeed requests sent to the server. I typed "test2" and got following:
As you see, request is sent to a URL that looks like this: http://www.howlongtobeat.com/search_main.php?t=games&page=1&sorthead=popular&sortd=Normal%20Order&plat=&detail=0.
I typed "test2", but it's nowhere to be seen.
That's because it was sent using POST request, e.g. the parameters were embedded in the HTTP request itself, not the URL. When I navigated to "Params" tab in the Developer Tools, indeed I could see my input:
queryString: "test2"
So in order to use this search form, you should send a POST request to that URL containing variable "queryString" filled with whatever value you need.
I strongly encourage asking the site owners' about an API, though. Using publicly available form engines that are designed to be used by end-users in automated fashion is considered unethical.

Categories

Resources