How to store data scraped by BeautifulSoup on GAE?

How to store data scraped by BeautifulSoup on GAE? - python

My site is able to log in to a 3rd party secured website by using gaemechanize and scrape data by using BeautifulSoup. After log in successfully, if I refresh pages, sometimes, the data will be gone so that "500 Internal Server Error" page will appear; however, sometimes, the data will be stored unexpectedly and those data is shown on other computer without login.
My question is How to store data until user click logout and the only one session can access the data?

Use WebApp2 to use Sessions, and store data to that session. Once the user logs out, you can store the data to the datastore permanently or just kill the session and let the data for that session disappear.
Check out WebApp2 sessions: http://webapp-improved.appspot.com/api/webapp2_extras/sessions.html

Related

Downloading URL To file... Not returning JSON data but Login HTML instead

I am writing a web scraping application. When I enter the URL directly into a browser, it displays the JSON data I want.
However, if I use Python's request lib, or URLDownloadToFile in C++, it simply downloads the html for the login page.
The site I am trying to scrape it from (DraftKings.com) requires a login. The other sites I scrape from don't.
I am 100% sure this is related, since if I paste the url when I am logged out, I get the login page, rather than the JSON data. Once I log in, if I paste the URL again, I get the JSON data again.
The thing is that if I remain logged in, and then use the Python script or C++ app to download the JSON data, as mentioned.... it downloads the Login HTML.
Anyone know how I can fix this issue?

Please don't ask us to help with an activity that violates the terms of service of the site you are trying to (ab-)use:
Using automated means (including but not limited to harvesting bots, robots, parser, spiders or screen scrapers) to obtain, collect or access any information on the Website or of any User for any purpose.
Even if that kind of usage were allowed, the answer would be boring:
You'd need to implement the login functionality in your scraper.

Scraping data from external site with username and password

I have an application with many users, some of these users have an account on an external website with data I want to scrape.
This external site has a members area protected with a email/password form. This sets some cookies when submitted (a couple of ASP ones). You can then pull up the needed page and grab the data the external site holds for the user that just logged in.
The external site has no API.
I envisage my application asking users for their credentials to the external site, logging in on their behalf and grabbing the data we want.
How would I go about this in Python, i.e. do I need to run a GUI web browser on the server that Python prods to handle the cookies (I'd rather not)?

Find the call the page makes to the backend by inspecting what is the format of the login call in your browser's inspector.
Make the same request after using either getpass to get user credentials from the terminal or via a GUI. You can use urllib2 to make the requests.
Save all the cookies from the response in a cookiejar.
Reuse the cookies in subsequent requests and fetch data.
Then, profit.

Usually, this is performed with session.
I'm recommending you to use requests library (http://docs.python-requests.org/en/latest/) in order to do that.
You can use the Session feature (http://docs.python-requests.org/en/latest/user/advanced/#session-objects). Simply perform an authentication HTTP request (url and parameters depends of the website you want to request), and then, perform a request towards the ressource you want to scrape.
Without further information, we cannot help you more.

How to request different session IDs from Java-based web-server using python?

As far as you know, In some web sites, when I send a home-page request to the web server, It assign a session ID to me. Using this session ID and cookies, it prevent multiple sign in (Login with different accounts in a single browser simultaneous). i.e, if I don't signed out, when I send another request to the web server on another tab of my browser, it returns me the same session ID.
Now, I want to receive different session IDs per request. How can I do it?

Page content not in response to get request

I'm trying to access my transaction history from my online banking page using python and requests. I have no trouble logging on with requests and getting my account overview page content but the bank account transaction data is not in the response text. Obviously, it shows up in my browser when I access the same page.
Viewing the raw html through my browser, my bank transaction data is present; however, it is not present in the response content I receive from a get request in python.
I'm thinking this has something to do with the following:
When the page is accessed through a browser, the transaction data is temporarily not visible because it is being loaded by some unknown background process. I think that the same process is happening when I access the site via python but the request response only contains the content that is present during the initial state of page when accessed; this state does not include the transaction data because the data is still loading.
One thing that supports this theory is the fact that the response text received through python and the response text in the browser (when viewed in developer tools) are identical until this line in the html:
<div id="accountRefreshDiv" style="display:none"><img blah blah>Updating...</div>
"Updating" also appears in the browser, along with a little spiny wheel, when the page is first accessed.
So my question is, what type of sub process could be going on in the background and how do I go about fetching the data that it is fetching (probably with JavaScript) but with python?

Python - reading the returnURL

I'm using python without a server to deploy to. Im trying to test the accept payment flow for Paypal.
In the code after sending a POST request I store the result into a local file and then open this file using "webbrowser".
However I suspect that I am missing something here since once I log in as a user I'm not automatically redirected to authorize the transaction but rather just log in.
Now I suspect this is because once I hit the Payment API endpoint it is redirecting me to the login API with some parameters
http://www.paypal.com?hypotheticalredirecturl=etc
Now I am capturing the response i.e - the html of the paypal login page. However the whole
URL cannot be captured. Hence the part
hypotheticalredirecturl=etc
cannot be captured and I think that this is stopping the flow from going to the
authorization page after I log in as a user.
I think if I appended the "hypotheticalredirect" part to my webpage after opening it using "webbrowser" I might be able make the flow normal.
Does any one know of any way to capture the url of the response?
I tried looking into the page itself but I dont think its there.
Any help will be appreciated.
Thanks,
Ashwin
EDIT : Using urllib and urllib2. Should I be looking at httplib?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to store data scraped by BeautifulSoup on GAE? - python

Related

Downloading URL To file... Not returning JSON data but Login HTML instead

Scraping data from external site with username and password

How to request different session IDs from Java-based web-server using python?

Page content not in response to get request

Python - reading the returnURL

Categories

Resources