Security risks of a link scraping system - python

I'm implementing a link scraping system like Facebook's link share feature, whereby a user enters a url which is passed to our server via ajax, and our server then does a get request (using the requests library) and parses the response html with Beautiful Soup to capture relevant information about the page.
In this type of system, obviously a person can enter any url that they want. I'm trying to imagine what type of security risks our server could be exposed to in this type of scenario? Could such a set up be exploited maliciously?

You probably want to make sure that your server doesn't execute any plugins or copy any videos/images.
Javascript is trickier, if you ignore it you will miss some links, if you execute it then you had better be sure you aren't being used to do something like send spam.
If you are asking on SO you probably aren't sure enough!

You should do a google on RFI/LFI (Remote / Local) File Inclusion Vulnerability and Iframe attacks. If you are safe from these two attacks , then you're good.

I have built quite a few small & large crawling systems. Actually not sure what kind of security risks you are talking about. I am not clear on your requirements.
But if all you are doing is fetch the html using BeautifulSoup & then extracting certain stuff about the page like title tag & meta tag info etc. & then store this data. I dont see any problems.
Unless you are not blindly doing some kind of eval either on the response of the url or on the stuff the user entered you are safe I feel.

Related

Python: interrogate database over http

I want to do automatic searches on a database (in this example www.scopus.com) with a simple python script. I need some place from where to start. For example I would like to do a search and get a list of links and open the links and extract information from the opened pages. Where do I start?
Technically speaking, scopus.com is not "a database", it's a web site that let's you search / consult a database. If you want to programmatically access their service, the obvious way is to use their API, which will mostly requires sending HTTP requests and parsing the HTTP response. You can do this with the standard lib's modules, but you'll certainly save a lot of time using python-requests instead. And you'll certainly want to get some understanding of the HTTP protocol before...

Redirecting all links from scraped webpage

The basic idea is that the web-application fetches an external website and overlays it some JavaScript, for additional functionality.
However the links on the webpage I fetched shouldn't navigate to the external website, but stay on my website. I figured that converting the links with regular expressions (or a similar method) would be inefficient as it would not cover the links dynamically generated, like AJAX-requests or other JavaScript functionality. So basically what I can't seem to find is a method to change/intercept/redirect all links of the scraped website.
So, what is a (good) way to change/intercept the dynamically generated links of a scraped website? Preferably a python-method.
Unless you're changing the URL's on the scraped web page (including dynamic ones), you can't do what you ask.
If a client is served up a web page with a URL pointing to an external site, your website will have no opportunity to intercept this or change it since their browser will navigate away without even going to your site (not strictly true though - read on). Theoretically, you could possibly attach event handlers to all links (before serving up the scraped page), and even intercept dynamically created ones (by parsing their javascript), but this might prove pretty difficult. You also have to stop other methods of the URL changing as well (like header redirection).
Clients themselves can use proxies in their browsers (that affect all outgoing URLs), but this is the client deciding that all traffic should be routed through a proxy server. You can't do this on their behalf (without actually changing the URLs).
EDIT: Since OP removed suggestion of using a web proxy, the answer details change slightly, but the end result is the same. For all practical purposes, it's nearly impossible to do this.
You could try parsing the javascript on the page and be successful for some pages (or possibly with a sophisticated enough script for many typical pages); but throw in one little eval on the page, and you'll need your own javascript engine written in javascript to try to figure out every possible external request on a page. ...and even then you couldn't do it.
Basically, give me a script which someone says can parse any webpage (including javascript) to intercept any external calls, and I'll give you a webpage that this script won't work for. Disclaimer: I'm speaking about intercepting the links, but letting the site function normally after...not just parsing the page to remove all javascript entirely.
Someone else may be able to provide you an answer that works sometimes on some web pages - maybe that would be good enough for your purposes.
Also, have you considered that most javascript on a page isn't embedded, but rather either loaded via <script> tags, or possibly even loaded dynamically, from the original server. I assume you'd want to distinguish "stuff loaded from original server needed to make page function and look correctly", from "stuff loaded from original server for other things". How does your program "know" this?
You could try parsing the page and removing all javascript...but even this would be very difficult, since there are still tricky ways of getting around this.

URL fetch: prevent abuse, mailcious urls etc. in python/django

I'm building a webpage featuring a very much a-like the facebook wall/newsfeed. Registered users (or through Facebook-connect, google auth) can submit urls. At the moment, I'm taking these URLs and use urllib2 to fetch the content of the URL and search for relevant information like og:properties, HTML title-tag and perheps some -tags for images.
Now, I understand that I'm putting my server at risk when I'm letting users feed my server with URLs to open.
My question is how high the risk is? What standard security checks can I make?
As for now, I am simply opening the url without any "active" protection because I don't know what to check for.
And what about storing fetched content into the database. Does django have built-in protection against SQL-injections?
Thank you!
One of the obvious risks here is that one could use your website as a vector for spreading malicious URLs.
E.g. Say I figure out a malformed html that allows for arbitrary code execution in webkit based browsers, say by exploiting a certain 0-day buffer overflow. Say your website goes popular, that'd be one of the spots I'd definitely try.
Now, you can't possibly match the contents of the URLs submitted to look for security flaws. You'd become an anti-virus/security company then. Both Chrome & Safari do take care of these to some extent.
For user's/content's sake and for the risk I explained, you could build in a flagging system that learns by user's actions. You could train a classifier whenever someone flags a URL, see examples here.
I'm sure there is a variety of such solutions, also in python.
For a quick overview of security, sql injections in Django's context, checkout this link.

Python get data from secured website

Id like to know if there is a way to get information from my banking website with Python, Id like to retrieve my card history and display it, and possibly save it into a text document each month.
I have found the urls ext to login and get the information from the website, which works from a browser, but I have been using liburl2 to "open" the webpages from Python and I have a feeling its not working because of some cookie or session things.
I can get any information I want from a website that does not require a login with urllib2, and then save the actual HTML and go through it later, but I cant on my banks website,
Any help would be appreciated
This is a part of Web-Scraping :
Web-scraping is a standard task that can serve various needs.
Scraping data out of secure-website means https
Handling https is not a problem with mechanize and BeautifulSoup
Although urllib2 with HTTPCookieJar also works fine
If managing the cookies is the problem, then I would recommend mechanize
Considering the case of your BANK-Site :
I would recommend not to play with your account.
If you must then, its not as easy as any normal secure/non-secure site.
These sites are designed to with-stand such scripts.
Problems that you would face with this:
BANK sites will surely have Captcha that is almost impossible to by-pass with a script unless you employee a lot of rocket-science and effort.
Other problem that you will definitely face is javascript, standard scripting solutions are focused to manage cookies, HTML parsing, etc. For processing javascript on links you will have to process js in your python script. That again needs a lot of effort.
Then, AJAX that again comes from javascript fetches data from server after page-load.
So, it will require you to take a lot of effort to do this task.
Also, if you try doing this you risk of blocking access to your account since banking sites are quick to block account access on 3-4 unsuccessful attempt on login or captcha, etc.
So, think before you do.

Retrieve cookie created using javascript in python

I've had a look at many tutorials regarding cookiejar, but my problem is that the webpage that i want to scape creates the cookie using javascript and I can't seem to retrieve the cookie. Does anybody have a solution to this problem?
If all pages have the same JavaScript then maybe you could parse the HTML to find that piece of code, and from that get the value the cookie would be set to?
That would make your scraping quite vulnerable to changes in the third party website, but that's most often the case while scraping. (Please bear in mind that the third-party website owner may not like that you're getting the content this way.)
I responded to your other question as well: take a look at mechanize. It's probably the most fully featured scraping module I know: if the cookie is sent, then I'm sure you can get to it with this module.
Maybe you can execute the JavaScript code in a JavaScript engine with Python bindings (like python-spidermonkey or pyv8) and then retrieve the cookie. Or, as the javascript code is executed client side anyway, you may be able to convert the cookie-generating code to Python.
You could access the page using a real browser, via PAMIE, win32com or similar, then the JavaScript will be running in its native environment.

Categories

Resources