bulkloader.py --dump without authentication

bulkloader.py --dump without authentication - python

Is there some way or using the bulkloader.py dump and restore functionality without authentication?
I have tried using:
- url: /remote_api
script: $PYTHON_LIB/google/appengine/ext/remote_api/handler.py
without the login-parameter, but login still seems to be required.
I still get
[ERROR ] Exception during authentication
I struggled with this for 6 hours yesterday, without any solution.
And yes, I have tried GAEBAR. It failed, however when it got to entities that contain up to 1MB (the maximum pr. entity) Blobs.
So, I am looking to dump (and restore) for backup-purposes mainly.

remote_api, which the bulkloader uses, is written to deliberately require authentication, even if you omit the relevant clause in app.yaml. You can override it if you really want, but it's an incredibly bad idea - it would allow any anonymous user to do practically anything they liked to your app!

Related

How do I stop git from asking credentials when I try to clone a repository that doesn't exist?

I'm doing some research and I need to download a very large number of git repositories, something like 17k+.
I wrote a very simple Python script to automate the cloning routine from a dataset containing the GitHub URLs.
first_10 = data.url
name = data.project_name
for x, i in zip(first_10, name):
os.system('git clone {} D:\gitres\{}'.format(x,i))
It just iterates over those two pandas dataframe columns for the URL to be cloned and the folder name.
Here comes the problem: every time the script finds a URL that no longer exists on GitHub, the script halts its routine, asks for credentials and won't resume until my input. Doesn't matter if I input correct credentials or gibberish, it will do this every time it finds an invalid GitHub URL. How do I stop git from asking those credentials?

The reason GitHub sends you a 401 to prompt for credentials if the repository is missing is because they don't want to leak whether a private repository exists. If they didn't prompt, you could easily determine that the repository does exist by getting a 401 and that it doesn't by getting a 404. Instead, GitHub always prompts for credentials, and only then returns a 404 if the repository doesn't exist or isn't accessible to you.
If your desire is not to be prompted at all, as torek mentioned, you can simply set the environment variable GIT_ASKPASS to false and this will work. You could also set GIT_TERMINAL_PROMPT to 0 and that would prevent any prompting for credentials.
However, I strongly recommend that you do indeed set some credentials because GitHub will much more aggressively rate-limit you if you don't set any credentials, and if you do end up using an excessive amount of resources, it's much easier for GitHub to contact you about the problem and ask you to fix it, rather than just block you or make an abuse report to your network provider.
On that note, your Python script is not likely to handle the case where you have a large number of failures for that reason, so you should strongly consider handling that case most robustly. In general, anyone making a large number of HTTP requests to any server needs to learn to gracefully back off.
If you decide you do want to pass credentials, you can do so from the environment using a custom credential helper, or you can use an SSH key and SSH URLs to do this.

I'd suggest you've overspecified the problem (turning this into an XY problem): you don't specifically need to make Git not ask for credentials since you could instead merely clone those repositories that do exist.
That said, there are two ways to prevent Git from asking for credentials:
Use a URL that cannot take credentials. (Any given server may or may not accept such URLs. With GitHub, you could try to log in as git#github.com via ssh, and present a valid public key. After ssh has authenticated you, Git will give you access to any accessible URL, and deny you access to any inaccessible or invalid URL, without asking for further credentials.)
Supply a credential helper that never actually provides any credentials, without asking for any. For instance, you could run with GIT_ASKPASS=false in the environment. See the credentials documentation for details.
(There's one more option as well, which is to allow Git to ask for credentials but redirect the input to a program. This is trickier than just overriding GIT_ASKPASS so there is no reason to cover it here.)
To solve the problem better, find a way to list out the repositories that do exist, and do not attempt to clone the ones that don't. This is likely to go significantly faster.

My guess is that you are using https:// data urls. If you use a personal access token, then GitHub shouldn't be asking you for a username/password. Take a look here on how to set it up.
I think that if you use ssh:// data urls instead, then you wouldn't encounter that problem because git defaults to using your ssh-key for authentication instead of password.

You probably want to check that the repo exists before attempting the clone. There are answers on Stack Overflow for this here.
Alternatively, if you switch to using subprocess instead of os.system, you can simply "trick" it by reading input from /dev/null which will prevent the prompt - that way, no intervention will be needed and the invalid URLs will simply be skipped.
for x, i in zip(first_10, name):
subprocess.call(['git', 'clone', x, i], shell=False, stdin=subprocess.DEVNULL, start_new_session=True)

I have come across another useful trick that might come in handy working in desktop environments.
in conjunction with using GIT_TERMINAL_PROMPT=0, git -c credential.helper= <rest of commands> helps to also suppress credential manager windows to pop up.

authorisation failure in Pyramid (Python)

So I'm trying to port some old Pylons code to Pyramid, and I'd like to be able to improve on the Auth setup - specifically support better RBAC, and Pyramid has good support for this.
However, I'd like to offer unauthorised users better info when they try illegal pages:
"Sorry, in order to view [page] you ([user]) need [group] privileges - please contact [admin]"
However I don't see how that's practical in Pyramid - I can do stuff in the forbidden_view_config page, however I can't easily find all the info needed from the page which was attempted - is it possible to get the exception or similar with the actual reason why permission was not granted?

The request object itself should have all the bits you need.
Specifically, security-related pieces lists some of the request attributes that you can retrieve. Also the request.exception attribute will be available when an exception is raised. There are several URL-related pieces available to get the "page", including application_url.

Fetching images from URL and saving on server and/or Table (ImageField)

I'm not seeing much documentation on this. I'm trying to get an image uploaded onto server from a URL. Ideally I'd like to make things simple but I'm in two minds as to whether using an ImageField is the best way or simpler to simply store the file on the server and display it as a static file. I'm not uploading anyfiles so I need to fetch them in. Can anyone suggest any decent code examples before I try and re-invent the wheel?
Given an URL say http://www.xyx.com/image.jpg, I'd like to download that image to the server, put it into a suitable location after renaming. My question is general as I'm looking for examples of what people have already done. So far I just see examples relating to uploading images, but that doesn't apply. This should be a simple case and I'm looking for a canonical example that might help.
This is for uploading an image from the user: Django: Image Upload to the Server
So are there any examples out there that just deal with the process of fetching and image and storing on the server and/or ImageField.

Well, just fetching an image and storing it into a file is straightforward:
import urllib2
with open('/path/to/storage/' + make_a_unique_name(), 'w') as f:
f.write(urllib2.urlopen(your_url).read())
Then you need to configure your Web server to serve files from that directory.
But this comes with security risks.
A malicious user could come along and type a URL that points nowhere. Or that points to their own evil server, which accepts your connection but never responds. This would be a typical denial of service attack.
A naive fix could be:
urllib2.urlopen(your_url, timeout=5)
But then the adversary could build a server that accepts a connection and writes out a line every second indefinitely, never stopping. The timeout doesn’t cover that.
So a proper solution is to run a task queue, also with timeouts, and a carefully chosen number of workers, all strictly independent of your Web-facing processes.
Another kind of attack is to point your server at something private. Suppose, for the sake of example, that you have an internal admin site that is running on port 8000, and it is not accessible to the outside world, but it is accessible to your own processes. Then I could type http://localhost:8000/path/to/secret/stats.png and see all your valuable secret graphs, or even modify something. This is known as server-side request forgery or SSRF, and it’s not trivial to defend against. You can try parsing the URL and checking the hostname against a blacklist, or explicitly resolving the hostname and making sure it doesn’t point to any of your machines or networks (including 127.0.0.0/8).
Then of course, there is the problem of validating that the file you receive is actually an image, not an HTML file or a Windows executable. But this is common to the upload scenario as well.

Get current, requested URL in Python without a framework

I'm trying to get the URL that has been requested in Python without using a web framework.
For example, on a page (let's say /main/index.html), the user clicks on a URL to go to /main/foo/bar (/foo/bar doesn't exist). Apache (with mod_wsgi) then redirects the user to a PHP script at /main/, which then gets the url and searches MySQL for any matching fields. Then the rest of the field is returned. This helped in PHP:
$_SERVER["REQUEST_URI"];
I'd rather not use PHP since it's becoming increasingly difficult to maintain the PHP code whilst the database keeps changing in structure.
I'm pretty sure there's a better way altogether and any mention would be greatly appreciated. For the sake of relevancy, is this even possible (to get the requested URL in Python)? Should I just use a framework, although it seems quite simple?
Thanks in advance,
Jamie
Note: I don't want to use GET for security purposes.

Well, if you run your program as a CGI script, you can get the same information in os.environ. However, if I recall correctly, REQUEST_URI as such is not part of the CGI standard and you need to use os.environ['SCRIPT_NAME'], os.environ['PATH_INFO'] and os.environ['QUERY_STRING'] to get the equivalent data.
However, I seriously urge you to see some lightweight framework, such as Pyramid. Plain CGI with Python is slow and generally just pain in the ass.

Unlike PHP, Python is a general purpose language and doesn't have this built-in.
The way you can gather this information depends on the deployment solution:
CGI (mostly Apache with mod_python, deprecated): see #Antti Haapala solution
WSGI (most other deployment solutions): see #gurney alex solution
But you will encouter much more problems: session hanling, url management, cookies, and even juste simple POST/GET parsing. All of this need to be done manually if you don't use a framework.
Now, if you feel like a framework is overkill (but really, incredible tools like Django are worth it), you can use a micro framework like bottle.
Microframeworks will typically make this heavy lifting for you, but without the complicated setup or the additional advanced features. Bottle has actually zero setup an is a one file lib.
Hello word with bottle:
from bottle import route, run, request
#route('/hello/:name')
def index(name='World'):
return '<b>Hello %s! You are at %s</b>' % (name, request.path)
run(host='localhost', port=8080)
request.path contains what you want, and if you visit http://127.0.0.1:8080/hello/you, you will get:
Hello you! You are at /hello/you

When I want to get a URL outside of any framework using Apache2 and Mod_WSGI I use
environ.get('PATH_INFO')
inside of my application() function.

When using mod_python, if I recall correctly you can use something like:
from mod_python import util
def handler(request):
parameters = util.FieldStorage(request)
url = parameters.get("url", "/")
See http://www.modpython.org/live/current/doc-html/pyapi-util.html for more info on the mod_python.util module and the FieldStorage class (including examples)

ModSecurity error with Django

I'm trying to access a Django page through a Facebook App (iframe) I made using fb.py on DreamHost and I keep getting an internal server error.
Looking in the error logs, this is what I see:
ModSecurity: Output filter: Failed to read bucket (rc 104): Connection reset by peer
I think it just has to do with the POST request. Somebody else asked about this error on a number of forums almost a year ago, to no avail:
ModSecurity: Output filter: Failed to read bucket (rc 104): Connection reset by peer
All I could find searching was this at http://www.modsecurity.org:
"When mod_security denies such a request, it sends an error bucket with e.g. code 403 down the output filter chain, leaving r->status as is (e.g. 500)."
Any ideas? Thanks!

Have you implemented CSRF protection as per https://docs.djangoproject.com/en/dev/ref/contrib/csrf/#ajax ?
Note to cross-check with the version of Django you are using.

So I've spent way too much time trying to figure this out. I've settled on a (slightly shitty) work-around: add {% csrf_token %} to any place in your template (I'm assuming you passed in the context_instance=RequestContext(request) argument to your render_to_response or whatever).
I think what is happening is that the cookie doesn't actually get set (this can be confirmed through inspecting the cookies in any browser's development tools). Adding the above code to your template forces this. I have a feeling that this may be remedied in later versions of Django, and it seems as though there are obvious fixes for 1.4+ (e.g., see here). Unfortunately dreamhost has stuck us with 1.2.3, so we need to make do.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.