Starting, Stopping, and Continuing the Google App Engine BulkLoader - python

I have quite of bit of data that I will be uploading into Google App Engine. I want to use the bulkloader to help get it in there. However, I have so much data that I generally use up my CPU quota before it's done. Also, any other problem such a bad internet connection or random computer issue can stop the process.
Is there any way to continue a bulkload from where you left off? Or to only bulkload data that has not been written to the datastore?
I couldn't find anything in the docs, so I assume any answer will include digging into the code.

Well, it is in the docs:
If the transfer is interrupted, you
can resume the transfer from where it
left off using the --db_filename=...
argument. The value is the name of the
progress file created by the tool,
which is either a name you provided
with the --db_filename argument when
you started the transfer, or a default
name that includes a timestamp. This
assumes you have sqlite3 installed,
and did not disable the progress file
with --db_filename=skip.
http://code.google.com/appengine/docs/python/tools/uploadingdata.html
(I've used it some time ago, so I had a feeling it should be there)

Related

Want to Use Whisper in My Flutter Project and Not Sure Where to Start

First I'd like to say that I know similar questions about calling Python code in Flutter have been asked before, but I think this particular case has some challenges.
Some notes about the app I'm aiming for:
Basically a note taking app, records a lecture or meeting or whatever and transcribes the text for you, with a few extra features thrown in. I'd like to have all speech being processed locally both to ensure it works offline and reduce the app's dependence on cloud services.
I'm trying to use Whisper, a new speech to text software that processes everything locally, which is a necessity for my app. I know I could make a Flutter plugin but I'm not sure if that's the best route to go about this for a few reasons:
I haven't done it before, so it would be quite a time investment to do this and just hope it works out.
One of the ways I've seen of doing this involves sending data over http between Python and Flutter, but Whisper would need a continuous stream of audio to work properly which I'm not sure this approach is suited for.
I'd really like to have 1 codebase that runs on any device.
I'd be fine with the app only working on pc for now, but I'd like to also have it working on Android and maybe IOS if reasonably possible. Any other routes I can take towards development are great too but I'd really like to stick with Flutter for this app if I can.
Just found that one: https://github.com/azkadev/whisper_dart
Did not tried it until now but seems to be worth the try.

How do I run Python scripts automatically, while my Flask website is running on a VPS?

Okay, so basically I am creating a website. The data I need to display on this website is delivered twice daily, where I need to read the delivered data from a file and store this new data in the database (instead of the old data).
I have created the python functions to do this. However, I would like to know, what would be the best way to run this script, while my flask application is running? This may be a very simple answer, but I have seen some answers saying to incorporate the script into the website design (however these answers didn't explain how), and others saying to run it separately. The script needs to run automatically throughout the day with no monitoring or input from me.
TIA
Generally it's a really bad idea to put a webserver to handle such tasks, that is the flask application in your case. There are many reasons for it so just to name a few:
Python's Achilles heel - GIL.
Sharing system resources of the application between users and other operations.
Crashes - it happens, it could be unlikely but it does. And if you are not careful, the web application goes down along with it.
So with that in mind I'd advise you to ditch this idea and use crontabs. Basically write a script that does whatever transformations or operations it needs to do and create a cron job at a desired time.

Python API implementing a simple log file

I have a Python script that will regulary check an API for data updates. Since it runs without supervision I would like to be able monitor what the script does to make sure it works properly.
My initial thought is just to write every communication attempt with the API to a text file with date, time and if data was pulled or not. A new line for every imput. My question to you is if you would recommend doing it in another way? Write to excel for example to be able to sort the columns? Or are there any other options worth considering?
I would say it really depends on two factors
How often you update
How much interaction do you want with the monitoring data (i.e. notification, reporting etc)
I have had projects where we've updated Google Sheets (using the API) to be able to collaboratively extract reports from update data.
However, note that this means a web call at every update, so if your updates are close together, this will affect performance. Also, if your app is interactive, there may be a delay while the data gets updated.
The upside is you can build things like graphs and timelines really easily (and collaboratively) where needed.
Also - yes, definitely the logging module as answered below. I sort of assumed you were using the logging module already for the local file for some reason!
Take a look at the logging documentation.
A new line for every input is a good start. You can configure the logging module to print date and time automatically.

Making SQL Alchemy Play Nice With Google App Engine

I'm currently working on a Google App Engine (Python) project which primarily uses Google Cloud SQL (with SQL Alchemy) for back-end data storage.
Most of the time everything works perfectly well. However, occasionally "something" goes haywire and we start getting bizarre exceptions. For example:
AttributeError: 'ColumnProperty' object has no attribute 'strategy'
AttributeError: 'RelationshipProperty' object has no attribute 'strategy'
We think this might be related to the spinning up of a new GAE instance, but we can't really be sure.
With all that being said, my question is this. What are some strategies that my team and I can use to track down this issue?
Keep in mind that the application is running on Google App Engine so that might limit our options a bit.
Update: Owen Nelson's comment below is right on. We've added threading.RLock as suggested by Google. However we are still seeing this issue, but much less often.
I want to be clear, to this point we've been unable to reproduce this issue in our local environment. We are pretty sure this has something to do with dynamic instances spinning up and that isn't something that we can really do in development.
From what I can understand, your application have problems only in production mode.
Try to reproduce the bug in dev mode
The best possible solution would be to be able to reproduce that bug in development mode. To do that, you could try to run a batch of unittest with LOTS of data. (See how to do local test on appengine).
If that doesn't work...
Turn on appstats to get more information on the handler
You can turn on appstats to try and get information on which handler is currently causing the problem. Appstats normally gives you information on datastore, this is not relevant in our case, but you can get information from the requests in general (such as response time)
Identify the handler and wrap it in a beautiful try catch
Once you identify the source of the problem or from where it is raised, you can surround it with a try..catch.. With that you can get more information on the current execution Trace and hopefully solve your problem

In Python in GAE, what is the best way to limit the risk of executing untrusted code?

I would like to enable students to submit python code solutions to a few simple python problems. My applicatoin will be running in GAE. How can I limit the risk from malicios code that is sumitted? I realize that this is a hard problem and I have read related Stackoverflow and other posts on the subject. I am curious if the restrictions aleady in place in the GAE environment make it simpler to limit damage that untrusted code could inflict. Is it possible to simply scan the submitted code for a few restricted keywords (exec, import, etc.) and then ensure the code only runs for less than a fixed amount of time, or is it still difficult to sandbox untrusted code even in the resticted GAE environment? For example:
# Import and execute untrusted code in GAE
untrustedCode = """#Untrusted code from students."""
class TestSpace(object):pass
testspace = TestSpace()
try:
#Check the untrusted code somehow and throw and exception.
except:
print "Code attempted to import or access network"
try:
# exec code in a new namespace (Thanks Alex Martelli)
# limit runtime somehow
exec untrustedCode in vars(testspace)
except:
print "Code took more than x seconds to run"
#mjv's smiley comment is actually spot-on: make sure the submitter IS identified and associated with the code in question (which presumably is going to be sent to a task queue), and log any diagnostics caused by an individual's submissions.
Beyond that, you can indeed prepare a test-space that's more restrictive (thanks for the acknowledgment;-) including a special 'builtin' that has all you want the students to be able to use and redefines __import__ &c. That, plus a token pass to forbid exec, eval, import, __subclasses__, __bases__, __mro__, ..., gets you closer. A totally secure sandbox in a GAE environment however is a real challenge, unless you can whitelist a tiny subset of the language that the students are allowed.
So I would suggest a layered approach: the sandbox GAE app in which the students upload and execute their code has essentially no persistent layer to worry about; rather, it "persists" by sending urlfetch requests to ANOTHER app, which never runs any untrusted code and is able to vet each request very critically. Default-denial with whitelisting is still the holy grail, but with such an extra layer for security you may be able to afford a default-acceptance with blacklisting...
You really can't sandbox Python code inside App Engine with any degree of certainty. Alex's idea of logging who's running what is a good one, but if the user manages to break out of the sandbox, they can erase the event logs. The only place this information would be safe is in the per-request logging, since users can't erase that.
For a good example of what a rathole trying to sandbox Python turns into, see this post. For Guido's take on securing Python, see this post.
There are another couple of options: If you're free to choose the language, you could run Rhino (a Javascript interpreter) on the Java runtime; Rhino is nicely sandboxed. You may also be able to use Jython; I don't know if it's practical to sandbox it, but it seems likely.
Alex's suggestion of using a separate app is also a good one. This is pretty much the approach that shell.appspot.com takes: It can't prevent you from doing malicious things, but the app itself stores nothing of value, so there's no harm if you do.
Here's an idea. Instead of running the code server-side, run it client-side with Skuplt:
http://www.skulpt.org/
This is both safer, and easier to implement.

Categories

Resources