I am trying to host a rest API using Django which takes in some parameters, processes it and returns a result. For this processing to happen, I have to use certain datasets which are loaded using excel, tiff, csv, and txt files. Loading these datasets and putting them in python variables (to use them, of course) takes a bit of time; the problem is that I don't want my backend to extract info from all these files everytime I get a request. The best way I could think of is literally copying the raw values from these files and putting them in python variables using the = operator, but that would be atleast a 100,000 lines of code then. Is there some way to pre-define certain variables in my backend? i.e. a variable that would get defined on a run and then every-time I get a request, I'll just use these pre-defined variables instead of loading them again.
Related
I am working on python on some data get from a SAS server. I am currently using SASPY to_df() function to bring it from SAS to local pandas.
I would like to know if its possible to filter/query the data that is being transferred so I could avoid bringing unneeded that and speeding up my download.
I couldn't find anything on saspy documentation, it only offers the possibility of using "**kwargs" but I couldn't figure out how to do it.
Thanks.
You need to define the sasdata object using the WHERE= dataset option to limit the observations pulled.
https://sassoftware.github.io/saspy/api.html#saspy.sasdata.SASdata
Then when you use the to_df() method only the selected data will be transferred.
You can also use the KEEP= or DROP= dataset option to limit the variables that are transferred. Remember that in order to reference any variables in the WHERE= option they have to be kept.
The "**kwargs" looks to be about changing how you connect to the SAS server, so that is not important for what you want.
I am building an image mosaic that detect if the user's selected area are taken or not.
My idea is to store the available_spots in a list, and I would just have to look through the list to check whether a spot is available or not.
The problem is that when I reload the website, avaliable_spots also gets reset to blank list,
so I want to store this array somewhere, that is fast to read and write to.
I am currently thinking about a text file to store this, but that might take forever to read since array length is over 1.4 million. Is there any other solutions that might be better?
You can't store the data in a file for a few reasons: (1) GAE standard won't let you, (2) the data is lost when your server is restarted, and (3) different instances will have different data.
Of course you can and should store the data in a database of your choice. Firestore is likely a better and cheaper option than SQL. It should be fast enough for you and you can implement caching if needed.
You might be able to store the data in a single Firestore entity and consider using compression if you are getting close to the max entity size.
If you want to store into a database you can use the "sqlite3" module.
Is a simple database that gets stored in a file so you dont have to install a database program. Is great for small projects.
If you want to do more complex stuff with databases you can use "sqlalchemy".
I'm impressed by the speed of running transformations, loading data and ease of use of Pandas and want to leverage all these nice properties (amongst others) to model some large-ish data sets (~100-200k rows, <20 columns). The aim is to work with the data on some computing nodes, but also to provide a view of the data sets in a browser via Flask.
I'm currently using a Postgres database to store the data, but the import (coming from csv files) of the data is slow, tedious and error prone and getting the data out of the database and processing it is not much easier. The data is never going to be changed once imported (no CRUD operations), so I thought it's ideal to store it as several pandas DataFrame (stored in hdf5 format and loaded via pytables).
The question is:
(1) Is this a good idea and what are the things to watch out for? (For instance I don't expect concurrency problems as DataFrames are (should?) be stateless and immutable (taken care of from application-side)). What else needs to be watched out for?
(2) How would I go about caching the data once it's loaded from the hdf5 file into a DataFrame, so it doesn't need to be loaded for every client request (at least the most recent/frequent dataframes). Flask (or werkzeug) has a SimpleCaching class, but, internally, it pickles the data and unpickles the cached data on access. I wonder if this is necessary in my specific case (assuming the cached object is immutable). Also, is such a simple caching method usable when the system gets deployed with Gunicorn (is it possible to have static data (the cache) and can concurrent (different process?) requests access the same cache?).
I realise these are many questions, but before I invest more time and build a proof-of-concept, I thought I get some feedback here. Any thoughts are welcome.
Answers to some aspects of what you're asking for:
It's not quite clear from your description whether you have the tables in your SQL database only, stored as HDF5 files or both. Something to look out for here is that if you use Python 2.x and create the files via pandas' HDFStore class, any strings will be pickled leading to fairly large files. You can also generate pandas DataFrame's directly from SQL queries using read_sql, for example.
If you don't need any relational operations then I would say ditch the postgre server, if it's already set up and you might need that in future keep using the SQL server. The nice thing about the server is that even if you don't expect concurrency issues, it will be handled automatically for you using (Flask-)SQLAlchemy causing you less headache. In general, if you ever expect to add more tables (files), it's less of an issue to have one central database server than maintaining multiple files lying around.
Whichever way you go, Flask-Cache will be your friend, using either a memcached or a redis backend. You can then cache/memoize the function that returns a prepared DataFrame from either SQL or HDF5 file. Importantly, it also let's you cache templates which may play a role in displaying large tables.
You could, of course, also generate a global variable, for example, where you create the Flask app and just import that wherever it's needed. I have not tried this and would thus not recommend it. It might cause all sorts of concurrency issues.
I need to perform a nightly update in the datastore on a relatively large dataset (syncing a subset of corporate data with GAE). I've been using the bulkloader, and it does the job, but the write costs are really adding up. Since I'm specifying key strings for each entity, the bulkloader is essentially rewriting the ENTIRE entity for every record it loads, which in my case, is about 90 writes PER ENTITY. (It's a large, flat dataset with a lot of indexes.) But within my dataset, only six of my 50 properties actually change overnight, so I'm doing a lot of redundant writing.
My first thought was to keep a cache of the prior night's build, loop through it for changes, get the entity, then execute a put() on the properties that need it. This works effectively to reduce writes, but takes a LONG time -- even when I batch the put(). It only takes ~3 minutes to load the ENTIRE dataset with the bulkloader -- and 16-18 just to run the updates! (I'm using remote API, BTW.) This won't work when I scale up.
I tried using ndb.KeyProperty in my model and only updating the changed fields via bulkloader, but then I lose the abilty to query/sort on the keyProperty, which I need.
I also tried StructuredProperties, which DOES let you query/sort, but the structured property doesn't allow you to set an ID for it, so I can't load just the structured property.
So...is there a way for me to reduce these writes and keep the functionality I need? Can I use the bulkloader to update changes only? Do I need to restructure my dataset??
I have a file that contains ~16,000 lines of information on entities. The user is supposed to upload the file using an HTML upload form, then the system handles this by reading line by line and creating then put()'ing entities onto the datastore.
I'm limited by the 30 second request time limit. I have tried a lot of different work-arounds using Task Queue, forced HTML redirecting, etc. and nothing has worked for me.
I am using forced HTML redirecting to delete all data and this works, albeit VERY slowly. (4th answer here: Delete all data for a kind in Google App Engine)
I can't seem to apply this to my uploading problem, since my method has to be a POST method. Is there a solution somehow? Sample code would be much appreciated since I'm very new to web development in general.
To solve a similar problem, I stored the dataset in a model with a single TextProperty, then spawn a taskqueue task that:
Fetches a dataset from the datastore if there are any left.
Checks if the length of the dataset is <= N, where N is some small number of entities you can put() without a timeout. I used 5. If so, write the individual entities, delete the dataset record, and spawn a new copy of the task.
If the dataset size was bigger than N, split it into N parts in the same format and write those to the datastore, delete the original entity, and spawn a new copy of the task.
If you're doing this to bulk load data, why not use the bulk loader?
If you need the interface to be accessible to non-admin users, then, as suggested, you need to break the file up into decent sized chunks (by taking blocks of n lines each) put them into the datastore, and start a task to deal with each of them.