I'm trying to execute a task after the app does the binding with the port so it doesn't get killed by Heroku for taking too long on startup. I am aware of the existence of before_first_request however I would like this action to be performed as soon as possible after the app startup without requiring a request.
I am loading an object as an attribute of the app object (because I need to access it across requests) and this object has to initialize in a weird way (it checks if a file exists and downloads it if it doesn't and afterwards it performs a bunch of computations).
Currently I'm doing this in the following way:
def create_app() -> Flask:
...
with app.app_context():
app.model = RecommenderModel() # This downloads a pretty heavy file if it isn't there
app.model.load_products() # This performs a bunch of calculations
...
return app
This initializes the app properly (as tested locally) however Heroku kills it (Error R10) because it takes too long.
Is there a way to do this asynchronously? When I tried to do so the app context got lost.
Edit: Additional information regarding what I'm doing:
The RecommenderModel object models the logic of a recommendation system. As of now, the recommendations are based on vector cosine similarity. Those vectors are extracted using pre-trained word2vec embeddings (which is the large file that needs to be downloaded). The conversion from products to vectors is handled by a Preprocessor class.
The Recommender Model initialization looks like this:
class RecommenderModel(object):
def __init__(self) -> None:
self.preproc = Preprocessor()
self.product_vector: dict = {}
def load_products(self) -> None:
for product in Product.get_all():
self.product_vector[product.id] = self.preproc.compute_vector(product)
The Preprocessor initialization looks like this:
class Preprocessor(object):
def __init__(self, embeddings: str = embeddings) -> None:
S3.ensure_file(embeddings)
self.vectors = KeyedVectors.load_word2vec_format(embeddings)
The S3.ensure_file method basically checks if the file exists and downloads it if it doesn't:
class S3(object):
client = boto3.client('s3')
#classmethod
def ensure_file(cls, filepath: str) -> None:
if os.path.exists(filepath):
return
dirname, filename = os.path.split(filepath)
bucket_name = os.environ.get('BUCKET_NAME')
cls.client.download_file(bucket_name, filename, filepath)
Related
I have a script that continually runs, processing data that it gets from an external device. The core logic follows something like:
from external_module import process_data, get_data, load_interesting_things
class MyService:
def __init__(self):
self.interesting_items = load_interesting_things()
self.run()
def run(self):
try:
while True:
data = get_data()
for item in self.interesting_items:
item.add_datapoint(process_data(data, item))
except KeyboardInterrupt:
pass
I would like to add the ability to request information for the various interesting things via a RESTful API.
Is there a way in which I can add something like a Flask web service to the program such that the web service can get a stat from the interesting_items list to return? For example something along the lines of:
#app.route("/item/<idx>/average")
def average(idx: int):
avg = interesting_items[idx].getAverage()
return jsonify({"average":avg})
Assuming there is the necessary idx bounds checking and any appropriate locking implemented.
It does not have to be Flask, but it should be light weight. I want to avoid using a database. I would prefer to use a webservice, but if it is not possible without completely restructuring the code base I can use a socket instead, but this is less preferable.
The server would be running on a local network only and usually only handling a single user, sometimes it may have a few.
I needed to move the run() method out of the __init__() method, so that I could have a global reference to the service, and start the run method in a separate thread. Something along the lines of:
service = MyService()
service_thread = threading.Thread(target=service.run, daemon=True)
service_thread.start()
app = flask.Flask("appname")
...
#app.route("/item/<idx>/average")
def average(idx: int):
avg = service.interesting_items[idx].getAverage()
return jsonify({"average":avg})
...
app.run()
My Bash script using kubectl create/apply -f ... to deploy lots of Kubernetes resources has grown too large for Bash. I'm converting it to Python using the PyPI kubernetes package.
Is there a generic way to create resources given the YAML manifest? Otherwise, the only way I can see to do it would be to create and maintain a mapping from Kind to API method create_namespaced_<kind>. That seems tedious and error prone to me.
Update: I'm deploying many (10-20) resources to many (10+) GKE clusters.
Update in the year 2020, for anyone still interested in this (since the docs for the python library is mostly empty).
At the end of 2018 this pull request has been merged,
so it's now possible to do:
from kubernetes import client, config
from kubernetes import utils
config.load_kube_config()
api = client.ApiClient()
file_path = ... # A path to a deployment file
namespace = 'default'
utils.create_from_yaml(api, file_path, namespace=namespace)
EDIT: from a request in a comment, a snippet for skipping the python error if the deployment already exists
from kubernetes import client, config
from kubernetes import utils
config.load_kube_config()
api = client.ApiClient()
def skip_if_already_exists(e):
import json
# found in https://github.com/kubernetes-client/python/blob/master/kubernetes/utils/create_from_yaml.py#L165
info = json.loads(e.api_exceptions[0].body)
if info.get('reason').lower() == 'alreadyexists':
pass
else
raise e
file_path = ... # A path to a deployment file
namespace = 'default'
try:
utils.create_from_yaml(api, file_path, namespace=namespace)
except utils.FailToCreateError as e:
skip_if_already_exists(e)
I have written a following piece of code to achieve the functionality of creating k8s resources from its json/yaml file:
def create_from_yaml(yaml_file):
"""
:param yaml_file:
:return:
"""
yaml_object = yaml.loads(common.load_file(yaml_file))
group, _, version = yaml_object["apiVersion"].partition("/")
if version == "":
version = group
group = "core"
group = "".join(group.split(".k8s.io,1"))
func_to_call = "{0}{1}Api".format(group.capitalize(), version.capitalize())
k8s_api = getattr(client, func_to_call)()
kind = yaml_object["kind"]
kind = re.sub('(.)([A-Z][a-z]+)', r'\1_\2', kind)
kind = re.sub('([a-z0-9])([A-Z])', r'\1_\2', kind).lower()
if "namespace" in yaml_object["metadata"]:
namespace = yaml_object["metadata"]["namespace"]
else:
namespace = "default"
try:
if hasattr(k8s_api, "create_namespaced_{0}".format(kind)):
resp = getattr(k8s_api, "create_namespaced_{0}".format(kind))(
body=yaml_object, namespace=namespace)
else:
resp = getattr(k8s_api, "create_{0}".format(kind))(
body=yaml_object)
except Exception as e:
raise e
print("{0} created. status='{1}'".format(kind, str(resp.status)))
return k8s_api
In above function, If you provide any object yaml/json file, it will automatically pick up the API type and object type and create the object like statefulset, deployment, service etc.
PS: The above code doesn't handler multiple kubernetes resources in one file, so you should have only one object per yaml file.
I see what you are looking for. This is possible with other k8s clients available in other languages. Here is an example in java. Unfortunately the python client library does not support that functionality yet. I opened a new feature request requesting the same and you can either choose to track it or contribute yourself :). Here is the link for the issue on GitHub.
The other way to still do what you are trying to do is to use java/golang client and put your code in a docker container.
I previously asked a similar question (How does Scrapy avoid re-downloading media that was downloaded recently?), but since I did not receive a definite answer I'll ask it again.
I've downloaded a large number of files to an AWS S3 bucket using Scrapy's Files Pipeline. According to the documentation (https://doc.scrapy.org/en/latest/topics/media-pipeline.html#downloading-and-processing-files-and-images), this pipeline avoids "re-downloading media that was downloaded recently", but it does not say how long ago "recent" is or how to set this parameter.
Looking at the implementation of the FilesPipeline class at https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/files.py, it would appear that this is obtained from the FILES_EXPIRES setting, for which the default is 90 days:
class FilesPipeline(MediaPipeline):
"""Abstract pipeline that implement the file downloading
This pipeline tries to minimize network transfers and file processing,
doing stat of the files and determining if file is new, uptodate or
expired.
`new` files are those that pipeline never processed and needs to be
downloaded from supplier site the first time.
`uptodate` files are the ones that the pipeline processed and are still
valid files.
`expired` files are those that pipeline already processed but the last
modification was made long time ago, so a reprocessing is recommended to
refresh it in case of change.
"""
MEDIA_NAME = "file"
EXPIRES = 90
STORE_SCHEMES = {
'': FSFilesStore,
'file': FSFilesStore,
's3': S3FilesStore,
}
DEFAULT_FILES_URLS_FIELD = 'file_urls'
DEFAULT_FILES_RESULT_FIELD = 'files'
def __init__(self, store_uri, download_func=None, settings=None):
if not store_uri:
raise NotConfigured
if isinstance(settings, dict) or settings is None:
settings = Settings(settings)
cls_name = "FilesPipeline"
self.store = self._get_store(store_uri)
resolve = functools.partial(self._key_for_pipe,
base_class_name=cls_name,
settings=settings)
self.expires = settings.getint(
resolve('FILES_EXPIRES'), self.EXPIRES
)
if not hasattr(self, "FILES_URLS_FIELD"):
self.FILES_URLS_FIELD = self.DEFAULT_FILES_URLS_FIELD
if not hasattr(self, "FILES_RESULT_FIELD"):
self.FILES_RESULT_FIELD = self.DEFAULT_FILES_RESULT_FIELD
self.files_urls_field = settings.get(
resolve('FILES_URLS_FIELD'), self.FILES_URLS_FIELD
)
self.files_result_field = settings.get(
resolve('FILES_RESULT_FIELD'), self.FILES_RESULT_FIELD
)
super(FilesPipeline, self).__init__(download_func=download_func, settings=settings)
#classmethod
def from_settings(cls, settings):
s3store = cls.STORE_SCHEMES['s3']
s3store.AWS_ACCESS_KEY_ID = settings['AWS_ACCESS_KEY_ID']
s3store.AWS_SECRET_ACCESS_KEY = settings['AWS_SECRET_ACCESS_KEY']
s3store.POLICY = settings['FILES_STORE_S3_ACL']
store_uri = settings['FILES_STORE']
return cls(store_uri, settings=settings)
def _get_store(self, uri):
if os.path.isabs(uri): # to support win32 paths like: C:\\some\dir
scheme = 'file'
else:
scheme = urlparse(uri).scheme
store_cls = self.STORE_SCHEMES[scheme]
return store_cls(uri)
def media_to_download(self, request, info):
def _onsuccess(result):
if not result:
return # returning None force download
last_modified = result.get('last_modified', None)
if not last_modified:
return # returning None force download
age_seconds = time.time() - last_modified
age_days = age_seconds / 60 / 60 / 24
if age_days > self.expires:
return # returning None force download
Do I understand this correctly? Also, I do not see a similar Boolean statement with age_days in the S3FilesStore class; is the checking of age also implemented for files on S3? (I was also unable to find any tests testing this age-checking feature for S3).
FILES_EXPIRES is indeed the setting to tell the FilesPipeline how "old" can a file be before downloading it (again).
The key section of the code is in media_to_download:
the _onsuccess callback checks the result of the pipeline's self.store.stat_file call, and for your question, it especially looks for the "last_modified" info. If last modified is older than "expires days", then the download is triggered.
You can check how the S3store gets the "last modified" information. It depends if botocore is available or not.
One line answer to this would be - class FilesPipeline(MediaPipeline): is the only class responsible for managing, validating and downloading files in your local paths. class S3FilesStore(object): just gets the files from local paths and uploads them to S3.
class FSFilesStore is the one which manages all your local paths and FilesPipeline uses them to store your files at local.
Links:
https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/files.py#L264
https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/files.py#L397
https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/files.py#L299
I am using the cloudant python library to connect to my cloudant account.
Here is the code I have so far:
import cloudant
class WorkflowsCloudant(cloudant.Account):
def __init__(self):
super(WorkflowsCloudant, self).__init__(settings.COUCH_DB_ACCOUNT_NAME,
auth=(settings.COUCH_PUBLIC_KEY,
settings.COUCH_PRIVATE_KEY))
#blueprint.route('/<workflow_id>')
def get_single_workflow(account_id, workflow_id):
account = WorkflowsCloudant()
db = account.database(settings.COUCH_DB_NAME)
doc = db.document(workflow_id)
resp = doc.get().json()
if resp['account_id'] != account_id:
return error_helpers.forbidden('Invalid Account')
return jsonify(resp)
This Flask controller will have CRUD operations inside of it, but with the current implementation, I will have to set the account and db variables in each method before performing operations on the document I want to view/manipulate. How can I clean up (or DRY up) my code so that I only have to call to my main WorkflowsCloudant class?
I don't know cloudant, so I may be totally off base, but I believe this answers your question:
Delete the account, db, and doc lines from get_single_workflow.
Add the following lines to __init__:
db = account.database(settings.COUCH_DB_NAME)
self.doc = db.document(workflow_id)
Change the resp line in get_single_workflow to:
resp = WorkflowsCloudant().doc.get().json()
As part of an integration test, I create documents in a subfolder:
class CmisTestBase(unittest.TestCase):
def setUp(self):
""" Create a root test folder for the test. """
logger.debug("Creating client")
self.cmis_client = CmisClient(REPOSITORY_URL, USERNAME, PASSWORD)
logger.debug("Getting default repo")
self.repo = self.cmis_client.getDefaultRepository()
logger.debug("Getting root folder")
self.root_folder = self.repo.getObjectByPath(TEST_ROOT_PATH)
logger.debug("Creating test folder")
self.folder_name = ".".join(['cmislib', self.__class__.__name__, str(time.time())])
self.test_folder = self.root_folder.createFolder(self.folder_name)
def tearDown(self):
""" Clean up after the test. """
logger.debug("Deleting test folder")
self.test_folder.deleteTree()
And in my tests I create documents and then test that I can query them using repo.query:
class SearchNoauth(SearchTest):
def setUp(self):
super(SearchNoauth, self).setUp()
def tearDown(self):
super(SearchNoauth, self).tearDown()
def test_noauth_empty(self):
logger.debug("Calling test_noauth_empty")
# Create a single document
self.create_document_simple()
# Retrieve all documents (No argument passed)
response = self.client.profile_noauth()
self.assertEqual(response.status_code, 200)
result_data = response.json()
logger.debug("results: {0}".format(pformat(result_data, indent=4)))
self.assertEqual(len(result_data), 1)
But as near as I can tell, my custom content created in the scope of the test is not found because the default repo does not search the test folder.
I'd expect an API:
to allow searches against a folder (not just a repo) or
to support a syntax for looking for objects in a specific folder
How do I construct a CMIS query that finds matching custom documents in a folder?
A bit more:
self.client.profile_noauth is a call to profile_noauth method in a python client library
that hits a Pyramid server
that aggregates a number of services, including Alfresco,
and eventually calls repo.query against an Alfresco default repository.
Much of this question is how to modify the facade service's CMIS query to look in a folder.
Later: I think I may have an answer. The basic idea is to get the ID of the folder and use in_folder():
>>> folder = repo.getObjectByPath('/Sites/test-site-1/documentLibrary')
>>> query = """
select cmis:objectId, cmis:name
from cmis:document
where in_folder('%s')
order by cmis:lastModificationDate desc
""" % folder.id
You answered the question in your post. If you want to query for documents in a specific folder you can use the in folder clause and that requires the ID of the folder you want to search.