As part of an integration test, I create documents in a subfolder:
class CmisTestBase(unittest.TestCase):
def setUp(self):
""" Create a root test folder for the test. """
logger.debug("Creating client")
self.cmis_client = CmisClient(REPOSITORY_URL, USERNAME, PASSWORD)
logger.debug("Getting default repo")
self.repo = self.cmis_client.getDefaultRepository()
logger.debug("Getting root folder")
self.root_folder = self.repo.getObjectByPath(TEST_ROOT_PATH)
logger.debug("Creating test folder")
self.folder_name = ".".join(['cmislib', self.__class__.__name__, str(time.time())])
self.test_folder = self.root_folder.createFolder(self.folder_name)
def tearDown(self):
""" Clean up after the test. """
logger.debug("Deleting test folder")
self.test_folder.deleteTree()
And in my tests I create documents and then test that I can query them using repo.query:
class SearchNoauth(SearchTest):
def setUp(self):
super(SearchNoauth, self).setUp()
def tearDown(self):
super(SearchNoauth, self).tearDown()
def test_noauth_empty(self):
logger.debug("Calling test_noauth_empty")
# Create a single document
self.create_document_simple()
# Retrieve all documents (No argument passed)
response = self.client.profile_noauth()
self.assertEqual(response.status_code, 200)
result_data = response.json()
logger.debug("results: {0}".format(pformat(result_data, indent=4)))
self.assertEqual(len(result_data), 1)
But as near as I can tell, my custom content created in the scope of the test is not found because the default repo does not search the test folder.
I'd expect an API:
to allow searches against a folder (not just a repo) or
to support a syntax for looking for objects in a specific folder
How do I construct a CMIS query that finds matching custom documents in a folder?
A bit more:
self.client.profile_noauth is a call to profile_noauth method in a python client library
that hits a Pyramid server
that aggregates a number of services, including Alfresco,
and eventually calls repo.query against an Alfresco default repository.
Much of this question is how to modify the facade service's CMIS query to look in a folder.
Later: I think I may have an answer. The basic idea is to get the ID of the folder and use in_folder():
>>> folder = repo.getObjectByPath('/Sites/test-site-1/documentLibrary')
>>> query = """
select cmis:objectId, cmis:name
from cmis:document
where in_folder('%s')
order by cmis:lastModificationDate desc
""" % folder.id
You answered the question in your post. If you want to query for documents in a specific folder you can use the in folder clause and that requires the ID of the folder you want to search.
Related
I'm trying to execute a task after the app does the binding with the port so it doesn't get killed by Heroku for taking too long on startup. I am aware of the existence of before_first_request however I would like this action to be performed as soon as possible after the app startup without requiring a request.
I am loading an object as an attribute of the app object (because I need to access it across requests) and this object has to initialize in a weird way (it checks if a file exists and downloads it if it doesn't and afterwards it performs a bunch of computations).
Currently I'm doing this in the following way:
def create_app() -> Flask:
...
with app.app_context():
app.model = RecommenderModel() # This downloads a pretty heavy file if it isn't there
app.model.load_products() # This performs a bunch of calculations
...
return app
This initializes the app properly (as tested locally) however Heroku kills it (Error R10) because it takes too long.
Is there a way to do this asynchronously? When I tried to do so the app context got lost.
Edit: Additional information regarding what I'm doing:
The RecommenderModel object models the logic of a recommendation system. As of now, the recommendations are based on vector cosine similarity. Those vectors are extracted using pre-trained word2vec embeddings (which is the large file that needs to be downloaded). The conversion from products to vectors is handled by a Preprocessor class.
The Recommender Model initialization looks like this:
class RecommenderModel(object):
def __init__(self) -> None:
self.preproc = Preprocessor()
self.product_vector: dict = {}
def load_products(self) -> None:
for product in Product.get_all():
self.product_vector[product.id] = self.preproc.compute_vector(product)
The Preprocessor initialization looks like this:
class Preprocessor(object):
def __init__(self, embeddings: str = embeddings) -> None:
S3.ensure_file(embeddings)
self.vectors = KeyedVectors.load_word2vec_format(embeddings)
The S3.ensure_file method basically checks if the file exists and downloads it if it doesn't:
class S3(object):
client = boto3.client('s3')
#classmethod
def ensure_file(cls, filepath: str) -> None:
if os.path.exists(filepath):
return
dirname, filename = os.path.split(filepath)
bucket_name = os.environ.get('BUCKET_NAME')
cls.client.download_file(bucket_name, filename, filepath)
After extracting some of the email's data, I would like to move the email to a specified folder with python. I've searched and haven't seemed to find what I need.
Has anyone done this before?
Per a comment, I've added my current logic in hopes that it will clarify my problem. I loop through my folder, extract the details. After doing that, I want to move the email to a different folder.
import win32com.client
import getpass
import re
'''
Loops through Lotus Notes folder to view messages
'''
def docGenerator(folderName):
# Get credentials
mailServer = 'server'
mailPath = 'PubDir\inbox.nsf'
# Password
pw = getpass.getpass('Enter password: ')
# Connect
session = win32com.client.Dispatch('Lotus.NotesSession')
# Initializing the session and database
session.Initialize(pw)
db = session.GetDatabase(mailServer, mailPath)
# Get folder
folder = db.GetView(folderName)
if not folder:
raise Exception('Folder "%s" not found' % folderName)
# Get the first document
doc = folder.GetFirstDocument()
# If the document exists,
while doc:
# Yield it
yield doc
# Get the next document
doc = folder.GetNextDocument(doc)
# Loop through emails
for doc in docGenerator('Folder\Here'):
# setting variables
subject = doc.GetItemValue('Subject')[0].strip()
invoice = re.findall(r'\d+',subject)[0]
body = doc.GetItemValue('Body')[0].strip()
# Move email after extracting above data
# ???
As you will move the document before getting the next one, I'd recommend to replace your loop with
doc = folder.GetFirstDocument()
while doc:
docN = folder.GetNextDocument(doc)
yield doc
doc = docN
And then to move the message to the proper folder you need
doc.PutInFolder(r"Destination\Folder")
doc.RemoveFromFolder(r"Origin\Folder")
Of course, take care of escaping your backslashes or using raw strings literals to pass correctly the view name.
doc.PutInFolder creates the folder if it doesn't exist. In that case the user needs to have permissions to create public folders, otherwise the created folder will be private. (If the folder already exists, of course, this is not a problem.)
I previously asked a similar question (How does Scrapy avoid re-downloading media that was downloaded recently?), but since I did not receive a definite answer I'll ask it again.
I've downloaded a large number of files to an AWS S3 bucket using Scrapy's Files Pipeline. According to the documentation (https://doc.scrapy.org/en/latest/topics/media-pipeline.html#downloading-and-processing-files-and-images), this pipeline avoids "re-downloading media that was downloaded recently", but it does not say how long ago "recent" is or how to set this parameter.
Looking at the implementation of the FilesPipeline class at https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/files.py, it would appear that this is obtained from the FILES_EXPIRES setting, for which the default is 90 days:
class FilesPipeline(MediaPipeline):
"""Abstract pipeline that implement the file downloading
This pipeline tries to minimize network transfers and file processing,
doing stat of the files and determining if file is new, uptodate or
expired.
`new` files are those that pipeline never processed and needs to be
downloaded from supplier site the first time.
`uptodate` files are the ones that the pipeline processed and are still
valid files.
`expired` files are those that pipeline already processed but the last
modification was made long time ago, so a reprocessing is recommended to
refresh it in case of change.
"""
MEDIA_NAME = "file"
EXPIRES = 90
STORE_SCHEMES = {
'': FSFilesStore,
'file': FSFilesStore,
's3': S3FilesStore,
}
DEFAULT_FILES_URLS_FIELD = 'file_urls'
DEFAULT_FILES_RESULT_FIELD = 'files'
def __init__(self, store_uri, download_func=None, settings=None):
if not store_uri:
raise NotConfigured
if isinstance(settings, dict) or settings is None:
settings = Settings(settings)
cls_name = "FilesPipeline"
self.store = self._get_store(store_uri)
resolve = functools.partial(self._key_for_pipe,
base_class_name=cls_name,
settings=settings)
self.expires = settings.getint(
resolve('FILES_EXPIRES'), self.EXPIRES
)
if not hasattr(self, "FILES_URLS_FIELD"):
self.FILES_URLS_FIELD = self.DEFAULT_FILES_URLS_FIELD
if not hasattr(self, "FILES_RESULT_FIELD"):
self.FILES_RESULT_FIELD = self.DEFAULT_FILES_RESULT_FIELD
self.files_urls_field = settings.get(
resolve('FILES_URLS_FIELD'), self.FILES_URLS_FIELD
)
self.files_result_field = settings.get(
resolve('FILES_RESULT_FIELD'), self.FILES_RESULT_FIELD
)
super(FilesPipeline, self).__init__(download_func=download_func, settings=settings)
#classmethod
def from_settings(cls, settings):
s3store = cls.STORE_SCHEMES['s3']
s3store.AWS_ACCESS_KEY_ID = settings['AWS_ACCESS_KEY_ID']
s3store.AWS_SECRET_ACCESS_KEY = settings['AWS_SECRET_ACCESS_KEY']
s3store.POLICY = settings['FILES_STORE_S3_ACL']
store_uri = settings['FILES_STORE']
return cls(store_uri, settings=settings)
def _get_store(self, uri):
if os.path.isabs(uri): # to support win32 paths like: C:\\some\dir
scheme = 'file'
else:
scheme = urlparse(uri).scheme
store_cls = self.STORE_SCHEMES[scheme]
return store_cls(uri)
def media_to_download(self, request, info):
def _onsuccess(result):
if not result:
return # returning None force download
last_modified = result.get('last_modified', None)
if not last_modified:
return # returning None force download
age_seconds = time.time() - last_modified
age_days = age_seconds / 60 / 60 / 24
if age_days > self.expires:
return # returning None force download
Do I understand this correctly? Also, I do not see a similar Boolean statement with age_days in the S3FilesStore class; is the checking of age also implemented for files on S3? (I was also unable to find any tests testing this age-checking feature for S3).
FILES_EXPIRES is indeed the setting to tell the FilesPipeline how "old" can a file be before downloading it (again).
The key section of the code is in media_to_download:
the _onsuccess callback checks the result of the pipeline's self.store.stat_file call, and for your question, it especially looks for the "last_modified" info. If last modified is older than "expires days", then the download is triggered.
You can check how the S3store gets the "last modified" information. It depends if botocore is available or not.
One line answer to this would be - class FilesPipeline(MediaPipeline): is the only class responsible for managing, validating and downloading files in your local paths. class S3FilesStore(object): just gets the files from local paths and uploads them to S3.
class FSFilesStore is the one which manages all your local paths and FilesPipeline uses them to store your files at local.
Links:
https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/files.py#L264
https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/files.py#L397
https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/files.py#L299
I have a module that needs to update new variable values from the web, about once a week. I could place those variable values in a file & load those values on startup. Or, a simpler solution would be to simply auto-update the code.
Is this possible in Python?
Something like this...
def self_updating_module_template():
dynamic_var1 = {'dynamic var1'} # some kind of place holder tag
dynamic_var2 = {'dynamic var2'} # some kind of place holder tag
return
def self_updating_module():
dynamic_var1 = 'old data'
dynamic_var2 = 'old data'
return
def updater():
new_data_from_web = ''
new_dynamic_var1 = new_data_from_web # Makes API call. gets values.
new_dynamic_var2 = new_data_from_web
# loads self_updating_module_template
dynamic_var1 = new_dynamic_var1
dynamic_var2 = new_dynamic_var2
# replace module place holders with new values.
# overwrite self_updating_module.py.
return
I would recommend that you use configparser and a set of default values located in an ini-style file.
The ConfigParser class implements a basic configuration file parser
language which provides a structure similar to what you would find on
Microsoft Windows INI files. You can use this to write Python programs
which can be customized by end users easily.
Whenever the configuration values are updated from the web api endpoint, configparser also lets us write those back out to the configuration file. That said, be careful! The reason that most people recommend that configuration files be included at build/deploy time and not at run time is for security/stability. You have to lock down the endpoint that allows updates to your running configuration in production and have some way to verify any configuration value updates before they are retrieved by your application:
import configparser
filename = 'config.ini'
def load_config():
config = configparser.ConfigParser()
config.read(filename)
if 'WEB_DATA' not in config:
config['WEB_DATA'] = {'dynamic_var1': 'dynamic var1', # some kind of place holder tag
'dynamic_var2': 'dynamic var2'} # some kind of place holder tag
return config
def update_config(config):
new_data_from_web = ''
new_dynamic_var1 = new_data_from_web # Makes API call. gets values.
new_dynamic_var2 = new_data_from_web
config['WEB_DATA']['dynamic_var1'] = new_dynamic_var1
config['WEB_DATA']['dynamic_var2'] = new_dynamic_var2
def save_config(config):
with open(filename, 'w') as configfile:
config.write(configfile)
Example usage::
# Load the configuration
config = load_config()
# Get new data from the web
update_config(config)
# Save the newly updated configuration back to the file
save_config(config)
I am using the cloudant python library to connect to my cloudant account.
Here is the code I have so far:
import cloudant
class WorkflowsCloudant(cloudant.Account):
def __init__(self):
super(WorkflowsCloudant, self).__init__(settings.COUCH_DB_ACCOUNT_NAME,
auth=(settings.COUCH_PUBLIC_KEY,
settings.COUCH_PRIVATE_KEY))
#blueprint.route('/<workflow_id>')
def get_single_workflow(account_id, workflow_id):
account = WorkflowsCloudant()
db = account.database(settings.COUCH_DB_NAME)
doc = db.document(workflow_id)
resp = doc.get().json()
if resp['account_id'] != account_id:
return error_helpers.forbidden('Invalid Account')
return jsonify(resp)
This Flask controller will have CRUD operations inside of it, but with the current implementation, I will have to set the account and db variables in each method before performing operations on the document I want to view/manipulate. How can I clean up (or DRY up) my code so that I only have to call to my main WorkflowsCloudant class?
I don't know cloudant, so I may be totally off base, but I believe this answers your question:
Delete the account, db, and doc lines from get_single_workflow.
Add the following lines to __init__:
db = account.database(settings.COUCH_DB_NAME)
self.doc = db.document(workflow_id)
Change the resp line in get_single_workflow to:
resp = WorkflowsCloudant().doc.get().json()