I'm trying to set up Whoosh search in a serverless environment (aws lambda hosted api) and having trouble with Whoosh since it hosts the index on the local filesystem. That becomes an issue with containers that aren't able to update and reference a single index.
Does anyone know if there is a solution to this problem. I am able to select the location that the directory is hosted but it has to be on the local filesystem. Is there a way to represent an s3 file as a local file?
I'm currently having to reindex every time the app is initialized and while it works it's clearly an expensive and terrible workaround.
The answer seems to be no. Serverless environments by default are ephemeral and don't support persistent data storage which is needed for something like storing an index that Whoosh generates.
You can always use Whoosh in RAM.
from whoosh.filedb.filestore import RamStorage
store = RamStorage()
ix = store.create_index(...)
Related
Let's say:
I have my python code in main.py and I am using Pandas
I am storing my API Key(to some azure service) in a Windows Environment Variable ( variable name = "AZURE_KEY" and variable_value = "abc123abc")
I will import this API Key in main.py using azure_key = os.environ.get("AZURE_KEY")
Question:
How can I be sure that Pandas Library hasn't sent azure_key's value to somewhere outside my local system?
Possible Approach:
I know one way is to go through the entire Pandas module files and understand the source code to see if any fishy stuff is happening , but such an approach is not feasible.
Note:
Pandas is just an example for the question.I want to use an API Key within a Streamlit code.
Hence,Please take this question agnostic to the library..
For a production system (on a server), you could use a firewall to filter outgoing connections
For a development system (your machine), you could add restrictions to the "API Key" account (e.g. only access test data, only access systems you really need, etc.)
I need to set a global application setting in my app. I don't really want to use the database as it's just a single flag. I can't use Memcache because it's not durable. Not sure if env variable is shared will al instances after the change?
This setting might be changed in the app (like once a month)
Is there any other Google Service where I could place it? There will be quite a lot of reads
Here are a few options:
Store it in your code.
Store it in an environment variable. This will be the same for all instances.
Store it in datastore but read the value into a Python module variable so you don't need to access the datastore to use it after the first time.
Google recently deployed "Secrets management". I haven't tried it yet, but it could work for this.
Variables are not supported in S3 backend I need alternative way to do this can any one suggests I go through online some are saying terragrunt some are say like python, workspaces.environments.Actually we are built some Dev environment for clients from the app they will enter the details like for ec2 they will enter count, ami, type from here all are fine but with the backend state file issue which does not support variables I need to change every time bucket name,path.Can some one please explain me the structure and sample code to resolve this thanks in advance. #23208
You don't need to have different backends for every client, you need to have different tfstate. Then you can use terraform init --reconfigure "key=<client>", where <client> is an identifier you set for your clients.
I'm wondering if PySpark supports S3 access using IAM roles. Specifically, I have a business constraint where I have to assume an AWS role in order to access a given bucket. This is fine when using boto (as it's part of the API), but I can't find a definitive answer as to if PySpark supports this out of the box.
Ideally, I'd like to be able to assume a role when running in standalone mode locally and point my SparkContext to that s3 path. I've seen that non-IAM calls usually follow :
spark_conf = SparkConf().setMaster('local[*]').setAppName('MyApp')
sc = SparkContext(conf=spark_conf)
rdd = sc.textFile('s3://<MY-ID>:<MY-KEY>#some-bucket/some-key')
Does something like this exist for providing IAM info? :
rdd = sc.textFile('s3://<MY-ID>:<MY-KEY>:<MY-SESSION>#some-bucket/some-key')
or
rdd = sc.textFile('s3://<ROLE-ARN>:<ROLE-SESSION-NAME>#some-bucket/some-key')
If not, what are the best practices for working with IAM creds? Is it even possible?
I'm using Python 1.7 and PySpark 1.6.0
Thanks!
IAM role for accessing s3 is only support by s3a, because it is using AWS SDK.
You need to put hadoop-aws JAR and aws-java-sdk JAR (and third-party Jars in its package) into your CLASSPATH.
hadoop-aws link.
aws-java-sdk link.
Then set this in core-site.xml:
<property>
<name>fs.s3.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>
<property>
<name>fs.s3a.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>
Hadoop 2.8+'s s3a connector supports IAM roles via a new credential provider; It's not in the Hadoop 2.7 release.
To use it you need to change the credential provider.
fs.s3a.aws.credentials.provider = org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider
fs.s3a.access.key = <your access key>
fs.s3a.secret.key = <session secret>
fs.s3a.session.token = <session token>
What is in Hadoop 2.7 (and enabled by default) is the picking up of the AWS_ environment variables.
If you set the AWS env vars for session login on your local system and the remote ones then they should get picked up.
I know its a pain, but as far as the Hadoop team are concerned Hadoop 2.7 shipped mid-2016 and we've done a lot since then, stuff which we aren't going to backport
IAM Role-based access to files in S3 is supported by Spark, you just need to be careful with your config. Specifically, you need:
Compatible versions of aws-java-sdk and hadoop-aws. This is quite brittle so only specific combinations work.
You must use the S3AFileSystem, not NativeS3FileSystem. The former permits role based access, whereas the later only allows user credentials.
To find out which combinations work, go to hadoop-aws on mvnrepository here. Click through the version of hadoop-aws you have look for the version of the aws-java-sdk compile dependency.
To find out what version of hadoop-aws you are using, in PySpark you can execute:
sc._gateway.jvm.org.apache.hadoop.util.VersionInfo.getVersion()
where sc is the SparkContext
This is what worked for me:
import os
import pyspark
from pyspark import SparkContext
from pyspark.sql import SparkSession
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.1 pyspark-shell'
sc = SparkContext.getOrCreate()
hadoopConf = sc._jsc.hadoopConfiguration()
hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark = SparkSession(sc)
df = spark.read.csv("s3a://mybucket/spark/iris/",header=True)
df.show()
It's the specific combination of aws-java-sdk:1.7.4 and hadoop-aws:2.7.1 that made it work. There is good guidance on troubleshooting s3a access here
Specially note that
Randomly changing hadoop- and aws- JARs in the hope of making a problem "go away" or to gain access to a feature you want, will not lead to the outcome you desire.
Here is a useful post containing further information.
Here's some more useful information about compatibility between the java libraries
I was trying to get this to work in the jupyter pyspark notebook. Note that the aws-hadoop version had to match the hadoop install in the Dockerfile i.e. here.
You could try the approach in Locally reading S3 files through Spark (or better: pyspark).
However I've had better luck with setting environment variables (AWS_ACCESS_KEY_ID etc) in Bash ... pyspark will automatically pick these up for your session.
After more research, I'm convinced this is not yet supported as evidenced here.
Others have suggested taking a more manual approach (see this blog post) which suggests to list s3 keys using boto, then parallelize that list using Spark to read each object.
The problem here (and I don't yet see how they themselves get around it) is that the s3 objects given back from listing within a bucket are not serializable/pickle-able (remember : it's suggested that these objects are given to the workers to read in independent processes via map or flatMap). Furthering the problem is that the boto s3 client itself isn't serializable (which is reasonable in my opinion).
What we're left with is the only choice of recreating the assumed-role s3 client per file, which isn't optimal or feasible past a certain point.
If anyone sees any flaws in this reasoning or an alternative solution/approach, I'd love to hear it.
I'm working on an App Engine project (Python) where we'd like to make certain changes to the app's behavior when debugging/developing (most often locally). For example, when debugging, we'd like to disable our rate-limiting decorators, turn on the debug param in the WSGIApplication, maybe add some asserts.
As far as I can tell, App Engine doesn't naturally have any concept of a global dev-mode or debug-mode, so I'm wondering how best to implement such a mode. The options I've been able to come up with so far:
Use google.appengine.api.app_identity.get_default_version_hostname() to get the hostname and check if it begins with localhost. This seems... unreliable, and doesn't allow for using the debug mode in a deployed app instance.
Use os.environ.get('APPLICATION_ID') to get the application id, which according to this page is automatically prepended with dev~ by the development server. Worryingly, the very source of this information is in a box warning:
Do not get the App ID from the environment variable. The development
server simulates the production App Engine service. One way in which
it does this is to prepend a string (dev~) to the APPLICATION_ID
environment variable, which is similar to the string prepended in
production for applications using the High Replication Datastore. You
can modify this behavior with the --default_partition flag, choosing a
value of "" to match the master-slave option in production. Google
recommends always getting the application ID using get_application_id,
as described above.
Not sure if this is an acceptable use of the environment variable. Either way it's probably equally hacky, and suffers the same problem of only working with a locally running instance.
Use a custom app-id for development (locally and deployed), use the -A flag in dev_appserver.py, and use google.appengine.api.app_identity.get_application_id() in the code. I don't like this for a number of reasons (namely having to have two separate app engine projects).
Use a dev app engine version for development and detect with os.environ.get('CURRENT_VERSION_ID').split('.')[0] in code. When deployed this is easy, but I'm not sure how to make dev_appserver.py use a custom version without modifying app.yaml. I suppose I could sed app.yaml to a temp file in /tmp/ with the version replaced and the relative paths resolved (or just create a persistent dev-app.yaml), then pass that into dev_appserver.py. But that seems also kinda dirty and prone to error/sync issues.
Am I missing any other approaches? Any considerations I failed to acknowledge? Any other advice?
In regards to "detecting" localhost development we use the following in our applications settings / config file.
IS_DEV_APPSERVER = 'development' in os.environ.get('SERVER_SOFTWARE', '').lower()
That used in conjunction with the debug flag should do the trick for you.