Parsing env files - python

How do safely store passwords and API keys within an .env file and properly parse them? using python?
I want to store passwords that I do not wish to push into public repos.

You may parse the key values of an .env file you can use os.getenv(key) where you replace key with the key of the value you want to access
Suppose the contents of the .env files are :
A=B
FOO=BAR
SECRET=VERYMUCH
You can parse the content like this :
import os
print(os.getenv("A"))
print(os.getenv("FOO"))
print(os.getenv("SECRET"))
# And so on...
Which would result in (output / stdout) :
B
BAR
VERYMUCH
Now, if you are using git version control and you never want yourself to push an environment file accidentally then add this to your .gitignore file and they'll be conveniently ignored.
.env
(Although you should make your life a bit easier by using an existing .gitignore template all of which by default ignore .env files)
Lastly :
you can load .env just like a text file by doing :
with open(".env") as env:
...
and then proceed to manually parse it using regex
In case for some reason if .env files aren't detected by your python script then use this module python-dotenv
You may make use of the aforementioned library like this :
from dotenv import load_dotenv
load_dotenv() # take environment variables from .env.
# Code of your application, which uses environment variables (e.g. from `os.environ` or
# `os.getenv`) as if they came from the actual environment.

Related

How to read the value from kubernetes secret in python application and set them as environment variable

I am trying to get values from kubernetes secret in my python application as environment variables, I see that secrets are created as separate files and mounted on a specific path(in my case I mount them on etc/secrets/azure-bs. There are five secret files namely
accessKeyId
bucket.properties
storageAccount
key.json
bucketName.
Now bucket.properties has some key value pairs. There is a property_source parser that is used in the application and it is abstracted from my team. This usually parses secret values. I am however only able to parse bucket.properties since it has key value pairs. I want to be able to read contents from these other files and store them as environment variables. I am not sure how to go about that. The contents in these other files is not in the format of "key=value", instead they key is the filename itself and the value is the contents of the file.
without other details on your question, i can tell you that you can surely create env vars from secrets
envFrom:
- secretRef:
name: mysecret
or
env:
- name: SECRET_USERNAME
valueFrom:
secretKeyRef:
name: othersecret
key: username
More details here:
https://kubernetes.io/docs/concepts/configuration/secret/

Handle environmental variables in config options

I have snakemake command line with configuration options like this:
snakemake --config \
f1=$PWD/file1.txt \
f2=$PWD/file2.txt \
f3=/path/to/file3.txt \
...more key-value pairs \
--directory /path/to/output/dir
file1.txt and file2.txt are expected to be in the same directory as the snakefile, file3.txt is somewhere else. I need the paths to files to be absolute, hence the $PWD variable, so Snakemake can find the files after moving to /path/to/output/dir.
Because I start having several configuration options, I would like to move all the --config items to a separate yaml configuration file. The problem is: How do I transfer the variable $PWD to a configuration file?
I could have a dummy string in the yaml file indicating that that string is to be replaced by the directory where the Snakefile is (e.g. f1: <replace me>/file1.txt) but I feel it's awkward. Any better ideas? It may be that I should rethink how the files fileX.txt are passed to snakemake...
You can access the directory the Snakefile lives in with workflow.basedir - you might be able to get away with specifying the relative path in the config file and then defining the absolute path in your Snakefile e.g. as
file1 = pathlib.Path(workflow.basedir) / config["f1"]
file2 = pathlib.Path(workflow.basedir) / config["f2"]
One option is to use an external module, intake, to handle the environmental variable integration. There is a similar answer, but a more specific example for this question is as follow.
A yaml file which follows the syntax expected by intake, a field called sources that contains a list of nested entries that specify at the very least a (possibly local) url at which the file can be access:
# config.yml
sources:
file1:
args:
url: "{{env(PWD)}}/file1.txt"
file2:
args:
url: "{{env(PWD)}}/file2.txt"
Inside the Snakefile, the relevant code would be:
import intake
cat = intake.open_catalog('config.yml')
f1 = cat['file1'].urlpath
f2 = cat['file2'].urlpath
Note that for less verbose yaml files, intake provides syntax for parameterization, see the docs or this example.

How to access DVC-controlled files from Oracle?

I have been storing my large files in CLOBs within Oracle, but I am thinking of storing my large files in a shared drive, then having a column in Oracle contain pointers to the files. This would use DVC.
When I do this,
(a) are the paths in Oracle paths that point to the files in my shared drive, as in, the actual files themselves?
(b) or do the paths in Oracle point somehow to the DVC metafile?
Any insight would help me out!
Thanks :)
Justin
EDIT to provide more clarity:
I checked here (https://dvc.org/doc/api-reference/open), and it helped, but I'm not fully there yet ...
I want to pull a file from a remote dvc repository using python (which I have connected to the Oracle database). So, if we can make that work, I think I will be good. But, I am confused. If I specify 'remote' below, then how do I name the file (e.g., 'activity.log') when the remote files are all encoded?
with dvc.api.open(
'activity.log',
repo='location/of/dvc/project',
remote='my-s3-bucket'
) as fd:
for line in fd:
match = re.search(r'user=(\w+)', line)
# ... Process users activity log
(NOTE: For testing purposes, my "remote" DVC directory is just another folder on my MacBook.)
I feel like I'm missing a key concept about getting remote files ...
I hope that adds more clarity. Any help figuring out remote file access is appreciated! :)
Justin
EDIT to get insights on 'rev' parameter:
Before my question, some background/my setup:
(a) I have a repo on my MacBook called 'basics'.
(b) I copied into 'basics' a directory of 501 files (called 'surface_files') that I subsequently pushed to a remote storage folder called 'gss'. After the push, 'gss' contains 220 hash directories.
The steps I used to get here are as follows:
> cd ~/Desktop/Work/basics
> git init
> dvc init
> dvc add ~/Desktop/Work/basics/surface_files
> git add .gitignore surface_files.dvc
> git commit -m "Add raw data"
> dvc remote add -d remote_storage ~/Desktop/Work/gss
> git commit .dvc/config -m "Configure remote storage"
> dvc push
> rm -rf ./.dvc/cache
> rm -rf ./surface_files
Next, I ran the following Python code to take one of my surface files, named surface_100141.dat, and used dvc.api.get_url() to get the corresponding remote storage file name. I then copied this remote storage file into my desktop under the file's original name, i.e., surface_100141.dat.
The code that does all this is as follows, but FIRST, MY QUESTION --- when I run the code as it is shown below, no problems; but when I uncomment the 'rev=' line, it fails. I am not sure why this is happening. I used git log and cat .git/refs/heads/master to make sure that I was getting the right hash. WHY IS THIS FAILING? That is my question.
(In full disclosure, my git knowledge is not too strong yet. I'm getting there, but it's still a work in progress! :))
import dvc.api
import os.path
from os import path
import shutil
filename = 'surface_100141.dat' # This file name would be stored in my Oracle database
home_dir = os.path.expanduser('~')+'/' # This simply expanding '~' into '/Users/ricej/'
resource_url = dvc.api.get_url(
path=f'surface_files/{filename}', # Works when 'surface_files.dvc' exists, even when 'surface_files' directory and .dvc/cache do not
repo=f'{home_dir}Desktop/Work/basics',
# rev='5c92710e68c045d75865fa24f1b56a0a486a8a45', # Commit hash, found using 'git log' or 'cat .git/refs/heads/master'
remote='remote_storage')
resource_url = home_dir+resource_url
print(f'Remote file: {resource_url}')
new_dir = f'{home_dir}Desktop/' # Will copy fetched file to desktop, for demonstration
new_file = new_dir+filename
print(f'Remote file copy: {new_file}')
if path.exists(new_file):
os.remove(new_file)
dest = shutil.copy(resource_url, new_file) # Check your desktop after this to see remote file copy
I'm not 100% sure that I understand the question (it would be great to expand it a bit on the actual use case you are trying to solve with this database), but I can share a few thoughts.
When we talk about DVC, I think you need to specify a few things to identify the file/directory:
Git commit + path (actual path like data/data/xml). Commit (or to be precise any Git revision) is needed to identify the version of the data file.
Or path in the DVC storage (/mnt/shared/storage/00/198493ef2343ao ...) + actual name of this file. This way you would be saving info that .dvc` files have.
I would say that second way is not recommended since to some extent it's an implementation detail - how does DVC store files internally. The public interface to DVC organized data storage is its repository URL + commit + file name.
Edit (example):
with dvc.api.open(
'activity.log',
repo='location/of/dvc/project',
remote='my-s3-bucket'
) as fd:
for line in fd:
match = re.search(r'user=(\w+)', line)
# ... Process users activity log
location/of/dvc/project this path must point to an actual Git repo. This repo should have a .dvc or dvc.lock file that has activity.log name in it + its hash in the remote storage:
outs:
- md5: a304afb96060aad90176268345e10355
path: activity.log
By reading this Git repo and analyzing let's say activity.log.dvc DVC will be able to create the right path s3://my-bucket/storage/a3/04afb96060aad90176268345e10355
remote='my-s3-bucket' argument is optional. By default it will use the one that is defined in the repo itself.
Let's take another real example:
with dvc.api.open(
'get-started/data.xml',
repo='https://github.com/iterative/dataset-registry'
) as fd:
for line in fd:
match = re.search(r'user=(\w+)', line)
# ... Process users activity log
In the https://github.com/iterative/dataset-registry you could find the .dvc file that is enough for DVC to create a path to the file by also analyzing its config
https://remote.dvc.org/dataset-registry/a3/04afb96060aad90176268345e10355
you could run wget on this file to download it

Ansible module to compare files in 2 folder and delete extras

I have 2 folder one on my server and one on my ansible server and want to make sure that there is no extra files on my server ( such as .bk or files that aren't used anymore ) and be able to exclude files ending with .xyz . Is there a module or a way that this can be done in Ansible or is there a module that can be used. I wont to use cmd or shell tasks but wasn't able to find any module
Folder 1 :
Bla.txt
Hello.jar
Folder 2:
Bla.txt
Hello.jar
info.log
bla.txt.bk
I would like the bla.txt.bk to be delete from the folder 2
Thank you in advance
An option would be to use synchronize.
"To make sure that there is no extra files" use delete parameter, or rsync_opts parameter
"To exclude files ending with .xyz" use rsync_opts parameter
For example (not tested)
- synchronize:
src: folder_1
dest: folder_2
rsync_opts:
- "--exclude=*.xyz"
- "--delete"
To exclude list of patterns use --exclude-from=FILE and read exclude patterns from FILE.

Need a file system metadata layer for applications

I'm looking for a metadata layer that sits on top of files which can interpret key-value pairs of information in file names for apps that work with thousands of files. More info:
These aren't necessarily media files that have built-in metadata - hence the key-value pairs.
The metadata goes beyond os information (file sizes, etc) - to whatever the app puts into the key-values.
It should be accessible via command line as well as a python module so that my applications can talk to it.
ADDED: It should also be supported by common os commands (cp, mv, tar, etc) so that it doesn't get lost if a file is copied or moved.
Examples of functionality I'd like include:
list files in directory x for organization_id 3375
report on files in directory y by translating load_time to year/month and show file count & size for each year/month combo
get oldest file in directory z based upon key of loadtime
Files with this simple metadata embedded within them might look like:
bowling_state-ky_league-15_game-8_gametime-201209141830.tgz
bowling_state-ky_league-15_game-9_gametime-201209141930.tgz
This metadata is very accessible & tightly joined to the file. But - I'd prefer to avoid needing to use cut or wild-cards for all operations.
I've looked around and can only find media & os metadata solutions and don't want to build something if it already exists.
Have you looked at extended file attributes? See: http://en.wikipedia.org/wiki/Extended_file_attributes
Basically, you store the key-value pairs as zero terminated strings in the filesystem itself. You can set these attributes from the command line like this:
$ setfattr -n user.comment -v "this is a comment" testfile
$ getfattr testfile
# file: testfile
user.comment
$ getfattr -n user.comment testfile
# file: testfile
user.comment="this is a comment"
To set and query extended file system attributes from python, you can try the python module xattr. See: http://pypi.python.org/pypi/xattr
EDIT
Extended attributes are supported by most filesystem manipulation commands, such as cp, mv and tar by adding command line flags. E.g. cp -a or tar --xattr. You may need to make these commands to work transparently. (You may have users who are unaware of your extended attributes.) In this case you can create an alias, e.g. alias cp="cp -a".
As already discussed, xattrs are a good solution when available. However, when you can't use xattrs:
NTFS alternate data streams
On Microsoft Windows, xattrs are not available, but NTFS alternate data streams provide a similar feature. ADSs let you store arbitrary amounts of data together with the main stream of a file. They are accessed using
drive:\path\to\file:streamname
An ADS is effectively just its own file with a special name. Apparently you can access them from Python by specifying a filename containing a colon:
open(r"drive:\path\to\file:streamname", "wb")
and then using it like an ordinary file. (Disclaimer: not tested.)
From the command line, use Microsoft's streams program.
Since ADSs store arbitrary binary data, you are responsible for writing the querying functionality.
SQLite
SQLite is an embedded RDBMS that you can use. Store the .sqlite database file alongside your directory tree.
For each file you add, also record each file in a table:
CREATE TABLE file (
file_id INTEGER PRIMARY KEY AUTOINCREMENT,
path TEXT
);
Then, for example, you could store a piece of metadata as a table:
CREATE TABLE organization_id (
file_id INTEGER PRIMARY KEY,
value INTEGER,
FOREIGN KEY(file_id) REFERENCES file(file_id)
);
Then you can query on it:
SELECT path FROM file NATURAL JOIN organization_id
WHERE value == 3375 AND path LIKE '/x/%';
Alternatively, if you want a pure key-value store, you can store all the metadata in one table:
CREATE TABLE metadata (
file_id INTEGER,
key TEXT,
value TEXT,
PRIMARY KEY(file_id, key),
FOREIGN KEY(file_id) REFERENCES file(file_id)
);
Query:
SELECT path FROM file NATURAL JOIN metadata
WHERE key == 'organization_id' AND value == 3375 AND path LIKE '/x/%';
Obviously it's your responsibility to update the database whenever you read or write a file. You must also make sure these updates are atomic (e.g. add a column active to the file table; when adding a file: set active = FALSE, write file, fsync, set active = TRUE, and as cleanup delete any files that have active = FALSE).
The Python standard library includes SQLite support as the sqlite3 package.
From the command line, use the sqlite3 program.
the Xattr limits on the size(Ext4 upto 4kb), the xattr key must prefix 'user.' on linux.
And not all file system support the xattr.
try iDB.py library which wraps the xattr and can easily switch to disable xattr supports.

Categories

Resources