I am using Azure Batch with Python and I would like to create a directory within the shared space from a batch task.
According to the docs:
Shared: This directory provides read/write access to all tasks that run on a node. Any task that runs on the node can create, read, update, and delete files in this directory. Tasks can access this directory by referencing the AZ_BATCH_NODE_SHARED_DIR environment variable.
Imagine that folder is called test_dir:
if not os.path.exists('test_dir'):
os.makedirs('test_dir')
Now, what if I want to write a file to that directory? I cannot use:
with open('$AZ_BATCH_NODE_SHARED_DIR/test_dir/test.txt', 'a') as output:
output.write('hello\n')
How do I get the full path from $AZ_BATCH_NODE_SHARED_DIR?
Use os.environ, which exposes the current environment as a mapping:
shared = os.environ['AZ_BATCH_NODE_SHARED_DIR']
with open(os.path.join(shared, 'test_dir', 'test.txt'), 'a') as output:
Related
Hi community I need for help.
I have a GCS bucket called "staging". This bucket contain folders and subfolders (see picture).
The "date-folders" (eg. 20221128) may be several. Each date-folder has 3 subfolders: I'm interested in the "main_folder". The main_folder has 2 "project folders". Each project folder has several subfolders. Each of these last subfolder has a .txt file.
The main objective is:
Obtain a list of all the path to .txt files (eg. gs://staging/20221128/main_folder/project_1/subfold_1/file.txt, ...)
Export the list on an Airflow Variable
Use the "list" Variable to run some DAGS dynamically.
The folders in the staging bucket may vary everyday, so I don't have static paths.
I'm using Apache Beam with Python SDK on Cloud Dataflow and Airflow with Cloud Composer.
Is there a way to obtain the list of paths (as os.listdir() on python) with Beam and schedule this workflow daily? (I need to override the list Variable eveyday with new paths).
For example I can achieve step n.1 (locally) with the following Python script:
def collect_paths(startpath="C:/Users/PycharmProjects/staging/"):
list_paths = []
for path, dirs, files in os.walk(startpath):
for f in files:
file_path = path + "/" + f
list_paths .append(file_path )
return list_paths
Thank you all in advance.
Edit n.1.
I've retrieved file paths thanks to google.cloud.storage API in my collect_paths script. Now, I want to access to XCom and get the list of paths. This is my task instance:
collect_paths_job = PythonOperator(
task_id='collect_paths',
python_callable=collect_paths,
op_kwargs={'bucket_name': 'staging'},
do_xcom_push=True
)
I want to iterate over the list in order to run (in the same DAG) N concurrent task, each processing a single file. I tried with:
files = collect_paths_job.xcom_pull(task_ids='collect_paths', include_prior_dates=False)
for f in files:
job = get_job_operator(f)
chain(job)
But got the following error:
TypeError: xcom_pull() missing 1 required positional argument: 'context'
I would like to correct you in your usage of the term Variable . Airflow attributes a special meaning to this object. What you want is for the file info to be accessible as parameters in a task.
Use XCom
Assume you have the DAG with the python task called -- list_files_from_gcs.
This task is a python task which exactly runs the collect_path function that you have written. Since this function returns a list, airflow automatically stuffs this into XCom. So now you can access this information anywhere in your DAG in subsequent tasks.
Now your subsequent task can again be a python task in the same DAG which case you can access XCom very very easily:
#task
def next_task(xyz, abc, **context):
ti = context['ti']
files_list = ti.xcom_pull(task_ids='list_files_from_gcs')
...
...
If you are now looking to call an entirely different DAG now, then you can use TriggerDagRunOperator as well and pass this list as dag_run config like this:
TriggerDagRunOperator(
conf={
"gcs_files_list": "{{task_instance.xcom_pull(task_ids='list_files_from_gcs'}}"
},
....
....
)
Then your triggered DAG can just parse the DAG run config to move ahead.
I am using pathlib to set up a folder structure, for which I'd like permissions set to drwxrwx--- (770) for all folders in the tree.
My current code is:
p=Path('name/{}/{}/{}/category'.format(year,month,day))
pp=Path('name/{}/{}/{}'.format(year,month,day))
p.mkdir(mode=0o770,parents=True,exist_ok=True)
I need exist_ok=True as I want the same line to work as I loop through category values. However, while testing this I'm deleting the folders.
After running,
oct(p.stat().st_mode)
0o40770
oct(pp.stat().st_mode)
0o40775
i.e., the parent directories have default permissions of 777 (with umask=002).
The only way around this I can think of, which seems inefficient, is:
p.mkdir(mode=0o770,parents=True,exist_ok=True)
os.system("chmod -R 770 {}".format(name))
Is there a way to apply the desired permissions with the Path().mkdir() call, or is the os.system() call unavoidable?
The documentation for Path.mkdir mentions this behavior:
If parents is true, any missing parents of this path are created as needed; they are created with the default permissions without taking mode into account (mimicking the POSIX mkdir -p command).
One way to avoid this would be to iterate over each path's parts or parents yourself, calling mkdir with exists_ok but without parents on each. That way, the missing directories are still created, but the mode is taken into account. That would look something like:
for parent in reversed(p.parents):
parent.mkdir(mode=0o770, exist_ok=True)
Working with scientific data, specifically climate data, I am constantly hard-coding paths to data directories in my Python code. Even if I were to write the most extensible code in the world, the hard-coded file paths prevent it from ever being truly portable. I also feel like having information about the file system of your machine coded in your programs could be security issue.
What solutions are out there for handling the configuration of paths in Python to avoid having to code them out explicitly?
One of the solution rely on using configuration files.
You can store all your path in a json file like so :
{
"base_path" : "/home/bob/base_folder",
"low_temp_area_path" : "/home/bob/base/folder/low_temp"
}
and then in your python code, you could just do :
import json
with open("conf.json") as json_conf :
CONF = json.load(json_conf)
and then you can use your path (or any configuration variable you like) like so :
print "The base path is {}".format(CONF["base_path"])
First off its always good practise to add a main function to go with each class to test that class or functions in the file. Along with this you determine the current working directory. This becomes incredibly important when running python from a cron job or from a directory that is not the current working directory. No JSON files or environment variables are then needed and you will obtain interoperation across Mac, RHEL and Debian distributions.
This is how you do it, and it will work on windows also if you use '\' instead of '/' (if that is even necessary, in your case).
if "__main__" == __name__:
workingDirectory = os.path.realpath(sys.argv[0])
As you can see when you run your command, the working directory is calculated if you provide a full path or relative path, meaning it will work in a cron job automatically.
After that if you want to work with data that is stored in the current directory use:
fileName = os.path.join( workingDirectory, './sub-folder-of-current-directory/filename.csv' )
fp = open( fileName,'r')
or in the case of the above working directory (parallel to your project directory):
fileName = os.path.join( workingDirectory, '../folder-at-same-level-as-my-project/filename.csv' )
fp = open( fileName,'r')
I believe there are many ways around this, but here is what I would do:
Create a JSON config file with all the paths I need defined.
For even more portability, I'd have a default path where I look for this config file but also have a command line input to change it.
In my opinion passing arguments from command line would be best solution. You should take a look at argparse . This allows you to create nice way to handle arguments from the command line. for example:
myDataScript.py /home/userName/datasource1/
I have a string of Java source code in Python that I want to compile, execute, and collect the output (stdout and stderr). Unfortunately, as far as I can tell, javac and java require real files, so I have to create a temporary directory.
What is the best way to do this? The tempfile module seems to be oriented towards creating files and directories that are only visible to the Python process. But in this case, I need Java to be able to see them too. However, I also want the other stuff to be handled intelligently if possible (such as deleting the folder when done or using the appropriate system temp folder)
tempfile.NamedTemporaryFile and tempfile.TemporaryDirectory work perfectly fine for your purposes. The resulting objects have a .name attribute that provides a file system visible name that java/javac can handle just fine, just make sure to:
Set the suffix appropriately if the compiler insists on files being named with a .java extension
Always call .flush() on the file handle before handing the .name of a NamedTemporaryFile to an external process or it may (usually will) see an incomplete file
If you don't want Python cleaning up the files when you close the objects, either pass delete=False to NamedTemporaryFile's constructor, or use the mkstemp and mkdtemp functions (which create the objects, but don't clean them up for you).
So for example, you might do:
# Create temporary directory for source and class files
with tempfile.TemporaryDirectory() as d:
# Write source code
srcpath = os.path.join(d.name, "myclass.java")
with open(srcpath, "w") as srcfile:
srcfile.write('source code goes here')
# Compile source code
subprocess.check_call(['javac', srcpath])
# Run source code
# Been a while since I've java-ed; you don't include .java or .class
# when running, right?
invokename = os.path.splitext(srcpath)[0]
subprocess.check_call(['java', invokename])
... with block for TemporaryDirectory done, temp directory cleaned up ...
tempfile.mkstemp creates a file that is normally visible in the filesystem and returns you the path as well. You should be able to use this to create your input and output files - assuming javac will atomically overwrite the output file if it exists there should be no race condition if other processes on your system don't misbehave.
I am using django views, I create a temp_dir using tempfile.gettempdir().
I write a gzipped text file in there, and then scp the file elsewhere. When these tasks are complete I try to delete the temp_dir.
if os.path.exists( temp_dir ):
shutil.rmtree( temp_dir )
However, occasionally I get this error back:
Operation not permitted: '/tmp/.ICE-unix'
Any ideas what this error means and how to best handle this situation?
tempfile.gettempdir() does not create a temp directory - it returns your system's standand tmp directory. DO NOT DELETE IT! That will blow everybody's temp files away. You can delete the file you created inside the temp dir, or you can create your own temp dir, but leave this one alone.
The value for temp_dir is taken from the OS environment variables, and apparently some other process is also using it to create files. The other file might be in use/locked and that will prevent you from deleting it.
Q: What is /tmp/.ICE-unix ?
A: Its a directory where X-windows session information is saved.
I am no expert but try running the python program or what your using to do this as an administrator then it will most likely allow this process to be done...