How do I move a file from local to HDFS in Python?

How do I move a file from local to HDFS in Python? - python

I have a script to check a directory for files. If the right files (with keyword) is present, I want to move that/those file(s) to an HDFS location.
import os
tRoot = "/tmp/mike"
keyword = "test"
for root, dirs, files in os.walk(tRoot):
for file in files:
if keyword in file:
fullPath = str(os.path.join(root, file))
subprocess.call(['hdfs', 'dfs', '-copyFromLocal','/tmp/mike/test*','hdfs:///user/edwaeadt/app'], shell=True)
I'm seeing below error:
Usage: hdfs [--config confdir] [--loglevel loglevel] COMMAND
I also tried with
subprocess.call(['hadoop', 'fs', '-copyFromLocal', '/tmp/mike/test*', 'hdfs:///user/edwaeadt/app'], shell=True)
But I'm seeing
Usage: hadoop [--config confdir] [COMMAND | CLASSNAME]
CLASSNAME
Also, seems like this loop is running 3 times. Because I now see the file moved to hdfs location but I also see twice where it says file exists. Seems like this copyFromLocal is running 3 times. Any ideas?

If you are intent on using subprocess and shell=True then your command should read as
subprocess.call(['hadoop fs -copyFromLocal /tmp/mike/test* hdfs:///user/edwaeadt/app'], shell=True)

Related

How do I select a file in my directory whose suffix is the name of another file?

I am trying to automate a program in Python that runs in terminal on Ubuntu.
I have a lot of files in my directory, and each file has a sister file associated with it, all in the same directory. The names of the files start with 1 and go all the way up to 500.
For example, there will be files like 1.mp3, 1_hello.mp4, 1_something.mp4, 1_something_else.mp3 and 2.mp3, 2_hello.mp4, 2_something.mp4, 2_something_else.mp3.
I will be only concerned with 1.mp3, and 1_hello.mp4, and similarly for all the other files in my directory. Other files with the same suffix don't matter to me. How can I automate this? I have tried this but, it doesn't work.
import os
directory_name = '/home/user/folder/'
for file_name in os.listdir(directory_name):
if file_name.endswith(".mp3"):
for sis_file in os.listdir(directory_name):
if sis_file.endswith("file_name._hello.mp4"):
os.system("command file_name -a file_name.sis_file file_name.new_file")
The command is written in the os.system line. new_file is created as the result of this operation, and it too must have the suffix of the original file for it to be easily identifiable. Also, for the command in os.system, each file must only be paired only with its sister file, or I will get inconsistent results.
Edit: I will a lot of files in my directory. All the files are sequentially numbered, begining from 1. Each file has many other files associated with it, and all these sister files will have same prefix as that of the parent file. The command that I have to automate is like this
command name_of_the_parent_file.mp3 -a name_of_the_sister_file.txt name_of_the_output_file.mp4
name_of_the_parent_file would be like 1.mp3, 2.mp3, 3.mp3
name_of_the_sister_file would be like 1_hello.txt, 2_hello.txt
name_of_the_output_file would be the name of the new file that this command creates.

If I get you right you need to run this command on all parent files combined with all paired children files individually:
import os
from glob import glob
files = glob('/home/user/folder/*.*') #get a list of files in the folder
parent_files = [i for i in files if i.rsplit('.', -1)[0].isdigit()] #get parent files by filtering filenames with a digit as filename
for file in parent_files:
for j in [i for i in files if i.startswith(f"{file.rsplit('.', -1)[0]}_")]: #run command on those files that start with the same digit as the parent files
os.system(f"command {file} -a {j} {j.rsplit('.', -1)[0]}.extension") # I have no idea what the output file should look like, so I named it after the children file with a random extension

this is what I gathered from your question.
from glob import glob
import re
path = glob("/home/testing/*") # your file path here
individual_files = [list(map(int, re.findall(r'\d+', x)))[0] for x in path]
individual_files
set_files = list(set(individual_files))
set_files
for i in set_files :
files = [path[id] for id, x in enumerate(individual_files) if x == i]
print(*files, sep = "\n")
print("\n\n")
the output is separating all the numbered files in the folder:
/home/testing/1.csv
/home/testing/1_something_else.csv
/home/testing/1_something.csv
/home/testing/2_something.csv
/home/testing/2_something_else.csv
/home/testing/2.csv
/home/testing/11_something.csv
/home/testing/11_something_else.csv
/home/testing/11.csv
/home/testing/111.csv
/home/testing/111_something.csv
/home/testing/111_something_else.csv
/home/testing/111_testing.txt

python: how to check if a specific pattern log file is there if a directory exists

I am running a command (ls -ltr /a/b/c/filename*.log) in a shell inside a python code. The issue is if directory /a/b/c dont exist it generates error, also if there is no log file with pattern filename*.log it fails. So i want to put check for both directory and file pattern before running the command in shell. So far, I tried below code.
def subprocess_cmd(command):
process = subprocess.Popen(command,stdout=subprocess.PIPE, shell=True)
proc_stdout = process.communicate()[0].strip()
return proc_stdout
DIR_PATH = "/a/b/c"
if os.path.exists(DIR_PATH):
files = os.listdir(DIR_PATH)
if "filename" in BASENAME for BASENAME in files:
CMD = "ls -ltr /a/b/c/filename*.log"
LOGFILE = subprocess_cmd(CMD)
Getting below error
if "filename" in BASENAME for BASENAME in files:
^
SyntaxError: invalid syntax
Please note that I am not checking whether a particular file exists or not. I am concerned about the pattern.

Use
try:
....
except Exception as e:
#some code to handle the missing directories

Try
if [f for f in files if f.startswith(BASENAME)]:
...
This will create a list of files whose name begins with the BASENAME and check if this list is not empty.
Note that it does NOT check for the extension (but you may of course add f.endswith(".log") as a second condition).

Copy files from one folder to another with matching names in .txt file

I want to copy files from one big folder to another folder based on matching file names in a .txt file.
My list.txt file contains file names:
S001
S002
S003
and another big folder contains many files for ex. S001, S002, S003, S004, S005.
I only want to copy the files from this big folder that matches the file names in my list.txt file.
I have tried Bash, Python - not working.
for /f %%f in list.txt do robocopy SourceFolder/ DestinationFolder/ %%f
is not working either.
My logic in Python is not working:
import os
import shutil
def main():
destination = "DestinationFolder/copy"
source = "SourceFolder/MyBigData"
with open(source, "r") as lines:
filenames_to_copy = set(line.rstrip() for line in lines)
for filenames in os.walk(destination):
for filename in filenames:
if filename in filenames_to_copy:
shutil.copy(source, destination)
Any answers in Bash, Python or R?
Thanks

I think the issue with your Python code is that with os.walk() your filename will be a list everytime, which will not be found in your filenames_to_copy.
I'd recommend trying with os.listdir() instead as this will return a list of the names of filenames/folders as strings - easier to compare against your filenames_to_copy.
Other note - perhaps you want to do os.listdir() (or os.walk()) on the source instead of the destination. Currently, you're only copying files from the source to the destination if the file already exists in the destination.

os.walk() will return a tuple of three elements: the name of the current directory inspected, the list of folders in it, and the list of files in it. You are only interested in the latter. So your should iterate with:
for _, _, filenames in os.walk(destination):
As pointed out by JezMonkey, os.listdir() is easier to use as it will list of the files and folders in the requested directory. However, you will lose the recursive search that os.walk() enables. If all your files are in the same folder and not hidden in some folders, you'd rather use os.listdir().
The second problem I see in you code is that you copy source when I think you want to copy os.path.join(source, filename).
Can you publish the exact error you have with the Python script so that we can better help you.
UPDATE
You actually don't need to list all the files in the source folder. With os.path.exists you can check that the file exists and copy it if it does.
import os
import shutil
def main():
destination = "DestinationFolder/copy"
source = "SourceFolder/MyBigData"
with open("list.txt", "r") as lines: # adapt the name of the file to open to your exact location.
filenames_to_copy = set(line.rstrip() for line in lines)
for filename in filenames_to_copy:
source_path = os.path.join(source, filename)
if os.path.exists(source_path):
print("copying {} to {}".format(source_path, destination))
shutil.copy(source_path, destination)

Thank you #PySaad and #Guillaume for your contributions, although my script is working now: I added:
if os.path.exists(copy_to):
shutil.rmtree(copy_to)
shutil.copytree(file_to_copy, copy_to)
to the script and its working like a charm :)
Thanks a lot for your help!

You can try with below code -
import glob
big_dir = "~\big_dir"
copy_to = "~\copy_to"
copy_ref = "~\copy_ref.txt"
big_dir_files = [os.path.basename(f) for f in glob.glob(os.path.join(big_dir, '*'))]
print 'big_dir', big_dir_files # Returns all filenames from big directory
with open(copy_ref, "r") as lines:
filenames_to_copy = set(line.rstrip() for line in lines)
print filenames_to_copy # prints filename which you have in .txt file
for file in filenames_to_copy:
if file in big_dir_files: # Matches filename from ref.txt with filename in big dir
file_to_copy = os.path.join(big_dir, file)
copy_(file_to_copy, copy_to)
def copy_(source_dir, dest_dir):
files = glob.iglob(os.path.join(source_dir, '*'))
for file in files:
dest = os.path.join(dest_dir, os.path.basename(os.path.dirname(file)))
if not os.path.exists(dir_name):
os.mkdir(dest)
shutil.copy2(file, dest)
Reference:
https://docs.python.org/3/library/glob.html

If you want an overkill bash script/tool. Check https://github.com/jordyjwilliams/copy_filenames_from_txt out.
This can be invoked by ./copy_filenames_from_txt.sh -f ./text_file_with_names -d search_dir-o output_dir
The script can be summarised (without error handling/args etc) to:
cat $SEARCH_FILE | while read i; do
find $SEARCH_DIR -name "*$i*" | while read file; do
cp -r $file $OUTPUT_DIR/
done
done
The 2nd while loop here is not even strictly necessary... One could just pass $i to the cp. (eg a list of files if multiple matches) I just wanted to have each file handled separately for the tool I was writing...
To make this a bit nicer and add error handling...
The active part of the my tool in the repo is (ignore the color markers):
cat $SEARCH_FILE | while read i; do
# Handle non-matching file
if [[ -z $(find $SEARCH_DIR -name "*$i*") && $VERBOSE ]]; then
echo -e "❌: ${RED}$i${RESET} ||${YELLOW} No matching file found${RESET}"
continue
fi
# Handle case of more than one matching file
find $SEARCH_DIR -name "*$i*" | while read file; do
if [[ -n "$file" ]]; then # Check file matching
FILENAME=$(basename ${file}); DIRNAME=$(dirname "${file}")
if [[ -f $OUTPUT_DIR/"$FILENAME" ]]; then
if [[ $DIRNAME/ != $OUTPUT_DIR && $VERBOSE ]]; then
echo -e "📁: ${CYAN}$FILENAME ${GREEN}already exists${RESET} in ${MAGENTA}$OUTPUT_DIR${RESET}"
fi
else
if [[ $VERBOSE ]]; then
echo -e "✅: ${GREEN}$i${RESET} >> ${CYAN}$file${RESET} >> ${MAGENTA}$OUTPUT_DIR${RESET}"
fi
cp -r $file $OUTPUT_DIR/
fi
fi
done
done

automation to run script files for number of definition files using python

I have an autojson file, where I need to run the file using the following command in the terminal:
python script.py -i input.json -o output.h
Now I want to run the same script for number of input files stored in a folder automatically and store the output in another folder. how can I write a python script to automate this?
For this to run I have to keep rewriting the input file names, instead the command should read the files from a given folder by itself and generate the files.

import os
import subprocess
myFiles = []
cwd = os.getcwd()
path_script = cwd+"\\script"
myFiles = os.listdir(path_script)
for script_file in myFiles:
x = os.path.join(path_def, filename)
y = filename.strip(".json")
os.system(script_file+" -i "+x+" -o "+y+".h")

Run command in python

I am writing a python script which will iterate a directory and run a command based on the file,like this:
for root, directories,files in os.walk(directory):
for filename in files:
if filename.endswith('.xx'):
filepath=os.path.join(root,filename)
process_file(filepath)
def process_file(filepath):
#get the file name and extension
name,ext=os.path.splitext(os.path.basename(filepath))
#copy the file to a tmp file, because the filepath may contain some no-asci character
tmp_file_src='c:/'+uuid.uuid1()+'.xx'
tmp_file_dest=tmp_file_src.replace('.xx','.yy')
shutil.copy2(filepath,tmp_file_src)
#run the command which may take 10-20 seconds
os.system('xx %s %s' %(tmp_file_src,tmp_file_dest))
#copy the generated output file, reset the file name
shutil.copy2(tmp_file_dest,os.path.dirname(filepath)+'/'+name+'.yy')
As you can see, one file to one command, and I have to wait the command run completely to do the further job.
Not the execute process:
file1-->file2.....
I wonder if they can be executed parallelly?
file1
file2
....

There is even the possibility to use threading module.
import threading
def starter_function(cmd_to_execute):
os.system(cmd_to_execute)
execution_thread = threading.Thread(target=starter_function, args=(cmd_to_execute,))
execution_thread.start()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.