Pyspark error handling with file name having spaces

Pyspark error handling with file name having spaces - python

I am using pyspark 2.1
Problem Statement: Need to validate the hdfs path, file if exist need to copy file name into variable
Below is the code used so far after referring few websites and stackoverflow
import os
import subprocess
import pandas as pd
import times
def run_cmd(args_list):
print('Running system command: {0}'.format(' '.join(args_list)))
proc = subprocess.Popen(args_list, stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
proc.communicate()
return proc.returncode
today = datetime.now().date().strftime('%d%b%Y')
source_dir = '/user/dev/input/'+ today
hdfs_file_path=source_dir+'\'student marks details.csv\''
cmd = ['hdfs', 'dfs', '-find','{}','-name', hdfs_file_path]
code=run_cmd(cmd)
if code<>1:
print 'file doesnot exist'
System.exit(1)
else:
print 'file exist'
With above code I am getting error as "File doesn't exist" but file is present in that folder
Problem is able to run the run below command in shell console I am getting the complete path.
hdfs dfs -find () -name /user/dev/input/08Aug2017/'student marks details.csv'
When I tried to import in pyspark with above detailed code I am not able to execute as there exist space in an filename . Please help me in resolving this issue.

The problem
Your problem is on this line:
hdfs_file_path = source_dir + '\'student marks details.csv\''
You are adding two unneeded single quotes, and also forgetting to add a directory separator.
The reason the path works in this command:
hdfs dfs -find () -name /user/dev/input/08Aug2017/'student marks details.csv'
is because this is a shell command. On the shell that you are using (presumably it is bash), the following commands are equivalent:
echo '/user/dev/input/08Aug2017/student marks details.csv'
echo /user/dev/input/08Aug2017/'student marks details.csv'
bash removes the quotes, and merges the strings together, yielding the same string result, which is /user/dev/input/08Aug2017/student marks details.csv. The quotes are not actually part of the path, but just a way to tell bash to not split the string at the spaces, but create a single string, and then remove the quotes.
When you write:
hdfs_file_path = source_dir + '\'student marks details.csv\''
The path you end up getting is /user/dev/input/08Aug2017'student marks details.csv', instead of the correct /user/dev/input/08Aug2017/student marks details.csv.
The subprocess call just requires plain strings that correspond to the values that you want, and will not process them the same way the shell does.
Solution
In python, joining paths together is best performed by calling os.path.join. So I would suggest to replace these lines:
source_dir = '/user/dev/input/' + today
hdfs_file_path = source_dir + '\'student marks details.csv\''
with the following:
source_dir = os.path.join('/user/dev/input/', today)
hdfs_file_path = os.path.join(source_dir, 'student marks details.csv')
os.path.join takes care to add a single directory separator (/ on Unix, \ on Windows) between its arguments, so you can't accidentally either forget the separator, or add it twice.

Related

remove file command not working on spaces name

my file name is
file_name = '19-00165_my-test - Copy (7)_Basic_sample_data'
my function is like
call("rm -rf /tmp/" + file_name + '.csv', shell=True)
but getting this error
/bin/sh: -c: line 0: syntax error near unexpected token `('

My response always is: Don't use space in files.
But if you really want this, than you should place the files in quotes as such:
call("rm -f '/tmp/{0}.csv'".format(file_name), shell=True)

Why are you using shell=True? That means the command will be passed to a shell for parsing, which is what's causing all the trouble. With shell=False, you pass a list consisting of the commands followed by its arguments, each as a separate list element (rather than all mashed together as a single string). Since the filename never goes through shell parsing, it can't get mis-parsed.
call(["rm", "-rf", "/tmp/" + file_name + '.csv'], shell=False)

In order to avoid having problems with unescaped characters, one way is to use the shlex module:
You can use the quote() function to escape the string, it returns a shell-escaped version of the string:
import shlex
file_name = "19-00165_my-test - Copy (7)_Basic_sample_'data"
call(f"rm -f /tmp/{shlex.quote(file_name)}.csv", shell=True)
# rm -rf /tmp/'19-00165_my-test - Copy (7)_Basic_sample_'"'"'data'.csv
You can also use join():
import shlex
file_name = "19-00165_my-test - Copy (7)_Basic_sample_'data"
call(shlex.join(["rm", "-f", f"/tmp/{file_name}.csv"]), shell=True)
# rm -f '/tmp/19-00165_my-test - Copy (7)_Basic_sample_'"'"'data.csv'
Note: This answer is only valid if shell=True is required to make the command work. Otherwise the answer of #Gordon Davisson is way easier.

Using python to insert a single escape character in front of specified character in a string

I have a path string that I would like to use inside of a subprocess command. This path contains directories with a whitespace, so a string like "foo/foo bar/bar" would need to be converted to "foo/foo\ bar/bar" beforehand. I have tried
path = "foo/foo bar/bar"
path = path.replace(" ","\\ ")
which results in "foo/foo\\ bar/bar"
I have also tried
path = os.path.normpath(path)
which changes nothing and
path = repr(path.replace(" ","\\ "))
which returns "foo/foo\\\\ bar/bar"
Is there a good solution to this while still using subprocess or os.system to call the command?

You must be expecting to put the entire command in a string and letting a shell parse it. Call subprocess with a list of arguments to avoid any need for quoting:
path = "foo/foo bar/bar"
subprocess.run(["ls", "-l", path])

How to escape a spacebar in a path name with subprocess?

I'm trying to convert a file from .m4a to .mp3 using ffmpeg and I need to access to the music folder.
The path name of this folder is : C:\\Users\A B\Desktop\Music
I can't access it with subprocess.call() because only C:\\Users\A gets recognized. The white space is not processed.
Here's my python script :
import constants
import os
import subprocess
path = 'C:\\Users\A B\Desktop\Music'
def main():
files = sorted(os.listdir(path), key=lambda x: os.path.getctime(os.path.join(path, x)))
if "Thumbs.db" in files: files.remove("Thumbs.db")
for f in files:
if f.lower()[-3:] == "m4a":
process(f)
def process(f):
inFile = f
outFile = f[:-3] + "mp3"
subprocess.call('ffmpeg -i {} {} {}'.format('C:\\Users\A B\Desktop\Music', inFile, outFile))
main()
When I run it I get an error that states :
C:\Users\A: No such file or directory
I wonder if someones knows how to put my full path name (C:\Users\A B\Desktop\Music) in subprocess.call() ?

Beforehand edit: spaces or not, the following command line -i <directory> <infilename> <outfilename> is not correct for ffmpeg since it expects the -i option, then input file and output file, not a directory first. So you have more than one problem here (which explains the "permission denied" message you had, because ffmpeg was trying to open a directory as a file!)
I suppose that you want to:
read all files from directory
convert them all to a file located in the same directory
In that case, you could add quotes to your both input & output absolute files like this:
subprocess.call('ffmpeg -i "{0}\{1}" "{0}\{2}"'.format('C:\\Users\A B\Desktop\Music', inFile, outFile))
That would work, but that's not the best thing to do: not very performant, using format when you already have all the arguments already, you may not have knowledge of other characters to escape, etc... don't reinvent the wheel.
The best way to do it is to pass the arguments in a list so subprocess module handles the quoting/escaping when necessary:
path = r'C:\Users\A B\Desktop\Music' # use raw prefix to avoid backslash escaping
subprocess.call(['ffmpeg','-i',os.path.join(path,inFile), os.path.join(path,outFile)])
Aside: if you're the user in question, it's even better to do:
path = os.getenv("USERPROFILE"),'Desktop','Music'
and you could even run the process in the path directory with cwd option:
subprocess.call(['ffmpeg','-i',inFile, outFile],cwd=path)
and if you're not, be sure to run the script with elevated privileges or you won't get access to another user directory (read-protected)

python subprocess module can't parse filename with special characters "("

I have a Python program that reads files and then tars them into tar balls of a certain size.
One of my files not only has spaces in it but also contains parentheses. I have the following code:
cmd = "/bin/tar -cvf " + tmpname + " '" + filename + "'"
NOTE: Those are single quotes inside double quotes outside of the filename variable. It's a little difficult to see.
Where tmpname and filename are variables in a for-loop that are subject to change each iteration (irrelevant).
As you can see the filename I'm tarballing contains single quotes around the file name so that the shell (bash) interprets it literally as is and doesn't try to do variable substitution which "" will do or program execution which ` will do.
As far as I can see, the cmd variable contains the exact syntax for the shell to interpret the command as I want it to. However when I run the following subprocess command substituting the cmd variable:
cmdobj = call(cmd, shell=True)
I get the following output/error:
/bin/tar: 237-r Property Transport Request (PTR) 012314.pdf: Cannot stat: No such file or directory
/bin/tar: Exiting with failure status due to previous errors
unable to tar: 237-r Property Transport Request (PTR) 012314.pdf
I even print the command out to the console before running the subprocess command to see what it will look when running in the shell and it's:
cmd: /bin/tar -cvf tempname0.tar '237-r Property Transport Request (PTR) 012314.pdf'
When I run the above command in the shell as is it works just fine. Not really sure what's going on here. Help please!

Pass a list of args without shell=True and the full path to the file if running from a different directory:
from subprocess import check_call
check_call(["tar","-cvf",tmpname ,"Property Transport Request (PTR) 012314.pdf"])
Also use tar not 'bin/tar'. check_call will raise a CalledProcessError if the command returns a non-zero exit status.

The call method that is part of the subprocess module should have an array of strings passed.
On the command line you would call
tar -cvf "file folder with space/"
The following is equivalent in python
call(["tar", "-cvf", "file folder with space/"])
You are making this call in the shell
"tar -cvf 'file folder with space/'"
Which causes the shell to look for a program with the exact name as `tar -cvf 'file folder with space/'
This avoids string concatenation, which makes for cleaner code.

subprocess.call() to remove files

I want to remove all the files and directories except for some of them by using
`subprocess.call(['rm','-r','!(new_models|creat_model.py|my_mos.tit)'])`
but it gives back information
rm: cannot remove `!(new_models|creat_model.py|my_mos.tit)': No such file or directory
how can I fix this? Thanks

If you use that rm command on the command line the !(…|…|…) pattern is expanded by the shell into all file names except those in the pattern before calling rm. Your code calls rm directly so rm gets the shell pattern as a file name and tries to delete a file with that name.
You have to add shell=True to the argument list of subprocess.call() or actually code this in Python instead of calling external commands. Downside: That would be more than one line. Upside: it can be done independently from external shells and system dependent external programs.

An alternative to shell=True could be the usage of glob and manual filtering:
import glob
files = [i for i in glob.glob("*") if i not in ('new_models', 'creat_model.py', 'my_mos.tit')]
subprocess.call(['rm','-r'] + files)
Edit 4 years later:
Without glob (of which I don't remember why I suggested it):
import os
files = [i for i in os.listdir() if i not in ('new_models', 'creat_model.py', 'my_mos.tit')]
subprocess.call(['rm','-r'] + files)

Code to remove all png
args = ('rm', '-rf', '/path/to/files/temp/*.png')
subprocess.call('%s %s %s' % args, shell=True)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.