Pig Script: STORE command not working - python

this is my first time posting to StackOverflow and I'm hoping someone can assist. I'm fairly new at pig scripts and have encountered a problem I can't solve.
Below is a pig script that fails when I attempt to write results to a file:
register 'myudf.py' using jython as myfuncs;
A = LOAD '$file_nm' USING PigStorage('$delimiter') AS ($fields);
B = FILTER A by ($field_nm) IS NOT NULL;
C = FOREACH B GENERATE ($field_nm) as fld;
D = GROUP C ALL;
E = FOREACH D GENERATE myfuncs.theResult(C.fld);
--DUMP E;
STORE E INTO 'myoutput/theResult';
EXEC;
I see the results of E when I Dump to the screen. However, I need to store the results temporarily in a file. After the Store command, the error I receive is: Output Location Validation Failed.
I've tried numerous workarounds, like removing the theResult folder and removing the earlier contents of theResult, but none of the commands I use work. These have been along the lines of:
hdfs dfs -rm myoutput/theResult
and
hadoop fs -rm myoutput/theResult
...using both the shell (hs) and file system (fs) commands. I've tried to call another function (shell script, python function, etc.) to clear the earlier results stored in the myoutput/theResult folder. I've read every website I can find and nothing is working. Any ideas??

the output location of a mapreduce is a directory. So, you must have tried it this way
hadoop fs -rmr myoutput/theResult
and then run the pig script. It will work.
"rmr" - remove recursive, which deletes both folder/file
"rm" - is just remove which removes only file
Everytime, you need to either change output path or delete and use the same, since HDFS is worm(write once read many) model storage.

Couple of things you can try-
making sure output director is a valid path.
Remove the entire directory and not just content within it. Remove directory with 'rmr and check that path doesn't exist before running pig script.

Thanks for both of your replies. I now have a solution that is working:
fs -mkdir -p myoutput/theResult
fs -rm -r myoutput/theResult
The first line attempts to create a directory, but the "-p" prevents an error if it already exists. Then the second line removes it. Either way, there will be a directory to remove, so no error!

The output of store is confusing when we are using Pig for the first time.
store grp into '/output1';
This will create the folder named 'output1' in root. The folder should not be already present
You can give your own hdfs path here like /user/thewhitetulip.
hdfs dfs -ls /output1
output:
/output1/_SUCCESS
/output1/part-r-00000
The part-r-00000 file is the output of the store program.

Related

"Sort by random" in OSX Finder setting kMDItemFinderComment to a random hash for every file in a folder?

Finder allows you to sort files by many different attributes.
In the OSX filesystem, there is an attribute on every file called "comments" (com.apple.metadata:kMDItemFinderComment) which allows you to add any arbitrary string data as metadata for that file.
Finder exposes this "comment" attribute in the GUI and you can "sort" by it. I thought I could abuse this attribute to fill in random data for the each files "comments", and then sort by those random comments.
tldr; I'm trying to create "sort by random" functionality (in Finder) with the help of a BASH script and some python.
this does work to achieve that (sort of):
find $1 -type f -print0 | while IFS= read -r -d $'\0' file; #get a list of files in the dir
do
if [[ $file == *.wav ]]
then
hash=$(openssl rand -hex 12); #generate a random hash
osxmetadata --set findercomment "$hash" $file; #set the comment
fi
done
here i'm using the osxmetadata python utility to do the heavy lifting.
and while it works as intended, but it's really slow:
https://i.stack.imgur.com/d7exk.gif
i'm trying to do this operation on folders with many items, and would frequently be "re-seeding" the files with random comments.
can anyone suggest an optimization i can try to make this faster? i tried using xattrs but that doesn't seem reindex the comments in finder when they update.
I'd wrap the then-clause in a (...)& and add a wait after the loop. Then it will do every file in parallel.

diff output of two python programs in windows cmd

So I am trying to compare output of two python programs, which have files that I will call trace1.py and trace2.py. Currently I am using process substitution with diff to try and compare their outputs, however I'm having trouble with finding both files, since they are in separate sub-directories of my current directory:
diff <(python /subdir1/tracing1.py) <(python /subdir2/tracing2.py)
When I run this, I get
The system cannot find the file specified.
I think I'm messing up some sort of path formatting, or else I'm using the process substitution incorrectly.
EDIT: In the end I decided that I didn't need to use process substitution, and instead could just diff program output after each program is run. However thanks to Fallenreaper in the comments, I was able to find a single command that does what I initially wanted:
python subdir1/tracing1.py > outfile1.txt & python subdir2/tracing2.py > outfile2.txt & diff outfile1.txt outfile2.txt
Sorry, not enough rep to comment yet :(
Your line works perfectly when you remove that slash. I would suggest using absolute path names or a relative path from current directory cos that front slash would take you to your root directory.
Cheers.

Streaming a directory with Spark on Windows 7

I am running Spark 1.6.1 with Python 2.7 on Windows 7.
The root scratch dir: /tmp/hive on HDFS is writable and my current permissions are: rwxrwxrwx (using winutils tools).
I want to stream files from a directory. According to the doc, the function textFileStream(directory):
Create an input stream that monitors a Hadoop-compatible file system
for new files and reads them as text files. Files must be wrriten to
the monitored directory by “moving” them from another location within
the same file system. File names starting with . are ignored.
When I launch Spark Streaming command:
lines = ssc.textFileStream(r"C:/tmp/hive/")
counts = lines.flatMap(lambda line: line.split(" "))\
.map(lambda x: (x, 1))\
.reduceByKey(lambda a, b: a+b)
counts.pprint()
ssc.start()
and then create the files to stream in my directory, nothing happens.
I also tried this:
lines = ssc.textFileStream("/tmp/hive/")
and
lines = ssc.textFileStream("hdfs://tmp/hive/")
which is HDFS path related, but nothing happens again.
Do I do something wrong?
Try using "file:///C:/tmp/hive" as a directory on Windows, worked for me on Windows 8 with Spark 1.6.3 but I had to fiddle a bit with file name and content before I made it work. I also tried with other paths so can confirm that it works the same way with paths which are not related in any way to winutits, e.g. you can use "file:///D:/someotherpath" if you have your data there
It is not straightforward though, I had a file in the monitored directory and did few content and file name changes before it got picked up, and then at some point it stopped reacting to my changes and getting picked up so results are not consistent. Guess it's a Windows thing.
I know it works so every time I try I know have to be patient and try few name changes before it gets picked up but that's only good to prove it works, obviously not good for anything else.
One other thing i was doing is using Unix eof instead of Windows eof in files but cannot assert it is required

Get full path of currently open files

I'm trying to code a simple application that must read all currently open files within a certain directory.
More specificly, I want to get a list of files open anywhere inside my Documents folder,
but I don't want only the processes' IDs or process name, I want the full path of the open file.
The thing is I haven't quite found anything to do that.
I couldn't do it neither in linux shell (using ps and lsof commands) nor using python's psutil library. None of these is giving me the information I need, which is only the path of currently open files in a dir.
Any advice?
P.S: I'm tagging this as python question (besides os related tags) because it would be a plus if it could be done using some python library.
This seems to work (on Linux):
import subprocess
import shlex
cmd = shlex.split('lsof -F n +d .')
try:
output = subprocess.check_output(cmd).splitlines()
except subprocess.CalledProcessError as err:
output = err.output.splitlines()
output = [line[3:] for line in output if line.startswith('n./')]
# Out[3]: ['file.tmp']
it reads open files from current directory, non-recursively.
For recursive search, use +D option. Keep in mind, that it is vulnerable to race condition - when you get your ouput, situation might have changed already. It is always best to try to do something (open file), and check for failure, e.g. open file and catch exception or check for null FILE value in C.

Script to rename files in folder to match names of files in another folder

I need to do a batch rename given the following scenario:
I have a bunch of files in Folder A
A bunch of files in Folder B.
The files in Folder A are all ".doc",
the files in Folder B are all ".jpg".
The files in Folder A are named "A0001.doc"
The files in Folder B are named "A0001johnsmith.jpg"
I want to merge the folders, and rename the files in Folder A so that they append the name portion of the matching file in Folder B.
Example:
Before:
FOLDER A: Folder B:
A0001.doc A0001johnsmith.jpg
After:
Folder C:
A0001johnsmith.doc
A0001johnsmith.jpg
I have seen some batch renaming scripts, but the only difference is that i need to assign a variable to contain the name portion so I can append it to the end of the corresponding file in Folder A.
I figure that the best way to do it would be to do a simple python script that would do a recursive loop, working on each item in the folder as follows:
Parse filename of A0001.doc
Match string to filenames in Folder B
Take the portion following the string that matched but before the "." and assign variable
Take the original string A0001 and append the variable containing the name element and rename it
Copy both files to Folder C (non-destructive, in case of errors etc)
I was thinking of using python for this, but I could use some help with syntax and such. I only know a little bit using the base python library, and I am guessing I would be importing libraries such as "OS", and maybe "SYS". I have never used them before, any help would be appreciated. I am also open to using a windows batch script or even powershell. Any input is helpful.
This is Powershell since you said you would use that.
Please note that I HAVE NOT TESTED THIS. I don't have access to a Windows machine right now so I can't test it. I'm basing this off of memory, but I think it's mostly correct.
foreach($aFile in ls "/path/to/FolderA")
{
$matchString = $aFile.Name.Split("."}[0]
$bFile = $(ls "/path/to/FolderB" |? { $_.Name -Match $matchString })[0]
$addString = $bFile.Name.Split(".")[0].Replace($matchString, "")
cp $aFile ("/path/to/FolderC/" + $matchString + $addString + ".doc")
cp $bFile "/path/to/FolderC"
}
This makes a lot of assumptions about the name structure. For example, I assumed the string to add doesn't appear in the common filename strings.
It is very simple with a plain batch script.
#echo off
for %%A in ("folderA\*.doc") do (
for %%B in ("folderB\%%~nA*.jpg") do (
copy "%%A" "folderC\%%~nB.doc"
copy "%%B" "folderC"
)
)
I haven't added any error checking.
You could have problems if you have a file like "A1.doc" matching multiple files like "A1file1.jpg" and "A10file2.jpg".
As long as the .doc files have fixed width names, and there exists a .jpg for every .doc, then I think the code should work.
Obviously more code could be added to handle various scenarios and error conditions.

Categories

Resources