Streaming a directory with Spark on Windows 7

Streaming a directory with Spark on Windows 7 - python

I am running Spark 1.6.1 with Python 2.7 on Windows 7.
The root scratch dir: /tmp/hive on HDFS is writable and my current permissions are: rwxrwxrwx (using winutils tools).
I want to stream files from a directory. According to the doc, the function textFileStream(directory):
Create an input stream that monitors a Hadoop-compatible file system
for new files and reads them as text files. Files must be wrriten to
the monitored directory by “moving” them from another location within
the same file system. File names starting with . are ignored.
When I launch Spark Streaming command:
lines = ssc.textFileStream(r"C:/tmp/hive/")
counts = lines.flatMap(lambda line: line.split(" "))\
.map(lambda x: (x, 1))\
.reduceByKey(lambda a, b: a+b)
counts.pprint()
ssc.start()
and then create the files to stream in my directory, nothing happens.
I also tried this:
lines = ssc.textFileStream("/tmp/hive/")
and
lines = ssc.textFileStream("hdfs://tmp/hive/")
which is HDFS path related, but nothing happens again.
Do I do something wrong?

Try using "file:///C:/tmp/hive" as a directory on Windows, worked for me on Windows 8 with Spark 1.6.3 but I had to fiddle a bit with file name and content before I made it work. I also tried with other paths so can confirm that it works the same way with paths which are not related in any way to winutits, e.g. you can use "file:///D:/someotherpath" if you have your data there
It is not straightforward though, I had a file in the monitored directory and did few content and file name changes before it got picked up, and then at some point it stopped reacting to my changes and getting picked up so results are not consistent. Guess it's a Windows thing.
I know it works so every time I try I know have to be patient and try few name changes before it gets picked up but that's only good to prove it works, obviously not good for anything else.
One other thing i was doing is using Unix eof instead of Windows eof in files but cannot assert it is required

Related

How do I access a file for reading/writing in a different (non-current) directory?

I am working on the listener portion of a backdoor program (for an ETHICAL hacking course) and I would like to be able to read files from any part of my linux system and not just from within the directory where my listener python script is located - however, this has not proven to be as simple as specifying a typical absolute path such as "~/Desktop/test.txt"
So far my code is able to read files and upload them to the virtual machine where my reverse backdoor script is actively running. But this is only when I read and upload files that are in the same directory as my listener script (aptly named listener.py). Code shown below.
def read_file(self, path):
with open(path, "rb") as file:
return base64.b64encode(file.read())
As I've mentioned previously, the above function only works if I try to open and read a file that is in the same directory as the script that the above code belongs to, meaning that path in the above content is a simple file name such as "picture.jpg"
I would like to be able to read a file from any part of my filesystem while maintaining the same functionality.
For example, I would love to be able to specify "~/Desktop/another_picture.jpg" as the path so that the contents of "another_picture.jpg" from my "~/Desktop" directory are base64 encoded for further processing and eventual upload.
Any and all help is much appreciated.
Edit 1:
My script where all the code is contained, "listener.py", is located in /root/PycharmProjects/virus_related/reverse_backdoor/. within this directory is a file that for simplicity's sake we can call "picture.jpg" The same file, "picture.jpg" is also located on my desktop, absolute path = "/root/Desktop/picture.jpg"
When I try read_file("picture.jpg"), there are no problems, the file is read.
When I try read_file("/root/Desktop/picture.jpg"), the file is not read and my terminal becomes stuck.
Edit 2:
I forgot to note that I am using the latest version of Kali Linux and Pycharm.
I have run "realpath picture.jpg" and it has yielded the path "/root/Desktop/picture.jpg"
Upon running read_file("/root/Desktop/picture.jpg"), I encounter the same problem where my terminal becomes stuck.
[FINAL EDIT aka Problem solved]:
Based on the answer suggesting trying to read a file like "../file", I realized that the code was fully functional because read_file("../file") worked without any flaws, indicating that my python script had no trouble locating the given path. Once the file was read, it was uploaded to the machine running my backdoor where, curiously, it uploaded the file to my target machine but in the parent directory of the script. It was then that I realized that problem lied in the handling of paths in the backdoor script rather than my listener.py
Credit is also due to the commentator who pointed out that "~" does not count as a valid path element. Once I reached the conclusion mentioned just above, I attempted read_file("~/Desktop/picture.jpg") which failed. But with a quick modification, read_file("/root/Desktop/picture.jpg") was successfully executed and the file was uploaded in the same directory as my backdoor script on my target machine once I implemented some quick-fix code.
My apologies for not being so specific; efforts to aid were certainly confounded by the unmentioned complexity of my situation and I would like to personally thank everyone who chipped in.
This was my first whole-hearted attempt to reach out to the stackoverflow community for help and I have not been disappointed. Cheers!

A solution I found is putting "../" before the filename if the path is right outside of the dictionary.
test.py (in some dictionary right inside dictionary "Desktop" (i.e. /Desktop/test):
with open("../test.txt", "r") as test:
print(test.readlines())
test.txt (in dictionary "/Desktop")
Hi!
Hello!
Result:
["Hi!", "Hello!"]
This is likely the simplest solution. I found this solution because I always use "cd ../" on the terminal.

This not only allows you to modify the current file, but all other files in the same directory as the one you are reading/writing to.
path = os.path.dirname(os.path.abspath(__file__))
dir_ = os.listdir(path)
for filename in dir_:
f = open(dir_ + '/' + filename)
content = f.read()
print filename, len(content)
try:
im = Image.open(filename)
im.show()
except IOError:
print('The following file is not an image type:', filename)

How to open any program in Python?

Well I searched a lot and found different ways to open program in python,
For example:-
import os
os.startfile(path) # I have to give a whole path that is not possible to give a full path for every program/software in my case.
The second one that I'm currently using
import os
os.system(fileName+'.exe')
In second example problem is:-
If I want to open calculator so its .exe file name is calc.exe and this happen for any other programs too (And i dont know about all the .exe file names of every program).
And assume If I wrote every program name hard coded so, what if user installed any new program. (my program wont able to open that program?)
If there is no other way to open programs in python so Is that possible to get the list of all install program in user's computer.
and there .exe file names (like:- calculator is calc.exe you got the point).
If you want to take a look at code
Note: I want generic solution.

There's always:
from subprocess import call
call(["calc.exe"])
This should allow you to use a dict or list or set to hold your program names and call them at will. This is covered also in this answer by David Cournapeau and chobok.

You can try with os.walk :
import os
exe_list=[]
for root, dirs, files in os.walk("."):
#print (dirs)
for j in dirs:
for i in files:
if i.endswith('.exe'):
#p=os.getcwd()+'/'+j+'/'+i
p=root+'/'+j+'/'+i
#print(p)
exe_list.append(p)
for i in exe_list :
print('index : {} file :{}'.format(exe_list.index(i),i.split('/')[-1]))
ip=int(input('Enter index of file :'))
print('executing {}...'.format(exe_list[ip]))
os.system(exe_list[ip])
os.getcwd()+'/'+i prepends the path of file to the exe file starting from root.
exe_list.index(i),i.split('/')[-1] fetches just the filename.exe
exe_list stores the whole path of an exe file at each index

Can be done with winapps
First install winapps by typing:
pip install winapps
After that use the library:
# This will give you list of installed applications along with some information
import winapps
for app in winapps.list_installed():
print(app)
If you want to search for an app you can simple do:
application = 'chrome'
for app in winapps.search_installed(application):
print(app)

Pig Script: STORE command not working

this is my first time posting to StackOverflow and I'm hoping someone can assist. I'm fairly new at pig scripts and have encountered a problem I can't solve.
Below is a pig script that fails when I attempt to write results to a file:
register 'myudf.py' using jython as myfuncs;
A = LOAD '$file_nm' USING PigStorage('$delimiter') AS ($fields);
B = FILTER A by ($field_nm) IS NOT NULL;
C = FOREACH B GENERATE ($field_nm) as fld;
D = GROUP C ALL;
E = FOREACH D GENERATE myfuncs.theResult(C.fld);
--DUMP E;
STORE E INTO 'myoutput/theResult';
EXEC;
I see the results of E when I Dump to the screen. However, I need to store the results temporarily in a file. After the Store command, the error I receive is: Output Location Validation Failed.
I've tried numerous workarounds, like removing the theResult folder and removing the earlier contents of theResult, but none of the commands I use work. These have been along the lines of:
hdfs dfs -rm myoutput/theResult
and
hadoop fs -rm myoutput/theResult
...using both the shell (hs) and file system (fs) commands. I've tried to call another function (shell script, python function, etc.) to clear the earlier results stored in the myoutput/theResult folder. I've read every website I can find and nothing is working. Any ideas??

the output location of a mapreduce is a directory. So, you must have tried it this way
hadoop fs -rmr myoutput/theResult
and then run the pig script. It will work.
"rmr" - remove recursive, which deletes both folder/file
"rm" - is just remove which removes only file
Everytime, you need to either change output path or delete and use the same, since HDFS is worm(write once read many) model storage.

Couple of things you can try-
making sure output director is a valid path.
Remove the entire directory and not just content within it. Remove directory with 'rmr and check that path doesn't exist before running pig script.

Thanks for both of your replies. I now have a solution that is working:
fs -mkdir -p myoutput/theResult
fs -rm -r myoutput/theResult
The first line attempts to create a directory, but the "-p" prevents an error if it already exists. Then the second line removes it. Either way, there will be a directory to remove, so no error!

The output of store is confusing when we are using Pig for the first time.
store grp into '/output1';
This will create the folder named 'output1' in root. The folder should not be already present
You can give your own hdfs path here like /user/thewhitetulip.
hdfs dfs -ls /output1
output:
/output1/_SUCCESS
/output1/part-r-00000
The part-r-00000 file is the output of the store program.

Get full path of currently open files

I'm trying to code a simple application that must read all currently open files within a certain directory.
More specificly, I want to get a list of files open anywhere inside my Documents folder,
but I don't want only the processes' IDs or process name, I want the full path of the open file.
The thing is I haven't quite found anything to do that.
I couldn't do it neither in linux shell (using ps and lsof commands) nor using python's psutil library. None of these is giving me the information I need, which is only the path of currently open files in a dir.
Any advice?
P.S: I'm tagging this as python question (besides os related tags) because it would be a plus if it could be done using some python library.

This seems to work (on Linux):
import subprocess
import shlex
cmd = shlex.split('lsof -F n +d .')
try:
output = subprocess.check_output(cmd).splitlines()
except subprocess.CalledProcessError as err:
output = err.output.splitlines()
output = [line[3:] for line in output if line.startswith('n./')]
# Out[3]: ['file.tmp']
it reads open files from current directory, non-recursively.
For recursive search, use +D option. Keep in mind, that it is vulnerable to race condition - when you get your ouput, situation might have changed already. It is always best to try to do something (open file), and check for failure, e.g. open file and catch exception or check for null FILE value in C.

converting/mapping linux reference path without altering the file?

Currently on a project that my client needs the reference file path to
remain in linux format. For example
A.ma , referencing objects from --> //linux/project/scene/B.ma
B.ma , referencing objects from --> //linux/project/scene/C.ma
Most of our Maya license here however are on Windows. I can run a
Python script that convert all the paths windows paths and save the
file. For example
Z:\project\scene\B.ma
However I'm trying to figure out a way to do this without converting
or altering the original file.... I'll try to explain what I'm trying to do.
Run the script to open the file.
The script checks for the linux formatted reference path, and all
child path down the hierarchy.
Maps all paths to their appropriate windows formatted paths.
Giving the animators the ability to "save" files normally without running a separate save script.
Is this possible to achieve this with Python script? Or will I need a
fully-compiled plug in to get this to work?
Any suggestion is greatly appreciated.
edit: Thank you for your input.
A little more clarification. The projects were set up for us by a remote company and part of the requirement is that we have to keep the path as is. They come as absolute path and we have no choice in that matter.
We match the mount //linux/ on our Fedora workstations. That same drive is mapped to Z:\ on our windows workstations. We only have 2 Maya license for Linux tho which is why I'm trying to do this.

Here is a solution. First step is to create a dict that keeps track of linux/windows references (don't forget to import the re module for regexp):
>>> def windows_path(path):
return path.replace('//linux', 'Z:').replace('/', '\\')
>>> reg = re.compile('(\w+\.ma) , referencing objects from --> (.*)')
>>> d = {}
>>> for line in open('D:\\temp\\Toto.txt'):
match = reg.match(line)
if match:
file_name = match.groups()[0]
linux_path = match.groups()[1]
d[file_name] = (linux_path, windows_path(linux_path))
>>> d
{'B.ma': ('//linux/project/scene/C.ma', 'Z:\\project\\scene\\C.ma'),
'A.ma': ('//linux/project/scene/B.ma', 'Z:\\project\\scene\\B.ma')}
Then you just need to loop on this dict to ask for file save:
>>> for file_name in d.keys():
s = raw_input('do you want to save file %s ? ' % file_name)
if s.lower() in ('y', 'yes'):
# TODO: save your file thanks to d[file][0] for linux path,
# d[file][1] for windows path
print '-> file %s was saved' % file_name
else:
print '-> file %s was not saved' % file_name
do you want to save file B.ma ? n
-> file B.ma was not saved
do you want to save file A.ma ? yes
-> file A.ma was saved

Many Windows applications will interpret paths with two leading "/"s as UNC paths. I don't know if Maya is one of those, but try it out. If Maya can understand paths like "//servername/share/foo", then all you need to do is set up a SMB server named "linux", and the paths will work as they are. I would guess that this is actually what your client does, since the path "//linux" would not make sense in a Linux-only environment.

You can use environment variables to do this. Maya will expand environment vars present in a file path, you could use Maya.env to set them up properly for each platform.

What you are looking for is the dirmap mel command. It is completely non-intrusive to your files as you just define a mapping from your linux paths to windows and/or vice versa. Maya will internally apply the mapping to resolve the paths, without changing them when saving the file.
To setup dirmap, you need to run a MEL script which issues the respective commands on maya startup. UserSetup.mel could be one place to put it.
For more details, see the official documentation - this particular link points to maya 2012, the command is available in Maya 7.0 and earlier as well though:
http://download.autodesk.com/global/docs/maya2012/en_us/Commands/dirmap.html

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.