Hadoop mapreduce python command line arguments - python

In my python mapper code, I need to access the 'path' given in -input 'path'. How is it possible to access this in python code?

You can read the input file from os.environ. For example,
import os
input_file = os.environ['map_input_file']
Actually, you can also read other JobConf from os.environ. Note: During the execution of a streaming job, the names of the "mapred" parameters are transformed. The dots ( . ) become underscores ( _ ). For example, mapred.job.id becomes mapred_job_id and mapred.jar becomes mapred_jar. To get the values in a streaming job's mapper/reducer use the parameter names with the underscores. See Configured Parameters.
I also find a very useful post for you: A Guide to Python Frameworks for Hadoop.

Related

How to load .mat file into workspace using Matlab Engine API for Python?

I have a .mat workspace file containing 4 character variables. These variables contain paths to various folders I need to be able to cd to and from relatively quickly. Usually, when using only Matlab I can load this workspace as follows (provided the .mat file is in the current directory).
load paths.mat
Currently I am experimenting with the Matlab Engine API for Python. The Matlab help docs recommend using the following Python formula to send variables to the current workspace in the desktop app:
import matlab.engine
eng = matlab.engine.start_matlab()
x = 4.0
eng.workspace['y'] = x
a = eng.eval('sqrt(y)')
print(a)
Which works well. However the whole point of the .mat file is that it can quickly load entire sets of variables the user is comfortable with. So the above is not efficient when trying to load the workspace.
I have also tried two different variations in Python:
eng.load("paths.mat")
eng.eval("load paths.mat")
The first variation successfully loads a dict variable in Python containing all four keys and values but this does not propagate to the workspace in Matlab. The second variation throws an error:
File "", line unknown SyntaxError: Error: Unexpected MATLAB
expression.
How do I load up a workspace through the engine without having to manually do it in Matlab? This is an important part of my workflow....
You didn't specify the number of output arguments from the MATLAB engine, which is a possible reason for the error.
I would expect the error from eng.load("paths.mat") to read something like
TypeError: unsupported data type returned from MATLAB
The difference in error messages may arise from different versions of MATLAB, engine API...
In any case, try specifying the number of output arguments like so,
eng.load("paths.mat", nargout=0)
This was giving me fits for a while. A few things to try. I was able to get this working on Matlab 2019a with Python 3.7. I had the most trouble trying to create a string and using the string as an argument for load and eval/evalin, so there might be some trickiness with the single or double quotes, or needing to have an additional set of quotes in the string.
Make sure the MAT file is on the Matlab Path. You can use addpath and rmpath really easily with pathlib objects:
from pathlib import Path
mat_file = Path('local/path/from/cwd/example.mat').resolve # get absolute path
eng.addpath(str(mat_file.parent))
# Execute other commands
eng.rmpath(str(mat_file.parent))
Per dML's answer, make sure to specify the nargout=0 when there are no outputs from the function, and always when calling a script. If there are 1 or more outputs you don't have to have an output in Python, and there is more than one it will be output as a tuple.
You can also turn your script into a function (just won't have access to base workspace without using evalin/assignin):
function load_example_matfile()
evalin('base','load example.mat')
end
eng.feval('load_example_matfile')
And, it does seem to work on the plain vanilla eval and load as well, but if you leave off the nargout=0 it either errors out or gives you the output of the file in python directly.
Both of these work.
eng.eval('load example.mat', nargout=0)
eng.load('example.mat', nargout=0)

Using Hadoop InputFormat in Pyspark

I'm working on a file parser for Spark that can basically read in n lines at a time and place all of those lines as a single row in a dataframe.
I know I need to use InputFormat to try and specify that, but I cannot find a good guide to this in Python.
Is there a method for specifying a custom InputFormat in Python or do I need to create it as a scala file and then specify the jar in spark-submit?
You can directly use the InputFormats with Pyspark.
Quoting from the documentation,
PySpark can also read any Hadoop InputFormat or write any Hadoop
OutputFormat, for both ‘new’ and ‘old’ Hadoop MapReduce APIs.
Pass the HadoopInputFormat class to any of these methods of pyspark.SparkContext as suited,
hadoopFile()
hadoopRDD()
newAPIHadoopFile()
newAPIHadoopRDD()
To read n lines, org.apache.hadoop.mapreduce.lib.NLineInputFormat can be used as the HadoopInputFormat class with the newAPI methods.
I cannot find a good guide to this in Python
In the Spark docs, under "Saving and Loading Other Hadoop Input/Output Formats", there is an Elasticsearch example + links to an HBase example.
can basically read in n lines at a time... I know I need to use InputFormat to try and specify that
There is NLineInputFormat specifically for that.
This is a rough translation of some Scala code I have from NLineInputFormat not working in Spark
def nline(n, path):
sc = SparkContext.getOrCreate
conf = {
"mapreduce.input.lineinputformat.linespermap": n
}
hadoopIO = "org.apache.hadoop.io"
return sc.newAPIHadoopFile(path,
"org.apache.hadoop.mapreduce.lib.NLineInputFormat",
hadoopIO + ".LongWritable",
hadoopIO + ".Text",
conf=conf).map(lambda x : x[1]) # To strip out the file-offset
n = 3
rdd = nline(n, "/file/input")
and place all of those lines as a single row in a dataframe
With NLineInputFormat, each string in the RDD is actually new-line delimited. You can rdd.map(lambda record : "\t".join(record.split('\n'))), for example to put make one line out them.

Most Effecient way to parse Evtx files for specific content

I have hundreds of gigs of Evtx security event logs I want to parse for specific Event IDs (4624) and usernames (joe) based on the Event IDs. I have attempted to use Powershell cmdlet like below:
get-winevent -filterhashtable #{Path="mypath.evtx"; providername="securitystuffprovider"; id=4624}
I know I can pass a variable containing a list to the Path parameter for all of my evtx files, but I am unable to filter based on a subset of the message of the EVTX. Also, this takes an incredibly long time to parse just one Evtx file much less 150 or so. I know there is a python package to parse Evtx but I am not sure how that would look as the python-evtx parser doesn't provide great examples of importing and using the package itself. I can not extract all of the data into csv as that would take too much disk space. Any ideas on how would be amazing. Thanks.
Use -Path with the -FilterXPath parameter, and then filter using an XPath expression like so:
$Username = 'jdoe'
$XPathFilter = "*[System[(EventID=4624)] and EventData[Data[#Name='SubjectUserName'] and (Data='$Username')]]"
Get-WinEvent -Path C:\path\to\log\files\*.evtx -FilterXPath $XPathFilter

Transforming a JSON file in Hadoop

I have 100GB of JSON files whose each row looks like this:
{"field1":100, "field2":200, "field3":[{"in1":20, "in2":"abc"},{"in1":30, "in2":"xyz"}]}
(It's actually a lot more complicated, but for this'll do as a small demo.)
I want to process it to something whose each row looks like this:
{"field1":100, "field2":200, "abc":20, "xyz":30}
Being extremely new to Hadoop, I just want to know if I'm on the right path:
Refering to this:
http://www.glennklockwood.com/di/hadoop-streaming.php
For conventional applications I'd create a a mapper and reducer in Python and execute it using something like:
hadoop \
jar /opt/hadoop/contrib/streaming/hadoop-streaming-1.0.3.jar \
-mapper "python $PWD/mapper.py" \
-reducer "python $PWD/reducer.py" \
-input "wordcount/mobydick.txt" \
-output "wordcount/output"
Now let me know if I'm on the right track:
Since I just need to parse a lot of files into another form; I suppose I don't need any reduction step. I can simply write a mapper which:
Takes input from stdin
Reads std.in line by line
Transforms each line according to my specifications
Outputs into stdout
Then I can run hadoop with simply a mapper and 0 reducers.
Does this approach seem correct? Will I be actually using the cluster properly or would this be as bad as running the Python script on a single host?
You are correct, in this case you don't need any reducer, the output of your mapper is directly what you want so you should set the number of reducers to 0. When you tell Hadoop the input path where your JSON data is, it will automatically feed each mapper with a random number of lines of JSON, which your mapper will process and you need to emit it to the the context, so that it stores the value in the output path. The approach is correct, and this task is 100% parallelizable, so if you have more than one machine in your cluster and your configuration is correct, it should take full advantage of the cluster and it will run much faster than running it on a single host.

Python Advice for a beginner. Regex, Dictionaries etc?

I'm writing my second python script to try and parse the contents of a config file and would like some noob advice. I'm not sure if its best to use regex to parse my script since its multiple lines? I've also been reading about dictionaries and wondered if this would be good practice. I'm not necessarily looking for the code just a push in the right direction.
Example: My config file looks like this.
Job {
Name = "host.domain.com-foo"
Client = host.domain.com-fd
JobDefs = "DefaultJob"
FileSet = "local"
Write Bootstrap = "/etc/foo/host.domain.com-foo.bsr"
Pool = storage-disk1
}
Should I used regex, line splitting or maybe a module? If I had multiple jobs in my config file would I use a dictionary to correlate a job to a pool?
If you can change the configuration file format, you can directly write your file as a Python file.
config.py
job = {
'Name' : "host.domain.com-foo",
'Client' : "host.domain.com-fd",
'JobDefs' : "DefaultJob",
'FileSet' : "local",
'Write Bootstrap' : "/etc/foo/host.domain.com-foo.bsr",
'Pool' : 'storage-disk1'
}
yourscript.py
from config import job
print job['Name']
There are numorous existing alternatives for this task, json, pickle and yaml to name 3. Unless you really want to implement this yourself, you should use one of these. Even if you do roll your own, following the format of one of the above is still a good idea.
Also, it's a much better idea to use a parser/generator or similar tool to do the parsing, regex's are going to be harder to maintain and more inefficient for this type of task.
If your config file can be turned into a python file, just make it a dictionary and import the module.
Job = { "Name" : "host.domain.com-foo",
"Client" : "host.domain.com-fd",
"JobDefs" : "DefaultJob",
"FileSet" : "local",
"Write BootStrap" : "/etc/foo/host.domain.com-foo.bsr",
"Pool" : "storage-disk1" }
You can access the options by simply calling Job["Name"]..etc.
The ConfigParser is easy to use as well. You can create a text file that looks like this:
[Job]
Name=host.domain.com-foo
Client=host.domain.com-fd
JobDefs=DefaultJob
FileSet=local
Write BootStrap=/etc/foo/host.domain.com-foo.bsr
Pool=storage-disk1
Just keep it simple like one of the above.
ConfigParser module from the standard library is probably the most Pythonic and staight-forward way to parse a configuration file that your python script is using.
If you are restricted to using the particular format you have outlined, then using pyparsing is pretty good.
I don't think a regex is adequate for parsing something like this. You could look at a true parser, such as pyparsing. Or if the file format is within your control, you might consider XML. There are standard Python libraries for parsing that.

Categories

Resources