I have 100GB of JSON files whose each row looks like this:
{"field1":100, "field2":200, "field3":[{"in1":20, "in2":"abc"},{"in1":30, "in2":"xyz"}]}
(It's actually a lot more complicated, but for this'll do as a small demo.)
I want to process it to something whose each row looks like this:
{"field1":100, "field2":200, "abc":20, "xyz":30}
Being extremely new to Hadoop, I just want to know if I'm on the right path:
Refering to this:
http://www.glennklockwood.com/di/hadoop-streaming.php
For conventional applications I'd create a a mapper and reducer in Python and execute it using something like:
hadoop \
jar /opt/hadoop/contrib/streaming/hadoop-streaming-1.0.3.jar \
-mapper "python $PWD/mapper.py" \
-reducer "python $PWD/reducer.py" \
-input "wordcount/mobydick.txt" \
-output "wordcount/output"
Now let me know if I'm on the right track:
Since I just need to parse a lot of files into another form; I suppose I don't need any reduction step. I can simply write a mapper which:
Takes input from stdin
Reads std.in line by line
Transforms each line according to my specifications
Outputs into stdout
Then I can run hadoop with simply a mapper and 0 reducers.
Does this approach seem correct? Will I be actually using the cluster properly or would this be as bad as running the Python script on a single host?
You are correct, in this case you don't need any reducer, the output of your mapper is directly what you want so you should set the number of reducers to 0. When you tell Hadoop the input path where your JSON data is, it will automatically feed each mapper with a random number of lines of JSON, which your mapper will process and you need to emit it to the the context, so that it stores the value in the output path. The approach is correct, and this task is 100% parallelizable, so if you have more than one machine in your cluster and your configuration is correct, it should take full advantage of the cluster and it will run much faster than running it on a single host.
Related
I am new beginner to Spark and started to write some script in Python. My understanding is that Spark executes the Transformation in parallel (map).
def some_function(name, content):
print(name, datetime.now())
time.sleep(30)
return content
config = SparkConf().setAppName("sample2").setMaster("local[*]")
filesRDD = SparkContext(conf=config).binaryFiles("F:\\usr\\temp\\*.zip")
inputfileRDD = filesRDD.map(lambda job_bundle: (job_bundle[0], some_function(job_bundle[0], job_bundle[1])))
print(inputfileRDD.collect())
The above code collects list of .zip files from a folder and processes it. When I execute it I am seeing this is happening sequentially.
Output
file:/F:/usr/temp/sample.zip 2020-10-22 10:42:37.089085
file:/F:/usr/temp/sample1.zip 2020-10-22 10:43:07.103317
You can see that it started processing 2nd file after 30 sec. meaning after completing the first file. What went wrong in my code ? why it is not executing RDD in parallel ? Can you please help me ?
I don't know exactly how the method binaryFiles partitions the files accross spark partitions. It seems that contrarily to textFiles, it tends to only create one partition. Let's see that with an example directory called dir and containing 5 files.
> ls dir
test1 test2 test3 test4 test5
If I use textFile, things are run in parallel. I don't provide the output because it is not very pretty but you can check yourself. We can verify that things are run in parallel with getNumPartitions.
>>> sc.textFile("dir").foreach(lambda x: some_function(x, None))
# ugly output, but everything starts at the same time,
# except maybe the last one since you have 4 cores.
>>> sc.textFile("dir").getNumPartitions()
5
With binaryFiles things are different and for some reason everything goes to the same partition.
>>> sc.binaryFiles("dir").getNumPartitions()
1
I even tried with 10k files and everything still goes to the same partition. I believe the reason behind that is that in scala, binaryFiles returns a RDD with file names and an object that allows to read the files (but no reading is performed). Therefore it is fast, and the resulting RDD is small. Therefore, having it on one partition is OK.
In scala, we can thus use repartition after using binaryFiles and things will work great.
scala> sc.binaryFiles("dir").getNumPartitions
1
scala> sc.binaryFiles("dir").repartition(4).getNumPartitions
4
scala> sc.binaryFiles("dir").repartition(4)
.foreach{ case (name, ds) => {
println(System.currentTimeMillis+": "+name)
Thread.sleep(2000)
// do some reading on the DataStream ds
}}
1603352918396: file:/home/oanicol/sandbox/dir/test1
1603352918396: file:/home/oanicol/sandbox/dir/test3
1603352918396: file:/home/oanicol/sandbox/dir/test4
1603352918396: file:/home/oanicol/sandbox/dir/test5
1603352920397: file:/home/oanicol/sandbox/dir/test2
The problem in python is that binaryFiles actually reads the file onto one single partition. Also, that's extremely mysterious to me but the following lines of code in pyspark 2.4 yield the same behaviour you notice which does not make sense.
# this should work but does not
sc.binaryFiles("dir", minPartitions=4).foreach(lambda x: some_function(x, ''))
# this does not work either, which is strange but it would not be advised anyway
# since all the data would be read on one partition
sc.binaryFiles("dir").repartition(4).foreach(lambda x: some_function(x, ''))
Yet, since binaryFiles actually reads the file, you can use wholeTextFile which reads the file as a text file and behaves as expected:
# this works
sc.wholeTextFiles("dir", minPartitions=4).foreach(lambda x: some_function(x, ''))
I have a python script that currently runs on my desktop. It takes a csv file with roughly 25 million lines (Maybe 15 or so columns) and performs line by line operations.
For each line of input, multiple output lines are produced. The results are then output line by line into a csv file, the output ends up at around 100 million lines.
Code looks something like this:
with open(outputfile,"a") as outputcsv:
with open(inputfile,"r") as input csv:
headerlist=next(csv.reader(csvfile)
for row in csv.reader(csvfile):
variable1 = row[headerlist.index("VAR1")]
variableN = row[headerlist.index("VARN")]
while calculations not complete:
do stuff #Some complex calculations are done at this point
outputcsv.write(stuff)
We're now trying to convert the script to run via Hadoop, using pyspark.
I have no idea how to even start. I'm trying to work out how to iterate through an RDD object but don't think it can be done.
Is a line by line calculation like this suitable for distributed processing?
If you directly want to run the script, you could do so via spark-submit:
spark-submit master local[*]/yarn other_parameters path_to_your_script.py
But I would suggest to go for spark API's as they are easy to use. It will lower the coding overhead.
First you have to create a spark session variable so that you could access all spark functions:
spark = SparkSession
.builder()
.appName("SparkSessionZipsExample")
.config("parameters", "value")
.getOrCreate()
Next, if you want to load a csv file:
file = spark.read.csv("path to file")
You can specify optional parameters like headers, inferschema, etc:
file=spark.read.option("header","true").csv("path to your file")
'file' will now be a pyspark dataframe.
You can now write the end output like this:
file.write.csv("output_path")
Please refer to the documentation : spark documentation for transformations and other information.
I have two files on HDFS and I just want to join these two files on a column say employee id.
I am trying to simply print the files to make sure we are reading that correctly from HDFS.
lines = sc.textFile("hdfs://ip:8020/emp.txt")
print lines.count()
I have tried foreach and println functions as well and I am not able to display file data.
I am working in python and totally new to both python and spark as well.
This is really easy just do a collect
You must be sure that all the data fits the memory on your master
my_rdd = sc.parallelize(xrange(10000000))
print my_rdd.collect()
If that is not the case You must just take a sample by using take method.
# I use an exagerated number to remind you it is very large and won't fit the memory in your master so collect wouldn't work
my_rdd = sc.parallelize(xrange(100000000000000000))
print my_rdd.take(100)
Another example using .ipynb:
In my python mapper code, I need to access the 'path' given in -input 'path'. How is it possible to access this in python code?
You can read the input file from os.environ. For example,
import os
input_file = os.environ['map_input_file']
Actually, you can also read other JobConf from os.environ. Note: During the execution of a streaming job, the names of the "mapred" parameters are transformed. The dots ( . ) become underscores ( _ ). For example, mapred.job.id becomes mapred_job_id and mapred.jar becomes mapred_jar. To get the values in a streaming job's mapper/reducer use the parameter names with the underscores. See Configured Parameters.
I also find a very useful post for you: A Guide to Python Frameworks for Hadoop.
I would like to be able to take data from a file (spreadsheet or other) and create a dictionary that I can then iterate over in a loop for the keys, and have corresponding values inserted in my command for each key. Sorry if that does not make much sense, I will explain in more detail below.
I have several samples that I am running through a bioinformatics pipeline and I am trying to automate the process. One of the steps is adding "read group" information to my files which is done with the following shell command:
picard-tools AddOrReplaceReadGroups I=input.bam O=output.bam RGID=IDXX
RGLB=LBXX RGPL=PLXX RGPU=PUXX RGSM=SMXX VALIDATION_STRINGENCY=SILENT
SORT_ORDER=coordinate CREATE_INDEX=true
For each sample ID there is a different RGID, RGLB, GRPL, RGPU, and RGSM (and different input files, but I already know how to call that info.) What I would like to do is have a loop that executes this command for each sample ID and have the corresponding RGLB, GRPL, RGPU, and RGSM inserted into the command. Is there an easy way to do this? I have been reading a bit and it seems like a dictionary is probably the way to go, but it is not clear to me how to generate the dictionary and call the independent values into my command.
This should be pretty easy, but how you do it depends on the format of your input file. You're going to want something basically like this:
import subprocess # This is how we're going to call the commands.
samples = {} # Empty dict
with open('inputfile','r') as f:
for line in f:
# Extract sampleID, other things depending on file format...
samples[sampleID] = [rgid, rglb, grpl, rgpu, rgsm] # Populate dict
for sampleID in samples:
rgid, rglb, grpl, rgpu, rgsm = samples[sampleID]
# Now you can run your commands using the subprocess module.
# Remember to add a change based on sampleID if e.g. the IO files differ.
subprocess.call(['picard-tools', 'AddOrReplaceReadGroups', 'I=input.bam',
'O=output.bam', 'RGID=%s' % rgid, 'RGLB=%s' % rglb, 'RGPL=%s' %rgpl,
'RGPU=%s' % rgpu, 'RGSM=%s' % rgsm, 'VALIDATION_STRINGENCY=SILENT',
'SORT_ORDER=coordinate', 'CREATE_INDEX=true'])