Spark using Python : save RDD output into text files - python

I am trying the word count problem in spark using python. But I am facing the problem when I try to save the output RDD in a text file using .saveAsTextFile command. Here is my code. Please help me. I am stuck. Appreciate for your time.
import re
from pyspark import SparkConf , SparkContext
def normalizewords(text):
return re.compile(r'\W+',re.UNICODE).split(text.lower())
conf=SparkConf().setMaster("local[2]").setAppName("sorted result")
sc=SparkContext(conf=conf)
input=sc.textFile("file:///home/cloudera/PythonTask/sample.txt")
words=input.flatMap(normalizewords)
wordsCount=words.map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y)
sortedwordsCount=wordsCount.map(lambda (x,y):(y,x)).sortByKey()
results=sortedwordsCount.collect()
for result in results:
count=str(result[0])
word=result[1].encode('ascii','ignore')
if(word):
print word +"\t\t"+ count
results.saveAsTextFile("/var/www/myoutput")

since you collected results=sortedwordsCount.collect() so, its not RDD. It will be normal python list or tuple.
As you know list is python object/data structure and append is method to add element.
>>> x = []
>>> x.append(5)
>>> x
[5]
Similarly RDD is sparks object/data structure and saveAsTextFile is method to write the file. Important thing is its distributed data structure.
So, we cannot use append on RDD or saveAsTextFile on list. collect is method on RDD to get to RDD to driver memory.
As mentioned in comments, save sortedwordsCount with saveAsTextFile or open file in python and use results to write in a file

Change results=sortedwordsCount.collect() to results=sortedwordsCount, because using .collect() results will be a list.

Related

Saving multiple items to HDFS with (spark, python, pyspark, jupyter)

I´m used to program in Python. My company now got a Hadoop Cluster with Jupyter installed. Until now I never used Spark / Pyspark for anything.
I am able to load files from HDFS as easy as this:
text_file = sc.textFile("/user/myname/student_grades.txt")
And I´m able to write output like this:
text_file.saveAsTextFile("/user/myname/student_grades2.txt")
The thing I´m trying to achieve is to use a simple "for loop" to read text files one-by-one and write it's content into one HDFS file. So I tried this:
list = ['text1.txt', 'text2.txt', 'text3.txt', 'text4.txt']
for i in list:
text_file = sc.textFile("/user/myname/" + i)
text_file.saveAsTextFile("/user/myname/all.txt")
So this works for the first element of the list, but then gives me this error message:
Py4JJavaError: An error occurred while calling o714.saveAsTextFile.
: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
XXXXXXXX/user/myname/all.txt already exists
To avoid confusion I "blured"-out the IP address with XXXXXXXX.
What is the right way to do this?
I will have tons of datasets (like 'text1', 'text2' ...) and want to perform a python function with each of them before saving them into HDFS. But I would like to have the results all together in "one" output file.
Thanks a lot!
MG
EDIT:
It seems like that my final goal was not really clear. I need to apply a function to each text file seperately and then I want to append the output to the existing output directory. Something like this:
for i in list:
text_file = sc.textFile("/user/myname/" + i)
text_file = really_cool_python_function(text_file)
text_file.saveAsTextFile("/user/myname/all.txt")
I wanted to post this as comment but could not do so as I do not have enough reputation.
You have to convert your RDD to dataframe and then write it in append mode. To convert RDD to dataframe please look into this answer:
https://stackoverflow.com/a/39705464/3287419
or this link http://spark.apache.org/docs/latest/sql-programming-guide.html
To save dataframe in append mode below link may be useful:
http://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes
Almost same question is here also Spark: Saving RDD in an already existing path in HDFS . But the answer provided is for scala. I hope something similar can be done in python also.
There is yet another (but ugly) approach. Convert your RDD to string. Let the resulting string be resultString . Use subprocess to append that string to destination file i.e.
subprocess.call("echo "+resultString+" | hdfs dfs -appendToFile - <destination>", shell=True)
you can read multiple files and save them by
textfile = sc.textFile(','.join(['/user/myname/'+f for f in list]))
textfile.saveAsTextFile('/user/myname/all')
you will get all part files within output directory.
If the text files all have the same schema, you could use Hive to read the whole folder as a single table, and directly write that output.
I would try this, it should be fine:
list = ['text1.txt', 'text2.txt', 'text3.txt', 'text4.txt']
for i in list:
text_file = sc.textFile("/user/myname/" + i)
text_file.saveAsTextFile(f"/user/myname/{i}")

Using Hadoop InputFormat in Pyspark

I'm working on a file parser for Spark that can basically read in n lines at a time and place all of those lines as a single row in a dataframe.
I know I need to use InputFormat to try and specify that, but I cannot find a good guide to this in Python.
Is there a method for specifying a custom InputFormat in Python or do I need to create it as a scala file and then specify the jar in spark-submit?
You can directly use the InputFormats with Pyspark.
Quoting from the documentation,
PySpark can also read any Hadoop InputFormat or write any Hadoop
OutputFormat, for both ‘new’ and ‘old’ Hadoop MapReduce APIs.
Pass the HadoopInputFormat class to any of these methods of pyspark.SparkContext as suited,
hadoopFile()
hadoopRDD()
newAPIHadoopFile()
newAPIHadoopRDD()
To read n lines, org.apache.hadoop.mapreduce.lib.NLineInputFormat can be used as the HadoopInputFormat class with the newAPI methods.
I cannot find a good guide to this in Python
In the Spark docs, under "Saving and Loading Other Hadoop Input/Output Formats", there is an Elasticsearch example + links to an HBase example.
can basically read in n lines at a time... I know I need to use InputFormat to try and specify that
There is NLineInputFormat specifically for that.
This is a rough translation of some Scala code I have from NLineInputFormat not working in Spark
def nline(n, path):
sc = SparkContext.getOrCreate
conf = {
"mapreduce.input.lineinputformat.linespermap": n
}
hadoopIO = "org.apache.hadoop.io"
return sc.newAPIHadoopFile(path,
"org.apache.hadoop.mapreduce.lib.NLineInputFormat",
hadoopIO + ".LongWritable",
hadoopIO + ".Text",
conf=conf).map(lambda x : x[1]) # To strip out the file-offset
n = 3
rdd = nline(n, "/file/input")
and place all of those lines as a single row in a dataframe
With NLineInputFormat, each string in the RDD is actually new-line delimited. You can rdd.map(lambda record : "\t".join(record.split('\n'))), for example to put make one line out them.

Convert a python List of List to spark RDD

I am running a spark job which run in the following steps:
First it reads a directory of files:
data = sc.binaryFiles()
Process each file separately:
res = data.map(lambda (x,y): func_1(x,y))
The func_1 call another function func_2 which process the content of each file separately and return a list of list to func_1. Now I need to change this list of list to spark rdd and write the same to hdfs. But I don't have any idea how to do this.
I am very new to spark. Any help in this case will be appreciate. Thank you in advance.
Edited: As per suggestion, here is the Func1 and Func2 definition:
def Func_1(filename, file_content):
Outputfile = "some code for generating output file name for each input file"
decode_data = Func_2(StringIO(file_content))
##save decode_data here in HDFS.
def Func2_():
##It does the decoding of the file in a sequence manner (its necessary as each binary file has some headers attach to each portion of the file) and return a list of list where each inner list equivalent to a row of the decoded data and out list is the collection of such rows(skipping the code as it is trivial)

cannot send pyspark output to a file in the local file system

I'm running a pyspark job on spark (single node, stand-alone) and trying to save the output in a text file in the local file system.
input = sc.textFile(inputfilepath)
words = input.flatMap(lambda x: x.split())
wordCount = words.countByValue()
wordCount.saveAsTextFile("file:///home/username/output.txt")
I get an error saying
AttributeError: 'collections.defaultdict' object has no attribute 'saveAsTextFile'
Basically whatever I add to 'wordCount' object, for example collect() or map() it returns the same error. The code works with no problem when output goes to the terminal (with a for loop) but I can't figure what is missing to send the output to a file.
The countByValue() method that you're calling is returning a dictionary of word counts. This is just a standard python dictionary, and doesn't have any Spark methods available to it.
You can use your favorite method to save the dictionary locally.

Use Apache Spark to implement the python function

I have a python code to implement in Spark, however I am unable to get the logic right for the RDD working to implement in Spark 1.1 version. This code is perfectly working in Python ,but I would like to implement in Spark with this code.
import lxml.etree
import csv
sc = SparkContext
data = sc.textFile("pain001.xml")
rdd = sc.parallelize(data)
# compile xpath selectors for ele ment text
selectors = ('GrpHdr/MsgId', 'GrpHdr/CreDtTm') # etc...
xpath = [lxml.etree.XPath('{}/text()'.format(s)) for s in selectors]
# open result csv file
with open('pain.csv', 'w') as paincsv:
writer = csv.writer(paincsv)
# read file with 1 'CstmrCdtTrfInitn' record per line
with open(rdd) as painxml:
# process each record
for index, line in enumerate(painxml):
if not line.strip(): # allow empty lines
continue
try:
# each line is an xml doc
pain001 = lxml.etree.fromstring(line)
# move to the customer elem
elem = pain001.find('CstmrCdtTrfInitn')
# select each value and write to csv
writer.writerow([xp(elem)[0].strip() for xp in xpath])
except Exception, e:
# give a hint where things go bad
sys.stderr.write("Error line {}, {}".format(index, str(e)))
raise
I am getting error as RDD not iteratable
I want to implement this code as a function and implement as a standalone program in Spark
I would want the input file to be processed in HDFS as well as local mode in Spark with the python module.
Appreciate responses for the problem.
The error you are getting is very informative, when you do with open(rdd) as painxml: and after that, you try to iterate over the RDD as if it was a normal List or Tuple in python, and an RDD is not iterable, furthermore if you read the textFile documentation, you can notice that it returns an RDD.
I think the problem you have is that you are trying to achieve this in a classic way, and you must approach it inside the MapReduce paradigm, if you are really new into Apache Spark, you can audit this course Scalable Machine Learning with Apache Spark, furthermore I would recommend you to update your spark's version to 1.5 or 1.6 (that will come out soon).
Just as a small example (but not using xmls):
Import the required files
import re
import csv
Read the input file
content = sc.textFile("../test")
content.collect()
# Out[8]: [u'1st record-1', u'2nd record-2', u'3rd record-3', u'4th record-4']
Map the RDD to manipulate each row
# Map it and convert it to tuples
rdd = content.map(lambda s: tuple(re.split("-+",s)))
rdd.collect()
# Out[9]: [(u'1st record', u'1'),
# (u'2nd record', u'2'),
# (u'3rd record', u'3'),
# (u'4th record', u'4')]
Write your data
with open("../test.csv", "w") as fw:
writer = csv.writer(fw)
for r1 in rdd.toLocalIterator():
writer.writerow(r1)
Take a look...
$ cat test.csv
1st record,1
2nd record,2
3rd record,3
4th record,4
Note: If you want to read a xml with Apache Spark, there are some libraries in GitHub like spark-xml; you can also find this question interesting xml processing in spark.

Categories

Resources