Let's say I have a file called "foo.txt" and I want to process the data in the file, the file is relatively huge so I want to use Spark to do this work in parallel, and I know that there is a function called parallelize() in pyspark, the example given by RDD Programming Guide like this:
data = [1, 2, 3, 4, 5]
distData = sc.parallelize(data)
I want to know if is there any way to using parallelize() to process a file?
Related
I have a very large xml 100MB using pyspark for the following reason,
to reduce the run time of data and convert it into data frame.
Any idea how to read to xml file. (modification required into below code) One more. I took from
google. In precise did not understand, how can I define get_values.
spark = SparkSession.builder.master("local[2]").appName('finale').getOrCreate()
xml = os.path.join(self.violation_path, xml)
file_rdd = spark.read.text(xml, wholetext=False)
pyspark.sql.udf.UDFRegistration.register(name="get_values", f = get_values,
returnType=StringType())
myRDD = spark.parallelize(file_rdd.take(4), 4)
parsed_records = spark.runJob(myRDD, lambda part: [get_values(x) for x in part])
print (parsed_records)
another method
root = ET.parse(xml).getroot()
is it right approach to use pyspark as local cluster? will it be fatser
Can it be run only cloud container and can not use local machine
I have a python script that currently runs on my desktop. It takes a csv file with roughly 25 million lines (Maybe 15 or so columns) and performs line by line operations.
For each line of input, multiple output lines are produced. The results are then output line by line into a csv file, the output ends up at around 100 million lines.
Code looks something like this:
with open(outputfile,"a") as outputcsv:
with open(inputfile,"r") as input csv:
headerlist=next(csv.reader(csvfile)
for row in csv.reader(csvfile):
variable1 = row[headerlist.index("VAR1")]
variableN = row[headerlist.index("VARN")]
while calculations not complete:
do stuff #Some complex calculations are done at this point
outputcsv.write(stuff)
We're now trying to convert the script to run via Hadoop, using pyspark.
I have no idea how to even start. I'm trying to work out how to iterate through an RDD object but don't think it can be done.
Is a line by line calculation like this suitable for distributed processing?
If you directly want to run the script, you could do so via spark-submit:
spark-submit master local[*]/yarn other_parameters path_to_your_script.py
But I would suggest to go for spark API's as they are easy to use. It will lower the coding overhead.
First you have to create a spark session variable so that you could access all spark functions:
spark = SparkSession
.builder()
.appName("SparkSessionZipsExample")
.config("parameters", "value")
.getOrCreate()
Next, if you want to load a csv file:
file = spark.read.csv("path to file")
You can specify optional parameters like headers, inferschema, etc:
file=spark.read.option("header","true").csv("path to your file")
'file' will now be a pyspark dataframe.
You can now write the end output like this:
file.write.csv("output_path")
Please refer to the documentation : spark documentation for transformations and other information.
I am trying the word count problem in spark using python. But I am facing the problem when I try to save the output RDD in a text file using .saveAsTextFile command. Here is my code. Please help me. I am stuck. Appreciate for your time.
import re
from pyspark import SparkConf , SparkContext
def normalizewords(text):
return re.compile(r'\W+',re.UNICODE).split(text.lower())
conf=SparkConf().setMaster("local[2]").setAppName("sorted result")
sc=SparkContext(conf=conf)
input=sc.textFile("file:///home/cloudera/PythonTask/sample.txt")
words=input.flatMap(normalizewords)
wordsCount=words.map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y)
sortedwordsCount=wordsCount.map(lambda (x,y):(y,x)).sortByKey()
results=sortedwordsCount.collect()
for result in results:
count=str(result[0])
word=result[1].encode('ascii','ignore')
if(word):
print word +"\t\t"+ count
results.saveAsTextFile("/var/www/myoutput")
since you collected results=sortedwordsCount.collect() so, its not RDD. It will be normal python list or tuple.
As you know list is python object/data structure and append is method to add element.
>>> x = []
>>> x.append(5)
>>> x
[5]
Similarly RDD is sparks object/data structure and saveAsTextFile is method to write the file. Important thing is its distributed data structure.
So, we cannot use append on RDD or saveAsTextFile on list. collect is method on RDD to get to RDD to driver memory.
As mentioned in comments, save sortedwordsCount with saveAsTextFile or open file in python and use results to write in a file
Change results=sortedwordsCount.collect() to results=sortedwordsCount, because using .collect() results will be a list.
I have two files on HDFS and I just want to join these two files on a column say employee id.
I am trying to simply print the files to make sure we are reading that correctly from HDFS.
lines = sc.textFile("hdfs://ip:8020/emp.txt")
print lines.count()
I have tried foreach and println functions as well and I am not able to display file data.
I am working in python and totally new to both python and spark as well.
This is really easy just do a collect
You must be sure that all the data fits the memory on your master
my_rdd = sc.parallelize(xrange(10000000))
print my_rdd.collect()
If that is not the case You must just take a sample by using take method.
# I use an exagerated number to remind you it is very large and won't fit the memory in your master so collect wouldn't work
my_rdd = sc.parallelize(xrange(100000000000000000))
print my_rdd.take(100)
Another example using .ipynb:
Say I got a dictionary like this:
Test = {"apple":[3,{1:1,3:5,6:7}],"banana":[4,{1:1,3:5,6:7,11:2}]}
Now I want to save this dictionary into a temporary file so that I can reconstruct the dictionary later. (I am doing external sorting XD)
Can any kind guy help me? I know there is a way to save it in csv format, but this one is a special kind of dictionary. Thx a lot.
In order to save a data structure to a file you need to decide on a serialization format. A serialization format takes an in-memory data structure and turns it into a sequence of bytes that can be written to a file. The process of turning that sequence of bytes back into a data structure is called deserialization.
Python provides a number of options for seralization with different characteristics. Here are two common choices and their tradeoffs:
pickle uses a compact format that can represent almost any Python object. However, it's specific to Python, so only Python programs can (easily) decode the file later. The details of the format can also vary between Python releases, so it's best to use pickle only if you will re-read the data file with the same or a newer version of Python than the one that created it. Pickle is able to deal with recursive data structures. Finally, pickle is inappropriate for reading data provided by other possibly-malicious programs, since decoding a pickled data structure can cause arbitrary code to run in your application in order to reconstruct complex objects.
json uses a human-readable text-based format that can represent only a small set of data types: numbers, strings, None, booleans, lists and dictionaries. However, it is a standard format that can be read by equivalent libraries in many other programming languages, and it does not vary from one Python release to another. JSON does not support recursive data structures, but it is generally safe to decode a JSON object from an unknown source except that of course the resulting data structure might use a lot of memory.
In both of these modules the dump function can write a data structure to a file and the load function can later recover it. The difference is in the format of the data written to the file.
The pickle module is quite convenient for serializing python data. It is also probably the fastest way to dump and reload a python data structure.
>>> import pickle
>>> Test = {"apple":[3,{1:1,3:5,6:7}],"banana":[4,{1:1,3:5,6:7,11:2}]}
>>> pickle.dump(Test, open('test_file', 'w'))
>>> pickle.load(open('test_file', 'r'))
{'apple': [3, {1: 1, 3: 5, 6: 7}], 'banana': [4, {1: 1, 3: 5, 11: 2, 6: 7}]}
For me, clearly, the best serializer is msgpack.
http://jmoiron.net/blog/python-serialization/
Same methods as the others.