Convert a python List of List to spark RDD - python

I am running a spark job which run in the following steps:
First it reads a directory of files:
data = sc.binaryFiles()
Process each file separately:
res = data.map(lambda (x,y): func_1(x,y))
The func_1 call another function func_2 which process the content of each file separately and return a list of list to func_1. Now I need to change this list of list to spark rdd and write the same to hdfs. But I don't have any idea how to do this.
I am very new to spark. Any help in this case will be appreciate. Thank you in advance.
Edited: As per suggestion, here is the Func1 and Func2 definition:
def Func_1(filename, file_content):
Outputfile = "some code for generating output file name for each input file"
decode_data = Func_2(StringIO(file_content))
##save decode_data here in HDFS.
def Func2_():
##It does the decoding of the file in a sequence manner (its necessary as each binary file has some headers attach to each portion of the file) and return a list of list where each inner list equivalent to a row of the decoded data and out list is the collection of such rows(skipping the code as it is trivial)

Related

Saving multiple items to HDFS with (spark, python, pyspark, jupyter)

I´m used to program in Python. My company now got a Hadoop Cluster with Jupyter installed. Until now I never used Spark / Pyspark for anything.
I am able to load files from HDFS as easy as this:
text_file = sc.textFile("/user/myname/student_grades.txt")
And I´m able to write output like this:
text_file.saveAsTextFile("/user/myname/student_grades2.txt")
The thing I´m trying to achieve is to use a simple "for loop" to read text files one-by-one and write it's content into one HDFS file. So I tried this:
list = ['text1.txt', 'text2.txt', 'text3.txt', 'text4.txt']
for i in list:
text_file = sc.textFile("/user/myname/" + i)
text_file.saveAsTextFile("/user/myname/all.txt")
So this works for the first element of the list, but then gives me this error message:
Py4JJavaError: An error occurred while calling o714.saveAsTextFile.
: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
XXXXXXXX/user/myname/all.txt already exists
To avoid confusion I "blured"-out the IP address with XXXXXXXX.
What is the right way to do this?
I will have tons of datasets (like 'text1', 'text2' ...) and want to perform a python function with each of them before saving them into HDFS. But I would like to have the results all together in "one" output file.
Thanks a lot!
MG
EDIT:
It seems like that my final goal was not really clear. I need to apply a function to each text file seperately and then I want to append the output to the existing output directory. Something like this:
for i in list:
text_file = sc.textFile("/user/myname/" + i)
text_file = really_cool_python_function(text_file)
text_file.saveAsTextFile("/user/myname/all.txt")
I wanted to post this as comment but could not do so as I do not have enough reputation.
You have to convert your RDD to dataframe and then write it in append mode. To convert RDD to dataframe please look into this answer:
https://stackoverflow.com/a/39705464/3287419
or this link http://spark.apache.org/docs/latest/sql-programming-guide.html
To save dataframe in append mode below link may be useful:
http://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes
Almost same question is here also Spark: Saving RDD in an already existing path in HDFS . But the answer provided is for scala. I hope something similar can be done in python also.
There is yet another (but ugly) approach. Convert your RDD to string. Let the resulting string be resultString . Use subprocess to append that string to destination file i.e.
subprocess.call("echo "+resultString+" | hdfs dfs -appendToFile - <destination>", shell=True)
you can read multiple files and save them by
textfile = sc.textFile(','.join(['/user/myname/'+f for f in list]))
textfile.saveAsTextFile('/user/myname/all')
you will get all part files within output directory.
If the text files all have the same schema, you could use Hive to read the whole folder as a single table, and directly write that output.
I would try this, it should be fine:
list = ['text1.txt', 'text2.txt', 'text3.txt', 'text4.txt']
for i in list:
text_file = sc.textFile("/user/myname/" + i)
text_file.saveAsTextFile(f"/user/myname/{i}")

RxPy read csv files and process lines

I want to use RxPy to open a (csv) file and process the file line by line. My precisely I envision to have the following steps
provide a filename to the stream
open the file
read file line by line
remove lines which start with a comment (e.g. # ...)
apply csv reader
filter records matching some criteria
So far I have:
def to_file(filename):
f = open(filename)
return Observable.using(
lambda: AnonymousDisposable(lambda: f.close()),
lambda d: Observable.just(f)
)
def to_reader(f):
return csv.reader(f)
def print_rows(reader):
for row in reader:
print(row)
This works
Observable.from_(["filename.csv", "filename2.csv"])
.flat_map(to_file).**map**(to_reader).subscribe(print_rows)
This doesn't: ValueError: I/O operation on closed file
Observable.from_(["filename.csv", "filename2.csv"])
.flat_map(to_file).**flat_map**(to_rows).subscribe(print)
The 2nd doesn't work because (see https://github.com/ReactiveX/RxPY/issues/69)
When the observables from the first flatmap is merged by the second flatmap, the inner subscriptions will be disposed when they complete. Thus the files will be closed, even if the file handles are on_next'ed into the new observable set up by the second flatmap.
Any idea how I can achieve:
Something like:
Observable.from_(["filename.csv", "filename2.csv"]
).flat_map(to_file
).filter(comment_lines
).filter(empty_lines
).map(to_csv_reader
).filter(filter_by.. )
).do whatever
Thanks a lot for your help
Juergen
I just started working with RxPy recently and needed to do the same thing. Surprised someone hasn't already answered your question but decided to answer just in case someone else needs to know. Assuming you have a CSV file like this:
$ cat datafile.csv
"iata","airport","city","state","country","lat","long"
"00M","Thigpen ","Bay Springs","MS","USA",31.95376472,-89.23450472
"00R","Livingston Municipal","Livingston","TX","USA",30.68586111,-95.01792778
"00V","Meadow Lake","Colorado Springs","CO","USA",38.94574889,-104.5698933
"01G","Perry-Warsaw","Perry","NY","USA",42.74134667,-78.05208056
"01J","Hilliard Airpark","Hilliard","FL","USA",30.6880125,-81.90594389
Here is a solution:
from rx import Observable
from csv import DictReader
Observable.from_(DictReader(open('datafile.csv', 'r'))) \
.subscribe(lambda row:
print("{0:3}\t{1:<35}".format(row['iata'], row['airport'][:35]))
)

Spark using Python : save RDD output into text files

I am trying the word count problem in spark using python. But I am facing the problem when I try to save the output RDD in a text file using .saveAsTextFile command. Here is my code. Please help me. I am stuck. Appreciate for your time.
import re
from pyspark import SparkConf , SparkContext
def normalizewords(text):
return re.compile(r'\W+',re.UNICODE).split(text.lower())
conf=SparkConf().setMaster("local[2]").setAppName("sorted result")
sc=SparkContext(conf=conf)
input=sc.textFile("file:///home/cloudera/PythonTask/sample.txt")
words=input.flatMap(normalizewords)
wordsCount=words.map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y)
sortedwordsCount=wordsCount.map(lambda (x,y):(y,x)).sortByKey()
results=sortedwordsCount.collect()
for result in results:
count=str(result[0])
word=result[1].encode('ascii','ignore')
if(word):
print word +"\t\t"+ count
results.saveAsTextFile("/var/www/myoutput")
since you collected results=sortedwordsCount.collect() so, its not RDD. It will be normal python list or tuple.
As you know list is python object/data structure and append is method to add element.
>>> x = []
>>> x.append(5)
>>> x
[5]
Similarly RDD is sparks object/data structure and saveAsTextFile is method to write the file. Important thing is its distributed data structure.
So, we cannot use append on RDD or saveAsTextFile on list. collect is method on RDD to get to RDD to driver memory.
As mentioned in comments, save sortedwordsCount with saveAsTextFile or open file in python and use results to write in a file
Change results=sortedwordsCount.collect() to results=sortedwordsCount, because using .collect() results will be a list.

Python - Extract private variables from a function?

I have a function f2(a, b)
It is only ever called by a minimize algorithm which iterates the function for different values of a and b each time. I would like to store these iterations in excel for plotting.
Is it possible to extract these values (i only need to paste them all into excel or a text file) easily? Conventional return and print won't work within f2. Is there any way to extract the values a and b to a public list in the main body some other way?
The algorithm may iterate dozens or hundreds of times.
So far I have tried:
Print to console (can't paste this data into excel easily)
Write to file (csv) within f2, the csv file gets overwritten within the function each time though.
Append the values to a global list.
values = []
def f2(a,b):
values.append((a,b))
#todo: actual function logic goes here
Then you can look at values in the main scope once you're done iterating.
Write to file (csv) within f2, the csv file gets overwritten within the function each time though.
Not if you open the file in append mode:
with open("file.csv", "a") as myfile:

Is there a memory efficient and fast way to load big JSON files?

I have some json files with 500MB.
If I use the "trivial" json.load() to load its content all at once, it will consume a lot of memory.
Is there a way to read partially the file? If it was a text, line delimited file, I would be able to iterate over the lines. I am looking for analogy to it.
There was a duplicate to this question that had a better answer. See https://stackoverflow.com/a/10382359/1623645, which suggests ijson.
Update:
I tried it out, and ijson is to JSON what SAX is to XML. For instance, you can do this:
import ijson
for prefix, the_type, value in ijson.parse(open(json_file_name)):
print prefix, the_type, value
where prefix is a dot-separated index in the JSON tree (what happens if your key names have dots in them? I guess that would be bad for Javascript, too...), theType describes a SAX-like event, one of 'null', 'boolean', 'number', 'string', 'map_key', 'start_map', 'end_map', 'start_array', 'end_array', and value is the value of the object or None if the_type is an event like starting/ending a map/array.
The project has some docstrings, but not enough global documentation. I had to dig into ijson/common.py to find what I was looking for.
So the problem is not that each file is too big, but that there are too many of them, and they seem to be adding up in memory. Python's garbage collector should be fine, unless you are keeping around references you don't need. It's hard to tell exactly what's happening without any further information, but some things you can try:
Modularize your code. Do something like:
for json_file in list_of_files:
process_file(json_file)
If you write process_file() in such a way that it doesn't rely on any global state, and doesn't
change any global state, the garbage collector should be able to do its job.
Deal with each file in a separate process. Instead of parsing all the JSON files at once, write a
program that parses just one, and pass each one in from a shell script, or from another python
process that calls your script via subprocess.Popen. This is a little less elegant, but if
nothing else works, it will ensure that you're not holding on to stale data from one file to the
next.
Hope this helps.
Yes.
You can use jsonstreamer SAX-like push parser that I have written which will allow you to parse arbitrary sized chunks, you can get it here and checkout the README for examples. Its fast because it uses the 'C' yajl library.
It can be done by using ijson. The working of ijson has been very well explained by Jim Pivarski in the answer above. The code below will read a file and print each json from the list. For example, file content is as below
[{"name": "rantidine", "drug": {"type": "tablet", "content_type": "solid"}},
{"name": "nicip", "drug": {"type": "capsule", "content_type": "solid"}}]
You can print every element of the array using the below method
def extract_json(filename):
with open(filename, 'rb') as input_file:
jsonobj = ijson.items(input_file, 'item')
jsons = (o for o in jsonobj)
for j in jsons:
print(j)
Note: 'item' is the default prefix given by ijson.
if you want to access only specific json's based on a condition you can do it in following way.
def extract_tabtype(filename):
with open(filename, 'rb') as input_file:
objects = ijson.items(input_file, 'item.drugs')
tabtype = (o for o in objects if o['type'] == 'tablet')
for prop in tabtype:
print(prop)
This will print only those json whose type is tablet.
On your mention of running out of memory I must question if you're actually managing memory. Are you using the "del" keyword to remove your old object before trying to read a new one? Python should never silently retain something in memory if you remove it.
Update
See the other answers for advice.
Original answer from 2010, now outdated
Short answer: no.
Properly dividing a json file would take intimate knowledge of the json object graph to get right.
However, if you have this knowledge, then you could implement a file-like object that wraps the json file and spits out proper chunks.
For instance, if you know that your json file is a single array of objects, you could create a generator that wraps the json file and returns chunks of the array.
You would have to do some string content parsing to get the chunking of the json file right.
I don't know what generates your json content. If possible, I would consider generating a number of managable files, instead of one huge file.
Another idea is to try load it into a document-store database like MongoDB.
It deals with large blobs of JSON well. Although you might run into the same problem loading the JSON - avoid the problem by loading the files one at a time.
If path works for you, then you can interact with the JSON data via their client and potentially not have to hold the entire blob in memory
http://www.mongodb.org/
"the garbage collector should free the memory"
Correct.
Since it doesn't, something else is wrong. Generally, the problem with infinite memory growth is global variables.
Remove all global variables.
Make all module-level code into smaller functions.
in addition to #codeape
I would try writing a custom json parser to help you figure out the structure of the JSON blob you are dealing with. Print out the key names only, etc. Make a hierarchical tree and decide (yourself) how you can chunk it. This way you can do what #codeape suggests - break the file up into smaller chunks, etc
You can parse the JSON file to CSV file and you can parse it line by line:
import ijson
import csv
def convert_json(self, file_path):
did_write_headers = False
headers = []
row = []
iterable_json = ijson.parse(open(file_path, 'r'))
with open(file_path + '.csv', 'w') as csv_file:
csv_writer = csv.writer(csv_file, ',', '"', csv.QUOTE_MINIMAL)
for prefix, event, value in iterable_json:
if event == 'end_map':
if not did_write_headers:
csv_writer.writerow(headers)
did_write_headers = True
csv_writer.writerow(row)
row = []
if event == 'map_key' and not did_write_headers:
headers.append(value)
if event == 'string':
row.append(value)
So simply using json.load() will take a lot of time. Instead, you can load the json data line by line using key and value pair into a dictionary and append that dictionary to the final dictionary and convert it to pandas DataFrame which will help you in further analysis
def get_data():
with open('Your_json_file_name', 'r') as f:
for line in f:
yield line
data = get_data()
data_dict = {}
each = {}
for line in data:
each = {}
# k and v are the key and value pair
for k, v in json.loads(line).items():
#print(f'{k}: {v}')
each[f'{k}'] = f'{v}'
data_dict[i] = each
Data = pd.DataFrame(data_dict)
#Data will give you the dictionary data in dataFrame (table format) but it will
#be in transposed form , so will then finally transpose the dataframe as ->
Data_1 = Data.T

Categories

Resources