How to read hadoop map file using python? - python

I have map file that is block compressed using DefaultCodec. The map file is created by java application like this:
MapFile.Writer writer =
new MapFile.Writer(conf, path,
MapFile.Writer.keyClass(IntWritable.class),
MapFile.Writer.valueClass(BytesWritable.class),
MapFile.Writer.compression(SequenceFile.CompressionType.BLOCK, new DefaultCodec()));
This file is stored in hdfs and I need to read some key,values from it in another application using python. I can't find any library that can do that. Do you have any suggestion and example?
Thanks

I would suggest using Spark which has a function called textFile() which can read files from HDFS and turn them into RDDs for further processing using other Spark libraries.
Here's the documentation : Pyspark

Create a reader as follow:
path = '/hdfs/path/to/file'
key = LongWritable()
value = LongWritable()
reader = MapFile.Reader(path)
while reader.next(key, value):
print key, value
Check out these hadoop.io.MapFile Python examples
And available methods in MapFile.py

Related

Read xml file using RDD using local cluster pyspark

I have a very large xml 100MB using pyspark for the following reason,
to reduce the run time of data and convert it into data frame.
Any idea how to read to xml file. (modification required into below code) One more. I took from
google. In precise did not understand, how can I define get_values.
spark = SparkSession.builder.master("local[2]").appName('finale').getOrCreate()
xml = os.path.join(self.violation_path, xml)
file_rdd = spark.read.text(xml, wholetext=False)
pyspark.sql.udf.UDFRegistration.register(name="get_values", f = get_values,
returnType=StringType())
myRDD = spark.parallelize(file_rdd.take(4), 4)
parsed_records = spark.runJob(myRDD, lambda part: [get_values(x) for x in part])
print (parsed_records)
another method
root = ET.parse(xml).getroot()
is it right approach to use pyspark as local cluster? will it be fatser
Can it be run only cloud container and can not use local machine

writing to pysftp fileobject using pandas to_csv with compression doesn't actually compress

I have looked at many related answers here on Stackoverflow and this question seems most related How to Transfer Pandas DataFrame to .csv on SFTP using Paramiko Library in Python?. I want to do something similar, however, I want to compress the file when I send it to the SFTP location, so I end up with a .csv.gz file essentially. The files I am working with are 15-40 MB in size uncompressed, but there are lots of them sometimes, so need to keep the fingerprint small.
I have been using code like this to move the dataframe to the destination, after pulling it from another location as a csv, doing some transformations on the data itself:
fileList = source_sftp.listdir('/Inbox/')
dataList = []
for item in fileList: # for each file in the list...
print(item)
if item[-3:] == u'csv':
temp = pd.read_csv(source_sftp.open('/Inbox/'+item)) # read the csv directly from the sftp server into a pd Dataframe
elif item[-3:] == u'zip':
temp = pd.read_csv(source_sftp.open('/Inbox/'+item),compression='zip')
elif item[-3:] == u'.gz':
temp = pd.read_csv(source_sftp.open('/Inbox/'+item),compression='gzip')
else:
temp = pd.read_csv(source_sftp.open('/Inbox/'+item),compression='infer')
dataList.append(temp) # keep each
#... Some transformations in here on the data
FL = [(x.replace('.csv',''))+suffix # just swap out to suffix
for x in fileList]
locpath = '{}/some/new/dir/'.format(dest_sftp.pwd)
i = 0
for item in dataList:
with dest_sftp.open(locpath + FL[i], 'w') as f:
item.to_csv(f, index=False,compression='gzip')
i = i+1
It seems like I should be able to get this to work, but I am guessing there is something being skipped over when I use to_csv to convert the dataframe back and then compress it on the sftp fileobject. Should I be streaming this somehow, or is there solution I am missing somewhere in the documentation on pysftp or pandas?
If I can avoid saving the csv file somewhere local first, I would like to, but I don't think I should have to, right? I am able to get the file in the end to be compressed if I just save file locally with temp.to_csv('/local/path/myfile.csv.gz', compression='gzip'), and after transferring this local file to the destination it is still compressed, so I don't think it has do with the transfer, just how pandas.Dataframe.to_csv and the pysftp.Connection.open are used together.
I should probably add that I still consider myself a newbie to much of Python, but I have been working with local to sftp and sftp to local, and have not had to do much in the way of transferring (directly or indirectly) between them.
Make sure you have the latest version of Pandas.
It supports the compression with a file-like object since 0.24 only:
GH21227: df.to_csv ignores compression when provided with a file handle

Saving multiple items to HDFS with (spark, python, pyspark, jupyter)

I´m used to program in Python. My company now got a Hadoop Cluster with Jupyter installed. Until now I never used Spark / Pyspark for anything.
I am able to load files from HDFS as easy as this:
text_file = sc.textFile("/user/myname/student_grades.txt")
And I´m able to write output like this:
text_file.saveAsTextFile("/user/myname/student_grades2.txt")
The thing I´m trying to achieve is to use a simple "for loop" to read text files one-by-one and write it's content into one HDFS file. So I tried this:
list = ['text1.txt', 'text2.txt', 'text3.txt', 'text4.txt']
for i in list:
text_file = sc.textFile("/user/myname/" + i)
text_file.saveAsTextFile("/user/myname/all.txt")
So this works for the first element of the list, but then gives me this error message:
Py4JJavaError: An error occurred while calling o714.saveAsTextFile.
: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
XXXXXXXX/user/myname/all.txt already exists
To avoid confusion I "blured"-out the IP address with XXXXXXXX.
What is the right way to do this?
I will have tons of datasets (like 'text1', 'text2' ...) and want to perform a python function with each of them before saving them into HDFS. But I would like to have the results all together in "one" output file.
Thanks a lot!
MG
EDIT:
It seems like that my final goal was not really clear. I need to apply a function to each text file seperately and then I want to append the output to the existing output directory. Something like this:
for i in list:
text_file = sc.textFile("/user/myname/" + i)
text_file = really_cool_python_function(text_file)
text_file.saveAsTextFile("/user/myname/all.txt")
I wanted to post this as comment but could not do so as I do not have enough reputation.
You have to convert your RDD to dataframe and then write it in append mode. To convert RDD to dataframe please look into this answer:
https://stackoverflow.com/a/39705464/3287419
or this link http://spark.apache.org/docs/latest/sql-programming-guide.html
To save dataframe in append mode below link may be useful:
http://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes
Almost same question is here also Spark: Saving RDD in an already existing path in HDFS . But the answer provided is for scala. I hope something similar can be done in python also.
There is yet another (but ugly) approach. Convert your RDD to string. Let the resulting string be resultString . Use subprocess to append that string to destination file i.e.
subprocess.call("echo "+resultString+" | hdfs dfs -appendToFile - <destination>", shell=True)
you can read multiple files and save them by
textfile = sc.textFile(','.join(['/user/myname/'+f for f in list]))
textfile.saveAsTextFile('/user/myname/all')
you will get all part files within output directory.
If the text files all have the same schema, you could use Hive to read the whole folder as a single table, and directly write that output.
I would try this, it should be fine:
list = ['text1.txt', 'text2.txt', 'text3.txt', 'text4.txt']
for i in list:
text_file = sc.textFile("/user/myname/" + i)
text_file.saveAsTextFile(f"/user/myname/{i}")

Using Hadoop InputFormat in Pyspark

I'm working on a file parser for Spark that can basically read in n lines at a time and place all of those lines as a single row in a dataframe.
I know I need to use InputFormat to try and specify that, but I cannot find a good guide to this in Python.
Is there a method for specifying a custom InputFormat in Python or do I need to create it as a scala file and then specify the jar in spark-submit?
You can directly use the InputFormats with Pyspark.
Quoting from the documentation,
PySpark can also read any Hadoop InputFormat or write any Hadoop
OutputFormat, for both ‘new’ and ‘old’ Hadoop MapReduce APIs.
Pass the HadoopInputFormat class to any of these methods of pyspark.SparkContext as suited,
hadoopFile()
hadoopRDD()
newAPIHadoopFile()
newAPIHadoopRDD()
To read n lines, org.apache.hadoop.mapreduce.lib.NLineInputFormat can be used as the HadoopInputFormat class with the newAPI methods.
I cannot find a good guide to this in Python
In the Spark docs, under "Saving and Loading Other Hadoop Input/Output Formats", there is an Elasticsearch example + links to an HBase example.
can basically read in n lines at a time... I know I need to use InputFormat to try and specify that
There is NLineInputFormat specifically for that.
This is a rough translation of some Scala code I have from NLineInputFormat not working in Spark
def nline(n, path):
sc = SparkContext.getOrCreate
conf = {
"mapreduce.input.lineinputformat.linespermap": n
}
hadoopIO = "org.apache.hadoop.io"
return sc.newAPIHadoopFile(path,
"org.apache.hadoop.mapreduce.lib.NLineInputFormat",
hadoopIO + ".LongWritable",
hadoopIO + ".Text",
conf=conf).map(lambda x : x[1]) # To strip out the file-offset
n = 3
rdd = nline(n, "/file/input")
and place all of those lines as a single row in a dataframe
With NLineInputFormat, each string in the RDD is actually new-line delimited. You can rdd.map(lambda record : "\t".join(record.split('\n'))), for example to put make one line out them.

Use Apache Spark to implement the python function

I have a python code to implement in Spark, however I am unable to get the logic right for the RDD working to implement in Spark 1.1 version. This code is perfectly working in Python ,but I would like to implement in Spark with this code.
import lxml.etree
import csv
sc = SparkContext
data = sc.textFile("pain001.xml")
rdd = sc.parallelize(data)
# compile xpath selectors for ele ment text
selectors = ('GrpHdr/MsgId', 'GrpHdr/CreDtTm') # etc...
xpath = [lxml.etree.XPath('{}/text()'.format(s)) for s in selectors]
# open result csv file
with open('pain.csv', 'w') as paincsv:
writer = csv.writer(paincsv)
# read file with 1 'CstmrCdtTrfInitn' record per line
with open(rdd) as painxml:
# process each record
for index, line in enumerate(painxml):
if not line.strip(): # allow empty lines
continue
try:
# each line is an xml doc
pain001 = lxml.etree.fromstring(line)
# move to the customer elem
elem = pain001.find('CstmrCdtTrfInitn')
# select each value and write to csv
writer.writerow([xp(elem)[0].strip() for xp in xpath])
except Exception, e:
# give a hint where things go bad
sys.stderr.write("Error line {}, {}".format(index, str(e)))
raise
I am getting error as RDD not iteratable
I want to implement this code as a function and implement as a standalone program in Spark
I would want the input file to be processed in HDFS as well as local mode in Spark with the python module.
Appreciate responses for the problem.
The error you are getting is very informative, when you do with open(rdd) as painxml: and after that, you try to iterate over the RDD as if it was a normal List or Tuple in python, and an RDD is not iterable, furthermore if you read the textFile documentation, you can notice that it returns an RDD.
I think the problem you have is that you are trying to achieve this in a classic way, and you must approach it inside the MapReduce paradigm, if you are really new into Apache Spark, you can audit this course Scalable Machine Learning with Apache Spark, furthermore I would recommend you to update your spark's version to 1.5 or 1.6 (that will come out soon).
Just as a small example (but not using xmls):
Import the required files
import re
import csv
Read the input file
content = sc.textFile("../test")
content.collect()
# Out[8]: [u'1st record-1', u'2nd record-2', u'3rd record-3', u'4th record-4']
Map the RDD to manipulate each row
# Map it and convert it to tuples
rdd = content.map(lambda s: tuple(re.split("-+",s)))
rdd.collect()
# Out[9]: [(u'1st record', u'1'),
# (u'2nd record', u'2'),
# (u'3rd record', u'3'),
# (u'4th record', u'4')]
Write your data
with open("../test.csv", "w") as fw:
writer = csv.writer(fw)
for r1 in rdd.toLocalIterator():
writer.writerow(r1)
Take a look...
$ cat test.csv
1st record,1
2nd record,2
3rd record,3
4th record,4
Note: If you want to read a xml with Apache Spark, there are some libraries in GitHub like spark-xml; you can also find this question interesting xml processing in spark.

Categories

Resources