I am new beginner to Spark and started to write some script in Python. My understanding is that Spark executes the Transformation in parallel (map).
def some_function(name, content):
print(name, datetime.now())
time.sleep(30)
return content
config = SparkConf().setAppName("sample2").setMaster("local[*]")
filesRDD = SparkContext(conf=config).binaryFiles("F:\\usr\\temp\\*.zip")
inputfileRDD = filesRDD.map(lambda job_bundle: (job_bundle[0], some_function(job_bundle[0], job_bundle[1])))
print(inputfileRDD.collect())
The above code collects list of .zip files from a folder and processes it. When I execute it I am seeing this is happening sequentially.
Output
file:/F:/usr/temp/sample.zip 2020-10-22 10:42:37.089085
file:/F:/usr/temp/sample1.zip 2020-10-22 10:43:07.103317
You can see that it started processing 2nd file after 30 sec. meaning after completing the first file. What went wrong in my code ? why it is not executing RDD in parallel ? Can you please help me ?
I don't know exactly how the method binaryFiles partitions the files accross spark partitions. It seems that contrarily to textFiles, it tends to only create one partition. Let's see that with an example directory called dir and containing 5 files.
> ls dir
test1 test2 test3 test4 test5
If I use textFile, things are run in parallel. I don't provide the output because it is not very pretty but you can check yourself. We can verify that things are run in parallel with getNumPartitions.
>>> sc.textFile("dir").foreach(lambda x: some_function(x, None))
# ugly output, but everything starts at the same time,
# except maybe the last one since you have 4 cores.
>>> sc.textFile("dir").getNumPartitions()
5
With binaryFiles things are different and for some reason everything goes to the same partition.
>>> sc.binaryFiles("dir").getNumPartitions()
1
I even tried with 10k files and everything still goes to the same partition. I believe the reason behind that is that in scala, binaryFiles returns a RDD with file names and an object that allows to read the files (but no reading is performed). Therefore it is fast, and the resulting RDD is small. Therefore, having it on one partition is OK.
In scala, we can thus use repartition after using binaryFiles and things will work great.
scala> sc.binaryFiles("dir").getNumPartitions
1
scala> sc.binaryFiles("dir").repartition(4).getNumPartitions
4
scala> sc.binaryFiles("dir").repartition(4)
.foreach{ case (name, ds) => {
println(System.currentTimeMillis+": "+name)
Thread.sleep(2000)
// do some reading on the DataStream ds
}}
1603352918396: file:/home/oanicol/sandbox/dir/test1
1603352918396: file:/home/oanicol/sandbox/dir/test3
1603352918396: file:/home/oanicol/sandbox/dir/test4
1603352918396: file:/home/oanicol/sandbox/dir/test5
1603352920397: file:/home/oanicol/sandbox/dir/test2
The problem in python is that binaryFiles actually reads the file onto one single partition. Also, that's extremely mysterious to me but the following lines of code in pyspark 2.4 yield the same behaviour you notice which does not make sense.
# this should work but does not
sc.binaryFiles("dir", minPartitions=4).foreach(lambda x: some_function(x, ''))
# this does not work either, which is strange but it would not be advised anyway
# since all the data would be read on one partition
sc.binaryFiles("dir").repartition(4).foreach(lambda x: some_function(x, ''))
Yet, since binaryFiles actually reads the file, you can use wholeTextFile which reads the file as a text file and behaves as expected:
# this works
sc.wholeTextFiles("dir", minPartitions=4).foreach(lambda x: some_function(x, ''))
Related
I have many binary files (.tdms format, similar to .wav) stored in S3 and I would like to read them with nptdms then process them in a distributed fashion with Dask on a cluster.
In PySpark there is pyspark.SparkContext.binaryFiles() which produces an RDD with a bytearray for each input file which is a simple solution to this problem.
I have not found an equivalent function in Dask - is there one? If not, how could the equivalent functionality be achieved in Dask?
I noticed there's dask.bytes.read_bytes() if it's necessary to involve this however nptdms can't read a chunk of a file - it needs the entire file to be available and I'm not sure how to accomplish that.
dask.bytes.read_bytes() will give you whole files if you use blocksize=None, i.e., exactly one block per file. The most common use case for that is compressed files (e.g., gzip) where you can't start mid-stream, but should work for your use case too. Note that the delayed objects you get each return bytes, not open files.
Alternatively, you can use fsspec.open_files. This returns OpenFile objects, which are safe to serialise and so you can use them in dask.delayed calls such as
ofs = fsspec.open_files("s3://...", ...)
#dask.delayed
def read_a_file(of):
with of as f:
# entering context actually touches storage
return TdmsFile.read(f)
tdms = [read_a_file(of) for of in ofs]
I have a python script that currently runs on my desktop. It takes a csv file with roughly 25 million lines (Maybe 15 or so columns) and performs line by line operations.
For each line of input, multiple output lines are produced. The results are then output line by line into a csv file, the output ends up at around 100 million lines.
Code looks something like this:
with open(outputfile,"a") as outputcsv:
with open(inputfile,"r") as input csv:
headerlist=next(csv.reader(csvfile)
for row in csv.reader(csvfile):
variable1 = row[headerlist.index("VAR1")]
variableN = row[headerlist.index("VARN")]
while calculations not complete:
do stuff #Some complex calculations are done at this point
outputcsv.write(stuff)
We're now trying to convert the script to run via Hadoop, using pyspark.
I have no idea how to even start. I'm trying to work out how to iterate through an RDD object but don't think it can be done.
Is a line by line calculation like this suitable for distributed processing?
If you directly want to run the script, you could do so via spark-submit:
spark-submit master local[*]/yarn other_parameters path_to_your_script.py
But I would suggest to go for spark API's as they are easy to use. It will lower the coding overhead.
First you have to create a spark session variable so that you could access all spark functions:
spark = SparkSession
.builder()
.appName("SparkSessionZipsExample")
.config("parameters", "value")
.getOrCreate()
Next, if you want to load a csv file:
file = spark.read.csv("path to file")
You can specify optional parameters like headers, inferschema, etc:
file=spark.read.option("header","true").csv("path to your file")
'file' will now be a pyspark dataframe.
You can now write the end output like this:
file.write.csv("output_path")
Please refer to the documentation : spark documentation for transformations and other information.
This tutorial https://www.dataquest.io/blog/python-json-tutorial/ has a 600MB file that they work with, however when I run their code
import ijson
filename = "md_traffic.json"
with open(filename, 'r') as f:
objects = ijson.items(f, 'meta.view.columns.item')
columns = list(objects)
I'm running into 10+ minutes of waiting for the file to be read into ijson and I'm really confused how this is supposed to be reasonable. Shouldn't there be parsing? Am I missing something?
The main problem is not that you are creating a list after parsing (that only collects the individual results into a single structure), but that you are using the default pure-python backend provided by ijson.
There are other backends that can be used which are way faster. In ijson's homepage it is explained how you can import those. The yajl2_cffi backend is the fastest currently available at the moment, but I've created a new yajl2_c backend (there's a pull request pending acceptance) that performs even better.
In my laptop (Intel(R) Core(TM) i7-5600U) using the yajl2_cffi backend your code runs in ~1.5 minutes. Using the yajl2_c backend it runs in ~10.5 seconds (python 3) and ~15 seconds (python 2.7.12).
Edit: #lex-scarisbrick is of course also right in that you can quickly break out of the loop if you are only interested in the column names.
This looks like a direct copy/paste of the tutorial found here:
https://www.dataquest.io/blog/python-json-tutorial/
The reason it's taking so long is the list() around the output of the ijson.items function. This effectively forces parsing of the entire file before returning any results. Taking advantage of the ijson.items being a generator, the first result can be returned almost immediately:
import ijson
filename = "md_traffic.json"
with open(filename, 'r') as f:
for item in ijson.items(f, 'meta.view.columns.item'):
print(item)
break
EDIT: The very next step in the tutorial is print(columns[0]), which is why I included printing the first item in the answer. Also, it's not clear whether the question was for Python 2 or 3, so the answer uses syntax that works in both, albeit inelegantly.
I tried running your code and I killed the program after 25 minutes. So yes 10 minutes it's reasonable fast.
I'm new to python as well as MPI.
I have a huge data file, 10Gb, and I want to load it into, i.e., a list or whatever more efficient, please suggest.
Here is the way I load the file content into a list
def load(source, size):
data = [[] for _ in range(size)]
ln = 0
with open(source, 'r') as input:
for line in input:
ln += 1
data[ln%size].sanitize(line)
return data
Note:
source: is file name
size: is the number of concurrent process, I divide data into [size] of sublist.
for parallel computing using MPI in python.
Please advise how to load data more efficient and faster. I'm searching for days but I couldn't get any results matches my purpose and if there exists, please comment with a link here.
Regards
If I have understood the question, your bottleneck is not Python data structures. It is the I/O speed that limits the efficiency of your program.
If the file is written in continues blocks in the H.D.D then I don't know a way to read it faster than reading the file starting form the first bytes to the end.
But if the file is fragmented, create multiple threads each reading a part of the file. The must slow down the process of reading but modern HDDs implement a technique named NCQ (Native Command Queueing). It works by giving high priority to the read operation on sectors with addresses near the current position of the HDD head. Hence improving the overall speed of read operation using multiple threads.
To mention an efficient data structure in Python for your program, you need to mention what operations will you perform to the data? (delete, add, insert, search, append and so on) and how often?
By the way, if you use commodity hardware, 10GBs of RAM is expensive. Try reducing the need for this amount of RAM by loading the necessary data for computation then replacing the results with new data for the next operation. You can overlap the computation with the I/O operations to improve performance.
(original) Solution using pickling
The strategy for your task can go this way:
split the large file to smaller ones, make sure they are divided on line boundaries
have Python code, which can convert smaller files into resulting list of records and save them as
pickled file
run the python code for all the smaller files in parallel (using Python or other means)
run integrating code, taking pickled files one by one, loading the list from it and appending it
to final result.
To gain anything, you have to be careful as overhead can overcome all possible gains from parallel
runs:
as Python uses Global Interpreter Lock (GIL), do not use threads for parallel processing, use
processes. As processes cannot simply pass data around, you have to pickle them and let the other
(final integrating) part to read the result from it.
try to minimize number of loops. For this reason it is better to:
do not split the large file to too many smaller parts. To use power of your cores, best fit
the number of parts to number of cores (or possibly twice as much, but getting higher will
spend too much time on swithing between processes).
pickling allows saving particular items, but better create list of items (records) and pickle
the list as one item. Pickling one list of 1000 items will be faster than 1000 times pickling
small items one by one.
some tasks (spliting the file, calling the conversion task in parallel) can be often done faster
by existing tools in the system. If you have this option, use that.
In my small test, I have created a file with 100 thousands lines with content "98-BBBBBBBBBBBBBB",
"99-BBBBBBBBBBB" etc. and tested converting it to list of numbers [...., 98, 99, ...].
For spliting I used Linux command split, asking to create 4 parts preserving line borders:
$ split -n l/4 long.txt
This created smaller files xaa, xab, xac, xad.
To convert each smaller file I used following script, converting the content into file with
extension .pickle and containing pickled list.
# chunk2pickle.py
import pickle
import sys
def process_line(line):
return int(line.split("-", 1)[0])
def main(fname, pick_fname):
with open(pick_fname, "wb") as fo:
with open(fname) as f:
pickle.dump([process_line(line) for line in f], fo)
if __name__ == "__main__":
fname = sys.argv[1]
pick_fname = fname + ".pickled"
main(fname, pick_fname)
To convert one chunk of lines into pickled list of records:
$ python chunk2pickle xaa
and it creates the file xaa.pickled.
But as we need to do this in parallel, I used parallel tool (which has to be installed into
system):
$ parallel -j 4 python chunk2pickle.py {} ::: xaa xab xac xad
and I found new files with extension .pickled on the disk.
-j 4 asks to run 4 processes in parallel, adjust it to your system or leave it out and it will
default to number of cores you have.
parallel can also get list of parameters (input file names in our case) by other means like ls
command:
$ ls x?? |parallel -j 4 python chunk2pickle.py {}
To integrate the results, use script integrate.py:
# integrate.py
import pickle
def main(file_names):
res = []
for fname in file_names:
with open(fname, "rb") as f:
res.extend(pickle.load(f))
return res
if __name__ == "__main__":
file_names = ["xaa.pickled", "xab.pickled", "xac.pickled", "xad.pickled"]
# here you have the list of records you asked for
records = main(file_names)
print records
In my answer I have used couple of external tools (split and parallel). You may do similar task
with Python too. My answer is focusing only on providing you an option to keep Python code for
converting lines to required data structures. Complete pure Python answer is not covered here (it
would get much longer and probably slower.
Solution using process Pool (no explicit pickling needed)
Following solution uses multiprocessing from Python. In this case there is no need to pickle results
explicitly (I am not sure, if it is done by the library automatically, or it is not necessary and
data are passed using other means).
# direct_integrate.py
from multiprocessing import Pool
def process_line(line):
return int(line.split("-", 1)[0])
def process_chunkfile(fname):
with open(fname) as f:
return [process_line(line) for line in f]
def main(file_names, cores=4):
p = Pool(cores)
return p.map(process_chunkfile, file_names)
if __name__ == "__main__":
file_names = ["xaa", "xab", "xac", "xad"]
# here you have the list of records you asked for
# warning: records are in groups.
record_groups = main(file_names)
for rec_group in record_groups:
print(rec_group)
This updated solution still assumes, the large file is available in form of four smaller files.
I have two files on HDFS and I just want to join these two files on a column say employee id.
I am trying to simply print the files to make sure we are reading that correctly from HDFS.
lines = sc.textFile("hdfs://ip:8020/emp.txt")
print lines.count()
I have tried foreach and println functions as well and I am not able to display file data.
I am working in python and totally new to both python and spark as well.
This is really easy just do a collect
You must be sure that all the data fits the memory on your master
my_rdd = sc.parallelize(xrange(10000000))
print my_rdd.collect()
If that is not the case You must just take a sample by using take method.
# I use an exagerated number to remind you it is very large and won't fit the memory in your master so collect wouldn't work
my_rdd = sc.parallelize(xrange(100000000000000000))
print my_rdd.take(100)
Another example using .ipynb: