How can I unzip a large file in PySpark?

How can I unzip a large file in PySpark? - python

I need to unzip a 1.6G file in PySpark.
I've tried doing things along the lines of:
unzipped_files = sc.union( \
[sc.binaryFiles(path) for path in paths]) \
.flatMap(lambda kv: unzip_file(kv) \
)
where paths is a list of filepaths (that currently just has one element) and unzip_file looks something along the lines of:
zipped_file_obj = zipfile.ZipFile(io.BytesIO(zipped_file[1]), "r")
return [
zipped_file_obj.open(filename).read()
for filename in zipped_file_obj.namelist()
]
but the resulting unzipped_files RDD I get is completely unusable. Things as simple as .isEmpty() cause the job to get shut down with a message like py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. which doesn't help me AT ALL.
Even if I try to just do the following:
file = sc.binaryFiles(paths[0])
that is a completely unusable RDD, as well. I'm about to pull my hair out here. Is the file just too big?

Related

Python crashes when loading very large file

I have this error "Fatal Python error: _Py_CheckRecursiveCall: Cannot recover from stack overflow", when I load a very large .txt in memory. When i use small text file everything works perfect.
I want to load each line of my text file, split it by a "-" and put each [0] in a list, and each [1] in another list. So my text file would be like this:
aaaaa-bbbb
cccc-ddddddd
eeeee-fffff
So:
list1 = ["aaaaa", "cccc", "eeeee"]
list2 = ["bbbb", "ddddddd", "fffff"]

Try executing the code in terminal
possibly It will run smoothly

import pandas as pd
df = pd.read_csv('file_name',sep="-", header=None)

Try breaking in smaller subfile or wait after some number of lines.

Converting Python script to be able to run in Spark/Hadoop

I have a python script that currently runs on my desktop. It takes a csv file with roughly 25 million lines (Maybe 15 or so columns) and performs line by line operations.
For each line of input, multiple output lines are produced. The results are then output line by line into a csv file, the output ends up at around 100 million lines.
Code looks something like this:
with open(outputfile,"a") as outputcsv:
with open(inputfile,"r") as input csv:
headerlist=next(csv.reader(csvfile)
for row in csv.reader(csvfile):
variable1 = row[headerlist.index("VAR1")]
variableN = row[headerlist.index("VARN")]
while calculations not complete:
do stuff #Some complex calculations are done at this point
outputcsv.write(stuff)
We're now trying to convert the script to run via Hadoop, using pyspark.
I have no idea how to even start. I'm trying to work out how to iterate through an RDD object but don't think it can be done.
Is a line by line calculation like this suitable for distributed processing?

If you directly want to run the script, you could do so via spark-submit:
spark-submit master local[*]/yarn other_parameters path_to_your_script.py
But I would suggest to go for spark API's as they are easy to use. It will lower the coding overhead.
First you have to create a spark session variable so that you could access all spark functions:
spark = SparkSession
.builder()
.appName("SparkSessionZipsExample")
.config("parameters", "value")
.getOrCreate()
Next, if you want to load a csv file:
file = spark.read.csv("path to file")
You can specify optional parameters like headers, inferschema, etc:
file=spark.read.option("header","true").csv("path to your file")
'file' will now be a pyspark dataframe.
You can now write the end output like this:
file.write.csv("output_path")
Please refer to the documentation : spark documentation for transformations and other information.

python script for run time log seggregation

I have a requirement of reading the log file in run time and segregate them in to multiple different files based on the search.
Since the log file will be rotated on a daily basis I have used "getmtime" to read the latest modified log file and read the lines dynamically as it is updated in the log and segregate them in to multiple files.
However my code fails to read new lines in the log file. Request your inputs here.
import time
import os
import glob
newest = max(glob.iglob('/var/log/*.log'), key=os.path.getmtime)
with open(newest,'r') as file, \
open(‘result1.log’, ‘w’) as output_file1, \
open(‘result2.log’, ‘w’) as output_file2, \
open(‘result3.log’, ‘w’) as output_file3:
while 1:
where = file.tell()
line = file.readline()
if not line:
time.sleep(1)
file.seek(where)
else:
if “abc” in line:
output_file1.write(line)
if “def” in line:
output_file2.write(line)
if “ghi” in line:
output-file3.write(line)
newest1 = max(glob.iglob('/var/log/*.log'), key=os.path.getmtime)
if newest1 != newest
newest= newest1
file = open(newest, 'r')
Thanks & Reagrds,
Ankith

For starters, your code contains syntax errors, so I don't think the code you presented above is the same as the code you really use, or you would have noticed. I have copy-pasted your sample, fixed both errors and ran it - the results were as you expected, new lines were read correctly. Thus, I believe your problem is not related to this sample.

Saving multiple items to HDFS with (spark, python, pyspark, jupyter)

I´m used to program in Python. My company now got a Hadoop Cluster with Jupyter installed. Until now I never used Spark / Pyspark for anything.
I am able to load files from HDFS as easy as this:
text_file = sc.textFile("/user/myname/student_grades.txt")
And I´m able to write output like this:
text_file.saveAsTextFile("/user/myname/student_grades2.txt")
The thing I´m trying to achieve is to use a simple "for loop" to read text files one-by-one and write it's content into one HDFS file. So I tried this:
list = ['text1.txt', 'text2.txt', 'text3.txt', 'text4.txt']
for i in list:
text_file = sc.textFile("/user/myname/" + i)
text_file.saveAsTextFile("/user/myname/all.txt")
So this works for the first element of the list, but then gives me this error message:
Py4JJavaError: An error occurred while calling o714.saveAsTextFile.
: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
XXXXXXXX/user/myname/all.txt already exists
To avoid confusion I "blured"-out the IP address with XXXXXXXX.
What is the right way to do this?
I will have tons of datasets (like 'text1', 'text2' ...) and want to perform a python function with each of them before saving them into HDFS. But I would like to have the results all together in "one" output file.
Thanks a lot!
MG
EDIT:
It seems like that my final goal was not really clear. I need to apply a function to each text file seperately and then I want to append the output to the existing output directory. Something like this:
for i in list:
text_file = sc.textFile("/user/myname/" + i)
text_file = really_cool_python_function(text_file)
text_file.saveAsTextFile("/user/myname/all.txt")

I wanted to post this as comment but could not do so as I do not have enough reputation.
You have to convert your RDD to dataframe and then write it in append mode. To convert RDD to dataframe please look into this answer:
https://stackoverflow.com/a/39705464/3287419
or this link http://spark.apache.org/docs/latest/sql-programming-guide.html
To save dataframe in append mode below link may be useful:
http://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes
Almost same question is here also Spark: Saving RDD in an already existing path in HDFS . But the answer provided is for scala. I hope something similar can be done in python also.
There is yet another (but ugly) approach. Convert your RDD to string. Let the resulting string be resultString . Use subprocess to append that string to destination file i.e.
subprocess.call("echo "+resultString+" | hdfs dfs -appendToFile - <destination>", shell=True)

you can read multiple files and save them by
textfile = sc.textFile(','.join(['/user/myname/'+f for f in list]))
textfile.saveAsTextFile('/user/myname/all')
you will get all part files within output directory.

If the text files all have the same schema, you could use Hive to read the whole folder as a single table, and directly write that output.

I would try this, it should be fine:
list = ['text1.txt', 'text2.txt', 'text3.txt', 'text4.txt']
for i in list:
text_file = sc.textFile("/user/myname/" + i)
text_file.saveAsTextFile(f"/user/myname/{i}")

RxPy read csv files and process lines

I want to use RxPy to open a (csv) file and process the file line by line. My precisely I envision to have the following steps
provide a filename to the stream
open the file
read file line by line
remove lines which start with a comment (e.g. # ...)
apply csv reader
filter records matching some criteria
So far I have:
def to_file(filename):
f = open(filename)
return Observable.using(
lambda: AnonymousDisposable(lambda: f.close()),
lambda d: Observable.just(f)
)
def to_reader(f):
return csv.reader(f)
def print_rows(reader):
for row in reader:
print(row)
This works
Observable.from_(["filename.csv", "filename2.csv"])
.flat_map(to_file).**map**(to_reader).subscribe(print_rows)
This doesn't: ValueError: I/O operation on closed file
Observable.from_(["filename.csv", "filename2.csv"])
.flat_map(to_file).**flat_map**(to_rows).subscribe(print)
The 2nd doesn't work because (see https://github.com/ReactiveX/RxPY/issues/69)
When the observables from the first flatmap is merged by the second flatmap, the inner subscriptions will be disposed when they complete. Thus the files will be closed, even if the file handles are on_next'ed into the new observable set up by the second flatmap.
Any idea how I can achieve:
Something like:
Observable.from_(["filename.csv", "filename2.csv"]
).flat_map(to_file
).filter(comment_lines
).filter(empty_lines
).map(to_csv_reader
).filter(filter_by.. )
).do whatever
Thanks a lot for your help
Juergen

I just started working with RxPy recently and needed to do the same thing. Surprised someone hasn't already answered your question but decided to answer just in case someone else needs to know. Assuming you have a CSV file like this:
$ cat datafile.csv
"iata","airport","city","state","country","lat","long"
"00M","Thigpen ","Bay Springs","MS","USA",31.95376472,-89.23450472
"00R","Livingston Municipal","Livingston","TX","USA",30.68586111,-95.01792778
"00V","Meadow Lake","Colorado Springs","CO","USA",38.94574889,-104.5698933
"01G","Perry-Warsaw","Perry","NY","USA",42.74134667,-78.05208056
"01J","Hilliard Airpark","Hilliard","FL","USA",30.6880125,-81.90594389
Here is a solution:
from rx import Observable
from csv import DictReader
Observable.from_(DictReader(open('datafile.csv', 'r'))) \
.subscribe(lambda row:
print("{0:3}\t{1:<35}".format(row['iata'], row['airport'][:35]))
)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I unzip a large file in PySpark? - python

Related

Python crashes when loading very large file

Converting Python script to be able to run in Spark/Hadoop

python script for run time log seggregation

Saving multiple items to HDFS with (spark, python, pyspark, jupyter)

RxPy read csv files and process lines

Categories

Resources