Converting Python script to be able to run in Spark/Hadoop - python

I have a python script that currently runs on my desktop. It takes a csv file with roughly 25 million lines (Maybe 15 or so columns) and performs line by line operations.
For each line of input, multiple output lines are produced. The results are then output line by line into a csv file, the output ends up at around 100 million lines.
Code looks something like this:
with open(outputfile,"a") as outputcsv:
with open(inputfile,"r") as input csv:
headerlist=next(csv.reader(csvfile)
for row in csv.reader(csvfile):
variable1 = row[headerlist.index("VAR1")]
variableN = row[headerlist.index("VARN")]
while calculations not complete:
do stuff #Some complex calculations are done at this point
outputcsv.write(stuff)
We're now trying to convert the script to run via Hadoop, using pyspark.
I have no idea how to even start. I'm trying to work out how to iterate through an RDD object but don't think it can be done.
Is a line by line calculation like this suitable for distributed processing?

If you directly want to run the script, you could do so via spark-submit:
spark-submit master local[*]/yarn other_parameters path_to_your_script.py
But I would suggest to go for spark API's as they are easy to use. It will lower the coding overhead.
First you have to create a spark session variable so that you could access all spark functions:
spark = SparkSession
.builder()
.appName("SparkSessionZipsExample")
.config("parameters", "value")
.getOrCreate()
Next, if you want to load a csv file:
file = spark.read.csv("path to file")
You can specify optional parameters like headers, inferschema, etc:
file=spark.read.option("header","true").csv("path to your file")
'file' will now be a pyspark dataframe.
You can now write the end output like this:
file.write.csv("output_path")
Please refer to the documentation : spark documentation for transformations and other information.

Related

PySpark DataFrame writing empty (zero bytes) files

I'm working with PySpark DataFrame API with Spark version 3.1.1 on a local setup. After reading in data, performing some transformations etc. I save the DataFrame to disk. Output directories get created, along with part-0000* file and there is _SUCCESS file present in the output directory as well. However, my part-0000* is always empty i.e. zero bytes.
I've tried writing it in both parquet as well as csv formats with the same result. Just before writing, I called df.show() to make sure there is data in the DataFrame.
### code.py ###
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
import configs
spark = SparkSession.builder.appName('My Spark App').getOrCreate()
data = spark.read.csv(configs.dataset_path, sep=configs.data_delim)
rdd = data.rdd.map(...)
data = spark.createDataFrame(rdd)
data = data.withColumn('col1', F.lit(1))
data.show() # Shows top 20 rows with data
data.write.parquet(save_path + '/dataset_parquet/', mode='overwrite') # Zero Bytes
data.write.csv(save_path + '/dataset_csv/', mode='overwrite') # Zero Bytes
I'm running this code as follows
export PYSPARK_PYTHON=python3
$SPARK_HOME/bin/spark-submit \
--master local[*] \
code.py
So I ran into a similar issue with pyspark and one thing I also noticed is that when I tried to set the mode to overwrite it was also failing. The issue with the overwrite was that it was failing to write while it was in the middle of the write, so it would create the file, fail, retry and the retry would fail with the 'file already exists' because it was past the point in its process of handling the overwrite.
So I added cache to force the evaluation because like your .show() above I was doing a data.cache().count(). The count showed records but any further evaluation using show or write showed the DF as empty.
So try adding .cache() to the first reference of that dataframe and see it it fixes your issue. It did for me.
df_bad = df_cln.filter(F.col('isInvalid')).select(F.concat(F.col('line')\
,F.lit(">> LINE:"),F.col('monotonically_increasing_id'))\
.alias("line"),F.col('monotonically_increasing_id'))
removed_file_cnt = df_bad.cache().count()
print(f"The count of the records still containing udf chars in the file: {removed_file_cnt}")
if removed_file_cnt > 0:
df_bad.coalesce(1)\
.orderBy('monotonically_increasing_id')\
.drop('monotonically_increasing_id')\
.write.option("ignoreTrailingWhiteSpace","false").option("encoding", "UTF-8")\
.format('text').save(s3_error_bucket_path, mode='overwrite')
Alternatively, consider using a .localCheckpoint() on the data column. It is fast and convenient. Since we can always restart the job there essentially no critical need for a checkpoint.

input file is not getting read from pd.read_csv

I'm trying to read a file stored in google storage from apache beam using pandas but getting error
def Panda_a(self):
import pandas as pd
data = 'gs://tegclorox/Input/merge1.csv'
df1 = pd.read_csv(data, names = ['first_name', 'last_name', 'age',
'preTestScore', 'postTestScore'])
return df1
ip2 = p |'Split WeeklyDueto' >> beam.Map(Panda_a)
ip7 = ip2 | 'print' >> beam.io.WriteToText('gs://tegclorox/Output/merge1234')
When I'm executing the above code , the error says the path does not exist. Any idea why ?
A bunch of things are wrong with this code.
Trying to get Pandas to read a file from Google Cloud Storage. Pandas does not support the Google Cloud Storage filesystem (as #Andrew pointed out - documentation says supported schemes are http, ftp, s3, file). However, you can use the Beam FileSystems.open() API to get a file object, and give that object to Pandas instead of the file path.
p | ... >> beam.Map(...) - beam.Map(f) transforms every element of the input PCollection using the given function f, it can't be applied to the pipeline itself. It seems that in your case, you want to simply run the Pandas code without any input. You can simulate that by supplying a bogus input, e.g. beam.Create(['ignored'])
beam.Map(f) requires f to return a single value (or more like: if it returns a list, it will interpret that list as a single value), but your code is giving it a function that returns a Pandas dataframe. I strongly doubt that you want to create a PCollection containing a single element where this element is the entire dataframe - more likely, you're looking to have 1 element for every row of the dataframe. For that, you need to use beam.FlatMap, and you need df.iterrows() or something like it.
In general, I am not sure why read the CSV file using Pandas at all. You can read it using Beam's ReadFromText with skip_header_lines=1, and then parse each line yourself - if you have a large amount of data, this will be a lot more efficient (and if you have only a small amount of data and do not anticipate it becoming large enough to exceed the capabilities of a single machine - say, if it will never be above a few GB - then Beam is the wrong tool).

Using Hadoop InputFormat in Pyspark

I'm working on a file parser for Spark that can basically read in n lines at a time and place all of those lines as a single row in a dataframe.
I know I need to use InputFormat to try and specify that, but I cannot find a good guide to this in Python.
Is there a method for specifying a custom InputFormat in Python or do I need to create it as a scala file and then specify the jar in spark-submit?
You can directly use the InputFormats with Pyspark.
Quoting from the documentation,
PySpark can also read any Hadoop InputFormat or write any Hadoop
OutputFormat, for both ‘new’ and ‘old’ Hadoop MapReduce APIs.
Pass the HadoopInputFormat class to any of these methods of pyspark.SparkContext as suited,
hadoopFile()
hadoopRDD()
newAPIHadoopFile()
newAPIHadoopRDD()
To read n lines, org.apache.hadoop.mapreduce.lib.NLineInputFormat can be used as the HadoopInputFormat class with the newAPI methods.
I cannot find a good guide to this in Python
In the Spark docs, under "Saving and Loading Other Hadoop Input/Output Formats", there is an Elasticsearch example + links to an HBase example.
can basically read in n lines at a time... I know I need to use InputFormat to try and specify that
There is NLineInputFormat specifically for that.
This is a rough translation of some Scala code I have from NLineInputFormat not working in Spark
def nline(n, path):
sc = SparkContext.getOrCreate
conf = {
"mapreduce.input.lineinputformat.linespermap": n
}
hadoopIO = "org.apache.hadoop.io"
return sc.newAPIHadoopFile(path,
"org.apache.hadoop.mapreduce.lib.NLineInputFormat",
hadoopIO + ".LongWritable",
hadoopIO + ".Text",
conf=conf).map(lambda x : x[1]) # To strip out the file-offset
n = 3
rdd = nline(n, "/file/input")
and place all of those lines as a single row in a dataframe
With NLineInputFormat, each string in the RDD is actually new-line delimited. You can rdd.map(lambda record : "\t".join(record.split('\n'))), for example to put make one line out them.

How to print rdd in python in spark

I have two files on HDFS and I just want to join these two files on a column say employee id.
I am trying to simply print the files to make sure we are reading that correctly from HDFS.
lines = sc.textFile("hdfs://ip:8020/emp.txt")
print lines.count()
I have tried foreach and println functions as well and I am not able to display file data.
I am working in python and totally new to both python and spark as well.
This is really easy just do a collect
You must be sure that all the data fits the memory on your master
my_rdd = sc.parallelize(xrange(10000000))
print my_rdd.collect()
If that is not the case You must just take a sample by using take method.
# I use an exagerated number to remind you it is very large and won't fit the memory in your master so collect wouldn't work
my_rdd = sc.parallelize(xrange(100000000000000000))
print my_rdd.take(100)
Another example using .ipynb:

How to operate on unsaved Excel file?

I'd like to automate a loop:
ABAQUS generates a Excel file;
Matlab utilises data in Excel file;
loop 1 and 2.
Now my question is: after step 1, the Excel file from ABAQUS is unsaved as Book1. I cannot use Matlab command to save it. Is there a way not to save this ''Book1'' file, but use the data in it? Or if I can find where it is so I can use the data inside? (I assume that Excel always saves the file even though user doesn't?)
Thank you! 
As agentp mentioned, if you are running Abaqus via a Python script, you can just use Python to create a .txt file to save all the relevant information. If well structured, a .txt file can be as readable as an Excel spreadsheet. Because Matlab and Python have intrinsic functions to read and write files this communication can be easily done.
As for Matlab calling Abaqus, you can use something similar to:
system('abaqus cae nogui=YOUR_SCRIPT.py')
Your script that pipes to Excel should have some code similar to this:
abq_ExcelUtilities.excelUtilities.XYtoExcel(
xyDataNames='S:Mises PI: PART-1-1 E: 4309 IP: 1', trueName='')
writing the same data to a report (.rpt) file the code looks like this:
x0 = session.xyDataObjects['S:Mises PI: PART-1-1 E: 4309 IP: 1']
session.writeXYReport(fileName='abaqus.rpt', xyData=(x0, ))
now to "roll your own", use that x0 object: x0.data is a regular python tuple holding the actual data which you can write to a file however you like, eg:
file=open('myfile.csv','w')
for point in x0.data: file.write('%g,%g\n'%point)
file.close()
(you can comment or delete the writeXYReport call )

Categories

Resources