pyspark intersection() function to compare data frames - python

Below is the code I have written to compare two dataframes and impose intersection function on them.
import os
from pyspark import SparkContext
sc = SparkContext("local", "Simple App")
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)
df = sqlContext.read.format("jdbc").option("url","jdbc:sqlserver://xxx:xxx").option("databaseName","xxx").option("driver","com.microsoft.sqlserver.jdbc.SQLServerDriver").option("dbtable","xxx").option("user","xxxx").option("password","xxxx").load()
df.registerTempTable("test")
df1= sqlContext.sql("select * from test where amitesh<= 300")
df2= sqlContext.sql("select * from test where amitesh <= 400")
df3= df1.intersection(df2)
df3.show()
I am getting below error:
AttributeError: 'DataFrame' object has no attribute 'intersection'
If my understanding is correct, intersection() is an inbuilt sub-function derived from python set function. So,
1) if I am trying to use it inside pyspark, do I need to import any special module inside my code, or it should work as in-built for pyspark as well?
2) To use this intersection() function, do we first need to convert df to rdd?
Please correct me wherever I am wrong. Can somebody give me a working example?
My motive is to get the common record from SQL server and move to HIVE. As of now, I am first trying to get my intersection function work and then start with the HIVE requirement that I can take care off if intersection() is working.

I got it working for me, instead of intersection(), I used intersect(), it worked.

Related

How to quickly instantiate a pyspark SparkContext for unit testing?

In my Python 3.8 unit test code, I need to instantiate a SparkContext to test some functions manipulating a RDD.
The problem is that instantiating a SparkContext takes a few seconds, which is too slow. I'm using this code to instantiate a SparkContext:
from pyspark import SparkContext
return SparkContext.getOrCreate()
The tests only run filter, map and distinct on an RDD. In my tests, I only want to test an expected output against obtained output on a tiny dataset (lists of length between 5 and 10) and I want them to run almost instantaneously. I don't need all other features offered by PySpark (such as parallelization).
I tried various ways, such as below, but instantiation takes as much time:
from pyspark import SparkContext, SparkConf
conf = SparkConf().setMaster("local[2]").setAppName("pytest-pyspark-local-testing")
sc = SparkContext(conf=conf)

NameError: name 'spark' is not defined, how to solve?

I have just installed pyspark2.4.5 in my ubuntu18.04 laptop, and when I run following codes,
#this is a part of the code.
import pubmed_parser as pp
from pyspark.sql import SparkSession
from pyspark.sql import Row
medline_files_rdd = spark.sparkContext.parallelize(glob('/mnt/hgfs/ShareDir/data/*.gz'), numSlices=1000)
parse_results_rdd = medline_files_rdd.\
flatMap(lambda x: [Row(file_name=os.path.basename(x), **publication_dict)
for publication_dict in pp.parse_medline_xml(x)])
medline_df = parse_results_rdd.toDF()
# save to parquet
medline_df.write.parquet('raw_medline.parquet', mode='overwrite')
medline_df = spark.read.parquet('raw_medline.parquet')
I get such error,
medline_files_rdd = spark.sparkContext.parallelize(glob('/mnt/hgfs/ShareDir/data/*.gz'), numSlices=1000)
NameError: name 'spark' is not defined
I have seen similiar questions on StackOverflow, but all of them can not solve my problem.Does anyone can help me?Thanks a lot.
By the way, I am new in spark, if I just want to use spark in Python, does it enough that I just install pyspark by using
pip install pyspark ? any others should I do? Should I install Hadoop or others?
Just create spark session in the starting
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('abc').getOrCreate()

How to change SparkContext property spark.sql.pivotMaxValues in jupyter PySpark session

Q: How to change SparkContext property spark.sql.pivotMaxValues in jupyter PySpark session
I made the following code change to increase spark.sql.pivotMaxValues. It sadly had no effect in the resulting error after restarting jupyter and running the code again.
from pyspark import SparkConf, SparkContext
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.linalg.distributed import RowMatrix
import numpy as np
try:
#conf = SparkConf().setMaster('local').setAppName('autoencoder_recommender_wide_user_record_maker') # original
#conf = SparkConf().setMaster('local').setAppName('autoencoder_recommender_wide_user_record_maker').set("spark.sql.pivotMaxValues", "99999")
conf = SparkConf().setMaster('local').setAppName('autoencoder_recommender_wide_user_record_maker').set("spark.sql.pivotMaxValues", 99999)
sc = SparkContext(conf=conf)
except:
print("Variables sc and conf are now defined. Everything is OK and ready to run.")
<... (other code) ...>
df = sess.read.csv(in_filename, header=False, mode="DROPMALFORMED", schema=csv_schema)
ct = df.crosstab('username', 'itemname')
Spark error message that was thrown on my crosstab line of code:
IllegalArgumentException: "requirement failed: The number of distinct values for itemname, can't exceed 1e4. Currently 16467"
I expect I'm not actually setting the config variable that I was trying to set, so what is a way to get that value actually set, programmatically if possible? THanks.
References:
Finally, you may be interested to know that there is a maximum number
of values for the pivot column if none are specified. This is mainly
to catch mistakes and avoid OOM situations. The config key is
spark.sql.pivotMaxValues and its default is 10,000.
Source: https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html
I would prefer to change the config variable upwards, since I have written the crosstab code already which works great on smaller datasets. If it turns out there truly is no way to change this config variable then my backup plans are, in order:
relational right outer join to implement my own Spark crosstab with higher capacity than was provided by databricks
scipy dense vectors with handmade unique combinations calculation code using dictionaries
kernel.json
This configuration file should be distributed together with jupyter
~/.ipython/kernels/pyspark/kernel.json
It contains SPARK configuration, including variable PYSPARK_SUBMIT_ARGS - list of arguments that will be used with spark-submit script.
You can try to add --conf spark.sql.pivotMaxValues=99999 to this variable in mentioned script.
PS
There are also cases where people are trying to override this variable programmatically. You can give it a try too...

SparkSession initialization error - Unable to use spark.read

I tried to create a standalone PySpark program that reads a csv and stores it in a hive table. I have trouble configuring Spark session, conference and contexts objects. Here is my code:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext, SparkSession
from pyspark.sql.types import *
conf = SparkConf().setAppName("test_import")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
spark = SparkSession.builder.config(conf=conf)
dfRaw = spark.read.csv("hdfs:/user/..../test.csv",header=False)
dfRaw.createOrReplaceTempView('tempTable')
sqlContext.sql("create table customer.temp as select * from tempTable")
And I get the error:
dfRaw = spark.read.csv("hdfs:/user/../test.csv",header=False)
AttributeError: 'Builder' object has no attribute 'read'
Which is the right way to configure spark session object in order to use read.csv command? Also, can someone explain the diference between Session, Context and Conference objects?
There is no need to use both SparkContext and SparkSession to initialize Spark. SparkSession is the newer, recommended way to use.
To initialize your environment, simply do:
spark = SparkSession\
.builder\
.appName("test_import")\
.getOrCreate()
You can run SQL commands by doing:
spark.sql(...)
Prior to Spark 2.0.0, three separate objects were used: SparkContext, SQLContext and HiveContext. These were used separatly depending on what you wanted to do and the data types used.
With the intruduction of the Dataset/DataFrame abstractions, the SparkSession object became the main entry point to the Spark environment. It's still possible to access the other objects by first initialize a SparkSession (say in a variable named spark) and then do spark.sparkContext/spark.sqlContext.

can't define a udf inside pyspark project

I have a python project that uses pyspark and i am trying to define a udf function inside the spark project (not in my python project) specifically in spark\python\pyspark\ml\tuning.py but i get pickling problems. it can't load the udf.
The code:
from pyspark.sql.functions import udf, log
test_udf = udf(lambda x : -x[1], returnType=FloatType())
d = data.withColumn("new_col", test_udf(data["x"]))
d.show()
when i try d.show() i am getting exception of unknown attribute test_udf
In my python project i defined many udf and it worked fine.
add the following to your code. It isn't recognizing the datatype.
from pyspark.sql.types import *
Let me know if this helps. Thanks.
Found it there was 2 problems
1) for some reason it didn't like the returnType=FloatType() i needed to convert it to just FloatType() though this was the signature
2) The data in column x was a vector and for some reason i had to cast it to float
The working code:
from pyspark.sql.functions import udf, log
test_udf = udf(lambda x : -float(x[1]), FloatType())
d = data.withColumn("new_col", test_udf(data["x"]))
d.show()

Categories

Resources