I have a python project that uses pyspark and i am trying to define a udf function inside the spark project (not in my python project) specifically in spark\python\pyspark\ml\tuning.py but i get pickling problems. it can't load the udf.
The code:
from pyspark.sql.functions import udf, log
test_udf = udf(lambda x : -x[1], returnType=FloatType())
d = data.withColumn("new_col", test_udf(data["x"]))
d.show()
when i try d.show() i am getting exception of unknown attribute test_udf
In my python project i defined many udf and it worked fine.
add the following to your code. It isn't recognizing the datatype.
from pyspark.sql.types import *
Let me know if this helps. Thanks.
Found it there was 2 problems
1) for some reason it didn't like the returnType=FloatType() i needed to convert it to just FloatType() though this was the signature
2) The data in column x was a vector and for some reason i had to cast it to float
The working code:
from pyspark.sql.functions import udf, log
test_udf = udf(lambda x : -float(x[1]), FloatType())
d = data.withColumn("new_col", test_udf(data["x"]))
d.show()
Related
In python you can simply:
from pyspark.sql.functions import struct
predict = mlflow.pyfunc.spark_udf(spark, "/my/local/model")
df.withColumn("prediction", predict(struct("name", "age"))).show()
Is it possible with scala?
I have just installed pyspark2.4.5 in my ubuntu18.04 laptop, and when I run following codes,
#this is a part of the code.
import pubmed_parser as pp
from pyspark.sql import SparkSession
from pyspark.sql import Row
medline_files_rdd = spark.sparkContext.parallelize(glob('/mnt/hgfs/ShareDir/data/*.gz'), numSlices=1000)
parse_results_rdd = medline_files_rdd.\
flatMap(lambda x: [Row(file_name=os.path.basename(x), **publication_dict)
for publication_dict in pp.parse_medline_xml(x)])
medline_df = parse_results_rdd.toDF()
# save to parquet
medline_df.write.parquet('raw_medline.parquet', mode='overwrite')
medline_df = spark.read.parquet('raw_medline.parquet')
I get such error,
medline_files_rdd = spark.sparkContext.parallelize(glob('/mnt/hgfs/ShareDir/data/*.gz'), numSlices=1000)
NameError: name 'spark' is not defined
I have seen similiar questions on StackOverflow, but all of them can not solve my problem.Does anyone can help me?Thanks a lot.
By the way, I am new in spark, if I just want to use spark in Python, does it enough that I just install pyspark by using
pip install pyspark ? any others should I do? Should I install Hadoop or others?
Just create spark session in the starting
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('abc').getOrCreate()
I am trying to run some code on a spark kubernetes cluster
"spark.kubernetes.container.image", "kublr/spark-py:2.4.0-hadoop-2.6"
The code I am trying to run is the following
def getMax(row, subtract):
'''
getMax takes two parameters -
row: array with parameters
subtract: normal value of the parameter
It outputs the value most distant from the normal
'''
try:
row = np.array(row)
out = row[np.argmax(row-subtract)]
except ValueError:
return None
return out.item()
from pyspark.sql.types import FloatType
udf_getMax = F.udf(getMax, FloatType())
The dataframe I am passing is as below
However I am getting the following error
ModuleNotFoundError: No module named 'numpy'
When I did a stackoverflow serach I could find similar issue of numpy import error in spark in yarn.
ImportError: No module named numpy on spark workers
And the crazy part is I am able to import numpy outside and
import numpy as np
command outside the function is not getting any errors.
Why is this happening? How to fix this or how to go forward. Any help is appreciated.
Thank you
I am reading files from a folder in a loop and creating dataframes from these.
However, I am getting this weird error TypeError: 'str' object is not callable.
Please find the code here:
for yr in range (2014,2018):
cat_bank_yr = sqlCtx.read.csv(cat_bank_path+str(yr)+'_'+h1+'bank.csv000',sep='|',schema=schema)
cat_bank_yr=cat_bank_yr.withColumn("cat_ledger",trim(lower(col("cat_ledger"))))
cat_bank_yr=cat_bank_yr.withColumn("category",trim(lower(col("category"))))
The code runs for one iteration and then stops at the line
cat_bank_yr=cat_bank_yr.withColumn("cat_ledger",trim(lower(col("cat_ledger"))))
with the above error.
Can anyone help out?
Your code looks fine - if the error indeed happens in the line you say it happens, you probably accidentally overwrote one of the PySpark function with a string.
To check this, put the following line directly above your for loop and see whether the code runs without an error now:
from pyspark.sql.functions import col, trim, lower
Alternatively, double-check whether the code really stops in the line you said, or check whether col, trim, lower are what you expect them to be by calling them like this:
col
should return
function pyspark.sql.functions._create_function.._(col)
In the import section use:
from pyspark.sql import functions as F
Then in the code wherever using col, use F.col so your code would be:
# on top/header part of code
from pyspark.sql import functions as F
for yr in range (2014,2018):
cat_bank_yr = sqlCtx.read.csv(cat_bank_path+str(yr)+'_'+h1+'bank.csv000',sep='|',schema=schema)
cat_bank_yr=cat_bank_yr.withColumn("cat_ledger",trim(lower(F.col("cat_ledger"))))
cat_bank_yr=cat_bank_yr.withColumn("category",trim(lower(F.col("category"))))
Hope this will work. Good luck.
Below is the code I have written to compare two dataframes and impose intersection function on them.
import os
from pyspark import SparkContext
sc = SparkContext("local", "Simple App")
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)
df = sqlContext.read.format("jdbc").option("url","jdbc:sqlserver://xxx:xxx").option("databaseName","xxx").option("driver","com.microsoft.sqlserver.jdbc.SQLServerDriver").option("dbtable","xxx").option("user","xxxx").option("password","xxxx").load()
df.registerTempTable("test")
df1= sqlContext.sql("select * from test where amitesh<= 300")
df2= sqlContext.sql("select * from test where amitesh <= 400")
df3= df1.intersection(df2)
df3.show()
I am getting below error:
AttributeError: 'DataFrame' object has no attribute 'intersection'
If my understanding is correct, intersection() is an inbuilt sub-function derived from python set function. So,
1) if I am trying to use it inside pyspark, do I need to import any special module inside my code, or it should work as in-built for pyspark as well?
2) To use this intersection() function, do we first need to convert df to rdd?
Please correct me wherever I am wrong. Can somebody give me a working example?
My motive is to get the common record from SQL server and move to HIVE. As of now, I am first trying to get my intersection function work and then start with the HIVE requirement that I can take care off if intersection() is working.
I got it working for me, instead of intersection(), I used intersect(), it worked.