pyspark udf print row being analyzed - python

I have a problem inside a pyspark udf function and I want to print the number of the row generating the problem.
I tried to count the rows using the equivalent of "static variable" in Python so that when the udf is called with a new row, a counter is incremented. However, it is not working:
import pyspark.sql.functions as F
def myF(input):
myF.lineNumber += 1
if (somethingBad):
print(myF.lineNumber)
return res
myF.lineNumber = 0
myF_udf = F.udf(myF, StringType())
How can I count the number of times a udf is called in order to find the number of the row generating the problem in pyspark?

UDFs are executed at workers, so the print statements inside them won't show up in the output (which is from the driver). The best way to handle issues with UDFs is to change the return type of the UDF to a struct or a list and pass the error information along with the returned output. In the code below I am just adding the error info to the string res that you were returning originally.
import pyspark.sql.functions as F
def myF(input):
myF.lineNumber += 1
if (somethingBad):
res += 'Error in line {}'.format(myF.lineNumber)
return res
myF.lineNumber = 0
myF_udf = F.udf(myF, StringType())

Related

PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable

I am getting this error after running the function below.
PicklingError: Could not serialize object: Exception: It appears that
you are attempting to reference SparkContext from a broadcast
variable, action, or transformation. SparkContext can only be used on
the driver, not in code that it run on workers. For more information,
see SPARK-5063.
The objective of this piece of code is to create a flag for every row based on the date differences. Multiple rows per user are supplied to the function to create the values of the flag.
Does anyone know what this means and how I can fix it?
def selection(df):
_i = 0
l = []
x = df.select("Date").orderBy("Date").rdd.flatMap(lambda x: x).collect()
rdd = spark.sparkContext.parallelize(x)
l.append((x[_i],1))
for _j in range(_i+1,rdd.count()):
if((d2.year - d1.year) * 12 + (d2.month - d1.month) >= 2 ):
l.append((x[_j],1))
else:
l.append((x[_j],0))
continue
_i=_j
columns = ['Date','flag']
new_df = spark.createDataFrame(l, columns)
df_new = df.join(new_df,['Date'],"inner")
return df_new
ToKeep = udf(lambda z: selection(z))
sample_new = sample.withColumn("toKeep",ToKeep(sample).over(Window.partitionBy(col("id")).orderBy(col("id"),col("Date"))))
#udf (stringType())
def function(x):
#could not have spark operation(such as spark.sql)
#if it contains spark operation ,must create new sparkContext
pass

Dask map_partitions() prints `partition_info` as None

I'm trying to use DataFrame.map_partitions() from Dask to apply a function on each partition. The function takes in input a list of values and have to return the rows of the dataframe partition that contains these values, on a specific column (using loc() and isin()).
The issue is that I get this error:
"index = partition_info['number'] - 1
TypeError: 'NoneType' object is not subscriptable"
When I print partition_info, it prints None hundreds of times (but I only have 60 elements in the loop so we expect only 60 prints), is it normal to print None because it's a child process or am I missing something with partition_info? I cannot find useful information on that.
def apply_f(df, barcodes_per_core: List[List[str]], partition_info=None):
print(partition_info)
index = partition_info['number'] - 1
indexes = barcodes_per_core[index]
return df.loc[df['barcode'].isin(indexes)]
df = from_pandas(df, npartitions=nb_cores)
dfs_per_core = df.map_partitions(apply_f, barcodes_per_core, meta=df)
dfs_per_core = dfs_per_core.compute(scheduler='processes')
=> Doc of partition_info at the end of this page.
It's not clear why things are not working on your end, one potential thing is that you are re-using df multiple times. Here's a MWE that works:
import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame(range(10), columns=["a"])
ddf = dd.from_pandas(df, npartitions=3)
def my_func(d, x, partition_info=None):
print(x, partition_info)
ddf.map_partitions(my_func, 3, meta=df.head()).compute(scheduler='processes')

Why do two different data processing ways in pyspark produce separate results? [duplicate]

This question already has answers here:
PySpark Evaluation
(2 answers)
Closed 4 years ago.
I am trying to create a sample dataset off my current dataset. I try two different ways and they produce two separate results. Separate in a way each sampled row should be integer and string ([5,unprivate], [1,hiprivate]). The first way is giving me string and string for each row ([private,private], [unprivate, hiprivate]). The second way is giving me the correct output.
Why are they producing two totally different datasets?
dataset
5,unprivate
1,private
2,hiprivate
ingest data
from pyspark import SparkContext
sc = SparkContext()
INPUT = "./dataset"
def parse_line(line):
bits = line.split(",")
return bits
df = sc.textFile(INPUT).map(parse_line)
1st way - outputs something like
[[u'unprivate', u'unprivate'], [u'unprivate', u'unprivate']]
#1st way
columns = df.first()
new_df = None
for i in range(0, len(columns)):
column = df.sample(withReplacement=True, fraction=1.0).map(lambda row: row[i]).zipWithIndex().map(lambda e: (e[1], [e[0]]))
if new_df is None:
new_df = column
else:
new_df = new_df.join(column)
new_df = new_df.map(lambda e: (e[0], e[1][0] + e[1][1]))
new_df = new_df.map(lambda e: e[1])
print new_df.collect()
2nd way - outputs something like
[(0, [u'5', u'unprivate']), (1, [u'1', u'unprivate']), (2, [u'2', u'private'])]
#2nd way
new_df = df.sample(withReplacement=True, fraction=1.0).map(lambda row: row[0]).zipWithIndex().map(lambda e: (e[1], [e[0]]))
new_df2 = df.sample(withReplacement=True, fraction=1.0).map(lambda row: row[1]).zipWithIndex().map(lambda e: (e[1], [e[0]]))
new_df = new_df.join(new_df2)
new_df = new_df.map(lambda e: (e[0], e[1][0] + e[1][1]))
print new_df.collect()
I am trying to figure out the unisample function in page 62
http://info.mapr.com/rs/mapr/images/Getting_Started_With_Apache_Spark.pdf
This has to do with how Spark executes code. Try putting this print statement in your code in the first example:
for i in range(0, len(columns)):
if new_df:
print(new_df.take(1))
Since the code is executed lazily for loops like this won't work because Spark is effectively going to execute only the last loop. So, when you start the for loop for the 2nd column you've already got a value for new_df which equals the output of the 2nd for loop.
You have to use the approach you use in your second example.

How to run parallel programs with pyspark?

I would like to use our spark cluster to run programs in parallel. My idea is to do sth like the following:
def simulate():
#some magic happening in here
return 0
spark = (
SparkSession.builder
.appName('my_simulation')
.enableHiveSupport()
.getOrCreate())
sc = spark.sparkContext
no_parallel_instances = sc.parallelize(xrange(500))
res = no_parallel_instances.map(lambda row: simulate())
print res.collect()
The question i have is whether there's a way to execute simulate() with different parameters. The only way i currently can imagine is to have a dataframe specifying the parameters, so something like this:
parameter_list = [[5,2.3,3], [3,0.2,4]]
no_parallel_instances = sc.parallelize(parameter_list)
res = no_parallel_instances.map(lambda row: simulate(row))
print res.collect()
Is there another, more elegant way to run parallel functions with spark?
If the data you are looking to parameterize your call with differs between each row, then yes you will need to include that with each row.
However, if you are looking to set global parameters that affect every row, then you can use a broadcast variable.
http://spark.apache.org/docs/latest/rdd-programming-guide.html#broadcast-variables
Broadcast variables are created once in your script and cannot be modified after that. Spark will efficiently distribute those values to every executor to make them available to your transformations. To create one you provide the data to spark and it gives you back a handle you can use to access it on the executors. For example:
settings_bc = sc.broadcast({
'num_of_donkeys': 3,
'donkey_color': 'brown'
})
def simulate(settings, n):
# do magic
return n
no_parallel_instances = sc.parallelize(xrange(500))
res = no_parallel_instances.map(lambda row: simulate(settings_bc.value, row))
print res.collect()

Check logs with Spark

I'm new to Spark and I'm trying to develop a python script that reads a csv file with some logs:
userId,timestamp,ip,event
13,2016-12-29 16:53:44,86.20.90.121,login
43,2016-12-29 16:53:44,106.9.38.79,login
66,2016-12-29 16:53:44,204.102.78.108,logoff
101,2016-12-29 16:53:44,14.139.102.226,login
91,2016-12-29 16:53:44,23.195.2.174,logoff
And checks if a user had some strange behaviors, for example if he has done two consecutive 'login' without doing 'logoff'. I've loaded the csv as a Spark dataFrame and I wanted to compare the log rows of a single user, ordered by timestamp and checking if two consecutive events are of the same type (login - login , logoff - logoff). I'm searching for doing it in a 'map-reduce' way, but at the moment I can't figure out how to use a reduce function that compares consecutive rows.
The code I've written works, but the performance are very bad.
sc = SparkContext("local","Data Check")
sqlContext = SQLContext(sc)
LOG_FILE_PATH = "hdfs://quickstart.cloudera:8020/user/cloudera/flume/events/*"
RESULTS_FILE_PATH = "hdfs://quickstart.cloudera:8020/user/cloudera/spark/script_results/prova/bad_users.csv"
N_USERS = 10*1000
dataFrame = sqlContext.read.format("com.databricks.spark.csv").load(LOG_FILE_PATH)
dataFrame = dataFrame.selectExpr("C0 as userID","C1 as timestamp","C2 as ip","C3 as event")
wrongUsers = []
for i in range(0,N_USERS):
userDataFrame = dataFrame.where(dataFrame['userId'] == i)
userDataFrame = userDataFrame.sort('timestamp')
prevEvent = ''
for row in userDataFrame.rdd.collect():
currEvent = row[3]
if(prevEvent == currEvent):
wrongUsers.append(row[0])
prevEvent = currEvent
badUsers = sqlContext.createDataFrame(wrongUsers)
badUsers.write.format("com.databricks.spark.csv").save(RESULTS_FILE_PATH)
First (not related but still), be sure that the number of entries per user is not that big because that collect in for row in userDataFrame.rdd.collect(): is dangerous.
Second, you don't need to leave the DataFrame area here to use classical Python, just stick to Spark.
Now, your problem. It's basically "for each line I want to know something from the previous line": that belongs to the concept of Window functions and to be precise the lag function. Here are two interesting articles about Window functions in Spark: one from Databricks with code in Python and one from Xinh with (I think easier to understand) examples in Scala.
I have a solution in Scala, but I think you'll pull it off translating it in Python:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.lag
import sqlContext.implicits._
val LOG_FILE_PATH = "hdfs://quickstart.cloudera:8020/user/cloudera/flume/events/*"
val RESULTS_FILE_PATH = "hdfs://quickstart.cloudera:8020/user/cloudera/spark/script_results/prova/bad_users.csv"
val data = sqlContext
.read
.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("header", "true") // use the header from your csv
.load(LOG_FILE_PATH)
val wSpec = Window.partitionBy("userId").orderBy("timestamp")
val badUsers = data
.withColumn("previousEvent", lag($"event", 1).over(wSpec))
.filter($"previousEvent" === $"event")
.select("userId")
.distinct
badUsers.write.format("com.databricks.spark.csv").save(RESULTS_FILE_PATH)
Basically you just retrieve the value from the previous line and compare it to the value on your current line, if it's a match that is a wrong behavior and you keep the userId. For the first line in your "block" of lines for each userId, the previous value will be null: when comparing with the current value, the boolean expression will be false so no problem here.

Categories

Resources