How to run parallel programs with pyspark? - python

I would like to use our spark cluster to run programs in parallel. My idea is to do sth like the following:
def simulate():
#some magic happening in here
return 0
spark = (
SparkSession.builder
.appName('my_simulation')
.enableHiveSupport()
.getOrCreate())
sc = spark.sparkContext
no_parallel_instances = sc.parallelize(xrange(500))
res = no_parallel_instances.map(lambda row: simulate())
print res.collect()
The question i have is whether there's a way to execute simulate() with different parameters. The only way i currently can imagine is to have a dataframe specifying the parameters, so something like this:
parameter_list = [[5,2.3,3], [3,0.2,4]]
no_parallel_instances = sc.parallelize(parameter_list)
res = no_parallel_instances.map(lambda row: simulate(row))
print res.collect()
Is there another, more elegant way to run parallel functions with spark?

If the data you are looking to parameterize your call with differs between each row, then yes you will need to include that with each row.
However, if you are looking to set global parameters that affect every row, then you can use a broadcast variable.
http://spark.apache.org/docs/latest/rdd-programming-guide.html#broadcast-variables
Broadcast variables are created once in your script and cannot be modified after that. Spark will efficiently distribute those values to every executor to make them available to your transformations. To create one you provide the data to spark and it gives you back a handle you can use to access it on the executors. For example:
settings_bc = sc.broadcast({
'num_of_donkeys': 3,
'donkey_color': 'brown'
})
def simulate(settings, n):
# do magic
return n
no_parallel_instances = sc.parallelize(xrange(500))
res = no_parallel_instances.map(lambda row: simulate(settings_bc.value, row))
print res.collect()

Related

How to loop through a list of BQ tables in Python?

I've a list of BQ tables that I'd like to use one at a time. The purpose is to process each table individually, perform some action (in my example, score the dataset for a previously fitted model), then compute, append, and save the probabilities in the all scores list.
Here's a screenshot of the entire code snippet.
# List of BQ Table
scoring_tables = ["`Customer_Analytics.HS_AF_PROPERTY_ALL_DATA_SCORE_PV_CLEANED_01`",
"`Customer_Analytics.HS_AF_PROPERTY_ALL_DATA_SCORE_PV_CLEANED_02`",
"`Customer_Analytics.HS_AF_PROPERTY_ALL_DATA_SCORE_PV_CLEANED_03`",
"`Customer_Analytics.HS_AF_PROPERTY_ALL_DATA_SCORE_PV_CLEANED_04`",
"`Customer_Analytics.HS_AF_PROPERTY_ALL_DATA_SCORE_PV_CLEANED_05`"]
# List to store probabilities/scores
all_scores = []
# Loop through each BQ tables, calculate, append and store the probabilities in the all_score = []
for t in scoring_tables:
%%bigquery property_data_score_00
SELECT * FROM t
score_data = property_data_score_00.copy()
score_data.set_index('HH_ID',inplace = True)
# Fixing for HOUSE_INCOME = 0 & AGE = 0 based on means
score_data['HOUSE_INCOME'] = np.where(score_data['HOUSE_INCOME']==0,107,score_data['HOUSE_INCOME'])
score_data['AGE'] = np.where(score_data['AGE']==0,54,score_data['AGE'])
# recategorize PROP_EXTR_WALL_TYPE | PROP_GRG_TYPE | PROP_ROOF_TYPE
condition = [(score_data['PROP_EXTR_WALL_TYPE'].str.contains("BRICK")),
(score_data['PROP_EXTR_WALL_TYPE'].str.contains("WOOD")),
(score_data['PROP_EXTR_WALL_TYPE'].str.contains("CONCRETE")),
(score_data['PROP_EXTR_WALL_TYPE'].str.contains("METAL")),
(score_data['PROP_EXTR_WALL_TYPE'].str.contains("STEEL"))]
choice = ["BRICK","WOOD","CONCRETE","METAL","METAL"]
score_data['PROP_EXTR_WALL_TYPE_MOD'] = np.select(condition,choice,default="OTHERS")
condition = [(score_data['PROP_GRG_TYPE'].str.contains("ATTACHED")),
(score_data['PROP_GRG_TYPE'].str.contains("DETACHED")),
(score_data['PROP_GRG_TYPE'].str.contains("CARPORT")),
(score_data['PROP_GRG_TYPE'].str.contains("BASEMENT"))]
choice = ["ATTACHED","DETACHED","CARPORT","BASEMENT"]
score_data['PROP_GRG_TYPE_MOD'] = np.select(condition,choice,default="OTHERS")
condition = [(score_data['PROP_ROOF_TYPE'].str.contains("GABLE")),
(score_data['PROP_ROOF_TYPE'].str.contains("HIP")),
(score_data['PROP_ROOF_TYPE'].str.contains("GAMBREL"))]
choice = ["GABLE","HIP","GAMBREL"]
score_data['PROP_ROOF_TYPE_MOD'] = np.select(condition,choice,default="OTHERS")
# one-hot encoding
to_encode = ["IND_ETHNICITY","IND_GENDER","IND_MOVERS_FLAG","IND_OCCUPATION","IND_REGION","PROP_EXTR_WALL_TYPE","PROP_GRG_TYPE","PROP_ROOF_TYPE","TSI"]
score_data_dm = pd.get_dummies(data = score_data, columns = to_encode, drop_first = False)
columns_not_in_score_data_dm = [c for c in train_X.columns if c not in score_data_dm.columns] #columns which might not get produced during pd.get_dummies(data = score_data....), if categories are not available in score data
score_data_dm[columns_not_in_score_data] = 0 #initializing above columns as 0
score_data_dm_filt = score_data_dm[select_columns] # making sure to select only the columns which are in the train_X
y_pred = xgb_prop_PV.predict_proba(score_data_dm_filt)[:,1] #final scoring
all_scores = all_scores + y_pred
Inside the looping, I'm having trouble with the SELECT * FROM t step. The error is shown below. I believe the indent within the loop is causing the %% bigquery step to fail. I looked at itertools here, however it appears that it is only useful when conditional looping is there, which is not the case in my situation.
Also, this appears to be a complex approach; is there a more elegant solution? Because the table was too large (600GB), it needed to be split into smaller datasets, so we tried this method. PS: It works without the loop if run for one table at a time. But its quite a manual effort.
Thanks,
Piyush
To answer the issue of the error message. If you are going to pass a variable in to a query using the bigquery magics command you need to use the params flag.
Your code should end up looking something like this:
t="table_name"
my_params = {"t": t}
%%bigquery --params $my_params
select #t
You may consider and try below approach.
Instead of using BigQuery magics, you may use BiQuery Client Library for Python. From there, you may loop to your list of tables by using f string as shown on below sample code.
from google.cloud import bigquery
bqclient = bigquery.Client()
scoring_tables = ["`your-project-id.your-dataset.test_table1`",
"`your-project-id.your-dataset.test_table2`","`your-project-id.your-dataset.test_table3`"]
for t in scoring_tables:
# Download query results.
query_string = f"""
SELECT * FROM {t}
"""
property_data_score_00 = (
bqclient.query(query_string)
.result()
.to_dataframe(
# Optionally, explicitly request to use the BigQuery Storage API. As of
# google-cloud-bigquery version 1.26.0 and above, the BigQuery Storage
# API is used by default.
create_bqstorage_client=True,
)
)
score_data = property_data_score_00.copy()
score_data.set_index('HH_ID',inplace = True)
print(score_data)
Sample output of above code:

Running Functions with Multiple Arguments Concurrently and Aggregating Complex Results

Set Up
This is part two of a question that I posted regarding accessing results from multiple processes.
For part one click Here: Link to Part One
I have a complex set of data that I need to compare to various sets of constraints concurrently, but I'm running into multiple issues. The first issue is getting results out of my multiple processes, and the second issue is making anything beyond an extremely simple function to run concurrently.
Example
I have multiple sets of constraints that I need to compare against some data and I would like to do this concurrently because I have a lot of sets of constrains. In this example I'll just be using two sets of constraints.
Jupyter Notebook
Create Some Sample Constraints & Data
# Create a set of constraints
constraints = pd.DataFrame([['2x2x2', 2,2,2],['5x5x5',5,5,5],['7x7x7',7,7,7]],
columns=['Name','First', 'Second', 'Third'])
constraints.set_index('Name', inplace=True)
# Create a second set of constraints
constraints2 = pd.DataFrame([['4x4x4', 4,4,4],['6x6x6',6,6,6],['7x7x7',7,7,7]],
columns=['Name','First', 'Second', 'Third'])
constraints2.set_index('Name', inplace=True)
# Create some sample data
items = pd.DataFrame([['a', 2,8,2],['b',5,3,5],['c',7,4,7]], columns=['Name','First', 'Second', 'Third'])
items.set_index('Name', inplace=True)
Running Sequentially
If I run this sequentially I can get my desired results but with the data that I am actually dealing with it can take over 12 hours. Here is what it would look like ran sequentially so that you know what my desired result would look like.
# Function
def seq_check_constraint(df_constraints_input, df_items_input):
df_constraints = df_constraints_input.copy()
df_items = df_items_input.copy()
df_items['Product'] = df_items.product(axis=1)
df_constraints['Product'] = df_constraints.product(axis=1)
for constraint in df_constraints.index:
df_items[constraint+'Product'] = df_constraints.loc[constraint,'Product']
for constraint in df_constraints.index:
for item in df_items.index:
col_name = constraint+'_fits'
df_items[col_name] = False
df_items.loc[df_items['Product'] < df_items[constraint+'Product'], col_name] = True
df_res = df_items.iloc[:: ,7:]
return df_res
constraint_sets = [constraints, constraints2, ...]
results = {}
counter = 0
for df in constrain_sets:
res = seq_check_constraint(df, items)
results['constraints'+str(counter)] = res
or uglier:
df_res1 = seq_check_constraint(constraints, items)
df_res2 = seq_check_constraint(constraints2, items)
results = {'constraints0':df_res1, 'constraints1': df_res2}
As a result of running these sequentially I end up with DataFrame's like shown here:
I'd ultimately like to end up with a dictionary or list of the DataFrame's, or be able to append the DataFrame's all together. The order that I get the results doesn't matter to me, I just want to have them all together and need to be able to do further analysis on them.
What I've Tried
So this brings me to my attempts at multiprocessing, From what I understand you can either use Queues or Managers to handle shared data and memory, but I haven't been able to get either to work. I also am struggling to get my function which takes two arguments to execute within the Pool's at all.
Here is my code as it stands right now using the same sample data from above:
Function
def check_constraint(df_constraints_input, df_items_input):
df_constraints = df_constraints_input.copy()
df_items = df_items_input.copy()
df_items['Product'] = df_items.product(axis=1) # Mathematical Product
df_constraints['Product'] = df_constraints.product(axis=1)
for constraint in df_constraints.index:
df_items[constraint+'Product'] = df_constraints.loc[constraint,'Product']
for constraint in df_constraints.index:
for item in df_items.index:
col_name = constraint+'_fits'
df_items[col_name] = False
df_items.loc[df_items['Product'] < df_items[constraint+'Product'], col_name] = True
df_res = df_items.iloc[:: ,7:]
return df_res
Jupyter Notebook
df_manager = mp.Manager()
df_ns = df_manager.Namespace()
df_ns.constraint_sets = [constraints, constraints2]
print('---Starting pool---')
if __name__ == '__main__':
with mp.Pool() as p:
print('--In the pool--')
res = p.map_async(mpf.check_constraint, (df_ns.constraint_sets, itertools.repeat(items)))
print(res.get())
and my current error:
TypeError: check_constraint() missing 1 required positional argument: 'df_items_input'
Easiest way is to create a list of tuples (where one tuple represents one set of arguments to the function) and pass it to starmap.
df_manager = mp.Manager()
df_ns = df_manager.Namespace()
df_ns.constraint_sets = [constraints, constraints2]
print('---Starting pool---')
if __name__ == '__main__':
with mp.Pool() as p:
print('--In the pool--')
check_constraint_args = []
for constraint in constraint_sets:
check_constraint_args.append((constraint, items))
res = p.starmap(mpf.check_constraint, check_constraint_args)
print(res.get())

Applying parallelization when updating dictionary values

datasets = {}
datasets['df1'] = df1
datasets['df2'] = df2
datasets['df3'] = df3
datasets['df4'] = df4
def prepare_dataframe(dataframe):
return dataframe.apply(lambda x: x.astype(str).str.lower().str.replace('[^\w\s]', ''))
for key, value in datasets.items():
datasets[key] = prepare_dataframe(value)
I need to prepare the data in some dataframes for further analysis. I would like to parallelize the for loop that updates the dictionary with a prepared dataframe. This code will eventually run on a machine with dozens of cores and thousands of dataframes. On my local machine I do not appear to be using more than a single core in the prepare_dataframe function.
I have looked at Numba and Joblib but I cannot find a way to work with dictionary values in either library.
Any insight would be very much appreciated!
You can use the multiprocessing library. You can read about its basics here.
Here is the code that does what you need:
from multiprocessing import Pool
def prepare_dataframe(dataframe):
# do whatever you want here
# changes made here are *not* global
# return a modified version of what you want
return dataframe
def worker(dict_item):
key,value = dict_item
return (key,prepare_dataframe(value))
def parallelize(data, func):
data_list = list(data.items())
pool = Pool()
data = dict(pool.map(func, data_list))
pool.close()
pool.join()
return data
datasets = parallelize(datasets,worker)

Check logs with Spark

I'm new to Spark and I'm trying to develop a python script that reads a csv file with some logs:
userId,timestamp,ip,event
13,2016-12-29 16:53:44,86.20.90.121,login
43,2016-12-29 16:53:44,106.9.38.79,login
66,2016-12-29 16:53:44,204.102.78.108,logoff
101,2016-12-29 16:53:44,14.139.102.226,login
91,2016-12-29 16:53:44,23.195.2.174,logoff
And checks if a user had some strange behaviors, for example if he has done two consecutive 'login' without doing 'logoff'. I've loaded the csv as a Spark dataFrame and I wanted to compare the log rows of a single user, ordered by timestamp and checking if two consecutive events are of the same type (login - login , logoff - logoff). I'm searching for doing it in a 'map-reduce' way, but at the moment I can't figure out how to use a reduce function that compares consecutive rows.
The code I've written works, but the performance are very bad.
sc = SparkContext("local","Data Check")
sqlContext = SQLContext(sc)
LOG_FILE_PATH = "hdfs://quickstart.cloudera:8020/user/cloudera/flume/events/*"
RESULTS_FILE_PATH = "hdfs://quickstart.cloudera:8020/user/cloudera/spark/script_results/prova/bad_users.csv"
N_USERS = 10*1000
dataFrame = sqlContext.read.format("com.databricks.spark.csv").load(LOG_FILE_PATH)
dataFrame = dataFrame.selectExpr("C0 as userID","C1 as timestamp","C2 as ip","C3 as event")
wrongUsers = []
for i in range(0,N_USERS):
userDataFrame = dataFrame.where(dataFrame['userId'] == i)
userDataFrame = userDataFrame.sort('timestamp')
prevEvent = ''
for row in userDataFrame.rdd.collect():
currEvent = row[3]
if(prevEvent == currEvent):
wrongUsers.append(row[0])
prevEvent = currEvent
badUsers = sqlContext.createDataFrame(wrongUsers)
badUsers.write.format("com.databricks.spark.csv").save(RESULTS_FILE_PATH)
First (not related but still), be sure that the number of entries per user is not that big because that collect in for row in userDataFrame.rdd.collect(): is dangerous.
Second, you don't need to leave the DataFrame area here to use classical Python, just stick to Spark.
Now, your problem. It's basically "for each line I want to know something from the previous line": that belongs to the concept of Window functions and to be precise the lag function. Here are two interesting articles about Window functions in Spark: one from Databricks with code in Python and one from Xinh with (I think easier to understand) examples in Scala.
I have a solution in Scala, but I think you'll pull it off translating it in Python:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.lag
import sqlContext.implicits._
val LOG_FILE_PATH = "hdfs://quickstart.cloudera:8020/user/cloudera/flume/events/*"
val RESULTS_FILE_PATH = "hdfs://quickstart.cloudera:8020/user/cloudera/spark/script_results/prova/bad_users.csv"
val data = sqlContext
.read
.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("header", "true") // use the header from your csv
.load(LOG_FILE_PATH)
val wSpec = Window.partitionBy("userId").orderBy("timestamp")
val badUsers = data
.withColumn("previousEvent", lag($"event", 1).over(wSpec))
.filter($"previousEvent" === $"event")
.select("userId")
.distinct
badUsers.write.format("com.databricks.spark.csv").save(RESULTS_FILE_PATH)
Basically you just retrieve the value from the previous line and compare it to the value on your current line, if it's a match that is a wrong behavior and you keep the userId. For the first line in your "block" of lines for each userId, the previous value will be null: when comparing with the current value, the boolean expression will be false so no problem here.

Spark with own map and reduce functions python

I'm trying to do a mapreduce like operation using python spark. Here is what i have and my problem.
object_list = list(objects) #this is precomputed earlier in my script
def my_map(obj):
return [f(obj)]
def my_reduce(obj_list1, obj_list2):
return obj_list1 + obj_list2
What I am trying to do in is something like the following:
myrdd = rdd(object_list) #objects are now spread out
myrdd.map(my_map)
myrdd.reduce(my_reduce)
my_result = myrdd.result()
where my_result should now just be = [f(obj1), f(obj2), ..., f(objn)]. I want to use spark purely for the speed, my script has been taking to long when doing this in a forloop. Does anyone know how to do the above in spark?
It would usually look like this:
myrdd = sc.parallelize(object_list)
my_result = myrdd.map(f).reduce(lambda a,b:a+b)
There is a sum function for RDDs, so this could also be:
myrdd = sc.parallelize(object_list)
my_result = myrdd.map(f).sum()
However, this will give you a single number. f(obj1)+f(obj2)+...
If you want an array of all the responses [f(obj1),f(obj2), ...], you would not use .reduce() or .sum() but instead use .collect():
myrdd = sc.parallelize(object_list)
my_result = myrdd.map(f).collect()

Categories

Resources