Spark with own map and reduce functions python

Spark with own map and reduce functions python - python

I'm trying to do a mapreduce like operation using python spark. Here is what i have and my problem.
object_list = list(objects) #this is precomputed earlier in my script
def my_map(obj):
return [f(obj)]
def my_reduce(obj_list1, obj_list2):
return obj_list1 + obj_list2
What I am trying to do in is something like the following:
myrdd = rdd(object_list) #objects are now spread out
myrdd.map(my_map)
myrdd.reduce(my_reduce)
my_result = myrdd.result()
where my_result should now just be = [f(obj1), f(obj2), ..., f(objn)]. I want to use spark purely for the speed, my script has been taking to long when doing this in a forloop. Does anyone know how to do the above in spark?

It would usually look like this:
myrdd = sc.parallelize(object_list)
my_result = myrdd.map(f).reduce(lambda a,b:a+b)
There is a sum function for RDDs, so this could also be:
myrdd = sc.parallelize(object_list)
my_result = myrdd.map(f).sum()
However, this will give you a single number. f(obj1)+f(obj2)+...
If you want an array of all the responses [f(obj1),f(obj2), ...], you would not use .reduce() or .sum() but instead use .collect():
myrdd = sc.parallelize(object_list)
my_result = myrdd.map(f).collect()

Related

Python, loops with changeable parts of filenames

I have a bunch of very similar commands which all look like this (df means pandas dataframe):
df1_part1=...
df1_part2=...
...
df1_part5=...
df2_part1=...
I would like to make a loop for it, as follows:
for i in range(1,5):
for j in range(1,5):
df%i_part%j=...
Of course, it doesn't work with %. But is has to be some easy way to do it, I suppose.
Could You help me please?

You can try one of the following options:
Create a dictionary which maps the your df and access it by the name of the dataframe:
mapping = {"df1_part1": df1_part1, "df1_part2": df1_part2}
for i in range(1,5):
for j in range(1,5):
mapping[f"df{i}_part{j}"] = ...
Use globals to access dynamically your variables:
df1_part1=...
df1_part2=...
...
df1_part5=...
df2_part1=...
for i in range(1,5):
for j in range(1,5):
globals()[f"df{i}_part{j}"] = ...

One way would be to collect your pandas dataframes in a list of lists and iterate over that list instead of trying dynamically parse your python code.
df1_part1=...
df1_part2=...
...
df1_part5=...
df2_part1=...
dflist = [[df1_part1, df1_part2, df1_part3, df1_part4, df1_part5],
[df2_part1, df2_part2, df2_part3, df2_part4, df2_part5]]
for df in dflist:
for df_part in df:
# do something with df_part

Assuming that this process is part of data preparation, I would like to mention that you should try to work with "data preparation pipelines" whenever it is possible. Otherwise, the code will be a huge mess to read after a couple of months.
There are several ways to deal with this problem.
A dictionary is the most straightforward way to deal with this.
df_parts = {
'df1' : {'part1': df1_part1, 'part2': df1_part2,...,'partN': df1_partN},
'df2' : {'part1': df1_part1, 'part2': df1_part2,...,'partN': df2_partN},
'...' : {'part1': ..._part1, 'part2': ..._part2,...,'partN': ..._partN},
'dfN' : {'part1': dfN_part1, 'part2': dfN_part2,...,'partN': dfN_partN},
}
# print parts from `dfN`
for val in for df_parts['dfN'].values():
print(val)
# print part1 for all dfs
for df in df_parts.values():
print(df['part1'])
# print everything
for df in df_parts:
for val in df_parts[df].values():
print(val)
The good thing with this approach is that you can iterate through the whole dictionary, but you don't include range which may be confusing later. Also, it is better to assign every df_part directly to a dict instead of assigning N*N variables which may be used once or twice. In this case you can just use 1 variable and re-assign it as you progress:
# code using df1_partN
df1 = df_parts['df1']['partN']
# stuff to do
# happy? checkpoint
df_parts['df1']['partN'] = df1

issue in writing function to filter rows data frame

I am writing a function that will serve as filter for rows that I wanted to use.
The sample data frame is as follow:
df = pd.DataFrame()
df ['Xstart'] = [1,2.5,3,4,5]
df ['Xend'] = [6,8,9,10,12]
df ['Ystart'] = [0,1,2,3,4]
df ['Yend'] = [6,8,9,10,12]
df ['GW'] = [1,1,2,3,4]
def filter(data,Game_week):
pass_data = data [(data['GW'] == Game_week)]
when I recall the function filter as follow, I got an error.
df1 = filter(df,1)
The error message is
AttributeError: 'NoneType' object has no attribute 'head'
but when I use manual filter, it works.
pass_data = df [(df['GW'] == [1])]
This is my first issue.
My second issue is that I want to filter the rows with multiple GW (1,2,3) etc.
For that I can manually do it as follow:
pass_data = df [(df['GW'] == [1])|(df['GW'] == [2])|(df['GW'] == [3])]
if I want to use in function input as list [1,2,3]
how can I write it in function such that I can input a range of 1 to 3?
Could anyone please advise?
Thanks,
Zep

Use isin for pass list of values instead scalar, also filter is existing function in python, so better is change function name:
def filter_vals(data,Game_week):
return data[data['GW'].isin(Game_week)]
df1 = filter_vals(df,range(1,4))

Because you don't return in the function, so it will be None, not the desired dataframe, so do (note that also no need parenthesis inside the data[...]):
def filter(data,Game_week):
return data[data['GW'] == Game_week]
Also, isin may well be better:
def filter(data,Game_week):
return data[data['GW'].isin(Game_week)]

Use return to return data from the function for the first part. For the second, use -
def filter(data,Game_week):
return data[data['GW'].isin(Game_week)]
Now apply the filter function -
df1 = filter(df,[1,2])

Formatting Multiple Columns in a Pandas Dataframe

I have a dataframe I'm working with that has a large number of columns, and I'm trying to format them as efficiently as possible. I have a bunch of columns that all end in .pct that need to be formatted as percentages, some that end in .cost that need to be formatted as currency, etc.
I know I can do something like this:
cost_calc.style.format({'c.somecolumn.cost' : "${:,.2f}",
'c.somecolumn.cost' : "${:,.2f}",
'e.somecolumn.cost' : "${:,.2f}",
'e.somecolumn.cost' : "${:,.2f}",...
and format each column individually, but I was hoping there was a way to do something similar to this:
cost_calc.style.format({'*.cost' : "${:,.2f}",
'*.pct' : "{:,.2%}",...
Any ideas? Thanks!

The first way doesn't seem bad if you can automatically build that dictionary... you can generate a list of all columns fitting the *.cost description with something like
costcols = [x for x in df.columns.values if x[-5:] == '.cost']
then build your dict like:
formatdict = {}
for costcol in costcols: formatdict[costcol] = "${:,.2f}"
then as you suggested:
cost_calc.style.format(formatdict)
You can easily add the .pct cases similarly. Hope this helps!

I would use regEx with dict generators:
import re
mylist = cost_calc.columns
r = re.compile(r'.*cost')
cost_cols = {key: "${:,.2f}" for key in mylist if r.match(key)}
r = re.compile(r'.*pct')
pct_cols = {key: "${:,.2f}" for key in mylist if r.match(key)}
cost_calc.style.format({**cost_cols, **pct_cols})
note: code for Python 2.7 and 3 onwards

How to run parallel programs with pyspark?

I would like to use our spark cluster to run programs in parallel. My idea is to do sth like the following:
def simulate():
#some magic happening in here
return 0
spark = (
SparkSession.builder
.appName('my_simulation')
.enableHiveSupport()
.getOrCreate())
sc = spark.sparkContext
no_parallel_instances = sc.parallelize(xrange(500))
res = no_parallel_instances.map(lambda row: simulate())
print res.collect()
The question i have is whether there's a way to execute simulate() with different parameters. The only way i currently can imagine is to have a dataframe specifying the parameters, so something like this:
parameter_list = [[5,2.3,3], [3,0.2,4]]
no_parallel_instances = sc.parallelize(parameter_list)
res = no_parallel_instances.map(lambda row: simulate(row))
print res.collect()
Is there another, more elegant way to run parallel functions with spark?

If the data you are looking to parameterize your call with differs between each row, then yes you will need to include that with each row.
However, if you are looking to set global parameters that affect every row, then you can use a broadcast variable.
http://spark.apache.org/docs/latest/rdd-programming-guide.html#broadcast-variables
Broadcast variables are created once in your script and cannot be modified after that. Spark will efficiently distribute those values to every executor to make them available to your transformations. To create one you provide the data to spark and it gives you back a handle you can use to access it on the executors. For example:
settings_bc = sc.broadcast({
'num_of_donkeys': 3,
'donkey_color': 'brown'
})
def simulate(settings, n):
# do magic
return n
no_parallel_instances = sc.parallelize(xrange(500))
res = no_parallel_instances.map(lambda row: simulate(settings_bc.value, row))
print res.collect()

Filtering a Pyspark DataFrame with SQL-like IN clause

I want to filter a Pyspark DataFrame with a SQL-like IN clause, as in
sc = SparkContext()
sqlc = SQLContext(sc)
df = sqlc.sql('SELECT * from my_df WHERE field1 IN a')
where a is the tuple (1, 2, 3). I am getting this error:
java.lang.RuntimeException: [1.67] failure: ``('' expected but identifier a found
which is basically saying it was expecting something like '(1, 2, 3)' instead of a.
The problem is I can't manually write the values in a as it's extracted from another job.
How would I filter in this case?

String you pass to SQLContext it evaluated in the scope of the SQL environment. It doesn't capture the closure. If you want to pass a variable you'll have to do it explicitly using string formatting:
df = sc.parallelize([(1, "foo"), (2, "x"), (3, "bar")]).toDF(("k", "v"))
df.registerTempTable("df")
sqlContext.sql("SELECT * FROM df WHERE v IN {0}".format(("foo", "bar"))).count()
## 2
Obviously this is not something you would use in a "real" SQL environment due to security considerations but it shouldn't matter here.
In practice DataFrame DSL is a much better choice when you want to create dynamic queries:
from pyspark.sql.functions import col
df.where(col("v").isin({"foo", "bar"})).count()
## 2
It is easy to build and compose and handles all details of HiveQL / Spark SQL for you.

reiterating what #zero323 has mentioned above : we can do the same thing using a list as well (not only set) like below
from pyspark.sql.functions import col
df.where(col("v").isin(["foo", "bar"])).count()

Just a little addition/update:
choice_list = ["foo", "bar", "jack", "joan"]
If you want to filter your dataframe "df", such that you want to keep rows based upon a column "v" taking only the values from choice_list, then
from pyspark.sql.functions import col
df_filtered = df.where( ( col("v").isin (choice_list) ) )

You can also do this for integer columns:
df_filtered = df.filter("field1 in (1,2,3)")
or this for string columns:
df_filtered = df.filter("field1 in ('a','b','c')")

A slightly different approach that worked for me is to filter with a custom filter function.
def filter_func(a):
"""wrapper function to pass a in udf"""
def filter_func_(col):
"""filtering function"""
if col in a.value:
return True
return False
return udf(filter_func_, BooleanType())
# Broadcasting allows to pass large variables efficiently
a = sc.broadcast((1, 2, 3))
df = my_df.filter(filter_func(a)(col('field1'))) \

from pyspark.sql import SparkSession
import pandas as pd
spark=SparkSession.builder.appName('Practise').getOrCreate()
df_pyspark=spark.read.csv('datasets/myData.csv',header=True,inferSchema=True)
df_spark.createOrReplaceTempView("df") # we need to create a Temp table first
spark.sql("SELECT * FROM df where Departments in ('IOT','Big Data') order by Departments").show()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Spark with own map and reduce functions python - python

Related

Python, loops with changeable parts of filenames

issue in writing function to filter rows data frame

Formatting Multiple Columns in a Pandas Dataframe

How to run parallel programs with pyspark?

Filtering a Pyspark DataFrame with SQL-like IN clause

Categories

Resources