I have this data in a DATAFRAME:
id,col
65475383,acacia
63975914,acacia
65475383,excelsa
63975914,better
I want to have a dictionary that will contain the column "word" and every id that is associated with it, something like this:
word:key
acacia: 65475383,63975914
excelsa: 65475383
better: 63975914
I tried groupBy, but that is a way to aggregate data, how to approach this problem?
I'm not sure if you intend to have the result as a Python dictionary or as a Dataframe (it is not clear from your question).
However, if you do want a Dataframe then one way to calculate that is:
from pyspark.sql.functions import collect_list
idsByWords = df \
.groupBy("col") \
.agg(collect_list("id").alias("ids")) \
.withColumnRenamed("col", "word")
This will result in:
idsByWords.show(truncate=False)
+-------+--------------------+
|word |ids |
+-------+--------------------+
|excelsa|[65475383] |
|better |[63975914] |
|acacia |[65475383, 63975914]|
+-------+--------------------+
Then you can turn that dataframe into a Python dictionary :
d = {r.asDict()["word"]: r.asDict()["ids"] for r in idsByWords.collect()}
To finally get:
{
'excelsa': [65475383],
'better': [63975914],
'acacia': [65475383, 63975914]
}
Note that collect may crash your driver program if it exceeds your driver memory.
Related
Given a PySpark DataFrame is it possible to obtain a list of source columns that are being referenced by the DataFrame?
Perhaps a more concrete example might help explain what I'm after. Say I have a DataFrame defined as:
import pyspark.sql.functions as func
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
source_df = spark.createDataFrame(
[("pru", 23, "finance"), ("paul", 26, "HR"), ("noel", 20, "HR")],
["name", "age", "department"],
)
source_df.createOrReplaceTempView("people")
sqlDF = spark.sql("SELECT name, age, department FROM people")
df = sqlDF.groupBy("department").agg(func.max("age").alias("max_age"))
df.show()
which returns:
+----------+--------+
|department|max_age |
+----------+--------+
| finance| 23|
| HR| 26|
+----------+--------+
The columns that are referenced by df are [department, age]. Is it possible to get that list of referenced columns programatically?
Thanks to Capturing the result of explain() in pyspark I know I can extract the plan as a string:
df._sc._jvm.PythonSQLUtils.explainString(df._jdf.queryExecution(), "formatted")
which returns:
== Physical Plan ==
AdaptiveSparkPlan (6)
+- HashAggregate (5)
+- Exchange (4)
+- HashAggregate (3)
+- Project (2)
+- Scan ExistingRDD (1)
(1) Scan ExistingRDD
Output [3]: [name#0, age#1L, department#2]
Arguments: [name#0, age#1L, department#2], MapPartitionsRDD[4] at applySchemaToPythonRDD at NativeMethodAccessorImpl.java:0, ExistingRDD, UnknownPartitioning(0)
(2) Project
Output [2]: [age#1L, department#2]
Input [3]: [name#0, age#1L, department#2]
(3) HashAggregate
Input [2]: [age#1L, department#2]
Keys [1]: [department#2]
Functions [1]: [partial_max(age#1L)]
Aggregate Attributes [1]: [max#22L]
Results [2]: [department#2, max#23L]
(4) Exchange
Input [2]: [department#2, max#23L]
Arguments: hashpartitioning(department#2, 200), ENSURE_REQUIREMENTS, [plan_id=60]
(5) HashAggregate
Input [2]: [department#2, max#23L]
Keys [1]: [department#2]
Functions [1]: [max(age#1L)]
Aggregate Attributes [1]: [max(age#1L)#12L]
Results [2]: [department#2, max(age#1L)#12L AS max_age#13L]
(6) AdaptiveSparkPlan
Output [2]: [department#2, max_age#13L]
Arguments: isFinalPlan=false
which is useful, however its not what I need. I need a list of the referenced columns. Is this possible?
Perhaps another way of asking the question is... is there a way to obtain the explain plan as an object that I can iterate over/explore?
UPDATE. Thanks to the reply from #matt-andruff I have gotten this:
df._jdf.queryExecution().executedPlan().treeString().split("+-")[-2]
which returns:
' Project [age#1L, department#2]\n '
from which I guess I could parse the information I'm after but this is a far from elegant way to do it, and is particularly error prone.
What I'm really after is a failsafe, reliable, API-supported way to get this information. I'm starting to think it isn't possible.
There is an object for that unfortunately its a java object, and not translated to pyspark.
You can still access it with Spark constucts:
>>> df._jdf.queryExecution().executedPlan().apply(0).output().apply(0).toString()
u'department#1621'
>>> df._jdf.queryExecution().executedPlan().apply(0).output().apply(1).toString()
u'max_age#1632L'
You could loop through both the above apply to get the information you are looking for with something like:
plan = df._jdf.queryExecution().executedPlan()
steps = [ plan.apply(i) for i in range(1,100) if not isinstance(plan.apply(i), type(None)) ]
iterator = steps[0].inputSet().iterator()
>>> iterator.next().toString()
u'department#1621'
>>> iterator.next().toString()
u'max#1642L'
steps = [ plan.apply(i) for i in range(1,100) if not isinstance(plan.apply(i), type(None)) ]
projections = [ (steps[0].p(i).toJSON().encode('ascii','ignore')) for i in range(1,100) if not( isinstance(steps[0].p(i), type(None) )) and steps[0].p(i).nodeName().encode('ascii','ignore') == 'Project' ]
dd = spark.sparkContext.parallelize(projections)
df2 = spark.read.json(rdd)
>>> df2.show(1,False)
+-----+------------------------------------------+----+------------+------+--------------+------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----+
|child|class |name|num-children|output|outputOrdering|outputPartitioning|projectList |rdd |
+-----+------------------------------------------+----+------------+------+--------------+------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----+
|0 |org.apache.spark.sql.execution.ProjectExec|null|1 |null |null |null |[[[org.apache.spark.sql.catalyst.expressions.AttributeReference, long, [1620, 4ad48da6-03cf-45d4-9b35-76ac246fadac, org.apache.spark.sql.catalyst.expressions.ExprId], age, true, 0, [people]]], [[org.apache.spark.sql.catalyst.expressions.AttributeReference, string, [1621, 4ad48da6-03cf-45d4-9b35-76ac246fadac, org.apache.spark.sql.catalyst.expressions.ExprId], department, true, 0, [people]]]]|null|
+-----+------------------------------------------+----+------------+------+--------------+------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----+
df2.select(func.explode(func.col('projectList'))).select( func.col('col')[0]["name"] ) .show(100,False)
+-----------+
|col[0].name|
+-----------+
|age |
|department |
+-----------+
range --> Bit of a hack but apparently size doesn't work.I'm sure with more time I could refine the range hack.
You can then use json to pull the information programmatically.
I have something that, while not being an answer to my original question (see Matt Andruff's answer for that), could still be useful here. Its a way to get all the source columns referenced by a pyspark.sql.column.Column.
Simple repro:
from pyspark.sql import functions as f, SparkSession
SparkSession.builder.getOrCreate()
col = f.concat(f.col("A"), f.col("B"))
type(col)
col._jc.expr().references().toList().toString()
returns:
<class 'pyspark.sql.column.Column'>
"List('A, 'B)"
its definitely not perfect, it still requires you to parse out the column names from the string that is returned, but at least the information I'm after is available. There might be some more methods on the object returned from references() that makes it easier to parse the returned string but if there is, I haven't found it!
Here is a function I wrote to do the parsing
def parse_references(references: str):
return sorted(
"".join(
references.replace("'", "")
.replace("List(", "")
.replace(")", "")
.replace(")", "")
.split()
).split(",")
)
assert parse_references("List('A, 'B)") == ["A", "B"]
PySpark is not really designed for such lower-level tricks (which begs more for Scala that Spark is developed in and as such offers all there is inside).
This step where you access QueryExecution is the main entry point to the machinery of Spark SQL's query execution engine.
The issue is that py4j (that is used as a bridge between JVM and Python environments) makes it of no use on PySpark's side.
You can use the following if you need to access the final query plan (just before it's converted into RDDs):
df._jdf.queryExecution().executedPlan().prettyJson()
Review the QueryExecution API.
QueryExecutionListener
You should really consider Scala to intercept whatever you want about your queries and QueryExecutionListener seems a fairly viable starting point.
There is more but it's all over Scala :)
What I'm really after is a failsafe, reliable, API-supported way to get this information. I'm starting to think it isn't possible.
I'm not surprised since you're throwing away the best possible answer: Scala. I'd recommend using it for a PoC and see what you can get and only then (if you have to) look out for a Python solution (which I think is doable yet highly error-prone).
You can try the below codes, this will give you a column list and its data type in the data frame.
for field in df.schema.fields:
print(field.name +" , "+str(field.dataType))
I have three Arrays of string type containing following information:
groupBy array: containing names of the columns I want to group my data by.
aggregate array: containing names of columns I want to aggregate.
operations array: containing the aggregate operations I want to perform
I am trying to use spark data frames to achieve this. Spark data frames provide an agg() where you can pass a Map [String,String] (of column name and respective aggregate operation ) as input, however I want to perform different aggregation operations on the same column of the data. Any suggestions on how to achieve this?
Scala:
You can for example map over a list of functions with a defined mapping from name to function:
import org.apache.spark.sql.functions.{col, min, max, mean}
import org.apache.spark.sql.Column
val df = Seq((1L, 3.0), (1L, 3.0), (2L, -5.0)).toDF("k", "v")
val mapping: Map[String, Column => Column] = Map(
"min" -> min, "max" -> max, "mean" -> avg)
val groupBy = Seq("k")
val aggregate = Seq("v")
val operations = Seq("min", "max", "mean")
val exprs = aggregate.flatMap(c => operations .map(f => mapping(f)(col(c))))
df.groupBy(groupBy.map(col): _*).agg(exprs.head, exprs.tail: _*).show
// +---+------+------+------+
// | k|min(v)|max(v)|avg(v)|
// +---+------+------+------+
// | 1| 3.0| 3.0| 3.0|
// | 2| -5.0| -5.0| -5.0|
// +---+------+------+------+
or
df.groupBy(groupBy.head, groupBy.tail: _*).agg(exprs.head, exprs.tail: _*).show
Unfortunately parser which is used internally SQLContext is not exposed publicly but you can always try to build plain SQL queries:
df.registerTempTable("df")
val groupExprs = groupBy.mkString(",")
val aggExprs = aggregate.flatMap(c => operations.map(
f => s"$f($c) AS ${c}_${f}")
).mkString(",")
sqlContext.sql(s"SELECT $groupExprs, $aggExprs FROM df GROUP BY $groupExprs")
Python:
from pyspark.sql.functions import mean, sum, max, col
df = sc.parallelize([(1, 3.0), (1, 3.0), (2, -5.0)]).toDF(["k", "v"])
groupBy = ["k"]
aggregate = ["v"]
funs = [mean, sum, max]
exprs = [f(col(c)) for f in funs for c in aggregate]
# or equivalent df.groupby(groupBy).agg(*exprs)
df.groupby(*groupBy).agg(*exprs)
See also:
Spark SQL: apply aggregate functions to a list of column
For those that wonder, how #zero323 answer can be written without a list comprehension in python:
from pyspark.sql.functions import min, max, col
# init your spark dataframe
expr = [min(col("valueName")),max(col("valueName"))]
df.groupBy("keyName").agg(*expr)
Do something like
from pyspark.sql import functions as F
df.groupBy('groupByColName') \
.agg(F.sum('col1').alias('col1_sum'),
F.max('col2').alias('col2_max'),
F.avg('col2').alias('col2_avg')) \
.show()
Here is another straight forward way to apply different aggregate functions on the same column while using Scala (this has been tested in Azure Databricks).
val groupByColName = "Store"
val colName = "Weekly_Sales"
df.groupBy(groupByColName)
.agg(min(colName),
max(colName),
round(avg(colName), 2))
.show()
for example if you want to count percentage of zeroes in each column in pyspark dataframe for which we can use expression to be executed on each column of the dataframe
from pyspark.sql.functions import count,col
def count_zero_percentage(c):
pred = col(c)==0
return sum(pred.cast("integer")).alias(c)
df.agg(*[count_zero_percentage(c)/count('*').alias(c) for c in df.columns]).show()
case class soExample(firstName: String, lastName: String, Amount: Int)
val df = Seq(soExample("me", "zack", 100)).toDF
import org.apache.spark.sql.functions._
val groupped = df.groupBy("firstName", "lastName").agg(
sum("Amount"),
mean("Amount"),
stddev("Amount"),
count(lit(1)).alias("numOfRecords")
).toDF()
display(groupped)
// Courtesy Zach ..
Zach simplified answer for a post Marked Duplicate
Spark Scala Data Frame to have multiple aggregation of single Group By
I asked the question a while back for python, but now I need to do the same thing in PySpark.
I have a dataframe (df) like so:
|cust_id|address |store_id|email |sales_channel|category|
-------------------------------------------------------------------
|1234567|123 Main St|10SjtT |idk#gmail.com|ecom |direct |
|4567345|345 Main St|10SjtT |101#gmail.com|instore |direct |
|1569457|876 Main St|51FstT |404#gmail.com|ecom |direct |
and I would like to combine the last 4 fields into one metadata field that is a json like so:
|cust_id|address |metadata |
-------------------------------------------------------------------------------------------------------------------
|1234567|123 Main St|{'store_id':'10SjtT', 'email':'idk#gmail.com','sales_channel':'ecom', 'category':'direct'} |
|4567345|345 Main St|{'store_id':'10SjtT', 'email':'101#gmail.com','sales_channel':'instore', 'category':'direct'}|
|1569457|876 Main St|{'store_id':'51FstT', 'email':'404#gmail.com','sales_channel':'ecom', 'category':'direct'} |
Here's the code I used to do this in python:
cols = [
'store_id',
'store_category',
'sales_channel',
'email'
]
df1 = df.copy()
df1['metadata'] = df1[cols].to_dict(orient='records')
df1 = df1.drop(columns=cols)
but I would like to translate this to PySpark code to work with a spark dataframe; I do NOT want to use pandas in Spark.
Use to_json function to create json object!
Example:
from pyspark.sql.functions import *
#sample data
df=spark.createDataFrame([('1234567','123 Main St','10SjtT','idk#gmail.com','ecom','direct')],['cust_id','address','store_id','email','sales_channel','category'])
df.select("cust_id","address",to_json(struct("store_id","category","sales_channel","email")).alias("metadata")).show(10,False)
#result
+-------+-----------+----------------------------------------------------------------------------------------+
|cust_id|address |metadata |
+-------+-----------+----------------------------------------------------------------------------------------+
|1234567|123 Main St|{"store_id":"10SjtT","category":"direct","sales_channel":"ecom","email":"idk#gmail.com"}|
+-------+-----------+----------------------------------------------------------------------------------------+
to_json by passing list of columns:
ll=['store_id','email','sales_channel','category']
df.withColumn("metadata", to_json(struct([x for x in ll]))).drop(*ll).show()
#result
+-------+-----------+----------------------------------------------------------------------------------------+
|cust_id|address |metadata |
+-------+-----------+----------------------------------------------------------------------------------------+
|1234567|123 Main St|{"store_id":"10SjtT","email":"idk#gmail.com","sales_channel":"ecom","category":"direct"}|
+-------+-----------+----------------------------------------------------------------------------------------+
#Shu gives a good answer, here's a variant that works out slightly better for my use case. I'm going from Kafka -> Spark -> Kafka and this one liner does exactly what I want. The struct(*) will pack up all the fields in the dataframe.
# Packup the fields in preparation for sending to Kafka sink
kafka_df = df.selectExpr('cast(id as string) as key', 'to_json(struct(*)) as value')
I am very new to Python/PySpark and currently using it with Databricks.
I have the following list
dummyJson= [
('{"name":"leo", "object" : ["191.168.192.96", "191.168.192.99"]}',),
('{"name":"anne", "object" : ["191.168.192.103", "191.168.192.107"]}',),
]
When I tried to
jsonRDD = sc.parallelize(dummyJson)
then
put it in dataframe
spark.read.json(jsonRDD)
it does not parse the JSON correctly. The resulting dataframe is one column with _corrupt_record as the header.
Looking at the elements in dummyJson, it looks like there are extra / unnecessary comma just before the closing parantheses on each element/record.
How can I remove this comma from each of the element of this list?
Thanks
If you can fix the input format at the source, that would be ideal.
But for your given case, you may fix it by taking the objects out of the tuple.
>>> dJson = [i[0] for i in dummyJson]
>>> jsonRDD = sc.parallelize(dJson)
>>> jsonDF = spark.read.json(jsonRDD)
>>> jsonDF.show()
+----+--------------------+
|name| object|
+----+--------------------+
| leo|[191.168.192.96, ...|
|anne|[191.168.192.103,...|
+----+--------------------+
I am new in spark , is there any built in function which will show next month date from current date like today is 27-12-2016 then the function will return 27-01-2017. i have used date_add() but no function for adding month. I have tried date_add(date, 31)but what if the month has 30 days .
spark.sql("select date_add(current_date(),31)") .show()
could anyone help me about this problem. do i need to write custom function for that ? cause i didn't find any built in code still
Thanks in advance
Kalyan
The most straightforward dataframe-friendly solution I found for adding/subtracting months
from pyspark.sql import functions as F
# assume df has "current_date" column as type DateType
months_to_add = 1 # int value, could be negative
df = df.withColumn("new_date", F.add_months("current_date", months_to_add))
This result will include any other columns previously contained in df.
This is not pyspark specific. You can use add_months. It's available since Spark 1.5. e.g :
spark.sql("select current_date(), add_months(current_date(),1)").show()
# +--------------+-----------------------------+
# |current_date()|add_months(current_date(), 1)|
# +--------------+-----------------------------+
# | 2016-12-27| 2017-01-27|
# +--------------+-----------------------------+
You can also use negative integers to remove months :
spark.sql("select current_date(), add_months(current_date(),-1) as last_month").show()
# +--------------+----------+
# |current_date()|last_month|
# +--------------+----------+
# | 2016-12-27|2016-11-27|
# +--------------+----------+