I have a DataFrame that I have processed to be like:
+---------+-------+
| inputs | temp |
+---------+-------+
| [1,0,0] | 12 |
+---------+-------+
| [0,1,0] | 10 |
+---------+-------+
...
inputs is a column of DenseVectors. temp is a column of values. I want to append the DenseVector with these values and create one column, but I am not sure how to start. Any tips for this desired output:
+---------------+
| inputsMerged |
+---------------+
| [1,0,0,12] |
+---------------+
| [0,1,0,10] |
+---------------+
...
EDIT: I am trying to use the VectorAssembler method but my resulting array is not as intended.
You might do something like this:
df.show()
+-------------+----+
| inputs|temp|
+-------------+----+
|[1.0,0.0,0.0]| 12|
|[0.0,1.0,0.0]| 10|
+-------------+----+
df.printSchema()
root
|-- inputs: vector (nullable = true)
|-- temp: long (nullable = true)
Import:
import pyspark.sql.functions as F
from pyspark.ml.linalg import Vectors, VectorUDT
Create the udf to merge the Vector and element:
concat = F.udf(lambda v, e: Vectors.dense(list(v) + [e]), VectorUDT())
Apply udf to inputs and temp columns:
merged_df = df.select(concat(df.inputs, df.temp).alias('inputsMerged'))
merged_df.show()
+------------------+
| inputsMerged|
+------------------+
|[1.0,0.0,0.0,12.0]|
|[0.0,1.0,0.0,10.0]|
+------------------+
merged_df.printSchema()
root
|-- inputsMerged: vector (nullable = true)
Related
i have an example dataset:
+---+------------------------------+
|id |example_field |
+---+------------------------------+
|1 |{[{[{111, AAA}, {222, BBB}]}]}|
+---+------------------------------+
The data type of the two fields are:
[('id', 'int'),
('example_field',
'struct<xxx:array<struct<nested_field:array<struct<field_1:int,field_2:string>>>>>')]
My question is if there's a way/function to flatten the field example_field using pyspark?
my expected output is something like this:
id field_1 field_2
1 111 AAA
1 222 BBB
The following code should do the trick:
from pyspark.sql import functions as F
(
df
.withColumn('_temp_ef', F.explode('example_field.xxx'))
.withColumn('_temp_nf', F.explode('_temp_ef.nested_field'))
.select(
'id',
F.col('_temp_nf.*')
)
)
The function explode creates a row for each element in an array, while select turns the fields of nested_field structure into columns.
The result is:
+---+-------+-------+
|id |field_1|field_2|
+---+-------+-------+
|1 |111 |AAA |
|1 |222 |BBB |
+---+-------+-------+
Note: I assumed that your DataFrame is something like this:
root
|-- id: integer (nullable = true)
|-- example_field: struct (nullable = true)
| |-- xxx: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- nested_field: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- field_1: integer (nullable = true)
| | | | | |-- field_2: string (nullable = true)
Let's suppose that we have the following two tables
+---------+--------+
|AUTHOR_ID| NAME |
+---------+--------+
| 102 |Camus |
| 103 |Hugo |
+---------+-------- +------------
|AUTHOR_ID| BOOK_ID + BOOK_NAME |
+---------+-------- + -----------|
| 1 |Camus | Etranger
| 1 |Hugo | Mesirable |
I want to join the two table in order to get a DataFrame with the following Schema
root
|-- AUTHORID: integer
|-- NAME: string
|-- BOOK_LIST: array
| |-- BOOK_ID: integer
| |-- BOOK_NAME: string
I'm using pyspark, Thanks in advance
Simple join + group by should do the job:
from pyspark.sql import functions as F
result = (df_authors.join(df_books, on=["AUTHOR_ID"], how="left")
.groupBy("AUTHOR_ID", "NAME")
.agg(F.collect_list(F.struct("BOOK_ID", "BOOK_NAME")))
)
In the aggregation we use collect_list to create the array of structs.
I have a pyspark dataframe that contains N number of columns containing integers. Some of the fields might be null as well.
For example:
+---+-----+-----+
| id| f_1 | f_2 |
+---+-----+-----+
| 1| null| null|
| 2|123 | null|
| 3|124 |127 |
+---+-----+-----+
What I want is to combine all f-prefixed columns into a pyspark array in a new column. For example:
+---+---------+
| id| combined|
+---+---------+
| 1| [] |
| 2|[123] |
| 3|[124,127]|
+---+---------+
The closer I have managed to get is this:
features_filtered = features.select(F.concat(* features.columns[1:]).alias('combined'))
which returns null (I assume due to the nulls in the initial dataframe).
From what I searched I would like to use .coalesce() or maybe .fillna() to handle/remove nulls but I haven't managed to make it work.
My main requirements are that I would like the newly created column to be of type Array and that I dont want to enumerate all column names that I need to concat.
In pyspark can be done as
df = df.withColumn("combined_array", f.array(*[i for i in df.columns if i.startswith('f')]))
.withColumn("combined", expr('''FILTER(combined_array, x -> x is not null)'''))
Try this- (In scala, but can be implemented in python with minimal change)
Load the data
val data =
"""
|id| f_1 | f_2
| 1| null| null
| 2|123 | null
| 3|124 |127
""".stripMargin
val stringDS = data.split(System.lineSeparator())
.map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString(","))
.toSeq.toDS()
val df = spark.read
.option("sep", ",")
.option("inferSchema", "true")
.option("header", "true")
.option("nullValue", "null")
.csv(stringDS)
df.printSchema()
df.show(false)
/**
* root
* |-- id: integer (nullable = true)
* |-- f_1: integer (nullable = true)
* |-- f_2: integer (nullable = true)
*
* +---+----+----+
* |id |f_1 |f_2 |
* +---+----+----+
* |1 |null|null|
* |2 |123 |null|
* |3 |124 |127 |
* +---+----+----+
*/
Convert it to array
df.withColumn("array", array(df.columns.filter(_.startsWith("f")).map(col): _*))
.withColumn("combined", expr("FILTER(array, x -> x is not null)"))
.show(false)
/**
* +---+----+----+----------+----------+
* |id |f_1 |f_2 |array |combined |
* +---+----+----+----------+----------+
* |1 |null|null|[,] |[] |
* |2 |123 |null|[123,] |[123] |
* |3 |124 |127 |[124, 127]|[124, 127]|
* +---+----+----+----------+----------+
*/
Currently I am working in Pyspark and have little knowledge of this technology. My data frame looks like:
id dob var1
1 13-02-1976 aab#dfsfs
2 01-04-2000 bb#NAm
3 28-11-1979 adam11#kjfd
4 30-01-1955 rehan42#ggg
My output looks like:
id dob var1 age var2
1 13-02-1976 aab#dfsfs 43 aab
2 01-04-2000 bb#NAm 19 bb
3 28-11-1979 adam11#kjfd 39 adam11
4 30-01-1955 rehan42#ggg 64 rehan42
What I have done so far -
df= df.select( df.id.cast('int').alias('id'),
df.dob.cast('date').alias('dob'),
df.var1.cast('string').alias('var1'))
But I think dob is not converted properly.
df= df.withColumn('age', F.datediff(F.current_date(), df.dob))
As you said Casting of dob column is not proper. Please Try this.
from pyspark.sql.functions import col, unix_timestamp, to_date
import pyspark.sql.functions as F
df2 = df.withColumn('date_in_dateFormat',to_date(unix_timestamp(F.col('dob'),'dd-MM-
yyyy').cast("timestamp")))
df2.show()
+---+----------+-----------+------------------+
| id| dob| var1|date_in_dateFormat|
+---+----------+-----------+------------------+
| 1|13-02-1976| aab#dfsfs| 1976-02-13|
| 2|01-04-2000| bb#NAm| 2000-04-01|
| 3|28-11-1979|adam11#kjfd| 1979-11-28|
| 4|30-01-1955|rehan42#ggg| 1955-01-30|
+---+----------+-----------+------------------+
df2.printSchema()
root
|-- id: integer (nullable = true)
|-- dob: string (nullable = true)
|-- var1: string (nullable = true)
|-- date_in_dateFormat: date (nullable = true)
df3= df2.withColumn('age', F.datediff(F.current_date(), df2.date_in_dateFormat))
df3.show()
+---+----------+-----------+------------------+-----+
| id| dob| var1|date_in_dateFormat| age|
+---+----------+-----------+------------------+-----+
| 1|13-02-1976| aab#dfsfs| 1976-02-13|15789|
| 2|01-04-2000| bb#NAm| 2000-04-01| 6975|
| 3|28-11-1979|adam11#kjfd| 1979-11-28|14405|
| 4|30-01-1955|rehan42#ggg| 1955-01-30|23473|
+---+----------+-----------+------------------+-----+
split_col =F.split(df['var1'], '#')
df4=df3.withColumn('Var2', split_col.getItem(0))
df4.show()
+---+----------+-----------+------------------+-----+-------+
| id| dob| var1|date_in_dateFormat| age| Var2|
+---+----------+-----------+------------------+-----+-------+
| 1|13-02-1976| aab#dfsfs| 1976-02-13|15789| aab|
| 2|01-04-2000| bb#NAm| 2000-04-01| 6975| bb|
| 3|28-11-1979|adam11#kjfd| 1979-11-28|14405| adam11|
| 4|30-01-1955|rehan42#ggg| 1955-01-30|23473|rehan42|
+---+----------+-----------+------------------+-----+-------+
I have a dataframe with a MapType column where the key is an id and the value is another StructType with two numbers, a counter and a revenue.
It looks like that:
+--------------------------------------+
| myMapColumn |
+--------------------------------------+
| Map(1 -> [1, 4.0], 2 -> [1, 1.5]) |
| Map() |
| Map(1 -> [3, 5.5]) |
| Map(1 -> [4, 0.1], 2 -> [6, 101.56]) |
+--------------------------------------+
Now I need to sum up these two values per id and the result would be:
+----------------------+
| id | count | revenue |
+----------------------+
| 1 | 8 | 9.6 |
| 2 | 7 | 103.06 |
+----------------------+
I actually have no idea how to do that and could not find a documentation for this special case. I tried using Dataframe.groupBy but could not make it work :(
Any ideas ?
I'm using Spark 1.5.2 with Python 2.6.6
Assuming that the schema is equivalent to this:
root
|-- myMapColumn: map (nullable = true)
| |-- key: integer
| |-- value: struct (valueContainsNull = true)
| | |-- _1: integer (nullable = false)
| | |-- _2: double (nullable = false)
all you need is explode and a simple aggregation:
from pyspark.sql.functions import col, explode, sum as sum_
(df
.select(explode(col("myMapColumn")))
.groupBy(col("key").alias("id"))
.agg(sum_("value._1").alias("count"), sum_("value._2").alias("revenue")))