Extracting a sub array from PySpark DataFrame column [duplicate] - python

This question already has answers here:
get first N elements from dataframe ArrayType column in pyspark
(2 answers)
Closed 4 years ago.
I wish to remove the last element of the array from this DataFrame. We have this link demonstrating the same thing, but with UDFs and that I wish to avoid. Is there is simple way to do this - something like list[:2]?
data = [(['cat','dog','sheep'],),(['bus','truck','car'],),(['ice','pizza','pasta'],)]
df = sqlContext.createDataFrame(data,['data'])
df.show()
+-------------------+
| data|
+-------------------+
| [cat, dog, sheep]|
| [bus, truck, car]|
|[ice, pizza, pasta]|
+-------------------+
Expected DataFrame:
+--------------+
| data|
+--------------+
| [cat, dog]|
| [bus, truck]|
| [ice, pizza]|
+--------------+

UDF is the best thing you can find for PySpark :)
from pyspark.sql.functions import udf
from pyspark.sql.types import StructType
# Get the fist two elements
split_row = udf(lambda row: row[:2])
# apply the udf to each row
new_df = df.withColumn("data", split_row(df["data"]))
new_df.show()
# Output
+------------+
| data|
+------------+
| [cat, dog]|
|[bus, truck]|
|[ice, pizza]|
+------------+

Related

Combine values from multiple columns into one Pyspark Dataframe [duplicate]

This question already has answers here:
Concat multiple columns of a dataframe using pyspark
(1 answer)
Concatenate columns in Apache Spark DataFrame
(18 answers)
How to concatenate multiple columns in PySpark with a separator?
(2 answers)
Closed 2 years ago.
I have a pyspark dataframe that has fields:
"id",
"fields_0_type" ,
"fields_0_price",
"fields_1_type",
"fields_1_price"
+------------------+--------------+-------------+-------------+---
|id |fields_0_type|fields_0_price|fields_1_type|fields_1_price|
+------------------+-----+--------+-------------+----------+
|1234| Return |45 |New |50 |
+--------------+----------+--------------------+------------+
How can I save the values of these values into two columns called "type" and"price" as a list and separate the values by ",". The ideal dataframe looks like this:
+--------------------------- +------------------------------+
|id |type | price
+---------------------------+------------------------------+
|1234 |Return,Upgrade |45,50
Note that the data I am providing here is a sample. In reality I have more than just "type" and "price" columns that will need to be combined.
Update:
Thanks it works. But is there any way that I can get rid of the extra ","? These are caused by the fact that there are blank values in the columns. Is there a way just to not to take in those columns with blank values in it?
What it is showing now:
+------------------------------------------------------------------+
|type |
+------------------------------------------------------------------+
|New,New,New,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, |
|New,New,Sale,Sale,New,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,|
+------------------------------------------------------------------+
How I want it:
+------------------------------------------------------------------+
|type |
+------------------------------------------------------------------+
|New,New,New |
|New,New,Sale,Sale,New|
+------------------------------------------------------------------+
Cast all columns in array then use concat_ws function.
Example:
df.show()
#+----+-------------+-------------+-------------+
#| id|fields_0_type|fields_1_type|fields_2_type|
#+----+-------------+-------------+-------------+
#|1234| a| b| c|
#+----+-------------+-------------+-------------+
columns=df.columns
columns.remove('id')
df.withColumn("type",concat_ws(",",array(*columns))).drop(*columns).show()
#+----+-----+
#| id| type|
#+----+-----+
#|1234|a,b,c|
#+----+-----+
UPDATE:
df.show()
#+----+-------------+--------------+-------------+--------------+
#| id|fields_0_type|fields_0_price|fields_1_type|fields_1_price|
#+----+-------------+--------------+-------------+--------------+
#|1234| a| 45| b| 50|
#+----+-------------+--------------+-------------+--------------+
type_cols=[f for f in df.columns if 'type' in f]
price_cols=[f for f in df.columns if 'price' in f]
df.withColumn("type",concat_ws(",",array(*type_cols))).withColumn("price",concat_ws(",",array(*price_cols))).\
drop(*type_cols,*price_cols).\
show()
#+----+----+-----+
#| id|type|price|
#+----+----+-----+
#|1234| a,b|45,50|
#+----+----+-----+

Spark(Pyspark) - How to convert Dataframe String column to dataframe multiple columns

I have a below pyspark DataFrame[recoms: string] where the recoms column value is of string type. Each row represent a string here.
+--------------------+
| recoms |
+--------------------+
|{"a":"1","b":"5",..}|
|{"a":"2","b":"4",..}|
|{"a":"3","b":"9",..}|
+--------------------+
The rows above doesn't have a defined schema to use the from_json method.So i'm looking for alternate options here. How do I convert or split to multi-value dataframe column as below using pyspark sql functions. In the above table all the values left to : would dataframe column name and right to : would be row values.
+--------+---+
| a | b |
+--------+---+
| 1 | 5|
| 2 | 4|
| 3 | 9|
+--------+---+
I tried explode sql function
df.select(explode("recoms")).show() and getting the below error.
org.apache.spark.sql.AnalysisException: cannot resolve 'explode(true_recoms)' due to data type mismatch: input to function explode should be array or map type, not string;;

Flip a Dataframe [duplicate]

This question already has answers here:
How to melt Spark DataFrame?
(6 answers)
Unpivot in Spark SQL / PySpark
(2 answers)
Dataframe transpose with pyspark in Apache Spark
(2 answers)
Closed 4 years ago.
I am working on Databricks using Python 2.
I have a PySpark dataframe like:
|Germany|USA|UAE|Turkey|Canada...
|5 | 3 |3 |42 | 12..
Which, as you can see, consists of hundreds of columns and only one single row.
I want to flip it in a way such that I get:
Name | Views
--------------
Germany| 5
USA | 3
UAE | 3
Turkey | 42
Canada | 12
How would I approach this?
Edit: I have hundreds of columns so I can't write them down. I don't know most of them, but they just exist there. I can't use the columns names in this process.
Edit 2: Example code:
dicttest = {'Germany': 5, 'USA': 20, 'Turkey': 15}
rdd=sc.parallelize([dicttest]).toDF()
df = rdd.toPandas().transpose()
This answer might be a bit 'overkill' but it does not use Pandas or collect anything to the driver. It will also work when you have multiple rows. We can just pass an empty list to the melt function from "How to melt Spark DataFrame?"
A working example would be as follows:
import findspark
findspark.init()
import pyspark as ps
from pyspark.sql import SQLContext, Column
import pandas as pd
from pyspark.sql.functions import array, col, explode, lit, struct
from pyspark.sql import DataFrame
from typing import Iterable
try:
sc
except NameError:
sc = ps.SparkContext()
sqlContext = SQLContext(sc)
# From https://stackoverflow.com/questions/41670103/how-to-melt-spark-dataframe
def melt(
df: DataFrame,
id_vars: Iterable[str], value_vars: Iterable[str],
var_name: str="variable", value_name: str="value") -> DataFrame:
"""Convert :class:`DataFrame` from wide to long format."""
# Create array<struct<variable: str, value: ...>>
_vars_and_vals = array(*(
struct(lit(c).alias(var_name), col(c).alias(value_name))
for c in value_vars))
# Add to the DataFrame and explode
_tmp = df.withColumn("_vars_and_vals", explode(_vars_and_vals))
cols = id_vars + [
col("_vars_and_vals")[x].alias(x) for x in [var_name, value_name]]
return _tmp.select(*cols)
# Sample data
df1 = sqlContext.createDataFrame(
[(0,1,2,3,4)],
("col1", "col2",'col3','col4','col5'))
df1.show()
df2 = melt(df1,id_vars=[],value_vars=df1.columns)
df2.show()
Output:
+----+----+----+----+----+
|col1|col2|col3|col4|col5|
+----+----+----+----+----+
| 0| 1| 2| 3| 4|
+----+----+----+----+----+
+--------+-----+
|variable|value|
+--------+-----+
| col1| 0|
| col2| 1|
| col3| 2|
| col4| 3|
| col5| 4|
+--------+-----+
Hope this helps.
You can convert pyspark dataframe to pandas dataframe and use Transpose function
%pyspark
import numpy as np
from pyspark.sql import SQLContext
from pyspark.sql.functions import lit
dt1 = [[1,2,4,5,6,7]]
dt = sc.parallelize(dt1).toDF()
dt.show()
dt.toPandas().transpose()
Output:
Other solution
dt2 = [{"1":1,"2":2,"4":4,"5":5,"6":29,"7":8}]
df = sc.parallelize(dt2).toDF()
df.show()
a = [{"name":i,"value":df.select(i).collect()[0][0]} for i in df.columns ]
df1 = sc.parallelize(a).toDF()
print(df1)

Efficient column processing in PySpark

I have a dataframe with a very large number of columns (>30000).
I'm filling it with 1 and 0 based on the first column like this:
for column in list_of_column_names:
df = df.withColumn(column, when(array_contains(df['list_column'], column), 1).otherwise(0))
However this process takes a lot of time. Is there a way to do this more efficiently? Something tells me that column processing can be parallelized.
Edit:
Sample input data
+----------------+-----+-----+-----+
| list_column | Foo | Bar | Baz |
+----------------+-----+-----+-----+
| ['Foo', 'Bak'] | | | |
| ['Bar', Baz'] | | | |
| ['Foo'] | | | |
+----------------+-----+-----+-----+
There is nothing specifically wrong with your code, other than very wide data:
for column in list_of_column_names:
df = df.withColumn(...)
only generates the execution plan.
Actual data processing will concurrent and parallelized, once the result is evaluated.
It is however an expensive process as it require O(NMK) operations with N rows, M columns and K values in the list.
Additionally execution plans on very wide data are very expensive to compute (though cost is constant in terms of number of records). If it becomes a limiting factor, you might be better off with RDDs:
Sort column array using sort_array function.
Convert data to RDD.
Apply search for each column using binary search.
You might approach like this,
import pyspark.sql.functions as F
exprs = [F.when(F.array_contains(F.col('list_column'), column), 1).otherwise(0).alias(column)\
for column in list_column_names]
df = df.select(['list_column']+exprs)
withColumn is already distributed so a faster approach would be difficult to get other than what you already have. you can try defining a udf function as following
from pyspark.sql import functions as f
from pyspark.sql import types as t
def containsUdf(listColumn):
row = {}
for column in list_of_column_names:
if(column in listColumn):
row.update({column: 1})
else:
row.update({column: 0})
return row
callContainsUdf = f.udf(containsUdf, t.StructType([t.StructField(x, t.StringType(), True) for x in list_of_column_names]))
df.withColumn('struct', callContainsUdf(df['list_column']))\
.select(f.col('list_column'), f.col('struct.*'))\
.show(truncate=False)
which should give you
+-----------+---+---+---+
|list_column|Foo|Bar|Baz|
+-----------+---+---+---+
|[Foo, Bak] |1 |0 |0 |
|[Bar, Baz] |0 |1 |1 |
|[Foo] |1 |0 |0 |
+-----------+---+---+---+
Note: list_of_column_names = ["Foo","Bar","Baz"]

Filter a large number of IDs from a dataframe Spark

I have a large dataframe with a format similar to
+-----+------+------+
|ID |Cat |date |
+-----+------+------+
|12 | A |201602|
|14 | B |201601|
|19 | A |201608|
|12 | F |201605|
|11 | G |201603|
+-----+------+------+
and I need to filter rows based on a list with around 5000 thousand IDs. The straighforward way would be to filter with isin but that has really bad performance. How can this filter be done?
If you're committed to using Spark SQL and isin doesn't scale anymore then inner equi-join should be a decent fit.
First convert id list to as single column DataFrame. If this is a local collection
ids_df = sc.parallelize(id_list).map(lambda x: (x, )).toDF(["id"])
and join:
df.join(ids_df, ["ID"], "inner")

Categories

Resources