Encode a column with integer in pyspark

Encode a column with integer in pyspark - python

I have to encode the column in a big DataFrame in pyspark(spark 2.0). All the values are almost unique(about 1000mln values).
The best choice could be StringIndexer, but at some reason it always fails and kills my spark session.
Can I somehow write a function like that:
id_dict() = dict()
def indexer(x):
id_dict.setdefault(x, len(id_dict))
return id_dict[x]
And map it to DataFrame with id_dict saving the items()? Will this dict will be synced on each executor?
I need all this for preprocessing tuples ('x', 3, 5) for spark.mllib ALS model.
Thank you.

StringIndexer keeps all labels in memory, so if values are almost unique, it just won't scale.
You can take unique values, sort and add id, which is expensive, but more robust in this case:
from pyspark.sql.functions import monotonically_increasing_id
df = spark.createDataFrame(["a", "b", "c", "a", "d"], "string").toDF("value")
indexer = (df.select("value").distinct()
.orderBy("value")
.withColumn("label", monotonically_increasing_id()))
df.join(indexer, ["value"]).show()
# +-----+-----------+
# |value| label|
# +-----+-----------+
# | d|25769803776|
# | c|17179869184|
# | b| 8589934592|
# | a| 0|
# | a| 0|
# +-----+-----------+
Note that labels are not consecutive and can differ from run to run or can change if spark.sql.shuffle.partitions changes. If it is not acceptable you'll have to use RDDs:
from operator import itemgetter
indexer = (df.select("value").distinct()
.rdd.map(itemgetter(0)).zipWithIndex()
.toDF(["value", "label"]))
df.join(indexer, ["value"]).show()
# +-----+-----+
# |value|label|
# +-----+-----+
# | d| 0|
# | c| 1|
# | b| 2|
# | a| 3|
# | a| 3|
# +-----+-----+

Related

How do I count percentage of zeroes for a specific column in pyspark dataframe aggregated by some groupby variables(s)

I have a pyspark dataframe with following columns
source_cd Day Date hour five_min_block five_min_block_volume
Here, the dates are varying from 31st January 2020 to 31st March 2021. There are 'Day' fields accordingly. Also, source_cd has 5 categories, the hours for every unique date vary from 0 to 23 and corresponding five_min_block varies from 1 to 12. And then I have my value column named as five_min_block_volume.
Now there can be any value in this five_min_block_volume field, starting from 0 to any positive definite number. What I want to do is to count the percentage of zeroes for this column, when aggregated by certain groupby variables ('Date' will never be a part of this groupby variable).
So assume that I want to group it by 'Source_cd', 'Day', 'hour' and 'five_min_block' (and maybe perform mean aggregation for the five_min_block_volume column as the output column). Essentially, my new dataframe will now contain source_cd,Day,hour,five_min_block fields, and no date field now.
Lets say, for a particular combination of source_cd,Day,hour,five_min_block, there were 50 entries in my original dataframe. Out of those 50 entries, 20 had five_min_block_volume as 0 value. So I want to display 40% as my 'percentage of zeroes' column as the newly created column, for this combination, in this grouped dataframe. And likewise for all other rows. I want to acheive this using pyspark. How do I go about doing this

May I suggest for quicker response and more clarity, it is useful if you post some code which someone trying to answer your question could easily copy and paste to produce the example you describe? In any case, I have tried to reproduce your example from the description as best I can below.
Note that the solution is only a couple of lines of code at the end. Hopefully it can help.
Reproducing the example
import pandas as pd
import numpy as np
import pyspark.sql.functions as func
# create the dummy data in pandas then convert to pyspark df
pdf = pd.DataFrame(columns=['source_cd', 'Day', 'Date', 'hour', 'five_min_block', 'five_min_block_volume'])
# create the date range by 5 minute blocks
pdf['Date'] = pd.date_range(start='2020-01-31', end='2020-03-31', freq='5min')
n = pdf.shape[0]
# extract hour and day
pdf['Day'] = pdf['Date'].dt.day
pdf['hour'] = pdf['Date'].dt.hour
pdf['date-temp'] = pdf['Date'].dt.date
# generate the 5 min block labels
pdf['five_min_block'] = 1
pdf['five_min_block'] = pdf.groupby(['date-temp', 'hour'])['five_min_block'].cumsum()
pdf.drop('date-temp', axis=1, inplace=True)
# random source column
pdf['source_cd'] = np.random.randint(low=0, high=5, size=n)
# random volumes, and add some extra zeros
pdf['five_min_block_volume'] = np.random.randint(low=0, high=20000, size=n)
pdf['five_min_block_volume'].iloc[np.random.choice(range(n),size=int(0.2*n))] = 0
Convert to spark dataframe, and to the grouping as described in question
sdf = spark.createDataFrame(pdf)
grouping_columns = ['Source_cd', 'Day', 'hour', 'five_min_block']
sdf.groupBy(grouping_columns).agg(
func.mean(func.col('five_min_block_volume')).alias('avg_of_block_volume'),
func.mean((func.col('five_min_block_volume') == 0).cast('float')).alias('percent_blocks_with_0_volume')
).show()
Output:

You could use something like this:
#funcs.pandas_udf('float', funcs.PandasUDFType.GROUPED_AGG)
def percentage_of_zeroes_agg(percentage_of_zeroes_col: funcs.col) -> float:
return percentage_of_zeroes_col.sum() / percentage_of_zeroes_col.count()
# == Example =============================================================================
# Columns to group dataframe by
groupby_columns = ['Source_cd', 'Day', 'hour']
# Aggregation expression, that computes the rate of zeroes for each group.
aggregation = percentage_of_zeroes_agg(df.percentage_of_zeroes).alias('percentage_of_zeroes')
# Perform the groupby operation
grouped_df = df.groupBy(*groupby_columns).agg(aggregation)
Full code
Here's the entire code, including some helper functions I've created to build a sample dataframe, based on the columns descriptions you gave.
# == Necessary Imports ===================================================================
from __future__ import annotations
import string
import pandas as pd
import numpy as np
import pyspark
from pyspark.sql import functions as funcs
from dateutil.relativedelta import relativedelta
# == Define spark session ================================================================
spark = pyspark.sql.SparkSession.builder.getOrCreate()
# == Helper functions to generate sample dataframe =======================================
# You can ignore these functions, as their purpose is only to create a sample dataframe to
# show how to solve your problem
def get_random_source_cd(n: int, num_cats: int = 5) -> list[str]:
source_cd_cats = string.ascii_uppercase[:num_cats]
return list(
map(source_cd_cats.__getitem__, np.random.randint(0, num_cats, n))
)
def get_random_hours(n: int) -> list[int]:
return np.random.randint(0, 23, n).tolist()
def get_random_dates(
n: int,
start_date: str | pd.Timestamp,
end_date: str | pd.Timestamp | None = None,
days: int | None = None,
) -> list[pd.Timestamp]:
start_date = pd.to_datetime(start_date)
if end_date is None:
if days is None:
days = n * 2
end_date = start_date + relativedelta(days=int(days))
else:
end_date = pd.to_datetime(end_date)
possible_dates = pd.date_range(start_date, end_date, freq='d').to_series()
return list(
map(possible_dates.__getitem__, np.random.randint(0, len(possible_dates), n))
)
def get_random_five_min_blocks(n: int) -> list[int]:
return np.random.randint(0, 13, n).tolist()
def generate_random_frame(n: int, **kwargs) -> pd.DataFrame:
dates = get_random_dates(
n, '2022-06-01', end_date=kwargs.get('end_date', None), days=kwargs.get('days', None)
)
days = list(map(lambda date: date.day, dates))
return spark.createDataFrame(
pd.DataFrame(
{
'source_cd': get_random_source_cd(n),
'Day': days,
'Date': dates,
'hour': get_random_hours(n),
'five_min_block': get_random_five_min_blocks(n),
'five_min_block_volume': np.random.random(n),
}
)
).withColumn(
'percentage_of_zeroes',
funcs.when(funcs.col('five_min_block') == 0, 1).otherwise(0)
)
# == User defined function used during aggregation =======================================
#funcs.pandas_udf('float', funcs.PandasUDFType.GROUPED_AGG)
def percentage_of_zeroes_agg(percentage_of_zeroes_col: funcs.col) -> float:
"""Pandas user defined function to compute the percentage of zeroes during aggregation.
Parameters
----------
percentage_of_zeroes_col : funcs.col
The `percentage_of_zeroes_col` column, as `pyspark.sql.column.Column`.
You can specify this parameter like so:
.. code-block:: python
groupby_columns = ['Source_cd', 'Day']
aggregation = percentage_of_zeroes_agg(df.percentage_of_zeroes).alias('percentage_of_zeroes')
grouped_df = df.groupBy(*groupby_columns).agg(aggregation)
In the above example, the aggregation variable shows how you can
use this function.
Returns
-------
float
The rate of values with column `percentage_of_zeroes` equal to 1.
Notes
-----
The `percentage_of_zeroes` column contains the value 1, when the column
`five_min_block` equals zero, and 0 otherwise. Therefore, when you sum all values,
you get the total count of rows from a given group that equal 0. The `count`
returns the number of observations (rows) from each group.
Dividing the sum by count, you get the ratio of zeroes on a given group.
"""
return percentage_of_zeroes_col.sum() / percentage_of_zeroes_col.count()
# == Example =============================================================================
# Generate a randomized Spark Dataframe, based on your columns specifications
df = generate_random_frame(50_000, end_date='2023-12-31')
# Columns to group dataframe by
groupby_columns = ['Source_cd', 'Day', 'hour']
# Aggregation expression, that computes the rate of zeroes for each group.
# NOTE: edit the `.alias` parameter, to change the name of the column that stores
# the aggregation results.
aggregation = percentage_of_zeroes_agg(df.percentage_of_zeroes).alias('percentage_of_zeroes')
# Perform the groupby operation
grouped_df = (
df
.groupBy(*groupby_columns)
.agg(aggregation)
# OPTIONAL: uncomment the next line, to sort the grouped dataframe
# by a set of columns (statement has a heavy impact on performance)
# .orderBy('count_of_zeroes', ascending=False)
)
# OPTIONAL: create column `pretty_percentage_of_zeroes` to store results from aggregation
# in percentage format.
grouped_df = grouped_df.withColumn(
'pretty_percentage_of_zeroes',
funcs.concat(
(funcs.format_number(grouped_df.percentage_of_zeroes * 100, 2)).cast('string'),
funcs.lit('%')
)
)
grouped_df.show()
# +---------+---+----+--------------------+---------------+---------------------------+
# |Source_cd|Day|hour|percentage_of_zeroes|count_of_zeroes|pretty_percentage_of_zeroes|
# +---------+---+----+--------------------+---------------+---------------------------+
# | A| 1| 0| 0.07692308| 1| 7.69%|
# | A| 1| 1| 0.11764706| 2| 11.76%|
# | A| 1| 2| 0.083333336| 1| 8.33%|
# | A| 1| 3| 0.0| 0| 0.00%|
# | A| 1| 4| 0.13333334| 2| 13.33%|
# | A| 1| 5| 0.0| 0| 0.00%|
# | A| 1| 6| 0.2| 2| 20.00%|
# | A| 1| 7| 0.0| 0| 0.00%|
# | A| 1| 8| 0.1764706| 3| 17.65%|
# | A| 1| 9| 0.10526316| 2| 10.53%|
# | A| 1| 10| 0.0| 0| 0.00%|
# | A| 1| 11| 0.0| 0| 0.00%|
# | A| 1| 12| 0.125| 2| 12.50%|
# | A| 1| 13| 0.05882353| 1| 5.88%|
# | A| 1| 14| 0.055555556| 1| 5.56%|
# | A| 1| 15| 0.0625| 1| 6.25%|
# | A| 1| 16| 0.083333336| 1| 8.33%|
# | A| 1| 17| 0.071428575| 1| 7.14%|
# | A| 1| 18| 0.11111111| 1| 11.11%|
# | A| 1| 19| 0.06666667| 1| 6.67%|
# +---------+---+----+--------------------+---------------+---------------------------+

Reversing Group By in PySpark

I am not sure about the correctness of the question itself. The solutions I've found for SQL do not work at Hive SQL or recursion is prohibited.
Thus, I'd like to solve the problem in Pyspark and need a solution or at least ideas, how to tackle the problem.
I have an original table which looks like this:
+--------+----------+
|customer|nr_tickets|
+--------+----------+
| A| 3|
| B| 1|
| C| 2|
+--------+----------+
This is how I want the table:
+--------+
|customer|
+--------+
| A|
| A|
| A|
| B|
| C|
| C|
+--------+
Do you have any suggestions?
Thank you very much in advance!

For Spark2.4+, use array_repeat with explode.
from pyspark.sql import functions as F
df.selectExpr("""explode(array_repeat(customer,cast(nr_tickets as int))) as customer""").show()
#+--------+
#|customer|
#+--------+
#| A|
#| A|
#| A|
#| B|
#| C|
#| C|
#+--------+

You can make a new dataframe by iterating over rows(groups).
1st make list of Rows havingcustomer (Row(customer=a["customer"])) repeated nr_tickets times for that customer using range(int(a["nr_tickets"]))
df_list + [Row(customer=a["customer"]) for T in range(int(a["nr_tickets"]))]
you can store and append these in a list and later make a dataframe with it.
df= spark.createDataFrame(df_list)
Overall,
from pyspark.sql import Row
df_list = []
for a in df.select(["customer","nr_tickets"]).collect():
df_list = df_list + [Row(customer=a["customer"]) for T in range(int(a["nr_tickets"]))]
df= spark.createDataFrame(df_list)
df.show()
you can also do it with list comprehension as
from pyspark.sql import Row
from functools import reduce #python 3
df_list = [
[Row(customer=a["customer"])]*int(a["nr_tickets"])
for a in df.select(["customer","nr_tickets"]).collect()
]
df= spark.createDataFrame(reduce(lambda x,y: x+y,df_list))
df.show()
Produces
+--------+
|customer|
+--------+
| A|
| A|
| A|
| B|
| C|
| C|
+--------+

in the meanwhile I have also found a solution by myself:
for i in range(1, max_nr_of_tickets):
table = table.filter(F.col('nr_tickets') >= 1).union(test)
table = table.withColumn('nr_tickets', F.col('nr_tickets') - 1)
Explanation: The DFs "table" and "test" are the same at the beginning.
So "max_nr_of_tickets" is just the highest "nr_tickets". It works.
I am only struggling with the format of the max number:
max_nr_of_tickets = df.select(F.max('nr_tickets')).collect()
I cannot use the result in the for loop's range as it is a list. So I manually enter the highest number.
Any ideas how I could get the max_nr_of_tickets into the right format so the loops range will accept it?
Thanks

Pyspark: how to split a dataframe into chunks and save them?

I need to split a pyspark dataframe df and save the different chunks.
This is what I am doing: I define a column id_tmp and I split the dataframe based on that.
chunk = 10000
id1 = 0
id2 = chunk
df = df.withColumn('id_tmp', row_number().over(Window.orderBy(monotonically_increasing_id())) - 1)
c = df.count()
while id1 < c:
stop_df = df.filter( (tmp.id_tmp < id2) & (tmp.id_tmp >= id1))
stop_df.write.format('com.databricks.spark.csv').save('myFolder/')
id1+=chunk
id2+=chunk
Is there a more efficient way without defining the column id_tmp

I suggest you to use the partitionBy method from the DataFrameWriter interface built-in Spark (docs). Here is an example.
Given the df DataFrame, the chuck identifier needs to be one or more columns. In my example id_tmp. The following snippet generates a DF with 12 records with 4 chunk ids.
import pyspark.sql.functions as F
df = spark.range(0, 12).withColumn("id_tmp", F.col("id") % 4).orderBy("id_tmp")
df.show()
Returns:
+---+------+
| id|id_tmp|
+---+------+
| 8| 0|
| 0| 0|
| 4| 0|
| 1| 1|
| 9| 1|
| 5| 1|
| 6| 2|
| 2| 2|
| 10| 2|
| 3| 3|
| 11| 3|
| 7| 3|
+---+------+
To save each chunk indepedently you need:
(df
.repartition("id_tmp")
.write
.partitionBy("id_tmp")
.mode("overwrite")
.format("csv")
.save("output_folder"))
repartition will shuffle the records so that each node has a complete set of records for one "id_tmp" value. Then each chunk is written to one file with the partitionBy.
Resulting folder structure:
output_folder/
output_folder/._SUCCESS.crc
output_folder/id_tmp=0
output_folder/id_tmp=0/.part-00000-eba244a4-ce95-4f4d-b9b8-8e5f972b144f.c000.csv.crc
output_folder/id_tmp=0/part-00000-eba244a4-ce95-4f4d-b9b8-8e5f972b144f.c000.csv
output_folder/id_tmp=1
output_folder/id_tmp=1/.part-00000-eba244a4-ce95-4f4d-b9b8-8e5f972b144f.c000.csv.crc
output_folder/id_tmp=1/part-00000-eba244a4-ce95-4f4d-b9b8-8e5f972b144f.c000.csv
output_folder/id_tmp=2
output_folder/id_tmp=2/.part-00000-eba244a4-ce95-4f4d-b9b8-8e5f972b144f.c000.csv.crc
output_folder/id_tmp=2/part-00000-eba244a4-ce95-4f4d-b9b8-8e5f972b144f.c000.csv
output_folder/id_tmp=3
output_folder/id_tmp=3/.part-00000-eba244a4-ce95-4f4d-b9b8-8e5f972b144f.c000.csv.crc
output_folder/id_tmp=3/part-00000-eba244a4-ce95-4f4d-b9b8-8e5f972b144f.c000.csv
output_folder/_SUCCESS
The size and number of partitions are quite important for Spark's performance. Don't partition the dataset too much and have reasonable file sizes (like 1GB per file) especially if you are using cloud storage services. It is also advised to use the partition variables if you want to filter the data when loading (i.e.: year=YYYY/month=MM/day=DD)

How to retrieve all columns using pyspark collect_list functions

I have a pyspark 2.0.1. I'm trying to groupby my data frame & retrieve the value for all the fields from my data frame. I found that
z=data1.groupby('country').agg(F.collect_list('names'))
will give me values for country & names attribute & for names attribute it will give column header as collect_list(names). But for my job I have dataframe with around 15 columns & I will run a loop & will change the groupby field each time inside loop & need the output for all of the remaining fields.Can you please suggest me how to do it using collect_list() or any other pyspark functions?
I tried this code too
from pyspark.sql import functions as F
fieldnames=data1.schema.names
names1= list()
for item in names:
if item != 'names':
names1.append(item)
z=data1.groupby('names').agg(F.collect_list(names1))
z.show()
but got error message
Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.collect_list. Trace: py4j.Py4JException: Method collect_list([class java.util.ArrayList]) does not exist

Use struct to combine the columns before calling groupBy
suppose you have a dataframe
df = spark.createDataFrame(sc.parallelize([(0,1,2),(0,4,5),(1,7,8),(1,8,7)])).toDF("a","b","c")
df = df.select("a", f.struct(["b","c"]).alias("newcol"))
df.show()
+---+------+
| a|newcol|
+---+------+
| 0| [1,2]|
| 0| [4,5]|
| 1| [7,8]|
| 1| [8,7]|
+---+------+
df = df.groupBy("a").agg(f.collect_list("newcol").alias("collected_col"))
df.show()
+---+--------------+
| a| collected_col|
+---+--------------+
| 0|[[1,2], [4,5]]|
| 1|[[7,8], [8,7]]|
+---+--------------+
Aggregation operation can be done only on single columns.
After aggregation, You can collect the result and iterate over it to separate the combined columns generate the index dict. or you can write a
udf to separate the combined columns.
from pyspark.sql.types import *
def foo(x):
x1 = [y[0] for y in x]
x2 = [y[1] for y in x]
return(x1,x2)
st = StructType([StructField("b", ArrayType(LongType())), StructField("c", ArrayType(LongType()))])
udf_foo = udf(foo, st)
df = df.withColumn("ncol",
udf_foo("collected_col")).select("a",
col("ncol").getItem("b").alias("b"),
col("ncol").getItem("c").alias("c"))
df.show()
+---+------+------+
| a| b| c|
+---+------+------+
| 0|[1, 4]|[2, 5]|
| 1|[7, 8]|[8, 7]|
+---+------+------+

Actually we can do it in pyspark 2.2 .
First we need create a constant column ("Temp"), groupBy with that column ("Temp") and apply agg by pass iterable *exprs in which expression of collect_list exits.
Below is the code:
import pyspark.sql.functions as ftions
import functools as ftools
def groupColumnData(df, columns):
df = df.withColumn("Temp", ftions.lit(1))
exprs = [ftions.collect_list(colName) for colName in columns]
df = df.groupby('Temp').agg(*exprs)
df = df.drop("Temp")
df = df.toDF(*columns)
return df
Input Data:
df.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 0| 1| 2|
| 0| 4| 5|
| 1| 7| 8|
| 1| 8| 7|
+---+---+---+
Output Data:
df.show()
+------------+------------+------------+
| a| b| c|
+------------+------------+------------+
|[0, 0, 1, 1]|[1, 4, 7, 8]|[2, 5, 8, 7]|
+------------+------------+------------+

in spark 2.4.4 and python 3.7 (I guess its also relevant for previous spark and python version) --
My suggestion is a based on pauli's answer,
instead of creating the struct and then using the agg function, create the struct inside collect_list:
df = spark.createDataFrame([(0,1,2),(0,4,5),(1,7,8),(1,8,7)]).toDF("a","b","c")
df.groupBy("a").agg(collect_list(struct(["b","c"])).alias("res")).show()
result :
+---+-----------------+
| a|res |
+---+-----------------+
| 0|[[1, 2], [4, 5]] |
| 1|[[7, 8], [8, 7]] |
+---+-----------------+

I just use Concat_ws function it's perfectly fine.
> from pyspark.sql.functions import * df =
> spark.createDataFrame([(0,1,2),(0,4,5),(1,7,8),(1,8,7)]).toDF("a","b","c")
> df.groupBy('a').agg(collect_list(concat_ws(',','b','c'))).alias('r').show()

Transpose column to row with Spark

I'm trying to transpose some columns of my table to row.
I'm using Python and Spark 1.5.0. Here is my initial table:
+-----+-----+-----+-------+
| A |col_1|col_2|col_...|
+-----+-------------------+
| 1 | 0.0| 0.6| ... |
| 2 | 0.6| 0.7| ... |
| 3 | 0.5| 0.9| ... |
| ...| ...| ...| ... |
I would like to have somthing like this:
+-----+--------+-----------+
| A | col_id | col_value |
+-----+--------+-----------+
| 1 | col_1| 0.0|
| 1 | col_2| 0.6|
| ...| ...| ...|
| 2 | col_1| 0.6|
| 2 | col_2| 0.7|
| ...| ...| ...|
| 3 | col_1| 0.5|
| 3 | col_2| 0.9|
| ...| ...| ...|
Does someone know haw I can do it? Thank you for your help.

Spark >= 3.4
You can use built-in melt method. With Python:
df.melt(
ids=["A"], values=["col_1", "col_2"],
variableColumnName="key", valueColumnName="val"
)
with Scala
df.melt(Array($"A"), Array($"col_1", $"col_2"), "key", "val")
Spark < 3.4
It is relatively simple to do with basic Spark SQL functions.
Python
from pyspark.sql.functions import array, col, explode, struct, lit
df = sc.parallelize([(1, 0.0, 0.6), (1, 0.6, 0.7)]).toDF(["A", "col_1", "col_2"])
def to_long(df, by):
# Filter dtypes and split into column names and type description
cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in by))
# Spark SQL supports only homogeneous columns
assert len(set(dtypes)) == 1, "All columns have to be of the same type"
# Create and explode an array of (column_name, column_value) structs
kvs = explode(array([
struct(lit(c).alias("key"), col(c).alias("val")) for c in cols
])).alias("kvs")
return df.select(by + [kvs]).select(by + ["kvs.key", "kvs.val"])
to_long(df, ["A"])
Scala:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.{array, col, explode, lit, struct}
val df = Seq((1, 0.0, 0.6), (1, 0.6, 0.7)).toDF("A", "col_1", "col_2")
def toLong(df: DataFrame, by: Seq[String]): DataFrame = {
val (cols, types) = df.dtypes.filter{ case (c, _) => !by.contains(c)}.unzip
require(types.distinct.size == 1, s"${types.distinct.toString}.length != 1")
val kvs = explode(array(
cols.map(c => struct(lit(c).alias("key"), col(c).alias("val"))): _*
))
val byExprs = by.map(col(_))
df
.select(byExprs :+ kvs.alias("_kvs"): _*)
.select(byExprs ++ Seq($"_kvs.key", $"_kvs.val"): _*)
}
toLong(df, Seq("A"))

One way to solve with pyspark sql using functions create_map and explode.
from pyspark.sql import functions as func
#Use `create_map` to create the map of columns with constant
df = df.withColumn('mapCol', \
func.create_map(func.lit('col_1'),df.col_1,
func.lit('col_2'),df.col_2,
func.lit('col_3'),df.col_3
)
)
#Use explode function to explode the map
res = df.select('*',func.explode(df.mapCol).alias('col_id','col_value'))
res.show()

The Spark local linear algebra libraries are presently very weak: and they do not include basic operations as the above.
There is a JIRA for fixing this for Spark 2.1 - but that will not help you today.
Something to consider: performing a transpose will likely require completely shuffling the data.
For now you will need to write RDD code directly. I have written transpose in scala - but not in python. Here is the scala version:
def transpose(mat: DMatrix) = {
val nCols = mat(0).length
val matT = mat
.flatten
.zipWithIndex
.groupBy {
_._2 % nCols
}
.toSeq.sortBy {
_._1
}
.map(_._2)
.map(_.map(_._1))
.toArray
matT
}
So you can convert that to python for your use. I do not have bandwidth to write/test that at this particular moment: let me know if you were unable to do that conversion.
At the least - the following are readily converted to python.
zipWithIndex --> enumerate() (python equivalent - credit to #zero323)
map --> [someOperation(x) for x in ..]
groupBy --> itertools.groupBy()
Here is the implementation for flatten which does not have a python equivalent:
def flatten(L):
for item in L:
try:
for i in flatten(item):
yield i
except TypeError:
yield item
So you should be able to put those together for a solution.

You could use the stack function:
for example:
df.selectExpr("stack(2, 'col_1', col_1, 'col_2', col_2) as (key, value)")
where:
2 is the number of columns to stack (col_1 and col_2)
'col_1' is a string for the key
col_1 is the column from which to take the values
if you have several columns, you could build the whole stack string iterating the column names and pass that to selectExpr

Use flatmap. Something like below should work
from pyspark.sql import Row
def rowExpander(row):
rowDict = row.asDict()
valA = rowDict.pop('A')
for k in rowDict:
yield Row(**{'A': valA , 'colID': k, 'colValue': row[k]})
newDf = sqlContext.createDataFrame(df.rdd.flatMap(rowExpander))

I took the Scala answer that #javadba wrote and created a Python version for transposing all columns in a DataFrame. This might be a bit different from what OP was asking...
from itertools import chain
from pyspark.sql import DataFrame
def _sort_transpose_tuple(tup):
x, y = tup
return x, tuple(zip(*sorted(y, key=lambda v_k: v_k[1], reverse=False)))[0]
def transpose(X):
"""Transpose a PySpark DataFrame.
Parameters
----------
X : PySpark ``DataFrame``
The ``DataFrame`` that should be tranposed.
"""
# validate
if not isinstance(X, DataFrame):
raise TypeError('X should be a DataFrame, not a %s'
% type(X))
cols = X.columns
n_features = len(cols)
# Sorry for this unreadability...
return X.rdd.flatMap( # make into an RDD
lambda xs: chain(xs)).zipWithIndex().groupBy( # zip index
lambda val_idx: val_idx[1] % n_features).sortBy( # group by index % n_features as key
lambda grp_res: grp_res[0]).map( # sort by index % n_features key
lambda grp_res: _sort_transpose_tuple(grp_res)).map( # maintain order
lambda key_col: key_col[1]).toDF() # return to DF
For example:
>>> X = sc.parallelize([(1,2,3), (4,5,6), (7,8,9)]).toDF()
>>> X.show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| 1| 2| 3|
| 4| 5| 6|
| 7| 8| 9|
+---+---+---+
>>> transpose(X).show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| 1| 4| 7|
| 2| 5| 8|
| 3| 6| 9|
+---+---+---+

A very handy way to implement:
from pyspark.sql import Row
def rowExpander(row):
rowDict = row.asDict()
valA = rowDict.pop('A')
for k in rowDict:
yield Row(**{'A': valA , 'colID' : k, 'colValue' : row[k]})
newDf = sqlContext.createDataFrame(df.rdd.flatMap(rowExpander)

To transpose Dataframe in pySpark, I use pivot over the temporary created column, which I drop at the end of the operation.
Say, we have a table like this. What we wanna do is to find all users over each listed_days_bin value.
+------------------+-------------+
| listed_days_bin | users_count |
+------------------+-------------+
|1 | 5|
|0 | 2|
|0 | 1|
|1 | 3|
|1 | 4|
|2 | 5|
|2 | 7|
|2 | 2|
|1 | 1|
+------------------+-------------+
Create new temp column - 'pvt_value', aggregate over it and pivot results
import pyspark.sql.functions as F
agg_df = df.withColumn('pvt_value', lit(1))\
.groupby('pvt_value')\
.pivot('listed_days_bin')\
.agg(F.sum('users_count')).drop('pvt_value')
New Dataframe should look like:
+----+---+---+
| 0 | 1 | 2 | # Columns
+----+---+---+
| 3| 13| 14| # Users over the bin
+----+---+---+

I found PySpark to be too complicated to transpose so I just convert my dataframe to Pandas and use the transpose() method and convert the dataframe back to PySpark if required.
dfOutput = spark.createDataFrame(dfPySpark.toPandas().transpose())
dfOutput.display()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Encode a column with integer in pyspark - python

Related

How do I count percentage of zeroes for a specific column in pyspark dataframe aggregated by some groupby variables(s)

Reversing Group By in PySpark

Pyspark: how to split a dataframe into chunks and save them?

How to retrieve all columns using pyspark collect_list functions

Transpose column to row with Spark

Categories

Resources