Number of unique elements in all columns of a pyspark dataframe [duplicate]

Number of unique elements in all columns of a pyspark dataframe [duplicate] - python

This question already has answers here:
Spark DataFrame: count distinct values of every column
(6 answers)
Show distinct column values in pyspark dataframe
(12 answers)
Closed 1 year ago.
How it is possible to calculate the number of unique elements in each column of a pyspark dataframe:
import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = pd.DataFrame([[1, 100], [1, 200], [2, 300], [3, 100], [4, 100], [4, 300]], columns=['col1', 'col2'])
df_spark = spark.createDataFrame(df)
print(df_spark.show())
# +----+----+
# |col1|col2|
# +----+----+
# | 1| 100|
# | 1| 200|
# | 2| 300|
# | 3| 100|
# | 4| 100|
# | 4| 300|
# +----+----+
# Some transformations on df_spark here
# How to get a number of unique elements (just a number) in each columns?
I know only the following solution which is very slow, both of these lines are calculated in the same amount of time:
col1_num_unique = df_spark.select('col1').distinct().count()
col2_num_unique = df_spark.select('col2').distinct().count()
There are about 10 millions rows in df_spark.

Try this:
from pyspark.sql.functions import col, countDistinct
df_spark.agg(*(countDistinct(col(c)).alias(c) for c in df_spark.columns))
EDIT:
As #pault suggested, its an expensive operation and you can use approx_count_distinct() The one he suggested is currently deprecated (spark version >= 2.1)

#Manrique solved the problem, but only slightly modified solution worked for me:
expression = [countDistinct(c).alias(c) for c in df.columns]
df.select(*expression).show()

This is mush faster:
df_spark.select(F.countDistinct("col1")).show()

Related

Find "primary key" for any pyspark dataset

How can I find the combination of columns in a data set (pyspark) that can be considered the primary key.
I tried to generate the combination of all the columns then compare the number of distinct records of each subset with the whole set, but it is very expensive.
from itertools import combinations
l_key = []
for i in range(len(df.columns)+1):
print(f'Iter:{i+2}..{len(df.columns)+1}')
for c in list(combinations(df.columns, i+2)):
if(df.select(*c).distinct().count() == df.count()):
l_key.append(c)
print(f'Key:{c}')
Are there any functions or libraries that can generate this type of analysis?

You can try it by creating the combinations as columns and then grouping.
import pyspark.sql.functions as F
import pandas as pd
from itertools import combinations
pdf = pd.DataFrame([[1, 1, 1], [2, 1, 1], [3, 2, 1]], columns=['x', 'y', 'z'])
df = spark.createDataFrame(pdf)
initCols = df.columns
for i in range(len(initCols)+1):
for c in list(combinations(initCols, i+2)):
df = df.withColumn(','.join(c), F.concat_ws(',', *c))
finalCols = df.columns
exprs = [F.size(F.collect_set(x)).alias(x) for x in finalCols]
df = df\
.withColumn("aggCol", F.lit("a"))\
.groupBy("aggCol")\
.agg(*exprs)
df.show()
Output:
+------+---+---+---+---+---+---+-----+
|aggCol| x| y| z|x,y|x,z|y,z|x,y,z|
+------+---+---+---+---+---+---+-----+
| a| 3| 2| 1| 3| 3| 2| 3|
+------+---+---+---+---+---+---+-----+
I believe this should be less expensive. I tested it quickly on a small dataframe (~20k rows, 7 cols) and it didn't take too much time. Let me know how this works out for your dataset.

Pyspark increase all dataframe values by 1

I am trying to increase all values in dataframe by 1 except for one column which is the ID column.
Example:
Results:
This is what I have so far but it gets a bit long when I have a lot of columns to do (e.g. 50).
df_add = df.select(
'Id',
(df['col_a'] + 1).alias('col_a'),
..
..
)
Is there a more pythonic way of achieving the same results?

EDIT (based on #Daniel comment):
You can directly use the lit function
from pyspark.sql.functions import col, lit
for column in plus_one_cols:
df = df.withColumn(column, col(column) + lit(1))
PREVIOUS ANSWER :
Adding "1" to columns is a columnar operation which can be better suited for a pandas_udf
from pyspark.sql.functions import col, pandas_udf, PandasUDFType
#pandas_udf('double', PandasUDFType.SCALAR)
def plus_one(v):
return v + 1
plus_one_cols = [x for x in df.columns if x != "Id"]
for column in plus_one_cols:
df = df.withColumn(column, plus_one(col(column)))
This will work much faster than the row-wise operations. You can also refer to Introducing Pandas UDF for PySpark - Databricks

If there are a lot of columns, you can use the below one-liner,
from pyspark.sql.functions import lit,col
df.select('Id', *[(col(i) + lit(1)) for i in df.columns if i != 'Id']).toDF(*df.columns).show()
Output:
+---+-----+-----+-----+
| Id|col_a|col_b|col_c|
+---+-----+-----+-----+
| 1| 4| 21| 6|
| 5| 6| 1| 1|
| 6| 10| 2| 1|
+---+-----+-----+-----+

You can use the withColumn method and then iterate over the columns as follows:
df_add = df
for column in ["col_a", "col_b", "col_c"]:
df_add = df_add.withColumn(column, expr(f"{column} +1").cast("integer"))

Use pyspark.sql.functions.lit to add values to columns
Ex:
from pyspark.sql import functions as psf
df = spark.sql("""select 1 as test""")
df.show()
# +----+
# |test|
# +----+
# | 1|
# +----+
df_add = df.select(
'test',
(df['test'] + psf.lit(1)).alias('col_a'),
)
df_add.show()
# +----+-----+
# |test|col_a|
# +----+-----+
# | 1| 2|
# +----+-----+
###
# If you want to do it for all columns then:
###
list_of_columns = ["col1", "col2", ...]
df_add = df.select(
[(df[col] + psf.lit(1)).alias(col) for col in list_of_columns]
)
df_add.show()

Difference between two DataFrames based on only one column in pyspark [duplicate]

This question already has an answer here:
Pyspark filter dataframe by columns of another dataframe
(1 answer)
Closed 4 years ago.
I am looking for a way to find the difference of two DataFrames based on one column. For example:
from pyspark.sql import SQLContext
sc = SparkContext()
sql_context = SQLContext(sc)
df_a = sql_context.createDataFrame([("fa", 3), ("fb", 5), ("fc", 7)], ["first name", "id"])
df_b = sql_context.createDataFrame([("la", 3), ("lb", 10), ("lc", 13)], ["last name", "id"])
DataFrame A:
+----------+---+
|first name| id|
+----------+---+
| fa| 3|
| fb| 5|
| fc| 7|
+----------+---+
DataFrame B:
+---------+---+
|last name| id|
+---------+---+
| la| 3|
| lb| 10|
| lc| 13|
+---------+---+
My goal is to find the difference of DataFrame A and DataFrame B considering column id, the output would be the following DataFrame
+---------+---+
|last name| id|
+---------+---+
| lb| 10|
| lc| 13|
+---------+---+
I don't want to use the following method:
a_ids = set(df_a.rdd.map(lambda r: r.id).collect())
df_c = df_b.filter(~col('id').isin(a_ids))
I'm looking for an efficient method (in terms of memory and speed) that I don't have to collect the ids (the size of ids can be billions), maybe something like RDDs SubtractByKey but for DataFrame
PS: I can map df_a to RDD, but I don't want to map df_b to RDD

You can do a left_anti join on column id:
df_b.join(df_a.select('id'), how='left_anti', on=['id']).show()
+---+---------+
| id|last name|
+---+---------+
| 10| lb|
| 13| lc|
+---+---------+

Flip a Dataframe [duplicate]

This question already has answers here:
How to melt Spark DataFrame?
(6 answers)
Unpivot in Spark SQL / PySpark
(2 answers)
Dataframe transpose with pyspark in Apache Spark
(2 answers)
Closed 4 years ago.
I am working on Databricks using Python 2.
I have a PySpark dataframe like:
|Germany|USA|UAE|Turkey|Canada...
|5 | 3 |3 |42 | 12..
Which, as you can see, consists of hundreds of columns and only one single row.
I want to flip it in a way such that I get:
Name | Views
--------------
Germany| 5
USA | 3
UAE | 3
Turkey | 42
Canada | 12
How would I approach this?
Edit: I have hundreds of columns so I can't write them down. I don't know most of them, but they just exist there. I can't use the columns names in this process.
Edit 2: Example code:
dicttest = {'Germany': 5, 'USA': 20, 'Turkey': 15}
rdd=sc.parallelize([dicttest]).toDF()
df = rdd.toPandas().transpose()

This answer might be a bit 'overkill' but it does not use Pandas or collect anything to the driver. It will also work when you have multiple rows. We can just pass an empty list to the melt function from "How to melt Spark DataFrame?"
A working example would be as follows:
import findspark
findspark.init()
import pyspark as ps
from pyspark.sql import SQLContext, Column
import pandas as pd
from pyspark.sql.functions import array, col, explode, lit, struct
from pyspark.sql import DataFrame
from typing import Iterable
try:
sc
except NameError:
sc = ps.SparkContext()
sqlContext = SQLContext(sc)
# From https://stackoverflow.com/questions/41670103/how-to-melt-spark-dataframe
def melt(
df: DataFrame,
id_vars: Iterable[str], value_vars: Iterable[str],
var_name: str="variable", value_name: str="value") -> DataFrame:
"""Convert :class:`DataFrame` from wide to long format."""
# Create array<struct<variable: str, value: ...>>
_vars_and_vals = array(*(
struct(lit(c).alias(var_name), col(c).alias(value_name))
for c in value_vars))
# Add to the DataFrame and explode
_tmp = df.withColumn("_vars_and_vals", explode(_vars_and_vals))
cols = id_vars + [
col("_vars_and_vals")[x].alias(x) for x in [var_name, value_name]]
return _tmp.select(*cols)
# Sample data
df1 = sqlContext.createDataFrame(
[(0,1,2,3,4)],
("col1", "col2",'col3','col4','col5'))
df1.show()
df2 = melt(df1,id_vars=[],value_vars=df1.columns)
df2.show()
Output:
+----+----+----+----+----+
|col1|col2|col3|col4|col5|
+----+----+----+----+----+
| 0| 1| 2| 3| 4|
+----+----+----+----+----+
+--------+-----+
|variable|value|
+--------+-----+
| col1| 0|
| col2| 1|
| col3| 2|
| col4| 3|
| col5| 4|
+--------+-----+
Hope this helps.

You can convert pyspark dataframe to pandas dataframe and use Transpose function
%pyspark
import numpy as np
from pyspark.sql import SQLContext
from pyspark.sql.functions import lit
dt1 = [[1,2,4,5,6,7]]
dt = sc.parallelize(dt1).toDF()
dt.show()
dt.toPandas().transpose()
Output:
Other solution
dt2 = [{"1":1,"2":2,"4":4,"5":5,"6":29,"7":8}]
df = sc.parallelize(dt2).toDF()
df.show()
a = [{"name":i,"value":df.select(i).collect()[0][0]} for i in df.columns ]
df1 = sc.parallelize(a).toDF()
print(df1)

How to retrieve all columns using pyspark collect_list functions

I have a pyspark 2.0.1. I'm trying to groupby my data frame & retrieve the value for all the fields from my data frame. I found that
z=data1.groupby('country').agg(F.collect_list('names'))
will give me values for country & names attribute & for names attribute it will give column header as collect_list(names). But for my job I have dataframe with around 15 columns & I will run a loop & will change the groupby field each time inside loop & need the output for all of the remaining fields.Can you please suggest me how to do it using collect_list() or any other pyspark functions?
I tried this code too
from pyspark.sql import functions as F
fieldnames=data1.schema.names
names1= list()
for item in names:
if item != 'names':
names1.append(item)
z=data1.groupby('names').agg(F.collect_list(names1))
z.show()
but got error message
Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.collect_list. Trace: py4j.Py4JException: Method collect_list([class java.util.ArrayList]) does not exist

Use struct to combine the columns before calling groupBy
suppose you have a dataframe
df = spark.createDataFrame(sc.parallelize([(0,1,2),(0,4,5),(1,7,8),(1,8,7)])).toDF("a","b","c")
df = df.select("a", f.struct(["b","c"]).alias("newcol"))
df.show()
+---+------+
| a|newcol|
+---+------+
| 0| [1,2]|
| 0| [4,5]|
| 1| [7,8]|
| 1| [8,7]|
+---+------+
df = df.groupBy("a").agg(f.collect_list("newcol").alias("collected_col"))
df.show()
+---+--------------+
| a| collected_col|
+---+--------------+
| 0|[[1,2], [4,5]]|
| 1|[[7,8], [8,7]]|
+---+--------------+
Aggregation operation can be done only on single columns.
After aggregation, You can collect the result and iterate over it to separate the combined columns generate the index dict. or you can write a
udf to separate the combined columns.
from pyspark.sql.types import *
def foo(x):
x1 = [y[0] for y in x]
x2 = [y[1] for y in x]
return(x1,x2)
st = StructType([StructField("b", ArrayType(LongType())), StructField("c", ArrayType(LongType()))])
udf_foo = udf(foo, st)
df = df.withColumn("ncol",
udf_foo("collected_col")).select("a",
col("ncol").getItem("b").alias("b"),
col("ncol").getItem("c").alias("c"))
df.show()
+---+------+------+
| a| b| c|
+---+------+------+
| 0|[1, 4]|[2, 5]|
| 1|[7, 8]|[8, 7]|
+---+------+------+

Actually we can do it in pyspark 2.2 .
First we need create a constant column ("Temp"), groupBy with that column ("Temp") and apply agg by pass iterable *exprs in which expression of collect_list exits.
Below is the code:
import pyspark.sql.functions as ftions
import functools as ftools
def groupColumnData(df, columns):
df = df.withColumn("Temp", ftions.lit(1))
exprs = [ftions.collect_list(colName) for colName in columns]
df = df.groupby('Temp').agg(*exprs)
df = df.drop("Temp")
df = df.toDF(*columns)
return df
Input Data:
df.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 0| 1| 2|
| 0| 4| 5|
| 1| 7| 8|
| 1| 8| 7|
+---+---+---+
Output Data:
df.show()
+------------+------------+------------+
| a| b| c|
+------------+------------+------------+
|[0, 0, 1, 1]|[1, 4, 7, 8]|[2, 5, 8, 7]|
+------------+------------+------------+

in spark 2.4.4 and python 3.7 (I guess its also relevant for previous spark and python version) --
My suggestion is a based on pauli's answer,
instead of creating the struct and then using the agg function, create the struct inside collect_list:
df = spark.createDataFrame([(0,1,2),(0,4,5),(1,7,8),(1,8,7)]).toDF("a","b","c")
df.groupBy("a").agg(collect_list(struct(["b","c"])).alias("res")).show()
result :
+---+-----------------+
| a|res |
+---+-----------------+
| 0|[[1, 2], [4, 5]] |
| 1|[[7, 8], [8, 7]] |
+---+-----------------+

I just use Concat_ws function it's perfectly fine.
> from pyspark.sql.functions import * df =
> spark.createDataFrame([(0,1,2),(0,4,5),(1,7,8),(1,8,7)]).toDF("a","b","c")
> df.groupBy('a').agg(collect_list(concat_ws(',','b','c'))).alias('r').show()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Number of unique elements in all columns of a pyspark dataframe [duplicate] - python

Try this: from pyspark.sql.functions import col, countDistinct df_spark.agg(*(countDistinct(col(c)).alias(c) for c in df_spark.columns)) EDIT: As #pault suggested, its an expensive operation and you can use approx_count_distinct() The one he suggested is currently deprecated (spark version >= 2.1)

#Manrique solved the problem, but only slightly modified solution worked for me: expression = [countDistinct(c).alias(c) for c in df.columns] df.select(*expression).show()

This is mush faster: df_spark.select(F.countDistinct("col1")).show()

Related

Find "primary key" for any pyspark dataset

Pyspark increase all dataframe values by 1

Difference between two DataFrames based on only one column in pyspark [duplicate]

Flip a Dataframe [duplicate]

How to retrieve all columns using pyspark collect_list functions

Categories

Resources