This question already has answers here:
How to melt Spark DataFrame?
(6 answers)
Unpivot in Spark SQL / PySpark
(2 answers)
Dataframe transpose with pyspark in Apache Spark
(2 answers)
Closed 4 years ago.
I am working on Databricks using Python 2.
I have a PySpark dataframe like:
|Germany|USA|UAE|Turkey|Canada...
|5 | 3 |3 |42 | 12..
Which, as you can see, consists of hundreds of columns and only one single row.
I want to flip it in a way such that I get:
Name | Views
--------------
Germany| 5
USA | 3
UAE | 3
Turkey | 42
Canada | 12
How would I approach this?
Edit: I have hundreds of columns so I can't write them down. I don't know most of them, but they just exist there. I can't use the columns names in this process.
Edit 2: Example code:
dicttest = {'Germany': 5, 'USA': 20, 'Turkey': 15}
rdd=sc.parallelize([dicttest]).toDF()
df = rdd.toPandas().transpose()
This answer might be a bit 'overkill' but it does not use Pandas or collect anything to the driver. It will also work when you have multiple rows. We can just pass an empty list to the melt function from "How to melt Spark DataFrame?"
A working example would be as follows:
import findspark
findspark.init()
import pyspark as ps
from pyspark.sql import SQLContext, Column
import pandas as pd
from pyspark.sql.functions import array, col, explode, lit, struct
from pyspark.sql import DataFrame
from typing import Iterable
try:
sc
except NameError:
sc = ps.SparkContext()
sqlContext = SQLContext(sc)
# From https://stackoverflow.com/questions/41670103/how-to-melt-spark-dataframe
def melt(
df: DataFrame,
id_vars: Iterable[str], value_vars: Iterable[str],
var_name: str="variable", value_name: str="value") -> DataFrame:
"""Convert :class:`DataFrame` from wide to long format."""
# Create array<struct<variable: str, value: ...>>
_vars_and_vals = array(*(
struct(lit(c).alias(var_name), col(c).alias(value_name))
for c in value_vars))
# Add to the DataFrame and explode
_tmp = df.withColumn("_vars_and_vals", explode(_vars_and_vals))
cols = id_vars + [
col("_vars_and_vals")[x].alias(x) for x in [var_name, value_name]]
return _tmp.select(*cols)
# Sample data
df1 = sqlContext.createDataFrame(
[(0,1,2,3,4)],
("col1", "col2",'col3','col4','col5'))
df1.show()
df2 = melt(df1,id_vars=[],value_vars=df1.columns)
df2.show()
Output:
+----+----+----+----+----+
|col1|col2|col3|col4|col5|
+----+----+----+----+----+
| 0| 1| 2| 3| 4|
+----+----+----+----+----+
+--------+-----+
|variable|value|
+--------+-----+
| col1| 0|
| col2| 1|
| col3| 2|
| col4| 3|
| col5| 4|
+--------+-----+
Hope this helps.
You can convert pyspark dataframe to pandas dataframe and use Transpose function
%pyspark
import numpy as np
from pyspark.sql import SQLContext
from pyspark.sql.functions import lit
dt1 = [[1,2,4,5,6,7]]
dt = sc.parallelize(dt1).toDF()
dt.show()
dt.toPandas().transpose()
Output:
Other solution
dt2 = [{"1":1,"2":2,"4":4,"5":5,"6":29,"7":8}]
df = sc.parallelize(dt2).toDF()
df.show()
a = [{"name":i,"value":df.select(i).collect()[0][0]} for i in df.columns ]
df1 = sc.parallelize(a).toDF()
print(df1)
Related
I am trying to increase all values in dataframe by 1 except for one column which is the ID column.
Example:
Results:
This is what I have so far but it gets a bit long when I have a lot of columns to do (e.g. 50).
df_add = df.select(
'Id',
(df['col_a'] + 1).alias('col_a'),
..
..
)
Is there a more pythonic way of achieving the same results?
EDIT (based on #Daniel comment):
You can directly use the lit function
from pyspark.sql.functions import col, lit
for column in plus_one_cols:
df = df.withColumn(column, col(column) + lit(1))
PREVIOUS ANSWER :
Adding "1" to columns is a columnar operation which can be better suited for a pandas_udf
from pyspark.sql.functions import col, pandas_udf, PandasUDFType
#pandas_udf('double', PandasUDFType.SCALAR)
def plus_one(v):
return v + 1
plus_one_cols = [x for x in df.columns if x != "Id"]
for column in plus_one_cols:
df = df.withColumn(column, plus_one(col(column)))
This will work much faster than the row-wise operations. You can also refer to Introducing Pandas UDF for PySpark - Databricks
If there are a lot of columns, you can use the below one-liner,
from pyspark.sql.functions import lit,col
df.select('Id', *[(col(i) + lit(1)) for i in df.columns if i != 'Id']).toDF(*df.columns).show()
Output:
+---+-----+-----+-----+
| Id|col_a|col_b|col_c|
+---+-----+-----+-----+
| 1| 4| 21| 6|
| 5| 6| 1| 1|
| 6| 10| 2| 1|
+---+-----+-----+-----+
You can use the withColumn method and then iterate over the columns as follows:
df_add = df
for column in ["col_a", "col_b", "col_c"]:
df_add = df_add.withColumn(column, expr(f"{column} +1").cast("integer"))
Use pyspark.sql.functions.lit to add values to columns
Ex:
from pyspark.sql import functions as psf
df = spark.sql("""select 1 as test""")
df.show()
# +----+
# |test|
# +----+
# | 1|
# +----+
df_add = df.select(
'test',
(df['test'] + psf.lit(1)).alias('col_a'),
)
df_add.show()
# +----+-----+
# |test|col_a|
# +----+-----+
# | 1| 2|
# +----+-----+
###
# If you want to do it for all columns then:
###
list_of_columns = ["col1", "col2", ...]
df_add = df.select(
[(df[col] + psf.lit(1)).alias(col) for col in list_of_columns]
)
df_add.show()
This question already has answers here:
Spark DataFrame: count distinct values of every column
(6 answers)
Show distinct column values in pyspark dataframe
(12 answers)
Closed 1 year ago.
How it is possible to calculate the number of unique elements in each column of a pyspark dataframe:
import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = pd.DataFrame([[1, 100], [1, 200], [2, 300], [3, 100], [4, 100], [4, 300]], columns=['col1', 'col2'])
df_spark = spark.createDataFrame(df)
print(df_spark.show())
# +----+----+
# |col1|col2|
# +----+----+
# | 1| 100|
# | 1| 200|
# | 2| 300|
# | 3| 100|
# | 4| 100|
# | 4| 300|
# +----+----+
# Some transformations on df_spark here
# How to get a number of unique elements (just a number) in each columns?
I know only the following solution which is very slow, both of these lines are calculated in the same amount of time:
col1_num_unique = df_spark.select('col1').distinct().count()
col2_num_unique = df_spark.select('col2').distinct().count()
There are about 10 millions rows in df_spark.
Try this:
from pyspark.sql.functions import col, countDistinct
df_spark.agg(*(countDistinct(col(c)).alias(c) for c in df_spark.columns))
EDIT:
As #pault suggested, its an expensive operation and you can use approx_count_distinct() The one he suggested is currently deprecated (spark version >= 2.1)
#Manrique solved the problem, but only slightly modified solution worked for me:
expression = [countDistinct(c).alias(c) for c in df.columns]
df.select(*expression).show()
This is mush faster:
df_spark.select(F.countDistinct("col1")).show()
This question already has answers here:
get first N elements from dataframe ArrayType column in pyspark
(2 answers)
Closed 4 years ago.
I wish to remove the last element of the array from this DataFrame. We have this link demonstrating the same thing, but with UDFs and that I wish to avoid. Is there is simple way to do this - something like list[:2]?
data = [(['cat','dog','sheep'],),(['bus','truck','car'],),(['ice','pizza','pasta'],)]
df = sqlContext.createDataFrame(data,['data'])
df.show()
+-------------------+
| data|
+-------------------+
| [cat, dog, sheep]|
| [bus, truck, car]|
|[ice, pizza, pasta]|
+-------------------+
Expected DataFrame:
+--------------+
| data|
+--------------+
| [cat, dog]|
| [bus, truck]|
| [ice, pizza]|
+--------------+
UDF is the best thing you can find for PySpark :)
from pyspark.sql.functions import udf
from pyspark.sql.types import StructType
# Get the fist two elements
split_row = udf(lambda row: row[:2])
# apply the udf to each row
new_df = df.withColumn("data", split_row(df["data"]))
new_df.show()
# Output
+------------+
| data|
+------------+
| [cat, dog]|
|[bus, truck]|
|[ice, pizza]|
+------------+
I want to create a sample single-column DataFrame, but the following code is not working:
df = spark.createDataFrame(["10","11","13"], ("age"))
## ValueError
## ...
## ValueError: Could not parse datatype: age
The expected result:
age
10
11
13
the following code is not working
With single element you need a schema as type
spark.createDataFrame(["10","11","13"], "string").toDF("age")
or DataType:
from pyspark.sql.types import StringType
spark.createDataFrame(["10","11","13"], StringType()).toDF("age")
With name elements should be tuples and schema as sequence:
spark.createDataFrame([("10", ), ("11", ), ("13", )], ["age"])
Well .. There is some pretty easy method for creating sample dataframe in PySpark
>>> df = sc.parallelize([[1,2,3], [2,3,4]]).toDF()
>>> df.show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| 1| 2| 3|
| 2| 3| 4|
+---+---+---+
to create with some column names
>>> df1 = sc.parallelize([[1,2,3], [2,3,4]]).toDF(("a", "b", "c"))
>>> df1.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 2| 3|
| 2| 3| 4|
+---+---+---+
In this way, no need to define schema too.Hope this is the simplest way
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([{"a": "x", "b": "y", "c": "3"}])
Output: (no need to define schema)
+---+---+---+
| a | b | c |
+---+---+---+
| x| y| 3|
+---+---+---+
For pandas + pyspark users, if you've already installed pandas in the cluster, you can do this simply:
# create pandas dataframe
df = pd.DataFrame({'col1':[1,2,3], 'col2':['a','b','c']})
# convert to spark dataframe
df = spark.createDataFrame(df)
Local Spark Setup
import findspark
findspark.init()
import pyspark
spark = (pyspark
.sql
.SparkSession
.builder
.master("local")
.getOrCreate())
See my farsante lib for creating a DataFrame with fake data:
import farsante
df = farsante.quick_pyspark_df(['first_name', 'last_name'], 7)
df.show()
+----------+---------+
|first_name|last_name|
+----------+---------+
| Tommy| Hess|
| Arthur| Melendez|
| Clemente| Blair|
| Wesley| Conrad|
| Willis| Dunlap|
| Bruna| Sellers|
| Tonda| Schwartz|
+----------+---------+
Here's how to explicitly specify the schema when creating the PySpark DataFrame:
df = spark.createDataFrame(
[(10,), (11,), (13,)],
StructType([StructField("some_int", IntegerType(), True)]))
df.show()
+--------+
|some_int|
+--------+
| 10|
| 11|
| 13|
+--------+
You can also try something like this -
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc) # sc is the spark context
sample = sqlContext.createDataFrame(
[
('qwe', 23), # enter your data here
('rty',34),
('yui',56),
],
['abc', 'def'] # the row header/column labels should be entered here
There are several ways to create a DataFrame, PySpark Create DataFrame is one of the first steps you learn while working on PySpark
I assume you already have data, columns, and an RDD.
1) df = rdd.toDF()
2) df = rdd.toDF(columns) //Assigns column names
3) df = spark.createDataFrame(rdd).toDF(*columns)
4) df = spark.createDataFrame(data).toDF(*columns)
5) df = spark.createDataFrame(rowData,columns)
Besides these, you can find several examples on pyspark create dataframe
I have this PySpark dataframe
+-----------+--------------------+
|uuid | test_123 |
+-----------+--------------------+
| 1 |[test, test2, test3]|
| 2 |[test4, test, test6]|
| 3 |[test6, test9, t55o]|
and I want to convert the column test_123 to be like this:
+-----------+--------------------+
|uuid | test_123 |
+-----------+--------------------+
| 1 |"test,test2,test3" |
| 2 |"test4,test,test6" |
| 3 |"test6,test9,t55o" |
so from list to be string.
how can I do it with PySpark?
While you can use a UserDefinedFunction it is very inefficient. Instead it is better to use concat_ws function:
from pyspark.sql.functions import concat_ws
df.withColumn("test_123", concat_ws(",", "test_123")).show()
+----+----------------+
|uuid| test_123|
+----+----------------+
| 1|test,test2,test3|
| 2|test4,test,test6|
| 3|test6,test9,t55o|
+----+----------------+
You can create a udf that joins array/list and then apply it to the test column:
from pyspark.sql.functions import udf, col
join_udf = udf(lambda x: ",".join(x))
df.withColumn("test_123", join_udf(col("test_123"))).show()
+----+----------------+
|uuid| test_123|
+----+----------------+
| 1|test,test2,test3|
| 2|test4,test,test6|
| 3|test6,test9,t55o|
+----+----------------+
The initial data frame is created from:
from pyspark.sql.types import StructType, StructField
schema = StructType([StructField("uuid",IntegerType(),True),StructField("test_123",ArrayType(StringType(),True),True)])
rdd = sc.parallelize([[1, ["test","test2","test3"]], [2, ["test4","test","test6"]],[3,["test6","test9","t55o"]]])
df = spark.createDataFrame(rdd, schema)
df.show()
+----+--------------------+
|uuid| test_123|
+----+--------------------+
| 1|[test, test2, test3]|
| 2|[test4, test, test6]|
| 3|[test6, test9, t55o]|
+----+--------------------+
As of version 2.4.0, you can use array_join.Spark docs
from pyspark.sql.functions import array_join
df.withColumn("test_123", array_join("test_123", ",")).show()