I need to add a number of columns (4000) into the data frame in pyspark. I am using the withColumn function, but getting assertion error.
df3 = df2.withColumn("['ftr' + str(i) for i in range(0, 4000)]", [expr('ftr[' + str(x) + ']') for x in range(0, 4000)])
Not sure what is wrong.
We can use .select() instead of .withColumn() to use a list as input to create a similar result as chaining multiple .withColumn()'s. The ["*"] is used to select also every existing column in the dataframe.
import pyspark.sql.functions as F
df2:
+---+
|age|
+---+
| 10|
| 11|
| 13|
+---+
df3 = df2.select(["*"] + [F.lit(f"{x}").alias(f"ftr{x}") for x in range(0,10)])
Results in:
+---+----+----+----+----+----+----+----+----+----+----+
|age|ftr0|ftr1|ftr2|ftr3|ftr4|ftr5|ftr6|ftr7|ftr8|ftr9|
+---+----+----+----+----+----+----+----+----+----+----+
| 10| 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|
| 11| 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|
| 13| 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|
+---+----+----+----+----+----+----+----+----+----+----+
Try to do something like this:
df2 = df3
for i in range(0, 4000):
df2 = df2.withColumn(f"ftr{i}", lit(f"frt{i}"))
Related
I have a dataframe like this
+---+---------------------+
| id| csv|
+---+---------------------+
| 1|a,b,c\n1,2,3\n2,3,4\n|
| 2|a,b,c\n3,4,5\n4,5,6\n|
| 3|a,b,c\n5,6,7\n6,7,8\n|
+---+---------------------+
and I want to explode the string type csv column, in fact I'm only interested in this column. So I'm looking for a method to obtain the following dataframe from the above.
+--+--+--+
| a| b| c|
+--+--+--+
| 1| 2| 3|
| 2| 3| 4|
| 3| 4| 5|
| 4| 5| 6|
| 5| 6| 7|
| 6| 7| 8|
+--+--+--+
Looking at the from_csv documentation it seems that the insput csv string can contain only one row of data, which I found stated more clearly here. So that's not an option.
I guess I could loop over the individual rows of the input dataframe, extract and parse the csv string from each row and then stitch everything together:
rows = df.collect()
for (i, row) in enumerate(rows):
data = row['csv']
data = data.split('\\n')
rdd = spark.sparkContext.parallelize(data)
df_row = (spark.read
.option('header', 'true')
.schema('a int, b int, c int')
.csv(rdd))
if i == 0:
df_new = df_row
else:
df_new = df_new.union(df_row)
df_new.show()
But that seems awfully inefficient. Is there a better way to achieve the desired result?
Using split + from_csv functions along with transform you can do something like:
from pyspark.sql import functions as F
df = spark.createDataFrame([
(1, r"a,b,c\n1,2,3\n2,3,4\n"), (2, r"a,b,c\n3,4,5\n4,5,6\n"),
(3, r"a,b,c\n5,6,7\n6,7,8\n")], ["id", "csv"]
)
df1 = df.withColumn(
"csv",
F.transform(
F.split(F.regexp_replace("csv", r"^a,b,c\\n|\\n$", ""), r"\\n"),
lambda x: F.from_csv(x, "a int, b int, c int")
)
).selectExpr("inline(csv)")
df1.show()
# +---+---+---+
# | a| b| c|
# +---+---+---+
# | 1| 2| 3|
# | 2| 3| 4|
# | 3| 4| 5|
# | 4| 5| 6|
# | 5| 6| 7|
# | 6| 7| 8|
# +---+---+---+
May I know what's wrong with below code? it does not print anything.
from pyspark.sql import SparkSession
from pyspark.sql.functions import current_date, current_timestamp,last_day,next_day, date_format, date_add, year, month, dayofmonth, dayofyear, dayofweek, date_trunc, date_sub, to_date, add_months, weekofyear, quarter, col
from pyspark.sql.types import StructType,StructField,StringType, IntegerType
ss = SparkSession.builder.appName('DateDim').master('local[1]').getOrCreate()
df = ss.createDataFrame([],StructType([]))
current_date()
df = df.select(current_date().alias("current_date"),next_day(current_date(), 'sunday').alias("next_day"),dayofweek(current_date()).alias("day_of_week"),dayofmonth(current_date()).alias("day_of_month"),dayofyear(current_date()).alias("day_of_year"),last_day(current_date()).alias("last_day"),year(current_date()).alias("year"),month(current_date()).alias("month"), weekofyear(current_date()).alias("week_of_year"),quarter(current_date()).alias("quarter")).collect()
print(df)
for i in range(1, 1000):
print(i)
for i in range(1, 1000):
v_date = date_add(v_date, i)
df.unionAll(df.select(v_date.alias("current_date"),next_day(v_date,'sunday').alias("next_day"),dayofweek(v_date).alias("day_of_week"),dayofmonth(v_date).alias("day_of_month"),dayofyear(v_date).alias("day_of_year"),last_day(v_date).alias("last_day"),year(v_date).alias("year"),month(v_date).alias("month"), weekofyear(v_date).alias("week_of_year"),quarter(v_date).alias("quarter")))
df.show()
You're getting zero rows because there are no rows in the initial df. Any columns being created will have no values as there are no rows in df.
It seems you're trying to create a dataframe with 1000 dates starting from the current day. There's a simple approach using sequence function.
data_sdf = spark.createDataFrame([(1,)], 'id string')
data_sdf. \
withColumn('min_dt', func.current_date().cast('date')). \
withColumn('max_dt', func.date_add('min_dt', 1000).cast('date')). \
withColumn('all_dates', func.expr('sequence(min_dt, max_dt, interval 1 day)')). \
withColumn('dates_exp', func.explode('all_dates')). \
drop('id'). \
show(10)
# +----------+----------+--------------------+----------+
# | min_dt| max_dt| all_dates| dates_exp|
# +----------+----------+--------------------+----------+
# |2022-08-27|2025-05-23|[2022-08-27, 2022...|2022-08-27|
# |2022-08-27|2025-05-23|[2022-08-27, 2022...|2022-08-28|
# |2022-08-27|2025-05-23|[2022-08-27, 2022...|2022-08-29|
# |2022-08-27|2025-05-23|[2022-08-27, 2022...|2022-08-30|
# |2022-08-27|2025-05-23|[2022-08-27, 2022...|2022-08-31|
# |2022-08-27|2025-05-23|[2022-08-27, 2022...|2022-09-01|
# |2022-08-27|2025-05-23|[2022-08-27, 2022...|2022-09-02|
# |2022-08-27|2025-05-23|[2022-08-27, 2022...|2022-09-03|
# |2022-08-27|2025-05-23|[2022-08-27, 2022...|2022-09-04|
# |2022-08-27|2025-05-23|[2022-08-27, 2022...|2022-09-05|
# +----------+----------+--------------------+----------+
# only showing top 10 rows
select the dates_exp field for further use.
You want to use the range() function which generates rows (using sequence you will generate an array which you then need to explode into rows).
That's how you can use it:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, next_day, dayofweek, dayofmonth, dayofyear, last_day, year, month, \
weekofyear, quarter, current_date
spark = SparkSession.builder.getOrCreate()
(
spark
.range(0, 1000)
.alias("id")
.select(
(current_date() + col('id').cast("int")).alias("date")
)
.select(
"date",
next_day("date", 'sunday').alias("next_sunday"),
dayofweek("date").alias("day_of_week"),
dayofmonth("date").alias("day_of_month"),
dayofyear("date").alias("day_of_year"),
last_day("date").alias("last_day"),
year("date").alias("year"),
month("date").alias("month"),
weekofyear("date").alias("week_of_year"),
quarter("date").alias("quarter")
)
).show()
it returns
+----------+-----------+-----------+------------+-----------+----------+----+-----+------------+-------+
| date|next_sunday|day_of_week|day_of_month|day_of_year| last_day|year|month|week_of_year|quarter|
+----------+-----------+-----------+------------+-----------+----------+----+-----+------------+-------+
|2022-09-22| 2022-09-25| 5| 22| 265|2022-09-30|2022| 9| 38| 3|
|2022-09-23| 2022-09-25| 6| 23| 266|2022-09-30|2022| 9| 38| 3|
|2022-09-24| 2022-09-25| 7| 24| 267|2022-09-30|2022| 9| 38| 3|
|2022-09-25| 2022-10-02| 1| 25| 268|2022-09-30|2022| 9| 38| 3|
|2022-09-26| 2022-10-02| 2| 26| 269|2022-09-30|2022| 9| 39| 3|
|2022-09-27| 2022-10-02| 3| 27| 270|2022-09-30|2022| 9| 39| 3|
|2022-09-28| 2022-10-02| 4| 28| 271|2022-09-30|2022| 9| 39| 3|
|2022-09-29| 2022-10-02| 5| 29| 272|2022-09-30|2022| 9| 39| 3|
|2022-09-30| 2022-10-02| 6| 30| 273|2022-09-30|2022| 9| 39| 3|
|2022-10-01| 2022-10-02| 7| 1| 274|2022-10-31|2022| 10| 39| 4|
|2022-10-02| 2022-10-09| 1| 2| 275|2022-10-31|2022| 10| 39| 4|
|2022-10-03| 2022-10-09| 2| 3| 276|2022-10-31|2022| 10| 40| 4|
|2022-10-04| 2022-10-09| 3| 4| 277|2022-10-31|2022| 10| 40| 4|
|2022-10-05| 2022-10-09| 4| 5| 278|2022-10-31|2022| 10| 40| 4|
|2022-10-06| 2022-10-09| 5| 6| 279|2022-10-31|2022| 10| 40| 4|
|2022-10-07| 2022-10-09| 6| 7| 280|2022-10-31|2022| 10| 40| 4|
|2022-10-08| 2022-10-09| 7| 8| 281|2022-10-31|2022| 10| 40| 4|
|2022-10-09| 2022-10-16| 1| 9| 282|2022-10-31|2022| 10| 40| 4|
|2022-10-10| 2022-10-16| 2| 10| 283|2022-10-31|2022| 10| 41| 4|
|2022-10-11| 2022-10-16| 3| 11| 284|2022-10-31|2022| 10| 41| 4|
+----------+-----------+-----------+------------+-----------+----------+----+-----+------------+-------+
only showing top 20 rows
I have the following dataframe
dataframe - columnA, columnB, columnC, columnD, columnE
I want to groupBy columnC and then consider max value of columnE
dataframe .select('*').groupBy('columnC').max('columnE')
expected output
dataframe - columnA, columnB, columnC, columnD, columnE
Real output
dataframe - columnC, columnE
Why all columns in the dataframe are not displayed as expected ?
For Spark version >= 3.0.0 you can use max_by to select the additional columns.
import random
from pyspark.sql import functions as F
#create some testdata
df = spark.createDataFrame(
[[random.randint(1,3)] + random.sample(range(0, 30), 4) for _ in range(10)],
schema=["columnC", "columnB", "columnA", "columnD", "columnE"]) \
.select("columnA", "columnB", "columnC", "columnD", "columnE")
df.groupBy("columnC") \
.agg(F.max("columnE"),
F.expr("max_by(columnA, columnE) as columnA"),
F.expr("max_by(columnB, columnE) as columnB"),
F.expr("max_by(columnD, columnE) as columnD")) \
.show()
For the testdata
+-------+-------+-------+-------+-------+
|columnA|columnB|columnC|columnD|columnE|
+-------+-------+-------+-------+-------+
| 25| 20| 2| 0| 2|
| 14| 2| 2| 24| 6|
| 26| 13| 3| 2| 1|
| 5| 24| 3| 19| 17|
| 22| 5| 3| 14| 21|
| 24| 5| 1| 8| 4|
| 7| 22| 3| 16| 20|
| 6| 17| 1| 5| 7|
| 24| 22| 2| 8| 3|
| 4| 14| 1| 16| 11|
+-------+-------+-------+-------+-------+
the result is
+-------+------------+-------+-------+-------+
|columnC|max(columnE)|columnA|columnB|columnD|
+-------+------------+-------+-------+-------+
| 1| 11| 4| 14| 16|
| 3| 21| 22| 5| 14|
| 2| 6| 14| 2| 24|
+-------+------------+-------+-------+-------+
What you want to achieve can be done via WINDOW function. Not groupBy
partition your data by columnC
Order your data within each partition in desc (rank)
filter out your desired result.
from pyspark.sql.window import Window
from pyspark.sql.functions import rank
from pyspark.sql.functions import col
windowSpec = Window.partitionBy("columnC").orderBy(col("columnE").desc())
expectedDf = df.withColumn("rank", rank().over(windowSpec)) \
.filter(col("rank") == 1)
You might wanna restructure your question.
How do I duplicate each row of my original dataframe and then add dataframe 2 so that my final output is: I am writing this in python in a pyspark dataframe.
What you want is cross join :
result = df1.crossJoin(df2)
result.show()
#+------+--------+------+-------+------------+-----------------+
#| name| address|salary|bonus %|allowances %|employee category|
#+------+--------+------+-------+------------+-----------------+
#| Tom| Chicago| 75000| 5| 5| onsite|
#| Tom| Chicago| 75000| 10| 10| off shore|
#|Martha|New york| 80000| 5| 5| onsite|
#|Martha|New york| 80000| 10| 10| off shore|
#|Samuel| Phoenix| 90000| 5| 5| onsite|
#|Samuel| Phoenix| 90000| 10| 10| off shore|
#| Rom| Dallas| 65000| 5| 5| onsite|
#| Rom| Dallas| 65000| 10| 10| off shore|
#+------+--------+------+-------+------------+-----------------+
I have a dataframe that looks like this
+-----------+-----------+-----------+
|salesperson| device|amount_sold|
+-----------+-----------+-----------+
| john| notebook| 2|
| gary| notebook| 3|
| john|small_phone| 2|
| mary|small_phone| 3|
| john|large_phone| 3|
| john| camera| 3|
+-----------+-----------+-----------+
and I have transformed it using pivot function to this with a Total column
+-----------+------+-----------+--------+-----------+-----+
|salesperson|camera|large_phone|notebook|small_phone|Total|
+-----------+------+-----------+--------+-----------+-----+
| gary| 0| 0| 3| 0| 3|
| mary| 0| 0| 0| 3| 3|
| john| 3| 3| 2| 2| 10|
+-----------+------+-----------+--------+-----------+-----+
but I would like a dataframe with a row (Total) that would also contain a total for every column like below:
+-----------+------+-----------+--------+-----------+-----+
|salesperson|camera|large_phone|notebook|small_phone|Total|
+-----------+------+-----------+--------+-----------+-----+
| gary| 0| 0| 3| 0| 3|
| mary| 0| 0| 0| 3| 3|
| john| 3| 3| 2| 2| 10|
| Total| 3| 3| 5| 5| 16|
+-----------+------+-----------+--------+-----------+-----+
Is it possible to do this is Spark using Scala/Python? (Preferably Scala and using Spark) and not using Union if possible
TIA
You can do something like below:
val columns = df.columns.dropWhile(_ == "salesperson").map(col)
//Use function `sum` on each column and union the result with original DataFrame.
val withTotalAsRow = df.union(df.select(lit("Total").as("salesperson") +: columns.map(sum):_*))
//I think this column already exists in DataFrame
//Append another column by adding value from each column
val withTotalAsColumn = withTotalAsRow.withColumn("Total", columns.reduce(_ plus _))
With spark Scala, you can achieve this using following snippet of code.
// Assuming spark session available as variable named 'spark'
import spark.implicits._
val resultDF = df.withColumn("Total", sum($"camera", $"large_phone", $"notebook", $"small_phone"))