I am trying to increase all values in dataframe by 1 except for one column which is the ID column.
Example:
Results:
This is what I have so far but it gets a bit long when I have a lot of columns to do (e.g. 50).
df_add = df.select(
'Id',
(df['col_a'] + 1).alias('col_a'),
..
..
)
Is there a more pythonic way of achieving the same results?
EDIT (based on #Daniel comment):
You can directly use the lit function
from pyspark.sql.functions import col, lit
for column in plus_one_cols:
df = df.withColumn(column, col(column) + lit(1))
PREVIOUS ANSWER :
Adding "1" to columns is a columnar operation which can be better suited for a pandas_udf
from pyspark.sql.functions import col, pandas_udf, PandasUDFType
#pandas_udf('double', PandasUDFType.SCALAR)
def plus_one(v):
return v + 1
plus_one_cols = [x for x in df.columns if x != "Id"]
for column in plus_one_cols:
df = df.withColumn(column, plus_one(col(column)))
This will work much faster than the row-wise operations. You can also refer to Introducing Pandas UDF for PySpark - Databricks
If there are a lot of columns, you can use the below one-liner,
from pyspark.sql.functions import lit,col
df.select('Id', *[(col(i) + lit(1)) for i in df.columns if i != 'Id']).toDF(*df.columns).show()
Output:
+---+-----+-----+-----+
| Id|col_a|col_b|col_c|
+---+-----+-----+-----+
| 1| 4| 21| 6|
| 5| 6| 1| 1|
| 6| 10| 2| 1|
+---+-----+-----+-----+
You can use the withColumn method and then iterate over the columns as follows:
df_add = df
for column in ["col_a", "col_b", "col_c"]:
df_add = df_add.withColumn(column, expr(f"{column} +1").cast("integer"))
Use pyspark.sql.functions.lit to add values to columns
Ex:
from pyspark.sql import functions as psf
df = spark.sql("""select 1 as test""")
df.show()
# +----+
# |test|
# +----+
# | 1|
# +----+
df_add = df.select(
'test',
(df['test'] + psf.lit(1)).alias('col_a'),
)
df_add.show()
# +----+-----+
# |test|col_a|
# +----+-----+
# | 1| 2|
# +----+-----+
###
# If you want to do it for all columns then:
###
list_of_columns = ["col1", "col2", ...]
df_add = df.select(
[(df[col] + psf.lit(1)).alias(col) for col in list_of_columns]
)
df_add.show()
Related
I want to add a new column new_col, if the value of column a is in yes_list, then the value is 1 in new_col else 0
from pyspark import SparkContext
sc = SparkContext.getOrCreate()
rdd = sc.parallelize([{"a":'y'}, {"a":'y', "b":2}, {"a":'n', "c":3}])
rdd_df = sqlContext.read.json(rdd)
yes_list = ['y']
Something like this:
rdd_df.withColumn("new_col", [1 if val in yes_list else 0 for val in rdd_df["a"]])
But the above is not correct, and raise errors.
TypeError: Column is not iterable
How to achieve it?
You can use the when and isin functions for the sparkSQL API. It would go as follows:
from pyspark.sql import functions
rdd_df.withColumn("new_col", functions.when(rdd_df['a'].isin(yes_list), 1).otherwise(0)).show()
+---+----+----+-------+
| a| b| c|new_col|
+---+----+----+-------+
| y|null|null| 1|
| y| 2|null| 1|
| n|null| 3| 0|
+---+----+----+-------+
I have a df that only has one row.
id |id2 |score|score2|
----------------------
0 |1 |4 |2 |
and i want to add a row of the percent of these to the bottom, i.e. every number divided by 7
0/7 |1/7 |4/7 |2/7 |
but the solution I came up with is incredibly slow
temp = [i/7 for i in df.collect()[0]]
row = sc.parallelize(Row(temp)).toDF()
df.union(row)
This took 21 seconds to run, almost all of which is the last two lines of code. Is there a better way to do this? My other thought was to transpose the table then this can easily be done with df.withColumn(). Ideally, I also want to filter out the column with 0, but I haven't really looked into that yet
check this out and let me know if it helps
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder \
.appName('practice')\
.getOrCreate()
sc= spark.sparkContext
df = sc.parallelize([
(0,1,4,2)]).toDF(["id", "id2","score","score2"])
df2 = df.select(*[(F.col(column)/7).alias(column) for column in df.columns])
df3 = df.union(df2)
df3.show()
+---+-------------------+------------------+------------------+
| id| id2| score| score2|
+---+-------------------+------------------+------------------+
|0.0| 1.0| 4.0| 2.0|
|0.0|0.14285714285714285|0.5714285714285714|0.2857142857142857|
+---+-------------------+------------------+------------------+
If you want to. filter out the column having 0 you can use below code
non_zero_cols = [c for c in df.columns if df[[c]].first()[c] > 0]
df1 = df.select(*non_zero_cols)
df2 = df1.select(*[(F.col(column)/7).alias(column) for column in
df1.columns])
df3 = df1.union(df2)
df3.show()
+-------------------+------------------+------------------+
| id2| score| score2|
+-------------------+------------------+------------------+
| 1.0| 4.0| 2.0|
|0.14285714285714285|0.5714285714285714|0.2857142857142857|
+-------------------+------------------+------------------+
Please check the below code for df having type column
non_zero_cols = [c for c in df.columns if df[[c]].first()[c] > 0]
df1 = df.select(*non_zero_cols, F.lit('count').alias('type') )
df2 = df1.select(*[(F.col(column)/7).alias(column) for column in
df1.columns if not column=='type'], F.lit('percent').alias('type'))
df3 = df1.union(df2)
df3.show()
+-------------------+------------------+------------------+-------+
| id2| score| score2| type|
+-------------------+------------------+------------------+-------+
| 1.0| 4.0| 2.0| count|
|0.14285714285714285|0.5714285714285714|0.2857142857142857|percent|
+-------------------+------------------+------------------+-------+
I am trying to change all the columns of a spark dataframe to double type but i want to know if there is a better way of doing it than just looping over the columns and casting.
With this dataframe:
df = spark.createDataFrame(
[
(1,2),
(2,3),
],
["foo","bar"]
)
df.show()
+---+---+
|foo|bar|
+---+---+
| 1| 2|
| 2| 3|
+---+---+
the for loop is problably the easiest and more natural solution.
from pyspark.sql import functions as F
for col in df.columns:
df = df.withColumn(
col,
F.col(col).cast("double")
)
df.show()
+---+---+
|foo|bar|
+---+---+
|1.0|2.0|
|2.0|3.0|
+---+---+
Of course, you can also use python comprehension:
df.select(
*(
F.col(col).cast("double").alias(col)
for col
in df.columns
)
).show()
+---+---+
|foo|bar|
+---+---+
|1.0|2.0|
|2.0|3.0|
+---+---+
If you have a lot of columns, the second solution is a little bit better.
First of all don't post a solution from PySpark on Spark questions. For beginners I find it really annoying. Not every implementation can be smoothly translated to Spark.
Suppose df if the DataFrame
import org.apache.spark.sql.Column
def func(column: Column) = column.cast(DoubleType)
val df2=df.select(df.columns.map(c=>func(col(c))): _*)
I want to create a sample single-column DataFrame, but the following code is not working:
df = spark.createDataFrame(["10","11","13"], ("age"))
## ValueError
## ...
## ValueError: Could not parse datatype: age
The expected result:
age
10
11
13
the following code is not working
With single element you need a schema as type
spark.createDataFrame(["10","11","13"], "string").toDF("age")
or DataType:
from pyspark.sql.types import StringType
spark.createDataFrame(["10","11","13"], StringType()).toDF("age")
With name elements should be tuples and schema as sequence:
spark.createDataFrame([("10", ), ("11", ), ("13", )], ["age"])
Well .. There is some pretty easy method for creating sample dataframe in PySpark
>>> df = sc.parallelize([[1,2,3], [2,3,4]]).toDF()
>>> df.show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| 1| 2| 3|
| 2| 3| 4|
+---+---+---+
to create with some column names
>>> df1 = sc.parallelize([[1,2,3], [2,3,4]]).toDF(("a", "b", "c"))
>>> df1.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 2| 3|
| 2| 3| 4|
+---+---+---+
In this way, no need to define schema too.Hope this is the simplest way
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([{"a": "x", "b": "y", "c": "3"}])
Output: (no need to define schema)
+---+---+---+
| a | b | c |
+---+---+---+
| x| y| 3|
+---+---+---+
For pandas + pyspark users, if you've already installed pandas in the cluster, you can do this simply:
# create pandas dataframe
df = pd.DataFrame({'col1':[1,2,3], 'col2':['a','b','c']})
# convert to spark dataframe
df = spark.createDataFrame(df)
Local Spark Setup
import findspark
findspark.init()
import pyspark
spark = (pyspark
.sql
.SparkSession
.builder
.master("local")
.getOrCreate())
See my farsante lib for creating a DataFrame with fake data:
import farsante
df = farsante.quick_pyspark_df(['first_name', 'last_name'], 7)
df.show()
+----------+---------+
|first_name|last_name|
+----------+---------+
| Tommy| Hess|
| Arthur| Melendez|
| Clemente| Blair|
| Wesley| Conrad|
| Willis| Dunlap|
| Bruna| Sellers|
| Tonda| Schwartz|
+----------+---------+
Here's how to explicitly specify the schema when creating the PySpark DataFrame:
df = spark.createDataFrame(
[(10,), (11,), (13,)],
StructType([StructField("some_int", IntegerType(), True)]))
df.show()
+--------+
|some_int|
+--------+
| 10|
| 11|
| 13|
+--------+
You can also try something like this -
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc) # sc is the spark context
sample = sqlContext.createDataFrame(
[
('qwe', 23), # enter your data here
('rty',34),
('yui',56),
],
['abc', 'def'] # the row header/column labels should be entered here
There are several ways to create a DataFrame, PySpark Create DataFrame is one of the first steps you learn while working on PySpark
I assume you already have data, columns, and an RDD.
1) df = rdd.toDF()
2) df = rdd.toDF(columns) //Assigns column names
3) df = spark.createDataFrame(rdd).toDF(*columns)
4) df = spark.createDataFrame(data).toDF(*columns)
5) df = spark.createDataFrame(rowData,columns)
Besides these, you can find several examples on pyspark create dataframe
I have a pyspark 2.0.1. I'm trying to groupby my data frame & retrieve the value for all the fields from my data frame. I found that
z=data1.groupby('country').agg(F.collect_list('names'))
will give me values for country & names attribute & for names attribute it will give column header as collect_list(names). But for my job I have dataframe with around 15 columns & I will run a loop & will change the groupby field each time inside loop & need the output for all of the remaining fields.Can you please suggest me how to do it using collect_list() or any other pyspark functions?
I tried this code too
from pyspark.sql import functions as F
fieldnames=data1.schema.names
names1= list()
for item in names:
if item != 'names':
names1.append(item)
z=data1.groupby('names').agg(F.collect_list(names1))
z.show()
but got error message
Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.collect_list. Trace: py4j.Py4JException: Method collect_list([class java.util.ArrayList]) does not exist
Use struct to combine the columns before calling groupBy
suppose you have a dataframe
df = spark.createDataFrame(sc.parallelize([(0,1,2),(0,4,5),(1,7,8),(1,8,7)])).toDF("a","b","c")
df = df.select("a", f.struct(["b","c"]).alias("newcol"))
df.show()
+---+------+
| a|newcol|
+---+------+
| 0| [1,2]|
| 0| [4,5]|
| 1| [7,8]|
| 1| [8,7]|
+---+------+
df = df.groupBy("a").agg(f.collect_list("newcol").alias("collected_col"))
df.show()
+---+--------------+
| a| collected_col|
+---+--------------+
| 0|[[1,2], [4,5]]|
| 1|[[7,8], [8,7]]|
+---+--------------+
Aggregation operation can be done only on single columns.
After aggregation, You can collect the result and iterate over it to separate the combined columns generate the index dict. or you can write a
udf to separate the combined columns.
from pyspark.sql.types import *
def foo(x):
x1 = [y[0] for y in x]
x2 = [y[1] for y in x]
return(x1,x2)
st = StructType([StructField("b", ArrayType(LongType())), StructField("c", ArrayType(LongType()))])
udf_foo = udf(foo, st)
df = df.withColumn("ncol",
udf_foo("collected_col")).select("a",
col("ncol").getItem("b").alias("b"),
col("ncol").getItem("c").alias("c"))
df.show()
+---+------+------+
| a| b| c|
+---+------+------+
| 0|[1, 4]|[2, 5]|
| 1|[7, 8]|[8, 7]|
+---+------+------+
Actually we can do it in pyspark 2.2 .
First we need create a constant column ("Temp"), groupBy with that column ("Temp") and apply agg by pass iterable *exprs in which expression of collect_list exits.
Below is the code:
import pyspark.sql.functions as ftions
import functools as ftools
def groupColumnData(df, columns):
df = df.withColumn("Temp", ftions.lit(1))
exprs = [ftions.collect_list(colName) for colName in columns]
df = df.groupby('Temp').agg(*exprs)
df = df.drop("Temp")
df = df.toDF(*columns)
return df
Input Data:
df.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 0| 1| 2|
| 0| 4| 5|
| 1| 7| 8|
| 1| 8| 7|
+---+---+---+
Output Data:
df.show()
+------------+------------+------------+
| a| b| c|
+------------+------------+------------+
|[0, 0, 1, 1]|[1, 4, 7, 8]|[2, 5, 8, 7]|
+------------+------------+------------+
in spark 2.4.4 and python 3.7 (I guess its also relevant for previous spark and python version) --
My suggestion is a based on pauli's answer,
instead of creating the struct and then using the agg function, create the struct inside collect_list:
df = spark.createDataFrame([(0,1,2),(0,4,5),(1,7,8),(1,8,7)]).toDF("a","b","c")
df.groupBy("a").agg(collect_list(struct(["b","c"])).alias("res")).show()
result :
+---+-----------------+
| a|res |
+---+-----------------+
| 0|[[1, 2], [4, 5]] |
| 1|[[7, 8], [8, 7]] |
+---+-----------------+
I just use Concat_ws function it's perfectly fine.
> from pyspark.sql.functions import * df =
> spark.createDataFrame([(0,1,2),(0,4,5),(1,7,8),(1,8,7)]).toDF("a","b","c")
> df.groupBy('a').agg(collect_list(concat_ws(',','b','c'))).alias('r').show()