My dataset has one column called 'eventAction'.
It has values like 'conversion', 'purchase', 'check-out', etc.. I want to convert this column in such a way that it maps conversion to 1 and all other categories to 0.
I used lambda function in this way:
e1 = event1.rdd.map(lambda x: 1 if x.eventAction == 'conversion' else 0)
where event1 is the name of my spark dataframe.
When printing e1 I get this:
print(e1.take(5))
[0, 0, 0, 0, 0]
So I think the lambda function worked properly. Now when I am converting to pyspark dataframe, I get null values as shown:
schema1 = StructType([StructField('conversion',IntegerType(),True)])
df = spark.createDataFrame(data=[e1],schema=schema1)
df.printSchema()
df.show()
It will be great if you can help me with this.
Thanks!
spark.createDataFrame expects an RDD of Row, not an RDD of integers. You need to map the RDD to Row objects before converting to dataframe. Note that there is no need to add square brackets around e1.
from pyspark.sql import Row
e1 = event1.rdd.map(lambda x: 1 if x.eventAction == 'conversion' else 0).map(lambda x: Row(x))
schema1 = StructType([StructField('conversion',IntegerType(),True)])
df = spark.createDataFrame(data=e1,schema=schema1)
That said, what you're trying to do should be easily done with Spark SQL when function. There is no need to use RDD with a custom lambda function. e.g.
import pyspark.sql.functions as F
df = events.select(F.when(F.col('eventAction') == 'conversion', 1).otherwise(0).alias('conversion'))
Related
I need to count a value in several columns and I want all those individual count for each column in a list.
Is there a faster/better way of doing this? Because my solution takes quite some time.
dataframe.cache()
list = [dataframe.filter(col(str(i)) == "value").count() for i in range(150)]
You can do a conditional count aggregation:
import pyspark.sql.functions as F
df2 = df.agg(*[
F.count(F.when(F.col(str(i)) == "value", 1)).alias(i)
for i in range(150)
])
result = df2.toPandas().transpose()[0].tolist()
You can try the following approach/design
write a map function for each row of the data frame like this:
VALUE = 'value'
def row_mapper(df_row):
return [each == VALUE for each in df_row]
write a reduce function for data frame that takes 2 two rows as input:
def reduce_rows(df_row1, df_row2):
return [x + y for x, y in zip(df_row1, df_row2)]
Note: these are simple python function to help you understand not some udf functions you can directly apply on PySpark.
What is the best way to iterate over Spark Dataframe (using Pyspark) and once find data type of Decimal(38,10) -> change it to Bigint (and resave all to the same dataframe)?
I have a part for changing data types - e.g.:
df = df.withColumn("COLUMN_X", df["COLUMN_X"].cast(IntegerType()))
but trying to find and integrate with iteration..
Thanks.
You can loop through df.dtypes and cast to bigint when type is equal to decimal(38,10) :
from pyspark.sql.funtions import col
select_expr = [
col(c).cast("bigint") if t == "decimal(38,10)" else col(c) for c, t in df.dtypes
]
df = df.select(*select_expr)
I found this post https://stackoverflow.com/a/54399474/11268096, where you can loop trough all columns and cast them to your desired data type.
from pyspark.sql import functions as F
for col in df.columns:
df = df.withColumn(
col,
F.col(col).cast("double")
)
I have a simple dataset with some null values:
Age,Title
10,Mr
20,Mr
null,Mr
1, Miss
2, Miss
null, Miss
I want to fill the nulls values with the aggregate of the grouping by a different column (in this case, Title). E.g. the Mean of the Title column is:
15, Mr
1.5, Miss
So the final result should look like this:
Age,Title
10,Mr
20,Mr
15,Mr
1, Miss
2, Miss
1.5, Miss
I have seen a lot of examples using Pandas using Transform:
df["Age"] = df.groupby("Title").transform(lambda x: x.fillna(x.mean()))
I am trying not to use external libraries and do it natively in pyspark. The python dataframe does not have a transform method.
I was thinking of storing the aggregates in a separate dataframe like this:
meanAgeDf = df.groupBy("Title").mean("Age").select("Title", col("avg(Age)").alias("AgeMean"))
and then for each grouping lookup the Title and fill all those values with that mean value:
from pyspark.sql.functions import when, col
x = df.join(meanAgeDf, "Title").withColumn("AgeMean", when(col("Age").isNull(), col("AgeMean")).otherwise(col("Age")))
Is this the most efficient way to do this?
This can be done in one step using window function avg.
from pyspark.sql import Window
from pyspark.sql.functions import when,avg
w = Window.partitionBy(df.title)
res = df.withColumn("mean_col",avg(df.age).over(w))
My function get_data returns a tuple: two integer values.
get_data_udf = udf(lambda id: get_data(spark, id), (IntegerType(), IntegerType()))
I need to split them into two columns val1 and val2. How can I do it?
dfnew = df \
.withColumn("val", get_data_udf(col("id")))
Should I save the tuple in a column, e.g. val, and then split it somehow into two columns. Or is there any shorter way?
You can create structFields in udf in order to access later times.
from pyspark.sql.types import *
get_data_udf = udf(lambda id: get_data(spark, id),
StructType([StructField('first', IntegerType()), StructField('second', IntegerType())]))
dfnew = df \
.withColumn("val", get_data_udf(col("id"))) \
.select('*', 'val.`first`'.alias('first'), 'val.`second`'.alias('second'))
tuple's can be indexed just like lists, so you can add the value for column one as get_data()[0] and for the second value in the second column you do get_data()[1]
also you can do v1, v2 = get_data() and this way assign the returned tuple values to the variables v1 and v2.
Take a look at this question here for further clarification.
For example you have a sample dataframe of one column like below
val df = sc.parallelize(Seq(3)).toDF()
df.show()
//Below is a UDF which will return a tuple
def tupleFunction(): (Int,Int) = (1,2)
//we will create two new column from the above UDF
df.withColumn("newCol",typedLit(tupleFunction.toString.replace("(","").replace(")","")
.split(","))).select((0 to 1)
.map(i => col("newCol").getItem(i).alias(s"newColFromTuple$i")):_*).show
I have a pandas dataframe with a column that is a small selection of strings. Let's call the column 'A' and all of the values in it are string_1, string_2, string_3.
Now, I want to add another column and fill it with numeric values that correspond to the strings.
I created a dictionary
d = { 'string_1' : 1, 'string_2' : 2, 'string_3': 3}
I then initialized the new column:
df['B'] = pd.Series(index=df.index)
Now, I want to fill it with the integer values. I can call the values associated with the strings in the dictionary by:
for s in df['A']:
n = d[s]
That works fine, but I've tried using just plain df['B'] = n to fill the new column in the for-loop, but that doesn't work, and I've tried to figure out indexing with pandas.
If I understand you correctly you can just call map:
df['B'] = df['A'].map(d)
This will perform the lookup and fill the values you are looking for.
Rather than fill as an empty column, you can simply populate this with an apply:
df['B'] = df['A'].apply(d.get)