how can I assign a row with Pyspark Dataframe? - python

Can you please convert this expression below from Pandas to Pyspark Dataframe, I try to see the equivalent of the loc in Pyspark?
import pandas as pd
df3 = pd.DataFrame(columns=["Devices","months"])
new_entry = {'Devices': 'device1', 'months': 'month1'}
df3.loc[len(df3)] = new_entry

In pyspark you need to union to add a new row to an existing data frame. But Spark data frame are unordered and there no index as in pandas so there no such equivalent. For the example you gave, you write it like this in Pyspark :
from pyspark.sql.types import *
schema = StructType([
StructField('Devices', StringType(), True),
StructField('months', TimestampType(), True)
])
df = spark.createDataFrame(sc.emptyRDD(), schema)
new_row_df = spark.createDataFrame([("device1", "month1")], ["Devices", "months"])
# or using spark.sql()
# new_row_df = spark.sql("select 'device1' as Devices, 'month1' as months")
df = df.union(new_row_df)
df.show()
#+-------+------+
#|Devices|months|
#+-------+------+
#|device1|month1|
#+-------+------+
If you want to add a row at "specific position", you can create a column index using for example row_number function by defining an ordering, then filter the row number you want to assign the new row into before doing union :
from pyspark.sql import functions as F
from pyspark.sql import Window
df = df.withColumn("rn", F.row_number().over(Window.orderBy("Devices")))
# df.loc[1] = ...
df = df.filter("rn <> 1").drop("rn").union(new_row_df)

Related

How to save pyspark 'for' loop output as a single dataframe?

I have a basic 'for' loop that shows the number of active customers each year. I can print the output, but I want the output to be a single table/dataframe (with 2 columns: year and # customers, each iteration of the loop creates 1 row in the table)
for yr in range(2018, 2023):
print (yr, df.filter(year(col('first_sale')) <= yr).count())
was able to solve by creating a blank dataframe with desired schema outside the loop and using union, but still curious if there's a shorter solution?
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
schema = StructType([StructField("year", IntegerType(), True), StructField("customer_count", IntegerType(), True)])
df2 = spark.createDataFrame([],schema=schema)
for yr in range(2018, 2023):
c1 = yr
c2 = df.filter((year(col('first_sale')) <= yr)).count()
newRow= spark.createDataFrame([(c1,c2)], schema)
df2 = df2.union(newRow)
display(df2)
I don't have your data, so I can't test if this works, but how about something like this:
year_col = year(col('first_sale')).alias('year')
grp = df.groupby(year_col).count().toPandas().sort_values('year').reset_index(drop=True)
grp['cumsum'] = grp['count'].cumsum()
The view grp[['year', 'cumsum']] should be the same as your for-loop.

Create new column with max value based on filtered rows with groupby in pyspark

I have a spark dataframe
import pandas as pd
foo = pd.DataFrame({'id': [1,1,2,2,2], 'col': ['a','b','a','a','b'], 'value': [1,5,2,3,4],
'col_b': ['a','c','a','a','c']})
I want to create a new column with the max of the value column, groupped by id. But I want the max value only for the rows that col==col_b
My result spark dataframe should look like this
foo = pd.DataFrame({'id': [1,1,2,2,2], 'col': ['a','b','a','a','b'], 'value': [1,5,2,3,4],
'max_value':[1,1,3,3,3], 'col_b': ['a','c','a','a','c']})
I have tried
from pyspark.sql import functions as f
from pyspark.sql.window import Window
w = Window.partitionBy('id')
foo = foo.withColumn('max_value', f.max('value').over(w))\
.where(f.col('col') == f.col('col_b'))
But I end up losing some rows.
Any ideas ?
Use when function for conditionnal aggregation max:
from pyspark.sql import Window
from pyspark.sql import functions as F
w = Window.partitionBy('id')
foo = foo.withColumn('max_value', F.max(F.when(F.col('col') == F.col('col_b'), F.col('value'))).over(w))

Append pandas dataframe to existing table in databricks

I want to append a pandas dataframe (8 columns) to an existing table in databricks (12 columns), and fill the other 4 columns that can't be matched with None values. Here is I've tried:
spark_df = spark.createDataFrame(df)
spark_df.write.mode("append").insertInto("my_table")
It thrown the error:
ParseException: "\nmismatched input ':' expecting (line 1, pos 4)\n\n== SQL ==\n my_table
Looks like spark can't handle this operation with unmatched columns, is there any way to achieve what I want?
I think that the most natural course of action would be a select() transformation to add the missing columns to the 8-column dataframe, followed by a unionAll() transformation to merge the two.
from pyspark.sql import Row
from pyspark.sql.functions import lit
bigrow = Row(a='foo', b='bar')
bigdf = spark.createDataFrame([bigrow])
smallrow = Row(a='foobar')
smalldf = spark.createDataFrame([smallrow])
fitdf = smalldf.select(smalldf.a, lit(None).alias('b'))
uniondf = bigdf.unionAll(fitdf)
Can you try this
df = spark.createDataFrame(pandas_df)
df_table_struct = sqlContext.sql('select * from my_table limit 0')
for col in set(df_table_struct.columns) - set(df.columns):
df = df.withColumn(col, F.lit(None))
df_table_struct = df_table_struct.unionByName(df)
df_table_struct.write.saveAsTable('my_table', mode='append')

pyspark dataframe "condition should be string or Column"

i am unable to use a filter on a data frame. i keep getting error "TypeError("condition should be string or Column")"
I have tried changing the filter to use col object. Still, it does not work.
path = 'dbfs:/FileStore/tables/TravelData.txt'
data = spark.read.text(path)
from pyspark.sql.types import StructType, StructField, IntegerType , StringType, DoubleType
schema = StructType([
StructField("fromLocation", StringType(), True),
StructField("toLocation", StringType(), True),
StructField("productType", IntegerType(), True)
])
df = spark.read.option("delimiter", "\t").csv(path, header=False, schema=schema)
from pyspark.sql.functions import col
answerthree = df.select("toLocation").groupBy("toLocation").count().sort("count", ascending=False).take(10) # works fine
display(answerthree)
I add a filter to variable "answerthree" as follows:
answerthree = df.select("toLocation").groupBy("toLocation").count().filter(col("productType")==1).sort("count", ascending=False).take(10)
It is throwing error as follows:
""cannot resolve 'productType' given input columns""condition should be string or Column"
In jist, i am trying to solve problem 3 given in below link using pyspark instead of scal. Dataset is also provided in below url.
https://acadgild.com/blog/spark-use-case-travel-data-analysis?fbclid=IwAR0fgLr-8aHVBsSO_yWNzeyh7CoiGraFEGddahDmDixic6wmumFwUlLgQ2c
I should be able to get the desired result only for productType value 1
As you don't have a variable referencing the data frame, the easiest is to use a string condition:
answerthree = df.select("toLocation").groupBy("toLocation").count()\
.filter("productType = 1")\
.sort(...
Alternatively, you can use a data frame variable and use a column-based filter:
count_df = df.select("toLocation").groupBy("toLocation").count()
answerthree = count_df.filter(count_df['productType'] == 1)\
.sort("count", ascending=False).take(10)

How to modify/transform the column of a dataframe?

I have an instance of pyspark.sql.dataframe.DataFrame created using
dataframe = sqlContext.sql("select * from table").
One column is 'arrival_date' and contains a string.
How do I modify this column so as to only take the first 4 characters from it and throw away the rest?
How would I convert the type of this column from string to date?
In graphlab.SFrame, this would be:
dataframe['column_name'] = dataframe['column_name'].apply(lambda x: x[:4] )
and
dataframe['column_name'] = dataframe['column_name'].str_to_datetime()
As stated by Orions, you can't modify a column, but you can override it. Also, you shouldn't need to create an user defined function, as there is a built-in function for extracting substrings:
from pyspark.sql.functions import *
df = df.withColumn("arrival_date", df['arrival_date'].substr(0, 4))
To convert it to date, you can use to_date, as Orions said:
from pyspark.sql.functions import *
df = df.withColumn("arrival_date", to_date(df['arrival_date'].substr(0, 4)))
However, if you need to specify the format, you should use unix_timestamp:
from pyspark.sql.functions import *
format = 'yyMM'
col = unix_timestamp(df['arrival_date'].substr(0, 4), format).cast('timestamp')
df = df.withColumn("arrival_date", col)
All this can be found in the pyspark documentation.
To extract first 4 characters from the arrival_date (StringType) column, create a new_df by using UserDefinedFunction (as you cannot modify the columns: they are immutable):
from pyspark.sql.functions import UserDefinedFunction, to_date
old_df = spark.sql("SELECT * FROM table")
udf = UserDefinedFunction(lambda x: str(x)[:4], StringType())
new_df = old_df.select(*[udf(column).alias('arrival_date') if column == 'arrival_date' else column for column in old_df.columns])
And to covert the arrival_date (StringType) column into DateType column, use the to_date function as show below:
new_df = old_df.select(old_df.other_cols_if_any, to_date(old_df.arrival_date).alias('arrival_date'))
Sources:
https://stackoverflow.com/a/29257220/2873538
https://databricks.com/blog/2015/09/16/apache-spark-1-5-dataframe-api-highlights.html

Categories

Resources