i am unable to use a filter on a data frame. i keep getting error "TypeError("condition should be string or Column")"
I have tried changing the filter to use col object. Still, it does not work.
path = 'dbfs:/FileStore/tables/TravelData.txt'
data = spark.read.text(path)
from pyspark.sql.types import StructType, StructField, IntegerType , StringType, DoubleType
schema = StructType([
StructField("fromLocation", StringType(), True),
StructField("toLocation", StringType(), True),
StructField("productType", IntegerType(), True)
])
df = spark.read.option("delimiter", "\t").csv(path, header=False, schema=schema)
from pyspark.sql.functions import col
answerthree = df.select("toLocation").groupBy("toLocation").count().sort("count", ascending=False).take(10) # works fine
display(answerthree)
I add a filter to variable "answerthree" as follows:
answerthree = df.select("toLocation").groupBy("toLocation").count().filter(col("productType")==1).sort("count", ascending=False).take(10)
It is throwing error as follows:
""cannot resolve 'productType' given input columns""condition should be string or Column"
In jist, i am trying to solve problem 3 given in below link using pyspark instead of scal. Dataset is also provided in below url.
https://acadgild.com/blog/spark-use-case-travel-data-analysis?fbclid=IwAR0fgLr-8aHVBsSO_yWNzeyh7CoiGraFEGddahDmDixic6wmumFwUlLgQ2c
I should be able to get the desired result only for productType value 1
As you don't have a variable referencing the data frame, the easiest is to use a string condition:
answerthree = df.select("toLocation").groupBy("toLocation").count()\
.filter("productType = 1")\
.sort(...
Alternatively, you can use a data frame variable and use a column-based filter:
count_df = df.select("toLocation").groupBy("toLocation").count()
answerthree = count_df.filter(count_df['productType'] == 1)\
.sort("count", ascending=False).take(10)
Related
I have a basic 'for' loop that shows the number of active customers each year. I can print the output, but I want the output to be a single table/dataframe (with 2 columns: year and # customers, each iteration of the loop creates 1 row in the table)
for yr in range(2018, 2023):
print (yr, df.filter(year(col('first_sale')) <= yr).count())
was able to solve by creating a blank dataframe with desired schema outside the loop and using union, but still curious if there's a shorter solution?
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
schema = StructType([StructField("year", IntegerType(), True), StructField("customer_count", IntegerType(), True)])
df2 = spark.createDataFrame([],schema=schema)
for yr in range(2018, 2023):
c1 = yr
c2 = df.filter((year(col('first_sale')) <= yr)).count()
newRow= spark.createDataFrame([(c1,c2)], schema)
df2 = df2.union(newRow)
display(df2)
I don't have your data, so I can't test if this works, but how about something like this:
year_col = year(col('first_sale')).alias('year')
grp = df.groupby(year_col).count().toPandas().sort_values('year').reset_index(drop=True)
grp['cumsum'] = grp['count'].cumsum()
The view grp[['year', 'cumsum']] should be the same as your for-loop.
I have a dataframe.
from pyspark.sql.types import *
input_schema = StructType(
[
StructField("ID", StringType(), True),
StructField("Date", StringType(), True),
StructField("code", StringType(), True),
])
input_data = [
("1", "2021-12-01", "a"),
("2", "2021-12-01", "b"),
]
input_df = spark.createDataFrame(data=input_data, schema=input_schema)
I would like to perform a transformation that combines a set of columns and stuff into a json string. The columns to be combined is known ahead of time. The output should look like something below.
Is there any sugggested method to achieve this?
Appreciate any help on this.
You can create a struct type and then convert to json:
from pyspark.sql import functions as F
col_to_combine = ['Date','code']
output = input_df.withColumn('combined',F.to_json(F.struct(*col_to_combine)))\
.drop(*col_to_combine)
output.show(truncate=False)
+---+--------------------------------+
|ID |combined |
+---+--------------------------------+
|1 |{"Date":"2021-12-01","code":"a"}|
|2 |{"Date":"2021-12-01","code":"b"}|
+---+--------------------------------+
I have to run a script that takes a few arguments as input and returns some results as output, so first I developed it in my local machine - working fine - and my goal now is running it in Databricks in order to parallelize it.
The issue comes when I'm trying to parallelize it. I'm taking the data from a Datalake already mounted (the issue is not there as I'm able to print the DataFrame after reading it), transforming it to a Spark DataFrame and passing each row to the main function grouped by material:
import pandas as pd
import os
import numpy as np
import scipy.stats as stats
from pyspark.sql import SparkSession
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import StructType,StructField,IntegerType,FloatType
# Pandas udf
schema = StructType([StructField('Material', IntegerType(), True),
StructField('Alpha', IntegerType(), True),
StructField('Beta', IntegerType(), True),
StructField('Sales', IntegerType(), True),
StructField('SL', FloatType(), True)])
#pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def main(data):
material = data['Material'].iloc[0]
print(material) #<-------- THIS IS NOT PRINTING
print('Hello world') #<------ NEITHER IS THIS
start = data['start '].iloc[0]
end = data['end '].iloc[0]
mu_lt = data['mu_lt'].iloc[0]
sigma_lt = data['sigma_lt'].iloc[0]
df = pd.DataFrame(columns=('Material', 'Alpha', 'Beta', 'Sales', 'SL'))
for beta in range(1, 2):
for alpha in range(3, 5):
# Do stuff
return df
if __name__ == '__main__':
spark = SparkSession.builder.getOrCreate()
params = pd.read_csv('/dbfs/mnt/input/params_input.csv')
params_spark = spark.createDataFrame(params)
params_spark.groupby('Material').apply(main).show()
I'm not sure if I'm passing correctly the DF to the main function or even declaring it right, but none of the prints nor the DF defined in the main function seem to be running. The code throws no error, but no output is returned either.
Try this:
#pandas_udf('y int, ds int, store_id string, product_id string, log string', PandasUDFType.GROUPED_MAP)
def train_predict(pdf):
return pd.DataFrame([3, 5, 'store123', 'product123', 'My log message'], columns=['y', 'ds','store_id','product_id', 'log'])
Can you please convert this expression below from Pandas to Pyspark Dataframe, I try to see the equivalent of the loc in Pyspark?
import pandas as pd
df3 = pd.DataFrame(columns=["Devices","months"])
new_entry = {'Devices': 'device1', 'months': 'month1'}
df3.loc[len(df3)] = new_entry
In pyspark you need to union to add a new row to an existing data frame. But Spark data frame are unordered and there no index as in pandas so there no such equivalent. For the example you gave, you write it like this in Pyspark :
from pyspark.sql.types import *
schema = StructType([
StructField('Devices', StringType(), True),
StructField('months', TimestampType(), True)
])
df = spark.createDataFrame(sc.emptyRDD(), schema)
new_row_df = spark.createDataFrame([("device1", "month1")], ["Devices", "months"])
# or using spark.sql()
# new_row_df = spark.sql("select 'device1' as Devices, 'month1' as months")
df = df.union(new_row_df)
df.show()
#+-------+------+
#|Devices|months|
#+-------+------+
#|device1|month1|
#+-------+------+
If you want to add a row at "specific position", you can create a column index using for example row_number function by defining an ordering, then filter the row number you want to assign the new row into before doing union :
from pyspark.sql import functions as F
from pyspark.sql import Window
df = df.withColumn("rn", F.row_number().over(Window.orderBy("Devices")))
# df.loc[1] = ...
df = df.filter("rn <> 1").drop("rn").union(new_row_df)
I need to remove utf formatting and convert a column to Integer type.
Below is what I have done to remove the utf format
>>auction_data = auction_raw_data.map(lambda line: line.encode("ascii","ignore").split(","))
>>auction_Data.take(2)
>>[['8211480551', '52.99', '1.201505', 'hanna1104', '94', '49.99', '311.6'], ['8211480551', '50.99', '1.203843', 'wrufai1', '90', '49.99', '311.6']]
But, when I create a dataframe with for the same data using the schema, and try to retrieve particular data, I get the data prefixed with a " u' ".
>>schema = StructType([ StructField("auctionid", StringType(), True),
StructField("bid", StringType(), True),
StructField("bidtime", StringType(), True),
StructField("bidder", StringType(), True),
StructField("bidderrate", StringType(), True),
StructField("openbid", StringType(), True),
StructField("price", StringType(), True)])`
>>xbox_df = sqlContext.createDataFrame(auction_data,schema)
>>xbox_df.registerTempTable("auction")
>>first_line = sqlContext.sql("select * from auction where auctionid=8211480551").collect()
>>for i in first_line:
>> print i
>>Row(auctionid=u'8211480551', bid=u'52.99', bidtime=u'1.201505', bidder=u'hanna1104', bidderrate=u'94', openbid=u'49.99', price=u'311.6')
>>Row(auctionid=u'8211480551', bid=u'50.99', bidtime=u'1.203843', bidder=u'wrufai1', bidderrate=u'90', openbid=u'49.99', price=u'311.6')
How to remove the u' infront of the values, also I want to convert the bid value into an Integer. When I directly change in schema definition, I get an error saying
" TypeError: IntegerType can not accept object in type ".Show less
I am loading a JSON and not using a schema, so I don't know if there's a difference. I have no issues when converting fields to int when using select. This is what I do:
from pyspark.sql.functions import *
...
df = df.select(col('intField').cast('int'))
df.show()
# prints Row(intField=123)