PySpark create a json string by combining columns - python

I have a dataframe.
from pyspark.sql.types import *
input_schema = StructType(
[
StructField("ID", StringType(), True),
StructField("Date", StringType(), True),
StructField("code", StringType(), True),
])
input_data = [
("1", "2021-12-01", "a"),
("2", "2021-12-01", "b"),
]
input_df = spark.createDataFrame(data=input_data, schema=input_schema)
I would like to perform a transformation that combines a set of columns and stuff into a json string. The columns to be combined is known ahead of time. The output should look like something below.
Is there any sugggested method to achieve this?
Appreciate any help on this.

You can create a struct type and then convert to json:
from pyspark.sql import functions as F
col_to_combine = ['Date','code']
output = input_df.withColumn('combined',F.to_json(F.struct(*col_to_combine)))\
.drop(*col_to_combine)
output.show(truncate=False)
+---+--------------------------------+
|ID |combined |
+---+--------------------------------+
|1 |{"Date":"2021-12-01","code":"a"}|
|2 |{"Date":"2021-12-01","code":"b"}|
+---+--------------------------------+

Related

how to convert a string received from one dataframe to another dataframe

I have the following code in a parquet file, which I convert into a variable using collect:
mapping_in_parquet = [('filial','filial','S','string'),('doc','numero_do_documento','S','string'),('serie','serie_do_documento','S','string')]
mapping = (df.select('mapping').distinct().collect()[0][0])
The problem is when I try to convert the string back to dataframe:
schema = StructType([
StructField("fieldName", StringType(), True),
StructField("alias", StringType(), True),
StructField("column_active", StringType(), True),
StructField("typeField", StringType(), True)])
df = (spark.createDataFrame(mapping, schema))
print(mapping)
I get error:
StructType can not accept object '[' in type
When playing the code directly in the console I don't get any error. The error occurs when I extract the column value to a variable.
There are 2 issues in your code:
First is typo while creating dataframe: Use df = (spark.createDataFrame(mapping_in_parquet, schema)).
Second, in mapping = (df.select('mapping').distinct().collect()[0][0]), there is no column mapping. Use either of [fieldName, alias, typeField, column_active].
Full example code with correction:
mapping_in_parquet = [('filial','filial','S','string'),('doc','numero_do_documento','S','string'),('serie','serie_do_documento','S','string')]
schema = StructType([
StructField("fieldName", StringType(), True),
StructField("alias", StringType(), True),
StructField("column_active", StringType(), True),
StructField("typeField", StringType(), True)])
df = (spark.createDataFrame(mapping_in_parquet, schema))
mapping = (df.select('fieldName').distinct().collect()[0][0])
print(mapping)
[Out]:
doc

How to calculate number of times an element has 2 fields the same after an RDD Join (Spark)

I started with 2 RDDs, one with userID and then "SHL .." and then one with userID and the rest of the information.
So, after joining 2 RDDs together I now have data in this format:
(u'5839477', (u'SHL UNRESTRICTED',(u'AGBAMA,JAMES', u'MEDALLION TAXI DRIVER',u'12/27/2020', u'08/22/2019', u'13:20')))]
The first field is the userID and then the next is information about them. I am needing to see
how many users have both "SHL UNRESTRICTED" and 'MEDALLION TAXI DRIVER'. I believe I should maybe have formatted the data after the .join before. The issue I'm having is being able to access the specific fields within the second field.
This would be easier expressed with DataFrames, introduced since Spark 1.6. Is there a specific reason you’re using RDDs?
If not, start working with DataFrames from the start or convert your existing RDD to a DataFrame by specifying a schema. Like this:
>>> rdd = spark.sparkContext.parallelize([("5839477", ("SHL UNRESTRICTED", (u'AGBAMA,JAMES', u'MEDALLION TAXI DRIVER',u'12/27/2020', u'08/22/2019', u'13:20')))])
>>> from pyspark.sql.types import *
>>> schema = StructType([
... StructField("user_id", StringType(), True),
... StructField("details", StructType([
... StructField("restrictions", StringType(), True),
... StructField("more_details", StructType([
... StructField("name", StringType(), True),
... StructField("function", StringType(), True),
... StructField("date1", StringType(), True),
... StructField("date2", StringType(), True),
... StructField("time_of_day", StringType(), True)
... ]), True) ]), True)])
...
>>> df = rdd.toDF(schema=schema)
>>> df.filter(
... (df.details.restrictions == "SHL UNRESTRICTED")
... & (df.details.more_details.function == "MEDALLION TAXI DRIVER")
... ).count()
...
1
Additionally, you could consider flattening the structure, so you won’t need to dig inside the nested columns.
Alternatively, if you really must stay with the RDD, then you would access the elements as if they were nested collections:
>>> rdd.filter(lambda x: (x[1][0] == "SHL UNRESTRICTED")
... and (x[1][1][1] == "MEDALLION TAXI DRIVER")
... ).count()
...
1
Notice how much less expressive this code is (what meaning does x[1][1][1] have to the reader of your code?). I definitely recommend adding names to your data structures.

How to convert multiple columns i.e time ,year,month and date into datetime format in pyspark dataframe

Data frame has 4 columns year,month,date,hhmm
hhmm - is hour and minute concatenated
eg: 10:30 is equal to 1030
dd=spark.createDataFrame([(2019,2,13,1030),(2018,2,14,1000),(2029,12,13,0300)],["Year","month","date","hhmm"])
dd.collect()
expected output in datetime format in pyspark dataframe dd
dd.collect()
2019-02-13 10:30:00
2018-2-14 10:00:00
2019-12-13 03:00:00
For Spark 3+, you can use make_timestamp function :
from pyspark.sql import functions as F
dd = dd.withColumn(
"time",
F.expr("make_timestamp(Year, month, date, substr(hhmm,1,2), substr(hhmm,3,2), 0)")
)
dd.show(truncate=False)
#+----+-----+----+----+-------------------+
#|Year|month|date|hhmm|time |
#+----+-----+----+----+-------------------+
#|2019|2 |13 |1030|2019-02-13 10:30:00|
#|2018|2 |14 |1000|2018-02-14 10:00:00|
#|2029|12 |13 |0300|2029-12-13 03:00:00|
#+----+-----+----+----+-------------------+
There is a problem with your data, 0300 integer will not load as the desired format, for me it loaded as 192, so first you have to load it as string, you just need to assign the data types using schema when doing the load. Refer to documentation. E.g. for a .csv:
from pyspark.sql import DataFrameReader
from pyspark.sql.types import *
schema = StructType([StructField("Year", StringType(), True), StructField("month", StringType(), True), StructField("date", StringType(), True), StructField("hhmm", StringType(), True)])
dd = DataFrameReader.csv(path='your/data/path', schema=schema)
After that you need to fix the data format and convert it to timestamp:
from pyspark.sql import functions as F
dd = spark.createDataFrame([('2019','2','13','1030'),('2018','2','14','1000'),('2029','12','13','300')],["Year","month","date","hhmm"])
dd = (dd.withColumn('month', F.when(F.length(F.col('month')) == 1, F.concat(F.lit('0'), F.col('month'))).otherwise(F.col('month')))
.withColumn('date', F.when(F.length(F.col('date')) == 1, F.concat(F.lit('0'), F.col('date'))).otherwise(F.col('date')))
.withColumn('hhmm', F.when(F.length(F.col('hhmm')) == 1, F.concat(F.lit('000'), F.col('hhmm')))
.when(F.length(F.col('hhmm')) == 2, F.concat(F.lit('00'), F.col('hhmm')))
.when(F.length(F.col('hhmm')) == 3, F.concat(F.lit('0'), F.col('hhmm')))
.otherwise(F.col('hhmm')))
.withColumn('time', F.to_timestamp(F.concat(*dd.columns), format='yyyyMMddHHmm'))
)
dd.show()
+----+-----+----+----+-------------------+
|Year|month|date|hhmm| time|
+----+-----+----+----+-------------------+
|2019| 02| 13|1030|2019-02-13 10:30:00|
|2018| 02| 14|1000|2018-02-14 10:00:00|
|2029| 12| 13|0300|2029-12-13 03:00:00|
+----+-----+----+----+-------------------+

pyspark dataframe "condition should be string or Column"

i am unable to use a filter on a data frame. i keep getting error "TypeError("condition should be string or Column")"
I have tried changing the filter to use col object. Still, it does not work.
path = 'dbfs:/FileStore/tables/TravelData.txt'
data = spark.read.text(path)
from pyspark.sql.types import StructType, StructField, IntegerType , StringType, DoubleType
schema = StructType([
StructField("fromLocation", StringType(), True),
StructField("toLocation", StringType(), True),
StructField("productType", IntegerType(), True)
])
df = spark.read.option("delimiter", "\t").csv(path, header=False, schema=schema)
from pyspark.sql.functions import col
answerthree = df.select("toLocation").groupBy("toLocation").count().sort("count", ascending=False).take(10) # works fine
display(answerthree)
I add a filter to variable "answerthree" as follows:
answerthree = df.select("toLocation").groupBy("toLocation").count().filter(col("productType")==1).sort("count", ascending=False).take(10)
It is throwing error as follows:
""cannot resolve 'productType' given input columns""condition should be string or Column"
In jist, i am trying to solve problem 3 given in below link using pyspark instead of scal. Dataset is also provided in below url.
https://acadgild.com/blog/spark-use-case-travel-data-analysis?fbclid=IwAR0fgLr-8aHVBsSO_yWNzeyh7CoiGraFEGddahDmDixic6wmumFwUlLgQ2c
I should be able to get the desired result only for productType value 1
As you don't have a variable referencing the data frame, the easiest is to use a string condition:
answerthree = df.select("toLocation").groupBy("toLocation").count()\
.filter("productType = 1")\
.sort(...
Alternatively, you can use a data frame variable and use a column-based filter:
count_df = df.select("toLocation").groupBy("toLocation").count()
answerthree = count_df.filter(count_df['productType'] == 1)\
.sort("count", ascending=False).take(10)

How to remove utf format in a dataframe in Pyspark and convert column from string to Integer

I need to remove utf formatting and convert a column to Integer type.
Below is what I have done to remove the utf format
>>auction_data = auction_raw_data.map(lambda line: line.encode("ascii","ignore").split(","))
>>auction_Data.take(2)
>>[['8211480551', '52.99', '1.201505', 'hanna1104', '94', '49.99', '311.6'], ['8211480551', '50.99', '1.203843', 'wrufai1', '90', '49.99', '311.6']]
But, when I create a dataframe with for the same data using the schema, and try to retrieve particular data, I get the data prefixed with a " u' ".
>>schema = StructType([ StructField("auctionid", StringType(), True),
StructField("bid", StringType(), True),
StructField("bidtime", StringType(), True),
StructField("bidder", StringType(), True),
StructField("bidderrate", StringType(), True),
StructField("openbid", StringType(), True),
StructField("price", StringType(), True)])`
>>xbox_df = sqlContext.createDataFrame(auction_data,schema)
>>xbox_df.registerTempTable("auction")
>>first_line = sqlContext.sql("select * from auction where auctionid=8211480551").collect()
>>for i in first_line:
>> print i
>>Row(auctionid=u'8211480551', bid=u'52.99', bidtime=u'1.201505', bidder=u'hanna1104', bidderrate=u'94', openbid=u'49.99', price=u'311.6')
>>Row(auctionid=u'8211480551', bid=u'50.99', bidtime=u'1.203843', bidder=u'wrufai1', bidderrate=u'90', openbid=u'49.99', price=u'311.6')
How to remove the u' infront of the values, also I want to convert the bid value into an Integer. When I directly change in schema definition, I get an error saying
" TypeError: IntegerType can not accept object in type ".Show less
I am loading a JSON and not using a schema, so I don't know if there's a difference. I have no issues when converting fields to int when using select. This is what I do:
from pyspark.sql.functions import *
...
df = df.select(col('intField').cast('int'))
df.show()
# prints Row(intField=123)

Categories

Resources