Splitting Dataframe column containing structs into new columns - python

I have a Dataframe named df with following structure:
root
|-- country: string (nullable = true)
|-- competition: string (nullable = true)
|-- competitor: array (nullable = true)
|-- element: struct (containsNull = true)
| |-- name: string (nullable = true)
|-- time: string (nullable = true)
Which looks like this:
|country|competiton|competitor |
|___________________________________|
|USA |WN |[{Adam, 9.43}] |
|China |FN |[{John, 9.56}] |
|China |FN |[{Adam, 9.48}] |
|USA |MNU |[{Phil, 10.02}] |
|... |... |... |
I want to pivot (or something similar) the competitor Column into new columns depending on the values in each struct so it looks like this:
|country|competition|Adam|John|Phil |
|____________________________________|
|USA |WN |9.43|... |... |
|China |FN |9.48|9.56|... |
|USA |MNU |... |... |10.02 |
The names are unique so if column already exists i dont want to create a new but fill the value in the already created one. There are alot of names so it needs to be done dynamically.
I have a pretty big dataset so i cant use Pandas.

We can extract the struct from the array and then create new columns from that struct. The name column can be pivoted with the time as values.
data_sdf. \
withColumn('name_time_struct', func.col('competitor')[0]). \
select('country', 'competition', func.col('name_time_struct.*')). \
groupBy('country', 'competition'). \
pivot('name'). \
agg(func.first('time')). \
show()
+-------+-----------+----+----+-----+
|country|competition|Adam|John| Phil|
+-------+-----------+----+----+-----+
| USA| MNU|null|null|10.02|
| USA| WN|9.43|null| null|
| China| FN|9.48|9.56| null|
+-------+-----------+----+----+-----+
P.S. this assumes there's just 1 struct in the array.

Related

PYSpark data Frame schema is showing String for every column

I am reading CSV file from below code snippet
df_pyspark = spark.read.csv("sample_data.csv") df_pyspark
and when i try to print data Frame its output is like shown as below:
DataFrame[_c0: string, _c1: string, _c2: string, _c3: string, _c4: string, _c5: string]
For each column dataType is showing 'String' even though column contains different dataType's as below:
df_pyspark.show()
|_c0| _c1| _c2| _c3| _c4| _c5|
+---+----------+---------+--------------------+-----------+----------+
| id|first_name|last_name| email| gender| phone|
| 1| Bidget| Mirfield|bmirfield0#scient...| Female|5628618353|
| 2| Gonzalo| Vango| gvango1#ning.com| Male|9556535457|
| 3| Rock| Pampling|rpampling2#guardi...| Bigender|4472741337|
| 4| Dorella| Edelman|dedelman3#histats...| Female|4303062344|
| 5| Faber| Thwaite|fthwaite4#google....|Genderqueer|1348658809|
| 6| Debee| Philcott|dphilcott5#cafepr...| Female|7906881842|`
I want to print the exact DataType of every column?
Use inferSchema parameter during read of CSV file it'll Show the exact/correct datatype according to the values in columns
df_pyspark = spark.read.csv("sample_data.csv", header=True, inferSchema=True)
+---+----------+---------+--------------------+-----------+----------+
| id|first_name|last_name| email| gender| phone|
+---+----------+---------+--------------------+-----------+----------+
| 1| Bidget| Mirfield|bmirfield0#scient...| Female|5628618353|
| 2| Gonzalo| Vango| gvango1#ning.com| Male|9556535457|
| 3| Rock| Pampling|rpampling2#guardi...| Bigender|4472741337|
| 4| Dorella| Edelman|dedelman3#histats...| Female|4303062344|
| 5| Faber| Thwaite|fthwaite4#google....|Genderqueer|1348658809|
+---+----------+---------+--------------------+-----------+----------+
only showing top 5 rows
df_pyspark.printSchema()
root
|-- id: integer (nullable = true)
|-- first_name: string (nullable = true)
|-- last_name: string (nullable = true)
|-- email: string (nullable = true)
|-- gender: string (nullable = true)
|-- phone: long (nullable = true)

Pyspark dataframe isin function datatype conversion

I am using isin function to filter pyspark dataframe. Surprisingly, although column data type (double) does not match data type in the list (Decimal), there was a match. Can someone help me understand why this is the case?
Example
(Pdb) df.show(3)
+--------------------+---------+------------+
| employee_id|threshold|wage|
+--------------------+---------+------------+
|AAA | 0.9| 0.5|
|BBB | 0.8| 0.5|
|CCC | 0.9| 0.5|
+--------------------+---------+------------+
(Pdb) df.printSchema()
root
|-- employee_id: string (nullable = true)
|-- threshold: double (nullable = true)
|-- wage: double (nullable = true)
(Pdb) include_thresholds
[Decimal('0.8')]
(Pdb) df.count()
3267
(Pdb) df.filter(fn.col("threshold").isin(include_thresholds)).count()
1633
However, if I use normal "in" operator to test if 0.8 belongs to include_thresholds, it's obviously false
(Pdb) 0.8 in include_thresholds
False
Do function col or isin implicitly perform datatype conversion ?
When you bring external input to spark for comparison. They are just taken as string and upcasted based on the context.
So what you observe based on numpy datatypes may not hold good in spark.
import decimal
include_thresholds=[decimal.Decimal(0.8)]
include_thresholds2=[decimal.Decimal('0.8')]
0.8 in include_thresholds # True
0.8 in include_thresholds2 # False
And, note the values
include_thresholds
[Decimal('0.8000000000000000444089209850062616169452667236328125')]
include_thresholds2
[Decimal('0.8')]
Coming to the dataframe
df = spark.sql(""" with t1 (
select 'AAA' c1, 0.9 c2, 0.5 c3 union all
select 'BBB' c1, 0.8 c2, 0.5 c3 union all
select 'CCC' c1, 0.9 c2, 0.5 c3
) select c1 employee_id, cast(c2 as double) threshold, cast(c3 as double) wage from t1
""")
df.show()
df.printSchema()
+-----------+---------+----+
|employee_id|threshold|wage|
+-----------+---------+----+
| AAA| 0.9| 0.5|
| BBB| 0.8| 0.5|
| CCC| 0.9| 0.5|
+-----------+---------+----+
root
|-- employee_id: string (nullable = false)
|-- threshold: double (nullable = false)
|-- wage: double (nullable = false)
include_thresholds2 would work fine.
df.filter(col("threshold").isin(include_thresholds2)).show()
+-----------+---------+----+
|employee_id|threshold|wage|
+-----------+---------+----+
| BBB| 0.8| 0.5|
+-----------+---------+----+
Now the below throws error.
df.filter(col("threshold").isin(include_thresholds)).show()
org.apache.spark.sql.AnalysisException: decimal can only support precision up to 38;
as it is taking the value 0.8000000000000000444089209850062616169452667236328125 as such and trying to upcast and thus throwing error.
found the answer in isin documentation:
https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Column.html#isin-java.lang.Object...-
isin
public Column isin(Object... list)
A boolean expression that is evaluated to true if the value of this expression is contained by the evaluated values of the arguments.
Note: Since the type of the elements in the list are inferred only during the run time, the elements will be "up-casted" to the most common type for comparison. For eg: 1) In the case of "Int vs String", the "Int" will be up-casted to "String" and the comparison will look like "String vs String". 2) In the case of "Float vs Double", the "Float" will be up-casted to "Double" and the comparison will look like "Double vs Double"

PySpark: How to covert column with Ljava.lang.Object

I created data frame in PySpark by reading data from HDFS like this:
df = spark.read.parquet('path/to/parquet')
I expect the data frame to have two column of strings:
+------------+------------------+
|my_column |my_other_column |
+------------+------------------+
|my_string_1 |my_other_string_1 |
|my_string_2 |my_other_string_2 |
|my_string_3 |my_other_string_3 |
|my_string_4 |my_other_string_4 |
|my_string_5 |my_other_string_5 |
|my_string_6 |my_other_string_6 |
|my_string_7 |my_other_string_7 |
|my_string_8 |my_other_string_8 |
+------------+------------------+
However, I get my_column column with some strings starting with [Ljava.lang.Object;, looking like this:
>> df.show(truncate=False)
+-----------------------------+------------------+
|my_column |my_other_column |
+-----------------------------+------------------+
|[Ljava.lang.Object;#7abeeeb6 |my_other_string_1 |
|[Ljava.lang.Object;#5c1bbb1c |my_other_string_2 |
|[Ljava.lang.Object;#6be335ee |my_other_string_3 |
|[Ljava.lang.Object;#153bdb33 |my_other_string_4 |
|[Ljava.lang.Object;#1a23b57f |my_other_string_5 |
|[Ljava.lang.Object;#3a101a1a |my_other_string_6 |
|[Ljava.lang.Object;#33846636 |my_other_string_7 |
|[Ljava.lang.Object;#521a0a3d |my_other_string_8 |
+-----------------------------+------------------+
>> df.printSchema()
root
|-- my_column: string (nullable = true)
|-- my_other_column: string (nullable = true)
As you can see, my_other_column column is looking as expected. Is there any way, how to convert objects in my_column column to humanly readable strings?
Jaroslav,
I tried with the following code, and have used a sample parquet file from here. I am able to get the desired output from the dataframe, can u please chk your code using the code snippet below and also sample file referred above to see if there's any other issue:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Read a Parquet file").getOrCreate()
df = spark.read.parquet('E:\\...\\..\\userdata1.parquet')
df.show(10)
df.printSchema()
Replace the path to your HDFS location.
Dataframe output for your reference:

Extracting the year from Date in Pyspark dataframe

I have a Pyspark data frame that contains a date column "Reported Date"(type:string). I would like to get the count of another column after extracting the year from the date.
I can get the count if I use the string date column.
crimeFile_date.groupBy("Reported Date").sum("Offence Count").show()
and I get this output
+-------------+------------------+
|Reported Date|sum(Offence Count)|
+-------------+------------------+
| 13/08/2010| 342|
| 6/10/2011| 334|
| 27/11/2011| 269|
| 12/01/2012| 303|
| 22/02/2012| 286|
| 31/07/2012| 276|
| 25/04/2013| 222|
+-------------+------------------+
To extract the year from "Reported Date" I have converted it to a date format (using this approach) and named the column "Date".
However, when I try to use the same code to group by the new column and do the count I get an error message.
crimeFile_date.groupBy(year("Date").alias("year")).sum("Offence Count").show()
TypeError: strptime() argument 1 must be str, not None
This is the data schema:
root
|-- Offence Count: integer (nullable = true)
|-- Reported Date: string (nullable = true)
|-- Date: date (nullable = true)
Is there a way to fix this error? or extract the year using another method?
Thank you
If I understand correctly then you want to extract the year from String date column. Of course, one way is using regex but sometimes it can throw your logic off if regex is not handling all scenarios.
here is the date data type approach.
Imports
import pyspark.sql.functions as f
Creating your Dataframe
l1 = [('13/08/2010',342),('6/10/2011',334),('27/11/2011',269),('12/01/2012',303),('22/02/2012',286),('31/07/2012',276),('25/04/2013',222)]
dfl1 = spark.createDataFrame(l1).toDF("dates","sum")
dfl1.show()
+----------+---+
| dates|sum|
+----------+---+
|13/08/2010|342|
| 6/10/2011|334|
|27/11/2011|269|
|12/01/2012|303|
|22/02/2012|286|
|31/07/2012|276|
|25/04/2013|222|
+----------+---+
Now, You can use to_timestamp or to_date apis of functions package
dfl2 = dfl1.withColumn('years',f.year(f.to_timestamp('dates', 'dd/MM/yyyy')))
dfl2.show()
+----------+---+-----+
| dates|sum|years|
+----------+---+-----+
|13/08/2010|342| 2010|
| 6/10/2011|334| 2011|
|27/11/2011|269| 2011|
|12/01/2012|303| 2012|
|22/02/2012|286| 2012|
|31/07/2012|276| 2012|
|25/04/2013|222| 2013|
+----------+---+-----+
Now, group by on years.
dfl2.groupBy('years').sum('sum').show()
+-----+--------+
|years|sum(sum)|
+-----+--------+
| 2013| 222|
| 2012| 865|
| 2010| 342|
| 2011| 603|
+-----+--------+
Showing into multiple steps for understanding but you can combine extract year and group by in one step.
Happy to extend if you need some other help.

Creating a new dataframe from a pyspark dataframe column efficiently

I wonder what is the most efficient way to extract a column in pyspark dataframe and turn them into a new dataframe? The following code runs without any problem with small datasets, but runs very slow and even causes out-of-memory error. I wonder how can I improve the efficiency of this code?
pdf_edges = sdf_grp.rdd.flatMap(lambda x: x).collect()
edgelist = reduce(lambda a, b: a + b, pdf_edges, [])
sdf_edges = spark.createDataFrame(edgelist)
In pyspark dataframe sdf_grp, The column "pairs" contains information as below
+-------------------------------------------------------------------+
|pairs |
+-------------------------------------------------------------------+
|[[39169813, 24907492], [39169813, 19650174]] |
|[[10876191, 139604770]] |
|[[6481958, 22689674]] |
|[[73450939, 114203936], [73450939, 21226555], [73450939, 24367554]]|
|[[66306616, 32911686], [66306616, 19319140], [66306616, 48712544]] |
+-------------------------------------------------------------------+
with a schema of
root
|-- pairs: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- node1: integer (nullable = false)
| | |-- node2: integer (nullable = false)
I'd like to convert them into a new dataframe sdf_edges looks like below
+---------+---------+
| node1| node2|
+---------+---------+
| 39169813| 24907492|
| 39169813| 19650174|
| 10876191|139604770|
| 6481958| 22689674|
| 73450939|114203936|
| 73450939| 21226555|
| 73450939| 24367554|
| 66306616| 32911686|
| 66306616| 19319140|
| 66306616| 48712544|
+---------+---------+
The most efficient way to extract columns is avoiding collect(). When you call collect(), all the data is transfered to the driver and processed there. At better way to achieve what you want is using the explode() function. Have a look at the example below:
from pyspark.sql import types as T
import pyspark.sql.functions as F
schema = T.StructType([
T.StructField("pairs", T.ArrayType(
T.StructType([
T.StructField("node1", T.IntegerType()),
T.StructField("node2", T.IntegerType())
])
)
)
])
df = spark.createDataFrame(
[
([[39169813, 24907492], [39169813, 19650174]],),
([[10876191, 139604770]], ) ,
([[6481958, 22689674]] , ) ,
([[73450939, 114203936], [73450939, 21226555], [73450939, 24367554]],),
([[66306616, 32911686], [66306616, 19319140], [66306616, 48712544]],)
], schema)
df = df.select(F.explode('pairs').alias('exploded')).select('exploded.node1', 'exploded.node2')
df.show(truncate=False)
Output:
+--------+---------+
| node1 | node2 |
+--------+---------+
|39169813|24907492 |
|39169813|19650174 |
|10876191|139604770|
|6481958 |22689674 |
|73450939|114203936|
|73450939|21226555 |
|73450939|24367554 |
|66306616|32911686 |
|66306616|19319140 |
|66306616|48712544 |
+--------+---------+
Well, I just solve it with the below
sdf_edges = sdf_grp.select('pairs').rdd.flatMap(lambda x: x[0]).toDF()

Categories

Resources