I am using isin function to filter pyspark dataframe. Surprisingly, although column data type (double) does not match data type in the list (Decimal), there was a match. Can someone help me understand why this is the case?
Example
(Pdb) df.show(3)
+--------------------+---------+------------+
| employee_id|threshold|wage|
+--------------------+---------+------------+
|AAA | 0.9| 0.5|
|BBB | 0.8| 0.5|
|CCC | 0.9| 0.5|
+--------------------+---------+------------+
(Pdb) df.printSchema()
root
|-- employee_id: string (nullable = true)
|-- threshold: double (nullable = true)
|-- wage: double (nullable = true)
(Pdb) include_thresholds
[Decimal('0.8')]
(Pdb) df.count()
3267
(Pdb) df.filter(fn.col("threshold").isin(include_thresholds)).count()
1633
However, if I use normal "in" operator to test if 0.8 belongs to include_thresholds, it's obviously false
(Pdb) 0.8 in include_thresholds
False
Do function col or isin implicitly perform datatype conversion ?
When you bring external input to spark for comparison. They are just taken as string and upcasted based on the context.
So what you observe based on numpy datatypes may not hold good in spark.
import decimal
include_thresholds=[decimal.Decimal(0.8)]
include_thresholds2=[decimal.Decimal('0.8')]
0.8 in include_thresholds # True
0.8 in include_thresholds2 # False
And, note the values
include_thresholds
[Decimal('0.8000000000000000444089209850062616169452667236328125')]
include_thresholds2
[Decimal('0.8')]
Coming to the dataframe
df = spark.sql(""" with t1 (
select 'AAA' c1, 0.9 c2, 0.5 c3 union all
select 'BBB' c1, 0.8 c2, 0.5 c3 union all
select 'CCC' c1, 0.9 c2, 0.5 c3
) select c1 employee_id, cast(c2 as double) threshold, cast(c3 as double) wage from t1
""")
df.show()
df.printSchema()
+-----------+---------+----+
|employee_id|threshold|wage|
+-----------+---------+----+
| AAA| 0.9| 0.5|
| BBB| 0.8| 0.5|
| CCC| 0.9| 0.5|
+-----------+---------+----+
root
|-- employee_id: string (nullable = false)
|-- threshold: double (nullable = false)
|-- wage: double (nullable = false)
include_thresholds2 would work fine.
df.filter(col("threshold").isin(include_thresholds2)).show()
+-----------+---------+----+
|employee_id|threshold|wage|
+-----------+---------+----+
| BBB| 0.8| 0.5|
+-----------+---------+----+
Now the below throws error.
df.filter(col("threshold").isin(include_thresholds)).show()
org.apache.spark.sql.AnalysisException: decimal can only support precision up to 38;
as it is taking the value 0.8000000000000000444089209850062616169452667236328125 as such and trying to upcast and thus throwing error.
found the answer in isin documentation:
https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Column.html#isin-java.lang.Object...-
isin
public Column isin(Object... list)
A boolean expression that is evaluated to true if the value of this expression is contained by the evaluated values of the arguments.
Note: Since the type of the elements in the list are inferred only during the run time, the elements will be "up-casted" to the most common type for comparison. For eg: 1) In the case of "Int vs String", the "Int" will be up-casted to "String" and the comparison will look like "String vs String". 2) In the case of "Float vs Double", the "Float" will be up-casted to "Double" and the comparison will look like "Double vs Double"
Related
I am reading CSV file from below code snippet
df_pyspark = spark.read.csv("sample_data.csv") df_pyspark
and when i try to print data Frame its output is like shown as below:
DataFrame[_c0: string, _c1: string, _c2: string, _c3: string, _c4: string, _c5: string]
For each column dataType is showing 'String' even though column contains different dataType's as below:
df_pyspark.show()
|_c0| _c1| _c2| _c3| _c4| _c5|
+---+----------+---------+--------------------+-----------+----------+
| id|first_name|last_name| email| gender| phone|
| 1| Bidget| Mirfield|bmirfield0#scient...| Female|5628618353|
| 2| Gonzalo| Vango| gvango1#ning.com| Male|9556535457|
| 3| Rock| Pampling|rpampling2#guardi...| Bigender|4472741337|
| 4| Dorella| Edelman|dedelman3#histats...| Female|4303062344|
| 5| Faber| Thwaite|fthwaite4#google....|Genderqueer|1348658809|
| 6| Debee| Philcott|dphilcott5#cafepr...| Female|7906881842|`
I want to print the exact DataType of every column?
Use inferSchema parameter during read of CSV file it'll Show the exact/correct datatype according to the values in columns
df_pyspark = spark.read.csv("sample_data.csv", header=True, inferSchema=True)
+---+----------+---------+--------------------+-----------+----------+
| id|first_name|last_name| email| gender| phone|
+---+----------+---------+--------------------+-----------+----------+
| 1| Bidget| Mirfield|bmirfield0#scient...| Female|5628618353|
| 2| Gonzalo| Vango| gvango1#ning.com| Male|9556535457|
| 3| Rock| Pampling|rpampling2#guardi...| Bigender|4472741337|
| 4| Dorella| Edelman|dedelman3#histats...| Female|4303062344|
| 5| Faber| Thwaite|fthwaite4#google....|Genderqueer|1348658809|
+---+----------+---------+--------------------+-----------+----------+
only showing top 5 rows
df_pyspark.printSchema()
root
|-- id: integer (nullable = true)
|-- first_name: string (nullable = true)
|-- last_name: string (nullable = true)
|-- email: string (nullable = true)
|-- gender: string (nullable = true)
|-- phone: long (nullable = true)
I have a Dataframe named df with following structure:
root
|-- country: string (nullable = true)
|-- competition: string (nullable = true)
|-- competitor: array (nullable = true)
|-- element: struct (containsNull = true)
| |-- name: string (nullable = true)
|-- time: string (nullable = true)
Which looks like this:
|country|competiton|competitor |
|___________________________________|
|USA |WN |[{Adam, 9.43}] |
|China |FN |[{John, 9.56}] |
|China |FN |[{Adam, 9.48}] |
|USA |MNU |[{Phil, 10.02}] |
|... |... |... |
I want to pivot (or something similar) the competitor Column into new columns depending on the values in each struct so it looks like this:
|country|competition|Adam|John|Phil |
|____________________________________|
|USA |WN |9.43|... |... |
|China |FN |9.48|9.56|... |
|USA |MNU |... |... |10.02 |
The names are unique so if column already exists i dont want to create a new but fill the value in the already created one. There are alot of names so it needs to be done dynamically.
I have a pretty big dataset so i cant use Pandas.
We can extract the struct from the array and then create new columns from that struct. The name column can be pivoted with the time as values.
data_sdf. \
withColumn('name_time_struct', func.col('competitor')[0]). \
select('country', 'competition', func.col('name_time_struct.*')). \
groupBy('country', 'competition'). \
pivot('name'). \
agg(func.first('time')). \
show()
+-------+-----------+----+----+-----+
|country|competition|Adam|John| Phil|
+-------+-----------+----+----+-----+
| USA| MNU|null|null|10.02|
| USA| WN|9.43|null| null|
| China| FN|9.48|9.56| null|
+-------+-----------+----+----+-----+
P.S. this assumes there's just 1 struct in the array.
I'm trying to count word pairs in a text file. First, I've done some pre-processing on the text, and then I counted word pairs as shown below:
((Aspire, to), 1) ; ((to, inspire), 4) ; ((inspire, before), 38)...
Now, I want to report the 1000 most frequent pairs, sorted by :
Word (second word of the pair)
Relative frequency (pair occurences / 2nd word total occurences)
Here's what I've done so far
from pyspark.sql import SparkSession
import re
spark = SparkSession.builder.appName("Bigram occurences and relative frequencies").master("local[*]").getOrCreate()
sc = spark.sparkContext
text = sc.textFile("big.txt")
tokens = text.map(lambda x: x.lower()).map(lambda x: re.split("[\s,.;:!?]+", x))
pairs = tokens.flatMap(lambda xs: (tuple(x) for x in zip(xs, xs[1:]))).map(lambda x: (x, 1)).reduceByKey(lambda x, y: x + y)
frame = pairs.toDF(['pair', 'count'])
# Dataframe ordered by the most frequent pair to the least
most_frequent = frame.sort(frame['count'].desc())
# For each row, trying to add a column with the relative frequency, but I'm getting an error
with_rf = frame.withColumn("rf", frame['count'] / (frame.pair._2.sum()))
I think I'm relatively close to the result I want but I can't figure it out. I'm new to Spark and DataFrames in general.
I also tried
import pyspark.sql.functions as F
frame.groupBy(frame['pair._2']).agg((F.col('count') / F.sum('count')).alias('rf')).show()
Any help would be appreciated.
EDIT: here's a sample of the frame dataframe
+--------------------+-----+
| pair|count|
+--------------------+-----+
|{project, gutenberg}| 69|
| {gutenberg, ebook}| 14|
| {ebook, of}| 5|
| {adventures, of}| 6|
| {by, sir}| 12|
| {conan, doyle)}| 1|
| {changing, all}| 2|
| {all, over}| 24|
+--------------------+-----+
root
|-- pair: struct (nullable = true)
| |-- _1: string (nullable = true)
| |-- _2: string (nullable = true)
|-- count: long (nullable = true)
The relative frequency can be computed by using window function, that partitions by the second word in the pair and applies a sum operation.
Then, we limit the entries in the df to the top x, based on count and finally order by the second word in pair and the relative frequency.
from pyspark.sql import functions as F
from pyspark.sql import Window as W
data = [(("project", "gutenberg"), 69,),
(("gutenberg", "ebook"), 14,),
(("ebook", "of"), 5,),
(("adventures", "of"), 6,),
(("by", "sir"), 12,),
(("conan", "doyle"), 1,),
(("changing", "all"), 2,),
(("all", "over"), 24,), ]
df = spark.createDataFrame(data, ("pair", "count", ))
ws = W.partitionBy(F.col("pair")._2).rowsBetween(W.unboundedPreceding, W.unboundedFollowing)
(df.withColumn("relative_freq", F.col("count") / F.sum("count").over(ws))
.orderBy(F.col("count").desc())
.limit(3) # change here to select top 1000
.orderBy(F.desc(F.col("pair")._2), F.col("relative_freq").desc())
).show()
"""
+--------------------+-----+-------------+
| pair|count|relative_freq|
+--------------------+-----+-------------+
| {all, over}| 24| 1.0|
|{project, gutenberg}| 69| 1.0|
| {gutenberg, ebook}| 14| 1.0|
+--------------------+-----+-------------+
"""
I created data frame in PySpark by reading data from HDFS like this:
df = spark.read.parquet('path/to/parquet')
I expect the data frame to have two column of strings:
+------------+------------------+
|my_column |my_other_column |
+------------+------------------+
|my_string_1 |my_other_string_1 |
|my_string_2 |my_other_string_2 |
|my_string_3 |my_other_string_3 |
|my_string_4 |my_other_string_4 |
|my_string_5 |my_other_string_5 |
|my_string_6 |my_other_string_6 |
|my_string_7 |my_other_string_7 |
|my_string_8 |my_other_string_8 |
+------------+------------------+
However, I get my_column column with some strings starting with [Ljava.lang.Object;, looking like this:
>> df.show(truncate=False)
+-----------------------------+------------------+
|my_column |my_other_column |
+-----------------------------+------------------+
|[Ljava.lang.Object;#7abeeeb6 |my_other_string_1 |
|[Ljava.lang.Object;#5c1bbb1c |my_other_string_2 |
|[Ljava.lang.Object;#6be335ee |my_other_string_3 |
|[Ljava.lang.Object;#153bdb33 |my_other_string_4 |
|[Ljava.lang.Object;#1a23b57f |my_other_string_5 |
|[Ljava.lang.Object;#3a101a1a |my_other_string_6 |
|[Ljava.lang.Object;#33846636 |my_other_string_7 |
|[Ljava.lang.Object;#521a0a3d |my_other_string_8 |
+-----------------------------+------------------+
>> df.printSchema()
root
|-- my_column: string (nullable = true)
|-- my_other_column: string (nullable = true)
As you can see, my_other_column column is looking as expected. Is there any way, how to convert objects in my_column column to humanly readable strings?
Jaroslav,
I tried with the following code, and have used a sample parquet file from here. I am able to get the desired output from the dataframe, can u please chk your code using the code snippet below and also sample file referred above to see if there's any other issue:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Read a Parquet file").getOrCreate()
df = spark.read.parquet('E:\\...\\..\\userdata1.parquet')
df.show(10)
df.printSchema()
Replace the path to your HDFS location.
Dataframe output for your reference:
I wonder what is the most efficient way to extract a column in pyspark dataframe and turn them into a new dataframe? The following code runs without any problem with small datasets, but runs very slow and even causes out-of-memory error. I wonder how can I improve the efficiency of this code?
pdf_edges = sdf_grp.rdd.flatMap(lambda x: x).collect()
edgelist = reduce(lambda a, b: a + b, pdf_edges, [])
sdf_edges = spark.createDataFrame(edgelist)
In pyspark dataframe sdf_grp, The column "pairs" contains information as below
+-------------------------------------------------------------------+
|pairs |
+-------------------------------------------------------------------+
|[[39169813, 24907492], [39169813, 19650174]] |
|[[10876191, 139604770]] |
|[[6481958, 22689674]] |
|[[73450939, 114203936], [73450939, 21226555], [73450939, 24367554]]|
|[[66306616, 32911686], [66306616, 19319140], [66306616, 48712544]] |
+-------------------------------------------------------------------+
with a schema of
root
|-- pairs: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- node1: integer (nullable = false)
| | |-- node2: integer (nullable = false)
I'd like to convert them into a new dataframe sdf_edges looks like below
+---------+---------+
| node1| node2|
+---------+---------+
| 39169813| 24907492|
| 39169813| 19650174|
| 10876191|139604770|
| 6481958| 22689674|
| 73450939|114203936|
| 73450939| 21226555|
| 73450939| 24367554|
| 66306616| 32911686|
| 66306616| 19319140|
| 66306616| 48712544|
+---------+---------+
The most efficient way to extract columns is avoiding collect(). When you call collect(), all the data is transfered to the driver and processed there. At better way to achieve what you want is using the explode() function. Have a look at the example below:
from pyspark.sql import types as T
import pyspark.sql.functions as F
schema = T.StructType([
T.StructField("pairs", T.ArrayType(
T.StructType([
T.StructField("node1", T.IntegerType()),
T.StructField("node2", T.IntegerType())
])
)
)
])
df = spark.createDataFrame(
[
([[39169813, 24907492], [39169813, 19650174]],),
([[10876191, 139604770]], ) ,
([[6481958, 22689674]] , ) ,
([[73450939, 114203936], [73450939, 21226555], [73450939, 24367554]],),
([[66306616, 32911686], [66306616, 19319140], [66306616, 48712544]],)
], schema)
df = df.select(F.explode('pairs').alias('exploded')).select('exploded.node1', 'exploded.node2')
df.show(truncate=False)
Output:
+--------+---------+
| node1 | node2 |
+--------+---------+
|39169813|24907492 |
|39169813|19650174 |
|10876191|139604770|
|6481958 |22689674 |
|73450939|114203936|
|73450939|21226555 |
|73450939|24367554 |
|66306616|32911686 |
|66306616|19319140 |
|66306616|48712544 |
+--------+---------+
Well, I just solve it with the below
sdf_edges = sdf_grp.select('pairs').rdd.flatMap(lambda x: x[0]).toDF()