I am reading CSV file from below code snippet
df_pyspark = spark.read.csv("sample_data.csv") df_pyspark
and when i try to print data Frame its output is like shown as below:
DataFrame[_c0: string, _c1: string, _c2: string, _c3: string, _c4: string, _c5: string]
For each column dataType is showing 'String' even though column contains different dataType's as below:
df_pyspark.show()
|_c0| _c1| _c2| _c3| _c4| _c5|
+---+----------+---------+--------------------+-----------+----------+
| id|first_name|last_name| email| gender| phone|
| 1| Bidget| Mirfield|bmirfield0#scient...| Female|5628618353|
| 2| Gonzalo| Vango| gvango1#ning.com| Male|9556535457|
| 3| Rock| Pampling|rpampling2#guardi...| Bigender|4472741337|
| 4| Dorella| Edelman|dedelman3#histats...| Female|4303062344|
| 5| Faber| Thwaite|fthwaite4#google....|Genderqueer|1348658809|
| 6| Debee| Philcott|dphilcott5#cafepr...| Female|7906881842|`
I want to print the exact DataType of every column?
Use inferSchema parameter during read of CSV file it'll Show the exact/correct datatype according to the values in columns
df_pyspark = spark.read.csv("sample_data.csv", header=True, inferSchema=True)
+---+----------+---------+--------------------+-----------+----------+
| id|first_name|last_name| email| gender| phone|
+---+----------+---------+--------------------+-----------+----------+
| 1| Bidget| Mirfield|bmirfield0#scient...| Female|5628618353|
| 2| Gonzalo| Vango| gvango1#ning.com| Male|9556535457|
| 3| Rock| Pampling|rpampling2#guardi...| Bigender|4472741337|
| 4| Dorella| Edelman|dedelman3#histats...| Female|4303062344|
| 5| Faber| Thwaite|fthwaite4#google....|Genderqueer|1348658809|
+---+----------+---------+--------------------+-----------+----------+
only showing top 5 rows
df_pyspark.printSchema()
root
|-- id: integer (nullable = true)
|-- first_name: string (nullable = true)
|-- last_name: string (nullable = true)
|-- email: string (nullable = true)
|-- gender: string (nullable = true)
|-- phone: long (nullable = true)
Related
I have a dataframe like as provided below:
+--------+-------+-------+--------------+--------------------+-----------------+----------+--------------------+------------+
|sequence|recType|valCode|registerNumber| rest| errorCode|errorType | errorDescription|isSuccessful|
+--------+-------+-------+--------------+--------------------+-----------------+----------+--------------------+------------+
| 9| 11| 0| XXXX2288|110XXXX2288MKKKKK...| CHAR0088| ERROR|Records out of se...| N|
| 9| 12| 0| XXXX2288|130XXXX22880011ZZ...| CHAR0088| ERROR|Records out of se...| N|
| 9| 18| 0| XXXX2288|140XXXX2288 ...| CHAR0088| ERROR|Records out of se...| N|
+--------+-------+-------+--------------+--------------------+-----------------+----------+--------------------+------------+ N|
The below code uses UDF to populate the data for errorType and errorDescription columns.
The UDFs i.e. resolveErrorTypeUDF and resolveErrorDescUDF take one errorCode as input and provide the respective errorType and errorDescription in output respectively.
errorFinalDf = errorDfAll.na.fill("") \
.withColumn("errorType", resolveErrorTypeUDF(col("errorCode"))) \
.withColumn("errorDescription", resolveErrorDescUDF(col("errorCode"))) \
.withColumn("isSuccessful", when(trim(col("errorCode")).eqNullSafe(""), "Y").otherwise("N")) \
.dropDuplicates()
Please notice that, I used to get only one error code in errorCode column. Now onwards, I will be getting single/multiple - separated error codes in the errorCode column. And I need to populate all the mapping errorType and errorDescription and write them into respective column with - separation.
The new dataframe would look like this.
+--------+-------+-------+--------------+--------------------+-----------------+----------+--------------------+------------+
|sequence|recType|valCode|registerNumber| rest| errorCode|errorType | errorDescription|isSuccessful|
+--------+-------+-------+--------------+--------------------+-----------------+----------+--------------------+------------+
| 7| 1| 0| XXXX8822|010XXXX8822XBCDEF...|CHAR0009-CHAR0021|ERROR-WARN|Short Failed-Miss...| N|
| 7| 11| 0| XXXX8822|110XXXX8822LLLLLL...|CHAR0009-CHAR0021|ERROR-WARN|Short Failed-Miss...| N|
| 7| 12| 0| XXXX8822|120XXXX8822011GB ...|CHAR0009-CHAR0021|ERROR-WARN|Short Failed-Miss...| N|
| 7| 18| 0| XXXX8822|180XXXX8822 ...|CHAR0009-CHAR0021|ERROR-WARN|Short Failed-Miss...| N|
| 7| 18| 0| XXXX8822|180XXXX88220 ...|CHAR0009-CHAR0021|ERROR-WARN|Short Failed-Miss...| N|
+--------+-------+-------+--------------+--------------------+-----------------+----------+--------------------+------------+
What changes would be needed to accommodate the new scenario. Please help. Thank you.
You need minimal changes, limited only to your UDFs.
Suppose you have a simple python function, get_type_from_code able to convert a string with the error code to the correspondent type (the same applies to the description).
from pyspark.sql import functions as F, types as T
def get_type_from_code(c: str) -> str:
"""Function to convert error code to error type.
Mind the interface: string in, string out
"""
return {'CHAR0009': 'ERROR', 'CHAR0021': 'WARNING'}.get(c, 'UNKNOWN')
#F.udf(returnType=T.StringType())
def convert_errcodes_to_types(codes: str) -> str:
"""Convert a string of error codes separated by '-' into a string of types concatenated with '-'"""
return '-'.join(
map(get_type_from_code, codes.split('-'))
)
Done!
I have a Dataframe named df with following structure:
root
|-- country: string (nullable = true)
|-- competition: string (nullable = true)
|-- competitor: array (nullable = true)
|-- element: struct (containsNull = true)
| |-- name: string (nullable = true)
|-- time: string (nullable = true)
Which looks like this:
|country|competiton|competitor |
|___________________________________|
|USA |WN |[{Adam, 9.43}] |
|China |FN |[{John, 9.56}] |
|China |FN |[{Adam, 9.48}] |
|USA |MNU |[{Phil, 10.02}] |
|... |... |... |
I want to pivot (or something similar) the competitor Column into new columns depending on the values in each struct so it looks like this:
|country|competition|Adam|John|Phil |
|____________________________________|
|USA |WN |9.43|... |... |
|China |FN |9.48|9.56|... |
|USA |MNU |... |... |10.02 |
The names are unique so if column already exists i dont want to create a new but fill the value in the already created one. There are alot of names so it needs to be done dynamically.
I have a pretty big dataset so i cant use Pandas.
We can extract the struct from the array and then create new columns from that struct. The name column can be pivoted with the time as values.
data_sdf. \
withColumn('name_time_struct', func.col('competitor')[0]). \
select('country', 'competition', func.col('name_time_struct.*')). \
groupBy('country', 'competition'). \
pivot('name'). \
agg(func.first('time')). \
show()
+-------+-----------+----+----+-----+
|country|competition|Adam|John| Phil|
+-------+-----------+----+----+-----+
| USA| MNU|null|null|10.02|
| USA| WN|9.43|null| null|
| China| FN|9.48|9.56| null|
+-------+-----------+----+----+-----+
P.S. this assumes there's just 1 struct in the array.
I'm trying to count word pairs in a text file. First, I've done some pre-processing on the text, and then I counted word pairs as shown below:
((Aspire, to), 1) ; ((to, inspire), 4) ; ((inspire, before), 38)...
Now, I want to report the 1000 most frequent pairs, sorted by :
Word (second word of the pair)
Relative frequency (pair occurences / 2nd word total occurences)
Here's what I've done so far
from pyspark.sql import SparkSession
import re
spark = SparkSession.builder.appName("Bigram occurences and relative frequencies").master("local[*]").getOrCreate()
sc = spark.sparkContext
text = sc.textFile("big.txt")
tokens = text.map(lambda x: x.lower()).map(lambda x: re.split("[\s,.;:!?]+", x))
pairs = tokens.flatMap(lambda xs: (tuple(x) for x in zip(xs, xs[1:]))).map(lambda x: (x, 1)).reduceByKey(lambda x, y: x + y)
frame = pairs.toDF(['pair', 'count'])
# Dataframe ordered by the most frequent pair to the least
most_frequent = frame.sort(frame['count'].desc())
# For each row, trying to add a column with the relative frequency, but I'm getting an error
with_rf = frame.withColumn("rf", frame['count'] / (frame.pair._2.sum()))
I think I'm relatively close to the result I want but I can't figure it out. I'm new to Spark and DataFrames in general.
I also tried
import pyspark.sql.functions as F
frame.groupBy(frame['pair._2']).agg((F.col('count') / F.sum('count')).alias('rf')).show()
Any help would be appreciated.
EDIT: here's a sample of the frame dataframe
+--------------------+-----+
| pair|count|
+--------------------+-----+
|{project, gutenberg}| 69|
| {gutenberg, ebook}| 14|
| {ebook, of}| 5|
| {adventures, of}| 6|
| {by, sir}| 12|
| {conan, doyle)}| 1|
| {changing, all}| 2|
| {all, over}| 24|
+--------------------+-----+
root
|-- pair: struct (nullable = true)
| |-- _1: string (nullable = true)
| |-- _2: string (nullable = true)
|-- count: long (nullable = true)
The relative frequency can be computed by using window function, that partitions by the second word in the pair and applies a sum operation.
Then, we limit the entries in the df to the top x, based on count and finally order by the second word in pair and the relative frequency.
from pyspark.sql import functions as F
from pyspark.sql import Window as W
data = [(("project", "gutenberg"), 69,),
(("gutenberg", "ebook"), 14,),
(("ebook", "of"), 5,),
(("adventures", "of"), 6,),
(("by", "sir"), 12,),
(("conan", "doyle"), 1,),
(("changing", "all"), 2,),
(("all", "over"), 24,), ]
df = spark.createDataFrame(data, ("pair", "count", ))
ws = W.partitionBy(F.col("pair")._2).rowsBetween(W.unboundedPreceding, W.unboundedFollowing)
(df.withColumn("relative_freq", F.col("count") / F.sum("count").over(ws))
.orderBy(F.col("count").desc())
.limit(3) # change here to select top 1000
.orderBy(F.desc(F.col("pair")._2), F.col("relative_freq").desc())
).show()
"""
+--------------------+-----+-------------+
| pair|count|relative_freq|
+--------------------+-----+-------------+
| {all, over}| 24| 1.0|
|{project, gutenberg}| 69| 1.0|
| {gutenberg, ebook}| 14| 1.0|
+--------------------+-----+-------------+
"""
I am using isin function to filter pyspark dataframe. Surprisingly, although column data type (double) does not match data type in the list (Decimal), there was a match. Can someone help me understand why this is the case?
Example
(Pdb) df.show(3)
+--------------------+---------+------------+
| employee_id|threshold|wage|
+--------------------+---------+------------+
|AAA | 0.9| 0.5|
|BBB | 0.8| 0.5|
|CCC | 0.9| 0.5|
+--------------------+---------+------------+
(Pdb) df.printSchema()
root
|-- employee_id: string (nullable = true)
|-- threshold: double (nullable = true)
|-- wage: double (nullable = true)
(Pdb) include_thresholds
[Decimal('0.8')]
(Pdb) df.count()
3267
(Pdb) df.filter(fn.col("threshold").isin(include_thresholds)).count()
1633
However, if I use normal "in" operator to test if 0.8 belongs to include_thresholds, it's obviously false
(Pdb) 0.8 in include_thresholds
False
Do function col or isin implicitly perform datatype conversion ?
When you bring external input to spark for comparison. They are just taken as string and upcasted based on the context.
So what you observe based on numpy datatypes may not hold good in spark.
import decimal
include_thresholds=[decimal.Decimal(0.8)]
include_thresholds2=[decimal.Decimal('0.8')]
0.8 in include_thresholds # True
0.8 in include_thresholds2 # False
And, note the values
include_thresholds
[Decimal('0.8000000000000000444089209850062616169452667236328125')]
include_thresholds2
[Decimal('0.8')]
Coming to the dataframe
df = spark.sql(""" with t1 (
select 'AAA' c1, 0.9 c2, 0.5 c3 union all
select 'BBB' c1, 0.8 c2, 0.5 c3 union all
select 'CCC' c1, 0.9 c2, 0.5 c3
) select c1 employee_id, cast(c2 as double) threshold, cast(c3 as double) wage from t1
""")
df.show()
df.printSchema()
+-----------+---------+----+
|employee_id|threshold|wage|
+-----------+---------+----+
| AAA| 0.9| 0.5|
| BBB| 0.8| 0.5|
| CCC| 0.9| 0.5|
+-----------+---------+----+
root
|-- employee_id: string (nullable = false)
|-- threshold: double (nullable = false)
|-- wage: double (nullable = false)
include_thresholds2 would work fine.
df.filter(col("threshold").isin(include_thresholds2)).show()
+-----------+---------+----+
|employee_id|threshold|wage|
+-----------+---------+----+
| BBB| 0.8| 0.5|
+-----------+---------+----+
Now the below throws error.
df.filter(col("threshold").isin(include_thresholds)).show()
org.apache.spark.sql.AnalysisException: decimal can only support precision up to 38;
as it is taking the value 0.8000000000000000444089209850062616169452667236328125 as such and trying to upcast and thus throwing error.
found the answer in isin documentation:
https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Column.html#isin-java.lang.Object...-
isin
public Column isin(Object... list)
A boolean expression that is evaluated to true if the value of this expression is contained by the evaluated values of the arguments.
Note: Since the type of the elements in the list are inferred only during the run time, the elements will be "up-casted" to the most common type for comparison. For eg: 1) In the case of "Int vs String", the "Int" will be up-casted to "String" and the comparison will look like "String vs String". 2) In the case of "Float vs Double", the "Float" will be up-casted to "Double" and the comparison will look like "Double vs Double"
I wonder what is the most efficient way to extract a column in pyspark dataframe and turn them into a new dataframe? The following code runs without any problem with small datasets, but runs very slow and even causes out-of-memory error. I wonder how can I improve the efficiency of this code?
pdf_edges = sdf_grp.rdd.flatMap(lambda x: x).collect()
edgelist = reduce(lambda a, b: a + b, pdf_edges, [])
sdf_edges = spark.createDataFrame(edgelist)
In pyspark dataframe sdf_grp, The column "pairs" contains information as below
+-------------------------------------------------------------------+
|pairs |
+-------------------------------------------------------------------+
|[[39169813, 24907492], [39169813, 19650174]] |
|[[10876191, 139604770]] |
|[[6481958, 22689674]] |
|[[73450939, 114203936], [73450939, 21226555], [73450939, 24367554]]|
|[[66306616, 32911686], [66306616, 19319140], [66306616, 48712544]] |
+-------------------------------------------------------------------+
with a schema of
root
|-- pairs: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- node1: integer (nullable = false)
| | |-- node2: integer (nullable = false)
I'd like to convert them into a new dataframe sdf_edges looks like below
+---------+---------+
| node1| node2|
+---------+---------+
| 39169813| 24907492|
| 39169813| 19650174|
| 10876191|139604770|
| 6481958| 22689674|
| 73450939|114203936|
| 73450939| 21226555|
| 73450939| 24367554|
| 66306616| 32911686|
| 66306616| 19319140|
| 66306616| 48712544|
+---------+---------+
The most efficient way to extract columns is avoiding collect(). When you call collect(), all the data is transfered to the driver and processed there. At better way to achieve what you want is using the explode() function. Have a look at the example below:
from pyspark.sql import types as T
import pyspark.sql.functions as F
schema = T.StructType([
T.StructField("pairs", T.ArrayType(
T.StructType([
T.StructField("node1", T.IntegerType()),
T.StructField("node2", T.IntegerType())
])
)
)
])
df = spark.createDataFrame(
[
([[39169813, 24907492], [39169813, 19650174]],),
([[10876191, 139604770]], ) ,
([[6481958, 22689674]] , ) ,
([[73450939, 114203936], [73450939, 21226555], [73450939, 24367554]],),
([[66306616, 32911686], [66306616, 19319140], [66306616, 48712544]],)
], schema)
df = df.select(F.explode('pairs').alias('exploded')).select('exploded.node1', 'exploded.node2')
df.show(truncate=False)
Output:
+--------+---------+
| node1 | node2 |
+--------+---------+
|39169813|24907492 |
|39169813|19650174 |
|10876191|139604770|
|6481958 |22689674 |
|73450939|114203936|
|73450939|21226555 |
|73450939|24367554 |
|66306616|32911686 |
|66306616|19319140 |
|66306616|48712544 |
+--------+---------+
Well, I just solve it with the below
sdf_edges = sdf_grp.select('pairs').rdd.flatMap(lambda x: x[0]).toDF()