I have a dataframe with a schema as follows:
root
|-- column: struct (nullable = true)
| |-- column-string: string (nullable = true)
|-- count: long (nullable = true)
What I want to do is:
Get rid of the struct - or by that I mean "promote" column-string, so my dataframe only has 2 columns - column-string and count
I then want to split column-string into 3 different columns, so I end up with the schema:
The text within column-string always fits the format:
Some-Text,Text,MoreText
Does anyone know how this is possible?
I'm using Pyspark Python.
PS. I am new to Pyspark & I don't know much about the struct format and couldn't find how to write an example into my post to make it reproducible - sorry.
You can also use from_csv to convert the comma-delimited string into a struct, and then star expand the struct:
import pyspark.sql.functions as F
df2 = df.withColumn(
'col',
F.from_csv(
'column.column-string',
'`column-string` string, `column-string2` string, `column-string3` string'
)
).select('col.*', 'count')
df2.show()
+-------------+--------------+--------------+-----+
|column-string|column-string2|column-string3|count|
+-------------+--------------+--------------+-----+
| SomeText| Text| MoreText| 1|
+-------------+--------------+--------------+-----+
Note that it's better not to have hyphens in column names because they are reserved for subtraction. Underscores are better.
You can select column-string field from the struct using column.column-string, the simply split by a comma to get three columns :
from pyspark.sql import functions as F
df1 = df.withColumn(
"column_string", F.split(F.col("column.column-string"), ",")
).select(
F.col("column_string")[0].alias("column-string"),
F.col("column_string")[1].alias("column-string2"),
F.col("column_string")[2].alias("column-string3"),
F.col("count")
)
Related
I'm new to Pyspark 3.0 and I have this homework where I need to change the string (geolocation) to tuple numeric data type (geolocation1).
Here is my code:
from pyspark.sql.functions import *
from pyspark.sql.types import *
df = df2.withColumn('geolocation1', col('geolocation').cast('double'))
Output:
| geolocation | geolocation1 |
| ------------------------------------------| ----------------- |
| (-37.80899950, 140.96004459) | null |
| (-37.80899952, 140.96004451) | null |
What have I done wrong here?
If you have a string like that, you can remove the parenthesis and split by comma, then cast to array<double>:
import pyspark.sql.functions as F
df = df2.withColumn(
'geolocation1',
F.split(
F.regexp_replace('geolocation', '[\( \)]', ''),
','
).cast('array<double>')
)
df.show(truncate=False)
+----------------------------+---------------------------+
|geolocation |geolocation1 |
+----------------------------+---------------------------+
|(-37.80899950, 140.96004459)|[-37.8089995, 140.96004459]|
+----------------------------+---------------------------+
df.printSchema()
root
|-- geolocation: string (nullable = false)
|-- geolocation1: array (nullable = false)
| |-- element: double (containsNull = true)
I would like to give some suggestions before answering this
First , you need to understand what is a double type. Here you are blindly converting a string which contains non-numeric characters as well to a numeric format.So internally spark will throw an exception which will be caught and null will be populated as an output.
And as I understood from the name of field, its a Geolocation which will be a combination of latitude and longitude. So I assume who ever given you this homework needs these two values as new columns. And if my assumption is correct , below is one of the ways to achieve it.
I am trying to save an pyspark.sql.dataframe.DataFrame in CSV format (could also be another format, as long as it is easily readable).
So far, I found a couple of examples to save the DataFrame. However, it is losing information everytime that I write it.
Dataset example:
# Create an example Pyspark DataFrame
from pyspark.sql import Row
Employee = Row("firstName", "lastName", "email", "salary")
employee1 = Employee('A', 'AA', 'mail1', 100000)
employee2 = Employee('B', 'BB', 'mail2', 120000 )
employee3 = Employee('C', None, 'mail3', 140000 )
employee4 = Employee('D', 'DD', 'mail4', 160000 )
employee5 = Employee('E', 'EE', 'mail5', 160000 )
department1 = Row(id='123', name='HR')
department2 = Row(id='456', name='OPS')
department3 = Row(id='789', name='FN')
department4 = Row(id='101112', name='DEV')
departmentWithEmployees1 = Row(department=department1, employees=[employee1, employee2, employee5])
departmentWithEmployees2 = Row(department=department2, employees=[employee3, employee4])
departmentWithEmployees3 = Row(department=department3, employees=[employee1, employee4, employee3])
departmentWithEmployees4 = Row(department=department4, employees=[employee2, employee3])
departmentsWithEmployees_Seq = [departmentWithEmployees1, departmentWithEmployees2]
dframe = spark.createDataFrame(departmentsWithEmployees_Seq)
In order to save this file as CSV, I firstly tried this solution:
type(dframe)
Out[]: pyspark.sql.dataframe.DataFrame
dframe.write.csv('junk_mycsv.csv')
Unfortunately, that result in this error:
org.apache.spark.sql.AnalysisException: CSV data source does not support struct<id:string,name:string> data type.;
That is the reason why I tried another possibility, to convert the spark dataframe into a pandas dataframe, and save it then. As mentioned in this example.
pandas_df = dframe.toPandas()
Works good! However, If I show my data, it is missing data:
print(pandas_df.head())
department employees
0 (123, HR) [(A, AA, mail1, 100000), (B, BB, mail2, 120000...
1 (456, OPS) [(C, None, mail3, 140000), (D, DD, mail4, 1600...
As you can see in the snapshot below, we are missing information. Because the data should be like this:
department employees
0 id:123, name:HR firstName: A, lastName: AA, email: mail1, salary: 100000
# Info is missing like 'id', 'name', 'firstName', 'lastName', 'email' etc.
# For the complete expected example, see screenshow below.
Just for information: I am working in Databricks, with Python.
Therefore, how can I write my data (dframe from the example above) without losing information?
Many thanks in advance!
Edit
Adding a picture for Pault, to show the format of the csv (and the headers).
Edit2
Replacing the picture for example csv output:
After running Pault's code:
from pyspark.sql.functions import to_json
dframe.select(*[to_json(c).alias(c) for c in dframe.columns])\
.repartition(1).write.csv("junk_mycsv.csv", header= True)
The output is not tidy, since most column headers are empty (due the nested format?). Only copying the first row:
department employees (empty ColName) (empty ColName) (and so on)
{\id\":\"123\" \"name\":\"HR\"}" [{\firstName\":\"A\" \"lastName\":\"AA\" (...)
Your dataframe has the following schema:
dframe.printSchema()
#root
# |-- department: struct (nullable = true)
# | |-- id: string (nullable = true)
# | |-- name: string (nullable = true)
# |-- employees: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- firstName: string (nullable = true)
# | | |-- lastName: string (nullable = true)
# | | |-- email: string (nullable = true)
# | | |-- salary: long (nullable = true)
So the department column is a StructType with two named fields and the employees column is an array of structs with four named fields. It appears what you want is to write the data in a format that saves both the key and the value for each record.
One option is to write the file in JSON format instead of CSV:
dframe.write.json("junk.json")
Which produces the following output:
{"department":{"id":"123","name":"HR"},"employees":[{"firstName":"A","lastName":"AA","email":"mail1","salary":100000},{"firstName":"B","lastName":"BB","email":"mail2","salary":120000},{"firstName":"E","lastName":"EE","email":"mail5","salary":160000}]}
{"department":{"id":"456","name":"OPS"},"employees":[{"firstName":"C","email":"mail3","salary":140000},{"firstName":"D","lastName":"DD","email":"mail4","salary":160000}]}
Or if you wanted to keep it in CSV format, you can use to_json to convert each column to JSON before writing the CSV.
# looping over all columns
# but you can also just limit this to the columns you want to convert
from pyspark.sql.functions import to_json
dframe.select(*[to_json(c).alias(c) for c in dframe.columns])\
.write.csv("junk_mycsv.csv")
This produces the following output:
"{\"id\":\"123\",\"name\":\"HR\"}","[{\"firstName\":\"A\",\"lastName\":\"AA\",\"email\":\"mail1\",\"salary\":100000},{\"firstName\":\"B\",\"lastName\":\"BB\",\"email\":\"mail2\",\"salary\":120000},{\"firstName\":\"E\",\"lastName\":\"EE\",\"email\":\"mail5\",\"salary\":160000}]"
"{\"id\":\"456\",\"name\":\"OPS\"}","[{\"firstName\":\"C\",\"email\":\"mail3\",\"salary\":140000},{\"firstName\":\"D\",\"lastName\":\"DD\",\"email\":\"mail4\",\"salary\":160000}]"
Note that the double-quotes are escaped.
I have a data table in PySpark that contains two columns with data type of 'struc'.
Please see sample data frame below:
word_verb word_noun
{_1=cook, _2=VB} {_1=chicken, _2=NN}
{_1=pack, _2=VBN} {_1=lunch, _2=NN}
{_1=reconnected, _2=VBN} {_1=wifi, _2=NN}
I want to concatenate the two columns together so I can do a frequency count of the concatenated verb and noun chunk.
I tried the code below:
df = df.withColumn('word_chunk_final', F.concat(F.col('word_verb'), F.col('word_noun')))
But I get the following error:
AnalysisException: u"cannot resolve 'concat(`word_verb`, `word_noun`)' due to data type mismatch: input to function concat should have been string, binary or array, but it's [struct<_1:string,_2:string>, struct<_1:string,_2:string>]
My desired output table is as follows. The concatenated new field would have datatype of string:
word_verb word_noun word_chunk_final
{_1=cook, _2=VB} {_1=chicken, _2=NN} cook chicken
{_1=pack, _2=VBN} {_1=lunch, _2=NN} pack lunch
{_1=reconnected, _2=VBN} {_1=wifi, _2=NN} reconnected wifi
Your code is almost there.
Assuming your schema is as follows:
df.printSchema()
#root
# |-- word_verb: struct (nullable = true)
# | |-- _1: string (nullable = true)
# | |-- _2: string (nullable = true)
# |-- word_noun: struct (nullable = true)
# | |-- _1: string (nullable = true)
# | |-- _2: string (nullable = true)
You just need to access the value of the _1 field for each column:
import pyspark.sql.functions as F
df.withColumn(
"word_chunk_final",
F.concat_ws(' ', F.col('word_verb')['_1'], F.col('word_noun')['_1'])
).show()
#+-----------------+------------+----------------+
#| word_verb| word_noun|word_chunk_final|
#+-----------------+------------+----------------+
#| [cook,VB]|[chicken,NN]| cook chicken|
#| [pack,VBN]| [lunch,NN]| pack lunch|
#|[reconnected,VBN]| [wifi,NN]|reconnected wifi|
#+-----------------+------------+----------------+
Also, you should use concat_ws ("concatenate with separator") instead of concat to add the strings together with a space in between them. It's similar to how str.join works in python.
I have pyspark dataframe with a column named Filters:
"array>"
I want to save my dataframe in csv file, for that i need to cast the array to string type.
I tried to cast it: DF.Filters.tostring() and DF.Filters.cast(StringType()), but both solutions generate error message for each row in the columns Filters:
org.apache.spark.sql.catalyst.expressions.UnsafeArrayData#56234c19
The code is as follows
from pyspark.sql.types import StringType
DF.printSchema()
|-- ClientNum: string (nullable = true)
|-- Filters: array (nullable = true)
|-- element: struct (containsNull = true)
|-- Op: string (nullable = true)
|-- Type: string (nullable = true)
|-- Val: string (nullable = true)
DF_cast = DF.select ('ClientNum',DF.Filters.cast(StringType()))
DF_cast.printSchema()
|-- ClientNum: string (nullable = true)
|-- Filters: string (nullable = true)
DF_cast.show()
| ClientNum | Filters
| 32103 | org.apache.spark.sql.catalyst.expressions.UnsafeArrayData#d9e517ce
| 218056 | org.apache.spark.sql.catalyst.expressions.UnsafeArrayData#3c744494
Sample JSON data:
{"ClientNum":"abc123","Filters":[{"Op":"foo","Type":"bar","Val":"baz"}]}
Thanks !!
I created a sample JSON dataset to match that schema:
{"ClientNum":"abc123","Filters":[{"Op":"foo","Type":"bar","Val":"baz"}]}
select(s.col("ClientNum"),s.col("Filters").cast(StringType)).show(false)
+---------+------------------------------------------------------------------+
|ClientNum|Filters |
+---------+------------------------------------------------------------------+
|abc123 |org.apache.spark.sql.catalyst.expressions.UnsafeArrayData#60fca57e|
+---------+------------------------------------------------------------------+
Your problem is best solved using the explode() function which flattens an array, then the star expand notation:
s.selectExpr("explode(Filters) AS structCol").selectExpr("structCol.*").show()
+---+----+---+
| Op|Type|Val|
+---+----+---+
|foo| bar|baz|
+---+----+---+
To make it a single column string separated by commas:
s.selectExpr("explode(Filters) AS structCol").select(F.expr("concat_ws(',', structCol.*)").alias("single_col")).show()
+-----------+
| single_col|
+-----------+
|foo,bar,baz|
+-----------+
Explode Array reference: Flattening Rows in Spark
Star expand reference for "struct" type: How to flatten a struct in a spark dataframe?
For me in Pyspark the function to_json() did the job.
As a plus compared to the simple casting to String, it keeps the "struct keys" as well (not only the "struct values"). So for the reported example I would have something like:
[{"Op":"foo","Type":"bar","Val":"baz"}]
This was much more useful to me since that I had to write results to a Postgres table. In this format I can easily use supported JSON functions in Postgres
You can try this:
DF = DF.withColumn('Filters', DF.Filters.cast("string"))
In PySpark 1.6 DataFrame currently there is no Spark builtin function to convert from string to float/double.
Assume, we have a RDD with ('house_name', 'price') with both values as string. You would like to convert, price from string to float.
In PySpark, we can apply map and python float function to achieve this.
New_RDD = RawDataRDD.map(lambda (house_name, price): (house_name, float(x.price)) # this works
In PySpark 1.6 Dataframe, it does not work:
New_DF = rawdataDF.select('house name', float('price')) # did not work
Until a built in Pyspark function available, how to do achieve this conversion with a UDF?
I developed this conversion UDF as follows:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
def string_to_float(x):
return float(x)
udfstring_to_float = udf(string_to_float, StringType())
rawdata.withColumn("house name", udfstring_to_float("price"))
Is there a better and much simpler way to achieve the same?
According to the documentation, you can use the cast function on a column like this:
rawdata.withColumn("house name", rawdata["price"].cast(DoubleType()).alias("price"))
The answer should be as follows:
>>> rawdata.printSchema()
root
|-- house name: string (nullable = true)
|-- price: string (nullable = true)
>>> rawdata=rawdata.withColumn('price',rawdata['price'].cast("float").alias('price'))
>>> rawdata.printSchema()
root
|-- house name: string (nullable = true)
|-- price: float (nullable = true)
As it is the shortest one-line code without using any user-defined function. You can see whether it worked correctly by using printSchema() function.