I have a dataframe with a schema as follows:
root
|-- column: struct (nullable = true)
| |-- column-string: string (nullable = true)
|-- count: long (nullable = true)
What I want to do is:
Get rid of the struct - or by that I mean "promote" column-string, so my dataframe only has 2 columns - column-string and count
I then want to split column-string into 3 different columns, so I end up with the schema:
The text within column-string always fits the format:
Some-Text,Text,MoreText
Does anyone know how this is possible?
I'm using Pyspark Python.
PS. I am new to Pyspark & I don't know much about the struct format and couldn't find how to write an example into my post to make it reproducible - sorry.
You can also use from_csv to convert the comma-delimited string into a struct, and then star expand the struct:
import pyspark.sql.functions as F
df2 = df.withColumn(
'col',
F.from_csv(
'column.column-string',
'`column-string` string, `column-string2` string, `column-string3` string'
)
).select('col.*', 'count')
df2.show()
+-------------+--------------+--------------+-----+
|column-string|column-string2|column-string3|count|
+-------------+--------------+--------------+-----+
| SomeText| Text| MoreText| 1|
+-------------+--------------+--------------+-----+
Note that it's better not to have hyphens in column names because they are reserved for subtraction. Underscores are better.
You can select column-string field from the struct using column.column-string, the simply split by a comma to get three columns :
from pyspark.sql import functions as F
df1 = df.withColumn(
"column_string", F.split(F.col("column.column-string"), ",")
).select(
F.col("column_string")[0].alias("column-string"),
F.col("column_string")[1].alias("column-string2"),
F.col("column_string")[2].alias("column-string3"),
F.col("count")
)
I have a difficult issue regarding rows in a PySpark DataFrame which contains a series of json strings.
The issue revolves around that each row might contain a different schema from another, so when I want to transform said rows into a subscriptable datatype in PySpark, I need to have a "unified" schema.
For example, consider this dataframe
import pandas as pd
json_1 = '{"a": 10, "b": 100}'
json_2 = '{"a": 20, "c": 2000}'
json_3 = '{"c": 300, "b": "3000", "d": 100.0, "f": {"some_other": {"A": 10}, "maybe_this": 10}}'
df = spark.createDataFrame(pd.DataFrame({'A': [1, 2, 3], 'B': [json_1, json_2, json_3]}))
Notice that each row contains different versions of the json string. To combat this, I do the following transforms
import json
import pyspark.sql.functions as fcn
from pyspark.sql import Row
from collections import OrderedDict
from pyspark.sql import DataFrame as SparkDataFrame
def convert_to_row(d: dict) -> Row:
"""Convert a dictionary to a SparkRow.
Parameters
----------
d : dict
Dictionary to convert.
Returns
-------
Row
"""
return Row(**OrderedDict(sorted(d.items())))
def get_schema_from_dictionary(the_dict: dict):
"""Create a schema from a dictionary.
Parameters
----------
the_dict : dict
Returns
-------
schema
Schema understood by PySpark.
"""
return spark.read.json(sc.parallelize([json.dumps(the_dict)])).schema
def get_universal_schema(df: SparkDataFrame, column: str):
"""Given a dataframe, retrieve the "global" schema for the column.
NOTE: It does this by merging across all the rows, so this will
take a long time for larger dataframes.
Parameters
----------
df : SparkDataFrame
Dataframe containing the column
column : str
Column to parse.
Returns
-------
schema
Schema understood by PySpark.
"""
col_values = [json.loads(getattr(item, column)) for item in df.select(column).collect()]
mega_dict = {}
for value in col_values:
mega_dict = {**mega_dict, **value}
return get_schema_from_dictionary(mega_dict)
def get_sample_schema(df, column):
"""Given a dataframe, sample a single value to convert.
NOTE: This assumes that the dataframe has the same schema
over all rows.
Parameters
----------
df : SparkDataFrame
Dataframe containing the column
column : str
Column to parse.
Returns
-------
schema
Schema understood by PySpark.
"""
return get_universal_schema(df.limit(1), column)
def from_json(df: SparkDataFrame, column: str, manual_schema=None, merge: bool = False) -> SparkDataFrame:
"""Convert json-string column to a subscriptable object.
Parameters
----------
df : SparkDataFrame
Dataframe containing the column
column : str
Column to parse.
manual_schema : PysparkSchema, optional
Schema understood by PySpark, by default None
merge : bool, optional
Parse the whole dataframe to extract a global schema, by default False
Returns
-------
SparkDataFrame
"""
if manual_schema is None or manual_schema == {}:
if merge:
schema = get_universal_schema(df, column)
else:
schema = get_sample_schema(df, column)
else:
schema = manual_schema
return df.withColumn(column, fcn.from_json(column, schema))
Then, I can simply do the following, to get a new dataframe, which has a unified schema
df = from_json(df, column='B', merge=True)
df.printSchema()
root
|-- A: long (nullable = true)
|-- B: struct (nullable = true)
| |-- a: long (nullable = true)
| |-- b: string (nullable = true)
| |-- c: long (nullable = true)
| |-- d: double (nullable = true)
| |-- f: struct (nullable = true)
| | |-- maybe_this: long (nullable = true)
| | |-- some_other: struct (nullable = true)
| | | |-- A: long (nullable = true)
Now we come to the crux of the issue. Since I'm doing this here col_values = [json.loads(getattr(item, column)) for item in df.select(column).collect()] I'm limited to the amount of memory on the master node.
How can I do a similar procedure, s.t the work is more distributed to each worker instead, before I collect to the master node?
If I understand your question correctly, since we can use RDD as the path parameter of the spark.read.json() method and RDD is distributed and could reduce the potential OOM issue using collect() method on a large dataset, thus you can try adjust the function get_universal_schema to the following:
def get_universal_schema(df: SparkDataFrame, column: str):
return spark.read.json(df.select(column).rdd.map(lambda x: x[0])).schema
and keep two functions: get_sample_schema() and from_json() as-is.
Spark DataFrames are designed to work with the data that has schema. DataFrame API exposes the methods that are useful on a data with a defined schema, like groupBy a column, or aggregation functions to operate on columns, etc. etc.
Given the requirements presented in the question, it appears to me that there is no fixed schema in the input data, and that you won't benefit from a DataFrame API. In fact it will likely add more constraints instead.
I think it is better to consider this data "schemaless" and use a lower-level API - the RDDs. RDDs are distributed across the cluster by definition. So, using RDD API you can first pre-process the data (consuming it as text), and then convert it to a DataFrame.
I'm working in Spark 1.6.1 and Python 2.7 and I have this thing to solve:
Get a dataframe A with X rows
For each row in A, depending on a field, create one or more rows of a new dataframe B
Save that new dataframe B
The solution that I've come up right now, is to collect dataframe A, go over it, append to a list the row(s) of B and then create the dataframe B from that list.
With this solution i obviously lose all the perks of working with dataframes and I would like to use foreach, but I can't find a way to make this work. I've tried this so far:
Pass an empty list to the foreach function (this just ignores the foreach function and doesn't do anything)
Create a global variable to be use in the foreach function (complains that it can't find the list)
Does anyone has any ideas?
Thank you
----------------------EDIT:
Examples of the things I've tried:
def f(row, list):
if row.one:
list += [Row(type='one', field='ok')]
else:
list += [Row(type='one', field='ok')]
list += [Row(type='two', field='nok')]
list = []
dfA.foreach(lambda x : f(x, list))
As I mention, this does nothing, it doesn't execute the function
And I've also tried (which list defined at the beginning of the class):
global list
def f(row):
if row.one:
list += [Row(type='one', field='ok')]
else:
list += [Row(type='one', field='ok')]
list += [Row(type='two', field='nok')]
dfA.foreach(list)
---------EDIT 2:
What I'm doing right now is:
list = []
for row in dfA.collect():
string = re.search(a_regex, row['raw'])
if string:
dates = re.findall(date_regex, string.group())
for date in dates:
date_string = datetime.strptime(date, '%Y-%m-%d').date()
list += [Row(event_type='1', event_date=date_string)]
b_string = re.search(b_regex, row['raw'])
if b_string:
dates = re.findall(date_regex, b_string.group())
for date in dates:
scheduled_to = datetime.strptime(date, '%Y-%m-%d').date()
list += [Row(event_type='2', event_date= date_string)]
and then:
dfB = self._sql_context.createDataFrame(list)
dfA is given by other process, I can't change it and i know it's a very stupid way of using dataframes but I can't do anything about that
--------------------EDIT3:
dfA.raw sample:
{"new":[],"removed":[{"start":"2018-03-10","end":"2018-03-16","scheduled_by_system":null}]}
{"new":[{"start":"2018-03-10","end":"2018-03-16","scheduled_by_system":null}],"removed":[]}
{"new":[{"start":"2017-01-28","end":"2017-02-03"},{"start":"2017-02-04","end":"2017-02-10"},{"start":"2017-02-11","end":"2017-02-17"},{"start":"2017-02-18","end":"2017-02-24"},{"start":"2017-03-04","end":"2017-03-10"},{"start":"2017-03-11","end":"2017-03-17"},{"start":"2017-03-18","end":"2017-03-24"},{"start":"2017-09-02","end":"2017-09-08"},{"start":"2017-09-16","end":"2017-09-22"},{"start":"2017-09-23","end":"2017-09-29"},{"start":"2017-09-30","end":"2017-10-06"},{"start":"2017-10-07","end":"2017-10-13"},{"start":"2017-12-02","end":"2017-12-08"},{"start":"2017-12-09","end":"2017-12-15"},{"start":"2017-12-16","end":"2017-12-22"},{"start":"2017-12-23","end":"2017-12-29"},{"start":"2018-01-06","end":"2018-01-12"}],"removed":[{"start":"2017-02-04","end":"2017-02-10"},{"start":"2017-02-11","end":"2017-02-17"},{"start":"2017-02-18","end":"2017-02-24"},{"start":"2017-03-04","end":"2017-03-10"},{"start":"2017-03-11","end":"2017-03-17"},{"start":"2017-03-18","end":"2017-03-24"},{"start":"2017-01-28","end":"2017-02-03"},{"start":"2017-09-16","end":"2017-09-22"},{"start":"2017-09-02","end":"2017-09-08"},{"start":"2017-09-30","end":"2017-10-06"},{"start":"2017-10-07","end":"2017-10-13"},{"start":"2017-09-23","end":"2017-09-29"},{"start":"2017-12-16","end":"2017-12-22"},{"start":"2017-12-23","end":"2017-12-29"},{"start":"2018-01-06","end":"2018-01-12"},{"start":"2017-12-09","end":"2017-12-15"},{"start":"2017-12-02","end":"2017-12-08"},{"start":"2018-02-10","end":"2018-02-16"}]}|
and the regex:
a_regex = r'\"new\":{(.*?)}{2}|\"new\":\[(.*?)\]'
b_regex = r'\"removed\":{(.*?)}{2}|removed\":\[(.*?)\]'
date_regex = r'\"start\":\"(\d{4}-\d{2}-\d{2})\"'
dfA.select('raw').show(2,False)
+-------------------------------------------------------------------------------------------------------+
|raw |
+-------------------------------------------------------------------------------------------------------+
|{"new":[{"start":"2018-03-24","end":"2018-03-30","scheduled_by_system":null}],"removed":[]}|
|{"new":[{"start":"2018-03-10","end":"2018-03-16","scheduled_by_system":null}],"removed":[]}|
+-------------------------------------------------------------------------------------------------------+
only showing top 2 rows
df.select('raw').printSchema()
root
|-- raw: string (nullable = true)
You would need to write a udf function to return the event_type and event_date strings after you have selected the required raw column.
import re
def searchUdf(regex, dateRegex, x):
list_return = []
string = re.search(regex, x)
if string:
dates = re.findall(dateRegex, string.group())
for date in dates:
date_string = datetime.strptime(date, '%Y-%m-%d').date()
list_return.append(date_string)
return list_return
from pyspark.sql import functions as F
udfFunctionCall = F.udf(searchUdf, T.ArrayType(T.DateType()))
The udf function would parse the raw column string with the regex and dateRegex passed as arguments and return eventType and data_string as arrayType column
You should be calling the udf function defined and filter out the empty rows and then separate the columns as event_type and event_date columns
df = df.select("raw")
adf = df.select(F.lit(1).alias("event_type"), udfFunctionCall(F.lit(a_regex), F.lit(date_regex), df.raw).alias("event_date"))\
.filter(F.size(F.col("event_date")) > 0)
bdf = df.select(F.lit(2).alias("event_type"), udfFunctionCall(F.lit(a_regex), F.lit(date_regex), df.raw).alias("event_date")) \
.filter(F.size(F.col("event_date")) > 0)
The regex used are provided in the question as
a_regex = r'\"new\":{(.*?)}{2}|\"new\":\[(.*?)\]'
b_regex = r'\"removed\":{(.*?)}{2}|removed\":\[(.*?)\]'
date_regex = r'\"start\":\"(\d{4}-\d{2}-\d{2})\"'
Now that you have two dataframes for both event_type, final step is to merge them together
adf.unionAll(bdf)
And thats it. Your confusion is all solved.
With the following raw column

|raw |

|{"new":[],"removed":[{"start":"2018-03-10","end":"2018-03-16","scheduled_by_system":null}]} |
|{"new":[{"start":"2018-03-10","end":"2018-03-16","scheduled_by_system":null}],"removed":[]} |
|{"new":[{"start":"2017-01-28","end":"2017-02-03"},{"start":"2017-02-04","end":"2017-02-10"},{"start":"2017-02-11","end":"2017-02-17"},{"start":"2017-02-18","end":"2017-02-24"},{"start":"2017-03-04","end":"2017-03-10"},{"start":"2017-03-11","end":"2017-03-17"},{"start":"2017-03-18","end":"2017-03-24"},{"start":"2017-09-02","end":"2017-09-08"},{"start":"2017-09-16","end":"2017-09-22"},{"start":"2017-09-23","end":"2017-09-29"},{"start":"2017-09-30","end":"2017-10-06"},{"start":"2017-10-07","end":"2017-10-13"},{"start":"2017-12-02","end":"2017-12-08"},{"start":"2017-12-09","end":"2017-12-15"},{"start":"2017-12-16","end":"2017-12-22"},{"start":"2017-12-23","end":"2017-12-29"},{"start":"2018-01-06","end":"2018-01-12"}],"removed":[{"start":"2017-02-04","end":"2017-02-10"},{"start":"2017-02-11","end":"2017-02-17"},{"start":"2017-02-18","end":"2017-02-24"},{"start":"2017-03-04","end":"2017-03-10"},{"start":"2017-03-11","end":"2017-03-17"},{"start":"2017-03-18","end":"2017-03-24"},{"start":"2017-01-28","end":"2017-02-03"},{"start":"2017-09-16","end":"2017-09-22"},{"start":"2017-09-02","end":"2017-09-08"},{"start":"2017-09-30","end":"2017-10-06"},{"start":"2017-10-07","end":"2017-10-13"},{"start":"2017-09-23","end":"2017-09-29"},{"start":"2017-12-16","end":"2017-12-22"},{"start":"2017-12-23","end":"2017-12-29"},{"start":"2018-01-06","end":"2018-01-12"},{"start":"2017-12-09","end":"2017-12-15"},{"start":"2017-12-02","end":"2017-12-08"},{"start":"2018-02-10","end":"2018-02-16"}]}|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
You should be getting
+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|event_type|event_date |
+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1 |[2018-03-10] |
|1 |[2017-01-28, 2017-02-04, 2017-02-11, 2017-02-18, 2017-03-04, 2017-03-11, 2017-03-18, 2017-09-02, 2017-09-16, 2017-09-23, 2017-09-30, 2017-10-07, 2017-12-02, 2017-12-09, 2017-12-16, 2017-12-23, 2018-01-06]|
|2 |[2018-03-10] |
|2 |[2017-01-28, 2017-02-04, 2017-02-11, 2017-02-18, 2017-03-04, 2017-03-11, 2017-03-18, 2017-09-02, 2017-09-16, 2017-09-23, 2017-09-30, 2017-10-07, 2017-12-02, 2017-12-09, 2017-12-16, 2017-12-23, 2018-01-06]|
+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
In spark, I have following data frame called "df" with some null entries:
+-------+--------------------+--------------------+
| id| features1| features2|
+-------+--------------------+--------------------+
| 185|(5,[0,1,4],[0.1,0...| null|
| 220|(5,[0,2,3],[0.1,0...|(10,[1,2,6],[0.1,...|
| 225| null|(10,[1,3,5],[0.1,...|
+-------+--------------------+--------------------+
df.features1 and df.features2 are type vector (nullable). Then I tried to use following code to fill null entries with SparseVectors:
df1 = df.na.fill({"features1":SparseVector(5,{}), "features2":SparseVector(10, {})})
This code led to following error:
AttributeError: 'SparseVector' object has no attribute '_get_object_id'
Then I found following paragraph in spark documentation:
fillna(value, subset=None)
Replace null values, alias for na.fill(). DataFrame.fillna() and DataFrameNaFunctions.fill() are aliases of each other.
Parameters:
value – int, long, float, string, or dict. Value to replace null values with. If the value is a dict, then subset is ignored and value must be a mapping from column name (string) to replacement value. The replacement value must be an int, long, float, or string.
Does this explain my failure to replace null entries with SparseVectors in DataFrame? Or does this mean that there's no way to do this in DataFrame?
I can achieve my goal by converting DataFrame to RDD and replacing None values with SparseVectors, but it will be much more convenient for me to do this directly in DataFrame.
Is there any method to do this directly in DataFrame?
Thanks!
You can use udf:
from pyspark.sql.functions import udf, lit
from pyspark.ml.linalg import *
fill_with_vector = udf(
lambda x, i: x if x is not None else SparseVector(i, {}),
VectorUDT()
)
df = sc.parallelize([
(SparseVector(5, {1: 1.0}), SparseVector(10, {1: -1.0})), (None, None)
]).toDF(["features1", "features2"])
(df
.withColumn("features1", fill_with_vector("features1", lit(5)))
.withColumn("features2", fill_with_vector("features2", lit(10)))
.show())
# +-------------+---------------+
# | features1| features2|
# +-------------+---------------+
# |(5,[1],[1.0])|(10,[1],[-1.0])|
# | (5,[],[])| (10,[],[])|
# +-------------+---------------+