Pyspark merge multiple columns into a json column - python

I asked the question a while back for python, but now I need to do the same thing in PySpark.
I have a dataframe (df) like so:
|cust_id|address |store_id|email |sales_channel|category|
-------------------------------------------------------------------
|1234567|123 Main St|10SjtT |idk#gmail.com|ecom |direct |
|4567345|345 Main St|10SjtT |101#gmail.com|instore |direct |
|1569457|876 Main St|51FstT |404#gmail.com|ecom |direct |
and I would like to combine the last 4 fields into one metadata field that is a json like so:
|cust_id|address |metadata |
-------------------------------------------------------------------------------------------------------------------
|1234567|123 Main St|{'store_id':'10SjtT', 'email':'idk#gmail.com','sales_channel':'ecom', 'category':'direct'} |
|4567345|345 Main St|{'store_id':'10SjtT', 'email':'101#gmail.com','sales_channel':'instore', 'category':'direct'}|
|1569457|876 Main St|{'store_id':'51FstT', 'email':'404#gmail.com','sales_channel':'ecom', 'category':'direct'} |
Here's the code I used to do this in python:
cols = [
'store_id',
'store_category',
'sales_channel',
'email'
]
df1 = df.copy()
df1['metadata'] = df1[cols].to_dict(orient='records')
df1 = df1.drop(columns=cols)
but I would like to translate this to PySpark code to work with a spark dataframe; I do NOT want to use pandas in Spark.

Use to_json function to create json object!
Example:
from pyspark.sql.functions import *
#sample data
df=spark.createDataFrame([('1234567','123 Main St','10SjtT','idk#gmail.com','ecom','direct')],['cust_id','address','store_id','email','sales_channel','category'])
df.select("cust_id","address",to_json(struct("store_id","category","sales_channel","email")).alias("metadata")).show(10,False)
#result
+-------+-----------+----------------------------------------------------------------------------------------+
|cust_id|address |metadata |
+-------+-----------+----------------------------------------------------------------------------------------+
|1234567|123 Main St|{"store_id":"10SjtT","category":"direct","sales_channel":"ecom","email":"idk#gmail.com"}|
+-------+-----------+----------------------------------------------------------------------------------------+
to_json by passing list of columns:
ll=['store_id','email','sales_channel','category']
df.withColumn("metadata", to_json(struct([x for x in ll]))).drop(*ll).show()
#result
+-------+-----------+----------------------------------------------------------------------------------------+
|cust_id|address |metadata |
+-------+-----------+----------------------------------------------------------------------------------------+
|1234567|123 Main St|{"store_id":"10SjtT","email":"idk#gmail.com","sales_channel":"ecom","category":"direct"}|
+-------+-----------+----------------------------------------------------------------------------------------+

#Shu gives a good answer, here's a variant that works out slightly better for my use case. I'm going from Kafka -> Spark -> Kafka and this one liner does exactly what I want. The struct(*) will pack up all the fields in the dataframe.
# Packup the fields in preparation for sending to Kafka sink
kafka_df = df.selectExpr('cast(id as string) as key', 'to_json(struct(*)) as value')

Related

How do filter with multiple contains in pyspark

I'm going to do a query with pyspark to filter row who contains at least one word in array. For example, the dataframe is:
"content" "other"
My father is big. ...
My mother is beautiful. ...
I'm going to travel. ...
I have an array:
array=["mother","father"]
And the output must be this:
"content" "other"
My father is big. ...
My mother is beautiful. ...
A simple filter for word in array.
I think this solution works. Let me know what you think.
import pyspark.sql.functions as f
phrases = ['bc', 'ij']
df = spark.createDataFrame([
('abcd',),
('efgh',),
('ijkl',)
], ['col1'])
(df
.withColumn('phrases', f.array([f.lit(element) for element in phrases]))
.where(f.expr('exists(phrases, element -> col1 like concat("%", element, "%"))'))
.drop('phrases')
.show()
)
output
+----+
|col1|
+----+
|abcd|
|ijkl|
+----+
Had the same thoughts as #ARCrow but using instr.
lst=["mother","father"]
DataFrame
data= [
(1,"My father is big."),
(2, "My mother is beautiful"),
(3,"I'm going to travel.")
]
df=spark.createDataFrame(data, ("id",'content'))
Solution
df=(df
.withColumn('phrases', f.array([f.lit(element) for element in lst]))
.where(f.expr('exists(phrases, element -> instr (content, element)>=1)'))
.drop('phrases')
)
df.show()
Outcome
+---+--------------------+
| id| content|
+---+--------------------+
| 1| My father is big.|
| 2|My mother is beau...|
+---+--------------------+
Taking some the same configuration as #wwnde,
data= [
(1,"My father is big."),
(2, "My mother is beautiful"),
(3,"I'm going to travel.")
]
df=spark.createDataFrame(data, ("id",'content'))
Solution
words = ["father", "mother"]
conditions = " or ".join([f"content like '%{word}%'" for word in words])
(
df
.filter(F.expr(conditions))
.show(truncate=False)
)
+---+----------------------+
|id |content |
+---+----------------------+
|1 |My father is big. |
|2 |My mother is beautiful|
+---+----------------------+
We made the Fugue project to port native Python or Pandas code to Spark or Dask. This lets you can keep the logic very readable by expressing it in native Python. Fugue can then port it to Spark for you with one function call.
First, we setup,
import pandas as pd
array=["mother","father"]
df = pd.DataFrame({"sentence": ["My father is big.", "My mother is beautiful.", "I'm going to travel. "]})
and then we can create a native Python function to express the logic:
from typing import List, Dict, Any, Iterable
def myfilter(df: List[Dict[str,Any]]) -> Iterable[Dict[str, Any]]:
for row in df:
for value in array:
if value in row["sentence"]:
yield row
and then test it on Pandas:
from fugue import transform
transform(df, myfilter, schema="*")
Because of works on Pandas, we can execute it on Spark by specifying the engine:
import fugue_spark
transform(df, myfilter, schema="*", engine="spark").show()
+---+--------------------+
| id| sentence|
+---+--------------------+
| 0| My father is big.|
| 1|My mother is beau...|
+---+--------------------+
Note we need .show() because Spark evaluates lazily. Schema is also a Spark requirement so Fugue interprets the "*" as all columns in = all columns out.
The fugue transform function can take both Pandas DataFrame inputs and Spark DataFrame inputs.
Edit:
You can replace the myfilter function above with a Pandas implementation like this:
def myfilter(df: pd.DataFrame) -> pd.DataFrame:
res = df.loc[df["sentence"].str.contains("|".join(array))]
return res
and Fugue will be able to port it to Spark the same way. Fugue knows how to adjust to the type hints and this will be faster than the native Python implementation because it takes advantage of Pandas being vectorized.

What is the most efficient way to "flatten" a JSON within a dataframe in pandas?

I'm having trouble loading a big JSON lines file in pandas, mainly because I need to "flatten" one of the resulting columns after using pd.read_json
For example, for this JSON line:
{"user_history": [{"event_info": 248595, "event_timestamp": "2019-10-01T12:46:03.145-0400", "event_type": "view"}, {"event_info": 248595, "event_timestamp": "2019-10-01T13:21:50.697-0400", "event_type": "view"}], "item_bought": 1909110}
I'd need to load 2 rows with 4 columns in pandas like this:
+--------------+--------------------------------+--------------+---------------+
| "event_info" | "event_timestamp" | "event_type" | "item_bought" |
+--------------+--------------------------------+--------------+---------------+
| 248595 | "2019-10-01T12:46:03.145-0400" | "view" | 1909110 |
| 248595 | "2019-10-01T13:21:50.697-0400" | "view" | 1909110 |
+--------------+--------------------------------+--------------+---------------+
The thing is, given the size of the file (413000+ lines, over 1GB), none of the ways that I managed to do it is fast enough for me. I was trying a rather rudimentary way, iterating over the loaded dataframe, creating a dictionary and appending the values to an empty dataframe:
history_df = pd.read_json('data/train_dataset.jl', lines=True)
history_df['index1'] = history_df.index
normalized_history = pd.DataFrame()
for index, row in history_df.iterrows():
for dic in row['user_history']:
dic['index1'] = row['index1']
dic['item_bought'] = row['item_bought']
normalized_history = normalized_history.append(dic, ignore_index=True)
So the question is which would be the fastest way to accomplish this? Is there any way without iterating the whole history_df dataframe?
Thank you in advance
Maybe if you try this?:
import pandas as pd
import json
data = []
# assuming each line from data/train_dataset.jl
# is a json object like the one you posted above:
with open('data/train_dataset.jl') as f:
for line in f:
data.append(json.loads(line))
normalized_history = pd.json_normalize(data, 'user_history', 'item_bought')

PySpark groupby elements with key of their occurence

I have this data in a DATAFRAME:
id,col
65475383,acacia
63975914,acacia
65475383,excelsa
63975914,better
I want to have a dictionary that will contain the column "word" and every id that is associated with it, something like this:
word:key
acacia: 65475383,63975914
excelsa: 65475383
better: 63975914
I tried groupBy, but that is a way to aggregate data, how to approach this problem?
I'm not sure if you intend to have the result as a Python dictionary or as a Dataframe (it is not clear from your question).
However, if you do want a Dataframe then one way to calculate that is:
from pyspark.sql.functions import collect_list
idsByWords = df \
.groupBy("col") \
.agg(collect_list("id").alias("ids")) \
.withColumnRenamed("col", "word")
This will result in:
idsByWords.show(truncate=False)
+-------+--------------------+
|word |ids |
+-------+--------------------+
|excelsa|[65475383] |
|better |[63975914] |
|acacia |[65475383, 63975914]|
+-------+--------------------+
Then you can turn that dataframe into a Python dictionary :
d = {r.asDict()["word"]: r.asDict()["ids"] for r in idsByWords.collect()}
To finally get:
{
'excelsa': [65475383],
'better': [63975914],
'acacia': [65475383, 63975914]
}
Note that collect may crash your driver program if it exceeds your driver memory.

How to create dataframe with single header ( 1 row many cols) and update values to this dataframe in pyspark?

I want to create a dataframe in pyspark like the table below :
category| category_id| bucket| prop_count| event_count | accum_prop_count | accum_event_count
-----------------------------------------------------------------------------------------------------
nation | nation | 1 | 222 | 444 | 555 | 6677
So, the code I tried below :
schema = StructType([])
df = sqlContext.createDataFrame(sc.emptyRDD(), schema)
df = df.withColumn("category",F.lit('nation')).withColumn("category_id",F.lit('nation')).withColumn("bucket",bucket)
df = df.withColumn("prop_count",prop_count).withColumn("event_count",event_count).withColumn("accum_prop_count",accum_prop_count).withColumn("accum_event_count",accum_event_count)
df.show()
This is giving an error :
AssertionError: col should be Column
Also, The values of the columns have to be updated again later and the update will also be of 1 line.
How to do this??
I think the problem with your code is lies in lines where you are using variables like .withColumn("bucket",bucket). You are trying to create a new column by giving an integer value. withColumn expects a column and not a single integer value.
To solve this, you can use the lit just like you are already using for "nation"
like :
df = df\
.withColumn("category",F.lit('nation'))\
.withColumn("category_id",F.lit('nation'))\
.withColumn("bucket",F.lit(bucket))\
.withColumn("prop_count",F.lit(prop_count))\
.withColumn("event_count",F.lit(event_count))\
.withColumn("accum_prop_count",F.lit(accum_prop_count))\
.withColumn("accum_event_count",F.lit(accum_event_count))
another simple and cleaner way to write it may be like this :
# create schema
fields = [StructField("category", StringType(),True),
StructField("category_id", StringType(),True),
StructField("bucket", IntegerType(),True),
StructField("prop_count", IntegerType(),True),
StructField("event_count", IntegerType(),True),
StructField("accum_prop_count", IntegerType(),True)
]
schema = StructType(fields)
# load data
data = [["nation","nation",1,222,444,555]]
df = spark.createDataFrame(data, schema)
df.show()

Can data loaded in as newAPIHadoopRDD be converted into a DataFrame?

I'm using PySpark to load data from Google BigQuery.
I've loaded data by using:
dfRates = sc.newAPIHadoopRDD(
'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
'org.apache.hadoop.io.LongWritable',
'com.google.gson.JsonObject',
conf=conf)
Where conf is defined as https://cloud.google.com/hadoop/examples/bigquery-connector-spark-example.
I need this data as a DataFrame, so I tried,
row = Row(['userId','accoId','rating']) # or row = Row(('userId','accoId','rating'))
dataRDD = dfRates.map(row).toDF()
and
dataRDD = sqlContext.createDataFrame(dfRates,['userId','accoId','rating'])
But it does not convert the data into a DataFrame. Is there a way to convert it into a DataFrame?
As long as the types can represented using Spark SQL types there is no reason it couldn't be. The only problem here seems to be your code.
newAPIHadoopRDD returns a RDD of pairs (tuple of length equal two). In this particular context it looks you'll get (int, str) in Python which clearly cannot be unpacked into ['userId','accoId','rating'].
According to the doc you've linked com.google.gson.JsonObject is represented as a JSON string which can be either parsed on a Python side using standard Python utils (json module):
def parse(v, fields=["userId", "accoId", "rating"]):
row = Row(*fields)
try:
parsed = json.loads(v)
except json.JSONDecodeError:
parsed = {}
return row(*[parsed.get(x) for x in fields])
dfRates.map(parse).toDF()
or on the Scala / DataFrame side using get_json_object:
from pyspark.sql.functions import col, get_json_object
dfRates.toDF(["id", "json_string"]).select(
# This assumes you expect userId field
get_json_object(col("json_string"), "$.userId"),
...
)
Please note the differences in the syntax I've used to define and create rows.
hbase table rows:
hbase(main):008:0> scan 'test_hbase_table'
ROW COLUMN+CELL
dalin column=cf:age, timestamp=1464101679601, value=40
tangtang column=cf:age, timestamp=1464101649704, value=9
tangtang column=cf:name, timestamp=1464108263191, value=zitang
2 row(s) in 0.0170 seconds
here we go
import json
host = '172.23.18.139'
table = 'test_hbase_table'
conf = {"hbase.zookeeper.quorum": host, "zookeeper.znode.parent": "/hbase-unsecure", "hbase.mapreduce.inputtable": table}
keyConv = "org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter"
valueConv = "org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter"
hbase_rdd = sc.newAPIHadoopRDD(
"org.apache.hadoop.hbase.mapreduce.TableInputFormat",
"org.apache.hadoop.hbase.io.ImmutableBytesWritable",
"org.apache.hadoop.hbase.client.Result",
keyConverter=keyConv,
valueConverter=valueConv,
conf=conf)
hbase_rdd1 = hbase_rdd.flatMapValues(lambda v: v.split("\n"))
and here we got the results
tt=sqlContext.jsonRDD(hbase_rdd1.values())
In [113]: tt.show()
+------------+---------+--------+-------------+----+------+
|columnFamily|qualifier| row| timestamp|type| value|
+------------+---------+--------+-------------+----+------+
| cf| age| dalin|1464101679601| Put| 40|
| cf| age|tangtang|1464101649704| Put| 9|
| cf| name|tangtang|1464108263191| Put|zitang|
+------------+---------+--------+-------------+----+------+

Categories

Resources