Split pyspark dataframe to chunks and convert to dictionary - python

I have a pyspark dataframe which looks like the following:
+----+--------------------+
| ID| Email|
+----+--------------------+
| 1| sample#example.org|
| 2| sample2#example.org|
| 3| sampleexample.org|
| 4| sample#exampleorg|
+----+--------------------+
What I need to do is to split it into chunks and then convert those chunks to dictionaries like:
chunk1
[{'ID': 1, 'Email': 'sample#example.org'}, {'ID': 2, 'Email': 'sample2#example.org'}]
chunk2
[{'ID': 3, 'Email': 'sampleexample.org'}, {'ID': 4, 'Email': 'sample#exampleorg'}]
I've found this post on SO but I figured it would not make any sense to first convert the chunks to pandas dataframe and from there to dictionary while I might be able to do it directly. Using the idea in that post, I've got the following code but not sure if this is the best way of doing it:
columns = spark_df.schema.fieldNames()
chunks = spark_df.repartition(num_chunks).rdd.mapPartitions(lambda iterator: [iterator.to_dict('records')]).toLocalIterator()
for list_of_dicts in chunks:
# do work locally on list_of_dicts

You can return [[x.asDict() for x in iterator]] in the mapPartitions function (no need Pandas). [x.asDict() for x in iterator] creates a list of dicts including all rows in the same partition. we then enclose it using another list so that it was treated as a single item with toLocalIterator():
from json import dumps
num_chunks = 2
chunks = spark_df.repartition(num_chunks).rdd.mapPartitions(lambda iterator: [[x.asDict() for x in iterator]]).toLocalIterator()
for list_of_dicts in chunks:
print(dumps(list_of_dicts))
#[{"ID": "2", "Email": "sample2#example.org"}, {"ID": "1", "Email": "sample#example.org"}]
#[{"ID": "4", "Email": "sample#exampleorg"}, {"ID": "3", "Email": "sampleexample.org"}]

Related

Python Pandas Flatten nested JSON

I'm new to python & pandas and it took me awhile to get the results I wanted using below. Basically, I am trying to flatten a nested JSON. While json_normalize works, there are columns where it contains a list of objects(key/value). I want to breakdown that list and add them as separate columns. Below sample code I wrote worked out fine but was wondering if this could be simplified or improved further or an alternative? Most of the articles I've found relies on actually naming the columns and such (codes that I cant pick up on just yet), but I would rather have this as a function and name the columns dynamically. The output csv will be used for Power BI.
with open('json.json', 'r') as json_file: jsondata = json.load(json_file)
df = pd.json_normalize(jsondata['results'], errors='ignore')
y = df.columns[df.applymap(type).eq(list).any()]
for x in y:
df = df.explode(x).reset_index(drop=True)
df_exploded = pd.json_normalize(df.pop(x))
for i in df_exploded.columns:
df_exploded = df_exploded.rename(columns={i:x + '_' + i})
df = df.join(df_exploded)
df.to_csv('json.csv')
Sample JSON Format (Not including the large JSON I was working on):
data = {
'results': [
{
'recordType': 'recordType',
'id': 'id',
'values':
{
'orderType': [{'value': 'value', 'text': 'text'}],
'type': [{'value': 'value', 'text': 'text'}],
'entity': [{'value': 'value', 'text': 'text'}],
'trandate': 'trandate'
}
}
]
}
The values part when json_normalized, doesn't get flatten and required explode and joined.
You can use something like this:
df = pd.json_normalize(jsondata['results'],meta=['recordType','id'])[['recordType','id','values.trandate']]
record_paths = [['values','orderType'],['values','type'],['values','entity']]
for i in record_paths:
df = pd.concat([df,pd.json_normalize(jsondata['results'],record_path=i,record_prefix=i[1])],axis=1)
df.to_csv('json.csv')
Or (much faster):
df = pd.DataFrame({'recordType':[i['recordType'] for i in jsondata['results']],
'id':[i['id'] for i in jsondata['results']],
'values.trandate':[i['values']['trandate'] for i in jsondata['results']],
'orderTypevalue':[i['values']['orderType'][0]['value'] for i in jsondata['results']],
'orderTypetext':[i['values']['orderType'][0]['text'] for i in jsondata['results']],
'typevalue':[i['values']['type'][0]['value'] for i in jsondata['results']],
'typetext':[i['values']['type'][0]['text'] for i in jsondata['results']],
'entityvalue':[i['values']['entity'][0]['value'] for i in jsondata['results']],
'entitytext':[i['values']['entity'][0]['text'] for i in jsondata['results']]})
df.to_csv('json.csv')
Output:
| | recordType | id | values.trandate | orderTypevalue | orderTypetext | typevalue | typetext | entityvalue | entitytext |
|---:|:-------------|:-----|:------------------|:-----------------|:----------------|:------------|:-----------|:--------------|:-------------|
| 0 | recordType | id | trandate | value | text | value | text | value | text |

Convert CSV into Dictionary using python

I have a CSV/EXCEL file. The sample data is shown below
+-----------+------------+------------+
| Number | start_date | end_date |
+-----------+------------+------------+
| 987654321 | 2021-07-15 | 2021-08-15 |
| 999999999 | 2021-07-15 | 2021-08-15 |
| 888888888 | 2021-07-15 | 2021-08-15 |
| 777777777 | 2021-07-15 | 2021-09-15 |
+-----------+------------+------------+
I need to convert it into a dictionary(JSON) with some condition on it and then pass that dictionary(JSON) into the DB rows. This means the CSV can provide n number of dictionaries.
Conditions to applied:
The numbers which are having the same start date and end date should be in the same dictionary.
All numbers in a dictionary should be concatenated with comma(,) string.
Expected dictionaries from the above input
dict1 = {
"request": [
{
"key": "AMI_LIST",
"value": "987654321,999999999,888888888"
},
{
"key": "START_DATE",
"value": "2021-07-15"
},
{
"key": "END_DATE",
"value": "2021-08-15"
}
]
}
dict2 = {
"request": [
{
"key": "AMI_LIST",
"value": "7777777777"
},
{
"key": "START_DATE",
"value": "2021-07-15"
},
{
"key": "END_DATE",
"value": "2021-09-15"
}
]
}
All these dictionaries will be stored as a model object and then will pass on to DB. I am not creating different variables for each dict it will handle in a loop. It is just a notation that I want to explain using variables dict1 and dict2.
NOTE: The maximum rows in a file will be 500 only.
I have tried using for loop but that will increase the complexity. Is there any other way to approach this problem?
Thanks in advance for your help.
Yep pandas is a really good option, you can do something like this :
import pandas as pd
import json
df = pd.read_csv("table.csv")
dfgrp = df.groupby(['end_date', 'start_date'], as_index = False).agg({"Number": list})
dfgrp.to_json()
which gives you :
{
'end_date': {'0': '2021-08-15', '1': '2021-09-15'},
'start_date': {'0': '2021-07-15', '1': '2021-07-15'},
'Number': {'0': [987654321, 999999999, 888888888], '1': [777777777]}
}
And you're almost there !
how about using pandas to do the job?
it should be capable of reading your excel-files in as a DataFrame (see here), then you can apply your conditions and use DataFrame.to_json (see here) to export your DataFrames to json files

Pyspark - from long to wide with new column names

I have this dataframe:
data = [{"name": "test", "sentiment":'positive', "avg":13.65, "stddev":15.24},
{"name": "test", "sentiment":'neutral', "avg":338.74, "stddev":187.27},
{"name": "test", "sentiment":'negative', "avg":54.58, "stddev":50.19}]
df = spark.createDataFrame(data).select("name", "sentiment", "avg", "stddev")
df.show()
+----+---------+------+------+
|name|sentiment| avg|stddev|
+----+---------+------+------+
|test| positive| 13.65| 15.24|
|test| neutral|338.74|187.27|
|test| negative| 54.58| 50.19|
+----+---------+------+------+
I'd like to create a dataframe with this structure:
+----+------------+-----------+------------+------------+-----------+------------+
|name|avg_positive|avg_neutral|avg_negative|std_positive|std_neutral|std_negative|
+----+------------+-----------+------------+------------+-----------+------------+
|test| 13.65| 338.74| 54.58| 15.24| 187.27| 50.19|
+----+------------+-----------+------------+------------+-----------+------------+
I also don't know the name of this operation, feel free to suggest a proper title.
Thanks!
use groupBy() and pivot()
df_grp = df.groupBy("name").pivot("sentiment").agg((F.first("avg").alias("avg")),(F.first("stddev").alias("stddev")) )
df_grp.show()
+----+------------+---------------+-----------+--------------+------------+---------------+
|name|negative_avg|negative_stddev|neutral_avg|neutral_stddev|positive_avg|positive_stddev|
+----+------------+---------------+-----------+--------------+------------+---------------+
|test| 54.58| 50.19| 338.74| 187.27| 13.65| 15.24|
+----+------------+---------------+-----------+--------------+------------+---------------+
rename the columns if you really want to

How to decode dictionary column when using pyarrow to read parquet files?

I have three .snappy.parquet files stored in an s3 bucket, I tried to use pandas.read_parquet() but it only work when I specify one single parquet file, e.g: df = pandas.read_parquet("s3://bucketname/xxx.snappy.parquet"), but if I don't specify the filename df = pandas.read_parquet("s3://bucketname"), this won't work and it gave me error: Seek before start of file.
I did a lot of reading, then I found this page
it suggests that we can use pyarrow to read multiple parquet files, so here's what I tried:
import s3fs
import import pyarrow.parquet as pq
s3 = s3fs.S3FileSystem()
bucket_uri = f's3://bucketname'
data = pq.ParquetDataset(bucket_uri, filesystem=s3)
df = data.read().to_pandas()
This works, but I found that the value for one of the columns in thie df is a dictionary, how can I decode this dictionary and the selected key as column names and value as the corresponding values?
For example, the current column:
column_1
{'Id': 'xxxxx', 'name': 'xxxxx','age': 'xxxxx'....}
The expected column:
Id age
xxx xxx
xxx xxx
Here's the output for data.read().schema:
column_0: string
-- field metadata --
PARQUET:field_id: '1'
column_1: struct<Id: string, name: string, age: string,.......>
child 0, Id: string
-- field metadata --
PARQUET:field_id: '3'
child 1, name: string
-- field metadata --
PARQUET:field_id: '7'
child 2, age: string
-- field metadata --
PARQUET:field_id: '8'
...........
...........
You have a column with a "struct type" and you want to flatten it. To do so call flatten before calling to_pandas
import pyarrow as pa
COLUMN1_SCHEMA = pa.struct([('Id', pa.string()), ('Name', pa.string()), ('Age', pa.string())])
SCHEMA = pa.schema([("column1", COLUMN1_SCHEMA), ('column2', pa.int32())])
df = pd.DataFrame({
"column1": [("1", "foo", "16"), ("2", "bar", "17"), ],
"column2": [1, 2],
})
pa.Table.from_pandas(df, SCHEMA).to_pandas() # without flatten
| column1 | column2 |
|:----------------------------------------|----------:|
| {'Id': '1', 'Name': 'foo', 'Age': '16'} | 1 |
| {'Id': '2', 'Name': 'bar', 'Age': '17'} | 2 |
pa.Table.from_pandas(df, SCHEMA).flatten().to_pandas() # with flatten
| column1.Id | column1.Name | column1.Age | column2 |
|-------------:|:---------------|--------------:|----------:|
| 1 | foo | 16 | 1 |
| 2 | bar | 17 | 2 |
As a side note, you shoulnd't call it a dictionary column. dictionary is loaded term in pyarrow, and usually refer to distionary encoding
Edit: how to read a subset of columns in parquet
import pyarrow.parquet as pq
table = pa.Table.from_pandas(df, SCHEMA)
pq.write_table(table, 'data.pq')
# Using read_table:
pq.read_table('data.pq', columns=['column1.Id', 'column1.Age'])
# Using ParquetDataSet:
pq.ParquetDataset('data.pq').read(columns=['column1.Id', 'column1.Age'])

Convert pyspark.sql.dataframe.DataFrame type Dataframe to Dictionary

I have a pyspark Dataframe and I need to convert this into python dictionary.
Below code is reproducible:
from pyspark.sql import Row
rdd = sc.parallelize([Row(name='Alice', age=5, height=80),Row(name='Alice', age=5, height=80),Row(name='Alice', age=10, height=80)])
df = rdd.toDF()
Once I have this dataframe, I need to convert it into dictionary.
I tried like this
df.set_index('name').to_dict()
But it gives error. How can I achieve this
Please see the example below:
>>> from pyspark.sql.functions import col
>>> df = (sc.textFile('data.txt')
.map(lambda line: line.split(","))
.toDF(['name','age','height'])
.select(col('name'), col('age').cast('int'), col('height').cast('int')))
+-----+---+------+
| name|age|height|
+-----+---+------+
|Alice| 5| 80|
| Bob| 5| 80|
|Alice| 10| 80|
+-----+---+------+
>>> list_persons = map(lambda row: row.asDict(), df.collect())
>>> list_persons
[
{'age': 5, 'name': u'Alice', 'height': 80},
{'age': 5, 'name': u'Bob', 'height': 80},
{'age': 10, 'name': u'Alice', 'height': 80}
]
>>> dict_persons = {person['name']: person for person in list_persons}
>>> dict_persons
{u'Bob': {'age': 5, 'name': u'Bob', 'height': 80}, u'Alice': {'age': 10, 'name': u'Alice', 'height': 80}}
The input that I'm using to test data.txt:
Alice,5,80
Bob,5,80
Alice,10,80
First we do the loading by using pyspark by reading the lines. Then we convert the lines to columns by splitting on the comma. Then we convert the native RDD to a DF and add names to the colume. Finally we convert to columns to the appropriate format.
Then we collect everything to the driver, and using some python list comprehension we convert the data to the form as preferred. We convert the Row object to a dictionary using the asDict() method. In the output we can observe that Alice is appearing only once, but this is of course because the key of Alice gets overwritten.
Please keep in mind that you want to do all the processing and filtering inside pypspark before returning the result to the driver.
Hope this helps, cheers.
You need to first convert to a pandas.DataFrame using toPandas(), then you can use the to_dict() method on the transposed dataframe with orient='list':
df.toPandas().set_index('name').T.to_dict('list')
# Out[1]: {u'Alice': [10, 80]}
RDDs have built in function asDict() that allows to represent each row as a dict.
If you have a dataframe df, then you need to convert it to an rdd and apply asDict().
new_rdd = df.rdd.map(lambda row: row.asDict(True))
One can then use the new_rdd to perform normal python map operations like:
# You can define normal python functions like below and plug them when needed
def transform(row):
# Add a new key to each row
row["new_key"] = "my_new_value"
return row
new_rdd = new_rdd.map(lambda row: transform(row))
One easy way can be to collect the row RDDs and iterate over it using dictionary comprehension. Here i will try to demonstrate something similar:
Lets assume a movie dataframe:
movie_df
movieId
avg_rating
1
3.92
10
3.5
100
2.79
100044
4.0
100068
3.5
100083
3.5
100106
3.5
100159
4.5
100163
2.9
100194
4.5
We can use dictionary comprehension and iterate over the row RDDs like below:
movie_dict = {int(row.asDict()['movieId']) : row.asDict()['avg_rating'] for row in movie_avg_rating.collect()}
print(movie_dict)
{1: 3.92,
10: 3.5,
100: 2.79,
100044: 4.0,
100068: 3.5,
100083: 3.5,
100106: 3.5,
100159: 4.5,
100163: 2.9,
100194: 4.5}

Categories

Resources