PySpark problem flattening array with nested JSON and other elements - python

I'm struggling with the correct syntax to flatten some data.
I have a dlt table with a column (named lorem for the sake of the example) where each row looks like this:
[{"field1": {"field1_1": null, "field1_2": null},
"field2": "blabla", "field3": 13209914,
"field4": {"field4_1": null, "field4_2": null}, "field5": 4}, ...
]
I want my output to create a new table based on the first that basically creates a row per each element in the array I shared above.
Table should look like:
|field1_1|field1_2|field2|field3|field4_1|field4_2|field_5|
|:-------|:-------|:-----|:-----|:-------|:-------|:------|
|null|null|blabla|13209914|null|null|4|
However when I explode like: select(explode("lorem")) I do not get the wanted output, instead I get only field 0 and exploded and the other fields except everything inside field4.
My question is, in what other way should I be flattening this data?
I can provide a clearer example if needed.

Use withColumn to add the additional columns you need. A simple example:
%%pyspark
from pyspark.sql.functions import col
df = spark.read.json("abfss://somelake#somestorage.dfs.core.windows.net/raw/flattenJson.json")
df2 = df \
.withColumn("field4_1", col("field4.field4_1")) \
.withColumn("field4_2", col("field4.field4_2"))
df2.show()
My results:

Related

Error when manipulating dataframe with columns of type string with Pandas Pivot Table

I have the dataframe:
And I would like to obtain using Pivot Table or an alternative function this result:
I am trying to transform the rows of the Custom Field column into Columns, with the Pivot Table function of Pandas, and I get an error:
import pandas as pd
data = {
"Custom Field": ["CF1", "CF2", "CF3"],
"id": ["RSA", "RSB", "RSC"],
"Name": ["Wilson", "Junior", "Otavio"]
}
### create the dataframe ###
df = pd.DataFrame(data)
print(df)
df2 = df.pivot_table(columns=['Custom Field'], index=['Name'])
print(df2)
I suspect it is because I am working with Strings.
Any suggestions?
Thanks in advance.
You need pivot, not pivot_table. The latter does aggregation on possibly repeating values whereas the former is just a rearrangement of the values and fails for duplicate values.
df.pivot(columns=['Custom Field'], index=['Name'])
Update as per comment: if there are multiple values per cell, you need to use privot_table and specify an appropriate aggregate function, e.g. concatenate the string values. You can also specify a fill value for empty cells (instead of NaN):
df = pd.DataFrame({"Custom Field": ["CF1", "CF2", "CF3", "CF1"],
"id": ["RSA", "RSB", "RSC", "RSD"],
"Name": ["Wilson", "Junior", "Otavio", "Wilson"]})
df.pivot_table(columns=['Custom Field'], index=['Name'], aggfunc=','.join, fill_value='-')
id
Custom Field CF1 CF2 CF3
Name
Junior - RSB -
Otavio - - RSC
Wilson RSA,RSD - -

Python pandas: How many values of one series are in another?

I have two pandas dataframes:
df1 = pd.DataFrame(
{
"col1": ["1","2",np.nan,"3"],
}
)
df2 = pd.DataFrame(
{
"col1": [2.0,3.0,4.0,np.nan],
}
)
I would like to know how many values of df1.col1 exist in df2.col1. In this case it should be 2 as I want "2" and 2.0 to be seen as equal.
I do have a working solution, but because I think I'll need this more often (and for learning purposes, of cause), I wanted to ask you if there is a more comfortable way to do that.
df1.col1[df1.col1.notnull()].isin(df2.col1[df2.col1.notnull()].astype(int).astype(str)).value_counts()
Use Series.dropna and convert to floats, if working with integers and missing values:
a = df1.col1.dropna().astype(float).isin(df2.col1.dropna()).value_counts()
Or:
a = df1.col1.dropna().isin(df2.col1.dropna().astype(int).astype(str)).value_counts()

Merge 2 relational dataframes to nested JSON / dataframe

I have another problem with joining to dataframes using pandas. I want to merge a complete dataframe into a column/field of another dataframe where the foreign key field of DF2 matches the unique key of DF1.
The input data are 2 CSV files roughly looking likes this:
CSV 1 / DF 1:
cid;name;surname;address
1;Mueller;Hans;42553
2;Meier;Peter;42873
3;Schmidt;Micha;42567
4;Pauli;Ulli;98790
5;Dick;Franz;45632
CSV 2 / DF 1:
OID;ticketid;XID;message
1;9;1;fgsgfs
2;8;2;gdfg
3;7;3;gfsfgfg
4;6;4;fgsfdgfd
5;5;5;dgsgd
6;4;5;dfgsgdf
7;3;1;dfgdhfd
8;2;2;dfdghgdh
I want each row of DF2, which XID matches with a cid of DF1, as a single field in DF1. my final goal is to convert the above input files into a nested JSON format.
Edit 1:
Something like this:
[
{
"cid": 1,
"name": "Mueller",
"surname": "Hans",
"address": 42553,
"ticket" :[{
"OID": 1,
"ticketid": 9,
"XID": 1,
"message": "fgsgfs"
}]
},
...]
Edit 2:
Some further thoughts: Would it be possible to create a dictionary of each row in dataframe 2 and then append this dictionary to a new column in dataframe 1 where some value (xid) of the dictionary matches with a unique id in a row (cid) ?
Some pseudo code I have in my mind:
Add new column "ticket" in DF1
Iterate over rows in DF2:
row to dictionary
iterate over DF1
find row where cid = dict.XID
append dictionary to field in "ticket"
convert DF1 to JSON
Non Python solution are also acceptable.
Not sure what you expect as output but check merge
df1.merge(df2, left_on="cid", right_on="XID", how="left")
[EDIT based on the expected output]
Maybe something like this:
(
df1.merge(
df2.groupby("XID").apply(lambda g: g.to_dict(orient="records")).reset_index(name="ticket"),
how="left", left_on="cid", right_on="XID")
.drop(["XID"], axis=1)
.to_json(orient="records")
)

Accessing JSON sub attributes using Pandas from .csv file

I have this nested JSON data set which I have converted to .csv using pandas:
[{
"attribute1": "One",
"attribute2": "Two",
"attribute3": [{
"attribute4": "Four",
"attribute5": "Five"
}, {
"attribute4": "Four",
"attribute5": "Five"
}]
}]
df = pd.DataFrame(data, columns=["attribute1", "attribute2", "attribute3"])
df.to_csv('example.csv')
The data in the column attribute3 is still JSON. How can I access the values of subattributes of attribute3 i.e attribute4 and attribute5 using indexing?
For instance something like this: data[0][2:0] for getting data at zeroth row, second column and its sub attribute zero.
I would appreciate some help regarding how to access nested values. Should I flatten the one single column that contains nested values? How can I do that?
It would be easier to parse your original JSON (data) using json_normalize():
In [5]: pd.io.json.json_normalize(data, ['attribute3'], ['attribute1','attribute2'])
Out[5]:
attribute4 attribute5 attribute1 attribute2
0 Four Five One Two
1 Four Five One Two

Pandas JSON data parsing projection

It seems to me that it would be imminently useful for pandas to support the idea of projection (omitting or selecting columns) during data parsing.
Many JSON datasets I find have a ton of extraneous fields I don't need, or I need to parse a specific field in the nested structure.
What I do currently is pipe through jq to create a file that contains only the fields I need. This becomes the "cleaned" file.
I would prefer a method where I didn't have to create a new cleaned file every time I want to look at a particular facet or set of facets, but I could instead tell pandas to load the JSON path .data.interesting and only project fields: A B C.
As an example:
{
"data": {
"not interesting": ["milk", "yogurt", "dirt"],
"interesting": [{ "A": "moonlanding", "B": "1956", "C": 100000, "D": "meh" }]
}
Unfortunately, it seems like there's no easy way to do it on load, but if you're okay with doing it immediately after...
# drop by index
df.drop(df.columns[[1, 2]], axis=1, inplace=True)
# drop by name
df.drop(['B', 'C'], axis=1, inplace=True)

Categories

Resources