Accessing JSON sub attributes using Pandas from .csv file - python

I have this nested JSON data set which I have converted to .csv using pandas:
[{
"attribute1": "One",
"attribute2": "Two",
"attribute3": [{
"attribute4": "Four",
"attribute5": "Five"
}, {
"attribute4": "Four",
"attribute5": "Five"
}]
}]
df = pd.DataFrame(data, columns=["attribute1", "attribute2", "attribute3"])
df.to_csv('example.csv')
The data in the column attribute3 is still JSON. How can I access the values of subattributes of attribute3 i.e attribute4 and attribute5 using indexing?
For instance something like this: data[0][2:0] for getting data at zeroth row, second column and its sub attribute zero.
I would appreciate some help regarding how to access nested values. Should I flatten the one single column that contains nested values? How can I do that?

It would be easier to parse your original JSON (data) using json_normalize():
In [5]: pd.io.json.json_normalize(data, ['attribute3'], ['attribute1','attribute2'])
Out[5]:
attribute4 attribute5 attribute1 attribute2
0 Four Five One Two
1 Four Five One Two

Related

Merging Lists of Different Format into Pandas DataFrame

I have two different lists and an id:
id = 1
timestamps = [1,2,3,4]
values = ['A','B','C','D']
What I want to do with them is concatenating them into a pandas DataFrame so that:
id
timestamp
value
1
1
A
1
2
B
1
3
C
1
4
D
By iterating over a for loop I will produce a new set of two lists and a new ID with each iteration which should then be concatenated to the existing data frame. The pseudocode would look like this:
# for each sample in group:
# do some calculation to create the two lists
# merge the lists into the data frame, using the ID as index
What I tried to do so far is using concatenate like this:
pd.concat([
existing_dataframe,
pd.DataFrame(
{
"id": id,
"timestamp": timestamps,
"value": values}
)])
But there seems to be a problem that the ID field and the other lists are of different lengths. Thanks for your help!
Use:
pd.DataFrame(
{
"timestamp": timestamps,
"value": values}
).assign(id=id).reindex(columns=["id", "timestamp", "value"])
Or:
df = \
pd.DataFrame(
{
"timestamp": timestamps,
"value": values}
)
df.insert(column='id',value=id, loc=0)

How to access rows in a MultiIndex dataframe by using integer-location based indexing

Suppose I have the following MultiIndex DataFrame, titled df:
arrays = [["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
["one", "two", "one", "two", "one", "two", "one", "two"],]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])
df = pd.Series(np.random.randn(8), index=index)
If I wanted to access all the rows associated with baz, for example, I would use cross-section: df.xs(('baz')).
But is there a way to access the rows by referencing the integer location in the first level, similar to iloc for single index DataFrames? In my example, I think that would be index location 1.
I attempted it with a workaround using .loc as per the following:
(df.loc[[df.index.get_level_values(0)[1]]]
But that returns the first group of rows associated with bar. Which I believe its because integer-location 1 is still within bar. I would have to reference 2 to get to baz.
Can I make it so that location 0, 1, 2, and 3 references bar, baz, foo, and qux respectively?
You can use levels
df.xs(df.index.levels[0][1])
second
one -1.052578
two 0.565691
dtype: float64
More details
df.index.levels[0][0]
'bar'
df.index.levels[0][1]
'baz'

PySpark problem flattening array with nested JSON and other elements

I'm struggling with the correct syntax to flatten some data.
I have a dlt table with a column (named lorem for the sake of the example) where each row looks like this:
[{"field1": {"field1_1": null, "field1_2": null},
"field2": "blabla", "field3": 13209914,
"field4": {"field4_1": null, "field4_2": null}, "field5": 4}, ...
]
I want my output to create a new table based on the first that basically creates a row per each element in the array I shared above.
Table should look like:
|field1_1|field1_2|field2|field3|field4_1|field4_2|field_5|
|:-------|:-------|:-----|:-----|:-------|:-------|:------|
|null|null|blabla|13209914|null|null|4|
However when I explode like: select(explode("lorem")) I do not get the wanted output, instead I get only field 0 and exploded and the other fields except everything inside field4.
My question is, in what other way should I be flattening this data?
I can provide a clearer example if needed.
Use withColumn to add the additional columns you need. A simple example:
%%pyspark
from pyspark.sql.functions import col
df = spark.read.json("abfss://somelake#somestorage.dfs.core.windows.net/raw/flattenJson.json")
df2 = df \
.withColumn("field4_1", col("field4.field4_1")) \
.withColumn("field4_2", col("field4.field4_2"))
df2.show()
My results:

Pythonic way of replace values in one column from a two column table

I have a df with the origin and destination between two points and I want to convert the strings to a numerical index, and I need to have a representation to back convert it for model interpretation.
df1 = pd.DataFrame({"Origin": ["London", "Liverpool", "Paris", "..."], "Destination": ["Liverpool", "Paris", "Liverpool", "..."]})
I separately created a new index on the sorted values.
df2 = pd.DataFrame({"Location": ["Liverpool", "London", "Paris", "..."], "Idx": ["1", "2", "3", "..."]})
What I want to get is this:
df3 = pd.DataFrame({"Origin": ["1", "2", "3", "..."], "Destination": ["1", "3", "1", "..."]})
I am sure there is a simpler way of doing this but the only two methods I can think of are to do a left join onto the Origin column by the Origin to Location and the same for destination then remove extraneous columns, or loop of every item in df1 and df2 and replace matching values. I've done the looped version and it works but it's not very fast, which is to be expected.
I am sure there must be an easier way to replace these values but I am drawing a complete blank.
You can use .map():
mapping = dict(zip(df2.Location, df2.Idx))
df1.Origin = df1.Origin.map(mapping)
df1.Destination = df1.Destination.map(mapping)
print(df1)
Prints:
Origin Destination
0 2 1
1 1 3
2 3 1
3 ... ...
Or "bulk" .replace():
df1 = df1.replace(mapping)
print(df1)

Merge 2 relational dataframes to nested JSON / dataframe

I have another problem with joining to dataframes using pandas. I want to merge a complete dataframe into a column/field of another dataframe where the foreign key field of DF2 matches the unique key of DF1.
The input data are 2 CSV files roughly looking likes this:
CSV 1 / DF 1:
cid;name;surname;address
1;Mueller;Hans;42553
2;Meier;Peter;42873
3;Schmidt;Micha;42567
4;Pauli;Ulli;98790
5;Dick;Franz;45632
CSV 2 / DF 1:
OID;ticketid;XID;message
1;9;1;fgsgfs
2;8;2;gdfg
3;7;3;gfsfgfg
4;6;4;fgsfdgfd
5;5;5;dgsgd
6;4;5;dfgsgdf
7;3;1;dfgdhfd
8;2;2;dfdghgdh
I want each row of DF2, which XID matches with a cid of DF1, as a single field in DF1. my final goal is to convert the above input files into a nested JSON format.
Edit 1:
Something like this:
[
{
"cid": 1,
"name": "Mueller",
"surname": "Hans",
"address": 42553,
"ticket" :[{
"OID": 1,
"ticketid": 9,
"XID": 1,
"message": "fgsgfs"
}]
},
...]
Edit 2:
Some further thoughts: Would it be possible to create a dictionary of each row in dataframe 2 and then append this dictionary to a new column in dataframe 1 where some value (xid) of the dictionary matches with a unique id in a row (cid) ?
Some pseudo code I have in my mind:
Add new column "ticket" in DF1
Iterate over rows in DF2:
row to dictionary
iterate over DF1
find row where cid = dict.XID
append dictionary to field in "ticket"
convert DF1 to JSON
Non Python solution are also acceptable.
Not sure what you expect as output but check merge
df1.merge(df2, left_on="cid", right_on="XID", how="left")
[EDIT based on the expected output]
Maybe something like this:
(
df1.merge(
df2.groupby("XID").apply(lambda g: g.to_dict(orient="records")).reset_index(name="ticket"),
how="left", left_on="cid", right_on="XID")
.drop(["XID"], axis=1)
.to_json(orient="records")
)

Categories

Resources