I have a pandas dataframe in which one column custom consists of dictionaries within a list. The list may be empty or have one or more dictionary objects within it. for example...
id custom
1 []
2 [{'key': 'impact', 'name': 'Impact', 'value': 'abc', 'type': 'string'}, {'key': 'proposed_status','name': 'PROPOSED Status [temporary]', 'value': 'pqr', 'type': 'string'}]
3 [{'key': 'impact', 'name': 'Impact', 'value': 'xyz', 'type': 'string'}]
I'm interested in extracting the data from the JSON into separate columns based on the dict keys named 'key' and 'value'!
for example: here, the output df will have additional columns impact and proposed_status:
id custom impact proposed_status
1 ... NA NA
2 ... abc pqr
3 ... xyz NA
Could the smart people of StackOverflow please guide me on the right way to solve this? Thanks!
The approach is in the comments
df = pd.DataFrame({'id': [1, 2, 3],
'custom': [[],
[{'key': 'impact', 'name': 'Impact', 'value': 'abc', 'type': 'string'},
{'key': 'proposed_status',
'name': 'PROPOSED Status [temporary]',
'value': 'pqr',
'type': 'string'}],
[{'key': 'impact', 'name': 'Impact', 'value': 'xyz', 'type': 'string'}]]})
# expand out lists, reset_index() so join() will work
df2 = df.explode("custom").reset_index(drop=True)
# join to keep "id"
df2 = (df2.join(df2["custom"]
# expand embedded dict
.apply(pd.Series))
.loc[:,["id","key","value"]]
# empty list generate spurios NaN, remove
.dropna()
# turn key attribute into column
.set_index(["id","key"]).unstack(1)
# cleanup multi index columns
.droplevel(0, axis=1)
)
df.merge(df2, on="id", how="left")
id
custom
impact
proposed_status
0
1
[]
nan
nan
1
2
[{'key': 'impact', 'name': 'Impact', 'value': 'abc', 'type': 'string'}, {'key': 'proposed_status', 'name': 'PROPOSED Status [temporary]', 'value': 'pqr', 'type': 'string'}]
abc
pqr
2
3
[{'key': 'impact', 'name': 'Impact', 'value': 'xyz', 'type': 'string'}]
xyz
nan
Related
I have a json response with productdetails of multiple products in this structure
all_content=[{'productcode': '0502570SRE',
'brand': {'code': 'MJ', 'name': 'Marie Jo'},
'series': {'code': '0257', 'name': 'DANAE'},
'family': {'code': '0257SRE', 'name': 'DANAE Red'},
'style': {'code': '0502570'},
'introSeason': {'code': '226', 'name': 'WINTER 2022'},
'seasons': [{'code': '226', 'name': 'WINTER 2022'}],
'otherColors': [{'code': '0502570ZWA'}, {'code': '0502570PIR'}],
'synonyms': [{'code': '0502571SRE'}],
'stayerB2B': False,
'name': [{'language': 'de', 'text': 'DANAE Rot Rioslip'},
{'language': 'sv', 'text': 'DANAE Red '},
{'language': 'en', 'text': 'DANAE Red rio briefs'},
{'language': 'it', 'text': 'DANAE Rouge slip brasiliano'},
{'language': 'fr', 'text': 'DANAE Rouge slip brésilien'},
{'language': 'da', 'text': 'DANAE Red rio briefs'},
{'language': 'nl', 'text': 'DANAE Rood rioslip'},
.......]
what i need is a dataframe with for each productcode only the values of specific keys in the subdictionaries. for ex.
productcode synonyms_code name_language_nl
0522570SRE 0522571SRE rioslip
I've tried a nested for loop which gives me a list of all values of one specific key - value pair in a subdict.
for item in all_content:
synonyms_details = item ['synonyms']
for i in synonyms_details:
print (i['code'])
How do I get from here to a DF like this
productcode synonyms_code name_language_nl
0522570SRE 0522571SRE rioslip
took another route with json_normalize to make a flattened df with original key in column
# test with json_normalize to extract list_of_dicts_of_list_of_dicts
# meta prefix necessary avoiding conflicts multi_use 'code'as key
df_syn = pd.json_normalize(all_content, record_path = ['synonyms'], meta = 'code', meta_prefix = 'org_')
result
code org_code
0 0162934FRO 0162935FRO
1 0241472TFE 0241473TF
changed column name for merge with orginal df
df_syn = df_syn.rename(columns={"code": "syn_code", "org_code":"code"})
result
syn_code code
0 0162934FRO 0162935FRO
1 0241472TFE 0241473T
merged flattened df with left merge based on combined key
df = pd.merge(left=df, right=df_syn, how='left', left_on='code', right_on='code')
result see last column. NaN because not every product has a synonym.
code brand series family style introSeason seasons otherColors synonyms stayerB2B ... composition recycleComposition media assortmentIds preOrderPeriod firstDeliveryDates productGroup language text syn_code
0 0502570SRE {'code': 'MJ', 'name': 'Marie Jo'} {'code': '0257', 'name': 'DANAE'} {'code': '0257SRE', 'name': 'DANAE Red'} {'code': '0502570'} {'code': '226', 'name': 'WINTER 2022'} [{'code': '226', 'name': 'WINTER 2022'}] [{'code': '0502570ZWA'}, {'code': '0502570PIR'}] [] False ... [{'material': [{'language': 'de', 'text': 'Pol... [{'origin': [{'language': 'de', 'text': 'Nicht... [{'type': 'IMAGE', 'imageType': 'No body pictu... [BO_MJ] {'startDate': '2022-01-01', 'endDate': '2022-0... {'common': '2022-09-05', 'deviations': []} {'code': '0SRI1'} nl DANAE Rood rioslip NaN
Next step is to get one step deeper to get the values of a list_of_lists_of_lists.
Suggestions for a more straightforward way? Extracting data like this for all nested values is quite timeconsuming.
I have o first list A=
[{'name': 'PASSWORD', 'id': '5f2496e5-dc40-418a-92e0-098e4642a92e'},
{'name': 'PERSON_NAME', 'id': '3a255440-e2aa-4c4d-993f-4cdef3237920'},
{'name': 'PERU_DNI_NUMBER', 'id': '41f41303-4a71-4732-a8a4-0eecea464562'},
{'name': 'PHONE_NUMBER', 'id': 'ac24413b-bb8f-4adc-ada5-a984f145a70b'},
{'name': 'POLAND_NATIONAL_ID_NUMBER',
'id': '32c49d92-6d5f-408e-b41e-dfec76ceae6a'}]
and I have a second list B :
[{'name': 'PHONE_NUMBER', 'count': '96'}]
I want to filter the first list based on the second in order to have the following list :
[{'name': 'PHONE_NUMBER', 'count': '96','id': 'ac24413b-bb8f-4adc-ada5-a984f145a70b'}.
I have used the following code but I dont get the right ouptut:
filtered = []
for x,i in DLP_job[i]['name']:
if x,i in ids[i]['name']:
filtered.append(x)
print(filtered)
Here is the solution
A = [{'name': 'PASSWORD', 'id': '5f2496e5-dc40-418a-92e0-098e4642a92e'},
{'name': 'PERSON_NAME', 'id': '3a255440-e2aa-4c4d-993f-4cdef3237920'},
{'name': 'PERU_DNI_NUMBER', 'id': '41f41303-4a71-4732-a8a4-0eecea464562'},
{'name': 'PHONE_NUMBER', 'id': 'ac24413b-bb8f-4adc-ada5-a984f145a70b'},
{'name': 'POLAND_NATIONAL_ID_NUMBER',
'id': '32c49d92-6d5f-408e-b41e-dfec76ceae6a'}]
B = [{'name': 'PHONE_NUMBER', 'count': '96'}]
print([{**x, **y} for x in A for y in B if y['name'] == x['name']])
One way is to walk both lists, and wherever you have matching name keys, use the merger of the 2 dicts:
l1 = [{'name': 'PASSWORD', 'id': '5f2496e5-dc40-418a-92e0-098e4642a92e'},
{'name': 'PERSON_NAME', 'id': '3a255440-e2aa-4c4d-993f-4cdef3237920'},
{'name': 'PERU_DNI_NUMBER', 'id': '41f41303-4a71-4732-a8a4-0eecea464562'},
{'name': 'PHONE_NUMBER', 'id': 'ac24413b-bb8f-4adc-ada5-a984f145a70b'},
{'name': 'POLAND_NATIONAL_ID_NUMBER',
'id': '32c49d92-6d5f-408e-b41e-dfec76ceae6a'}]
l2 = [{'name': 'PHONE_NUMBER', 'count': '96'}, {'name': 'PERSON_NAME', 'count': '100'}]
result = []
for d2 in l2:
for d1 in l1:
if d1['name'] == d2['name']:
result.append({**d1, **d2})
print(result)
[{'name': 'PHONE_NUMBER', 'id': 'ac24413b-bb8f-4adc-ada5-a984f145a70b', 'count': '96'},
{'name': 'PERSON_NAME', 'id': '3a255440-e2aa-4c4d-993f-4cdef3237920', 'count': '100'}]
I have a rest api that returns me some data, among them some in the format of links, so I call this link and store it all in a dataframe, but I need to remove some values from these lists and concatenate with the dataframe, anyone know a way to do this?
response = requests.get(url,auth=(usr,psw),headers=headers)
df = pd.DataFrame(response.json()['result'])
def get_data_from_link (data):
return requests.get(data['link'],auth=(usr,psw),headers=headers).json()
df['assignment_group_response']=df['assignment_group'].apply(get_data_from_link)
column I need to transform
0 {'result': {'attested_date': '', 'skip_sync': ...
1 {'result': {'attested_date': '', 'skip_sync': ...
2 {'result': {'attested_date': '', 'skip_sync': ...
Initial dataframe after fetching your data using the link:
assignment_group_response
0 {'name': 'abc', 'extra': {'value': 123}}
1 {'name': 'def', 'extra': {'value': 456}}
2 {'name': 'xyz', 'extra': {'value': 789}}
Now, I'll create new columns and get the values from the nested dictionary
df["name"] = df["assignment_group_response"].apply(lambda x: x["name"])
df["extra"] = df["assignment_group_response"].apply(lambda x: x["extra"])
df["value"] = df["assignment_group_response"].apply(lambda x: x["extra"]["value"])
After adding the columns, the dataframe would look like:
assignment_group_response name extra value
0 {'name': 'abc', 'extra': {'value': 123}} abc {'value': 123} 123
1 {'name': 'def', 'extra': {'value': 456}} def {'value': 456} 456
2 {'name': 'xyz', 'extra': {'value': 789}} xyz {'value': 789} 789
I have a dataframe like this
df['likes']
0 {'data': [{'id': '651703178310339', 'name': 'A...
1 {'data': [{'id': '798659570200808', 'name': 'B...
2 {'data': [{'id': '10200132902001105', 'name': ...
3 {'data': [{'id': '10151983313320836', 'name': ...
4 NaN
5 {'data': [{'id': '1551927888235503', 'name': '...
6 {'data': [{'id': '10204089171847031', 'name': ...
7 {'data': [{'id': '399992547089295', 'name': 'В...
8 {'data': [{'id': '10201813292573808', 'name': ...
9 NaN
Some cells have several elements 'id'
df['likes'][0]
{'data': [{'id': '651703178310339', 'name': 'A'},
{'id': '10204089171847031', 'name': 'B'}],
'paging': {'cursors': {'after': 'MTAyMDQwODkxNzE4NDcwMzEZD',
'before': 'NjUxNzAzMTc4MzEwMzM5'}}}
Some cells have zero. I want to get a new variable
df['number']
0 2
1 4
2 3
4 0
That contains number of elements 'id'. df['likes'] was obtained from dict. I tried to count 'id'
df['likes'].apply(lambda x: x.count('id'))
AttributeError: 'dict' object has no attribute 'count'
So I tried like this
df['likes'].apply(lambda x: len(x.keys()))
AttributeError: 'float' object has no attribute 'keys'
How to fix it?
I was asked to publish a full set of data, I publish three lines so as not to take up much space
`df['likes']`
`0 {'data': [{'id': '651703178310339', 'name': 'A'},
{'id': '10204089171847031', 'name': 'B'}],
'paging': {'cursors': {'after': 'MTAyMDQwODkxNzE4NDcwMzEZD',
'before': 'NjUxNzAzMTc4MzEwMzM5'}}}
1 {'data': [{'id': '798659570200808', 'name': 'C'},
{'id': '574668895969867', 'name': 'D'},
{'id': '651703178310339', 'name': 'A'},
{'id': '1365088683555195', 'name': 'G'}],
'paging': {'cursors': {'after': 'MTM2NTA4ODY4MzU1NTE5NQZDZD',
'before': 'Nzk4NjU5NTcwMjAwODA4'}}}
2 NaN`
Option 1:
In [120]: df.likes.apply(pd.Series)['data'].apply(lambda x: pd.Series(x).notnull()).sum(1)
Out[120]:
0 2.0
1 4.0
2 0.0
dtype: float64
Option 2:
In [146]: df['count'] = [sum('id' in d for d in x.get('data',[]))
if pd.notna(x) else 0
for x in df['likes']]
In [147]: df
Out[147]:
likes count
0 {'data': [{'id': '651703178310339', 'name': 'A... 2
1 {'data': [{'id': '798659570200808', 'name': 'C... 4
2 NaN 0
Data set:
In [137]: df.to_dict('r')
Out[137]:
[{'likes': {'data': [{'id': '651703178310339', 'name': 'A'},
{'id': '10204089171847031', 'name': 'B'}],
'paging': {'cursors': {'after': 'MTAyMDQwODkxNzE4NDcwMzEZD',
'before': 'NjUxNzAzMTc4MzEwMzM5'}}}},
{'likes': {'data': [{'id': '798659570200808', 'name': 'C'},
{'id': '574668895969867', 'name': 'D'},
{'id': '651703178310339', 'name': 'A'},
{'id': '1365088683555195', 'name': 'G'}],
'paging': {'cursors': {'after': 'MTM2NTA4ODY4MzU1NTE5NQZDZD',
'before': 'Nzk4NjU5NTcwMjAwODA4'}}}},
{'likes': nan}]
This almost works:
df['likes'].apply(lambda x: len(x['data']))
Note the error:
> AttributeError: 'float' object has no attribute 'keys'
That happens because you have some NaN values (which are represented as float NAN). So:
df['likes'][df['likes'].notnull()].apply(lambda x: len(x['data']))
I'm struggling to convert a JSON API response into a pandas Dataframe object. I've read answers to similar questions/documentation but nothing has helped. My closest attempt is below:
r = requests.get('https://api.xxx')
data = r.text
df = pd.read_json(data, orient='records')
Which returns the following format:
0 {'type': 'bid', 'price': 6.193e-05, ...},
1 {'type': 'bid', 'price': 6.194e-05, ...},
3 {'type': 'bid', 'price': 6.149e-05, ...} etc
The original format of the data is:
{'abc': [{'type': 'bid',
'price': 6.194e-05,
'amount': 2321.37952545,
'tid': 8577050,
'timestamp': 1498649162},
{'type': 'bid',
'price': 6.194e-05,
'amount': 498.78993587,
'tid': 8577047,
'timestamp': 1498649151},
...]}
I'm happy to be directed to good documentation.
I think you need json_normalize:
from pandas import json_normalize
import requests
r = requests.get('https://api.xxx')
data = r.text
df = json_normalize(data, 'abc')
print (df)
amount price tid timestamp type
0 2321.379525 0.000062 8577050 1498649162 bid
1 498.789936 0.000062 8577047 1498649151 bid
For multiple keys is possible use concat with list comprehension and DataFrame constructor:
d = {'abc': [{'type': 'bid', 'price': 6.194e-05, 'amount': 2321.37952545, 'tid': 8577050, 'timestamp': 1498649162}, {'type': 'bid', 'price': 6.194e-05, 'amount': 498.78993587, 'tid': 8577047, 'timestamp': 1498649151}],
'def': [{'type': 'bid', 'price': 6.194e-05, 'amount': 2321.37952545, 'tid': 8577050, 'timestamp': 1498649162}, {'type': 'bid', 'price': 6.194e-05, 'amount': 498.78993587, 'tid': 8577047, 'timestamp': 1498649151}]}
df = pd.concat([pd.DataFrame(v) for k,v in d.items()], keys=d)
print (df)
amount price tid timestamp type
abc 0 2321.379525 0.000062 8577050 1498649162 bid
1 498.789936 0.000062 8577047 1498649151 bid
def 0 2321.379525 0.000062 8577050 1498649162 bid
1 498.789936 0.000062 8577047 1498649151 bid