Accessing YAML data in Python - python

I have a YAML file that parses into an object, e.g.:
{'name': [{'proj_directory': '/directory/'},
{'categories': [{'quick': [{'directory': 'quick'},
{'description': None},
{'table_name': 'quick'}]},
{'intermediate': [{'directory': 'intermediate'},
{'description': None},
{'table_name': 'intermediate'}]},
{'research': [{'directory': 'research'},
{'description': None},
{'table_name': 'research'}]}]},
{'nomenclature': [{'extension': 'nc'}
{'handler': 'script'},
{'filename': [{'id': [{'type': 'VARCHAR'}]},
{'date': [{'type': 'DATE'}]},
{'v': [{'type': 'INT'}]}]},
{'data': [{'time': [{'variable_name': 'time'},
{'units': 'minutes since 1-1-1980 00:00 UTC'},
{'latitude': [{'variable_n...
I'm having trouble accessing the data in python and regularly see the error TypeError: list indices must be integers, not str
I want to be able to access all elements corresponding to 'name' so to retrieve each data field I imagine it would look something like:
import yaml
settings_stream = open('file.yaml', 'r')
settingsMap = yaml.safe_load(settings_stream)
yaml_stream = True
print 'loaded settings for: ',
for project in settingsMap:
print project + ', ' + settingsMap[project]['project_directory']
and I would expect each element would be accessible via something like ['name']['categories']['quick']['directory']
and something a little deeper would just be:
['name']['nomenclature']['data']['latitude']['variable_name']
or am I completely wrong here?

The brackets, [], indicate that you have lists of dicts, not just a dict.
For example, settingsMap['name'] is a list of dicts.
Therefore, you need to select the correct dict in the list using an integer index, before you can select the key in the dict.
So, giving your current data structure, you'd need to use:
settingsMap['name'][1]['categories'][0]['quick'][0]['directory']
Or, revise the underlying YAML data structure.
For example, if the data structure looked like this:
settingsMap = {
'name':
{'proj_directory': '/directory/',
'categories': {'quick': {'directory': 'quick',
'description': None,
'table_name': 'quick'}},
'intermediate': {'directory': 'intermediate',
'description': None,
'table_name': 'intermediate'},
'research': {'directory': 'research',
'description': None,
'table_name': 'research'},
'nomenclature': {'extension': 'nc',
'handler': 'script',
'filename': {'id': {'type': 'VARCHAR'},
'date': {'type': 'DATE'},
'v': {'type': 'INT'}},
'data': {'time': {'variable_name': 'time',
'units': 'minutes since 1-1-1980 00:00 UTC'}}}}}
then you could access the same value as above with
settingsMap['name']['categories']['quick']['directory']
# quick

Related

How to convert json into a pandas dataframe?

I'm trying to covert an api response from json to a dataframe in pandas. the problem I am having is that de data is nested in the json format and I am not getting the right columns in my dataframe.
The data is collect from a api with the following format:
{'tickets': [{'url': 'https...',
'id': 1,
'external_id': None,
'via': {'channel': 'web',
'source': {'from': {}, 'to': {}, 'rel': None}},
'created_at': '2020-05-01T04:16:33Z',
'updated_at': '2020-05-23T03:02:49Z',
'type': 'incident',
'subject': 'Subject',
'raw_subject': 'Raw subject',
'description': 'Hi, this is the description',
'priority': 'normal',
'status': 'closed',
'recipient': None,
'requester_id': 409467360874,
'submitter_id': 409126461453,
'assignee_id': 409126461453,
'organization_id': None,
'group_id': 360009916453,
'collaborator_ids': [],
'follower_ids': [],
'email_cc_ids': [],
'forum_topic_id': None,
'problem_id': None,
'has_incidents': False,
'is_public': True,
'due_at': None,
'tags': ['tag_1',
'tag_2',
'tag_3',
'tag_4'],
'custom_fields': [{'id': 360042034433, 'value': 'value of the first custom field'},
{'id': 360041487874, 'value': 'value of the second custom field'},
{'id': 360041489414, 'value': 'value of the third custom field'},
{'id': 360040980053, 'value': 'correo_electrónico'},
{'id': 360040980373, 'value': 'suscribe_newsletter'},
{'id': 360042046173, 'value': None},
{'id': 360041028574, 'value': 'product'},
{'id': 360042103034, 'value': None}],
'satisfaction_rating': {'score': 'unoffered'},
'sharing_agreement_ids': [],
'comment_count': 2,
'fields': [{'id': 360042034433, 'value': 'value of the first custom field'},
{'id': 360041487874, 'value': 'value of the second custom field'},
{'id': 360041489414, 'value': 'value of the third custom field'},
{'id': 360040980053, 'value': 'correo_electrónico'},
{'id': 360040980373, 'value': 'suscribe_newsletter'},
{'id': 360042046173, 'value': None},
{'id': 360041028574, 'value': 'product'},
{'id': 360042103034, 'value': None}],
'followup_ids': [],
'ticket_form_id': 360003608013,
'deleted_ticket_form_id': 360003608013,
'brand_id': 360004571673,
'satisfaction_probability': None,
'allow_channelback': False,
'allow_attachments': True},
What I already tried is the following: I have converted the JSON format into a dict as following:
x = response.json()
df = pd.DataFrame(x['tickets'])
But I'm struggling with the output. I don't know how to get a correct, ordered, normalized dataframe.
(I'm new in this :) )
Let's supose you get your request data by this code r = requests.get(url, auth)
Your data ins't clear yet, so let's get a dataframe of it data = pd.read_json(json.dumps(r.json, ensure_ascii = False))
But, probably you will get a dataframe with one single row.
When I faced a problem like this, I wrote this function to get the full data:
listParam = []
def listDict(entry):
if type(entry) is dict:
listParam.append(entry)
elif type(entry) is list:
for ent in entry:
listDict(ent)
Because your data looks like a dict because of {'tickets': ...} you will need to get the information like that:
listDict(data.iloc[0][0])
And then,
pd.DataFrame(listParam)
I can't show the results because you didn't post the complete data nor told where I can find the data to test, but this will probably work.
You have to convert the json to dictionary first and then convert the dictionary value for key 'tickets' into dataframe.
file = open('file.json').read()
ticketDictionary = json.loads(file)
df = pd.DataFrame(ticketDictionary['tickets'])
'file.json' contains your data here.
df now contains your dataFrame in this format.
For the lists within the response you can have separate dataframes if required:
for field in df['fields']:
df = pd.DataFrame(field)
It will give you this for lengths:
id value
0 360042034433 value of the first custom field
1 360041487874 value of the second custom field
2 360041489414 value of the third custom field
3 360040980053 correo_electrónico
4 360040980373 suscribe_newsletter
5 360042046173 None
6 360041028574 product
7 360042103034 None
This can be one way to structure as you haven't mentioned the exact expected format.

Convert nested dictionary within JSON from a string

I have JSON data that I loaded that appears to have a bit of a messy data structure where nested dictionaries are wrapped in single quotes and recognized as a string, rather than a single dictionary which I can loop through. What is the best way to drop the single quotes from the key-value property ('value').
Provided below is an example of the structure:
for val in json_data:
print(val)
{'id': 'status6',
'title': 'Estimation',
'text': '> 2 days',
'type': 'color',
'value': '{"index":14,"post_id":null,"changed_at":"2020-06-12T09:04:58.659Z"}',
'name': 'Internal: online course'},
{'id': 'date',
'title': 'Deadline',
'text': '2020-06-26',
'type': 'date',
'value': '{"date":"2020-06-26","changed_at":"2020-06-12T11:33:37.195Z"}',
'name': 'Internal: online course'},
{'id': 'tags',
'title': 'Tags',
'text': 'Internal',
'type': 'tag',
'value': '{"tag_ids":[3223513]}',
'name': 'Internal: online course'},
If I add a nested look targeting ['value'], it loops by character and not key-value pair in the dictionary.
Using json.loads to convert string to dict
import json
json_data = [{'id': 'status6',
'title': 'Estimation',
'text': '> 2 days',
'type': 'color',
'value': '{"index":14,"post_id":null,"changed_at":"2020-06-12T09:04:58.659Z"}',
'name': 'Internal: online course'},
{'id': 'date',
'title': 'Deadline',
'text': '2020-06-26',
'type': 'date',
'value': '{"date":"2020-06-26","changed_at":"2020-06-12T11:33:37.195Z"}',
'name': 'Internal: online course'},
{'id': 'tags',
'title': 'Tags',
'text': 'Internal',
'type': 'tag',
'value': '{"tag_ids":[3223513]}',
'name': 'Internal: online course'}]
# the result is a Python dictionary:
for val in json_data:
print(json.loads(val['value']))
this should be work!!

I have list of nested dict variable and need to convert to dict variable type for Json object

Below json data has 3 rules (dict type). I have created as list with some changes. Now i need to convert this "list to dict" data type. The below data has lot of nested list/dict. I want to split this list of list (3 list) and append it to dictionary.(dict datatype)
<class 'list'>
[
{'ID': 'Glacierize bird_sporr after 2 weeks',
'Status': 'Enabled',
'Transitions': [{'Days': 14, 'StorageClass': 'GLACIER'}],
'NoncurrentVersionTransitions': [{'NoncurrentDays': 14, 'StorageClass': 'GLACIER'}],
'Prefix': 'bird_sporr'},
{'Expiration':
{'Days': 45},
'ID': 'Delete files after 45 days',
'Status': 'Enabled',
'NoncurrentVersionExpiration': {'NoncurrentDays': 45},
'Prefix': 'bird_sporr'
},
{'ID': 'PruneAbandonedMultipartUpload',
'Status': 'Enabled',
'AbortIncompleteMultipartUpload': {'DaysAfterInitiation': 30},
'Prefix': ''}
]
I need the below output with dict data type.. This API will not acccept the list data type. Please help on this. Let me know if any queries.
<class 'dict'>
{'ID': 'Glacierize bird_sporr after 2 weeks',
'Status': 'Enabled',
'Transitions': [{'Days': 14, 'StorageClass': 'GLACIER'}],
'NoncurrentVersionTransitions': [{'NoncurrentDays': 14, 'StorageClass': 'GLACIER'}],
'Prefix': 'bird_sporr'},
{'Expiration':
{'Days': 45},
'ID': 'Delete files after 45 days',
'Status': 'Enabled',
'NoncurrentVersionExpiration': {'NoncurrentDays': 45},
'Prefix': 'bird_sporr'},
{'ID': 'PruneAbandonedMultipartUpload',
'Status': 'Enabled',
'AbortIncompleteMultipartUpload': {'DaysAfterInitiation': 30},
'Prefix': ''}
If your problem is just that, you have a list with your output. But you need just the output, without it being contained by a list, Then you should simply be able to do this:
list[0] should give you your desired dictionary.

How to convert a JSON list into two different columns in a CSV file?

This is my first question on this spectacular website, I need to know how to export complex information from a JSON to a CSV.
The problem is that I need from the list that I have in the column to have two different values.
I tried a lot of different combinations and I couldn't so one of my last resources are asked to the community.
My code is this:
def output(alerts):
output = list()
for alert in alerts:
applications = alerts['applications']
for app in applications:
categories = app['categories']
for cat in categories:
output_alert = [list(cat.items())[0], app['confidence'], app['icon'],
app['name'], app['version'], app['website'], alerts['language'], alerts['status']]
output.append(output_alert)
df = pd.DataFrame(output, columns=['Categories', 'Confidence', 'Icon', 'Name', 'Version', 'Website',
'Language', 'Status'])
df.to_csv(args.output)
print('Scan completed, you already have your new CSV file')
return
enter image description here
I left you a picture of the CSV file with the problem in column B (I have a list there) but I need actually two columns with each value...
I attached the JSON response that I have from a REST API
{'applications': [{'categories': [{'59': 'JavaScript libraries'}],
'confidence': '100',
'icon': 'Lo-dash.png',
'name': 'Lodash',
'version': '4.17.15',
'website': 'http://www.lodash.com'},
{'categories': [{'12': 'JavaScript frameworks'}],
'confidence': '100',
'icon': 'RequireJS.png',
'name': 'RequireJS',
'version': '2.3.6',
'website': 'http://requirejs.org'},
{'categories': [{'13': 'Issue trackers'}],
'confidence': '100',
'icon': 'Sentry.svg',
'name': 'Sentry',
'version': '4.6.2',
'website': 'https://sentry.io/'},
{'categories': [{'1': 'CMS'},
{'6': 'Ecommerce'},
{'11': 'Blogs'}],
'confidence': '100',
'icon': 'Wix.png',
'name': 'Wix',
'version': None,
'website': 'https://www.wix.com'},
{'categories': [{'59': 'JavaScript libraries'}],
'confidence': '100',
'icon': 'Zepto.png',
'name': 'Zepto',
'version': None,
'website': 'http://zeptojs.com'},
{'categories': [{'19': 'Miscellaneous'}],
'confidence': '100',
'icon': 'webpack.svg',
'name': 'webpack',
'version': None,
'website': 'https://webpack.js.org/'},
{'categories': [{'12': 'JavaScript frameworks'}],
'confidence': '0',
'icon': 'React.png',
'name': 'React',
'version': None,
'website': 'https://reactjs.org'}], 'language': 'es', 'status': 'success'}
[{'59': 'JavaScript libraries'}] this last thing is my big problem! Thank you for your time and help!
You could try using list(cat.keys())[0], list(cat.values())[0] in your output_alert variable to extract key and value separately.
You can use json_normalize to extract your columns without the for-loop, and then create two new columns with the extracted keys and values from categories:
result = pd.json_normalize(
alerts,
record_path=["applications"],
meta=["language", "status"]
).explode("categories")
result["category_labels"] = result.categories.apply(lambda x: list(x.keys())[0])
result["category_values"] = result.categories.apply(lambda x: list(x.values())[0])
The output is:

Extracting value for one dictionary key in Pandas based on another in the same dictionary

This is from an R guy.
I have this mess in a Pandas column: data['crew'].
array(["[{'credit_id': '54d5356ec3a3683ba0000039', 'department': 'Production', 'gender': 1, 'id': 494, 'job': 'Casting', 'name': 'Terri Taylor', 'profile_path': None}, {'credit_id': '56407fa89251417055000b58', 'department': 'Sound', 'gender': 0, 'id': 6745, 'job': 'Music Editor', 'name': 'Richard Henderson', 'profile_path': None}, {'credit_id': '5789212392514135d60025fd', 'department': 'Production', 'gender': 2, 'id': 9250, 'job': 'Executive In Charge Of Production', 'name': 'Jeffrey Stott', 'profile_path': None}, {'credit_id': '57892074c3a36835fa002886', 'department': 'Costume & Make-Up', 'gender': 0, 'id': 23783, 'job': 'Makeup Artist', 'name': 'Heather Plott', 'profile_path': None}
It goes on for quite some time. Each new dict starts with a credit_id field. One sell can hold several dicts in an array.
Assume I want the names of all Casting directors, as shown in the first entry. I need to check check the job entry in every dict and, if it's Casting, grab what's in the name field and store it in my data frame in data['crew'].
I tried several strategies, then backed off and went for something simple.
Running the following shut me down, so I can't even access a simple field. How can I get this done in Pandas.
for row in data.head().iterrows():
if row['crew'].job == 'Casting':
print(row['crew'])
EDIT: Error Message
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-138-aa6183fdf7ac> in <module>()
1 for row in data.head().iterrows():
----> 2 if row['crew'].job == 'Casting':
3 print(row['crew'])
TypeError: tuple indices must be integers or slices, not str
EDIT: Code used to get the array of dict (strings?) in the first place.
def convert_JSON(data_as_string):
try:
dict_representation = ast.literal_eval(data_as_string)
return dict_representation
except ValueError:
return []
data["crew"] = data["crew"].map(lambda x: sorted([d['name'] if d['job'] == 'Casting' else '' for d in convert_JSON(x)])).map(lambda x: ','.join(map(str, x))
To create a DataFrame from your sample data, write:
df = pd.DataFrame(data=[
{ 'credit_id': '54d5356ec3a3683ba0000039', 'department': 'Production',
'gender': 1, 'id': 494, 'job': 'Casting', 'name': 'Terri Taylor',
'profile_path': None},
{ 'credit_id': '56407fa89251417055000b58', 'department': 'Sound',
'gender': 0, 'id': 6745, 'job': 'Music Editor',
'name': 'Richard Henderson', 'profile_path': None},
{ 'credit_id': '5789212392514135d60025fd', 'department': 'Production',
'gender': 2, 'id': 9250, 'job': 'Executive In Charge Of Production',
'name': 'Jeffrey Stott', 'profile_path': None},
{ 'credit_id': '57892074c3a36835fa002886', 'department': 'Costume & Make-Up',
'gender': 0, 'id': 23783, 'job': 'Makeup Artist',
'name': 'Heather Plott', 'profile_path': None}])
Then you can get your data with a single instruction:
df[df.job == 'Casting'].name
The result is:
0 Terri Taylor
Name: name, dtype: object
The above result is Pandas Series object with names found.
In this case, 0 is the index value for the record found and
Terri Taylor is the name of (the only in your data) Casting Director.
Edit
If you want just a list (not Series), write:
df[df.job == 'Casting'].name.tolist()
The result is ['Terri Taylor'] - just a list.
I think, both my solutions should be quicker than "ordinary" loop
based on iterrows().
Checking the execution time, you may try also yet another solution:
df.query("job == 'Casting'").name.tolist()
==========
And as far as your code is concerned:
iterrows() returns each time a pair containing:
the key of the current row,
a named tuple - the content of this row.
So your loop should look something like:
for row in df.iterrows():
if row[1].job == 'Casting':
print(row[1]['name'])
You can not write row[1].name because it refers to the index value
(here we have a collision with default attributes of the named tuple).

Categories

Resources