How can I explode a nested dictionary into a dataframe?

How can I explode a nested dictionary into a dataframe? - python

I have a nested dictionary as below. I'm trying to convert the below to a dataframe with the columns iid, Invnum, #type, execId, CId, AId, df, type. What’s the best way to go about it?
data = {'A': {'B1': {'iid': 'B1', 'Invnum': {'B11': {'#type': '/test_data', 'execId': 42, 'CId': 42, 'AId': 'BAZ'}, 'B12': {'#type': '/test_data', 'CId': 8, 'AId': '123'}}}}, 'B2': {'iid': 'B2', 'Invnum': {'B21': {'#type': '/test_data', 'execId': 215, 'CId': 253,'df': [], 'type': 'F'}, 'B22': {'#type': '/test_data', 'execId': 10,'df': [], 'type': 'F'}}}}
for key1 in data['A'].keys():
for key2 in data['A'][key1]['Invnum']:
print(key1,key2)
Expected output:

As indicated in the comments, your input data is very obscure. This provides a lot of trouble for us, because we don't know what we can assume or not. For my solution I will assume at least the following, based on the example you provide:
In the dictionary there is an entry containing the iid and Invnum as keys in the same level.
The Invnum key is the only key, which has multiple values, or in otherwords is iterable (besides df), and on iteration it must hold the last dictionary. In otherwords, after the Invnum value (e.g. B11), you can only get the last dict with the other fields as keys (#type, execId, CId, AId, df, type), if they exists.
If there is a df value, it will hold a list.
# This is a place holder dictionary, so we can create entries that have the same pattern.
entry = {'#type': '', 'execId': '', 'CId': '', 'AId': '', 'df': '', 'type': ''}
# This will hold all the (properly) format entries for the df.
items = []
def flatten(data):
if isinstance(data, dict):
match data:
# We are searching for a level that contains both an `iid` and `Invnum` key.
case {'iid': id, 'Invnum': remainder}:
for each in remainder:
entry_row = dict(**entry, iid=id, Invnum=each)
entry_row.update(remainder[each])
items.append(entry_row)
case _:
for key, value in data.items():
flatten(value)
# We flatten the data, such that the `items` variable will hold consistent entries
flatten(data)
# Transfer to pandas dataframe, and reorder the values for easy comparison.
df = pd.DataFrame(items)
df = df[['iid', 'Invnum', '#type', 'execId', 'CId', 'AId', 'df', 'type']]
print(df.to_string(index=False))
Output:
iid Invnum #type execId CId AId df type
B1 B11 /test_data 42 42 BAZ
B1 B12 /test_data 8 123
B2 B21 /test_data 215 253 [] F
B2 B22 /test_data 10 [] F
Note:
All entries have been turned into strings, since I am using '' for empty values.
I heavily rely on the above made assumptions, in case they are incorrect, the answer will not match your expectation.
I am using Structural pattern matching, which is introduced in python 3.10.

Related

Generating a list based on a list of dictionaries and data from another list

The following lists are given:
atr = [{'name': 'surname', 'type': 'varchar(50)', 'table': None}, {'name': 'ls_data', 'type': 'timestamp', 'table': None}, {'name': 'cpn', 'type': 'int', 'table': None}, {'name': 'code', 'type': 'varchar(200)', 'table': None}]
pk = ['surname', 'cpn', 'ls_data']
It is necessary to form a list of "type" from the atr list, while "name" from atr = pk.
The order should be as in the pk list.
Expected output
lst = ['varchar(50)', 'int', 'timestamp']
I tried it like this
lst = [d["type"] for d in atr if d["name"] in pk]
But this is incorrect, the order is not the same as in the pk list.

It would work using something like this:
lst = [atr[[d["name"] for d in atr].index(p)]["type"] for p in pk]
The output for print(lst) is:
['varchar(50)', 'int', 'timestamp']
i.e. the same result items list order as in the query items list as opposed to your original approach which gives a different order.
Though I'm not sure how readable/performant that is; it
generates a new list (only containing the values of the "name" key) from the list of dictionaries for each query item p in pk
searches for the index of the current query item p in that list
uses this index to retrieve the respective dictionary from the original atr list
and finally, select the value of the "type" key from that dictionary

Use list of indices to manipulate a nested dictionary

I'm trying to perform operations on a nested dictionary (data retrieved from a yaml file):
data = {'services': {'web': {'name': 'x'}}, 'networks': {'prod': 'value'}}
I'm trying to modify the above using the inputs like:
{'services.web.name': 'new'}
I converted the above to a list of indices ['services', 'web', 'name']. But I'm not able to/not sure how to perform the below operation in a loop:
data['services']['web']['name'] = new
That way I can modify dict the data. There are other values I plan to change in the above dictionary (it is extensive one) so I need a solution that works in cases where I have to change, EG:
data['services2']['web2']['networks']['local'].
Is there a easy way to do this? Any help is appreciated.

You may iterate over the keys while moving a reference:
data = {'networks': {'prod': 'value'}, 'services': {'web': {'name': 'x'}}}
modification = {'services.web.name': 'new'}
for key, value in modification.items():
keyparts = key.split('.')
to_modify = data
for keypart in keyparts[:-1]:
to_modify = to_modify[keypart]
to_modify[keyparts[-1]] = value
print(data)
Giving:
{'networks': {'prod': 'value'}, 'services': {'web': {'name': 'new'}}}

python key value two dict matches

I am tyring to match the value of two dicts of two sperate keys by looping over them-with hopefully if i in line_aum['id_class'] == line_investor['id_class'] becoming True, then the next sum dunction will work:
Tho it kicks out a different result
so far I have:
for line_aum in aum_obj:
for line_investor in investor_obj:
if i in line_aum['id_class'] == line_investor['id_class']:
total = (sum,line_investor['amount'], line_aum['value'])
amount = line['id_class']
print(amount,total)
Example data:
{'fund_name': '', 'fund_code': 'LFE', 'aumc': '406.37', 'value': '500', 'ddate': '2013-01-01', 'id_fund': '165', 'currency': 'EUR', 'nav': '24.02', 'shares': '16.918', 'estimate': '0', 'id_class': '4526', 'class_name': 'LTD - CLASS B (EUR)'}

Use itertools.product instead of nested loops if both aum_obj and investor_obj are lists:
from itertools import product
for line_aum, line_investor in product(aum_obj, investor_obj):
if line_aum['id_class'] == line_investor['id_class']:
# `line_aum` and `line_investor` have matching values for the `id_class` keys.

Using Python CSV DictReader to create multi-level nested dictionary

Total Python noob here, probably missing something obvious. I've searched everywhere and haven't found a solution yet, so I thought I'd ask for some help.
I'm trying to write a function that will build a nested dictionary from a large csv file. The input file is in the following format:
Product,Price,Cost,Brand,
blue widget,5,4,sony,
red widget,6,5,sony,
green widget,7,5,microsoft,
purple widget,7,6,microsoft,
etc...
The output dictionary I need would look like:
projects = { `<Brand>`: { `<Product>`: { 'Price': `<Price>`, 'Cost': `<Cost>` },},}
But obviously with many different brands containing different products. In the input file, the data is ordered alphabetically by brand name, but I know that it becomes unordered as soon as DictReader executes, so I definitely need a better way to handle the duplicates. The if statement as written is redundant and unnecessary.
Here's the non-working, useless code I have so far:
def build_dict(source_file):
projects = {}
headers = ['Product', 'Price', 'Cost', 'Brand']
reader = csv.DictReader(open(source_file), fieldnames = headers, dialect = 'excel')
current_brand = 'None'
for row in reader:
if Brand != current_brand:
current_brand = Brand
projects[Brand] = {Product: {'Price': Price, 'Cost': Cost}}
return projects
source_file = 'merged.csv'
print build_dict(source_file)
I have of course imported the csv module at the top of the file.
What's the best way to do this? I feel like I'm way off course, but there is very little information available about creating nested dicts from a CSV, and the examples that are out there are highly specific and tend not to go into detail about why the solution actually works, so as someone new to Python, it's a little hard to draw conclusions.
Also, the input csv file doesn't normally have headers, but for the sake of trying to get a working version of this function, I manually inserted a header row. Ideally, there would be some code that assigns the headers.
Any help/direction/recommendation is much appreciated, thanks!

import csv
from collections import defaultdict
def build_dict(source_file):
projects = defaultdict(dict)
headers = ['Product', 'Price', 'Cost', 'Brand']
with open(source_file, 'rb') as fp:
reader = csv.DictReader(fp, fieldnames=headers, dialect='excel',
skipinitialspace=True)
for rowdict in reader:
if None in rowdict:
del rowdict[None]
brand = rowdict.pop("Brand")
product = rowdict.pop("Product")
projects[brand][product] = rowdict
return dict(projects)
source_file = 'merged.csv'
print build_dict(source_file)
produces
{'microsoft': {'green widget': {'Cost': '5', 'Price': '7'},
'purple widget': {'Cost': '6', 'Price': '7'}},
'sony': {'blue widget': {'Cost': '4', 'Price': '5'},
'red widget': {'Cost': '5', 'Price': '6'}}}
from your input data (where merged.csv doesn't have the headers, only the data.)
I used a defaultdict here, which is just like a dictionary but when you refer to a key that doesn't exist instead of raising an Exception it simply makes a default value, in this case a dict. Then I get out -- and remove -- Brand and Product, and store the remainder.
All that's left I think would be to turn the cost and price into numbers instead of strings.
[modified to use DictReader directly rather than reader]

Here I offer another way to satisfy your requirement(different from DSM)
Firstly, this is my code:
import csv
new_dict={}
with open('merged.csv','rb')as csv_file:
data=csv.DictReader(csv_file,delimiter=",")
for row in data:
dict_brand=new_dict.get(row['Brand'],dict())
dict_brand[row['Product']]={k:row[k] for k in ('Cost','Price')}
new_dict[row['Brand']]=dict_brand
print new_dict
Briefly speaking, the main point to solve is to figure out what the key-value pairs are in your requirements. According to your requirement,it can be called as a 3-level-dict,here the key of first level is the value of Brand int the original dictionary, so I extract it from the original csv file as
dict_brand=new_dict.get(row['Brand'],dict())
which is going to judge if there exists the Brand value same as the original dict in our new dict, if yes, it just inserts, if no, it creates, then maybe the most complicated part is the second level or middle level, here you set the value of Product of original dict as the value of the new dict of key Brand, and the value of Product is also the key of the the third level dict which has Price and Cost of the original dict as the value,and here I extract them like:
dict_brand[row['Product']]={k:row[k] for k in ('Cost','Price')}
and finally, what we need to do is just set the created 'middle dict' as the value of our new dict which has Brand as the key.
Finally, the output is
{'sony': {'blue widget': {'Price': '5', 'Cost': '4'},
'red widget': {'Price': '6', 'Cost': '5'}},
'microsoft': {'purple widget': {'Price': '7', 'Cost': '6'},
'green widget': {'Price': '7', 'Cost': '5'}}}
That's that.

Data structure for maintaining tabular data in memory?

My scenario is as follows: I have a table of data (handful of fields, less than a hundred rows) that I use extensively in my program. I also need this data to be persistent, so I save it as a CSV and load it on start-up. I choose not to use a database because every option (even SQLite) is an overkill for my humble requirement (also - I would like to be able to edit the values offline in a simple way, and nothing is simpler than notepad).
Assume my data looks as follows (in the file it's comma separated without titles, this is just an illustration):
Row | Name | Year | Priority
------------------------------------
1 | Cat | 1998 | 1
2 | Fish | 1998 | 2
3 | Dog | 1999 | 1
4 | Aardvark | 2000 | 1
5 | Wallaby | 2000 | 1
6 | Zebra | 2001 | 3
Notes:
Row may be a "real" value written to the file or just an auto-generated value that represents the row number. Either way it exists in memory.
Names are unique.
Things I do with the data:
Look-up a row based on either ID (iteration) or name (direct access).
Display the table in different orders based on multiple field: I need to sort it e.g. by Priority and then Year, or Year and then Priority, etc.
I need to count instances based on sets of parameters, e.g. how many rows have their year between 1997 and 2002, or how many rows are in 1998 and priority > 2, etc.
I know this "cries" for SQL...
I'm trying to figure out what's the best choice for data structure. Following are several choices I see:
List of row lists:
a = []
a.append( [1, "Cat", 1998, 1] )
a.append( [2, "Fish", 1998, 2] )
a.append( [3, "Dog", 1999, 1] )
...
List of column lists (there will obviously be an API for add_row etc):
a = []
a.append( [1, 2, 3, 4, 5, 6] )
a.append( ["Cat", "Fish", "Dog", "Aardvark", "Wallaby", "Zebra"] )
a.append( [1998, 1998, 1999, 2000, 2000, 2001] )
a.append( [1, 2, 1, 1, 1, 3] )
Dictionary of columns lists (constants can be created to replace the string keys):
a = {}
a['ID'] = [1, 2, 3, 4, 5, 6]
a['Name'] = ["Cat", "Fish", "Dog", "Aardvark", "Wallaby", "Zebra"]
a['Year'] = [1998, 1998, 1999, 2000, 2000, 2001]
a['Priority'] = [1, 2, 1, 1, 1, 3]
Dictionary with keys being tuples of (Row, Field):
Create constants to avoid string searching
NAME=1
YEAR=2
PRIORITY=3
a={}
a[(1, NAME)] = "Cat"
a[(1, YEAR)] = 1998
a[(1, PRIORITY)] = 1
a[(2, NAME)] = "Fish"
a[(2, YEAR)] = 1998
a[(2, PRIORITY)] = 2
...
And I'm sure there are other ways... However each way has disadvantages when it comes to my requirements (complex ordering and counting).
What's the recommended approach?
EDIT:
To clarify, performance is not a major issue for me. Because the table is so small, I believe almost every operation will be in the range of milliseconds, which is not a concern for my application.

Having a "table" in memory that needs lookups, sorting, and arbitrary aggregation really does call out for SQL. You said you tried SQLite, but did you realize that SQLite can use an in-memory-only database?
connection = sqlite3.connect(':memory:')
Then you can create/drop/query/update tables in memory with all the functionality of SQLite and no files left over when you're done. And as of Python 2.5, sqlite3 is in the standard library, so it's not really "overkill" IMO.
Here is a sample of how one might create and populate the database:
import csv
import sqlite3
db = sqlite3.connect(':memory:')
def init_db(cur):
cur.execute('''CREATE TABLE foo (
Row INTEGER,
Name TEXT,
Year INTEGER,
Priority INTEGER)''')
def populate_db(cur, csv_fp):
rdr = csv.reader(csv_fp)
cur.executemany('''
INSERT INTO foo (Row, Name, Year, Priority)
VALUES (?,?,?,?)''', rdr)
cur = db.cursor()
init_db(cur)
populate_db(cur, open('my_csv_input_file.csv'))
db.commit()
If you'd really prefer not to use SQL, you should probably use a list of dictionaries:
lod = [ ] # "list of dicts"
def populate_lod(lod, csv_fp):
rdr = csv.DictReader(csv_fp, ['Row', 'Name', 'Year', 'Priority'])
lod.extend(rdr)
def query_lod(lod, filter=None, sort_keys=None):
if filter is not None:
lod = (r for r in lod if filter(r))
if sort_keys is not None:
lod = sorted(lod, key=lambda r:[r[k] for k in sort_keys])
else:
lod = list(lod)
return lod
def lookup_lod(lod, **kw):
for row in lod:
for k,v in kw.iteritems():
if row[k] != str(v): break
else:
return row
return None
Testing then yields:
>>> lod = []
>>> populate_lod(lod, csv_fp)
>>>
>>> pprint(lookup_lod(lod, Row=1))
{'Name': 'Cat', 'Priority': '1', 'Row': '1', 'Year': '1998'}
>>> pprint(lookup_lod(lod, Name='Aardvark'))
{'Name': 'Aardvark', 'Priority': '1', 'Row': '4', 'Year': '2000'}
>>> pprint(query_lod(lod, sort_keys=('Priority', 'Year')))
[{'Name': 'Cat', 'Priority': '1', 'Row': '1', 'Year': '1998'},
{'Name': 'Dog', 'Priority': '1', 'Row': '3', 'Year': '1999'},
{'Name': 'Aardvark', 'Priority': '1', 'Row': '4', 'Year': '2000'},
{'Name': 'Wallaby', 'Priority': '1', 'Row': '5', 'Year': '2000'},
{'Name': 'Fish', 'Priority': '2', 'Row': '2', 'Year': '1998'},
{'Name': 'Zebra', 'Priority': '3', 'Row': '6', 'Year': '2001'}]
>>> pprint(query_lod(lod, sort_keys=('Year', 'Priority')))
[{'Name': 'Cat', 'Priority': '1', 'Row': '1', 'Year': '1998'},
{'Name': 'Fish', 'Priority': '2', 'Row': '2', 'Year': '1998'},
{'Name': 'Dog', 'Priority': '1', 'Row': '3', 'Year': '1999'},
{'Name': 'Aardvark', 'Priority': '1', 'Row': '4', 'Year': '2000'},
{'Name': 'Wallaby', 'Priority': '1', 'Row': '5', 'Year': '2000'},
{'Name': 'Zebra', 'Priority': '3', 'Row': '6', 'Year': '2001'}]
>>> print len(query_lod(lod, lambda r:1997 <= int(r['Year']) <= 2002))
6
>>> print len(query_lod(lod, lambda r:int(r['Year'])==1998 and int(r['Priority']) > 2))
0
Personally I like the SQLite version better since it preserves your types better (without extra conversion code in Python) and easily grows to accommodate future requirements. But then again, I'm quite comfortable with SQL, so YMMV.

A very old question I know but...
A pandas DataFrame seems to be the ideal option here.
http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.DataFrame.html
From the blurb
Two-dimensional size-mutable, potentially heterogeneous tabular data
structure with labeled axes (rows and columns). Arithmetic operations
align on both row and column labels. Can be thought of as a dict-like
container for Series objects. The primary pandas data structure
http://pandas.pydata.org/

I personally would use the list of row lists. Because the data for each row is always in the same order, you can easily sort by any of the columns by simply accessing that element in each of the lists. You can also easily count based on a particular column in each list, and make searches as well. It's basically as close as it gets to a 2-d array.
Really the only disadvantage here is that you have to know in what order the data is in, and if you change that ordering, you'll have to change your search/sorting routines to match.
Another thing you can do is have a list of dictionaries.
rows = []
rows.append({"ID":"1", "name":"Cat", "year":"1998", "priority":"1"})
This would avoid needing to know the order of the parameters, so you can look through each "year" field in the list.

Have a Table class whose rows is a list of dict or better row objects
In table do not directly add rows but have a method which update few lookup maps e.g. for name
if you are not adding rows in order or id are not consecutive you can have idMap too
e.g.
class Table(object):
def __init__(self):
self.rows = []# list of row objects, we assume if order of id
self.nameMap = {} # for faster direct lookup for row by name
def addRow(self, row):
self.rows.append(row)
self.nameMap[row['name']] = row
def getRow(self, name):
return self.nameMap[name]
table = Table()
table.addRow({'ID':1,'name':'a'})

First, given that you have a complex data retrieval scenario, are you sure even SQLite is overkill?
You'll end up having an ad hoc, informally-specified, bug-ridden, slow implementation of half of SQLite, paraphrasing Greenspun's Tenth Rule.
That said, you are very right in saying that choosing a single data structure will impact one or more of searching, sorting or counting, so if performance is paramount and your data is constant, you could consider having more than one structure for different purposes.
Above all, measure what operations will be more common and decide which structure will end up costing less.

I personally wrote a lib for pretty much that quite recently, it is called BD_XML
as its most fundamental reason of existence is to serve as a way to send data back and forth between XML files and SQL databases.
It is written in Spanish (if that matters in a programming language) but it is very simple.
from BD_XML import Tabla
It defines an object called Tabla (Table), it can be created with a name for identification an a pre-created connection object of a pep-246 compatible database interface.
Table = Tabla('Animals')
Then you need to add columns with the agregar_columna (add_column) method, with can take various key word arguments:
campo (field): the name of the field
tipo (type): the type of data stored, can be a things like 'varchar' and 'double' or name of python objects if you aren't interested in exporting to a data base latter.
defecto (default): set a default value for the column if there is none when you add a row
there are other 3 but are only there for database tings and not actually functional
like:
Table.agregar_columna(campo='Name', tipo='str')
Table.agregar_columna(campo='Year', tipo='date')
#declaring it date, time, datetime or timestamp is important for being able to store it as a time object and not only as a number, But you can always put it as a int if you don't care for dates
Table.agregar_columna(campo='Priority', tipo='int')
Then you add the rows with the += operator (or + if you want to create a copy with an extra row)
Table += ('Cat', date(1998,1,1), 1)
Table += {'Year':date(1998,1,1), 'Priority':2, Name:'Fish'}
#…
#The condition for adding is that is a container accessible with either the column name or the position of the column in the table
Then you can generate XML and write it to a file with exportar_XML (export_XML) and escribir_XML (write_XML):
file = os.path.abspath(os.path.join(os.path.dirname(__file__), 'Animals.xml'))
Table.exportar_xml()
Table.escribir_xml(file)
And then import it back with importar_XML (import_XML) with the file name and indication that you are using a file and not an string literal:
Table.importar_xml(file, tipo='archivo')
#archivo means file
Advanced
This are ways you can use a Tabla object in a SQL manner.
#UPDATE <Table> SET Name = CONCAT(Name,' ',Priority), Priority = NULL WHERE id = 2
for row in Table:
if row['id'] == 2:
row['Name'] += ' ' + row['Priority']
row['Priority'] = None
print(Table)
#DELETE FROM <Table> WHERE MOD(id,2) = 0 LIMIT 1
n = 0
nmax = 1
for row in Table:
if row['id'] % 2 == 0:
del Table[row]
n += 1
if n >= nmax: break
print(Table)
this examples assume a column named 'id'
but can be replaced width row.pos for your example.
if row.pos == 2:
The file can be download from:
https://bitbucket.org/WolfangT/librerias

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I explode a nested dictionary into a dataframe? - python

Related

Generating a list based on a list of dictionaries and data from another list

Use list of indices to manipulate a nested dictionary

python key value two dict matches

Using Python CSV DictReader to create multi-level nested dictionary

Data structure for maintaining tabular data in memory?

Categories

Resources