Nested dictionary with key: list[key:value] pairs to dataframe - python

I'm currently struggling with creating a dataframe based on a dictionary that is nested like {key1:[{key:value},{key:value}, ...],key2:[{key:value},{key:value},...]}
And I want this to go into a dataframe, where the value of key1 and key2 are the index, while the list nested key:value pairs would become the column and record values.
Now, for each key1, key2, etc the list key:value pairs can be different in size. Example data:
some_dict = {'0000297386FB11E2A2730050568F1BAB': [{'FILE_ID': '0000297386FB11E2A2730050568F1BAB'},
{'FileTime': '1362642335'},
{'Size': '1016439'},
{'DocType_Code': 'AF3BD580734A77068DD083389AD7FDAF'},
{'Filenr': 'F682B798EC9481FF031C4C12865AEB9A'},
{'DateRegistered': 'FAC4F7F9C3217645C518D5AE473DCB1E'},
{'TITLE': '2096158F036B0F8ACF6F766A9B61A58B'}],
'000031EA51DA11E397D30050568F1BAB': [{'FILE_ID': '000031EA51DA11E397D30050568F1BAB'},
{'FileTime': '1384948248'},
{'Size': '873514'},
{'DatePosted': '7C6BCB90AC45DA1ED6D1C376FC300E7B'},
{'DocType_Code': '28F404E9F3C394518AF2FD6A043D3A81'},
{'Filenr': '13A6A062672A88DE75C4D35917F3C415'},
{'DateRegistered': '8DD4262899F20DE45F09F22B3107B026'},
{'Comment': 'AE207D73C9DDB76E1EEAA9241VJGN02'},
{'TITLE': 'DF96336A6FE08E34C5A94F6A828B4B62'}]}
The final result should look like this:
Index | File_ID | ... | DatePosted | ... | Comment | Title
0000297386FB11E2A2730050568F1BAB|0000297386FB11E2A2730050568F1BAB|...|NaN|...|NaN|2096158F036B0F8ACF6F766A9B61A58B
000031EA51DA11E397D30050568F1BAB|000031EA51DA11E397D30050568F1BAB|...|7C6BCB90AC45DA1ED6D1C376FC300E7B|...|AE207D73C9DDB76E1EEAA9241VJGN02|DF96336A6FE08E34C5A94F6A828B4B62
Now I've tried to parse the dict directly to pandas using comprehension as suggested in Creating dataframe from a dictionary where entries have different lengths and tried to flatten the dict more, and then parsing it to pandas Flatten nested dictionaries, compressing keys. Both with no avail.

Here you go.
You do not need key of first dict. Because it's also available in lower stages.
Then you need to merge multiple dicts into single one. I did that with update.
THen we turn dict into pd series.
And concat it into a dataframe.
In [39]: seriess = []
...: for values in some_dict.values():
...: d = {}
...: for thing in values:
...: d.update(thing)
...: s = pd.Series(d)
...: seriess.append(s)
...:
In [40]: pd.concat(seriess,axis=1).T
Out[40]:
FILE_ID FileTime Size ... TITLE DatePosted Comment
0 0000297386FB11E2A2730050568F1BAB 1362642335 1016439 ... 2096158F036B0F8ACF6F766A9B61A58B NaN NaN
1 000031EA51DA11E397D30050568F1BAB 1384948248 873514 ... DF96336A6FE08E34C5A94F6A828B4B62 7C6BCB90AC45DA1ED6D1C376FC300E7B AE207D73C9DDB76E1EEAA9241VJGN02

Let's try the following code:
dfs = []
for k in some_dict.keys():
dfs.append(pd.DataFrame.from_records(some_dict[k]))
new_df = [dfs[0].append(x) for x in dfs[1:]][0]
final_result = (new_df
.groupby(new_df['FILE_ID'].notna().cumsum())
.first())
Output
FILE_ID FileTime Size DocType_Code Filenr DateRegistered TITLE DatePosted Comment
FILE_ID
1 0000297386FB11E2A2730050568F1BAB 1362642335 1016439 AF3BD580734A77068DD083389AD7FDAF F682B798EC9481FF031C4C12865AEB9A FAC4F7F9C3217645C518D5AE473DCB1E 2096158F036B0F8ACF6F766A9B61A58B None None
2 000031EA51DA11E397D30050568F1BAB 1384948248 873514 28F404E9F3C394518AF2FD6A043D3A81 13A6A062672A88DE75C4D35917F3C415 8DD4262899F20DE45F09F22B3107B026 DF96336A6FE08E34C5A94F6A828B4B62 7C6BCB90AC45DA1ED6D1C376FC300E7B AE207D73C9DDB76E1EEAA9241VJGN02

Related

Make a list from a data frame that has repeated and non repeat values in columns

I have a data frame like this
data = [['Ma', 1,'too'], ['Ma', 1,'taa'], ['Ma', 1,'tuu',],['Ga', 2,'too'], ['Ga', 2,'taa'], ['Ga', 2,'tuu',]]
df = pd.DataFrame(data, columns = ['NAME', 'AID','SUBTYPE'])
NAME ID SUBTYPE
Ma 1 too
Ma 1 taa
Ma 1 tuu
Ga 2 too
Ga 2 taa
Ga 2 tuu
There are repeated NAME and ID and different SUBTYPE
And I want a list like this
Ma-1-[too,taa,too],Ga-2-[too,taa,tuu]
EDIT: NAME and ID should be always the same.
Generally, to achieve this in Python we would use dictionaries as the keys cannot be duplicated.
# We combine the NAME and ID keys, so we can use them together as a key.
df["NAMEID"] = df["NAME"] + "-" + df["ID"].astype(str)
# Convert the desired fields to lists.
name_id_list = df["NAMEID"].tolist()
subtype_list = df["SUBTYPE"].tolist()
# Loop through the lists by zipping them together.
results_dict = {}
for name_id, subttype in zip(name_id_list, subtype_list):
if results_dict.get(name_id):
# If the key already exists then instead we append them to the end of the list.
results_dict[name_id].append(subttype)
else:
# If key not exists add them as key-value pairs to a dictionary.
results_dict[name_id] = [subtype]
Results dict will end up looking like:
{'Ma-1': ['too', 'taa', 'tuu'], 'Ga-2': ['too', 'taa', 'tuu']}

Dataframe to dictionary, values came out scrambled

I have a dataframe that contains two columns that I would like to convert into a dictionary to use as a map.
I have tried multiple ways of converting, but my dictionary values always comes up in the wrong order.
My python version is 3 and Pandas version is 0.24.2.
This is what the first few rows of my dataframe looks like:
geozip.head()
Out[30]:
Geoid ZIP
0 100100 36276
1 100124 36310
2 100460 35005
3 100460 35062
4 100460 35214
I would like my dictionary to look like this:
{100100: 36276,
100124: 36310,
100460: 35005,
100460: 35062,
100460: 35214,...}
But instead my outputs came up with the wrong order for the values.
{100100: 98520,
100124: 36310,
100460: 57520,
100484: 35540,
100676: 19018,
100820: 57311,
100988: 15483,
101132: 36861,...}
I tried this first but the dictionary came out unordered:
geozipmap = geozip.set_index('Geoid')['ZIP'].to_dict()
Then I tried coverting the two columns into list first then convert to dictionary, but same problem occurred:
geoid = geozip.Geoid.tolist()
zipcode = geozip.ZIP.tolist()
geozipmap = dict(zip(geoid, zipcode))
I tried converting to OrderedDict and that didn't work either.
Then I've tried:
geozipmap = {k: v for k, v in zip(geoid, zipcode)}
I've also tried:
geozipmap = {}
for index, g in enumerate(geoid):
geozipmap[geoid[index]] = zipcode[index]
I've also tried the answers suggested:
panda dataframe to ordered dictionary
None of these work. Really not sure what is going on?
try this default_dict and if same key have multiple values you can provide those as list
from collections import defaultdict
df =pd.DataFrame(data={"Geoid":[100100,100124,100460,100460,100460],
"ZIP":[36276,36310,35005,35062,35214]})
data_dict = defaultdict(list)
for k,v in zip(df['Geoid'],df['ZIP']):
data_dict[k].append(v)
print(data_dict)
defaultdict(<class 'list'>, {100100: [36276], 100124: [36310], 100460: [35005, 35062, 35214]})
Will this work for you?
dfG = df['Geoid'].values
dfZ = df['ZIP'].values
for g , z in zip (dfG,dfZ):
print(str(g)+':'+str(z))
This gives the output as below (but the values are strings)
100100:36276
100124:36310
100460:35005
100460:35062
100460:35214

create a filtered list of dictionaries based on existing list of dictionaries

I have a list of dictionaries read in from csv DictReader that represent rows of a csv file:
rows = [{"id":"123","date":"1/1/18","foo":"bar"},
{"id":"123","date":"2/2/18", "foo":"baz"}]
I would like to create a new dictionary, where only unique ID's are stored. But I would like to only keep the row entry with the most recent date. Based on the above example, it would keep the row with date 2/2/18.
I was thinking of doing something like this, but having trouble translating the pseudocode in the else statement into actual python.
I can figure out the part of checking the two dates for which is more recent, but having the most trouble figuring out how I check the new list for the dictionary that contains the same id and then retrieving the date from that row.
Note: Unfortunately, due to resource constraints on our platform I am unable to use pandas for this project.
new_data = []
for row in rows:
if row['id'] not in new_data:
new_data.append(row)
else:
check the element in new_data with the same id as row['id']
if that element's date value is less recent:
replace it with the current row
else :
continue to next row in rows
You'll need a function to convert your date (as string) to a date (as date).
import datetime
def to_date(date_str):
d1, m1, y1 = [int(s) for s in date_str.split('/')]
return datetime.date(y1, m1, d1)
I assumed your date format is d/m/yy. Consider using datetime.strptime to parse your dates, as illustrated by Alex Hall's answer.
Then, the idea is to loop over your rows and store them in a new structure (here, a dict whose keys are the IDs). If a key already exists, compare its date with the current row, and take the right one. Following your pseudo-code, this leads to:
rows = [{"id":"123","date":"1/1/18","foo":"bar"},
{"id":"123","date":"2/2/18", "foo":"baz"}]
new_data = dict()
for row in rows:
existing = new_data.get(row['id'], None)
if existing is None or to_date(existing['date']) < to_date(row['date']):
new_data[row['id']] = row
If your want your new_data variable to be a list, use new_data = list(new_data.values()).
import datetime
rows = [{"id":"123","date":"1/1/18","foo":"bar"},
{"id":"123","date":"2/2/18", "foo":"baz"}]
def parse_date(d):
return datetime.datetime.strptime(d, "%d/%m/%y").date()
tmp_dict = {}
for row in rows:
if row['id'] not in tmp_dict.keys():
tmp_dict['id'] = row
else:
if parse_date(row['date']) > parse_date(tmp_dict[row['id']]):
tmp_dict['id'] = row
print tmp_dict.values()
output
[{'date': '2/2/18', 'foo': 'baz', 'id': '123'}]
Note: you can merge the two if to if row['id'] not in tmp_dict.keys() || parse_date(row['date']) > parse_date(tmp_dict[row['id']]) for cleaner and shorter code
Firstly, work with proper date objects, not strings. Here is how to parse them:
from datetime import datetime, date
rows = [{"id": "123", "date": "1/1/18", "foo": "bar"},
{"id": "123", "date": "2/2/18", "foo": "baz"}]
for row in rows:
row['date'] = datetime.strptime(row['date'], '%d/%m/%y').date()
(check if the format is correct)
Then for the actual task:
new_data = {}
for row in rows:
new_data[row['id']] = max(new_data.get(row['id'], date.min),
row['date'])
print(new_data.values())
Alternatively:
Here are some generic utility functions that work well here which I use in many places:
from collections import defaultdict
def group_by_key_func(iterable, key_func):
"""
Create a dictionary from an iterable such that the keys are the result of evaluating a key function on elements
of the iterable and the values are lists of elements all of which correspond to the key.
"""
result = defaultdict(list)
for item in iterable:
result[key_func(item)].append(item)
return result
def group_by_key(iterable, key):
return group_by_key_func(iterable, lambda x: x[key])
Then the solution can be written as:
by_id = group_by_key(rows, 'id')
for id_num, group in list(by_id.items()):
by_id[id_num] = max(group, key=lambda r: r['date'])
print(by_id.values())
This is less efficient than the first solution because it creates lists along the way that are discarded, but I use the general principles in many places and I thought of it first, so here it is.
If you like to utilize classes as much as I do, then you could make your own class to do this:
from datetime import date
rows = [
{"id":"123","date":"1/1/18","foo":"bar"},
{"id":"123","date":"2/2/18", "foo":"baz"},
{"id":"456","date":"3/3/18","foo":"bar"},
{"id":"456","date":"1/1/18","foo":"bar"}
]
class unique(dict):
def __setitem__(self, key, value):
#Add key if missing or replace key if date is newer
if key not in self or self[key]["date"] < value["date"]:
dict.__setitem__(self, key, value)
data = unique() #Initialize new class based on dict
for row in rows:
d, m, y = map(int, row["date"].split('/')) #Split date into parts
row["date"] = date(y, m, d) #Replace date value
data[row["id"]] = row #Set new data. Will overwrite same ids with more recent
print data.values()
Outputs:
[
{'date': datetime.date(18, 2, 2), 'foo': 'baz', 'id': '123'},
{'date': datetime.date(18, 3, 3), 'foo': 'bar', 'id': '456'}
]
Keep in mind that data is a dict that essentially overrides the __setitem__ method that uses IDs as keys. And the dates are date objects so they can be compared easily.

Creating pandas dataframe from list of dictionaries containing lists of data

I have a list of dictionaries with this structure.
{
'data' : [[year1, value1], [year2, value2], ... m entries],
'description' : string,
'end' : string,
'f' : string,
'lastHistoricalperiod' : string,
'name' : string,
'series_id' : string,
'start' : int,
'units' : string,
'unitsshort' : string,
'updated' : string
}
I want to put this in a pandas DataFrame that looks like
year value updated (other dict keys ... )
0 2040 120.592468 2014-05-23T12:06:16-0400 other key-values
1 2039 120.189987 2014-05-23T12:06:16-0400 ...
2 other year-value pairs ...
...
n
where n = m* len(list with dictionaries) (where length of each list in 'data' = m)
That is, each tuple in 'data' should have its own row. What I've done thus far is this:
x = [list of dictionaries as described above]
# Create Empty Data Frame
output = pd.DataFrame()
# Loop through each dictionary in the list
for dictionary in x:
# Create a new DataFrame from the 2-D list alone.
data = dictionary['data']
y = pd.DataFrame(data, columns = ['year', 'value'])
# Loop through all the other dictionary key-value pairs and fill in values
for key in dictionary:
if key != 'data':
y[key] = dictionary[key]
# Concatenate most recent output with the dframe from this dictionary.
output = pd.concat([output_frame, y], ignore_index = True)
This seems very hacky, and I was wondering if there's a more 'pythonic' way to do this, or at least if there are any obvious speedups here.
If Your data is in the form [{},{},...] you can do the following...
The issue with your data is in the data key of your dictionaries.
df = pd.DataFrame(data)
fix = df.groupby(level=0)['data'].apply(lambda x:pd.DataFrame(x.iloc[0],columns = ['Year','Value']))
fix = fix.reset_index(level=1,drop=True)
df = pd.merge(fix,df.drop(['data'],1),how='inner',left_index=True,right_index=True)
The code does the following...
Creates a DataFrame with your list of dictionaries
creates a new dataframe by stretching out your data column into more rows
The stretching line has caused a multiindex with an irrelevant column - this removes it
Finally merge on the original index and get desired DataFrame
Some data would have been helpful when answering this question. However, from your data structure some example data might look like this:
dict_list = [{'data' : [['1999', 1], ['2000', 2], ['2001', 3]],
'description' : 'foo_dictionary',
'end' : 'foo1',
'f' : 'foo2',},
{'data' : [['2002', 4], ['2003', 5]],
'description' : 'bar_dictionary',
'end' : 'bar1',
'f' : 'bar2',}
]
My suggestion would be to manipulate and reshape this data into a new dictionary and then simply pass that dictionary to the DataFrame constructor. In order to pass a dictionary to the pd.DataFrame constructor you could very simply reshape the data into a new dict as follows:
data_dict = {'years' : [],
'value' : [],
'description' : [],
'end' : [],
'f' : [],}
for dictionary in dict_list:
data_dict['years'].extend([elem[0] for elem in dictionary['data']])
data_dict['value'].extend([elem[1] for elem in dictionary['data']])
data_dict['description'].extend(dictionary['description'] for x in xrange(len(dictionary['data'])))
data_dict['end'].extend(dictionary['end'] for x in xrange(len(dictionary['data'])))
data_dict['f'].extend(dictionary['f'] for x in xrange(len(dictionary['data'])))
and then just pass this to pandas
import pandas as pd
pd.DataFrame(data_dict)
which gives me the following output:
description end f value years
0 foo_dictionary foo1 foo2 1 1999
1 foo_dictionary foo1 foo2 2 2000
2 foo_dictionary foo1 foo2 3 2001
3 bar_dictionary bar1 bar2 4 2002
4 bar_dictionary bar1 bar2 5 2003
I would say that if this is the type of output you want, then this system would be a decent simplification.
In fact you could simplify it even further by creating a year:value dictionary, and a dict for the other vals. Then you would not have to type out the new dictionary and you could run a nested for loop. This could look as follows:
year_val_dict = {'years' : [],
'value' : []}
other_val_dict = {_key : [] for _key in dict_list[0] if _key!='data'}
for dictionary in dict_list:
year_val_dict['years'].extend([elem[0] for elem in dictionary['data']])
year_val_dict['value'].extend([elem[1] for elem in dictionary['data']])
for _key in other_val_dict:
other_val_dict[_key].extend(dictionary[_key] for x in xrange(len(dictionary['data'])))
year_val_dict.update(other_val_dict)
pd.DataFrame(year_val_dict)
NB this of course assumes that all the dicts in the dict_list have the same structure....

Searching items of large list in large python dictionary quickly

I am currently working to make a dictionary with a tuple of names as keys and a float as the value of the form {(nameA, nameB) : datavalue, (nameB, nameC) : datavalue ,...}
The values data is from a matrix I have made into a pandas DataFrame with the names as both the index and column labels. I have created an ordered list of the keys for my final dictionary called keys with the function createDictionaryKeys(). The issue I have is that not all the names from this list appear in my data matrix. I want to only include the names do appear in the data matrix in my final dictionary.
How can I do this search avoiding the slow linear for loop? I created a dictionary that has the name as key and a value of 1 if it should be included and 0 otherwise as well. It has the form {nameA : 1, nameB: 0, ... } and is called allow_dict. I was hoping to use this to do some sort of hash search.
def createDictionary( keynamefile, seperator, datamatrix, matrixsep):
import pandas as pd
keys = createDictionaryKeys(keynamefile, seperator)
final_dict = {}
data_df = pd.read_csv(open(datamatrix), sep = matrixsep)
pd.set_option("display.max_rows", len(data_df))
df_indices = list(data_df.index.values)
df_cols = list(data_df.columns.values)[1:]
for i in df_indices:
data_df = data_df.rename(index = {i:df_cols[i]})
data_df = data_df.drop("Unnamed: 0", 1)
allow_dict = descriminatePromoters( HARDCODEDFILENAME, SEP, THRESHOLD )
#print ( item for item in df_cols if allow_dict[item] == 0 ).next()
present = [ x for x in keys if x[0] in df_cols and x[1] in df_cols]
for i in present:
final_dict[i] = final_df.loc[i[0],i[1]]
return final_dict
Testing existence in python sets is O(1), so simply:
present = [ x for x in keys if x[0] in set(df_cols) and x[1] in set(df_cols)]
...should give you some speed up. Since you're iterating through in O(n) anyway (and have to to construct your final_dict), something like:
colset = set(df_cols)
final_dict = {k: final_df.loc[k[0],k[1]]
for k in keys if (k[0] in colset)
and (k[1] in colset)}
Would be nice, I would think.

Categories

Resources