How to read data and create a dictionary of dictionaries in Python? - python

I need to read some huge database from HDF5 files and organize it in a nice way to make it easy to read and use.
I saw this post Python List as variable name and I'm trying to make a dictionary of dictionaries.
Basically I have a list of data sets and variables that I need to read form the HDF5 files. As an example I created this two lists:
dataset = [0,1,2,3]
var = ['a','b','c']
Now, there is legacy "home brewed" read_hdf5(dataset,var) function that reads the data from the HDF5 files and returns the appropriate array.
I can easily read from a specific dataset (say 0) at a time creating a dictionary like this:
data = {}
for type in var:
data[type] = read_hdf5(0,type)
Which gives me a nice dictionary if all the data for each variable in dataset 0.
Now I wan to be able to implement a dictionary of dictionaries so I can be able to access the data like this:
data[dataset][var]
That returns the array of data for the given set and variable
I tried the following but the only thing that the loop is doing is overwriting the last variable read:
for set in dataset:
for type in var:
data[set] = {'set':set, str(type): read_hdf5(set,type)}
Any ideas? Thank you!!!

You have to create a new dict for each set before iterating on vars:
dataset = [0,1,2,3]
var = ['a', 'b', 'c']
data = {}
for set in datasets:
data[set] = {}
for type in var:
data[set][type] = read_hdf5(set, type)
As a side note: set and type are builtin names so you'd better use something else.

Related

Delete entries of a list nested in a dictionary depending on values located in a different nested list in the same dictionary

I am writing a script to read in a lot of data contained in multiple CSV files. When I read the data from each CSV I put it in a list which is stored in a dictionary so the eventual data structure is:
data_set = {user1 : {filenames: [file1,file2...]
labels: [file1label_1,file1label_2,file1label_3,file2label_1...]
features: [file1feat_1,file1feat_2,file1feat_3,file2feat_1...]
file_timepoints: [file1time_1,file1time_2,file1time_3,file2time_1,...]
}
user2 : {filenames: [file1,file2...]
labels: [file1label_1,file1label_2,file1label_3,file2label_1...]
features: [file1feat_1,file1feat_2,file1feat_3,file2feat_1...]
file_timepoints: [file1time_1,file1time_2,file1time_3,file2time_1,...]
}
}
Now the filenames variable is a list of 200 files so the length is 200, but all other variables in the dictionary are lists of length 7000 because they contain data from each timestep of each file.
I am wondering, what would be an efficient way to delete data corresponding to a specific file from all of the lists within a dictionary? So for example if I wanted to delete file1 data for user1 the resulting dictionary would look like:
data_set = {user1 : {filenames: [file2...]
labels: [file2label_1...]
features: [file2feat_1...]
file_timepoints: [file2time_1,...]
}
user2 : {filenames: [file1,file2...]
labels: [file1label_1,file1label_2,file1label_3,file2label_1...]
features: [file1feat_1,file1feat_2,file1feat_3,file2feat_1...]
file_timepoints: [file1time_1,file1time_2,file1time_3,file2time_1,...]
}
}
So far I have tried using nested for loops but it gets extremely messy and is highly inefficient. Any suggestions would be greatly appreciated!
EDIT:
This is an example of what the data looks like. The labels come from a CSV that is just 1 1xNtimesteps row of data, the feature data come from a CSV that is NfeaturesxNtimesteps, and timepoints come from a CSV that is 1 1xNtimesteps row.
After doing some research I think the best way for me to approach this issue would be to use object-oriented concepts, namely using a factory design pattern with classes and an inheritance hierarchy with the files and variables.

Creating/Getting/Extracting multiple data frames from python dictionary of dataframes

I have a python dictionary with keys as dataset names and values as the entire data frames themselves, see the dictionary dict below
[Dictionary of Dataframes ]
One way id to write all the codes manually like below:
csv = dict['csv.pkl']
csv_emp = dict['csv_emp.pkl']
csv_emp_yr= dict['csv_emp_yr.pkl']
emp_wf=dict['emp_wf.pkl']
emp_yr_wf=dict['emp_yr_wf.pkl']
But this will get very inefficient with more number of datasets.
Any help on how to get this done over a loop?
Although I would not recommend this method but you can try this:
import sys
this = sys.modules[__name__] # this is now your current namespace
for key in dict.keys():
setattr(this, key, dict[key])
Now you can check new variables made with names same as keys of dictionary.
globals() has risk as it gives you what the namespace is currently pointing to but this can change and so modifying the return from globals() is not a good idea
List can also be used like (limited usecases):
dataframes = []
for key in dict.keys():
dataframes.append(dict[key])
Still this is your choice, both of the above methods have some limitations.

python:initializing list of sets within dictionary

I want to initialize a list of set to a dictionary.I can directly enter the variables but couldn't initialize it and get input from user.My data structure should look something like this
d={1:[{1,2},{3,4}],2:[{2,3},{4,5,100}]}
So that,if i want to access the element 100 it could be done as d[2][1][2]
.I could define the data structure but couldn't initialize it.Could someone help me to initialize the structure.
If I understand correctly you want to initialise the array of sets automatically. Default dict seems the most appropriate way to do this:
from collections import defaultdict
a = defaultdict(lambda: [set(), set()])
a[100][2].add(2)
print(a.items())
This would initialise a list with two sets whenever accessing a non existing key.

Dataset that holds a list of pre-defined classes for each observation - in R

I'm new to R and need to keep a dataset that contains for each observation (let's say - a user) a list of classes (let's say events).
for example - for each user_ID I hold a list of events, every event class contains the fields: name, time, type.
My question is - what is the optimal way to hold such data in R? I have several millions of such observations so I need to hold it in optimal manner (in terms of space).
In addition, after I decide how to hold it, I need create it from within python, as my original data is in python dict. What is the best way to do it?
Thanks!
You can save your dict as a .csv using the csv module for Python.
mydict = {"a":1, "b":2, "c":3}
with open("test.csv", "wb") as myfile:
w = csv.writer(myfile)
w.writerows(mydict.items())
Then just load it into R with read.csv.
Of course, depending on what your Python dict looks like, you may need some more post processing, but without a reproducible example it's hard to say what that would be.

Create a dictionary with name of variable

I am trying to write a program to parse a file, break it into sections, and read it into a nested dictionary. I want the output to be something like this:
output = {'section1':{'nested_section1':{'value1':'value2'}}}
I'm trying to do this by building separate dictionaries, than merging them, but I'm running into trouble naming them. I want the dictionaries inside of the others to be named based on the sections of the file they're taken from. But it seems I can't name a dictionary from a variable.
You can name a dictionary entry from a variable. If you have
text = "myKey" # or myNumber or any hashable type
data = dict()
You can do
data[text] = anyValue
Store all your dictionaries in a single root dictionary.
all_dicts['output'] = {'section1':{'nested_section1':{'value1':'value2'}}}
As you merge dictionaries, remove the children from all_dicts.
all_dicts['someotherdict']['key'] = all_dicts['output']
del all_dicts['output']

Categories

Resources