Creating dataframe from xml - python

I have an xml that I want to parse out and create a dataframe. What I have been trying so far is something like this:
all_dicts = []
fields = ['f1','f2','f3','f4','f5','f6','f7']
for i in root.findall('.//item'):
d = {}
for j in product.findall('.//subitems'):
for k in j.findall('.//subitem'):
if k.attrib['name'] in fields:
d[k.attrib['name']] = k.text
all_dicts.append(d)
This gives me a list of dictionaries that I can easily do pd.DataFrame(all_dicts) to get what I want. However, the subitems tend to have multiple sub-elements that have the same name. For example, each subitem could have multiple times where k.attrib['name'] == f1, so it adds an item to the dictionary with the same key and therefore just overwrites the previous value when I need all of them. Is there a way to create such as data frame easily?

Use dict.get to check if the key exists
If the key does not exist, add it as a list
If the key does exist, append to the list
Without a comprehensive example of the xml, I can't offer a more detailed example.
all_dicts = []
fields = ['f1','f2','f3','f4','f5','f6','f7']
for i in root.findall('.//item'):
d = dict()
for j in product.findall('.//subitems'):
for k in j.findall('.//subitem'):
n = k.attrib['name']
if n in fields:
if d.get(n) == None: # check if key exist
d[n] = [k.text] # add key as a list
else:
d[n].append(k.text) # append to list
all_dicts.append(d)
Alternatively, only add the dict value as a list, if the field is 'f1'.
all_dicts = []
fields = ['f1','f2','f3','f4','f5','f6','f7']
for i in root.findall('.//item'):
d = dict()
for j in product.findall('.//subitems'):
for k in j.findall('.//subitem'):
n = k.attrib['name']
if n in fields and n == 'f1': # if field is 'f1' add list
if d.get(n) == None: # check if key exist
d[n] = [k.text] # add key as a list
else:
d[n].append(k.text) # append to list
elif n in fields: # if field isn't 'f1' just add the text
d[n] = k.text
all_dicts.append(d)

Related

how can I create nested dictionary keys and assign them values from a list of namespaced key value pairs?

I have env vars that looks like this:
CONFIG-SOMEKEY-SOMEOTHERKEY = val345
CONFIG-SOMEKEY-SOMEOTHEROTHERKEY = val678
CONFIG-ANOTHERKEY = val222
I want to create a dictionary out of them that would look like:
{
'SOMEKEY': {
'SOMEOTHERKEY': 'val3242',
'SOMEOTHEROTHERKEY': 'val678'
}
'ANOTHERKEY': 'val222'
}
"CONFIG-" is a prefix to denote which vars this should be done with- so I can filter them easily like this:
config_fields = [i for i in os.environ if i.startswith("CONFIG-")]
But I'm unsure of how to loop over the string, split on "-" and build a dict.
While looping I was thinking I could check if its the last item and assign the value but how would it know the full path of keys it's on?
I suspect this is a job for recursion I'm just now sure exactly how to implement it
You could do:
data = ['CONFIG-SOMEKEY-SOMEOTHERKEY = val345',
'CONFIG-SOMEKEY-SOMEOTHEROTHERKEY = val678',
'CONFIG-ANOTHERKEY = val222']
result = {}
for e in data:
key, value = e.split(" = ") # split into key and value
path = key.split("-") # split the key into parts
ref = result
for part in path[1:-1]:
ref[part] = part in ref and ref[part] or {}
ref = ref[part]
ref[path[-1]] = value # take the last part of key and set the value
print(result)
Output
{'SOMEKEY': {'SOMEOTHERKEY': 'val345', 'SOMEOTHEROTHERKEY': 'val678'}, 'ANOTHERKEY': 'val222'}
This part:
ref = result
for part in path[1:-1]:
ref[part] = part in ref and ref[part] or {}
ref = ref[part]
ref[path[-1]] = value
will create the nested dictionaries, is equivalent to:
for part in path[1:-1]:
if part not in ref:
ref[part] = {}
ref = ref[part]
So if the part is in the dictionary you set ref as the value corresponding to part otherwise you create a new dictionary.
You can use the assoc_in function from toolz. Split the name on - and slice off the prefix.
import os
from toolz.dictoolz import assoc_in
CONFIG={}
for k, v in os.environ.items():
if k.startswith("CONFIG-"):
assoc_in(CONFIG, k.split('-')[1:], v)
If you don't want to add a dependency, you can see the implementation of assoc_in here. A simpler substitute might be something like
def assoc_in(d, ks, v):
for k in ks[:-1]:
d = d.setdefault(k, {})
d[ks[-1]] = v
This uses the .setdefault() method to get the nested dicts, which will add a new one if it doesn't exist yet.
You can get your environment variables like so:
import os
text = [f"{k} = {v}" for k,v in os.environ.items() if k.startswith("CONFIG-")]
print(env)
(inspired by How to access environment variable values? - especially this answer)
Then you can use dicts to iterativly splitting your values:
text = """CONFIG-SOMEKEY-SOMEOTHERKEY = val345
CONFIG-SOMEKEY-SOMEOTHEROTHERKEY = val678
CONFIG-ANOTHERKEY = val222"""
text = text.split("\n")
d = {}
curr_d = d
for part in text:
while "-" in part:
a, b = part.split("-",1)
if '-' in b:
curr_d [a] = curr_d.get(a,{})
curr_d = curr_d[a]
part = b
a, b = part.split("=",1)
curr_d[a] = b
curr_d = d
print(d)
Output:
{'CONFIG': {'SOMEOTHERKEY ': ' val345',
'SOMEOTHEROTHERKEY ': ' val678'},
'ANOTHERKEY ': ' val222'}

In Python, How to assign value of variable to the dictionary, where the variable will keep getting values for each iteration

Ex:
for x in myresult:
y=str(x)
if y.startswith('(') and y.endswith(')'):
y = y[2:-3]
y=y.replace("\\","").replace(";",'')
chr_num = y.find("property_name")
chr_num=chr_num+15
PropertyName = y[chr_num:-1]
chr_num1 = y.find("phrase_value")
chr_num1 = chr_num1 + 14
chr_num2 = y.find("where")
chr_num2=chr_num2-2
PhraseValue = y[chr_num1:chr_num2]
This is the existing code. Now i want to store 'PhraseValue' in dictionary or array.
NOTE: PhraseValue will keep getting values for each iteraction
This is a very basic question. In your case, obviously, PropertyName and PhraseValue are overwritten on each iteration and contains only the last values at the end of the loop.
If you want to store multiple values, the easiest structure is a list.
ret = [] # empty list
for x in some_iterator():
y = some_computation(x)
ret.append(y) # add the value to the list
# ret contains all the y's
If you want to use a dict, you have to compute a key and a value:
ret = {} # empty dict
for x in some_iterator():
y = some_computation(x)
k = the_key(x) # maybe k = x
ret[k] = y # map k to y
# ret contains all the k's and their mapped values.
The choice between a list and a dict depends on your specific problem: use a dict if you want to find values by key, like in a dictionary; use a list if you need ordered values.
Assuming that PropertyName is the key, then you could simply add
results = {}
before the loop, and
results[PropertyName] = PhraseValue
as the last line of the if statement inside the loop.
This solution does have one problem. What if a given PropertyName occurs more than once? The above solution would only keep the last found value.
If you want to keep all values, you can use collections.defaultdict;
import collections
results = collections.defaultdict(list)
Then as the last line of the if statement inside the loop;
results[PropertyName].append(PhraseValue)

i couldn't undestand what this lines of code do?

This part of class i did not understand what does do in this code:
for file in os.listdir(path):
if(os.path.isfile(os.path.join(path,file)) and select in file):
temp = scipy.io.loadmat(os.path.join(path,file))
temp = {k:v for k, v in temp.items() if k[0] != '_'}
for i in range(len(temp[patch_type+"_patches"])):
self.tensors.append(temp[patch_type+"_patches"][i])
self.labels.append(temp[patch_type+"_labels"][0][i])
self.tensors = np.array(self.tensors)
self.labels = np.array(self.labels)
especially this line :
temp = {k:v for k, v in temp.items() if k[0] != '_'}
the whole class is as follow :
class Datasets(Dataset):
def __init__(self,path,train,transform=None):
if(train):
select ="Training"
patch_type = "train"
else:
select = "Testing"
patch_type = "testing"
self.tensors = []
self.labels = []
self.transform = transform
for file in os.listdir(path):
if(os.path.isfile(os.path.join(path,file)) and select in file):
temp = scipy.io.loadmat(os.path.join(path,file))
temp = {k:v for k, v in temp.items() if k[0] != '_'}
for i in range(len(temp[patch_type+"_patches"])):
self.tensors.append(temp[patch_type+"_patches"][i])
self.labels.append(temp[patch_type+"_labels"][0][i])
self.tensors = np.array(self.tensors)
self.labels = np.array(self.labels)
def __len__(self):
try:
if len(self.tensors) != len(self.labels):
raise Exception("Lengths of the tensor and labels list are not the same")
except Exception as e:
print(e.args[0])
return len(self.tensors)
def __getitem__(self,idx):
sample = (self.tensors[idx],self.labels[idx])
# print(self.labels)
sample = (torch.from_numpy(self.tensors[idx]),torch.from_numpy(np.array(self.labels[idx])).long())
return sample
#tuple containing the image patch and its corresponding label
It's a dict comprehension; in this particular case, it creates a new dict from an existing dict temp, but only for items for which the key k does not start with an underscore. That check is performed by the if ... part.
It is equivalent to
new = {}
for k, v in temp.items():
if key[0] != '_':
new[k] = value
temp = new
or, slightly different:
new = {}
for key, value in temp.items():
if not key.startswith('_'):
new[key] = value
temp = new
You can see that it looks a bit nicer as a single line, since it avoids a temporary dict (new; under the hood, it still creates a nameless temporary dict though).
It is filtering out the underscore-prefixed variables from the loaded MATLAB file. From the scipy documentation the function scipy.io.loadmat returns a dictionary containing the variable names from the loaded file as keys and the matricies as values. The line of code you reference is a dictionary comprehension that clones the dictionary minus the variables that fail the conditional check.
Update
What happens here is roughly this:
Load a MATLAB file (file in your code) as a hashmap (dictionary) where the keys are the variable names from the file and the values are the matricies, assign to temp.
Iterate through those key/value pairs and drop the underscore-prefixed ones and reassign the results of that iteration to temp.
Profit

Python nested lists within dict element

I'm trying to create the following datastructure (list containing several lists) within a shared dict:
{'my123': [['TEST'],['BLA']]}
code:
records = manager.dict({})
<within some loop>
dictkey = "my123"
tempval = "TEST" # as an example, gets new values with every loop
list = []
list.append(tempval)
if dictkey not in records.keys():
records[dictkey] = [list]
else:
records[dictkey][0].append([tempval])
The first list within the dict element 'my123' gets populated with "TEST", but when I loop a second time (where tempval is "BLA"), the list doesn't get nested.
Instead I'm getting:
{'my123': [['TEST']]}
What am I doing wrong in the else statement?
Edit:
Have modified the code, but still doesn't get added:
records = manager.dict({})
<within some loop>
dictkey = "my123"
tempval = "TEST" # as an example, gets new values with every loop
list = []
list.append(tempval)
if dictkey == "my123":
print tempval # prints new values with every loop to make sure
if dictkey not in records.keys():
records[dictkey] = [list]
else:
records[dictkey].append([list])
Remove the [0] part from the last line. The value in the dictionary is already a list. It is that list you wish to append the second list (['BLA']) to.
You're almost there. You will want to append the list like so:
records = manager.dict({})
# within some loop
dictkey = "my123"
tempval = "TEST" # as an example, gets new values with every loop
temp_list = [tempval] # holds a list of value
if dictkey not in records:
records[dictkey] = [temp_list]
else:
records[dictkey].append(temp_list) # append list of value
I've found the solution. Looks like the append in the else statement doesn't work for class multiprocessing.managers.DictProxy.
I've modified the else statement and now it's working.
records = manager.dict({})
< within some loop >
dictkey = "my123"
tempval = "TEST" # as an example, gets new values with every loop
temp_list = [tempval] # holds a list of value
if dictkey not in records:
records[dictkey] = [temp_list]
else:
records[dictkey] = records.get(dictkey, []) + [temp_list]
Thanks everyone for your help!

Python Group by count

Given a dictionary, I need some way to do the following:
In the dictionary, we have names, gender, occupation, and salary. I need to figure out if each name I search in the dictionay, there are no more than 5 other employees that have the same name, gender and occupation. If so, I output it. Otherwise, I remove it.
Any help or resources would be appreciated!
What I researched:
count = Counter(tok['Name'] for tok in input_file)
This counts the number of occurances for name (ie Bob: 2, Amy: 4). However, I need to add the gender and occupation to this as well (ie Bob, M, Salesperson: 2, Amy, F, Manager: 1).
To only check if the dictionary has 5 or more (key,value) pairs, in which the name,gender and occupation of employee is same, is quite simple. To remove all such inconsistencies is tricky.
# data = {}
# key = 'UID'
# value = ('Name','Male','Accountant','20000')
# data[key] = value
def consistency(dictionary):
temp_list_of_values_we_care_about = [(x[0],x[1],x[2]) for x in dictionary.itervalues()]
temp_dict = {}
for val in temp_list_of_values_we_care_about:
if val in temp_dict:
temp_dict[val] += 1
else:
temp_dict[val] = 1
if max(temp_dict.values()) >=5:
return False
else:
return True
And to actually, get a dictionary with those particular values removed, there are two ways.
Edit and update the original dictionary. (Doing it in-place)
Create a new dictionary and add only those values which satisfy our constraint.
def consistency(dictionary):
temp_list_of_values_we_care_about = [(x[0],x[1],x[2]) for x in dictionary.itervalues()]
temp_dict = {}
for val in temp_list_of_values_we_care_about:
if val in temp_dict:
temp_dict[val] += 1
else:
temp_dict[val] = 1
new_dictionary = {}
for key in dictionary:
value = dictionary[key]
temp = (value[0],value[1],value[2])
if temp_dict[temp] <=5:
new_dictionary[key] = value
return new_dictionary
P.S. I have chosen the much easier second way to do it. Choosing the first method will cause a lot of computation overhead, and we certainly would want to avoid that.

Categories

Resources