I have some data like this:
FeatureName,Machine,LicenseHost
Feature1,host1,lichost1
Feature1,host2,lichost1
Feature2,host1,lichost2
Feature1,host1,lichost1
and so on...
I want to maintain a nested dictionary where the first level of key is the feature name, next is machine name, finally license host, and the value is the number of times that combination occurs.
Something like:
dictionary['Feature1']['host1']['lichost1'] = 2
dictionary['Feature1']['host2']['lichost1'] = 1
dictionary['Feature2']['host1']['lichost2'] = 1
The obvious way of creating/updating such a dictionary is (assuming I am reading the data line by line from the CSV):
for line in file:
feature, machine, license = line.split(',')
if feature not in dictionary:
dictionary[feature] = {}
if machine not in dictionary[feature]:
dictionary[feature][machine] = {}
if license not in dictionary[feature][machine]:
dictionary[feature][machine][license] = 1
else:
dictionary[feature][machine][license] += 1
This ensures that I will never run into key not found errors at any level.
What is the best way to do the above (for any number of nested levels) ?
You could use defaultdict:
from collections import defaultdict
import csv
def d1(): return defaultdict(int)
def d2(): return defaultdict(d1)
def d3(): return defaultdict(d2)
dictionary = d3()
with open('input.csv') as input_file:
next (input_file);
for line in csv.reader(input_file):
dictionary[line[0]][line[1]][line[2]] += 1
assert dictionary['Feature1']['host1']['lichost1'] == 2
assert dictionary['Feature1']['host2']['lichost1'] == 1
assert dictionary['Feature2']['host1']['lichost2'] == 1
assert dictionary['InvalidFeature']['host1']['lichost1'] == 0
If the multiple function defs bother you, you can say the same thing more succinctly:
dictionary = defaultdict(lambda: defaultdict(lambda: defaultdict(int)))
Related
How can I check that my dict contains only one value filled ?
I want to enter in my condition only if the value is the only one in my dict and of this type (in my example "test2") of my dict.
For now I have this if statement
my_dict = {}
my_dict["test1"] = ""
my_dict["test2"] = "example"
my_dict["test3"] = ""
my_dict["test4"] = ""
if my_dict["test2"] and not my_dict["test1"] and not my_dict["test3"] and not my_dict["test4"]:
print("inside")
I would like to find a better, classy and "pep8" way to achieve that
Any ideas ?
You have to check every value for truthiness, there's no way around that, e.g.
if sum(1 for v in my_dict.values() if v) == 1:
print('inside')
You can use filter() as below to check how many values are there in the dictionary.
if len(list(filter(None, my_dict.values()))) == 1:
print("inside")
Assuming that all your values are strings, what about
ref_key = "test2"
if ''.join(my_dict.values()) == my_dict[ref_key]:
print("inside")
... since it looks like you have a precise key in mind (when you do if my_dict["test2"]). Otherwise, my answer is (twice) less general than (some) others'.
Maybe you want to check if there's only one pair in dictionary after removing the empty values.
my_dict = {}
my_dict["test1"] = ""
my_dict["test2"] = "example"
my_dict["test3"] = ""
my_dict["test4"] = ""
my_dict={key:val for key,val in my_dict.items() if val}
if len(my_dict)==1:
print("inside")
Here is the another flavour (without loops):
data = list(my_dict.values())
if data.count('') + 1 == len(data):
print("Inside")
I have a question for you, dear python lovers.
I have a corpus file, as the following:
Ah , this is greasy .
I want to eat kimchee .
Is Chae Yoon 's coordinator in here ?
Excuse me , aren 't you Chae Yoon 's coordinator ? Yes . Me ?
-Chae Yoon is done singing .
This lady right next to me ... everyone knows who she is right ?
I want to assign a specific number for each token, and replace it with the assigned number on the file.
What I mean by saying token is, basically each group of characters in the file separated by ' '. So, for example, ? is a token, also Excuse is a token as well.
I have a corpus file which involves more than 4 million lines, as above. Can you show me a fastest way to do I want?
Thanks,
Might be overkill but you could write your own classifier:
# Python 3.x
class Classifier(dict):
def __init__(self, args = None):
'''args is an iterable of keys (only)'''
self.n = 1
super().__init__()
if args:
for thing in args:
self[thing] = self.n
def __setitem__(self, key, value = None):
## print('setitem', key)
if key not in self:
super().__setitem__(key, self.n)
self.n += 1
def setdefault(self, key, default = None):
increment = key not in self
n = super().setdefault(key, self.n)
self.n += int(increment)
## print('setdefault', n)
return n
def update(self, other):
for k, v in other:
self.setdefault(k)
def transpose(self):
return {v:k for k, v in self.items()}
Usage:
c = Classifier()
with open('foo.txt') as infile, open('classified.txt', 'w+') as outfile:
for line in infile:
line = (str(c.setdefault(token)) for token in line.strip().split())
outfile.write(' '.join(line))
outfile.write('\n')
To reduce the number of writes you could accumulate lines in a list and use writelines() at some set length.
If you have enough memory, you could read the entire file in and split it then feed that to Classifier.
De-classify
z = c.transpose()
with open('classified.txt') as f:
for line in f:
line = (z[int(n)] for n in line.strip().split())
print(' '.join(line))
For Python 2.7 super() requires arguments - replace super() with super(Classifier, self).
If you are going to be working mainly with strings for the token numbers, in the class you should convert self.n to a string when saving it then you won't have to convert back and forth between strings and ints in your working code.
You also may be able to use LabelEncoder from sklearn.
If you have a specific dictionary already to change your values, you need to simply map the new values.
mapping = { '?':1, 'Excuse':2, ...}
for k, v in mapping.iteritems():
my_string = my_string.replace(k, v)
If you want to create a brand new dictionary:
mapping = list(set(my_string.split(' ')))
mapping = dict[(i,x) for i,x in enumerate(mapping)]
for k, v in mapping.iteritems():
my_string = my_string.replace(k, v)
from collection import defaultdict
from itertools import count
with open(filename) as f:
with open(output, 'w+') as out:
c = count()
d = defaultdict(c.__next__)
for line in f:
line = line.split()
line = ' '.join([d[token] for token in line])
out.write(line)
Using a defaultdict, we remember what tokens we've seen. Every time we see a new token, we get the next number and assign it to that token. This writes output to a different file.
split = "super string".split(' ')
map = []
result = ''
foreach word in split:
if not map.__contains__(word):
map[word] = len(map)
result += ' ' + str(map[word]
this way avoid to do my_string = my_string.replace(k, v) that makes it slow
Try the following: it assigns a numeric number to each token then replaces tokens with corresponding number.
a = """Ah , this is greasy .
I want to eat kimchee .
Is Chae Yoon 's coordinator in here ?
Excuse me , aren 't you Chae Yoon 's coordinator ? Yes . Me ?
-Chae Yoon is done singing .
This lady right next to me ... everyone knows who she is right ?""".split(" ")
key_map = dict({(j,str(m)) for m,j in enumerate(set(a))})
" ".join(map(lambda x:key_map[x], a))
i.e. first map each unique token to a number, then you can use the key_map to assign the numeric value to each token
So I have a P = dict(). I have the following code:
def someFunction():
P['key'] += 1
'''do other task'''
What is the simplest way to check if P['key'] is defined or not?
I checked How do I check if a variable exists? but I am not sure if that answers my question.
Two main ways to check in an ordinary dict:
The "look before you leap" paradigm. The else statement isn't required, of course, unless you want to define some alternate behavior:
if 'key' in P:
P['key'] += 1
else:
pass
The "easier to ask for forgiveness than permission" paradigm:
try:
P['key'] += 1
except KeyError:
pass # Or do something else
Or you could use a defaultdict as suggested.
You should use a defaultdict from the collections module.
from collections import defaultdict
d = defaultdict(int)
d[0] = 5
d[1] = 10
for i in range(3):
d[i] += 1
# Note that d[2] was not set before the loop
for k, v in d.items():
print('%i: %i' % (k,v))
prints:
brunsgaard#archbook /tmp> python test.py
0: 6
1: 11
2: 1
Usually I will check key presence with
if some_key in some_dict:
print("do something")
Advanced usage: If you have a dictionary, key is string, value is a list. When key exists, you want to add an element to key associated value list. So you can
some_dict[some_key] = some_dict.get(some_key, []) + [new_item];
a is a list filled dynamically with values being received in no specific order. So, if the next value received was (ID2,13), how could I remove the (ID2,10) based on the fact that ID2 was the next value received? Because I don't know the order in which the list is being populated, I won't know the index.
Also, how would I know the count of a specfic ID?
I have tried a.count(ID1) but because of the second element, it fails to find any.
a = [(ID1,10),(ID2,10),(ID1,12),(ID2,15)]
My current usage:
while True:
'Receive ID information in format (IDx,Value)'
ID_info = (ID2,13) #For example
if a.count(ID2) == 2: #I need to modify this line as it always returns 0
del a[0] #This is what I need to modify to delete the correct information, as it is not always index 0
a.append(ID_info)
else:
a.append(ID_info)
Assuming that the ID's are hashable, it sounds like you want to be using a dictionary.
a = {ID1: 10, ID2: 10}
id, val = (ID2, 13)
a[id] = val
With the "keep two" addition, I still think it's easier with a dictionary, though with some modifications.
EDIT: Simpler version using collections.defaultdict.
from collections import defaultdict
a = defaultdict(list)
a[ID1].append(10)
a[ID2].append(10)
id, val = (ID2, 13)
a[id].append(val)
if len(a[id]) > 2:
a[id].pop(0)
def count(a, id):
return len(a[id])
a = {ID1: [10], ID2: [10]}
id, val = (ID2, 13)
if id not in a.keys():
a[id] = []
a[id].append(val)
if len(a[id]) > 2:
a[id].pop(0)
def count(a, id):
if id not in a.keys():
return 0
else:
return len(a[id])
You could (and probably should) also encapsulate this behavior into a simple class inherited from dict.
Given a dictionary, I need some way to do the following:
In the dictionary, we have names, gender, occupation, and salary. I need to figure out if each name I search in the dictionay, there are no more than 5 other employees that have the same name, gender and occupation. If so, I output it. Otherwise, I remove it.
Any help or resources would be appreciated!
What I researched:
count = Counter(tok['Name'] for tok in input_file)
This counts the number of occurances for name (ie Bob: 2, Amy: 4). However, I need to add the gender and occupation to this as well (ie Bob, M, Salesperson: 2, Amy, F, Manager: 1).
To only check if the dictionary has 5 or more (key,value) pairs, in which the name,gender and occupation of employee is same, is quite simple. To remove all such inconsistencies is tricky.
# data = {}
# key = 'UID'
# value = ('Name','Male','Accountant','20000')
# data[key] = value
def consistency(dictionary):
temp_list_of_values_we_care_about = [(x[0],x[1],x[2]) for x in dictionary.itervalues()]
temp_dict = {}
for val in temp_list_of_values_we_care_about:
if val in temp_dict:
temp_dict[val] += 1
else:
temp_dict[val] = 1
if max(temp_dict.values()) >=5:
return False
else:
return True
And to actually, get a dictionary with those particular values removed, there are two ways.
Edit and update the original dictionary. (Doing it in-place)
Create a new dictionary and add only those values which satisfy our constraint.
def consistency(dictionary):
temp_list_of_values_we_care_about = [(x[0],x[1],x[2]) for x in dictionary.itervalues()]
temp_dict = {}
for val in temp_list_of_values_we_care_about:
if val in temp_dict:
temp_dict[val] += 1
else:
temp_dict[val] = 1
new_dictionary = {}
for key in dictionary:
value = dictionary[key]
temp = (value[0],value[1],value[2])
if temp_dict[temp] <=5:
new_dictionary[key] = value
return new_dictionary
P.S. I have chosen the much easier second way to do it. Choosing the first method will cause a lot of computation overhead, and we certainly would want to avoid that.