Python Group by count - python

Given a dictionary, I need some way to do the following:
In the dictionary, we have names, gender, occupation, and salary. I need to figure out if each name I search in the dictionay, there are no more than 5 other employees that have the same name, gender and occupation. If so, I output it. Otherwise, I remove it.
Any help or resources would be appreciated!
What I researched:
count = Counter(tok['Name'] for tok in input_file)
This counts the number of occurances for name (ie Bob: 2, Amy: 4). However, I need to add the gender and occupation to this as well (ie Bob, M, Salesperson: 2, Amy, F, Manager: 1).

To only check if the dictionary has 5 or more (key,value) pairs, in which the name,gender and occupation of employee is same, is quite simple. To remove all such inconsistencies is tricky.
# data = {}
# key = 'UID'
# value = ('Name','Male','Accountant','20000')
# data[key] = value
def consistency(dictionary):
temp_list_of_values_we_care_about = [(x[0],x[1],x[2]) for x in dictionary.itervalues()]
temp_dict = {}
for val in temp_list_of_values_we_care_about:
if val in temp_dict:
temp_dict[val] += 1
else:
temp_dict[val] = 1
if max(temp_dict.values()) >=5:
return False
else:
return True
And to actually, get a dictionary with those particular values removed, there are two ways.
Edit and update the original dictionary. (Doing it in-place)
Create a new dictionary and add only those values which satisfy our constraint.
def consistency(dictionary):
temp_list_of_values_we_care_about = [(x[0],x[1],x[2]) for x in dictionary.itervalues()]
temp_dict = {}
for val in temp_list_of_values_we_care_about:
if val in temp_dict:
temp_dict[val] += 1
else:
temp_dict[val] = 1
new_dictionary = {}
for key in dictionary:
value = dictionary[key]
temp = (value[0],value[1],value[2])
if temp_dict[temp] <=5:
new_dictionary[key] = value
return new_dictionary
P.S. I have chosen the much easier second way to do it. Choosing the first method will cause a lot of computation overhead, and we certainly would want to avoid that.

Related

Creating dataframe from xml

I have an xml that I want to parse out and create a dataframe. What I have been trying so far is something like this:
all_dicts = []
fields = ['f1','f2','f3','f4','f5','f6','f7']
for i in root.findall('.//item'):
d = {}
for j in product.findall('.//subitems'):
for k in j.findall('.//subitem'):
if k.attrib['name'] in fields:
d[k.attrib['name']] = k.text
all_dicts.append(d)
This gives me a list of dictionaries that I can easily do pd.DataFrame(all_dicts) to get what I want. However, the subitems tend to have multiple sub-elements that have the same name. For example, each subitem could have multiple times where k.attrib['name'] == f1, so it adds an item to the dictionary with the same key and therefore just overwrites the previous value when I need all of them. Is there a way to create such as data frame easily?
Use dict.get to check if the key exists
If the key does not exist, add it as a list
If the key does exist, append to the list
Without a comprehensive example of the xml, I can't offer a more detailed example.
all_dicts = []
fields = ['f1','f2','f3','f4','f5','f6','f7']
for i in root.findall('.//item'):
d = dict()
for j in product.findall('.//subitems'):
for k in j.findall('.//subitem'):
n = k.attrib['name']
if n in fields:
if d.get(n) == None: # check if key exist
d[n] = [k.text] # add key as a list
else:
d[n].append(k.text) # append to list
all_dicts.append(d)
Alternatively, only add the dict value as a list, if the field is 'f1'.
all_dicts = []
fields = ['f1','f2','f3','f4','f5','f6','f7']
for i in root.findall('.//item'):
d = dict()
for j in product.findall('.//subitems'):
for k in j.findall('.//subitem'):
n = k.attrib['name']
if n in fields and n == 'f1': # if field is 'f1' add list
if d.get(n) == None: # check if key exist
d[n] = [k.text] # add key as a list
else:
d[n].append(k.text) # append to list
elif n in fields: # if field isn't 'f1' just add the text
d[n] = k.text
all_dicts.append(d)

Create a list from dictionary values and some conditions in python

I need some help with python and dictionary.
So the basically idea is to create a list that will contain several values on a python dictionary.
I parse each key of the dic, and then if the number of values is > 1, I check wether these values contain a particular prefix, if so I put the values that do not have the prefix into a list.
Here is the dic:
defaultdict(<class 'list'>, {'ACTGA': ['YP_3878.3', 'HUYUII.1'], 'ACTGT': ['XP_46744.1', 'JUUIIL.2'], 'ACCTTGG': ['YP_8990.1'], 'ACCTTTT': ['YP_8992.1'], 'ATGA': ['YP_9000.1', 'YP_3222.1'], 'AAATTTGG': ['ORAAA.8', 'OTTTAX']})
and a here is the prefix_list = ["XP_","YP_"]
Let me explain it better:
I would like actually to create a new sequence_list with value content.
So the basicaly idea is going though each key and if there are > 1 values, I put the n-1 values into the sequence_list depending on some conditions.
Here is an exemple :
The first key is 'ACTGA' where there are 2 values: YP_3878.3 and HUYUII.1, then because HUYUII.1 does not have any prefix into the prefix_list, then I put it into the sequence_list:
print(sequence_list):
["HUYUII.1"]
The second key is 'ACTGT' where there are 3 values: XP_46744.1, JUUIIL.2 and JUUIIL.3, then because JUUIIL.2 and JUUIIL.3 do not have any prefix into the prefix_list, then I put them into the sequence_list:
print(sequence_list):
["HUYUII.1","JUUIIL.2","JUUIIL.3"]
The third key where n value > 1 is 'ATGAAA' where there are 3 values : 'YP_9000.1', 'YP_3222.1' and 'HUU3222.1', then because HUU3222.1 does not have any prefix into the prefix_list, then I put them into the sequence_list, AND because there is 2 values left with both prefix, I put the first one also in the sequence_list :
print(sequence_list):
["HUYUII.1","JUUIIL.2","JUUIIL.3","YP_9000.1","HUU3222.1"]
The fourth key where n value > 1 is 'AAATTTGG' where there are 2 values : 'ORLOP.8' and 'OTTTAX', then because both does not have a prefix into the prefix_list, I put the first one into the sequence_list:
print(sequence_list):
["HUYUII.1","JUUIIL.2","JUUIIL.3","YP_9000.1","HUU3222.1","ORAAA.8"]
So at the end I should get the sequence_list such as:
["HUYUII.1","JUUIIL.2","JUUIIL.3","YP_9000.1","HUU3222.1","ORAAA.8"]
Does someone have an idea? I tried something but it is quite difficult and maybe totally messy:
sequence_list=[]
for value in dedup_records.items():
if(len(value[1]))>1:
try:
length=len(value[1])
liste=value[1]
print("liste1",liste)
r = re.compile("YP_*.|XP_*.")
newlist = list(filter(r.match, liste))
if len(newlist)!=0:
print(newlist)
for i in newlist:
if i in liste:
liste.remove(i)
while len(newlist)>1:
liste.remove(newlist[0])
else:
while len(liste)>1:
liste.pop(0)
print(liste)
except :
continue
for i in liste:
sequence_list.append(i)
You can make your code much cleaner by using a function so that it is easier to read what is happening inside the loop.
Also, just personal preference, I'd suggest using list_ as a variable name instead of liste, As the misspellings can be tough to work with.
The approach is to first split every list into two groups: one with prefix, and one without prefix. After that, We just need to verify that there is at least 1 item with prefix (in which case, append every items except the last one with prefix, and append all non-prefixed items), otherwise we need to leave 1 non-prefixed item, and append all the others.
dedup_records = {'ACTGA': ['YP_3890.3', 'HUYUII.1'], 'ACTGT': ['XP_46744.1', 'JUUIIL.2','JUUIIL.3'], 'ACCTTGG': ['YP_8990.1'], 'ACCTTTT': ['YP_8992.1'], 'ATGAAA': ['YP_9000.1', 'YP_3222.1','HUU3222.1'], 'AAATTTGG': ['ORLOP.8', 'OTTTAX']}
prefix_list = ["XP_","YP_"]
def separate_items_with_prefix(list_, prefix_list):
'''separates a list into two lists based on prefix
returns two lists: one for items with prefix
another for items without prefix
'''
with_prefix = []
without_prefix = []
for item in list_:
if any(item.startswith(prefix) for prefix in prefix_list):
with_prefix.append(item)
else:
without_prefix.append(item)
return with_prefix, without_prefix
sequence_list = []
for val in dedup_records.values():
if len(val) <= 1:
continue #skip items with only upto 1 value in them
with_prefix, without_prefix = separate_items_with_prefix(val, prefix_list)
if with_prefix: #So there is at least 1 item in the list with prefix
sequence_list.extend(with_prefix[:-1])
sequence_list.extend(without_prefix)
else: #there are no items with a prefix in the list
sequence_list.extend(without_prefix[:-1])
Output:
print(sequence_list)
['HUYUII.1', 'JUUIIL.2', 'JUUIIL.3', 'YP_9000.1', 'HUU3222.1', 'ORLOP.8']
If I get youre code right, you want to achieve this:
prefix_list = ["XP_", "YP_"]
sequence_list = []
have_interesting_prefix = lambda v: any(
v.startswith(prefix) for prefix in prefix_list
)
for values in dedup_records.values():
if len(values) > 1:
sequence_list.extend(v for v in values if not have_interesting_prefix(v))
prefixed = filter(have_interesting_prefix, values)
if len(prefixed) > 1:
sequence_list.append(prefixed[0])

Removing specific dictionary values inside a loop

I'm trying to make a context-free grammar simplification software.
I'm stuck when it comes to delete some specific items from the dictionary's values or even the key value pair.
The problem is that it doesn't follow a pattern.
If the element belongs to V1, I need to keep it in dictionary.
(V1 is the list of all values who derivates a terminal, those guys are the only ones I need to keep on my dictionary, but it's not that simple)
If the element doesn't belongs to V1 and dictionary's values is a string, I need to remove the element.
If the element doesn't belongs to V1 and dictionary's values is a list, I need to check if it's the single element of that list, if so, delete Value.
The failed loop is down here.
I printed the parts that I can't figure out the logic in modifying the dictionary.
counter = 0
for k,v in derivations.items():
derivationsCount = len(v)
while counter < derivationsCount:
if lista_ou_string(v[counter]): # returns True for lists, False for else
sizeOfList = len(v[counter])
counter2 = 0
while counter2 <= (sizeOfList - 1):
if v[counter][counter2] not in V1:
if derivationsCount == 1:
print("# NEED TO DELETE BOTH KEY AND VALUE FROM derivatios.items()")
else:
print("# NEED TO DELETE ONLY THE VALUE FROM derivations.items()")
counter2 += 1
else: # strings \/
if v[counter] not in V1:
if derivationsCount == 1:
print("# NEED TO DELETE BOTH KEY AND VALUE FROM derivatios.items()")
else:
print("# NEED TO DELETE ONLY THE VALUE FROM derivations.items()")
else:
print("# DO NOT DELETE ANYTHING! ALL LISTS ELEMENTS BELONGS TO 'V1'")
counter += 1
One does not want to modify a dictionary (or list) while looping over it. Therefore I create a copy of the derivations - new_derivations and modify this new_derivations:
import copy
new_derivations = copy.deepcopy(derivations)
for k, v in derivations.items():
for vi in v:
if (lista_ou_string(vi) and not set(vi).issubset(V1)) or vi not in V1:
if len(v) == 1:
# NEED TO DELETE BOTH KEY AND VALUE FROM derivatios.items()
del new_derivations[k]
break
else:
# NEED TO DELETE ONLY THE VALUE FROM derivations.items()
idx = new_derivations[k].index(vi)
del new_derivations[k][idx]
I would actually implement the above code differently: instead of thinking in terms of removing items from derivations, think instead of when an element should be added to the list. Then the code becomes much simpler:
new_derivations = {}
for k, v in derivations.items():
nv = [vi for vi in v if ((isinstance(vi, list) and set(vi).issubset(V1))
or vi in V1)]
if nv:
new_derivations[k] = nv
if you want to delete a key,value pair from a dictionary, use del:
>>> my_dictionary = {'foo':'bar', 'boo':'baz'}
>>> del my_dictionary['foo']
>>> my_dictionary
{'boo': 'baz'}
if you want to delete the value, but keep the key, you can try assigning key None:
>>> my_dictionary = {'foo':'bar', 'boo':'baz'}
>>> my_dictionary['foo'] = None
>>> my_dictionary
{'foo': None, 'boo': 'baz'}

Python: update dictionary key with tuple values

I have a dictionary that has keys with two values each. I need to update the second value as pass duplicate keys.
Clearly what I'm trying isn't working out.
if value1 not in dict.keys():
dict.update({key:(value1,value2)})
else:
dict.update({key:value1,+1)})
this just returned a diction with 1s for value 2 instead of incrementing by 1
The expression +1 doesn't increment anything, it's just the number 1
Also avoid using dict as a name because it's a Python built-in
Try structuring your code more like this:
my_dict = {} # some dict
my_key = # something
if my_key not in my_dict:
new_value = # some new value here
my_dict[my_key] = new_value
else:
# First calculate what should be the new value
# Here I'm doing a trivial new_value = old_value + 1, no tuples
new_value = my_dict[my_key] + 1
my_dict[my_key] = new_value
# For tuples you can e.g. increment the second element only
# Of course assuming you have at least 2 elements,
# or old_value[0] and old_value[1] will fail
old_value = my_dict[my_key] # this is a tuple
new_value = old_value[0], old_value[1] + 1
my_dict[my_key] = new_value
There may be shorter or smarter ways to do it, e.g. using the operator +=, but this snippet is written for clarity

python list, working with multiple elements

a is a list filled dynamically with values being received in no specific order. So, if the next value received was (ID2,13), how could I remove the (ID2,10) based on the fact that ID2 was the next value received? Because I don't know the order in which the list is being populated, I won't know the index.
Also, how would I know the count of a specfic ID?
I have tried a.count(ID1) but because of the second element, it fails to find any.
a = [(ID1,10),(ID2,10),(ID1,12),(ID2,15)]
My current usage:
while True:
'Receive ID information in format (IDx,Value)'
ID_info = (ID2,13) #For example
if a.count(ID2) == 2: #I need to modify this line as it always returns 0
del a[0] #This is what I need to modify to delete the correct information, as it is not always index 0
a.append(ID_info)
else:
a.append(ID_info)
Assuming that the ID's are hashable, it sounds like you want to be using a dictionary.
a = {ID1: 10, ID2: 10}
id, val = (ID2, 13)
a[id] = val
With the "keep two" addition, I still think it's easier with a dictionary, though with some modifications.
EDIT: Simpler version using collections.defaultdict.
from collections import defaultdict
a = defaultdict(list)
a[ID1].append(10)
a[ID2].append(10)
id, val = (ID2, 13)
a[id].append(val)
if len(a[id]) > 2:
a[id].pop(0)
def count(a, id):
return len(a[id])
a = {ID1: [10], ID2: [10]}
id, val = (ID2, 13)
if id not in a.keys():
a[id] = []
a[id].append(val)
if len(a[id]) > 2:
a[id].pop(0)
def count(a, id):
if id not in a.keys():
return 0
else:
return len(a[id])
You could (and probably should) also encapsulate this behavior into a simple class inherited from dict.

Categories

Resources