Operation similar to group by for lists

Operation similar to group by for lists - python

I have lists of ids and scores:
ids=[1,2,1,1,3,1]
scores=[10,20,10,30,40,10]
I want to remove duplicates from list ids so that scores would sum up accordingly.This is something very similar to what groupby.sum() does when use dataframes.
So, as output I expect :
ids=[1,2,3]
scores=[60,20,40]
I use the following code but it doesn't work well for all cases:
for indi ,i in enumerate(ids):
for indj ,j in enumerate(ids):
if(i==j) and (indi!=indj):
del ids[i]
scores[indj]=scores[indi]+scores[indj]
del scores[indi]

You can create a dictionary using ids and scores with the key as elements of id and values as the list of elements corresponding to an element in id, you can them sum up the values, and get your new id and scores list
from collections import defaultdict
ids=[1,2,1,1,3,1]
scores=[10,20,10,30,40,10]
dct = defaultdict(list)
#Create the dictionary of element of ids vs list of elements of scores
for id, score in zip(ids, scores):
dct[id].append(score)
print(dct)
#defaultdict(<class 'list'>, {1: [10, 10, 30, 10], 2: [20], 3: [40]})
#Calculate the sum of values, and get the new ids and scores list
new_ids, new_scores = zip(*((key, sum(value)) for key, value in dct.items()))
print(list(new_ids))
print(list(new_scores))
The output will be
[1, 2, 3]
[60, 20, 40]

As suggested in comments, using a dictionary is one way. You can iterate one time over the list and update the sum per id.
If you want two lists at the end, select the keys and values with keys() and values() methods from the dictionary:
ids=[1,2,1,1,3,1]
scores=[10,20,10,30,40,10]
# Init the idct with all ids at 0
dict_ = {i:0 for i in set(ids)}
for id, scores in zip(ids, scores):
dict_[id] += scores
print(dict_)
# {1: 60, 2: 20, 3: 40}
new_ids = list(dict_.keys())
sum_score = list(dict_.values())
print(new_ids)
# [1, 2, 3]
print(sum_score)
# [60, 20, 40]

Simply loop through them and add if the ids match.
ids=[1,2,1,1,3,1]
scores=[10,20,10,30,40,10]
ans={}
for i,s in zip(ids,scores):
if i in ans:
ans[i]+=s
else:
ans[i]=s
ids, scores=list(ans.keys()), list(ans.values())
Output:
[1, 2, 3]
[60, 20, 40]

# Find all unique ids and keep track of their scores
id_to_score = {id : 0 for id in set(ids)}
# Sum up the scores for that id
for index, id in enumerate(ids):
id_to_score[id] += scores[index]
unique_ids = []
score_sum = []
for (i, s) in id_to_score.items():
unique_ids.append(i)
score_sum.append(s)
print(unique_ids) # [1, 2, 3]
print(score_sum) # [60, 20, 40]

This may help you.
# Solution 1
import pandas as pd
ids=[1,2,1,1,3,1]
scores=[10,20,10,30,40,10]
df = pd.DataFrame(list(zip(ids, scores)),
columns=['ids', 'scores'])
print(df.groupby('ids').sum())
#### Output ####
scores
ids
1 60
2 20
3 40
# Solution 2
from itertools import groupby
zipped_list = list(zip(ids, scores))
print([[k, sum(v for _, v in g)] for k, g in groupby(sorted(zipped_list), key = lambda x: x[0])])
#### Output ####
[[1, 60], [2, 20], [3, 40]]

With only built-in Python tools I would do that task following way:
ids=[1,2,1,1,3,1]
scores=[10,20,10,30,40,10]
uids = list(set(ids)) # unique ids
for uid in uids:
print(uid,sum(s for inx,s in enumerate(scores) if ids[inx]==uid))
Output:
1 60
2 20
3 40
Above code just print result, but it might be easily changed to result in dict:
output_dict = {uid:sum(s for inx,s in enumerate(scores) if ids[inx]==uid) for uid in uids} # {1: 60, 2: 20, 3: 40}
or other data structure. Keep in mind that this method require separate pass for every unique id, so it might be slower than other approaches. Whatever this will be or not issue, depends on how big is your data.

Related

count how often a key appears in a dataset

i have a pandas dataframe
where you can find 3 columns. the third is the second one with some str slicing.
To every warranty_claim_number, there is a key_part_number (first column).
this dataframe has a lot of rows.
I have a second list, which contains 70 random select warranty_claim_numbers.
I was hoping to find the corresponding key_part_number from those 70 claims in my dataset.
Then i would like to create a dictionary with the key_part_number as key and the corresponding value as warranty_claim_number.
At last, count how often each key_part_number appears in this dataset and update the key.
This should like like this:
dicti = {4:'000120648353',10:'000119582589',....}

first of all you need to change the datatype of warranty_claim_numbers to string or you wont get the leading 0's
You can subset your df form that list of claim numbers:
df = df[df["warranty_claim_number"].isin(claimnumberlist)]
This gives you a dataframe with only the rows with those claim numbers.
countofkeyparts = df["key_part_number"].value_counts()
this gives you a pandas series with the values and you can cast i to a dict with to_dict()
countofkeyparts = countofkeyparts.to_dict()
The keys in a dict have to be unique so if you want the count as a key you can have the value be a list of key_part_numbers
values = {}
for key, value in countofkeyparts.items():
values[value]= values.get(value,[])
values[value].append(key)

According to your example, you can't use the number of occurrences as the key of the dictionary because the key in the dictionary is unique and you can't exclude multiple data columns with the same frequency of occurrence, so it is recommended to set the result in this format: dicti = {4:['000120648353', '09824091'],10:['000119582589'] ,....}
I'll use randomly generated data as an example
from collections import Counter
import random
lst = [random.randint(1, 10) for i in range(20)]
counter = Counter(lst)
print(counter) # First element, then number of occurrences
nums = set(counter.values()) # All occurrences
res = {item: [val for val in counter if counter[val] == item] for item in nums}
print(res)
# Counter({5: 6, 8: 4, 3: 2, 4: 2, 9: 2, 2: 2, 6: 1, 10: 1})
# {1: [6, 10], 2: [3, 4, 9, 2], 4: [8], 6: [5]}

This does what you want:
# Select rows where warranty_claim_numbers item is in lst:
df_wanted = df.loc[df["warranty_claim_numbers"].isin(lst), "warranty_claim_numbers"]
# Count the values in that row:
count_values = df_wanted.value_counts()
# Transform to Dictionary:
print(count_values.to_dict())

Make a new list depending on group number and add scores up as well

If a have a list within a another list that looks like this...
[['Harry',9,1],['Harry',17,1],['Jake',4,1], ['Dave',9,2],['Sam',17,2],['Sam',4,2]]
How can I add the middle element together so so for 'Harry' for example, it shows up as ['Harry', 26] and also for Python to look at the group number (3rd element) and output the winner only (the one with the highest score which is the middle element). So for each group, there needs to be one winner. So the final output shows:
[['Harry', 26],['Sam',21]]
THIS QUESTION IS NOT A DUPLICATE: It has a third element as well which I am stuck about
The similar question gave me an answer of:
grouped_scores = {}
for name, score, group_number in players_info:
if name not in grouped_scores:
grouped_scores[name] = score
grouped_scores[group_number] = group_number
else:
grouped_scores[name] += score
But that only adds the scores up, it doesn't take out the winner from each group. Please help.
I had thought doing something like this, but I'm not sure exactly what to do...
grouped_scores = {}
for name, score, group_number in players_info:
if name not in grouped_scores:
grouped_scores[name] = score
else:
grouped_scores[name] += score
for group in group_number:
if grouped_scores[group_number] = group_number:
[don't know what to do here]

Solution:
Use itertools.groupby, and collections.defaultdict:
l=[['Harry',9,1],['Harry',17,1],['Jake',4,1], ['Dave',9,2],['Sam',17,2],['Sam',4,2]]
from itertools import groupby
from collections import defaultdict
l2=[list(y) for x,y in groupby(l,key=lambda x: x[-1])]
l3=[]
for x in l2:
d=defaultdict(int)
for x,y,z in x:
d[x]+=y
l3.append(max(list(map(list,dict(d).items())),key=lambda x: x[-1]))
Now:
print(l3)
Is:
[['Harry', 26], ['Sam', 21]]
Explanation:
First two lines are importing modules. Then the next line is using groupby to separate in to two groups based on last element of each sub-list. Then the next line to create empty list. Then the next loop iterating trough the grouped ones. Then create a defaultdict. Then the sub-loop is adding the stuff to the defaultdict. Then last line to manage how to make that dictionary into a list.

I would aggregate the data first with a defaultdict.
>>> from collections import defaultdict
>>>
>>> combined = defaultdict(lambda: defaultdict(int))
>>> data = [['Harry',9,1],['Harry',17,1],['Jake',4,1], ['Dave',9,2],['Sam',17,2],['Sam',4,2]]
>>>
>>> for name, score, group in data:
...: combined[group][name] += score
...:
>>> combined
>>>
defaultdict(<function __main__.<lambda>()>,
{1: defaultdict(int, {'Harry': 26, 'Jake': 4}),
2: defaultdict(int, {'Dave': 9, 'Sam': 21})})
Then apply max to each value in that dict.
>>> from operator import itemgetter
>>> [list(max(v.items(), key=itemgetter(1))) for v in combined.values()]
>>> [['Harry', 26], ['Sam', 21]]

use itertools.groupby and then take the middle value from the grouped element and then append it to a list passed on the maximum condition
import itertools
l=[['Harry',9,1],['Harry',17,1],['Jake',4,1], ['Dave',9,2],['Sam',17,2],['Sam',4,2]]
maxlist=[]
maxmiddleindexvalue=0
for key,value in itertools.groupby(l,key=lambda x:x[0]):
s=0
m=0
for element in value:
s+=element[1]
m=max(m,element[1])
if(m==maxmiddleindexvalue):
maxlist.append([(key,s)])
if(m>maxmiddleindexvalue):
maxlist=[(key,s)]
maxmiddleindexvalue=m
print(maxlist)
OUTPUT
[('Harry', 26), [('Sam', 21)]]

Creating nested dictionaries in Python using Openpyxl

Trying to build a dictionary in Python created by looping through an Excel file using Openpyxl, where the key is the Name of a person, and the value is a list of dictionary items where each key is the Location, and the value is an array of Start and End.
Here is the Excel file:
And here is what I want:
people = {
'John':[{20:[[2,4],[3,5]]}, {21:[[2,4]]}],
'Jane':[{20:[[9,10]]},{21:[[2,4]]}]
}
Here is my current script:
my_file = openpyxl.load_workbook('Book2.xlsx', read_only=True)
ws = my_file.active
people = {}
for row in ws.iter_rows(row_offset=1):
a = row[0] # Name
b = row[1] # Date
c = row[2] # Start
d = row[3] # End
if a.value: # Only operate on rows that contain data
if a.value in people.keys(): # If name already in dict
for k, v in people.items():
for item in v:
#print(item)
for x in item:
if x == int(b.value):
print(people[k])
people[k][0][x].append([c.value,d.value])
else:
#people[k].append([c.value,d.value]) # Creates inf loop
else:
people[a.value] = [{b.value:[[c.value,d.value]]}]
Which successfully creates this:
{'John': [{20: [[2, 4], [9, 10]]}], 'Jane': [{20: [[9, 10]]}]}
But when I uncomment the line after the else: block to try to add a new Location dictionary to the initial list, it creates an infinite loop.
if x == int(b.value):
people[k][0][x].append([c.value,d.value])
else:
#people[k].append([c.value,d.value]) # Creates inf loop
I am sure there's a more Pythonic way of doing this, but pretty stuck here and looking for a nudge in the right direction. The outcome here is to analyze all of the dict items for overlapping Start/Ends per person and per location. So John's Start of 3.00 - 5.00 at location 20 overlaps with his Start/End at the same location of 2.00 - 4.00

It seems you're overthinking this; a combination of default dictionaries should do the trick.
from collections import defaultdict
person = defaultdict(dict)
for row in ws.iter_rows(min_row=2, max_col=4):
p, l, s, e = (c.value for c in row)
if p not in person:
person[p] = defaultdict(list)
person[p][l].append((s, e))

You can use the Pandas library for this. The core of this solution is a nested dictionary comprehension, each using groupby. You can, as below, use a function to take care of the nesting to aid readability / maintenance.
import pandas as pd
# define dataframe, or df = pd.read_excel('file.xlsx')
df = pd.DataFrame({'Name': ['John']*3 + ['Jane']*2,
'Location': [20, 20, 21, 20, 21],
'Start': [2.00, 3.00, 2.00, 9.00, 2.00],
'End': [4.00, 5.00, 4.00, 10.00, 4.00]})
# convert cols to integers
int_cols = ['Start', 'End']
df[int_cols] = df[int_cols].apply(pd.to_numeric, downcast='integer')
# define inner dictionary grouper and split into list of dictionaries
def loc_list(x):
d = {loc: w[int_cols].values.tolist() for loc, w in x.groupby('Location')}
return [{i: j} for i, j in d.items()]
# define outer dictionary grouper
people = {k: loc_list(v) for k, v in df.groupby('Name')}
{'Jane': [{20: [[9, 10]]}, {21: [[2, 4]]}],
'John': [{20: [[2, 4], [3, 5]]}, {21: [[2, 4]]}]}

How to join values between sublists

I have a list with sublists, for example:
LA=[[1,2],[2,7],[4,5],[1,9],[6,5],[4,3],[2,1],[2,2]]
If the first element in each sublist is an ID, how do I join the ones with the same ID to get something like this:
LR=[[1,11],[2,10],[4,8],[6,5]]
I've tried using a for loop, but it's too long and not efficient.

You can use itertools.groupby:
import itertools
LA=[[1,2],[2,7],[4,5],[1,9],[6,5],[4,3],[2,1],[2,2]]
new_d = [[a, sum(i[-1] for i in list(b))] for a, b in itertools.groupby(sorted(LA), key=lambda x:x[0])]
Output:
[[1, 11], [2, 10], [4, 8], [6, 5]]

LA=[[1,2],[2,7],[4,5],[1,9],[6,5],[4,3],[2,1],[2,2]]
new_dict = {}
for (key, value) in LA:
if key in new_dict:
new_dict[key].append(value)
else:
new_dict[key] = [value]
for key, value in new_dict.items():
new_dict[key] = (sum(value))
dictlist = []
for key, value in new_dict.items():
temp = [key,value]
dictlist.append(temp)
print(dictlist)
will do the job too

You can do it just using list comprehensions:
LR = [[i,sum([L[1] for L in LA if L[0]==i])] for i in set([L[0] for L in LA])]
Gives the desired result.
To break this down a bit set([L[0] for L in LA]) gives a set (with no repeats) of all of the ID's, then we simply itterate over that set and sum the values which also have that ID.

Grouping with collections.defaultdict() is always straightforward:
from collections import defaultdict
LA = [[1,2],[2,7],[4,5],[1,9],[6,5],[4,3],[2,1],[2,2]]
# create defaultdict of list values
d = defaultdict(list)
# loop over each list
for sublist in LA:
# group by first element of each list
key = sublist[0]
# add to dictionary, each key will have a list of values
d[key].append(sublist)
# definitely can be more readable
result = [[key, sum(x[1] for x in [y for y in value])] for (key, value) in sorted(d.items())]
print(result)
Output:
[[1, 11], [2, 10], [4, 8], [6, 5]]

python running count of values in dict

I have a dictionary like this
d = {1:'Bob', 2:'Joe', 3:'Bob', 4:'Bill', 5:'Bill'}
I want to keep a count of how many times each name occurs as a dictionary value. So, the output should be like this:
d = {1:['Bob', 1], 2:['Joe',1], 3:['Bob', 2], 4:['Bill',1] , 5:['Bill',2]}

One way of counting the values like you want, is shown below:
from collections import Counter
d = {1:'Bob',2:'Joe',3:'Bob', 4:'Bill', 5:'Bill'}
c = Counter()
new_d = {}
for k in sorted(d.keys()):
name = d[k]
c[name] += 1;
new_d[k] = [name, c[name]]
print(new_d)
# {1: ['Bob', 1], 2: ['Joe', 1], 3: ['Bob', 2], 4: ['Bill', 1], 5: ['Bill', 2]}
Here I use Counter to keep track of occurrences of names in the input dictionary. Hope this helps. Maybe not most elegant code, but it works.

To impose an order (which a dict per se doesn't have), let's say you're going in sorted order on the keys. Then you could do -- assuming the values are hashable, as in you example...:
import collections
def enriched_by_count(somedict):
countsofar = collections.defaultdict(int)
result = {}
for k in sorted(somedict):
v = somedict[k]
countsofar[v] += 1
result[k] = [v, countsofar[v]]
return result

Without using any modules, this is the code I came up with. Maybe not as short, but I am scared of modules.
def new_dict(d):
check = [] #List for checking against
new_dict = {} #The new dictionary to be returned
for i in sorted(d.keys()): #Loop through all the dictionary items
val = d[i] #Store the dictionary item value in a variable just for clarity
check.append(val) #Add the current item to the array
new_dict[i] = [d[i], check.count(val)] #See how many of the items there are in the array
return new_dict
Use like so:
d = {1:'Bob', 2:'Joe', 3:'Bob', 4:'Bill', 5:'Bill'}
d = new_dict(d)
print d
Output:
{1: ['Bob', 1], 2: ['Joe', 1], 3: ['Bob', 2], 4: ['Bill', 1], 5: ['Bill', 2]}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Operation similar to group by for lists - python

Simply loop through them and add if the ids match. ids=[1,2,1,1,3,1] scores=[10,20,10,30,40,10] ans={} for i,s in zip(ids,scores): if i in ans: ans[i]+=s else: ans[i]=s ids, scores=list(ans.keys()), list(ans.values()) Output: [1, 2, 3] [60, 20, 40]

Related

count how often a key appears in a dataset

Make a new list depending on group number and add scores up as well

Creating nested dictionaries in Python using Openpyxl

How to join values between sublists

python running count of values in dict

Categories

Resources