count how often a key appears in a dataset

count how often a key appears in a dataset - python

i have a pandas dataframe
where you can find 3 columns. the third is the second one with some str slicing.
To every warranty_claim_number, there is a key_part_number (first column).
this dataframe has a lot of rows.
I have a second list, which contains 70 random select warranty_claim_numbers.
I was hoping to find the corresponding key_part_number from those 70 claims in my dataset.
Then i would like to create a dictionary with the key_part_number as key and the corresponding value as warranty_claim_number.
At last, count how often each key_part_number appears in this dataset and update the key.
This should like like this:
dicti = {4:'000120648353',10:'000119582589',....}

first of all you need to change the datatype of warranty_claim_numbers to string or you wont get the leading 0's
You can subset your df form that list of claim numbers:
df = df[df["warranty_claim_number"].isin(claimnumberlist)]
This gives you a dataframe with only the rows with those claim numbers.
countofkeyparts = df["key_part_number"].value_counts()
this gives you a pandas series with the values and you can cast i to a dict with to_dict()
countofkeyparts = countofkeyparts.to_dict()
The keys in a dict have to be unique so if you want the count as a key you can have the value be a list of key_part_numbers
values = {}
for key, value in countofkeyparts.items():
values[value]= values.get(value,[])
values[value].append(key)

According to your example, you can't use the number of occurrences as the key of the dictionary because the key in the dictionary is unique and you can't exclude multiple data columns with the same frequency of occurrence, so it is recommended to set the result in this format: dicti = {4:['000120648353', '09824091'],10:['000119582589'] ,....}
I'll use randomly generated data as an example
from collections import Counter
import random
lst = [random.randint(1, 10) for i in range(20)]
counter = Counter(lst)
print(counter) # First element, then number of occurrences
nums = set(counter.values()) # All occurrences
res = {item: [val for val in counter if counter[val] == item] for item in nums}
print(res)
# Counter({5: 6, 8: 4, 3: 2, 4: 2, 9: 2, 2: 2, 6: 1, 10: 1})
# {1: [6, 10], 2: [3, 4, 9, 2], 4: [8], 6: [5]}

This does what you want:
# Select rows where warranty_claim_numbers item is in lst:
df_wanted = df.loc[df["warranty_claim_numbers"].isin(lst), "warranty_claim_numbers"]
# Count the values in that row:
count_values = df_wanted.value_counts()
# Transform to Dictionary:
print(count_values.to_dict())

Related

How do I convert horizontal format to vertical format for ECLAT algorithm in a better way in python?

I am using ECLAT algorithm for association rule mining on transaction data.
The data I have is in the format:
Order 1: item 1, item 2, item 3
Order 2: item 1, item 2
Order 3: item 1, item 3
.
.
.
Order n: item 1, item 2, ... item N
The actual pandas dataframe:
Image 1
I want to convert it in this format for ECLAT
item 1: order 1, order 2, order 3
item 2: order 1, order 3,...order N
.
.
.
item N: order 2, order N
This is how I am doing (but its extremely slow as I am using 2 loops with dataframe.
unique_item_codes = df['item_code'].unique() #gets unique SKUs from main dataframe df
sku_list = list(unique_item_codes)
vert_dict = {}
for sku in sku_list: #going over each unique SKU
sku_order = [] #empty list for adding all order_no's where sku in the above loop appears
#order_code is another df from main df consisting only item_code and order_no
for i in range(order_code.shape[0]):
order_no = order_code.iloc[i][0]
if order_code.iloc[i][1] == sku:
sku_order.append(order_no) #adding the order_no in the list sku_order
vert_dict[sku] = sku_order #adding all the order_no's for a particular SKU in vert_dict
vert_dict
order_code pandas dataframe:
order_code pandas dataframe
Final format of vert_dict is:
{item 1: [order 1, order 2, order 3...order N], item 2: [order 1, order 2, order 3...order N]... item N}
How do I do it faster, or in a better way than this?
Any help is appreciated! Thanks!

You need something thats called inverted index.
Since you did not provide example data df, I will show you the idea with both dict input and dict output:
from collections import defaultdict
inverted_index = defaultdict(list)
for order, item_list in input_data.items():
for item in item_list:
inverted_index[item].append(order)

Operation similar to group by for lists

I have lists of ids and scores:
ids=[1,2,1,1,3,1]
scores=[10,20,10,30,40,10]
I want to remove duplicates from list ids so that scores would sum up accordingly.This is something very similar to what groupby.sum() does when use dataframes.
So, as output I expect :
ids=[1,2,3]
scores=[60,20,40]
I use the following code but it doesn't work well for all cases:
for indi ,i in enumerate(ids):
for indj ,j in enumerate(ids):
if(i==j) and (indi!=indj):
del ids[i]
scores[indj]=scores[indi]+scores[indj]
del scores[indi]

You can create a dictionary using ids and scores with the key as elements of id and values as the list of elements corresponding to an element in id, you can them sum up the values, and get your new id and scores list
from collections import defaultdict
ids=[1,2,1,1,3,1]
scores=[10,20,10,30,40,10]
dct = defaultdict(list)
#Create the dictionary of element of ids vs list of elements of scores
for id, score in zip(ids, scores):
dct[id].append(score)
print(dct)
#defaultdict(<class 'list'>, {1: [10, 10, 30, 10], 2: [20], 3: [40]})
#Calculate the sum of values, and get the new ids and scores list
new_ids, new_scores = zip(*((key, sum(value)) for key, value in dct.items()))
print(list(new_ids))
print(list(new_scores))
The output will be
[1, 2, 3]
[60, 20, 40]

As suggested in comments, using a dictionary is one way. You can iterate one time over the list and update the sum per id.
If you want two lists at the end, select the keys and values with keys() and values() methods from the dictionary:
ids=[1,2,1,1,3,1]
scores=[10,20,10,30,40,10]
# Init the idct with all ids at 0
dict_ = {i:0 for i in set(ids)}
for id, scores in zip(ids, scores):
dict_[id] += scores
print(dict_)
# {1: 60, 2: 20, 3: 40}
new_ids = list(dict_.keys())
sum_score = list(dict_.values())
print(new_ids)
# [1, 2, 3]
print(sum_score)
# [60, 20, 40]

Simply loop through them and add if the ids match.
ids=[1,2,1,1,3,1]
scores=[10,20,10,30,40,10]
ans={}
for i,s in zip(ids,scores):
if i in ans:
ans[i]+=s
else:
ans[i]=s
ids, scores=list(ans.keys()), list(ans.values())
Output:
[1, 2, 3]
[60, 20, 40]

# Find all unique ids and keep track of their scores
id_to_score = {id : 0 for id in set(ids)}
# Sum up the scores for that id
for index, id in enumerate(ids):
id_to_score[id] += scores[index]
unique_ids = []
score_sum = []
for (i, s) in id_to_score.items():
unique_ids.append(i)
score_sum.append(s)
print(unique_ids) # [1, 2, 3]
print(score_sum) # [60, 20, 40]

This may help you.
# Solution 1
import pandas as pd
ids=[1,2,1,1,3,1]
scores=[10,20,10,30,40,10]
df = pd.DataFrame(list(zip(ids, scores)),
columns=['ids', 'scores'])
print(df.groupby('ids').sum())
#### Output ####
scores
ids
1 60
2 20
3 40
# Solution 2
from itertools import groupby
zipped_list = list(zip(ids, scores))
print([[k, sum(v for _, v in g)] for k, g in groupby(sorted(zipped_list), key = lambda x: x[0])])
#### Output ####
[[1, 60], [2, 20], [3, 40]]

With only built-in Python tools I would do that task following way:
ids=[1,2,1,1,3,1]
scores=[10,20,10,30,40,10]
uids = list(set(ids)) # unique ids
for uid in uids:
print(uid,sum(s for inx,s in enumerate(scores) if ids[inx]==uid))
Output:
1 60
2 20
3 40
Above code just print result, but it might be easily changed to result in dict:
output_dict = {uid:sum(s for inx,s in enumerate(scores) if ids[inx]==uid) for uid in uids} # {1: 60, 2: 20, 3: 40}
or other data structure. Keep in mind that this method require separate pass for every unique id, so it might be slower than other approaches. Whatever this will be or not issue, depends on how big is your data.

keyerror 1 in my code

I am writing a function that take dictionary input and return list of keys which have unique values in that dictionary. Consider,
ip = {1: 1, 2: 1, 3: 3}
so output should be [3] as key 3 has unique value which is not present in dict.
Now there is problem in given fuction:
def uniqueValues(aDict):
dicta = aDict
dum = 0
for key in aDict.keys():
for key1 in aDict.keys():
if key == key1:
dum = 0
else:
if aDict[key] == aDict[key1]:
if key in dicta:
dicta.pop(key)
if key1 in dicta:
dicta.pop(key1)
listop = dicta.keys()
print listop
return listop
I am getting error like:
File "main.py", line 14, in uniqueValues
if aDict[key] == aDict[key1]: KeyError: 1
Where i am doing wrong?

Your main problem is this line:
dicta = aDict
You think you're making a copy of the dictionary, but actually you still have just one dictionary, so operations on dicta also change aDict (and so, you remove values from adict, they also get removed from aDict, and so you get your KeyError).
One solution would be
dicta = aDict.copy()
(You should also give your variables clearer names to make it more obvious to yourself what you're doing)
(edit) Also, an easier way of doing what you're doing:
def iter_unique_keys(d):
values = list(d.values())
for key, value in d.iteritems():
if values.count(value) == 1:
yield key
print list(iter_unique_keys({1: 1, 2: 1, 3: 3}))

Use Counter from collections library:
from collections import Counter
ip = {
1: 1,
2: 1,
3: 3,
4: 5,
5: 1,
6: 1,
7: 9
}
# Generate a dict with the amount of occurrences of each value in 'ip' dict
count = Counter([x for x in ip.values()])
# For each item (key,value) in ip dict, we check if the amount of occurrences of its value.
# We add it to the 'results' list only if the amount of occurrences equals to 1.
results = [x for x,y in ip.items() if count[y] == 1]
# Finally, print the results list
print results
Output:
[3, 4, 7]

Link two dictionaries lists and calculate average values

I am not used to code with Python, but I have to do this one with it. What I am trying to do, is something that would reproduce the result of SQL statment like this :
SELECT T2.item, AVG(T1.Value) AS MEAN FROM TABLE_DATA T1 INNER JOIN TABLE_ITEMS T2 ON T1.ptid = T2.ptid GROUP BY T2.item.
In Python, I have two lists of dictionnaries, with the common key 'ptid'. My dctData contains around 100 000 pdit and around 7000 for the dctItems. Using a comparator like [i for i in dctData for j in dctItems if i['ptid'] == j['ptid']] is endless:
ptid = 1
for line in lines[6:]: # Skipping header
data = line.split()
for d in data:
dctData.append({'ptid' : ptid, 'Value': float(d)})
ptid += 1
dctData = [{'ptid':1,'Value': 0}, {'ptid':2,'Value': 2}, {'ptid':3,'Value': 2}, {'ptid':4,'Value': 5}, {'ptid':5,'Value': 3}, {'ptid':6,'Value': 2}]
for line in lines[1:]: # Skipping header
data = line.split(';')
dctItems.append({'ptid' : int(data[1]), 'item' : data[3]})
dctItems = [{'item':21, 'ptid':1}, {'item':21, 'ptid':2}, {'item':21, 'ptid':6}, {'item':22, 'ptid':2}, {'item':22, 'ptid':5}, {'item':23, 'ptid':4}]
Now, what I would like to get for result, is a third list that would present the average values according to each item in dctItems dictionnary, while the link between the two dictionnaries would be based on the 'pdit' value.
Where for example with the item 21, it would calculate the mean value of 1.3 by getting the values (0, 2, 2) of the ptid 1, 2 and 6:
And finally, the result would look something like this, where the key Value represents the mean calculated :
dctResults = [{'id':21, 'Value':1.3}, {'id':22, 'Value':2.5}, {'id':23, 'Value':5}]
How can I achieve this?
Thanks you all for your help.

Given those data structures that you use, this is not trivial, but it will become much easier if you use a single dictionary mapping items to their values, instead.
First, let's try to re-structure your data in that way:
values = {entry['ptid']: entry['Value'] for entry in dctData}
items = {}
for item in dctItems:
items.setdefault(item['item'], []).append(values[item['ptid']])
Now, items has the form {21: [0, 2, 2], 22: [2, 3], 23: [5]}. Of course, it would be even better if you could create the dictionary in this form in the first place.
Now, we can pretty easily calculate the average for all those lists of values:
avg = lambda lst: float(sum(lst))/len(lst)
result = {item: avg(values) for item, values in items.items()}
This way, result is {21: 1.3333333333333333, 22: 2.5, 23: 5.0}
Or if you prefer your "list of dictionaries" style:
dctResult = [{'id': item, 'Value': avg(values)} for item, values in items.items()]

Appending values to dictionary in Python

I have a dictionary to which I want to append to each drug, a list of numbers. Like this:
append(0), append(1234), append(123), etc.
def make_drug_dictionary(data):
drug_dictionary={'MORPHINE':[],
'OXYCODONE':[],
'OXYMORPHONE':[],
'METHADONE':[],
'BUPRENORPHINE':[],
'HYDROMORPHONE':[],
'CODEINE':[],
'HYDROCODONE':[]}
prev = None
for row in data:
if prev is None or prev==row[11]:
drug_dictionary.append[row[11][]
return drug_dictionary
I later want to be able to access the entirr set of entries in, for example, 'MORPHINE'.
How do I append a number into the drug_dictionary?
How do I later traverse through each entry?

Just use append:
list1 = [1, 2, 3, 4, 5]
list2 = [123, 234, 456]
d = {'a': [], 'b': []}
d['a'].append(list1)
d['a'].append(list2)
print d['a']

You should use append to add to the list. But also here are few code tips:
I would use dict.setdefault or defaultdict to avoid having to specify the empty list in the dictionary definition.
If you use prev to to filter out duplicated values you can simplfy the code using groupby from itertools
Your code with the amendments looks as follows:
import itertools
def make_drug_dictionary(data):
drug_dictionary = {}
for key, row in itertools.groupby(data, lambda x: x[11]):
drug_dictionary.setdefault(key,[]).append(row[?])
return drug_dictionary
If you don't know how groupby works just check this example:
>>> list(key for key, val in itertools.groupby('aaabbccddeefaa'))
['a', 'b', 'c', 'd', 'e', 'f', 'a']

It sounds as if you are trying to setup a list of lists as each value in the dictionary. Your initial value for each drug in the dict is []. So assuming that you have list1 that you want to append to the list for 'MORPHINE' you should do:
drug_dictionary['MORPHINE'].append(list1)
You can then access the various lists in the way that you want as drug_dictionary['MORPHINE'][0] etc.
To traverse the lists stored against key you would do:
for listx in drug_dictionary['MORPHINE'] :
do stuff on listx

To append entries to the table:
for row in data:
name = ??? # figure out the name of the drug
number = ??? # figure out the number you want to append
drug_dictionary[name].append(number)
To loop through the data:
for name, numbers in drug_dictionary.items():
print name, numbers

If you want to append to the lists of each key inside a dictionary, you can append new values to them using + operator (tested in Python 3.7):
mydict = {'a':[], 'b':[]}
print(mydict)
mydict['a'] += [1,3]
mydict['b'] += [4,6]
print(mydict)
mydict['a'] += [2,8]
print(mydict)
and the output:
{'a': [], 'b': []}
{'a': [1, 3], 'b': [4, 6]}
{'a': [1, 3, 2, 8], 'b': [4, 6]}
mydict['a'].extend([1,3]) will do the job same as + without creating a new list (efficient way).

You can use the update() method as well
d = {"a": 2}
d.update{"b": 4}
print(d) # {"a": 2, "b": 4}

how do i append a number into the drug_dictionary?
Do you wish to add "a number" or a set of values?
I use dictionaries to build associative arrays and lookup tables quite a bit.
Since python is so good at handling strings,
I often use a string and add the values into a dict as a comma separated string
drug_dictionary = {}
drug_dictionary={'MORPHINE':'',
'OXYCODONE':'',
'OXYMORPHONE':'',
'METHADONE':'',
'BUPRENORPHINE':'',
'HYDROMORPHONE':'',
'CODEINE':'',
'HYDROCODONE':''}
drug_to_update = 'MORPHINE'
try:
oldvalue = drug_dictionary[drug_to_update]
except:
oldvalue = ''
# to increment a value
try:
newval = int(oldval)
newval += 1
except:
newval = 1
drug_dictionary[drug_to_update] = "%s" % newval
# to append a value
try:
newval = int(oldval)
newval += 1
except:
newval = 1
drug_dictionary[drug_to_update] = "%s,%s" % (oldval,newval)
The Append method allows for storing a list of values but leaves you will a trailing comma
which you can remove with
drug_dictionary[drug_to_update][:-1]
the result of the appending the values as a string means that you can append lists of values as you need too and
print "'%s':'%s'" % ( drug_to_update, drug_dictionary[drug_to_update])
can return
'MORPHINE':'10,5,7,42,12,'

vowels = ("a","e","i","o","u") #create a list of vowels
my_str = ("this is my dog and a cat") # sample string to get the vowel count
count = {}.fromkeys(vowels,0) #create dict initializing the count to each vowel to 0
for char in my_str :
if char in count:
count[char] += 1
print(count)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

count how often a key appears in a dataset - python

This does what you want: # Select rows where warranty_claim_numbers item is in lst: df_wanted = df.loc[df["warranty_claim_numbers"].isin(lst), "warranty_claim_numbers"] # Count the values in that row: count_values = df_wanted.value_counts() # Transform to Dictionary: print(count_values.to_dict())

Related

How do I convert horizontal format to vertical format for ECLAT algorithm in a better way in python?

Operation similar to group by for lists

keyerror 1 in my code

Link two dictionaries lists and calculate average values

Appending values to dictionary in Python

Categories

Resources