Python Aggregation without PANDAS

Python Aggregation without PANDAS - python

I have a sorted and nested list. Each element in the list has 3 sub-elements; 'Drugname', Doctor_id, Amount. For a given drugname (which repeats) the doctor ids are different and so are the amounts. See sample list below..
I need an output where, for each drugname, I need to count the total UNIQUE doctor ids and the sum of the dollar amount for that drug. For ex, for the list snippet below..
[
['CIPROFLOXACIN HCL', 1801093968, 61.49],
['CIPROFLOXACIN HCL', 1588763981, 445.23],
['HYDROCODONE-ACETAMINOPHEN', 1801093968, 251.52],
['HYDROCODONE-ACETAMINOPHEN', 1588763981, 263.16],
['HYDROXYZINE HCL', 1952310666, 945.5],
['IBUPROFEN', 1801093968, 67.06],
['INVEGA SUSTENNA', 1952310666, 75345.68]
]
The desired output is as below.
[
['CIPROFLOXACIN HCL', 2, 516.72],
['HYDROCODONE-ACETAMINOPHEN', 2, 514.68]
['HYDROXYZINE HCL', 1, 945.5]
['IBUPROFEN', 1, 67.06]
['INVEGA SUSTENNA', 1, 75345.68]
]
In a database world this is the easiest thing with a simple GROUP BY on drugname. In Python, I am not allowed to use PANDAS, NumPy etc. Just the basic building blocks of Python. I tried the below code but I am unable to reset the count variable to count doctor ids and amounts. This commented code is one of several attempts. Not sure if I need to use a nested for loop or a for loop-while loop combo.
All help is appreciated!
aggr_list = []
temp_drug_name = ''
doc_count = 0
amount = 0
for list_element in sorted_new_list:
temp_drug_name = list_element[0]
if temp_drug_name == list_element[0]:
amount += float(amount)
doc_count += 1
aggr_list.append([temp_drug_name, doc_count, amount])
print(aggr_list)

Since the list is already sorted you can simply iterate through the list (named l in the example below) and keep track of the name of the last iteration, and if the name of the current iteration is different from the last, insert a new entry to the output. Use a set to keep track of the doctor IDs already seen for the current drug, and only increment the the second item of the last entry of the output by 1 if the doctor ID is not seen. And increment the third item of the last entry of the output by the amount of the current iteration:
output = []
last = None
for name, id, amount in l:
if name != last:
output.append([name, 0, 0])
last = name
ids = set()
if id not in ids:
output[-1][1] += 1
ids.add(id)
output[-1][2] += amount
output becomes:
[['CIPROFLOXACIN HCL', 2, 506.72],
['HYDROCODONE-ACETAMINOPHEN', 2, 514.6800000000001],
['HYDROXYZINE HCL', 1, 945.5],
['IBUPROFEN', 1, 67.06],
['INVEGA SUSTENNA', 1, 75345.68]]
Note that decimal floating points are approximated in the binary system that the computer uses (please read Is floating point math broken?), so some minor errors are inevitable as seen in the sum of the second entry above.

Here is a solution with a focus on readability, it doesn't consider that the entries in your original list are sorted by drug name.
It does one pass on all the entries of your data , then a pass on the number of unique drugs.
To do only a single pass on all the entries of your sorted data, see #blhsing solution
from collections import defaultdict, namedtuple
Entry = namedtuple('Entry',['doctors', 'prices'])
processed_data = defaultdict(lambda: Entry(doctors=set(), prices=[]))
for entry in data:
drug_name, doctor_id, price = entry
processed_data[drug_name].doctors.add(doctor_id)
processed_data[drug_name].prices.append(price)
stat_list = [[drug_name, len(entry.doctors), sum(entry.prices)] for drug_name, entry in processed_data.items()]

Without Pandas or defaultdict:
d={}
for row in l:
if row[0] in d:
d[row[0]].append(row[1])
d[row[0]].append(row[2])
else:
d[row[0]]=[row[1]]
d[row[0]].append(row[2])
return [[key, len(set(val[0::2])), sum(val[1::2])] for key, val in d.items()]

Reusable solution, meant for those who arrive here trough Google:
def group_by(rows, key):
m = {}
for row in rows:
k = key(row)
try:
m[k].append(row)
except KeyError:
m[k] = [row]
return m.values()
grouped_by_drug = group_by(data, key=lambda row: row[0])
result = [
(
drug_rows[0][0],
len(drug_rows),
sum(row[2] for row in drug_rows)
)
for drug_rows in grouped_by_drug
]
You can also use defaultdict in this implementation, which for my use case is slightly faster.

Related

Finding average value of list elements, if a condition is met using Python?

I have a list that has the following format:
mylist = ["Joe, 100%", "Joe, 80%", "Joe, 90%", "Sally, 95%", "Sally, 80%", "Jimmy, 90%", ...]
What I am trying to do is, first count the number of times each name appears. If a name appears 2 or more times, append that name along with the average percent. So, I'm trying to get to the following output:
newlist = ["Joe, 90%", "Sally, 87.5%"]
To try this, I did mylist.split(", ") to get the names only, and used Counter() to find how many times the name appears. Then, I used a simple if >= 2 statement to append the name to newlist if the name appears 2 or more times.
However, despite trying many different things, but I wasn't able to figure out how to get the percentages back with the name in the final list. I also am unsure how to word my question on Google, so I wasn't able to find any help. Does anyone know how to do this? If this question is a duplicate, please let me know (and provide a link so I can learn), and I can delete this question. Thanks!

You can try this:
from collections import defaultdict
counts = defaultdict(int)
percents = defaultdict(int)
for item in mylist:
name, percent = item.split(',')
percent = int(percent.lstrip().rstrip('%'))
percents[name]+=percent
counts[name]+=1
result = []
for k,v in counts.items():
if v > 1:
result.append(f"{k}, {percents[k]/v}%")
print(result)
Output
['Joe, 90.0%', 'Sally, 87.5%']

I would recommend you create a dictionary of the scores, where the key would be the name and the value would be a list of their scores. This snippet shows how you can achieve that:
mydict = {}
for item in mylist:
name, score = item.split(", ") # splits each item into a score and a name
score = float(score.replace("%", "")) # converts string score to a float
if mydict[name]: # checks if the name already exists in the dictionary
mydict[name].append(score)
else:
mydict[name] = [score]
This would would leave you with a dictionary of scores that is organized by their name. Now all you would have to do is average the scores in the dictionary:
newlist = []
for name in mydict:
if len(mydict[name]) >= 2:
average = str(sum(mydict[name]))/len(mydict[name])) + "%"
straverage = name + ", " + average
newlist.append(straverage)

Comparing the elements of a list with themselves

I have lists of items:
['MRS_103_005_010_BG_001_v001',
'MRS_103_005_010_BG_001_v002',
'MRS_103_005_010_FG_001_v001',
'MRS_103_005_010_FG_001_v002',
'MRS_103_005_010_FG_001_v003',
'MRS_103_005_020_BG_001_v001',
'MRS_103_005_020_BG_001_v002',
'MRS_103_005_020_BG_001_v003']
I need to identify the latest version of each item and store it to a new list. Having trouble with my logic.
Based on how this has been built I believe I need to first compare the indices to each other. If I find a match I then check to see which number is greater.
I figured I first needed to do a check to see if the folder names matched between the current index and the next index. I did this by making two variables, 0 and 1, to represent the index so I could do a staggered incremental comparison of the list on itself. If the two indices matched I then needed to check the vXXX number on the end. whichever one was the highest would be appended to the new list.
I suspect that the problem lies in one copy of the list getting to an empty index before the other one does but I'm unsure of how to compensate for that.
Again, I am not a programmer by trade. Any help would be appreciated! Thank you.
# Preparing variables for filtering the folders
versions = foundVerList
verAmountTotal = len(foundVerList)
verIndex = 0
verNextIndex = 1
highestVerCount = 1
filteredVersions = []
# Filtering, this will find the latest version of each folder and store to a list
while verIndex < verAmountTotal:
try:
nextVer = (versions[verIndex])
nextVerCompare = (versions[verNextIndex])
except IndexError:
verNextIndex -= 1
if nextVer[0:24] == nextVerCompare[0:24]:
if nextVer[-3:] < nextVerCompare [-3:]:
filteredVersions.append(nextVerCompare)
else:
filteredVersions.append(nextVer)
verIndex += 1
verNextIndex += 1
My expected output is:
print filteredVersions
['MRS_103_005_010_BG_001_v002', 'MRS_103_005_010_FG_001_v003']
['MRS_103_005_020_BG_001_v003']
The actual output is:
print filteredVersions
['MRS_103_005_010_BG_001_v002', 'MRS_103_005_010_FG_001_v002',
'MRS_103_005_010_FG_001_v003']
['MRS_103_005_020_BG_001_v002', 'MRS_103_005_020_BG_001_v003']
During the with loop I am using os.list on each folder referenced via verIndex. I believe the problem is that a list is being generated for every folder that is searched but I want all the searches to be combined in a single list which will THEN go through the groupby and sorted actions.

Seems like a case for itertools.groupby:
from itertools import groupby
grouped = groupby(data, key=lambda version: version.rsplit('_', 1)[0])
result = [sorted(group, reverse=True)[0] for key, group in grouped]
print(result)
Output:
['MRS_103_005_010_BG_001_v002',
'MRS_103_005_010_FG_001_v003',
'MRS_103_005_020_BG_001_v003']
This groups the entries by everything before the last underscore, which I understand to be the "item code".
Then, it sorts each group in reverse order. The elements of each group differ only by the version, so the entry with the highest version number will be first.
Lastly, it extracts the first entry from each group, and puts it back into a result list.

Try this:
text = """MRS_103_005_010_BG_001_v001
MRS_103_005_010_BG_001_v002
MRS_103_005_010_FG_001_v001
MRS_103_005_010_FG_001_v002
MRS_103_005_010_FG_001_v003
MRS_103_005_020_BG_001_v001
MRS_103_005_020_BG_001_v002
MRS_103_005_020_BG_001_v003
"""
result = {}
versions = text.splitlines()
for item in versions:
v = item.split('_')
num = int(v.pop()[1:])
name = item[:-3]
if result.get(name, 0) < num:
result[name] = num
filteredVersions = [k + str(v) for k, v in result.items()]
print(filteredVersions)
output:
['MRS_103_005_010_BG_001_v2', 'MRS_103_005_010_FG_001_v3', 'MRS_103_005_020_BG_001_v3']

What are the advantages to using count sort in this algorithm?

Working on following algorithm:
You are given an array of student objects. Ever student has an integer-valued age field that is to be treated as a key. Rearrange the elements of the array so that students of equal age appear together. The order in which different ages appear is not important. How would your solution change if ages have to appear in sorted order?
The author solution of the book the problem comes from says to create two hashes - one that maps age => number of occurrences of that age, and another that maps age to the offset in the final array. Then, iterate over these two hashes and write the values with appropriate offsets in the final array.
Author code:
import collections
Person = collections.namedtuple('Person', ('age', 'name'))
def group_by_age(people):
age_to_count = collections.counter([person.age for person in people])
age_to_offset, offset = {}, 0
for age, count in age_to_count.items():
age_to_offset[age] = offset
offset += count
while age_to_offset:
from_age = next(iter(age_to_offset))
from_idx = age_to_offset[from_age]
to_age = people[from_idx].age
to_idx = age_to_offset[people[from_idx].age]
people[from_idx], people[to_idx] = people[to_idx], people[from_idx]
age_to_count[to_age] -= 1
if age_to_count[to_age]:
age_to_offset[to_age] = to_idx + 1
else:
del age_to_offset[to_age]
I'm wondering why we can't simplify things, and just create a hash with key = age and value = object. Then just iterate over hash keys and write the values to the array. If need to sort output, can sort hash keys and key in to hash to write values to input array.
Question 1: Why did the author go with a less intuitive route?
Question 2: Is my code below (based off author solution) as good? This code is much cleaner, which begs the question why the author didn't go with this route.
import collections
Person = collections.namedtuple('Person', ('age', 'tuple'))
def group_ages(people):
age_to_count = collections.Counter([person.age for person in people])
age_to_offset, offset = {}, 0
for age, count in age_to_count.items():
age_to_offset[age] = offset
offset += count
for old_index, person in enumerate(people):
new_index = age_to_offset[person.age]
people[old_index], people[new_index] = people[new_index], people[old_index]
age_to_count[person.age] -= 1
if age_to_count[person.age]:
age_to_offset[person.age] += 1

Its right, that you can sort the hash keys and print them in order, but then how will you remember the keys.
For e.g. lets say you have 3 students with ages 20,40 and 15.
Now in hashmap, they will something be like, h[20]="",h[40]="",h[15]="".
But after sorting them, how will you print values from h. You wont be having indices now.
And if you try to get the keys from main array, you will again use some other sort for that.

Remove duplicates from user input

I want to ignore any duplicate entry given by user as input. I have below code :
def pITEMName():
global ITEMList,fITEMList
pITEMList = []
fITEMList = []
ITEMList = str(raw_input('Enter pipe separated list of ITEMS : ')).upper().strip()
items = ITEMList.split("|")
count = len(items)
print 'Total Distint ITEM Count : ', count
pipelst = [i.replace('-mc','').replace('-MC','').replace('$','').replace('^','') for i in ITEMList.split('|')]
filepath = '/location/data.txt'
f = open(filepath, 'r')
for lns in f:
split_pipe = lns.split(':', 1)
if split_pipe[0] in pipelst:
index = pipelst.index(split_pipe[0])
pITEMList=split_pipe[0]+"|"
fITEMList.append(pITEMList)
del pipelst[index]
for lns in pipelst:
print bcolors.red + lns,' is wrong ITEM Name' + bcolors.ENDC
f.close()
When I execute above code it prompts me for user input as :
Enter pipe separated list of items :
And if I provide the input as :
Enter pipe separated list of items : AAA|IFA|AAA
After pressing enter I am getting the result as :
Enter pipe separated list of Items : AAA|IFA|AAA
Total Distint Item Count : 3
AAA is wrong Item Name
Items Belonging to other Centers :
Other Centers :
Item Count From Other Center = 0
Items Belonging to Current Centers :
Active Items in US1:
^IFA$
Active Items in US2 :
^AAA$|^AAA$
Ignored Item Count From Current Center = 0
You Have Entered ItemList belonging to this Center as:
^IFA$|^AAA$|^AAA$
Active Item Count : 3
Do You Want To Continue [YES|Y|NO|N] :
In above result you must be noticing that I have mentioned AAA entry twice so its counting as wrong Item. I want as duplicate entry to be ignored. Here I want to ignore the case sensitive condition also. Means If I give AAA|aaa|ifa, one 'aaa' should get ignored.
Please help me that how I can implement this.

First, you're doing ITEMList.split("|") several times. You should just use your already calculated items.
Second, you probably want:
items = set(ITEMList.lower().split("|"))
This way you get a set with unique, all lowercase elements.
I assume this doesn't matter since you can discard either uppercase or lowercase.

If item order is not important, then a set will do this very well.
items = set(ITEMList.split("|"))

Lots of great answers here; throwing my hat into the ring as well. One straightforward way to do this:
items = list(set(ITEMList.split("|")))
items.sort()
This preserves your items object as a list and orders it (which is something you may or may not prefer in this case).
If you decide later that you want to return an element of your items list in your code, you will be able to do it by referring to the list index (this functionality doesn't exist with sets).
If you want to preserve the value of the variable count, you could implement the code as:
items = ITEMList.split("|")
count = len(items)
items = list(set(ITEMList.split("|")))
items.sort()
You will also want to adjust this line:
pipelst = [i.replace('-mc','').replace('MC','').replace('$','').replace('^','') for i in ITEMList.split('|')]
to this:
pipelst = [i.replace('-mc','').replace('MC','').replace('$','').replace('^','') for i in items]

if order is important
my_list = "^IFA$|^AAA$|^AAA$"
"|".join(collections.Counter(my_list.upper().split("|")).keys())
is one way to do it

Maintaining order in large list of movies/ratings

I have a text file with hundreds of thousands of students, and their ratings for certain films organized with the first word being the student number, the second being the name of the movie (with no spaces), and the third being the rating they gave the movie:
student1000 Thor 1
student1001 Superbad -3
student1002 Prince_of_Persia:_The_Sands_of_Time 5
student1003 Old_School 3
student1004 Inception 5
student1005 Finding_Nemo 3
student1006 Tangled 5
I would like to arrange them in a dictionary so that I have each student mapped to a list of their movie ratings, where the ratings are in the same order for each student. In other words, I would like to have it like this:
{student1000 : [1, 3, -5, 0, 0, 3, 0,...]}
{student1001 : [0, 1, 0, 0, -3, 0, 1,...]}
Such that the first, second, third, etc. ratings for each student correspond to the same movies. The order is completely random for movies AND student numbers, and I'm having quite a bit of trouble doing this effectively. Any help in coming up with something that will minimize the big-O complexity of this problem would be awesome.
I ended up figuring it out. Here's the code I used for anyone wondering:
def get_movie_data(fileLoc):
movieDic = {}
movieList = set()
f = open(fileLoc)
setHold = set()
for line in f:
setHold.add(line.split()[1])
f.close()
movieList = sorted(setHold)
f = open(fileLoc)
for line in f:
hold = line.strip().split()
student = hold[0]
movie = hold[1]
rating = int(hold[2])
if student not in movieDic:
lst = [0]*len(movieList)
movieDic[student] = lst
hold2 = movieList.index(movie)
rate = movieDic[student]
rate[hold2] = rating
f.close()
return movieList, movieDic
Thanks for the help!

You can first build a dictionary of dictionaries:
{
'student1000' : {'Thor': 1, 'Superbad': 3, ...},
'student1001' : {'Thor': 0, 'Superbad': 1, ...},
...
}
Then you can go through that to get a master list of all the movies, establish an order for them (corresponding to the order within each student's rating list), and finally go through each student in the dictionary, converting the dictionary to the list you want. Or, like another answer said, just keep it as a dictionary.
defaultdict will probably come in handy. It lets you say that the default value for each student is an empty list (or dictionary) so you don't have to initialize it before you start appending values (or setting key-value pairs).
from collections import defaultdict
students = defaultdict(dict)
with open(filename, 'r') as f:
for line in f.readlines():
elts = line.split()
student = elts[0]
movie = elts[1]
rating = int(elts[2])
students[student][movie] = rating

So, the answers here are functionally the same as what you seem to be looking for, but as far as directly constructing the lists you're looking for, they seem to be answering slightly different questions. Personally I would prefer to do this in a more dynamic way. Since it doesn't seem to me like you actually know the movies that are going to be rated ahead of time, you've gotta keep some kind of running tally of that.
ratings = {}
allMovies = []
for line in file:
info = line.split(" ")
movie = info[1].strip().lower()
student = info[0].strip().lower()
rating = float(info[2].strip().lower())
if movie not in allMovies:
allMovies.append(movie)
movieIndex = allMovies.index(movie)
if student not in ratings:
ratings[student] = ([0]*(len(allMovies)-1)).append(rating)
else:
if len(allMovies) > len(ratings[student]):
ratings[student] = ratings[student].extend([0]*(len(allMovies)-len(ratings[student]))
ratings[student][movieIndex] = rating
It's not the way I would approach this problem, but I think this solution is closest to the original intent of the question and you can use a buffer to feed in the lines if there's a memory issue, but unless your file is multiple gigabytes there shouldn't be an issue with that.

Just put the scores into a dictionary rather than a list. After you've read all the data, you can then extract the movie names and put them in any order you want. Assuming students can rate different movies, maintaining some kind of consistent order while reading the file, without knowing the order of the movies to begin with, seems like a lot of work.
If you're worrying about the keys taking up a lot of memory, use intern() on the keys to make sure you're only storing one copy of each string.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Aggregation without PANDAS - python

Without Pandas or defaultdict: d={} for row in l: if row[0] in d: d[row[0]].append(row[1]) d[row[0]].append(row[2]) else: d[row[0]]=[row[1]] d[row[0]].append(row[2]) return [[key, len(set(val[0::2])), sum(val[1::2])] for key, val in d.items()]

Related

Finding average value of list elements, if a condition is met using Python?

Comparing the elements of a list with themselves

What are the advantages to using count sort in this algorithm?

Remove duplicates from user input

Maintaining order in large list of movies/ratings

Categories

Resources